JP2015231083A

JP2015231083A - Voice synthesis call system, communication terminal, and voice synthesis call method

Info

Publication number: JP2015231083A
Application number: JP2014115527A
Authority: JP
Inventors: 友里松本; Yuri Matsumoto
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-06-04
Filing date: 2014-06-04
Publication date: 2015-12-21

Abstract

PROBLEM TO BE SOLVED: To provide a voice synthesis call system, a communication terminal, and a voice synthesis call method that can continue interaction in real time even if one party does not speak while keeping a call state with the other party on a voice call.SOLUTION: In a voice synthesis call system 1000, a caller terminal 100 (refer to Fig. 2) comprises: a text input unit 170 that accepts input of text; and a voice text switching unit 160 that switches to the text input by the text input unit 170 while maintaining a voice call state. A media processing device (voice distribution server) 200 comprises: a voice synthesis unit 220 that converts a text into voice data; and a voice transmission unit 230 that transmits the converted voice data to a called party terminal 300. When the receiving the voice data, the called party terminal 300 outputs the voice call in the voice call state and a voice call by the voice data together.

Description

本発明は、音声合成通話システム、通信端末および音声合成通話方法に関する。 The present invention relates to a voice synthesis call system, a communication terminal, and a voice synthesis call method.

特許文献１には、携帯機器各々に音声合成手段を搭載せずに、また、回線を占有することなく、携帯機器でメールの内容を自動的に読み上げる電子メール読み上げシステムが記載されている。特許文献１に記載の電子メール読み上げシステムは、音声合成エンジンが、メールサーバに着信したメールのコピーを音声ファイルに変換し、蓄積サーバに格納する。携帯機器は、メール着信通知を受けると音声ファイル要求をサーバシステムに送出して未ダウンロードの音声ファイルが蓄積サーバに存在することを確認し、ダウンロードする。 Patent Document 1 describes an e-mail reading system that automatically reads the contents of an e-mail on a portable device without installing a voice synthesizer in each portable device and without occupying a line. In the electronic mail reading system described in Patent Document 1, a voice synthesis engine converts a copy of mail received at a mail server into a voice file and stores it in a storage server. When the mobile device receives the incoming mail notification, the mobile device sends an audio file request to the server system, confirms that an undownloaded audio file exists in the storage server, and downloads it.

特開２００７−２３３４３５号公報JP 2007-233435 A

特許文献１に記載の技術では、あらかじめテキスト編集ソフト等により入力された文章を着信側の携帯機器に送信し、音声合成により再生することができる。しかしながら、もともと音声通話をしていた場合には、その通話を一度切断して、メールやテキスト編集ソフトに切り替えてテキスト文章を送る処理を行わなければならない。例えば、直前まで相手と電話で通話していたが、一方の人物が電車に乗る等して、音声では話せなくなるが、相手とのやりとりは継続したい時がある。
そこで、それまでに音声通話をしていた相手と通話状態を維持しながら、一方または双方が声を出さなくてもリアルタイムでやりとりを続けられる方法が必要である。 With the technique described in Patent Document 1, a sentence input in advance by text editing software or the like can be transmitted to a mobile device on the incoming side and reproduced by speech synthesis. However, if a voice call was originally made, the call must be disconnected once and switched to mail or text editing software to send a text sentence. For example, there is a case where a person talks on the telephone until just before, but one person gets on the train and cannot speak by voice, but wants to continue the communication with the other party.
Therefore, there is a need for a method capable of continuing a real-time exchange even if one or both of them does not speak while maintaining a call state with the other party who has been making a voice call.

このような背景を鑑みて本発明がなされたのであり、本発明は、音声通話をしていた端末と通話状態を維持しながら、一方が声を出さなくてもリアルタイムで音声によるやりとりを続けることができる、音声合成通話システム、通信端末および音声合成通話方法を提供することを課題とする。 The present invention has been made in view of such a background, and the present invention maintains a call state with a terminal that has been carrying out a voice call, and continues voice communication in real time even if one does not speak. It is an object of the present invention to provide a speech synthesis call system, a communication terminal, and a speech synthesis call method.

前記した課題を解決するため、請求項１に記載の発明は、ネットワークに接続可能な第１および第２の通信端末を有し、前記第１および第２の通信端末は、音声配信サーバを介して音声通話を行う音声通話部を備える音声合成通話システムであって、発信側の通信端末である前記第１の通信端末は、テキストの入力を受け付けるテキスト入力部と、音声通話状態を維持しつつ、前記テキスト入力部によるテキスト入力に切り替える音声テキスト切替部と、を備え、前記第１の通信端末または前記音声配信サーバは、前記テキストを音声データに変換する音声合成部と、前記変換した音声データを、受信側の通信端末である前記第２の通信端末に送信する音声送信部と、を備え、前記第２の通信端末の前記音声通話部は、前記音声データを受信した場合、前記音声通話状態の音声通話と当該音声データによる音声通話とを合せて出力することを特徴とする音声合成通話システムとした。 In order to solve the above-described problem, the invention according to claim 1 includes first and second communication terminals connectable to a network, and the first and second communication terminals are connected via an audio distribution server. A voice synthesis call system including a voice call unit for making a voice call, wherein the first communication terminal, which is a communication terminal on the calling side, maintains a voice call state with a text input unit that accepts text input A voice text switching unit that switches to text input by the text input unit, wherein the first communication terminal or the voice distribution server includes a voice synthesis unit that converts the text into voice data, and the converted voice data. Is transmitted to the second communication terminal which is a communication terminal on the receiving side, and the voice call unit of the second communication terminal receives the voice data. When was the speech synthesis call system and outputs the combined voice communication by voice and the voice data of the voice communication state.

また、請求項３に記載の発明は、ネットワークに接続可能な第１および第２の通信端末を有し、前記第１および第２の通信端末は、音声配信サーバを介して音声通話を行う音声合成通話システムの音声合成通話方法であって、発信側の通信端末である前記第１の通信端末は、前記音声配信サーバを介して音声通話を行うステップと、音声通話状態を維持しつつ、テキスト入力に切り替えるステップと、テキストの入力を受け付けるステップと、を実行し、前記第１の通信端末または前記音声配信サーバは、前記テキストを音声データに変換するステップと、前記変換した音声データを、受信側の通信端末である第２の通信端末に送信するステップと、を実行し、前記第２の通信端末は、前記音声データを受信した場合、前記音声通話状態の音声通話と当該音声データによる音声通話とを合せて出力するステップを実行することを特徴とする音声合成通話方法とした。 According to a third aspect of the present invention, there are provided first and second communication terminals connectable to a network, and the first and second communication terminals perform voice calls via a voice distribution server. A voice synthesis call method for a synthetic call system, wherein the first communication terminal, which is a communication terminal on a caller side, performs a voice call via the voice distribution server, and maintains a voice call state while text A step of switching to input and a step of accepting input of text, wherein the first communication terminal or the voice distribution server receives the step of converting the text into voice data and the converted voice data Transmitting to a second communication terminal, which is a communication terminal on the side, when the second communication terminal receives the voice data, the sound in the voice call state And a voice synthesizing call method characterized by performing the step of outputting the combined and voice call by call and the audio data.

このようにすることで、本発明の音声合成通話システムによれば、それまでに音声通話をしていた相手と通話状態を維持しながら、一方または双方が声を出さなくてもリアルタイムでやりとりを続けることができる。 In this way, according to the speech synthesis call system of the present invention, the conversation state is maintained with the other party that has been carrying out the voice call so far, and one or both of them can communicate in real time without speaking. You can continue.

また、請求項２に記載の発明は、音声配信サーバを介して音声通話を行う音声通話部と、テキストの入力を受け付けるテキスト入力部と、音声通話状態を維持しつつ、前記テキスト入力部によるテキスト入力に切り替える音声テキスト切替部と、前記テキストを音声データに変換する音声合成部と、前記変換した音声データを前記音声配信サーバに送信する音声送信部と、を備えることを特徴とする通信端末とした。 According to a second aspect of the present invention, there is provided a voice call unit that performs a voice call via a voice distribution server, a text input unit that accepts text input, and a text that is generated by the text input unit while maintaining a voice call state. A communication terminal comprising: a voice text switching unit that switches to input; a voice synthesis unit that converts the text into voice data; and a voice transmission unit that transmits the converted voice data to the voice distribution server; did.

このようにすることで、本発明の通信端末によれば、それまでに音声通話をしていた相手と通話状態を維持しながら、一方が声を出さなくてもリアルタイムでやりとりを続けることができる。 By doing so, according to the communication terminal of the present invention, it is possible to continue a real-time exchange even if one of them does not speak while maintaining a call state with the other party who has been carrying out the voice call so far. .

本発明によれば、音声通話をしていた相手と通話状態を維持しながら、一方が声を出さなくてもリアルタイムでやりとりを続けることができる、音声合成通話システム、通信端末および音声合成通話方法を提供することができる。 According to the present invention, a voice synthesis call system, a communication terminal, and a voice synthesis call method capable of continuing a real-time exchange even if one side does not speak while maintaining a call state with a partner who has been carrying out a voice call Can be provided.

本発明の概要を説明するための図である。It is a figure for demonstrating the outline | summary of this invention. 本実施形態に係る音声合成通話システムの全体構成と処理概要を説明するための図である。It is a figure for demonstrating the whole structure and the process outline | summary of the speech synthesis telephone system which concern on this embodiment. 本実施形態に係る音声合成通話システムの相手先アドレス情報格納部のデータ構成例を示す図である。It is a figure which shows the example of a data structure of the other party address information storage part of the speech synthesis telephone system which concerns on this embodiment. 本実施形態に係る音声合成通話システムの処理の流れを示すシーケンス図である。It is a sequence diagram which shows the flow of a process of the speech synthesis telephone system which concerns on this embodiment. 本実施形態に係る音声合成通話システムの音声合成手段がネットワーク側にある場合の全体構成と処理概要を説明するための図である。It is a figure for demonstrating the whole structure and process outline | summary in case the speech synthesis means of the speech synthesis telephone system which concerns on this embodiment exists in the network side. 本実施形態に係る音声合成通話システムの音声合成手段が端末側にある場合の全体構成と処理概要を説明するための図である。It is a figure for demonstrating the whole structure and the process outline | summary in case the speech synthesis means of the speech synthesis telephone system which concerns on this embodiment exists in the terminal side.

次に、本発明を実施するための形態（以下、「本実施形態」という。）における音声合成通話システム１０００等について説明する。 Next, a speech synthesis call system 1000 and the like in a mode for carrying out the present invention (hereinafter referred to as “the present embodiment”) will be described.

＜システム構成と処理概要＞
図１は、本発明の概要を示す図である。
本実施形態に係る音声合成通話システム１０００では、音声で通話していた発信者が備える端末と着信者が備える端末とが、その通話を切断せずに、一方の端末にテキスト文を入力すると、他方の端末にそのテキスト文が音声合成された音声データがリアルタイムで配信される。
ここで、本発明の音声合成手段３（後記する図２等のテキスト受信部２１０、音声合成部２２０および音声送信部２３０）を双方の端末側が備えることで、双方が声を出さなくてもリアルタイムでやりとりを続けることができる。なお、以下で、本発明の音声合成手段３を一方の端末が備える例について説明する。 <System configuration and processing overview>
FIG. 1 is a diagram showing an outline of the present invention.
In the speech synthesis call system 1000 according to the present embodiment, when a terminal provided by a caller who is talking by voice and a terminal provided by a callee input a text sentence to one terminal without disconnecting the call, Voice data in which the text is synthesized by voice is delivered to the other terminal in real time.
Here, since both terminal sides are provided with the speech synthesizing means 3 (text receiving unit 210, speech synthesizing unit 220, and speech transmitting unit 230 in FIG. 2 to be described later) of the present invention, real time even if both do not speak. You can continue to communicate. In the following, an example in which one terminal includes the speech synthesis means 3 of the present invention will be described.

図１に示すように、ユーザＡは、音声通話／テキスト入力可能な携帯電話やスマートフォン等の通信端末１を持ち、ユーザＢは、音声通話可能な携帯電話やスマートフォン等の通信端末２を持ち、音声通話を行っている。
通信端末１または通信ネットワーク（ＮＷ）２上には、音声通話の途中で通話状態を維持しながら、テキスト文を音声データに音声合成する音声合成手段３を備える。音声合成手段３は、通信端末１に内蔵した音声合成装置を利用する方法と、通信ネットワーク（ＮＷ）２上に設置された音声合成サーバを使用する方法とがある。
通信端末１（ユーザＡ）が、電車内等の声を出せない環境に移行しても、元々通話していた通信端末２（ユーザＢ）と通話状態を保持しながら、音声対音声の通話ではなくテキスト対音声の通話に切り替え、そのテキストを音声として合成し、音声によるやりとりが継続できる。 As shown in FIG. 1, a user A has a communication terminal 1 such as a mobile phone or a smartphone capable of voice call / text input, and a user B has a communication terminal 2 such as a mobile phone or a smartphone capable of voice call, You are making a voice call.
The communication terminal 1 or the communication network (NW) 2 includes speech synthesis means 3 that synthesizes a text sentence into speech data while maintaining a call state during a voice call. The voice synthesizer 3 includes a method using a voice synthesizer built in the communication terminal 1 and a method using a voice synthesizer installed on the communication network (NW) 2.
Even if the communication terminal 1 (user A) shifts to an environment where the voice cannot be made, such as in a train, the communication state with the communication terminal 2 (user B) that originally made the call is maintained, while the voice-to-voice call is performed. Instead, you can switch to text-to-speech calls, synthesize the text as speech, and continue to communicate by speech.

図２は、本実施形態に係る音声合成通話システム１０００の全体構成と処理概要を説明するための図である。
図２に示すように、本実施形態に係る音声合成通話システム１０００は、ネットワークに接続可能な発信者端末１００（第１の通信端末）と、メディア処理装置２００（音声配信サーバ）と、ネットワークに接続可能な着信者端末３００（第２の通信端末）と、を含んで構成される。 FIG. 2 is a diagram for explaining the overall configuration and processing outline of the speech synthesis call system 1000 according to the present embodiment.
As shown in FIG. 2, a speech synthesis call system 1000 according to this embodiment includes a caller terminal 100 (first communication terminal) connectable to a network, a media processing device 200 (voice distribution server), and a network. And a connectable receiver terminal 300 (second communication terminal).

発信者端末１００は、音声通話／テキスト入力可能な携帯電話やスマートフォン等の通信端末である。発信者端末１００は、受信部１１０と、送信部１２０と、相互に音声通話を行う音声通話部１３０と、相手先アドレス情報格納部１４０と、切替ボタン１５０と、音声テキスト切替部１６０と、テキスト入力部１７０と、を備えて構成される。
受信部１１０および送信部１２０は、受信／送信のための物理的なインタフェース（Ｉ／Ｆ）である。音声通話部１３０は、音声通話を行うためのソフトウェアまたはミドルウェアからなる。音声通話部１３０は、受信部１１０を介して入力された音声を受話する受話機能（後記図５の受話部１０１参照）と、送信部１２０に音声を出力する発話機能（後記図５の発話部１０２参照）と、を有する。 The caller terminal 100 is a communication terminal such as a mobile phone or a smartphone capable of voice call / text input. The caller terminal 100 includes a reception unit 110, a transmission unit 120, a voice call unit 130 that performs a voice call to each other, a destination address information storage unit 140, a switching button 150, a voice text switching unit 160, a text And an input unit 170.
The reception unit 110 and the transmission unit 120 are physical interfaces (I / F) for reception / transmission. The voice call unit 130 includes software or middleware for performing a voice call. The voice call unit 130 receives a voice input via the reception unit 110 (see the reception unit 101 in FIG. 5 described later) and the speech function that outputs the voice to the transmission unit 120 (the speech unit illustrated in FIG. 5 later). 102).

切替ボタン１５０は、例えば図示しない表示部の表示画面内で、任意の位置を指示するためのタッチパネルに割り当てられたソフトキーである。このタッチキーの場合、ユーザが切替ボタン１５０として設定されたユーザ画面１５５（図４参照）を指で押したり、スライドしたり、離したりすることにより各操作を指示する。また、切替ボタン１５０は、キーボード等のキーのほか、専用に割り当てられた機能キーであってもよい。
また、本実施形態では、テキスト入力は、スマートフォン端末やフィーチャーフォン（携帯電話）にデフォルトで内蔵されている言語入力用キーボードを利用するものとして説明する。 The switch button 150 is a soft key assigned to a touch panel for instructing an arbitrary position within a display screen of a display unit (not shown), for example. In the case of this touch key, the user instructs each operation by pressing, sliding, or releasing the user screen 155 (see FIG. 4) set as the switching button 150 with a finger. The switch button 150 may be a dedicated function key in addition to a key such as a keyboard.
Further, in the present embodiment, the text input is described as using a language input keyboard built in by default in a smartphone terminal or a feature phone (mobile phone).

テキスト入力部１７０は、テキスト入力画面が表示され、端末付属の文字入力機能（キーパッド等）によりテキスト文を受け付ける。 The text input unit 170 displays a text input screen and accepts a text sentence by a character input function (keypad or the like) attached to the terminal.

音声テキスト切替部１６０は、音声通話状態を維持しつつ、テキスト入力に切り替える。音声テキスト切替部１６０は、テキスト入力部１７０から取得したテキスト入力結果を音声合成手段３に送信する。 The voice text switching unit 160 switches to text input while maintaining the voice call state. The voice text switching unit 160 transmits the text input result acquired from the text input unit 170 to the voice synthesis unit 3.

音声通話部１３０、音声テキスト切替部１６０およびテキスト入力部１７０は、ＲＯＭ（Read Only Memory）等に格納されたプログラムをＣＰＵ（Central Processing Unit）がメインメモリであるＲＡＭ（Random Access Memory）に展開し実行することで実現される。また、相手先アドレス情報格納部１４０は、具体的にはハードディスクやフラッシュメモリ、ＲＡＭ等の記憶手段からなる。 The voice call unit 130, the voice text switching unit 160, and the text input unit 170 expand a program stored in a ROM (Read Only Memory) or the like in a RAM (Random Access Memory) whose CPU (Central Processing Unit) is a main memory. It is realized by executing. The destination address information storage unit 140 is specifically composed of storage means such as a hard disk, a flash memory, and a RAM.

メディア処理装置２００は、ネットワーク上のメディアサーバ等である。メディア処理装置２００は、音声合成手段３を含む。この音声合成手段３は、テキスト受信部２１０と、テキストからアナログ音声を生成する音声合成エンジンからなる音声合成部２２０と、テキストから変換した音声データを着信者端末３００に送信する音声送信部２３０と、を備えて構成される。なお、図６に示すように、この音声合成手段３を発信者端末１００（通信端末Ａ）に備えるようにしてもよい。 The media processing device 200 is a media server or the like on a network. The media processing device 200 includes speech synthesis means 3. The voice synthesizer 3 includes a text receiver 210, a voice synthesizer 220 including a voice synthesizer engine that generates analog voice from text, a voice transmitter 230 that transmits voice data converted from text to the callee terminal 300, and the like. , And is configured. In addition, as shown in FIG. 6, you may make it provide this voice synthesis means 3 in the sender | caller terminal 100 (communication terminal A).

着信者端末３００は、音声通話可能な携帯電話やスマートフォン、固定電話等である。着信者端末３００は、受信部３１０と、送信部３２０と、音声通話部３３０と、を備えて構成される。
受信部３１０および送信部３２０は、受信／送信のための物理的なインタフェース（Ｉ／Ｆ）である。音声通話部３３０は、音声通話を行うためのソフトウェアまたはミドルウェアからなる。音声通話部３３０は、受信部３１０を介して入力された音声を受話する受話機能（後記図５の受話部３０１参照）と、送信部３２０に音声を出力する発話機能（後記図５の発話部３０２参照）と、を有する。
音声通話部３３０は、音声データを受信した場合、音声通話状態の音声通話と音声データによる音声通話とを合せて出力する。これにより、音声通話部３３０は、音声データを受信した場合、音声通話状態を維持しつつ、音声データによる音声通話を継続する。 The callee terminal 300 is a mobile phone capable of making a voice call, a smartphone, a fixed phone, or the like. The called party terminal 300 includes a receiving unit 310, a transmitting unit 320, and a voice call unit 330.
The reception unit 310 and the transmission unit 320 are physical interfaces (I / F) for reception / transmission. The voice call unit 330 includes software or middleware for making a voice call. The voice call unit 330 receives a voice input via the receiving unit 310 (see the receiving unit 301 in FIG. 5 below) and a speech function that outputs the voice to the transmitting unit 320 (the speech unit in FIG. 5 below). 302).
When voice data is received, the voice call unit 330 outputs a voice call in a voice call state and a voice call based on the voice data. Thereby, when the voice call unit 330 receives the voice data, the voice call unit 330 continues the voice call using the voice data while maintaining the voice call state.

図３は、本実施形態に係る相手先アドレス情報格納部１４０のデータ構成例を示す図である。
図３に示すように、相手先アドレス情報格納部１４０には、各ユーザのユーザＩＤ１４１に対応付けて、相手先アドレス情報１４２や電話番号１４３等のユーザ端末識別情報が格納される。ここで、相手先アドレス情報１４２は、例えば、着信者端末３００のＩＰアドレスである。また、電話番号１４３は、着信者端末３００がスマートフォン等のように携帯電話機能を備える場合、その携帯電話番号が格納される。 FIG. 3 is a diagram illustrating a data configuration example of the destination address information storage unit 140 according to the present embodiment.
As shown in FIG. 3, the destination address information storage unit 140 stores user terminal identification information such as destination address information 142 and telephone number 143 in association with the user ID 141 of each user. Here, the destination address information 142 is, for example, the IP address of the recipient terminal 300. The telephone number 143 stores the mobile phone number when the recipient terminal 300 has a mobile phone function such as a smartphone.

以下、図２を参照して、音声合成通話システム１０００の処理概要を説明する。
まず、発信者端末１００は、通話確立時に、ユーザＩＤやその発信者端末１００に固有なユーザ端末識別情報（ＩＰ（Internet Protocol）アドレスや電話番号等）を含む相手先アドレス情報を、相手先アドレス情報格納部１４０に保持する（図２の符号ａ参照）。
発信者端末１００は、切替ボタン１５０の操作による切替指示を受け付ける（ステップＳ１：切替指示）。
切替指示を受信した、発信者端末１００の音声テキスト切替部１６０は、テキスト入力部１７０に対してテキスト入力機能の起動指示を行う（ステップＳ２：起動指示）。 Hereinafter, with reference to FIG. 2, an outline of processing of the speech synthesis call system 1000 will be described.
First, at the time of establishing a call, the caller terminal 100 receives destination address information including a user ID and user terminal identification information (IP (Internet Protocol) address, telephone number, etc.) unique to the caller terminal 100 as the destination address. The information is stored in the information storage unit 140 (see symbol a in FIG. 2).
The caller terminal 100 receives a switching instruction by operating the switching button 150 (step S1: switching instruction).
Receiving the switching instruction, the voice text switching unit 160 of the caller terminal 100 instructs the text input unit 170 to activate the text input function (step S2: activation instruction).

起動指示を受信した、発信者端末１００のテキスト入力部１７０は、音声テキスト切替部１６０に入力結果を返却する（ステップＳ３：入力結果返却）。テキスト入力部１７０は、具体的には、スマートフォン端末やフィーチャーフォンにデフォルトで内蔵されている言語入力用キーボードを利用してテキスト入力を行う。
また、発信者端末１００の音声テキスト切替部１６０は、相手先アドレス情報格納部１４０に保持されている相手先アドレス情報を参照する（ステップＳ４：相手先アドレス情報参照）。
そして、発信者端末１００の音声テキスト切替部１６０は、メディア処理装置２００に対して相手先アドレス情報を付与してテキスト入力結果を送信する（ステップＳ５：テキスト入力結果送信）。 The text input unit 170 of the caller terminal 100 that has received the activation instruction returns the input result to the voice text switching unit 160 (step S3: input result return). Specifically, the text input unit 170 performs text input using a language input keyboard built in a smartphone terminal or a feature phone by default.
Further, the voice text switching unit 160 of the caller terminal 100 refers to the destination address information held in the destination address information storage unit 140 (step S4: refer to the destination address information).
Then, the voice text switching unit 160 of the caller terminal 100 transmits the text input result with the destination address information added to the media processing device 200 (step S5: text input result transmission).

次に、メディア処理装置２００のテキスト受信部２１０は、発信者端末１００からのテキスト入力結果を受信し、音声合成部２２０に対して、テキストデータから音声データへの変換指示を行う（ステップＳ６：変換指示）。
変換指示を受信した、メディア処理装置２００の音声合成部２２０は、テキストデータを音声データに変換し、音声送信部２３０に対して送出指示を行う（ステップＳ７：送出指示）。
そして、メディア処理装置２００の音声送信部２３０は、送出指示に従って着信者端末３００に対して音声合成された音声データを送出する（ステップＳ８：音声データ送出）。
着信者端末３００は、受信部３１０が送出された音声を受信し、音声通話部３３０は、音声通話状態の音声通話と音声データによる音声通話とを合せて出力する。 Next, the text receiving unit 210 of the media processing device 200 receives the text input result from the caller terminal 100, and instructs the speech synthesizer 220 to convert text data into speech data (step S6: Conversion instructions).
The voice synthesizer 220 of the media processing device 200 that has received the conversion instruction converts the text data into voice data, and issues a transmission instruction to the voice transmission unit 230 (step S7: transmission instruction).
Then, the voice transmission unit 230 of the media processing device 200 sends voice data that has been voice-synthesized to the callee terminal 300 in accordance with the transmission instruction (step S8: voice data transmission).
The called party terminal 300 receives the voice sent by the receiving unit 310, and the voice call unit 330 outputs the voice call in the voice call state and the voice call based on the voice data.

このようにすることにより、本実施形態に係る音声合成通話システム１０００は、発信者端末１００のユーザが、それまでに音声通話をしていた着信者端末３００のユーザと通話状態を維持しながら、一方または双方が声を出さなくてもリアルタイムでやりとりを続けることができる。例えば、音声通話中に、その音声通話を継続できない状況が生じた場合に、発信者端末１００のユーザは、切替ボタン１５０を操作してテキスト入力画面に切替え、テキスト文を入力すると、入力されたテキスト文はメディア処理装置２００にリアルタイムで送出される。メディア処理装置２００の音声合成部２２０は、テキスト文を音声データに変換し、着信者端末３００にリアルタイムで送信する。 By doing in this way, the speech synthesis call system 1000 according to the present embodiment allows the user of the caller terminal 100 to maintain a call state with the user of the callee terminal 300 that has been making a voice call so far. The exchange can be continued in real time even if one or both of them do not speak. For example, when a situation occurs in which a voice call cannot be continued during a voice call, the user of the caller terminal 100 operates the switch button 150 to switch to a text input screen and inputs a text sentence. The text sentence is sent to the media processing device 200 in real time. The voice synthesizer 220 of the media processing device 200 converts the text sentence into voice data and transmits it to the callee terminal 300 in real time.

＜処理の流れ＞
次に、本実施形態に係る音声合成通話システム１０００の処理の流れについて詳細に説明する。
図４は、本実施形態に係る音声合成通話システム１０００の処理の流れを示すシーケンス図である。
図４に示すように、発信者端末１００と着信者端末３００との間で、通話が確立されている（ステップＳ１０１）。また、通話確立時には、発信者端末１００の相手先アドレス情報格納部１４０に相手先アドレス情報が保持されている。
まず、発信者端末１００は、通話確立時に、ユーザ画面１５５をタッチして切替ボタン１５０を押す等の切替指示を受け付け（ステップＳ１０２）、発信者端末１００の音声テキスト切替部１６０に出力する。 <Process flow>
Next, the processing flow of the speech synthesis call system 1000 according to the present embodiment will be described in detail.
FIG. 4 is a sequence diagram showing a processing flow of the speech synthesis call system 1000 according to the present embodiment.
As shown in FIG. 4, a call is established between the caller terminal 100 and the callee terminal 300 (step S101). In addition, when a call is established, the destination address information is stored in the destination address information storage unit 140 of the caller terminal 100.
First, the caller terminal 100 accepts a switching instruction such as touching the user screen 155 and pressing the switch button 150 when the call is established (step S102), and outputs it to the voice text switching unit 160 of the caller terminal 100.

音声テキスト切替部１６０は、発信者端末１００のテキスト入力部１７０に対してテキスト入力機能の起動指示を行う（ステップＳ１０３）。
テキスト入力部１７０は、ユーザ画面１５５をテキスト入力が可能なテキスト入力画面に表示切替する（ステップＳ１０４）。
発信者端末１００のユーザは、ユーザ画面１５５へのテキスト入力結果をテキスト入力部１７０に出力する（ステップＳ１０５）。
テキスト入力部１７０は、テキスト入力結果を音声テキスト切替部１６０に出力する（ステップＳ１０６）。 The voice text switching unit 160 instructs the text input unit 170 of the caller terminal 100 to start the text input function (step S103).
The text input unit 170 switches the display of the user screen 155 to a text input screen that allows text input (step S104).
The user of the caller terminal 100 outputs the text input result on the user screen 155 to the text input unit 170 (step S105).
The text input unit 170 outputs the text input result to the voice text switching unit 160 (step S106).

音声テキスト切替部１６０は、相手先アドレス情報格納部１４０に保持されている相手先アドレス情報を参照する（ステップＳ１０７）。
そして、音声テキスト切替部１６０は、メディア処理装置２００のテキスト受信部２１０に対して相手先アドレス情報を付与してテキスト入力結果を送信する（ステップＳ１０８）。
メディア処理装置２００のテキスト受信部２１０は、発信者端末１００からのテキスト入力結果を受信し、音声合成部２２０に対して、テキストから音声への変換指示を行う（ステップＳ１０９）。 The voice text switching unit 160 refers to the destination address information held in the destination address information storage unit 140 (step S107).
Then, the voice text switching unit 160 adds the destination address information to the text receiving unit 210 of the media processing device 200 and transmits the text input result (step S108).
The text receiving unit 210 of the media processing device 200 receives the text input result from the caller terminal 100 and instructs the speech synthesizer 220 to convert text to speech (step S109).

音声合成部２２０は、変換指示に従って受信したテキストデータを音声データに変換する（ステップＳ１１０）。
そして、音声合成部２２０は、メディア処理装置２００の音声送信部２３０に対して送出指示を行う（ステップＳ１１１）。
音声送信部２３０は、送出指示に従って着信者端末３００に対して音声合成された音声データを送出する（ステップＳ１１２）。
着信者端末３００は、受信部３１０が送出された音声データを受信し、音声通話部３３０が、音声通話状態の音声通話と音声データによる音声通話とを合せて出力する。 The voice synthesizer 220 converts the received text data into voice data according to the conversion instruction (step S110).
Then, the voice synthesis unit 220 issues a transmission instruction to the voice transmission unit 230 of the media processing device 200 (step S111).
The voice transmission unit 230 sends voice data synthesized by voice to the callee terminal 300 in accordance with the sending instruction (step S112).
The called party terminal 300 receives the voice data transmitted by the receiving unit 310, and the voice call unit 330 outputs the voice call in the voice call state and the voice call based on the voice data.

[適用例１]
図５は、音声合成手段３（音声合成部２２０等）がネットワーク（ＮＷ）側にある場合の音声合成通話システム１０００Ａの全体構成と処理概要を説明するための図である。図２と同一構成部分には同一符号を付している。
図５に示すように、音声合成手段３（音声合成部２２０等）がＮＷ側にある場合の音声合成通話システム１０００Ａは、発信者端末である通信端末１００Ａ（第１の通信端末）と、ＮＷ上の音声配信サーバ２００Ａと、着信者端末である通信端末３００Ａ（第２の通信端末）とを含んで構成される。 [Application Example 1]
FIG. 5 is a diagram for explaining the overall configuration and processing outline of the speech synthesis call system 1000A when the speech synthesis means 3 (speech synthesis unit 220 or the like) is on the network (NW) side. The same components as those in FIG. 2 are denoted by the same reference numerals.
As shown in FIG. 5, a speech synthesis call system 1000A when speech synthesis means 3 (speech synthesis unit 220 or the like) is on the NW side includes a communication terminal 100A (first communication terminal) that is a caller terminal, and an NW The voice distribution server 200A and the communication terminal 300A (second communication terminal) which is a callee terminal are included.

通信端末１００Ａは、音声通話やテキスト入力可能な携帯電話やスマートフォン等である。通信端末１００Ａは、受話部１０１（音声通話部１３０）、発話部１０２（音声通話部１３０）、相手先アドレス情報格納部１４０、切替ボタン１５０、音声テキスト切替部１６０、およびテキスト入力部１７０を備えて構成される。
音声配信サーバ２００Ａは、図２のメディア処理装置２００であり、音声合成手段３を含む。この音声合成手段３は、テキスト受信部２１０、音声合成部２２０、および音声送信部２３０を備えて構成される。
通信端末３００Ａは、音声通話可能な携帯電話、スマートフォン、固定電話等である。通信端末３００Ａは、受話部３０１（音声通話部）、および発話部３０２（音声通話部）を備えて構成される。 The communication terminal 100A is a mobile phone, a smartphone, or the like capable of voice calls and text input. 100 A of communication terminals are provided with the receiving part 101 (voice call part 130), the utterance part 102 (voice call part 130), the other party address information storage part 140, the switch button 150, the voice text switching part 160, and the text input part 170. Configured.
The voice distribution server 200 </ b> A is the media processing device 200 of FIG. 2 and includes the voice synthesis unit 3. The voice synthesizer 3 includes a text receiver 210, a voice synthesizer 220, and a voice transmitter 230.
The communication terminal 300A is a mobile phone, a smartphone, a fixed phone, or the like that can make a voice call. The communication terminal 300A is configured to include a receiver 301 (voice call unit) and an utterance unit 302 (voice call unit).

以下、図５を参照して、音声合成通話システム１０００Ａの処理概要を説明する。
図５に示すように、通信端末１００Ａと通信端末３００Ａとの間は、通話確立しており、呼接続状態にある。この呼接続状態では、通信端末３００Ａの発話部３０２から通信端末１００Ａの受話部１０１に音声データが送信され（図５の符号ａ参照）、また通信端末１００Ａの発話部１０２から通信端末３００Ａの受話部３０１に音声データが送信される（図５の符号ｂ参照）。
また、図５の符号ｃに示すように、通信端末１００Ａは、通話確立時に、ユーザＩＤやその通信端末に固有なユーザ端末識別情報（ＩＰアドレスや電話番号等）を含む相手先アドレス情報を、相手先アドレス情報格納部１４０に保持する。
通信端末１００Ａは、切替ボタン１５０の操作による切替指示を受け付ける（ステップＳ１：切替指示）。例えば、通信端末１００Ａのユーザが画面上でボタンをクリックする等してオンにする。 Hereinafter, with reference to FIG. 5, an outline of processing of the speech synthesis call system 1000 </ b> A will be described.
As shown in FIG. 5, a call is established between the communication terminal 100A and the communication terminal 300A and is in a call connection state. In this call connection state, voice data is transmitted from the utterance unit 302 of the communication terminal 300A to the reception unit 101 of the communication terminal 100A (see symbol a in FIG. 5), and the utterance unit 102 of the communication terminal 100A receives the communication terminal 300A. Audio data is transmitted to the unit 301 (see symbol b in FIG. 5).
In addition, as shown by reference sign c in FIG. 5, the communication terminal 100A, when establishing a call, transmits destination address information including a user ID and user terminal identification information (IP address, telephone number, etc.) unique to the communication terminal, It is held in the destination address information storage unit 140.
The communication terminal 100A accepts a switching instruction by operating the switching button 150 (step S1: switching instruction). For example, the user of the communication terminal 100A turns on by clicking a button on the screen.

切替指示を受信した、発信者端末１００の音声テキスト切替部１６０は、テキスト入力部１７０に対してテキスト入力機能の起動指示を行う（ステップＳ２：起動指示）。
起動指示を受信した、通信端末１００Ａのテキスト入力部１７０は、音声テキスト切替部１６０に入力結果を返却する（ステップＳ３：入力結果返却）。具体的には、テキスト入力部１７０は、テキスト入力画面が表示され、端末付属の文字入力機能（キーパッド等）によりテキスト文を受け付ける。
また、通信端末１００Ａの音声テキスト切替部１６０は、相手先アドレス情報格納部１４０に保持されている相手先アドレス情報を参照する（ステップＳ４：相手先アドレス情報参照）。 Receiving the switching instruction, the voice text switching unit 160 of the caller terminal 100 instructs the text input unit 170 to activate the text input function (step S2: activation instruction).
The text input unit 170 of the communication terminal 100A that has received the activation instruction returns the input result to the voice text switching unit 160 (step S3: input result return). Specifically, the text input unit 170 displays a text input screen and accepts a text sentence by a character input function (keypad or the like) attached to the terminal.
Further, the voice text switching unit 160 of the communication terminal 100A refers to the destination address information held in the destination address information storage unit 140 (step S4: reference to the destination address information).

そして、通信端末１００Ａの音声テキスト切替部１６０は、音声配信サーバ２００Ａに対して相手先アドレス情報を付与してテキスト入力結果を送信する（ステップＳ５：テキスト入力結果送信）。具体的には、音声テキスト切替部１６０は、送信等と描かれたボタンが押されることによりテキスト入力結果を送信する。 Then, the voice text switching unit 160 of the communication terminal 100A transmits the text input result by adding the destination address information to the voice distribution server 200A (step S5: text input result transmission). Specifically, the voice text switching unit 160 transmits a text input result when a button labeled “send” is pressed.

次に、音声配信サーバ２００Ａのテキスト受信部２１０は、通信端末１００Ａからのテキスト入力結果を受信し、音声合成部２２０に対して、テキストデータから音声データへの変換指示を行う（ステップＳ６：変換指示）。
変換指示を受信した、音声配信サーバ２００Ａの音声合成部２２０は、テキスト入力結果を音声データに変換し、また音声送信部２３０に対して送出指示を行う（ステップＳ７：送出指示）。
そして、音声配信サーバ２００Ａの音声送信部２３０は、送出指示に従って通信端末３００Ａに対して音声合成された音声データを送出する（ステップＳ８：音声データ送出）。 Next, the text receiving unit 210 of the voice distribution server 200A receives the text input result from the communication terminal 100A, and instructs the voice synthesizer 220 to convert text data into voice data (step S6: conversion). Instructions).
Receiving the conversion instruction, the voice synthesizing unit 220 of the voice distribution server 200A converts the text input result into voice data and issues a transmission instruction to the voice transmission unit 230 (step S7: transmission instruction).
Then, the voice transmission unit 230 of the voice distribution server 200A sends voice data synthesized by voice to the communication terminal 300A according to the sending instruction (step S8: voice data sending).

このように、音声配信サーバ２００Ａは、通信端末１００Ａからテキスト入力結果を受信した場合、音声合成部２２０によりテキスト入力結果を音声データに変換してから通信端末３００Ａに送信する。
通信端末３００Ａの受話部３０１は、音声配信サーバ２００Ａからの音声データを受信し、音声通話状態の音声通話（図５の符号ｂ参照）に合せて、受信した音声データを出力する。 As described above, when the voice distribution server 200A receives the text input result from the communication terminal 100A, the voice synthesizing unit 220 converts the text input result into voice data and transmits the voice data to the communication terminal 300A.
The receiving unit 301 of the communication terminal 300A receives the voice data from the voice distribution server 200A, and outputs the received voice data in accordance with the voice call in the voice call state (see symbol b in FIG. 5).

このようにすることにより、本実施形態に係る音声合成通話システム１０００Ａは、それまでに音声通話をしていた相手と通話状態を維持しながら、一方または双方が声を出さなくてもリアルタイムでやりとりを続けることができる。 By doing so, the speech synthesis call system 1000A according to the present embodiment can exchange in real time even if one or both of them do not speak while maintaining the call state with the other party who has been carrying out the voice call so far. Can continue.

また、本適用例１では、通信端末１００Ａが、音声合成手段３（音声合成部２２０等）を備えていないので、既存の携帯端末等の軽微な変更で使用することができる。したがって、通信端末１００Ａについては低コストで運用することができる。 Further, in Application Example 1, the communication terminal 100A does not include the speech synthesis unit 3 (speech synthesis unit 220 or the like), and therefore can be used with minor changes such as existing portable terminals. Therefore, the communication terminal 100A can be operated at a low cost.

[適用例２]
図６は、音声合成手段３（音声合成部２２０等）が端末側にある場合の音声合成通話システム１０００Ｂの全体構成と処理概要を説明するための図である。図２および図５と同一構成部分には同一符号を付している。また、図６では、通信端末１００Ｂ側にのみ音声合成手段３を備える例として説明するが、通信端末３００Ｂ側に音声合成手段３を備えるようにしてもよい。このように構成すれば、双方が声を出さなくてもリアルタイムでやりとりを続けることができる。
図６に示すように、音声合成部２２０が端末側にある場合の音声合成通話システム１０００Ｂは、発信者端末である通信端末１００Ｂ（第１の通信端末）と、ＮＷ上の音声配信サーバ２００Ｂと、着信者端末である通信端末３００Ｂ（第２の通信端末）とを含んで構成される。なお、通信端末３００Ｂは、図５の通信端末３００Ａと同一構成である。 [Application Example 2]
FIG. 6 is a diagram for explaining the overall configuration and processing outline of the speech synthesis call system 1000B when the speech synthesis means 3 (speech synthesis unit 220 and the like) is on the terminal side. The same components as those in FIGS. 2 and 5 are denoted by the same reference numerals. Moreover, although FIG. 6 demonstrates as an example provided with the speech synthesis means 3 only in the communication terminal 100B side, you may make it provide the speech synthesis means 3 in the communication terminal 300B side. If comprised in this way, it can continue exchanging in real time, even if both sides do not speak.
As shown in FIG. 6, the speech synthesis call system 1000B when the speech synthesizer 220 is on the terminal side includes a communication terminal 100B (first communication terminal) which is a caller terminal, and a voice distribution server 200B on the NW. And a communication terminal 300B (second communication terminal) which is a callee terminal. Communication terminal 300B has the same configuration as communication terminal 300A in FIG.

通信端末１００Ｂは、音声通話やテキスト入力可能な携帯電話やスマートフォン等である。通信端末１００Ｂは、受話部１０１（音声通信部１３０）、発話部１０２（音声通信部１３０）、相手先アドレス情報格納部１４０、切替ボタン１５０、音声テキスト切替部１６０、テキスト入力部１７０、テキスト受信部２１０、音声合成部２２０、および音声送信部２３０を備えて構成される。
音声配信サーバ２００Ｂは、図２のメディア処理装置２００である。 The communication terminal 100B is a mobile phone, a smartphone, or the like capable of voice calls and text input. The communication terminal 100B includes a receiving unit 101 (voice communication unit 130), an utterance unit 102 (voice communication unit 130), a destination address information storage unit 140, a switching button 150, a voice text switching unit 160, a text input unit 170, and text reception. Unit 210, speech synthesis unit 220, and speech transmission unit 230.
The audio distribution server 200B is the media processing device 200 in FIG.

以下、図６を参照して、音声合成通話システム１０００Ｂの処理概要を説明する。
図６に示すように、通信端末１００Ｂと通信端末３００Ｂとの間は、通話確立しており、呼接続状態にある。この呼接続状態では、通信端末３００Ｂの発話部３０２から通信端末１００Ｂの受話部１０１に音声データが送信され（図６の符号ａ参照）、また通信端末１００Ｂの発話部１０２から通信端末３００Ｂの受話部３０１に音声データが送信される（図６の符号ｂ参照）。 Hereinafter, with reference to FIG. 6, an outline of processing of the speech synthesis call system 1000B will be described.
As shown in FIG. 6, the communication terminal 100B and the communication terminal 300B have established a call and are in a call connection state. In this call connection state, voice data is transmitted from the utterance unit 302 of the communication terminal 300B to the reception unit 101 of the communication terminal 100B (see symbol a in FIG. 6), and the utterance unit 102 of the communication terminal 100B receives the communication terminal 300B. Audio data is transmitted to the unit 301 (see symbol b in FIG. 6).

また、図６の符号ｃに示すように、通信端末１００Ｂは、通話確立時に、ユーザＩＤやその通信端末に固有なユーザ端末識別情報（ＩＰアドレスや電話番号等）を含む相手先アドレス情報を、相手先アドレス情報格納部１４０に保持する。
通信端末１００Ｂは、切替ボタン１５０の操作による切替指示を受け付ける（ステップＳ１：切替指示）。例えば、通信端末１００Ｂのユーザが画面上でボタンをクリックする等してオンにする。
切替指示を受信した、通信端末１００Ｂの音声テキスト切替部１６０は、テキスト入力部１７０に対してテキスト入力機能の起動指示を行う（ステップＳ２：起動指示）。 In addition, as shown by reference symbol c in FIG. 6, the communication terminal 100B, when establishing a call, transmits destination address information including a user ID and user terminal identification information (IP address, telephone number, etc.) unique to the communication terminal, It is held in the destination address information storage unit 140.
The communication terminal 100B receives a switching instruction by operating the switching button 150 (step S1: switching instruction). For example, the user of the communication terminal 100B turns on by clicking a button on the screen.
Receiving the switching instruction, the voice text switching unit 160 of the communication terminal 100B instructs the text input unit 170 to activate the text input function (step S2: activation instruction).

起動指示を受信した、通信端末１００Ｂのテキスト入力部１７０は、音声テキスト切替部１６０に入力結果を返却する（ステップＳ３：入力結果返却）。具体的には、テキスト入力部１７０は、テキスト入力画面が表示され、端末付属の文字入力機能（キーパッド等）によりテキスト文を受け付ける。
また、通信端末１００Ｂの音声テキスト切替部１６０は、相手先アドレス情報格納部１４０に保持されている相手先アドレス情報を参照する（ステップＳ４：相手先アドレス情報参照）。 The text input unit 170 of the communication terminal 100B that has received the activation instruction returns the input result to the voice text switching unit 160 (step S3: input result return). Specifically, the text input unit 170 displays a text input screen and accepts a text sentence by a character input function (keypad or the like) attached to the terminal.
Further, the voice text switching unit 160 of the communication terminal 100B refers to the destination address information held in the destination address information storage unit 140 (step S4: reference to destination address information).

そして、通信端末１００Ｂの音声テキスト切替部１６０は、テキスト受信部２１０に対して相手先アドレス情報を付与してテキスト入力結果を出力する（ステップＳ５：テキスト入力結果出力）。具体的には、音声テキスト切替部１６０は、送信等と描かれたボタンが押されることにより、テキスト入力結果を送信する。 Then, the voice text switching unit 160 of the communication terminal 100B gives the destination address information to the text receiving unit 210 and outputs a text input result (step S5: text input result output). Specifically, the voice text switching unit 160 transmits a text input result when a button labeled “send” is pressed.

次に、テキスト受信部２１０は、音声テキスト切替部１６０からのテキスト入力結果を受信し、音声合成部２２０に対して、テキストデータから音声データへの変換指示を行う（ステップＳ６：変換指示）。
変換指示を受信した、音声合成部２２０は、テキスト入力結果を音声データに変換し、また音声送信部２３０に対して送出指示を行う（ステップＳ７：送出指示）。
そして、音声送信部２３０は、送出指示に従って通信端末３００Ｂに向けて音声合成された音声データを送出する（ステップＳ８：音声データ送出）。 Next, the text receiving unit 210 receives the text input result from the voice text switching unit 160, and instructs the voice synthesis unit 220 to convert text data into voice data (step S6: conversion instruction).
Receiving the conversion instruction, the speech synthesizer 220 converts the text input result into speech data, and issues a transmission instruction to the voice transmission unit 230 (step S7: transmission instruction).
Then, the voice transmission unit 230 sends voice data synthesized by voice to the communication terminal 300B according to the sending instruction (step S8: send voice data).

このように、通信端末１００Ｂは、音声合成部２２０によりテキスト入力結果を音声データに変換してから通信端末３００Ｂに送信する。
通信端末３００Ｂの受話部３０１は、音声配信サーバ２００Ｂからの音声データを受信し、音声通話状態の音声通話（図６の符号ｂ参照）に合せて、受信した音声データを出力する。 As described above, the communication terminal 100B converts the text input result into the voice data by the voice synthesizer 220, and transmits the voice data to the communication terminal 300B.
The receiving unit 301 of the communication terminal 300B receives the voice data from the voice distribution server 200B, and outputs the received voice data in accordance with the voice call in the voice call state (see symbol b in FIG. 6).

このようにすることにより、本実施形態に係る音声合成通話システム１０００Ｂは、それまでに音声通話をしていた相手と通話状態を維持しながら、一方または双方が声を出さなくてもリアルタイムでやりとりを続けることができる。 In this way, the speech synthesis call system 1000B according to the present embodiment exchanges in real time even if one or both of them do not speak while maintaining the call state with the other party who has been carrying out the voice call so far. Can continue.

また、本適用例２では、通信端末１００Ｂが、音声合成手段３（音声合成部２２０等）を備えるので、新たな機能を有する音声配信サーバ２００Ｂを導入することなく、既存のサーバをそのまま使用することができる。したがって、音声配信サーバ２００Ｂについては低コストでシステムを構築することができる。 In Application Example 2, since the communication terminal 100B includes the speech synthesis unit 3 (speech synthesis unit 220 and the like), the existing server is used as it is without introducing the speech distribution server 200B having a new function. be able to. Therefore, a system can be constructed at low cost for the voice distribution server 200B.

以上説明したように、本実施形態に係る音声合成通話システム１０００は、発信者端末１００（図２参照）が、テキストの入力を受け付けるテキスト入力部１７０と、音声通話状態を維持しつつ、テキスト入力部１７０によるテキスト入力に切り替える音声テキスト切替部１６０と、を備える。メディア処理装置（音声配信サーバ）２００は、テキストを音声データに変換する音声合成部２２０と、変換した音声データを着信者端末３００に送信する音声送信部２３０と、を備える。着信者端末３００は、音声データを受信した場合、音声通話状態の音声通話と当該音声データによる音声通話とを合せて出力する。これにより、音声通話中に、その音声通話を継続できない状況が生じた場合に、発信者端末１００のユーザは、切替ボタン１５０を操作してテキスト入力画面に切替え、テキスト文を入力すると、音声合成部２２０は、テキスト文を音声データに変換し、音声送信部２３０が着信者端末３００にリアルタイムで送信する。 As described above, in the speech synthesis call system 1000 according to the present embodiment, the caller terminal 100 (see FIG. 2) has the text input unit 170 that receives text input and the text input while maintaining the voice call state. A voice text switching unit 160 that switches to text input by the unit 170. The media processing device (voice distribution server) 200 includes a voice synthesizer 220 that converts text into voice data, and a voice transmitter 230 that sends the converted voice data to the callee terminal 300. When receiving the voice data, the callee terminal 300 outputs the voice call in the voice call state and the voice call based on the voice data together. As a result, when a situation occurs in which the voice call cannot be continued during the voice call, the user of the caller terminal 100 operates the switching button 150 to switch to the text input screen and inputs a text sentence. The unit 220 converts the text sentence into voice data, and the voice transmission unit 230 transmits it to the callee terminal 300 in real time.

したがって、音声通話の途中で通話状態を維持しながら、テキスト対音声の通話に切り替えることで、通話相手が車内などの声を出せない環境に移行しても会話を継続することができる。すなわち、元々通話していた相手と通話状態を保持しながら、やりとりが継続できる。通話相手から見ても、切り替え前と変わらず音声通話でやりとりが継続できる。 Therefore, by switching to a text-to-speech call while maintaining the call state in the middle of a voice call, the conversation can be continued even when the call partner shifts to an environment where a voice cannot be produced, such as in a car. That is, it is possible to continue the exchange while maintaining the call state with the other party who was originally calling. Even when viewed from the other party, the conversation can be continued with a voice call as before switching.

３音声合成手段
１００発信者端末（第１の通信端末）
１００Ａ，１００Ｂ通信端末（第１の通信端末）
１０１，３０１受話部（音声通話部）
１０２，３０２発話部（音声通話部）
１１０，３１０受信部
１２０，３２０送信部
１３０，３３０音声通話部
１４０相手先アドレス情報格納部
１５０切替ボタン
１６０音声テキスト切替部
１７０テキスト入力部
２００メディア処理装置（音声配信サーバ）
２００Ａ，２００Ｂ音声配信サーバ
２１０テキスト受信部
２２０音声合成部
２３０音声送信部
３００着信者端末（第２の通信端末）
３００Ａ，３００Ｂ通信端末（第２の通信端末）
１０００，１０００Ａ，１０００Ｂ音声合成通話システム 3 Speech synthesis means 100 Caller terminal (first communication terminal)
100A, 100B communication terminal (first communication terminal)
101,301 Earpiece (voice call)
102,302 Speech part (voice call part)
110, 310 Receiving unit 120, 320 Transmitting unit 130, 330 Voice call unit 140 Destination address information storage unit 150 Switch button 160 Voice text switching unit 170 Text input unit 200 Media processing device (voice distribution server)
200A, 200B Voice distribution server 210 Text receiver 220 Voice synthesizer 230 Voice transmitter 300 Callee terminal (second communication terminal)
300A, 300B communication terminal (second communication terminal)
1000, 1000A, 1000B voice synthesis call system

Claims

A voice synthesis call system having first and second communication terminals connectable to a network, wherein the first and second communication terminals include a voice call unit for making a voice call via a voice distribution server. ,
The first communication terminal, which is a communication terminal on the calling side,
A text input unit that accepts text input;
A voice text switching unit that switches to text input by the text input unit while maintaining a voice call state,
The first communication terminal or the audio distribution server is
A speech synthesizer for converting the text into speech data;
An audio transmission unit that transmits the converted audio data to the second communication terminal, which is a communication terminal on the receiving side,
The voice call unit of the second communication terminal is
When the voice data is received, the voice call in the voice call state and the voice call based on the voice data are output together.

A voice call unit for making a voice call via the voice distribution server;
A text input unit that accepts text input;
A voice text switching unit that switches to text input by the text input unit while maintaining the voice call state;
A speech synthesizer for converting the text into speech data;
A communication terminal comprising: an audio transmission unit configured to transmit the converted audio data to the audio distribution server.

A voice synthesis call method for a voice synthesis call system having first and second communication terminals connectable to a network, wherein the first and second communication terminals make a voice call via a voice distribution server, ,
The first communication terminal, which is a communication terminal on the calling side,
Making a voice call via the voice delivery server;
Switching to text input while maintaining a voice call state;
Receiving a text input, and
The first communication terminal or the audio distribution server is
Converting the text into audio data;
Transmitting the converted voice data to a second communication terminal which is a communication terminal on the receiving side, and
The second communication terminal is
When the voice data is received, a step of outputting the voice call in the voice call state together with the voice call based on the voice data is executed.