JP5387095B2

JP5387095B2 - Information processing apparatus and information processing method

Info

Publication number: JP5387095B2
Application number: JP2009082786A
Authority: JP
Inventors: 聡渡▲辺▼; 勉兼安; 一郎宮本
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2009-03-30
Filing date: 2009-03-30
Publication date: 2014-01-15
Anticipated expiration: 2029-03-30
Also published as: JP2010237802A

Description

本発明は、情報処理装置及び情報処理方法に関する。特に、チャット機能を有する情報処理装置及び情報処理方法に関する。 The present invention relates to an information processing apparatus and an information processing method. In particular, the present invention relates to an information processing apparatus and an information processing method having a chat function.

聴覚障害を持った人と健聴者とのコミュニケーションを助ける手段として、テキストチャットを用いることが考えられる。聴覚障害者は、音声を聞き取ることが出来ない。そのため健聴者はテキストをチャット端末に入力し、聴覚障害者はテキストを見て相手の発言を理解することができる。 It is conceivable to use text chatting as a means of helping communication between a hearing impaired person and a normal hearing person. Hearing impaired people cannot hear the sound. Therefore, a normal hearing person inputs a text into a chat terminal, and a hearing-impaired person can understand a partner's remarks by seeing a text.

チャットシステムを用いたコミュニケーションにおいては、単に伝わるというだけでなく、音声同士のコミュニケーションと同様に円滑にコミュニケーションをとるための工夫がなされている。例えば、特許文献１には音声認識を用いて、テキスト入力の作業負荷を軽減する技術が開示されている。音声認識を用いれば、キーボード操作に熟練していなくてもチャットを円滑に楽しむことが出来る。 In communication using a chat system, not only simply communicates, but also a device for smooth communication as in the case of communication between voices. For example, Patent Document 1 discloses a technique for reducing the workload of text input using voice recognition. Using voice recognition, you can enjoy chatting smoothly even if you are not skilled in keyboard operation.

聴覚障害者が発話障害も併せ持っている場合には、聴覚障害者はテキストによって発言することとなる。ところが、状況によっては音声同士でコミュニケーションされることが好ましい場合がある。例えば、聴覚障害者が講演会や会議など複数人の前で発言する場合及びラジオ放送などで発言する場合であるが、その場合には、チャットシステムに音声合成を用いることが考えられる。 If the hearing impaired person also has speech impairment, the hearing impaired person speaks by text. However, it may be preferable to communicate with each other depending on the situation. For example, a hearing impaired person speaks in front of a plurality of people, such as a lecture or a meeting, or a radio broadcast, etc. In this case, it is conceivable to use speech synthesis for the chat system.

特開２０００−２８５０６３JP 2000-285063 A

しかし、従来のチャットシステムにおいては入力手段がキーボードなどによるタイプ入力しかなく、入力操作に習熟していない利用者が利用する場合には発言の入力タイミングを利用者が制御できず、第３者にとって聞きやすい自然な音声会話とはならない場合が多かった。 However, in the conventional chat system, the input means is only type input using a keyboard or the like, and when the user who is not familiar with the input operation uses it, the user cannot control the input timing of the speech. In many cases, the voice conversation was not easy to hear.

そこで、本発明は、上記問題に鑑みてなされたものであり、本発明の目的とするところは、テキストを合成音声に変換して行われる音声チャットにおいて、発言の入力タイミングを利用者が制御することが可能であり、第３者にとって聞こえのいい自然な音声会話をすることが可能な、新規かつ改良された情報処理装置を提供することにある。 Accordingly, the present invention has been made in view of the above problems, and an object of the present invention is to control the input timing of speech in a voice chat performed by converting text into synthesized speech. It is possible to provide a new and improved information processing apparatus capable of having a natural voice conversation that can be heard by a third party.

上記課題を解決するために、本発明のある観点によれば、ネットワークを介して接続された他の情報処理装置とメッセージの交換を用いた会話をすることのできるチャット機能を有する情報処理装置であって、上記メッセージの候補である候補テキストを記憶するテキストデータ記憶部と、予め録音された音声データである候補音声及び上記候補音声に紐付けられ該候補音声の内容を示す候補音声テキストを記憶する音声データ記憶部と、操作画面の表示を制御し、上記候補テキスト及び上記候補音声テキストを上記操作画面上に選択可能に表示させる表示制御部と、上記操作画面に表示された候補テキストの中から利用者により選択された候補テキストである選択テキストを取得する選択テキスト取得部と、上記操作画面に表示された候補音声テキストの中から利用者により選択された候補音声テキストである選択音声テキスト及び上記選択音声テキストに対応して記憶された候補音声である選択音声を取得する選択音声取得部と、上記選択音声を出力又は上記他の情報処理装置に対して送信する音声出力部と、上記選択テキスト及び上記選択音声テキストを上記他の情報処理装置に対して送信するテキスト送信部と、を有する情報処理装置が提供される。 In order to solve the above-described problem, according to an aspect of the present invention, an information processing apparatus having a chat function capable of having a conversation using message exchange with another information processing apparatus connected via a network. A text data storage unit that stores candidate text that is a candidate for the message, and a candidate voice that is pre-recorded voice data and a candidate voice text that is linked to the candidate voice and indicates the content of the candidate voice A voice data storage unit for controlling the display of the operation screen, the display control unit for selectively displaying the candidate text and the candidate voice text on the operation screen, and the candidate text displayed on the operation screen The selected text acquisition unit that acquires the selected text that is the candidate text selected by the user from the above, and the candidates displayed on the operation screen A selected voice acquisition unit that acquires a selected voice text that is a candidate voice text selected from a voice text by a user and a selected voice that is a candidate voice stored corresponding to the selected voice text; and the selected voice Provided is an information processing apparatus having an audio output unit that outputs or transmits to the other information processing apparatus, and a text transmission unit that transmits the selected text and the selected audio text to the other information processing apparatus Is done.

かかる構成により、情報処理装置は、利用者に対してチャット上の発言の候補である候補テキスト及び候補音声テキストを操作画面を通じて提供し、利用者の操作に応じて予め記憶されたテキスト及び音声をメッセージとして出力する。そのため、利用者は、提供されたテキスト及び音声の候補の中から所望のデータを選択し、自らの発言として確定入力することが出来る。これにより、利用者は、例えばキーボード入力操作の習熟度合いに関わらず、発言のタイミングを制御することが出来るようになり、第３者にとって聞こえのよい自然な音声会話が出来るようになる。 With this configuration, the information processing apparatus provides the user with candidate text and candidate speech text that are candidates for speech on the chat through the operation screen, and stores text and speech stored in advance according to the user's operation. Output as a message. Therefore, the user can select desired data from the provided text and speech candidates and can confirm and input the data as his / her own speech. As a result, the user can control the timing of speech regardless of the level of proficiency of keyboard input operation, for example, and a natural voice conversation that can be heard by a third party can be performed.

また、利用者が入力した入力テキストを取得する入力テキスト取得部と、上記入力テキスト、上記選択テキスト、及び上記選択音声テキストの音声化を制御する音声化制御部をさらに有してもよい。 Moreover, you may further have the input text acquisition part which acquires the input text which the user input, and the voice control part which controls the voice of the said input text, the said selected text, and the said selected speech text.

また、上記音声化制御部は、予め登録された登録語及び上記登録語に紐付けられた修正語を含む置換テーブルを有し、上記入力テキスト及び上記選択テキストが上記登録語を含む場合に、上記入力テキスト及び上記選択テキスト中の上記登録語を上記置換テーブル中の上記登録語に対応する修正語に置換してもよい。 Further, the voice control unit has a replacement table including a registered word registered in advance and a correction word linked to the registered word, and when the input text and the selected text include the registered word, The registered word in the input text and the selected text may be replaced with a modified word corresponding to the registered word in the replacement table.

また、上記入力テキスト及び上記選択テキストから合成音声を生成する音声合成部をさらに有し、上記音声出力部は上記合成音声をさらに出力又は上記他の情報処理装置に対して送信してもよい。 Further, a speech synthesis unit that generates synthesized speech from the input text and the selected text may be further included, and the speech output unit may further output the synthesized speech or transmit the synthesized speech to the other information processing apparatus.

また、上記入力テキスト及び上記選択テキストから生成された合成音声の出力時間を計算する音声出力時間計算部をさらに有し、上記表示制御部は、上記音声出力時間計算部から入力された上記出力時間に基づいて、上記合成音声の残り出力時間を上記操作画面に表示させてもよい。 In addition, it further includes a voice output time calculation unit that calculates an output time of a synthesized voice generated from the input text and the selected text, and the display control unit is configured to output the output time input from the voice output time calculation unit. Based on the above, the remaining output time of the synthesized speech may be displayed on the operation screen.

また、上記音声出力部は、上記他の情報処理装置から入力された音声停止指示に従い出力を停止してもよい。 The audio output unit may stop the output in accordance with an audio stop instruction input from the other information processing apparatus.

以上説明したように本発明によれば、テキストを合成音声に変換して行われる音声チャットにおいて、発言の入力タイミングを利用者が制御することが可能であり、第３者にとって聞こえのいい自然な音声会話をすることができる。 As described above, according to the present invention, in voice chat performed by converting text into synthesized speech, the user can control the input timing of speech, which is natural for a third party to hear. You can have a voice conversation.

本発明の第１の実施形態にかかる音声合成チャットシステムの構成図である。1 is a configuration diagram of a speech synthesis chat system according to a first embodiment of the present invention. 第１の実施形態にかかる情報処理装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the information processing apparatus concerning 1st Embodiment. 第１の実施形態にかかる情報処理装置の操作画面の一例を示す説明図である。It is explanatory drawing which shows an example of the operation screen of the information processing apparatus concerning 1st Embodiment. 第２の実施形態にかかる情報処理装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the information processing apparatus concerning 2nd Embodiment. 第２の実施形態にかかる音声化制御部の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the voice control part concerning 2nd Embodiment. 第２の実施形態において音声化制御部が有する置換テーブルの一例である。It is an example of the replacement table which the voice control part has in 2nd Embodiment. 第３の実施形態にかかる情報処理装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the information processing apparatus concerning 3rd Embodiment. 第３の実施形態において音声出力制御部が表示させるメッセージの一例である。It is an example of the message which an audio | voice output control part displays in 3rd Embodiment.

以下に添付図面を参照しながら、本発明の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Exemplary embodiments of the present invention will be described below in detail with reference to the accompanying drawings. In addition, in this specification and drawing, about the component which has the substantially same function structure, duplication description is abbreviate | omitted by attaching | subjecting the same code | symbol.

＜第１の実施形態＞
まず、図１を参照しながら本発明の第１の実施形態にかかる音声合成チャットシステムの構成について説明する。図１は、本発明の第１の実施形態にかかる音声合成チャットシステムの構成図である。 <First Embodiment>
First, the configuration of the speech synthesis chat system according to the first embodiment of the present invention will be described with reference to FIG. FIG. 1 is a configuration diagram of a speech synthesis chat system according to the first embodiment of the present invention.

（システム構成）
本発明の第１の実施形態にかかる音声合成チャットシステムは、第１のチャット端末装置１００、通信網２００、及び第２のチャット端末装置３００から主に構成される。第１のチャット端末装置１００と第２のチャット端末装置３００とは通信網２００を介して接続されている。本実施形態においては、音声合成チャットシステムは２つの端末装置で構成されるが、これに限られない。２つ以上の複数の情報処理装置で構成されてよい。 (System configuration)
The speech synthesis chat system according to the first embodiment of the present invention is mainly composed of a first chat terminal device 100, a communication network 200, and a second chat terminal device 300. The first chat terminal device 100 and the second chat terminal device 300 are connected via a communication network 200. In the present embodiment, the voice synthesis chat system includes two terminal devices, but is not limited thereto. It may be composed of two or more information processing apparatuses.

第１のチャット端末装置１００及び第２のチャット端末装置３００は、通信網２００に接続可能な装置である。例えばＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）であってよい。また、例えば、ＰＤＡ（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ）、携帯電話、デジタルテレビなどの表示装置、ビデオプレーヤ、ビデオデッキ、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）レコーダ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）プレーヤ、ＤＶＤレコーダなどの記録・再生装置、音楽再生装置、及びゲーム機などの機器であってもよい。 The first chat terminal device 100 and the second chat terminal device 300 are devices that can be connected to the communication network 200. For example, it may be a PC (Personal Computer). Also, for example, recording / playback of PDA (Personal Digital Assistant), mobile phone, digital TV display device, video player, video deck, HDD (Hard Disk Drive) recorder, DVD (Digital Versatile Disc) player, DVD recorder, etc. It may be a device such as a device, a music playback device, and a game machine.

通信網２００は、有線または無線の伝送路である。例えば電話回線網、衛星通信網、インターネットなどの公衆回線網や、Ｅｔｈｅｒｎｅｔ（登録商標）を含む各種のＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）、ＩＰ−ＶＰＮ（ＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ−ＶｉｒｔｕａｌＰｒｉｖａｔｅＮｅｔｗｏｒｋ）等の専用回線網を含んでも良い。 The communication network 200 is a wired or wireless transmission path. For example, a public line network such as a telephone line network, a satellite communication network, and the Internet, various LANs including the Ethernet (registered trademark), a wide area network (WAN), an IP-VPN (Internet Protocol-Virtual Private Network). ) Etc. may also be included.

（音声合成チャットの概要）
第１のチャット端末装置１００は、情報処理装置の一例である。また、第２のチャット端末装置３００は、他の情報処理装置の一例である。第１のチャット端末装置１００及び第２のチャット端末装置３００は、テキストメッセージを交換可能なチャット機能を有する。 (Outline of speech synthesis chat)
The first chat terminal device 100 is an example of an information processing device. The second chat terminal device 300 is an example of another information processing device. The first chat terminal device 100 and the second chat terminal device 300 have a chat function capable of exchanging text messages.

利用者Ａ（健聴者）と利用者Ｂ（聴覚障害及び発話障害を有する。）との音声コミュニケーションに、本実施形態にかかるチャットシステムを用いる場合を考える。例えば大人数で開催する会議や、ラジオ放送などの場合には、音声によるコミュニケーションが望ましい。発話障害を有する利用者が音声コミュニケーションをとるために、本実施形態にかかるチャットシステムは音声合成を用いる。 Consider a case where the chat system according to the present embodiment is used for voice communication between a user A (normally hearing person) and a user B (having hearing impairment and speech impairment). For example, in the case of a conference held with a large number of people or radio broadcasting, voice communication is desirable. The chat system according to the present embodiment uses speech synthesis in order for a user having a speech disorder to perform voice communication.

利用者Ａは、音声を発すると共に第２のチャット端末装置３００に発言内容をテキスト入力する。このとき、テキスト入力はキーボードなどの入力手段を用いて入力されてもよいし、音声認識を用いて、音声をテキストに変換したものであってもよい。入力されたテキストは、通信網２００を介して第１のチャット端末装置１００に送信される。 User A utters a voice and inputs the text of the message to the second chat terminal device 300. At this time, the text input may be input using an input means such as a keyboard, or may be a voice converted into text using voice recognition. The input text is transmitted to the first chat terminal device 100 via the communication network 200.

利用者Ａによって入力されたテキストは、第１のチャット端末装置１００及び第２のチャット端末装置３００の操作画面上に表示される。利用者Ｂは、第１のチャット端末装置１００の操作画面に表示されたテキストを見て応答となるテキストを第１のチャット端末装置１００に入力する。利用者Ｂによって入力されたテキストは、通信網２００を介して第２のチャット端末装置３００に送信されると共に、第１のチャット端末装置１００において音声合成され、例えばスピーカーなどの出力手段を用いて音声出力される。本実施形態においては第１のチャット端末装置１００と第２のチャット端末装置３００とが同じ場所に存在する状況を想定しているため、このような構成であってよい。 The text input by the user A is displayed on the operation screens of the first chat terminal device 100 and the second chat terminal device 300. The user B inputs text to the first chat terminal device 100 as a response by looking at the text displayed on the operation screen of the first chat terminal device 100. The text input by the user B is transmitted to the second chat terminal device 300 via the communication network 200 and is voice-synthesized by the first chat terminal device 100, for example, using output means such as a speaker. Sound is output. In this embodiment, since the 1st chat terminal device 100 and the 2nd chat terminal device 300 assume the situation where it exists in the same place, such a structure may be sufficient.

例えば、第１のチャット端末装置１００と第２のチャット端末装置３００とが離れた場所に存在する場合には、第１のチャット端末装置１００において入力されたテキストは第２のチャット端末装置３００に送信され、第２のチャット端末装置３００において音声合成された後、音声出力されてもよい。また、第１のチャット端末装置１００において入力されたテキストは、第１のチャット端末装置１００において音声合成された後、例えばＷＡＶやＭＰ３などの音声ファイルに変換されてから第２のチャット端末装置３００に転送されてもよい。また、音声合成されたデータはストリーミング方式で第２のチャット端末装置３００に転送されてもよい。 For example, when the first chat terminal device 100 and the second chat terminal device 300 exist in a place away from each other, the text input in the first chat terminal device 100 is sent to the second chat terminal device 300. After being transmitted and synthesized in the second chat terminal device 300, the voice may be output. In addition, the text input in the first chat terminal device 100 is voice-synthesized in the first chat terminal device 100, and then converted into a voice file such as WAV or MP3, for example, and then the second chat terminal device 300. May be forwarded to. Further, the voice synthesized data may be transferred to the second chat terminal device 300 in a streaming manner.

以上、本実施形態にかかる音声合成チャットシステムの全体構成と利用方法の概要について説明してきた。しかし、従来このようなシステムを用いて、第３者が聞いても聞こえのよい音声会話となるには、利用者側がシステムの利用に習熟する必要があった。そこで、第３者にとっても聞こえのよい音声会話を助けるための詳細な構成について次に説明する。 Heretofore, an overview of the overall configuration and usage method of the speech synthesis chat system according to the present embodiment has been described. Conventionally, however, in order to achieve a voice conversation that can be heard even by a third party using such a system, the user has to be proficient in using the system. Therefore, a detailed configuration for helping a voice conversation that can be heard by a third party will be described below.

次に、図２と図３とを参照しながら、第１のチャット端末装置１００の機能構成と、操作画面の一例について説明する。図２は、第１の実施形態にかかる第１のチャット端末装置の機能構成を示すブロック図である。図３は、第１の実施形態にかかる第１のチャット端末装置の操作画面の一例を示す説明図である。 Next, the functional configuration of the first chat terminal device 100 and an example of the operation screen will be described with reference to FIGS. 2 and 3. FIG. 2 is a block diagram illustrating a functional configuration of the first chat terminal device according to the first embodiment. FIG. 3 is an explanatory diagram illustrating an example of an operation screen of the first chat terminal device according to the first embodiment.

（機能構成）
図２を参照しながら本発明の第１の実施形態にかかる第１のチャット端末装置１００の機能構成について説明する。本発明の第１の実施形態にかかる第１のチャット端末装置１００は、音声データ記憶部１０２、テキストデータ記憶部１０４、選択音声取得部１０６、入力テキスト取得部１０８、選択テキスト取得部１１０、表示制御部１１２、テキスト送信部１１４、テキスト受信部１１６、音声合成部１１８、音声出力部１２０、及び音声出力時間計算部１２２を主に有する。 (Functional configuration)
A functional configuration of the first chat terminal device 100 according to the first embodiment of the present invention will be described with reference to FIG. The first chat terminal device 100 according to the first embodiment of the present invention includes a voice data storage unit 102, a text data storage unit 104, a selected voice acquisition unit 106, an input text acquisition unit 108, a selected text acquisition unit 110, and a display. It mainly includes a control unit 112, a text transmission unit 114, a text reception unit 116, a speech synthesis unit 118, a speech output unit 120, and a speech output time calculation unit 122.

（音声データ記憶部１０２）
音声データ記憶部１０２は、選択音声取得部１０６に接続される。音声データ記憶部１０２は、予め録音された音声データである候補音声を記憶しておく記憶部である。候補音声は、候補音声の内容とその特徴を含むテキストである候補音声テキストと紐付けられて記憶される。ここで、候補音声は、例えば予め音声合成で生成したもの、肉声を録音したもの、及び人間の声でない効果音のようなものであってもよい。さらに記憶しておく候補音声としては、例えば図３の音声選択部５２０に示す「うーん（躊躇）」及び「うーん（納得）」のように、同じ表記の言葉であってもニュアンスの異なるものを蓄積しておくと効果的である。音声合成を用いて生成される音声は、通常、表記が同じである場合には同じ波形を持つ合成音が生成される。ところが、人間の発話は表記としては同じであっても、音声信号として見た場合には音声の継続時間長、パワー、スペクトル、及びピッチの変化の異なる音声を文脈によって通常使い分けている。そこで、このようなニュアンスの違いを加味したデータを蓄積しておくことによって、利用者はより自然な会話をすることが出来るようになる。 (Audio data storage unit 102)
The voice data storage unit 102 is connected to the selected voice acquisition unit 106. The voice data storage unit 102 is a storage unit that stores candidate voices that are voice data recorded in advance. The candidate speech is stored in association with a candidate speech text that is a text including the content of the candidate speech and its characteristics. Here, the candidate voice may be, for example, a voice generated beforehand by voice synthesis, a voice recorded, or a sound effect that is not a human voice. Further, as candidate voices to be stored, words having the same notation, such as “Uon (躊躇)” and “Uon (Consent)” shown in the voice selection unit 520 in FIG. It is effective to accumulate. As for speech generated using speech synthesis, synthesized speech having the same waveform is usually generated when the notation is the same. However, even if human speech is the same in terms of notation, when viewed as a speech signal, speech with different speech duration length, power, spectrum, and pitch changes is usually properly used depending on the context. Therefore, by accumulating data that takes into account such nuance differences, the user can have a more natural conversation.

（テキストデータ記憶部１０４）
テキストデータ記憶部１０４は、選択テキスト取得部１１０に接続される。テキストデータ記憶部１０４は、予め登録されたテキストデータであり、ネットワークを介して接続された第２のチャット端末装置３００に対して送信されるメッセージの候補である候補テキストを記憶しておく記憶部である。ここで、候補テキストは例えばよくしようするフレーズや必ず使用することがわかっている文章などであってよい。特に本実施形態のように音声合成を用いるときには、テキスト入力に時間がかかると、自然な会話のテンポが損なわれる。そこでよく使用するフレーズや、必ず使用することがわかっているフレーズを候補テキストとして登録しておくことによって、テキスト入力にかかる時間を削減することができ、無駄な無言時間の発生を避け、聞き手にとっても聞こえのよい会話となる。 (Text data storage unit 104)
The text data storage unit 104 is connected to the selected text acquisition unit 110. The text data storage unit 104 is pre-registered text data, and a storage unit that stores candidate texts that are candidates for messages to be transmitted to the second chat terminal device 300 connected via the network. It is. Here, the candidate text may be, for example, a frequently used phrase or a sentence that is known to be used. In particular, when speech synthesis is used as in the present embodiment, if it takes time to input text, the natural conversation tempo is impaired. Therefore, by registering frequently used phrases or phrases that are known to be used as candidate texts, the time required for text input can be reduced. The conversation is very audible.

ここで、音声データ記憶部１０２及びテキストデータ記憶部１０４は物理的に同じ記憶部であっても別体の記憶部であってもよい。例えば音声データ記憶部１０２及びテキストデータ記憶部１０４の具体的な例としては、ハードディスク（Hard Disk）などの磁気記録媒体や、ＥＥＰＲＯＭ（Electronically Erasable and Programmable Read Only Memory）、フラッシュメモリ、ＭＲＡＭ（Magnetoresistive Random Access Memory）、ＦｅＲＡＭ（Ferroelectric Random Access Memory）、ＰＲＡＭ（Phase change Random Access Memory）などの不揮発性メモリが挙げられるが、上記に限られない。 Here, the voice data storage unit 102 and the text data storage unit 104 may be physically the same storage unit or separate storage units. For example, specific examples of the voice data storage unit 102 and the text data storage unit 104 include a magnetic recording medium such as a hard disk, an EEPROM (Electronically Erasable and Programmable Read Only Memory), a flash memory, an MRAM (Magnetoresistive Random). Non-volatile memories such as Access Memory (Feed Random Access Memory), FeRAM (Ferroelectric Random Access Memory), and PRAM (Phase Change Random Access Memory) are included, but are not limited to the above.

（選択音声取得部１０６）
選択音声取得部１０６は、音声データ記憶部１０２、音声出力部１２０、表示制御部１１２、及びテキスト送信部１１４に接続される。選択音声取得部１０６は、図３に示す操作画面５００の音声選択部５２０に表示された候補音声テキストの中から利用者により選択された候補音声テキストである選択音声テキスト及び選択音声テキストに対応して記憶された候補音声である選択音声を音声データ記憶部１０２から取得する。選択音声取得部１０６は、選択音声を音声出力部１２０に入力する。また、選択音声取得部１０６は、選択音声テキストを表示制御部１１２及びテキスト送信部１１４に入力する。 (Selected sound acquisition unit 106)
The selected voice acquisition unit 106 is connected to the voice data storage unit 102, the voice output unit 120, the display control unit 112, and the text transmission unit 114. The selected speech acquisition unit 106 corresponds to the selected speech text and the selected speech text that are candidate speech texts selected by the user from the candidate speech texts displayed on the speech selection unit 520 of the operation screen 500 shown in FIG. The selected voice that is the stored candidate voice is acquired from the voice data storage unit 102. The selected voice acquisition unit 106 inputs the selected voice to the voice output unit 120. Further, the selected voice acquisition unit 106 inputs the selected voice text to the display control unit 112 and the text transmission unit 114.

（入力テキスト取得部１０８）
入力テキスト取得部１０８は、音声合成部１１８、表示制御部１１２、及びテキスト送信部１１４に接続される。入力テキスト取得部１０８は、図３に示す操作画面５００のテキスト入力部５１０において利用者が入力したテキストデータである入力テキストを取得する。そして入力テキスト取得部１０８は、入力テキストを表示制御部１１２及びテキスト送信部１１４に入力すると共に、音声合成部１１８に入力する。 (Input text acquisition unit 108)
The input text acquisition unit 108 is connected to the speech synthesis unit 118, the display control unit 112, and the text transmission unit 114. The input text acquisition unit 108 acquires input text that is text data input by the user in the text input unit 510 of the operation screen 500 shown in FIG. The input text acquisition unit 108 inputs the input text to the display control unit 112 and the text transmission unit 114 and inputs the input text to the speech synthesis unit 118.

（選択テキスト取得部１１０）
選択テキスト取得部１１０は、テキストデータ記憶部１０４、音声合成部１１８、表示制御部１１２、及びテキスト送信部１１４に接続される。選択テキスト取得部１１０は、図３に示す操作画面５００のテキスト選択部５３０に表示された候補テキストの中から利用者により選択された候補テキストである選択テキストをテキストデータ記憶部１０４から取得する。選択テキスト取得部１１０は、取得した選択テキストを表示制御部１１２及びテキスト送信部１１４に入力すると共に、音声合成部１１８に入力する。 (Selected text acquisition unit 110)
The selected text acquisition unit 110 is connected to the text data storage unit 104, the speech synthesis unit 118, the display control unit 112, and the text transmission unit 114. The selected text acquisition unit 110 acquires, from the text data storage unit 104, the selected text that is a candidate text selected by the user from the candidate texts displayed on the text selection unit 530 of the operation screen 500 shown in FIG. The selected text acquisition unit 110 inputs the acquired selected text to the display control unit 112 and the text transmission unit 114 and also inputs to the speech synthesis unit 118.

（テキスト送信部１１４）
テキスト送信部１１４は、選択音声取得部１０６、入力テキスト取得部１０８、選択テキスト取得部１１０、及び通信網２００に接続される。テキスト送信部１１４は、入力されたテキストを通信網２００に送信する通信インタフェースである。例えばテキスト送信部１１４は、選択音声取得部１０６、入力テキスト取得部１０８、及び選択テキスト取得部１１０から入力された選択音声テキスト、入力テキスト、及び選択テキストを通信網２００を介して他の第２のチャット端末装置３００に対して送信する。 (Text transmitter 114)
The text transmission unit 114 is connected to the selected voice acquisition unit 106, the input text acquisition unit 108, the selected text acquisition unit 110, and the communication network 200. The text transmission unit 114 is a communication interface that transmits input text to the communication network 200. For example, the text transmission unit 114 receives the selected voice text, the input text, and the selected text input from the selected voice acquisition unit 106, the input text acquisition unit 108, and the selection text acquisition unit 110 via the communication network 200. Is transmitted to the chat terminal device 300.

（テキスト受信部１１６）
テキスト受信部１１６は、表示制御部１１２及び通信網２００に接続される。テキスト受信部１１６は、通信網２００を介してテキストを受信する通信インタフェースである。例えばテキスト受信部１１６は、第２のチャット端末装置３００から受信したテキストを表示制御部１１２に入力する。 (Text receiver 116)
The text receiving unit 116 is connected to the display control unit 112 and the communication network 200. The text receiving unit 116 is a communication interface that receives text via the communication network 200. For example, the text receiving unit 116 inputs the text received from the second chat terminal device 300 to the display control unit 112.

（音声合成部１１８）
音声合成部１１８は、入力テキスト取得部１０８、選択テキスト取得部１１０、音声出力部１２０、及び音声出力時間計算部１２２に接続される。音声合成部１１８は、入力されたテキストデータから合成音声を生成する。例えば本実施形態において音声合成部１１８は、入力テキスト取得部１０８から入力テキストが入力され、選択テキスト取得部１１０からは選択テキストが入力される。音声合成部１１８は、これら入力されたテキストデータからそれぞれ合成音声を生成し、合成音声を音声出力部１２０に入力する。さらに音声合成部１１８は、合成音声が音声出力された場合に要する時間長情報を音声出力時間計算部１２２に入力する。 (Speech synthesizer 118)
The voice synthesis unit 118 is connected to the input text acquisition unit 108, the selected text acquisition unit 110, the voice output unit 120, and the voice output time calculation unit 122. The speech synthesizer 118 generates synthesized speech from the input text data. For example, in the present embodiment, the speech synthesizer 118 receives input text from the input text acquisition unit 108 and receives selection text from the selected text acquisition unit 110. The speech synthesizer 118 generates synthesized speech from the input text data, and inputs the synthesized speech to the speech output unit 120. Furthermore, the speech synthesizer 118 inputs time length information required when the synthesized speech is output to the speech output time calculator 122.

また、音声合成部１１８は、本実施形態においては第１のチャット端末装置１００に搭載されるが、第２のチャット端末装置３００に搭載される構成も可能である。この場合第１のチャット端末装置１００は、選択テキスト及び入力テキストに例示されるテキストデータを第２のチャット端末装置３００に送信し、第２のチャット端末装置３００においてテキストデータは音声合成される。 In addition, the voice synthesizer 118 is mounted on the first chat terminal device 100 in the present embodiment, but a configuration mounted on the second chat terminal device 300 is also possible. In this case, the first chat terminal device 100 transmits text data exemplified by the selected text and the input text to the second chat terminal device 300, and the text data is synthesized by voice in the second chat terminal device 300.

（音声出力部１２０）
音声出力部１２０は、選択音声取得部１０６、音声合成部１１８、音声出力時間計算部１２２、及び音声出力装置（図示せず。）に接続される。上述の通り、本実施形態においては、第１のチャット端末装置１００と第２のチャット端末装置３００とが同じ会場に存在する場合を想定しているため、音声出力部１２０は直接音声出力装置（例えばスピーカーなど。）に接続される。例えば音声出力部１２０は、選択音声取得部１０６から選択音声が入力される。また音声合成部１１８から合成音声が入力される。このとき音声出力部１２０は、入力された選択音声及び合成音声をＤＡ（ＤｉｇｉｔａｌｔｏＡｎａｌｏｇ）変換し、外部スピーカーなどの音声出力装置に音声のアナログ信号を入力する。 (Audio output unit 120)
The voice output unit 120 is connected to the selected voice acquisition unit 106, the voice synthesis unit 118, the voice output time calculation unit 122, and a voice output device (not shown). As described above, in the present embodiment, since it is assumed that the first chat terminal device 100 and the second chat terminal device 300 are present in the same venue, the audio output unit 120 is directly connected to the audio output device ( For example, a speaker. For example, the voice output unit 120 receives the selected voice from the selected voice acquisition unit 106. A synthesized speech is input from the speech synthesizer 118. At this time, the audio output unit 120 performs DA (Digital to Analog) conversion on the input selected audio and synthesized audio, and inputs an analog audio signal to an audio output device such as an external speaker.

また、音声出力部１２０は、音声出力装置の代わりに通信網２００に接続される構成も可能である。この場合音声出力部１２０は、入力された音声をＤＡ変換せずデジタルデータのまま通信網２００に対して送信する。ここで音声出力部１２０は例えば入力された選択音声及び合成音声などの音声データを圧縮処理してもよい。 The audio output unit 120 may be configured to be connected to the communication network 200 instead of the audio output device. In this case, the audio output unit 120 transmits the input audio to the communication network 200 as digital data without performing DA conversion. Here, the audio output unit 120 may compress the input audio data such as selected audio and synthesized audio, for example.

（音声出力時間計算部１２２）
音声出力時間計算部１２２は、音声合成部１１８、音声出力部１２０、及び表示制御部１１２に接続される。音声出力時間計算部１２２は、時計機能を有し、選択音声及び合成音声が出力される残り時間を計算する。音声出力時間計算部１２２は、音声合成部１１８から入力された時間長情報から音声の出力のこり時間を計算し、得られた出力残り時間に関する情報を表示制御部１１２に入力する。 (Audio output time calculator 122)
The voice output time calculation unit 122 is connected to the voice synthesis unit 118, the voice output unit 120, and the display control unit 112. The voice output time calculation unit 122 has a clock function and calculates the remaining time during which the selected voice and synthesized voice are output. The voice output time calculation unit 122 calculates a voice output remaining time from the time length information input from the voice synthesis unit 118, and inputs the obtained information regarding the remaining output time to the display control unit 112.

（表示制御部１１２）
表示制御部１１２は、選択音声取得部１０６、入力テキスト取得部１０８、選択テキスト取得部１１０、テキスト送信部１１４、テキスト受信部１１６、及び音声出力時間計算部１２２に接続される。表示制御部１１２は、第１のチャット端末装置１００に接続された表示装置（図示せず）の表示、例えば図３に示す操作画面５００の表示を制御する機能部である。 (Display control unit 112)
The display control unit 112 is connected to the selected voice acquisition unit 106, the input text acquisition unit 108, the selected text acquisition unit 110, the text transmission unit 114, the text reception unit 116, and the voice output time calculation unit 122. The display control unit 112 is a functional unit that controls the display of a display device (not shown) connected to the first chat terminal device 100, for example, the display of the operation screen 500 shown in FIG.

ここで図３に示す操作画面５００を用いて表示制御部１１２が行う表示制御の一例について説明する。表示制御部１１２は、操作画面５００の表示に関する制御全般を行う。例えば利用者が入力部（図示せず。）を用いてテキストを入力した場合、表示制御部１１２は入力されたテキストを一文字ごとに入力テキスト取得部１０８から受け取り、リアルタイムでテキスト入力部５１０に表示させる。入力されたテキストが例えばエンターキーなどを用いて確定されると、表示制御部１１２は、メッセージ表示部５４０に上記の入力テキストを表示させる。また通信網２００を介して接続された他の第２のチャット端末装置３００によって確定されたメッセージをテキスト受信部から受け取ると、表示制御部１１２は受け取ったテキストをメッセージ表示部５４０に表示させる。ここで入力部は例えば、キーボードやマウスなどの操作入力デバイスや、ボタン、方向キー、ジョグダイヤルなどの回転型セレクター、あるいはこれらの組合せなどが挙げられるが、上記に限られない。 Here, an example of display control performed by the display control unit 112 using the operation screen 500 illustrated in FIG. 3 will be described. The display control unit 112 performs overall control related to the display of the operation screen 500. For example, when a user inputs text using an input unit (not shown), the display control unit 112 receives the input text from the input text acquisition unit 108 for each character and displays it on the text input unit 510 in real time. Let When the input text is confirmed using, for example, the enter key, the display control unit 112 causes the message display unit 540 to display the input text. When the message confirmed by another second chat terminal device 300 connected via the communication network 200 is received from the text receiving unit, the display control unit 112 causes the message display unit 540 to display the received text. Here, examples of the input unit include an operation input device such as a keyboard and a mouse, a rotary selector such as a button, a direction key, and a jog dial, or a combination thereof, but is not limited thereto.

また表示制御部１１２は、音声データ記憶部１０２から記憶された候補音声テキストを取得して操作画面５００の音声選択部５２０に選択可能に表示させる。例えば利用者によって候補音声テキストの中から１の候補音声テキストが選択された場合、表示制御部１１２は選ばれた候補音声テキストである選択音声テキストをメッセージ表示部５４０に表示させる。 In addition, the display control unit 112 acquires the candidate speech text stored from the speech data storage unit 102 and causes the speech selection unit 520 of the operation screen 500 to display the candidate speech text so as to be selectable. For example, when one candidate speech text is selected from the candidate speech texts by the user, the display control unit 112 causes the message display unit 540 to display the selected speech text that is the selected candidate speech text.

また表示制御部１１２は、テキストデータ記憶部１０４から記憶された候補テキストを取得して操作画面５００のテキスト選択部５３０に選択可能に表示させる。例えば利用者によって候補テキストの中から１の候補テキストが選択された場合、表示制御部１１２は選ばれた候補テキストである選択テキストをメッセージ表示部５４０に表示させる。 Further, the display control unit 112 acquires the candidate text stored from the text data storage unit 104 and displays the candidate text on the text selection unit 530 of the operation screen 500 so as to be selectable. For example, when one candidate text is selected from the candidate texts by the user, the display control unit 112 causes the message display unit 540 to display the selected text that is the selected candidate text.

また表示制御部１１２は、音声出力時間計算部１２２から受け取った音声の残り出力時間に関する情報を元に、音声の出力残り時間を視覚的に表示する。このとき表示の手段は、例えば操作画面５００の第１の音声出力時間表示部５５０に示したように、棒状の表示を残り時間に応じた数表示するものであってよい。また例えば第２の音声出力時間表示部５５５に示したように、音声の出力残り時間を時計形式を用いて数値で表してもよい。 The display control unit 112 visually displays the remaining audio output time based on the information related to the remaining audio output time received from the audio output time calculation unit 122. At this time, the display means may display a number of bar-shaped displays corresponding to the remaining time, as shown in the first audio output time display section 550 of the operation screen 500, for example. For example, as shown in the second audio output time display unit 555, the remaining audio output time may be expressed numerically using a clock format.

（第１の実施形態の効果の例）
以上、本発明の第１の実施形態にかかる第１のチャット端末装置１００の機能構成について説明してきた。このような第１のチャット端末装置１００を用いることによって、利用者は入力手段として通常のキーボードなどの入力部を用いたテキスト入力に加え、予め登録されたテキスト及び予め登録された音声を選択することが出来るようになる。従来の入力手段においては、利用者はキーボードなどを用いてテキスト入力することが必要であり、利用者の入力操作の習熟度合いによっては円滑な会話が出来ない場合があった。特に本実施形態のように音声を用いた会話をする場合においては、会話の間に無駄な無言時間があると聞き手にとって聞こえのよい会話とならない。そこで本実施形態において示した入力手段は、例えばマウスなどの入力部を用いてクリックなどの操作をすることによって入力するテキスト及び音声を選択する選択形式である。このような選択形式の入力手段を用いることにより、利用者は入力操作に習熟していなくとも自らの発言したいタイミングで所望のテキスト及び音声を出力することが可能となる。 (Example of effects of the first embodiment)
The functional configuration of the first chat terminal device 100 according to the first embodiment of the present invention has been described above. By using the first chat terminal device 100 as described above, the user selects pre-registered text and pre-registered voice in addition to text input using an input unit such as a normal keyboard as input means. It will be possible. In the conventional input means, the user needs to input text using a keyboard or the like, and there is a case where smooth conversation cannot be performed depending on the user's proficiency in input operation. In particular, in the case of a conversation using voice as in the present embodiment, if there is a useless silent time between conversations, the conversation is not good for the listener. Therefore, the input means shown in the present embodiment is a selection format for selecting text and voice to be input by performing an operation such as clicking using an input unit such as a mouse. By using such selection type input means, the user can output desired text and voice at the timing he / she wants to speak even if he / she is not familiar with the input operation.

また、テキストデータ記憶部に予め登録されたテキストを操作画面上に選択可能に表示することによって、クリックひとつで長い文章も入力し、音声化することができるようになる。これにより、予め用意したテキストを用いるためテキストの打ち誤りを防止することが出来ると共に、キー入力の必要がないためキー入力操作に習熟していない利用者であっても入力にかかる時間が削減される。従って、音声の品質を向上させることが出来る。聞き手にとっては余計な無言時間を抑え、回答者にとっては端末に向かう時間が抑えられるため、周囲に注意を払うことが出来るようになる。 In addition, by displaying text pre-registered in the text data storage unit on the operation screen so as to be selectable, a long sentence can be input and voiced with a single click. This makes it possible to prevent text typing errors by using text prepared in advance, and reduces the time required for input even for users who are not familiar with key input operations because there is no need for key input. The Therefore, the quality of voice can be improved. For the listener, the unnecessary silent time is reduced, and for the respondent, the time to go to the terminal is reduced, so that attention can be paid to the surroundings.

また、音声データ記憶部に予め登録された音声を操作画面上で選択することによってクリック一つで音声を出力することが出来るようになった。ここで出力することが出来る音声は、予め音声合成によって生成されたものであってもよいが、肉声を録音しておくと効果的である。音声合成によって得られる音声は、通常同じ表記のテキストに対しては同じ波形の音声が生成されるため、言葉の微妙なニュアンスを表現できないことが多い。これが合成音声を用いた会話の表現力の限界となっている部分があった。予めよく使用する言葉、例えば相槌など感情を表現したい言葉を登録しておくことにより、利用者は、音声の時間長、パワー、スペクトル、ピッチの変化の異なる音声を文脈によって使い分けることが出来るようになる。 In addition, it is possible to output a voice with one click by selecting a voice registered in the voice data storage unit on the operation screen. The voice that can be output here may be generated by voice synthesis in advance, but it is effective to record a real voice. Since the speech obtained by speech synthesis usually generates speech with the same waveform for text with the same notation, it is often impossible to express subtle nuances of words. This was the limit of the expressiveness of conversation using synthetic speech. By registering frequently used words in advance, such as words that express emotions such as conflict, users can use different voices with different time length, power, spectrum, and pitch depending on the context. Become.

また、本実施形態にかかる第１のチャット端末装置１００は音声出力時間計算部１２２を有するため、操作画面に自ら入力した音声及び、入力したテキストから生成された合成音声の残り出力時間を把握することが出来るようになった。本実施形態にかかる第１のチャット端末装置１００の利用者は聴覚障害及び発話障害を有する利用者を想定しているため、利用者が周囲の状況を把握する補助手段として音声の残り出力時間を表示させる。利用者は、聴覚障害を有するため音声で会話の状況を把握することが出来ない。そのため、操作画面上のメッセージ表示部及び音声出力時間表示部を参照することによって音声による会話がどのように進んでいるかを把握することが可能となる。 In addition, since the first chat terminal device 100 according to the present embodiment has the voice output time calculation unit 122, it grasps the remaining output time of the voice input by itself on the operation screen and the synthesized voice generated from the input text. I was able to do it. Since the user of the first chat terminal device 100 according to the present embodiment is assumed to be a user with hearing impairment and speech impairment, the remaining voice output time is used as an auxiliary means for the user to grasp the surrounding situation. Display. Since the user has hearing impairment, the user cannot grasp the situation of conversation by voice. Therefore, it is possible to grasp how the voice conversation is progressing by referring to the message display unit and the voice output time display unit on the operation screen.

＜第２の実施形態＞
次に、本発明の第２の実施形態にかかる音声合成チャットシステムの第１のチャット端末装置１００について図４を用いて説明する。図４は、第２の実施形態にかかる第１のチャット端末装置の機能構成を示すブロック図である。本発明の第２の実施形態にかかる音声合成チャットシステムは、第１のチャット端末装置１００の機能構成の一部分が第１の実施形態と異なる。そのため、第１の実施形態と同様の構成については説明を省略する。 <Second Embodiment>
Next, the 1st chat terminal device 100 of the speech synthesis chat system concerning the 2nd Embodiment of this invention is demonstrated using FIG. FIG. 4 is a block diagram illustrating a functional configuration of the first chat terminal device according to the second embodiment. In the speech synthesis chat system according to the second embodiment of the present invention, a part of the functional configuration of the first chat terminal device 100 is different from that of the first embodiment. Therefore, description of the same configuration as in the first embodiment is omitted.

（音声化制御部１２４）
第２の実施形態にかかる第１のチャット端末装置１００は、音声化制御部１２４をさらに有する点において第１の実施形態にかかる第１のチャット端末装置１００と異なる。音声化制御部１２４は、選択音声取得部１０６、入力テキスト取得部１０８、選択テキスト取得部１１０、表示制御部１１２、テキスト送信部１１４、及び音声合成部１１８に接続される。音声化制御部１２４は、入力されたテキストの音声化を制御する機能部である。また、音声化制御部１２４は、音声化するか否かを制御したり、音声化された場合に第３者にとって聞こえのよい音声とならないテキストを検知すると、入力されたテキストを制御する。 (Voice control unit 124)
The first chat terminal device 100 according to the second embodiment is different from the first chat terminal device 100 according to the first embodiment in that it further includes an audio control unit 124. The voice control unit 124 is connected to the selected voice acquisition unit 106, the input text acquisition unit 108, the selected text acquisition unit 110, the display control unit 112, the text transmission unit 114, and the voice synthesis unit 118. The voice control unit 124 is a functional unit that controls voice conversion of input text. Also, the voice control unit 124 controls whether or not the voice is voiced or when the text that does not become a voice that can be heard by a third party when voiced is detected, controls the input text.

音声化制御部１２４の制御の一例を図５を用いて説明する。図５は、第２の実施形態にかかる音声化制御部の動作の一例を示すフローチャートである。まず、音声化制御部１２４は、ステップＳ１００において入力されたテキストが選択音声取得部１０６からの入力であるか否かを判断する。かかる判断において選択音声取得部１０６からの入力であると判断された場合には、ステップＳ１１４において入力されたテキストを表示制御部１１２及びテキスト送信部１１４に出力する。選択音声取得部１０６から選択音声テキストが入力された場合、選択音声取得部１０６は音声化判断部１２４に選択音声テキストを入力すると共に音声出力部１２０に選択音声を入力している。即ち、選択音声テキストを音声合成部１１８に入力してしまうと同じ内容が２重に音声出力されてしまうため、選択音声入力部からの入力であった場合には音声合成は用いない。 An example of the control of the voice control unit 124 will be described with reference to FIG. FIG. 5 is a flowchart illustrating an example of the operation of the voice control unit according to the second embodiment. First, the voice control unit 124 determines whether or not the text input in step S100 is an input from the selected voice acquisition unit 106. If it is determined in this determination that the input is from the selected voice acquisition unit 106, the text input in step S114 is output to the display control unit 112 and the text transmission unit 114. When the selected speech text is input from the selected speech acquisition unit 106, the selected speech acquisition unit 106 inputs the selected speech text to the speech determination unit 124 and inputs the selected speech to the speech output unit 120. That is, if the selected speech text is input to the speech synthesizer 118, the same content is output as a double speech. Therefore, if the input is from the selected speech input unit, speech synthesis is not used.

ステップＳ１００の判断において選択音声取得部１０６からの入力でないと判断された場合、即ち入力テキスト取得部１０８及び選択テキスト取得部１１０からの入力であると判断された場合には、次にステップＳ１０２において入力されたテキストが直前と同じテキストであるか否かが判断される。音声化制御部１２４は、直前に音声合成部１１８に出力したテキストを記憶しておき、入力されたテキストデータと記憶されたテキストデータとを比較する。かかる比較において同一であると判断された場合には、ステップＳ１０４において入力を確定するか否か判断される。このような判断と確認ステップを実施するのは、操作の誤りなどで同じテキストを二度連続して音声化するのを防ぐためである。特に、選択テキスト取得部１１０からの入力である場合には、利用者はクリックするだけでテキストを入力できるため、上記のような操作ミスを行うことが考えられる。同じ音声を２度出力してしまうと、聞き手にとって無駄な時間を与えてしまうばかりでなく、機械操作感が強くなるため、自然な音声コミュニケーションを低下させる。 If it is determined in step S100 that the input is not from the selected voice acquisition unit 106, that is, if it is determined that the input is from the input text acquisition unit 108 and the selected text acquisition unit 110, then in step S102 It is determined whether or not the input text is the same text as before. The voice control unit 124 stores the text output to the voice synthesizer 118 immediately before, and compares the input text data with the stored text data. If it is determined in the comparison that they are the same, it is determined in step S104 whether to confirm the input. Such determination and confirmation steps are performed to prevent the same text from being voiced twice consecutively due to an operation error or the like. In particular, in the case of input from the selected text acquisition unit 110, the user can input text simply by clicking. If the same voice is output twice, not only will the listener be wasted, but the machine operation will become stronger, reducing natural voice communication.

ステップＳ１０４の入力を確定するか否かの判断は、利用者からの入力によって判断されてよい。例えば、操作画面に直前のテキストデータと同じである旨を伝えるメッセージを表示すると共に、入力を確定するか否か選択を促す画面を表示する。入力が操作の誤りであって、利用者が入力を確定しないことを選択した場合には、ステップＳ１１２において入力されたテキストはいずれにも出力されず、処理を中断する。 The determination of whether or not to confirm the input in step S104 may be determined by an input from the user. For example, a message indicating that the text data is the same as the immediately preceding text data is displayed on the operation screen, and a screen prompting the user to select whether to confirm the input is displayed. If the input is an operation error and the user selects not to confirm the input, the text input in step S112 is not output to any item, and the process is interrupted.

また、ステップＳ１０４において利用者によって入力を確定することが選択された場合及びステップＳ１０２において直前と同じテキストではないと判断された場合には、ステップＳ１０６において、入力されたテキストに登録語が含まれるか否かを判断される。ここで図６を用いてステップＳ１０６の判断について詳しく説明する。図６は、本実施形態において音声化制御部が有する置換テーブルの一例である。置換テーブル６００は、登録語６１０、制限情報６２０、及び修正語６３０を含む。入力されたテキストに含まれていた場合に修正したい言葉を予め登録語６１０に登録しておく。登録語６１０には、修正語６３０が紐付けられている。また、登録語６１０は制限情報６２０がさらに紐付けられていてもよい。例えば図６の例を参照すると、登録語６１０に「でｓ。」、修正語６３０に「です。」、制限を「文末」と登録しておくと、音声化制御部１２４は、置換テーブル６００を参照して、入力されたテキストのうち文末に「でｓ。」を含むテキストを「です。」に置換する。このような置換テーブルの活用例としては、キーボード入力の打ち誤りに対応することが挙げられる。例えば利用者の過去のチャットログを解析することにより、利用者がよく打ち間違える言葉を登録しておくことは効果的である。打ち間違いを含むテキストを音声化した場合には、聞き手にとって聞きづらい音声となる場合が多い。このような置換テーブル６００を利用することによって、音声の向上につながる。また、置換テーブルの他の利用としては、伏せておきたい固有名詞、数値、放送禁止用語などを登録しておくことが挙げられる。合成音声をそのまま公共の電波を用いた放送に用いる場合や、合成音声を大勢の人の前で流す場合に効果的である。 If it is selected in step S104 that the user confirms the input, or if it is determined in step S102 that the input text is not the same as the previous one, in step S106, the input word includes a registered word. It is judged whether or not. Here, the determination in step S106 will be described in detail with reference to FIG. FIG. 6 is an example of a replacement table included in the voice control unit in the present embodiment. The replacement table 600 includes a registered word 610, restriction information 620, and a modified word 630. A word to be corrected when it is included in the input text is registered in the registered word 610 in advance. A registered word 610 is associated with a modified word 630. Further, the restriction information 620 may be further associated with the registered word 610. For example, referring to the example of FIG. 6, if the registered word 610 is registered with “de s.”, The modified word 630 is “is”, and the restriction is “end of sentence”, the voice control unit 124 will replace the replacement table 600. Referring to, the text that includes “de.s.” at the end of the input text is replaced with “is.”. An example of using such a replacement table is to deal with typing errors in keyboard input. For example, it is effective to register words that are often mistaken by the user by analyzing the past chat log of the user. When text containing a typing error is made into speech, it is often difficult for the listener to hear it. By using such a replacement table 600, the voice is improved. Another use of the replacement table is to register proper nouns, numerical values, broadcast-prohibited terms, etc. to be hidden. This is effective when the synthesized voice is used as it is for broadcasting using public radio waves, or when the synthesized voice is played in front of a large number of people.

ステップＳ１０６において登録語が含まれると判断された場合には、ステップＳ１０８において音声化制御部１２４は、登録語を修正語に置換し、置換されたテキストデータをステップＳ１１０において音声合成部１１８、表示制御部１１２、テキスト送信部１１４に出力する。 If it is determined in step S106 that the registered word is included, in step S108, the speech control unit 124 replaces the registered word with a corrected word, and the replaced text data is displayed in the speech synthesizer 118 in step S110. The data is output to the control unit 112 and the text transmission unit 114.

尚、音声化制御部１２４は、さらに予め設定した文字数を超える入力が一度にされた場合に、入力を確定するか否か判断してもよい。また、上記では直前と同じテキストが入力された場合の判断を利用者からの入力に基づいて行ったが、これに限られない。例えば、２度全く同じテキストが連続して入力できないようにしてもよい。 Note that the voice control unit 124 may determine whether or not to confirm the input when an input exceeding the preset number of characters is made at one time. In the above description, the determination when the same text is input as before is performed based on the input from the user. However, the present invention is not limited to this. For example, the same text may not be continuously input twice.

また、第２の実施形態において選択音声取得部１０６、入力テキスト取得部１０８、及び選択テキスト取得部１１０からの出力テキストは、第１の実施形態においては音声合成部１１８、表示制御部１１２、及びテキスト送信部１１４に直接入力されていたが、第２の実施形態においてはいずれも音声化制御部１２４に入力され、音声化制御部１２４から音声合成部１１８、表示制御部１１２、及びテキスト送信部１１４に入力される点において第１の実施形態と異なる。 In the second embodiment, the output text from the selected voice acquisition unit 106, the input text acquisition unit 108, and the selected text acquisition unit 110 is the same as the voice synthesis unit 118, the display control unit 112, and the like in the first embodiment. Although it is directly input to the text transmission unit 114, in the second embodiment, all are input to the speech control unit 124. From the speech control unit 124, the speech synthesis unit 118, the display control unit 112, and the text transmission unit. It is different from the first embodiment in that it is input to 114.

（第２の実施形態の効果の例）
以上説明したように、第２の実施形態にかかる第１のチャット端末装置１００は、音声化制御部１２４を設けることによって、操作間違い、例えばタイプミスにより誤った単語などを正しい表記に修正してから修正語のテキストを音声化することが出来るようになる。また、例えば２重クリックなどの操作間違いにより、誤って同じテキストを複数回入力してしまった場合には、入力を確定するか否かを利用者自らが判断できるようになった。従って、入力されたテキストが音声化された場合に不都合となるテキストを含んでいる場合に、テキストを音声化する前に修正することが出来るようになる。これにより、聞き手にとって聞こえのよい音声会話を実施することができる。 (Example of effects of the second embodiment)
As described above, the first chat terminal device 100 according to the second embodiment corrects an erroneous operation due to an operation mistake, for example, a typo, to a correct notation by providing the voice control unit 124. It becomes possible to utter the text of the correction word. In addition, when the same text is erroneously input a plurality of times due to an operation mistake such as double click, the user can determine whether or not to confirm the input. Therefore, when the input text includes text that is inconvenient when it is voiced, it can be corrected before the text is voiced. As a result, it is possible to carry out a voice conversation that can be heard by the listener.

＜第３の実施形態＞
次に、本発明の第３の実施形態かかる音声合成チャットシステムの第１のチャット端末装置１００について図７と図８とを用いて説明する。図７は、第３の実施形態にかかる第１のチャット端末装置の機能構成を示すブロック図である。図８は、第３の実施形態において音声出力制御部が表示させるメッセージの一例である。以下、第１の実施形態及び第２の実施形態と同様の構成については説明を省略する。 <Third Embodiment>
Next, the 1st chat terminal device 100 of the speech synthesis chat system concerning the 3rd Embodiment of this invention is demonstrated using FIG. 7 and FIG. FIG. 7 is a block diagram illustrating a functional configuration of the first chat terminal device according to the third embodiment. FIG. 8 is an example of a message displayed by the audio output control unit in the third embodiment. Hereinafter, description of the same configurations as those in the first embodiment and the second embodiment will be omitted.

（第２のチャット端末装置３００の機能構成）
まず、第２のチャット端末装置３００の機能構成について図７を用いて説明する。第２のチャット端末装置３００は、音声出力制御部３１０、テキスト送受信部３２０、テキスト表示部３３０、及びテキスト入力部３４０を主に有する。 (Functional configuration of second chat terminal device 300)
First, the functional configuration of the second chat terminal device 300 will be described with reference to FIG. The second chat terminal device 300 mainly includes a voice output control unit 310, a text transmission / reception unit 320, a text display unit 330, and a text input unit 340.

（音声出力制御部３１０）
音声出力制御部３１０は、第１のチャット端末装置１００の音声出力を制御する機能部である。音声出力制御部３１０は、第１のチャット端末装置１００の音声出力部１２０に対して音声の停止及び出力可能指示信号を送信する。上記指示信号は、通信網２００を介して送信されてよい。 (Audio output control unit 310)
The voice output control unit 310 is a functional unit that controls the voice output of the first chat terminal device 100. The voice output control unit 310 transmits a voice stop and output enable instruction signal to the voice output unit 120 of the first chat terminal device 100. The instruction signal may be transmitted via the communication network 200.

（第１のチャット端末装置１００の機能構成）
（音声出力部１２０）
音声出力部１２０は、上記の音声の「出力停止」を示す指示信号を受信すると、音声の出力が出来ない状態にする。この時、例えば図８のａに示すように「音声出力を停止します。」というメッセージを第１のチャット端末装置１００の画面上に表示させてもよい。また、音声「出力再開」を示す指示信号を受信すると、音声の出力を直ちに再開する。このとき、例えば図８のｂに示すように「音声出力を再開します。」というメッセージを第１のチャット端末装置１００の画面上に表示させてもよい。 (Functional configuration of first chat terminal device 100)
(Audio output unit 120)
When the voice output unit 120 receives the instruction signal indicating the “output stop” of the voice, the voice output unit 120 makes a state in which the voice cannot be output. At this time, for example, as shown in a of FIG. 8, a message “sound output is stopped” may be displayed on the screen of the first chat terminal device 100. When an instruction signal indicating the voice “resume output” is received, the voice output is immediately resumed. At this time, for example, as shown in b of FIG. 8, a message “speech output is resumed” may be displayed on the screen of the first chat terminal device 100.

また、強制的に出力の停止、再開を制御することが好ましくない場合には、例えば「出力停止」を示す指示信号を受信した場合、音声出力部１２０は音声の出力を停止せず、単にメッセージを表示するだけでもよい。この場合、例えば図８のｃに示す「音声出力しないでください。」というメッセージが用いられても良い。音声の「出力再開」を示す指示信号を受信した場合には、例えば図８のｄに示す「音声出力をして結構です。」といったメッセージを表示させることができる。 Also, when it is not preferable to forcibly stop the output stop or restart, for example, when an instruction signal indicating “output stop” is received, the audio output unit 120 does not stop the audio output and simply sends a message. You may just display. In this case, for example, a message “Please do not output audio” shown in FIG. 8 c may be used. When an instruction signal indicating “output restart” of voice is received, for example, a message such as “It is okay to output voice” shown in d of FIG. 8 can be displayed.

（第３の実施形態の効果の例）
このように、第２のチャット端末装置３００側で第１のチャット端末装置１００の音声を制御することが出来るようにすることによって、第２のチャット端末装置３００の利用者Ａ（健聴者）の都合により音声の出力を制御することが出来るようになる。第１のチャット端末装置１００の利用者Ｂ（聴覚障害及び発話障害を有する。）は、画面に向かっている時間が多いため、周囲の状況にリアルタイムで気が付けない場合がある。本発明の一実施形態に係る音声合成チャットシステムが大人数で開催する会議や、ラジオ放送などに用いられる場合、聴衆にとって聞き苦しい音声が出力される場合には、強制的に制御することが出来ることが好ましい場合もある。 (Example of effects of the third embodiment)
In this way, by allowing the second chat terminal apparatus 300 to control the voice of the first chat terminal apparatus 100, the user A (normal hearing person) of the second chat terminal apparatus 300 can control the voice. The voice output can be controlled for convenience. User B (having hearing impairment and speech impairment) of the first chat terminal apparatus 100 often has time to face the screen, and may not be aware of the surrounding situation in real time. When the speech synthesis chat system according to an embodiment of the present invention is used for a conference held by a large number of people, a radio broadcast, or the like, it can be controlled compulsorily when an unpleasant voice is output to the audience. May be preferred.

また、第１のチャット端末装置１００の利用者Ｂは音声を聞き取ることができないため、自らが発言することが好ましい状況か否かの判断が遅れる場合がある。そのため第２のチャット端末装置３００からの入力によって、利用者Ｂの操作画面上に音声出力が好ましい状態か否かを知らせるメッセージを表示することによって音声出力が好ましくない状況で音声が出力されてしまう危険を回避することができるようになる。 In addition, since the user B of the first chat terminal device 100 cannot hear the voice, the determination as to whether or not it is preferable for the user B to speak may be delayed. For this reason, by inputting from the second chat terminal device 300, a message informing whether the voice output is in a preferable state is displayed on the operation screen of the user B, so that the voice is output in a situation where the voice output is not preferable. You can avoid danger.

以上、添付図面を参照しながら本発明の好適な実施形態について詳細に説明したが、本発明はかかる例に限定されないことは言うまでもない。本発明の属する技術の分野における通常の知識を有する者であれば、特許請求の範囲に記載された技術的思想の範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても当然に本発明の技術的範囲に属するものと了解される。 As mentioned above, although preferred embodiment of this invention was described in detail, referring an accompanying drawing, it cannot be overemphasized that this invention is not limited to this example. It is obvious that a person having ordinary knowledge in the technical field to which the present invention pertains can come up with various changes or modifications within the scope of the technical idea described in the claims. Of course, it is understood that these also belong to the technical scope of the present invention.

尚、本明細書において、フローチャートに記述されたステップは、記載された順序に沿って時系列的に行われる処理はもちろん、必ずしも時系列的に処理されなくとも、並列的に又は個別的に実行される処理をも含む。また時系列的に処理されるステップでも、場合によっては適宜順序を変更することが可能であることは言うまでもない。 In this specification, the steps described in the flowcharts are executed in parallel or individually even if they are not necessarily processed in time series, as well as processes performed in time series in the described order. Including processing to be performed. Further, it goes without saying that the order can be appropriately changed even in the steps processed in time series.

１００第１のチャット端末装置
１０２音声データ記憶部
１０４テキストデータ記憶部
１０６選択音声取得部
１０８入力テキスト取得部
１１０選択テキスト取得部
１１２表示制御部
１１４テキスト送信部
１１６テキスト受信部
１１８音声合成部
１２０音声出力部
１２２音声出力時間計算部
１２４音声化制御部 DESCRIPTION OF SYMBOLS 100 1st chat terminal device 102 Voice data storage part 104 Text data storage part 106 Selected voice acquisition part 108 Input text acquisition part 110 Selected text acquisition part 112 Display control part 114 Text transmission part 116 Text reception part 118 Voice synthesis part 120 Voice Output unit 122 Audio output time calculation unit 124 Voice control unit

Claims

An information processing apparatus having a chat function capable of having a conversation using message exchange with another information processing apparatus connected via a network,
A text data storage unit for storing candidate text that is a candidate for the message;
A voice data storage unit that stores candidate voices that are pre-recorded voice data and candidate voice text that is linked to the candidate voices and indicates the contents of the candidate voices;
A display control unit that controls display of an operation screen and displays the candidate text and the candidate speech text in a selectable manner on the operation screen;
As a speech of the conversation, a selection text acquisition unit that receives selection by the user for the candidate text displayed on the operation screen and acquires the selection text that is the selected candidate text;
Selected speech that acquires selected speech text that is a candidate speech text selected by a user from candidate speech texts displayed on the operation screen and selected speech that is a candidate speech stored corresponding to the selected speech text An acquisition unit;
An audio output unit for outputting the selected audio or transmitting the selected audio to the other information processing apparatus;
A text transmission unit that transmits the selected text and the selected voice text to the other information processing apparatus;
An information processing apparatus comprising:

An input text acquisition unit for acquiring the input text input by the user;
The information processing apparatus according to claim 1, further comprising a voice control unit that controls voice conversion of the input text, the selected text, and the selected voice text.

The voice control unit
Having a replacement table containing pre-registered registered words and correction words linked to the registered words;
If the input text and the selected text comprising the registered word, replacing the registered word of the input text and the selected in text corrected word corresponding to the registered word in the substitution table, to claim 2 The information processing apparatus described.

A speech synthesizer that generates synthesized speech from the input text and the selected text;
The voice output unit further outputs the synthesized voice or transmits it to the other information processing apparatus;
The information processing apparatus according to claim 2 or 3 .

A speech output time calculation unit for calculating an output time of a synthesized speech generated from the input text and the selected text;
The display controller, on the basis of the said output time input from the audio output time calculation unit, and displays the remaining output time of the synthesized speech on the operation screen, the information of any one of claims 2-4 Processing equipment.

The information processing apparatus according to claim 1, wherein the voice output unit stops output in accordance with a voice stop instruction input from the other information processing apparatus.

The information processing apparatus according to claim 6, wherein the voice output unit displays a message on the operation screen in response to the voice stop instruction.

The voice data storage unit stores voice data indicating the same notation and different voice signals as separate voice data,
The information processing according to claim 1, wherein the display control unit displays the candidate speech text on the operation screen so that the separate speech data can be selected as different candidates. apparatus.

An information processing method executed in an information processing apparatus connected to another information processing apparatus via a network and having a chat function for conversation using message exchange,
Storing candidate text that is a candidate for the message;
Storing candidate speech that is pre-recorded speech data and candidate speech text associated with the candidate speech and indicating the content of the candidate speech;
Controlling the display of the operation screen to display the candidate text and the candidate speech text so as to be selectable on the operation screen;
Obtaining a selected text that is a candidate text selected by a user from among the candidate texts displayed on the operation screen;
Obtaining a selected speech text that is a candidate speech text selected by a user from candidate speech texts displayed on the operation screen, and a selected speech that is a candidate speech stored corresponding to the selected speech text; ,
Outputting the selected voice or transmitting to the other information processing apparatus;
Transmitting the selected text and the selected voice text to the other information processing apparatus;
Including an information processing method.