JP2014150442A

JP2014150442A - Telephone conversation system and telephone conversation relay method

Info

Publication number: JP2014150442A
Application number: JP2013018631A
Authority: JP
Inventors: Wataru Kikuchi; 渉菊地; Misako Takamatsu; 美砂子高松; Yaoki Kobayashi; 八起小林; Jun Watanabe; 純渡邉
Original assignee: Nippon Telegraph and Telephone Corp; Nippon Telegraph and Telephone East Corp
Current assignee: Nippon Telegraph and Telephone Corp; Nippon Telegraph and Telephone East Corp
Priority date: 2013-02-01
Filing date: 2013-02-01
Publication date: 2014-08-21
Anticipated expiration: 2033-02-01
Also published as: JP6064209B2

Abstract

PROBLEM TO BE SOLVED: To make it possible to understand speech content of a telephone conversation partner more easily.SOLUTION: The telephone conversation system comprises: a first telephone conversation relay section for receiving voice and video transmitted from a first telephone conversation terminal; a first voice recognition section for converting voice received by the first telephone conversation relay section into text data to generate text video which indicates the text data; a second telephone conversation relay section for receiving the voice and video transmitted from a second telephone conversation terminal; a second voice recognition section for converting the voice received by the second telephone conversation relay section into the text data to generate text video which indicates the text data; and a synthesis section for synthesizing the video received by the first telephone conversation relay section, the video received by the second telephone conversation relay section, the text video generated by the first voice recognition section, and the text video generated by the second voice recognition section to generate a synthesized video.

Description

本発明は、通話を行うための通話装置の技術に関する。 The present invention relates to a technology of a call device for making a call.

近年、ＩＣＴ（Information and Communication Technology）サービスが発達している。その具体例として、例えばユーザの感情や印象を相手に分かりやすく伝えることのできるテレビ電話端末が提案されている（特許文献１参照）。 In recent years, ICT (Information and Communication Technology) services have been developed. As a specific example, for example, a videophone terminal has been proposed that can convey a user's feelings and impressions to a partner in an easily understandable manner (see Patent Document 1).

特開２００９−１１２０２７号公報JP 2009-112027 A

しかしながら、ＩＣＴサービスの普及は一部に留まっている。例えば、高齢者にはＩＣＴサービスが十分に普及しているとは言い難い。
高齢者は、加齢により身体に様々なハンディキャップを抱えている。このことが原因となって、ＩＣＴサービスの利用が阻害されている。例えば、聴覚の衰えによって、通話相手の発話内容が聞き取りにくいという問題がある。 However, the spread of ICT services remains limited. For example, it is hard to say that ICT services are sufficiently popular among elderly people.
Elderly people have various handicap due to aging. This has hindered the use of the ICT service. For example, there is a problem that it is difficult to hear the content of the other party's utterance due to a decline in hearing.

上記事情に鑑み、本発明は、通話相手の発話内容をより容易に理解することを可能とする技術の提供を目的としている。 In view of the above circumstances, an object of the present invention is to provide a technique that makes it easier to understand the utterance content of a call partner.

本発明の一態様は、第一通話端末から送信された音声及び映像を受信する第一通話中継部と、前記第一通話中継部によって受信された音声をテキストデータに変換し、当該テキストデータを表すテキスト映像を生成する第一音声認識部と、第二通話端末から送信された音声及び映像を受信する第二通話中継部と、前記第二通話中継部によって受信された音声をテキストデータに変換し、当該テキストデータを表すテキスト映像を生成する第二音声認識部と、前記第一通話中継部によって受信された映像と、前記第二通話中継部によって受信された映像と、前記第一音声認識部によって生成されたテキスト映像と、前記第二音声認識部によって生成されたテキスト映像と、を合成することによって合成映像を生成する合成部と、を備え、前記第一通話中継部は、前記合成映像を前記第一通話端末に送信し、前記第二通話中継部は、前記合成映像を前記第二通話端末に送信する、通話システムである。 One aspect of the present invention is a first call relay unit that receives voice and video transmitted from a first call terminal, converts voice received by the first call relay unit into text data, and converts the text data into A first voice recognition unit that generates a text image to represent, a second call relay unit that receives voice and video transmitted from the second call terminal, and a voice received by the second call relay unit is converted into text data A second voice recognition unit that generates a text video representing the text data, a video received by the first call relay unit, a video received by the second call relay unit, and the first voice recognition A synthesis unit that generates a synthesized video by synthesizing the text video generated by the unit and the text video generated by the second voice recognition unit, Call relay unit transmits the synthesized image to the first call terminal, the second call relay unit transmits the combined image to the second call terminal, a call system.

本発明の一態様は、上記の通話システムであって、前記第一音声認識部によって生成されたテキスト映像の入力を受け付ける第一入力部と、前記第一通話中継部によって受信された音声及び映像の入力を受け付ける第二入力部と、前記第二通話中継部によって受信された音声及び映像の入力を受け付ける第三入力部と、前記第二音声認識部によって生成されたテキスト映像の入力を受け付ける第四入力部と、をさらに備え、前記合成部は、前記第一乃至第四入力部に入力された各映像を、予め定められた画面の領域に配置することによって前記合成映像を生成する。 One aspect of the present invention is the call system described above, wherein a first input unit that receives an input of a text image generated by the first voice recognition unit, and a voice and a video received by the first call relay unit A second input unit that accepts input, a third input unit that accepts input of voice and video received by the second call relay unit, and a second input unit that accepts input of text video generated by the second voice recognition unit A four-input unit, and the synthesizing unit generates the synthesized video by arranging each video input to the first to fourth input units in a predetermined screen area.

本発明の一態様は、第一通話端末から送信された音声及び映像を受信する第一通話受信ステップと、前記第一通話受信ステップによって受信された音声をテキストデータに変換し、当該テキストデータを表すテキスト映像を生成する第一音声認識ステップと、第二通話端末から送信された音声及び映像を受信する第二通話受信ステップと、前記第二通話受信ステップによって受信された音声をテキストデータに変換し、当該テキストデータを表すテキスト映像を生成する第二音声認識部と、前記第一通話受信ステップによって受信された映像と、前記第二通話受信ステップによって受信された映像と、前記第一音声認識ステップによって生成されたテキスト映像と、前記第二音声認識ステップによって生成されたテキスト映像と、を合成することによって合成映像を生成する合成ステップと、前記合成映像を前記第一通話端末に送信する第一送信ステップと、前記合成映像を前記第二通話端末に送信する第二送信ステップと、を有する通話中継方法である。 One aspect of the present invention is a first call receiving step for receiving audio and video transmitted from a first call terminal, and converting voice received by the first call receiving step into text data, A first voice recognition step for generating a text image to represent, a second call reception step for receiving voice and video transmitted from the second call terminal, and converting the voice received by the second call reception step into text data A second voice recognition unit that generates a text video representing the text data, a video received by the first call reception step, a video received by the second call reception step, and the first voice recognition The text image generated by the step is synthesized with the text image generated by the second speech recognition step. And a first transmission step for transmitting the synthesized video to the first call terminal, and a second transmission step for transmitting the synthesized video to the second call terminal. It is a relay method.

本発明により、通話相手の発話内容をより容易に理解することが可能となる。 According to the present invention, it is possible to more easily understand the utterance content of the other party.

通話システム１００のシステム構成図である。1 is a system configuration diagram of a call system 100. FIG. 合成部６０の処理の概略を示す概略図である。6 is a schematic diagram illustrating an outline of processing of a combining unit 60. FIG. 合成映像の具体例を示す概略図である。It is the schematic which shows the specific example of a synthetic | combination image | video. 通話システム１００における通話セッション確立時の処理の流れの具体例を示すシーケンス図である。4 is a sequence diagram showing a specific example of a flow of processing when a call session is established in the call system 100. FIG. 通話システム１００における通話の処理の流れの具体例を示すシーケンス図である。4 is a sequence diagram showing a specific example of a call processing flow in call system 100. FIG. 通話システム１００の第一の変形例（通話システム１００ａ）のシステム構成図である。It is a system configuration figure of the 1st modification (call system 100a) of call system 100. 通話システム１００の第二の変形例（通話システム１００ｂ）のシステム構成図である。It is a system configuration figure of the 2nd modification (call system 100b) of call system 100. 通話システム１００ｂにおいて生成される合成映像の具体例を示す概略図である。It is the schematic which shows the specific example of the synthetic | combination image | video produced | generated in the telephone system 100b.

以下、本発明の一実施形態である通話システムについて説明する。
図１は、通話システム１００のシステム構成図である。通話システム１００は、２台の通話端末１０（１０−１、１０−２）、中継装置９０を備える。通話端末１０と中継装置９０とはネットワーク１１（１１−１、１１−２）を介して双方向通信可能に接続されている。 Hereinafter, a call system according to an embodiment of the present invention will be described.
FIG. 1 is a system configuration diagram of the call system 100. The call system 100 includes two call terminals 10 (10-1, 10-2) and a relay device 90. The call terminal 10 and the relay device 90 are connected to each other via a network 11 (11-1, 11-2) so that bidirectional communication is possible.

通話端末１０は、通話を行うユーザによって操作される。通話端末１０は、音声入力部、撮像部、音声出力部、表示部を備える。
音声入力部は、マイクや受話器等の音声入力装置であり、通話端末１０のユーザ（話者）の発話音声を入力する。音声入力部は、音声入力装置を通話端末１０に接続するためのインタフェースであっても良い。この場合、音声入力部は、音声入力装置によって生成された音声信号を通話端末１０に入力する。 The call terminal 10 is operated by a user who makes a call. The call terminal 10 includes a voice input unit, an imaging unit, a voice output unit, and a display unit.
The voice input unit is a voice input device such as a microphone or a receiver, and inputs the voice of the user (speaker) of the call terminal 10. The voice input unit may be an interface for connecting the voice input device to the call terminal 10. In this case, the voice input unit inputs the voice signal generated by the voice input device to the call terminal 10.

撮像部は、カメラ等の撮像装置であり、通話端末１０のユーザ（話者）の顔を撮影する。撮像部は、撮像装置を通話端末１０に接続するためのインタフェースであっても良い。この場合、撮像部は、撮像装置によって生成された映像信号を通話端末１０に入力する。 The imaging unit is an imaging device such as a camera, and images the face of the user (speaker) of the call terminal 10. The imaging unit may be an interface for connecting the imaging device to the call terminal 10. In this case, the imaging unit inputs the video signal generated by the imaging device to the call terminal 10.

音声出力部は、スピーカー等の音声出力装置であり、通話端末１０のユーザの対話者の発話音声を出力する。音声出力部は、音声出力装置を通話端末１０に接続するためのインタフェースであっても良い。この場合、音声出力部は、発話音声を表す電気信号を生成し、音声出力装置に対して出力する。 The audio output unit is an audio output device such as a speaker, and outputs the uttered voice of the user who interacts with the call terminal 10. The audio output unit may be an interface for connecting the audio output device to the call terminal 10. In this case, the voice output unit generates an electrical signal representing the speech voice and outputs it to the voice output device.

表示部は、ＣＲＴ（Cathode Ray Tube）ディスプレイ、液晶ディスプレイ、有機ＥＬ（Electro Luminescence）ディスプレイ等の画像表示装置である。表示部は、中継装置９０によって生成された映像データを表示する。表示部は、画像表示装置を通話端末１０に接続するためのインタフェースであっても良い。この場合、表示部は、中継装置９０によって生成された映像データを表示するための映像信号を生成し、自身に接続されている画像表示装置に映像信号を出力する。 The display unit is an image display device such as a CRT (Cathode Ray Tube) display, a liquid crystal display, or an organic EL (Electro Luminescence) display. The display unit displays the video data generated by the relay device 90. The display unit may be an interface for connecting the image display device to the call terminal 10. In this case, a display part produces | generates the video signal for displaying the video data produced | generated by the relay apparatus 90, and outputs a video signal to the image display apparatus connected to self.

通話端末１０は、ネットワーク１１を介して所定のプロトコルで通信を行う事によって、中継装置９０との間で通話を可能にする。例えば、通話端末１０は、ＳＩＰ（Session Initiation Protocol）に基づいて動作することによって、中継装置９０との間で通話セッションを確立する。通話端末１０−１及び通話端末１０−２は、それぞれが中継装置９０との間で通話セッションを確立することによって、互いに通話が可能になる。すなわち、通話端末１０−１及び通話端末１０−２がそれぞれ中継装置９０と通話セッションを確立している場合、通話端末１０−１の音声入力部によって入力された音声と撮像部によって撮影された映像とは、通話端末１０−２において出力される。同様に、通話端末１０−２の音声入力部によって入力された音声と撮像部によって撮影された映像とは、通話端末１０−１において出力される。そのため、通話端末１０−１のユーザと通話端末１０−２のユーザとは、それぞれ映像を見ながら通話を行う事が可能である。 The call terminal 10 enables communication with the relay device 90 by performing communication with the predetermined protocol via the network 11. For example, the call terminal 10 establishes a call session with the relay device 90 by operating based on SIP (Session Initiation Protocol). The call terminal 10-1 and the call terminal 10-2 can talk with each other by establishing a call session with the relay device 90. That is, when the call terminal 10-1 and the call terminal 10-2 each establish a call session with the relay device 90, the audio input by the audio input unit of the call terminal 10-1 and the video taken by the imaging unit Is output at the call terminal 10-2. Similarly, the voice input by the voice input unit of the call terminal 10-2 and the video taken by the imaging unit are output at the call terminal 10-1. Therefore, the user of the call terminal 10-1 and the user of the call terminal 10-2 can make a call while watching the video respectively.

中継装置９０は、バスで接続されたＣＰＵ（Central Processing Unit）やメモリや補助記憶装置などを備え、中継プログラムを実行する。中継装置９０は、中継プログラムを実行することによって、複数の通話中継部２０（２０−１、２０−２）、複数のスプリッタ３０（３０−１、３０−２）、複数の音声認識部４０（４０−１、４０−２）、合成装置７０を備える装置として機能する。なお、中継装置９０の各機能の全て又は一部は、ＡＳＩＣ（Application Specific Integrated Circuit）やＰＬＤ（Programmable Logic Device）やＦＰＧＡ（Field Programmable Gate Array）等のハードウェアを用いて実現されても良い。中継プログラムは、コンピュータ読み取り可能な記録媒体に記録されても良い。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。 The relay device 90 includes a CPU (Central Processing Unit), a memory, an auxiliary storage device, and the like connected by a bus, and executes a relay program. The relay device 90 executes the relay program, thereby performing a plurality of call relay units 20 (20-1, 20-2), a plurality of splitters 30 (30-1, 30-2), and a plurality of voice recognition units 40 ( 40-1, 40-2), and functions as a device including the synthesizing device 70. Note that all or part of the functions of the relay device 90 may be realized by using hardware such as an application specific integrated circuit (ASIC), a programmable logic device (PLD), or a field programmable gate array (FPGA). The relay program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a portable medium such as a flexible disk, a magneto-optical disk, a ROM, a CD-ROM, or a storage device such as a hard disk built in the computer system.

通話中継部２０（２０−１、２０−２）は、ネットワーク１１を介して所定のプロトコルで通信を行う事によって、通話端末１０との間で通話を可能にする。例えば、通話中継部２０は、ＳＩＰに基づいて動作することによって、通話端末１０との間で通話セッションを確立する。 The call relay unit 20 (20-1, 20-2) enables communication with the call terminal 10 by performing communication with the predetermined protocol via the network 11. For example, the call relay unit 20 establishes a call session with the call terminal 10 by operating based on SIP.

通話中継部２０−１は、ネットワーク１１−１を介して通話端末１０−１との間で通話セッションを確立する。通話中継部２０−１は、通話端末１０−１から受信した音声をスプリッタ３０−１に出力する。通話中継部２０−１は、通話端末１０−１から受信した映像を第二入力部５２に出力する。通話中継部２０−２は、ネットワーク１１−２を介して通話端末１０−２との間で通話セッションを確立する。通話中継部２０−２は、通話端末１０−２から受信した音声をスプリッタ３０−２に出力する。通話中継部２０−２は、通話端末１０−２から受信した映像を第三入力部５３に出力する。また、通話中継部２０−１は、合成部６０によって出力された合成映像及び音声を、ネットワーク１１−１を介して通話端末１０−１へ送信する。通話中継部２０−２は、合成部６０によって出力された合成映像及び音声を、ネットワーク１１−２を介して通話端末１０−２へ送信する。 The call relay unit 20-1 establishes a call session with the call terminal 10-1 via the network 11-1. The call relay unit 20-1 outputs the voice received from the call terminal 10-1 to the splitter 30-1. The call relay unit 20-1 outputs the video received from the call terminal 10-1 to the second input unit 52. The call relay unit 20-2 establishes a call session with the call terminal 10-2 via the network 11-2. The call relay unit 20-2 outputs the voice received from the call terminal 10-2 to the splitter 30-2. The call relay unit 20-2 outputs the video received from the call terminal 10-2 to the third input unit 53. In addition, the call relay unit 20-1 transmits the synthesized video and audio output by the synthesis unit 60 to the call terminal 10-1 via the network 11-1. The call relay unit 20-2 transmits the synthesized video and audio output by the synthesis unit 60 to the call terminal 10-2 via the network 11-2.

スプリッタ３０（３０−１、３０−２）は、通話中継部２０によって出力された音声を複数の出力先に分配する。スプリッタ３０−１は、通話中継部２０−１によって出力された音声を、音声認識部４０−１及び第二入力部５２に分配する。スプリッタ３０−２は、通話中継部２０−２によって出力された音声を、音声認識部４０−２及び第三入力部５３に分配する。 The splitter 30 (30-1, 30-2) distributes the sound output by the call relay unit 20 to a plurality of output destinations. The splitter 30-1 distributes the voice output by the call relay unit 20-1 to the voice recognition unit 40-1 and the second input unit 52. The splitter 30-2 distributes the voice output by the call relay unit 20-2 to the voice recognition unit 40-2 and the third input unit 53.

音声認識部４０（４０−１、４０−２）は、入力された音声の内容をテキストデータに変換する。そして、音声認識部４０は、テキストデータを表す文字を表示した映像（テキスト映像）を生成する。音声認識部４０−１は、スプリッタ３０−１から分配された音声に基づいてテキスト映像を生成し、第一入力部５１に出力する。音声認識部４０−２は、スプリッタ３０−２から分配された音声に基づいてテキスト映像を生成し、第四入力部５４に出力する。 The voice recognition unit 40 (40-1, 40-2) converts the content of the input voice into text data. Then, the voice recognition unit 40 generates a video (text video) displaying characters representing text data. The voice recognition unit 40-1 generates a text video based on the voice distributed from the splitter 30-1 and outputs it to the first input unit 51. The voice recognition unit 40-2 generates a text video based on the voice distributed from the splitter 30-2 and outputs the text video to the fourth input unit 54.

合成装置７０は、第一入力部５１、第二入力部５２、第三入力部５３、第四入力部５４及び合成部６０を備える。第一入力部５１は、音声認識部４０−１から出力されたテキスト映像を合成部６０に入力する。第二入力部５２は、スプリッタ３０−１から分配された音声と、通話中継部２０−１から出力された映像とを合成部６０に入力する。第三入力部５３は、スプリッタ３０−２から分配された音声と、通話中継部２０−２から出力された映像とを合成部６０に入力する。第四入力部５４は、音声認識部４０−２から出力されたテキスト映像を合成部６０に入力する。 The combining device 70 includes a first input unit 51, a second input unit 52, a third input unit 53, a fourth input unit 54, and a combining unit 60. The first input unit 51 inputs the text image output from the voice recognition unit 40-1 to the synthesis unit 60. The second input unit 52 inputs the audio distributed from the splitter 30-1 and the video output from the call relay unit 20-1 to the synthesis unit 60. The third input unit 53 inputs the audio distributed from the splitter 30-2 and the video output from the call relay unit 20-2 to the synthesis unit 60. The fourth input unit 54 inputs the text image output from the voice recognition unit 40-2 to the synthesis unit 60.

合成部６０は、第一入力部５１乃至第四入力部５４によって入力された各映像を合成することによって、合成映像を生成する。合成部６０は、合成映像を通話中継部２０−１及び通話中継部２０−２の双方に出力する。また、合成部６０は、第二入力部５２によって入力された音声を通話中継部２０−２に出力し、第三入力部５３によって入力された音声を通話中継部２０−１に出力する。 The synthesizing unit 60 generates a synthesized video by synthesizing each video input by the first input unit 51 to the fourth input unit 54. The synthesizing unit 60 outputs the synthesized video to both the call relay unit 20-1 and the call relay unit 20-2. The synthesizing unit 60 outputs the voice input by the second input unit 52 to the call relay unit 20-2, and outputs the voice input by the third input unit 53 to the call relay unit 20-1.

図２は、合成部６０の処理の概略を示す概略図である。合成部６０は、一つの映像面を複数の領域に分割し、各領域に映像やテキスト映像を配置することによって合成映像を生成する。図２に示される具体例では、一つの映像面が四つの領域に分割されている。 FIG. 2 is a schematic diagram showing an outline of processing of the synthesis unit 60. The synthesizing unit 60 divides one video plane into a plurality of areas, and generates a synthesized video by arranging video and text video in each area. In the specific example shown in FIG. 2, one image plane is divided into four areas.

合成部６０は、左下に位置する第一領域８１には、一方の通話端末１０（例えば通話端末１０−１）側の音声に関するテキスト映像を配置する。すなわち、合成部６０は、第一入力部５１によって入力されたテキスト映像を第一領域８１に配置する。合成部６０は、左上に位置する第二領域８２には、一方の通話端末１０（例えば通話端末１０−１）側の映像を配置する。すなわち、合成部６０は、第二入力部５２によって入力された映像を第二領域８２に配置する。 The synthesizing unit 60 arranges a text image related to the voice on the side of one call terminal 10 (for example, the call terminal 10-1) in the first area 81 located in the lower left. That is, the synthesis unit 60 arranges the text image input by the first input unit 51 in the first area 81. The synthesizing unit 60 arranges an image on the side of one call terminal 10 (for example, the call terminal 10-1) in the second region 82 located at the upper left. That is, the composition unit 60 arranges the video input by the second input unit 52 in the second region 82.

合成部６０は、右上に位置する第三領域８３には、他方の通話端末１０（例えば通話端末１０−２）側の映像を配置する。すなわち、合成部６０は、第三入力部５３によって入力された映像を第三領域８３に配置する。合成部６０は、右下に位置する第四領域８４には、他方の通話端末１０（例えば通話端末１０−２）側の音声に関するテキスト映像を配置する。すなわち、合成部６０は、第四入力部５４によって入力されたテキスト映像を第四領域８４に配置する。 The synthesizing unit 60 arranges the video on the other call terminal 10 (for example, call terminal 10-2) side in the third area 83 located in the upper right. That is, the composition unit 60 arranges the video input by the third input unit 53 in the third region 83. The synthesizing unit 60 arranges a text image related to the voice on the other call terminal 10 (for example, the call terminal 10-2) side in the fourth area 84 located in the lower right. That is, the composition unit 60 arranges the text image input by the fourth input unit 54 in the fourth area 84.

図３は、合成映像の具体例を示す概略図である。第一領域８１乃至第四領域８４の各領域にそれぞれ映像が配置されることによって、図３に示されるような合成映像が生成される。図３に示されるように、第二領域８２に通話端末１０−１のユーザの顔が表示され、第一領域８１に通話端末１０−１のユーザの発話内容がテキストとして表示される。また、第三領域８３に通話端末１０−２のユーザの顔が表示され、第四領域８４に通話端末１０−２のユーザの発話内容がテキストとして表示される。 FIG. 3 is a schematic diagram illustrating a specific example of a composite video. A video is arranged in each of the first area 81 to the fourth area 84 to generate a composite video as shown in FIG. As shown in FIG. 3, the face of the user of the call terminal 10-1 is displayed in the second area 82, and the utterance content of the user of the call terminal 10-1 is displayed as text in the first area 81. Further, the face of the user of the call terminal 10-2 is displayed in the third area 83, and the utterance content of the user of the call terminal 10-2 is displayed as text in the fourth area 84.

図４は、通話システム１００における通話セッション確立時の処理の流れの具体例を示すシーケンス図である。図４は、Ａ氏が操作する通話端末１０と、Ｂ氏が操作する通話端末１０との間で通話セッションが確立するまでの処理の流れを示す。 FIG. 4 is a sequence diagram showing a specific example of the flow of processing when a call session is established in the call system 100. FIG. 4 shows a process flow until a call session is established between the call terminal 10 operated by Mr. A and the call terminal 10 operated by Mr. B.

まず、Ａ氏が通話端末１０を操作してＢ氏への発呼の指示を入力する（ステップＳ１０１）。Ｂ氏への発呼の指示を受けた通話端末１０は、呼接続装置に対して発呼要求を送信する（ステップＳ１０２）。発呼要求には、通話セッションを確立する複数の通話端末１０を示す識別情報が含まれる。ステップＳ１０２で送信される発呼要求には、Ａ氏の通話端末１０の識別情報と、Ｂ氏の通話端末１０の識別情報と、が含まれる。 First, Mr. A operates the call terminal 10 to input a call instruction to Mr. B (step S101). The call terminal 10 that has received the call instruction to Mr. B transmits a call request to the call connection device (step S102). The call request includes identification information indicating a plurality of call terminals 10 that establish a call session. The call request transmitted in step S102 includes the identification information of Mr. A's call terminal 10 and the identification information of Mr. B's call terminal 10.

呼接続装置は、発呼要求を受信すると、発呼要求に含まれる識別情報が表す各通話端末１０に対して発呼する（ステップＳ１０３、Ｓ１０４）。さらに、呼接続装置は、発呼先となった各通話端末１０に対応付けられている通話中継部２０にも発呼する。すなわち、呼接続装置は、Ａ氏側の通話中継部２０と、Ｂ氏側の通話中継部２０とに発呼する（ステップＳ１０５、Ｓ１０６）。 When the call connection device receives the call request, the call connection device makes a call to each call terminal 10 indicated by the identification information included in the call request (steps S103 and S104). Furthermore, the call connection device also makes a call to the call relay unit 20 associated with each call terminal 10 that is the call destination. That is, the call connection device places a call to the Mr. A side call relay unit 20 and the Mr. B side call relay unit 20 (steps S105 and S106).

呼接続装置から発呼を受けたＡ氏通話端末及びＢ氏通話端末は、着信音を出力し、ユーザに対して着呼していることを報知する。ユーザが通話端末１０を操作することによってオフフックになると、通話端末１０は呼接続装置に対して応答する（ステップＳ１０７、Ｓ１０８）。 The Mr. A call terminal and the Mr. B call terminal that have received a call from the call connection device output a ring tone to notify the user that the call is being received. When the user goes off-hook by operating the call terminal 10, the call terminal 10 responds to the call connection device (steps S107 and S108).

呼接続装置から発呼を受けたＡ氏側通話中継部２０及びＢ氏側通話中継部２０は、着信に応じて自動的にオフフック状態に遷移し、呼接続装置に対して応答する（ステップＳ１０９、Ｓ１１０）。 The Mr. A side call relay unit 20 and the Mr. B side call relay unit 20 that have received a call from the call connection device automatically transition to the off-hook state in response to the incoming call, and respond to the call connection device (step S109). , S110).

Ａ氏の通話端末１０とＡ氏側の通話中継部２０との間で、上記のようなＶ字発信処理が行われることによって、通話セッションが確立される（ステップＳ１１１）。また、Ｂ氏の通話端末１０とＢ氏側の通話中継部２０との間で、上記のようなＶ字発信処理が行われることによって、通話セッションが確立される（ステップＳ１１２）。Ａ氏側通話中継部２０とＢ氏側通話中継部２０とは、合成部６０を介して接続されている。そのため、Ａ氏通話端末１０とＢ氏通話端末１０とは、通話が可能な状態となる。 A call session is established by performing the above V-shaped call processing between Mr. A's call terminal 10 and Mr. A's call relay unit 20 (step S111). In addition, a call session is established by performing the above V-shaped call processing between Mr. B's call terminal 10 and Mr. B's call relay unit 20 (step S112). The Mr. A-side call relay unit 20 and the Mr. B-side call relay unit 20 are connected via the combining unit 60. Therefore, the Mr. A call terminal 10 and the Mr. B call terminal 10 are in a state in which a call can be made.

図５は、通話システム１００における通話の処理の流れの具体例を示すシーケンス図である。図５は、Ａ氏が操作する通話端末１０と、Ｂ氏が操作する通話端末１０との間で行われる通話の処理の流れを示す。 FIG. 5 is a sequence diagram showing a specific example of a call processing flow in the call system 100. FIG. 5 shows a processing flow of a call performed between the call terminal 10 operated by Mr. A and the call terminal 10 operated by Mr. B.

Ａ氏通話端末１０は、入力されたＡ氏の音声及び映像をＡ氏側通話中継部２０に送信する（ステップＳ２０１）。Ａ氏側通話中継部２０は、受信した音声を、スプリッタ３０を介してＡ氏側音声認識部４０に出力する（ステップＳ２０２）。また、Ａ氏側通話中継部２０は、受信した音声及び映像を、合成装置７０に出力する（ステップＳ２０３）。Ａ氏側音声認識部４０は、Ａ氏側通話中継部２０から出力された音声について音声認識処理を実行し、テキスト映像を生成する（ステップＳ２０４）。Ａ氏側音声認識部４０は、生成したテキスト映像を合成装置７０に出力する（ステップＳ２０５）。 The Mr. A call terminal 10 transmits the input Mr. A's voice and video to the Mr. A side call relay unit 20 (step S201). The Mr. A side call relay unit 20 outputs the received voice to the Mr. A side voice recognition unit 40 via the splitter 30 (step S202). In addition, the Mr. A side call relay unit 20 outputs the received voice and video to the synthesizing device 70 (step S203). The Mr. A side voice recognition unit 40 performs a voice recognition process on the voice output from the Mr. A side call relay unit 20 to generate a text image (step S204). The Mr. A side voice recognition unit 40 outputs the generated text image to the synthesizing device 70 (step S205).

Ｂ氏通話端末１０は、入力されたＢ氏の音声及び映像をＢ氏側通話中継部２０に送信する（ステップＳ２０６）。Ｂ氏側通話中継部２０は、受信した音声を、スプリッタ３０を介してＢ氏側音声認識部４０に出力する（ステップＳ２０７）。また、Ｂ氏側通話中継部２０は、受信した音声及び映像を、合成装置７０に出力する（ステップＳ２０８）。Ｂ氏側音声認識部４０は、Ｂ氏側通話中継部２０から出力された音声について音声認識処理を実行し、テキスト映像を生成する（ステップＳ２０９）。Ｂ氏側音声認識部４０は、生成したテキスト映像を合成装置７０に出力する（ステップＳ２１０）。 The Mr. B call terminal 10 transmits the input Mr. B's voice and video to the Mr. B side call relay unit 20 (step S206). The Mr. B side call relay unit 20 outputs the received voice to the Mr. B side voice recognition unit 40 via the splitter 30 (step S207). In addition, the Mr. B side call relay unit 20 outputs the received voice and video to the synthesizing device 70 (step S208). The B-side voice recognition unit 40 performs a voice recognition process on the voice output from the B-side call relay unit 20 to generate a text image (step S209). The Mr. B side voice recognition unit 40 outputs the generated text image to the synthesis device 70 (step S210).

合成装置７０は、Ａ氏側通話中継部２０から出力された音声及び映像と、Ｂ氏側通話中継部２０から出力された音声及び映像と、Ａ氏側音声認識部４０から出力されたテキスト映像と、Ｂ氏側音声認識部４０から出力されたテキスト映像と、を合成することによって合成映像を生成する（ステップＳ２１１）。 The synthesizing device 70 includes the voice and video output from the Mr. A side call relay unit 20, the voice and video output from the Mr. B side call relay unit 20, and the text video output from the Mr. A side voice recognition unit 40. The synthesized video is generated by synthesizing the text video output from the Mr. B side voice recognition unit 40 (step S211).

合成装置７０は、合成映像と、Ｂ氏側通話中継部２０から出力された音声と、をＡ氏側通話中継部２０に出力する（ステップＳ２１２）。Ａ氏側通話中継部２０は、合成装置７０によって出力された合成映像及び音声を、Ａ氏通話端末１０へ送信する（ステップＳ２１３）。Ａ氏通話端末１０は、受信した合成映像を表示し、音声を出力する（ステップＳ２１４）。 The synthesizing device 70 outputs the synthesized video and the voice output from the Mr. B side call relay unit 20 to the Mr. A side call relay unit 20 (step S212). The Mr. A side call relay unit 20 transmits the synthesized video and audio output by the synthesizing device 70 to the Mr. A call terminal 10 (step S213). The Mr. A call terminal 10 displays the received composite video and outputs a sound (step S214).

合成装置７０は、合成映像と、Ａ氏側通話中継部２０から出力された音声と、をＢ氏側通話中継部２０に出力する（ステップＳ２１５）。Ｂ氏側通話中継部２０は、合成装置７０によって出力された合成映像及び音声を、Ｂ氏通話端末１０へ送信する（ステップＳ２１６）。Ｂ氏通話端末１０は、受信した合成映像を表示し、音声を出力する（ステップＳ２１７）。 The synthesizing device 70 outputs the synthesized video and the voice output from the Mr. A side call relay unit 20 to the Mr. B side call relay unit 20 (step S215). The Mr. B side call relay unit 20 transmits the synthesized video and audio output by the synthesizing device 70 to the Mr. B call terminal 10 (step S216). The Mr. B call terminal 10 displays the received composite video and outputs a sound (step S217).

このように構成された通話システム１００では、通話端末１０を用いて通話を行う際に、相手の発話内容がテキストの映像として通話端末１０の画面に表示される。そのため、たとえ聴覚が衰えているユーザであっても、通話相手の発話内容をより容易に理解することが可能となる。 In the call system 100 configured as described above, when a call is made using the call terminal 10, the content of the other party's utterance is displayed on the screen of the call terminal 10 as a text image. Therefore, even if the user is deaf, it is possible to more easily understand the utterance content of the other party.

また、高齢者は、口腔部や喉の衰えによって発話を思うようにできないという問題もある。このような問題に対し、上述した通話システム１００では、自身の発話内容が通話相手にテキストの映像として表示される。そのため、たとえ発話を思うようにはっきりとできないユーザであっても、音声認識部４０の性能に応じて、通話相手に発話内容をより正確に伝えることが可能となる。 In addition, there is a problem that the elderly can not think of speech due to the deterioration of the oral cavity and throat. For such a problem, in the above-described call system 100, the content of the utterance is displayed as a text image to the other party. For this reason, even a user who cannot clearly speak an utterance can more accurately convey the utterance content to the other party according to the performance of the voice recognition unit 40.

＜変形例＞
図６は、通話システム１００の第一の変形例（通話システム１００ａ）のシステム構成図である。通話システム１００ａは、変換部４１（４１−１、４１−２）を備える点で、通話システム１００と異なる。 <Modification>
FIG. 6 is a system configuration diagram of a first modification (call system 100a) of the call system 100. The call system 100a differs from the call system 100 in that it includes a conversion unit 41 (41-1, 41-2).

通話システム１００ａにおける音声認識部４０は、テキスト映像を生成せず、音声認識の結果であるテキストデータを変換部４１に出力する。変換部４１は、音声認識部４０によって出力されたテキストデータの文章を、通話端末１０のユーザによって指定された他の言語の文章に変換（翻訳）する。変換部４１は、変換後のテキストデータを表す文字を表示した映像（テキスト映像）を生成する。そして、変換部４１は、生成したテキスト映像を合成装置７０に出力する。例えば、変換部４１−１はＸ言語をＹ言語に変換し、変換部４１−２はＹ言語をＸ言語に変換する。
このように構成されることによって、他言語のユーザと会話をスムーズに行う事が可能となる。 The voice recognition unit 40 in the call system 100a does not generate a text image and outputs text data as a result of voice recognition to the conversion unit 41. The conversion unit 41 converts (translates) the sentence of the text data output by the voice recognition unit 40 into a sentence in another language designated by the user of the call terminal 10. The conversion part 41 produces | generates the image | video (text image | video) which displayed the character showing the text data after conversion. Then, the conversion unit 41 outputs the generated text video to the synthesis device 70. For example, the conversion unit 41-1 converts the X language into the Y language, and the conversion unit 41-2 converts the Y language into the X language.
With this configuration, it is possible to smoothly perform conversations with users in other languages.

図７は、通話システム１００の第二の変形例（通話システム１００ｂ）のシステム構成図である。通話システム１００ｂは、音声認識部４０が生成するテキスト映像の表示態様が、通話システム１００におけるテキスト映像の表示態様と異なる。 FIG. 7 is a system configuration diagram of a second modification of the call system 100 (call system 100b). In the call system 100b, the display mode of the text video generated by the voice recognition unit 40 is different from the display mode of the text video in the call system 100.

図８は、通話システム１００ｂにおいて生成される合成映像の具体例を示す概略図である。図８に示されるように、第一領域８１に表示されるテキスト映像と、第四領域８４に表示されるテキスト映像とでは、文字の高さ方向の位置が異なる。時間軸において先に発話された内容を表すテキスト映像では、より高い位置に文字が配置され、時間軸において後に発話された内容を表すテキスト映像では、より低い位置に文字が配置される。 FIG. 8 is a schematic diagram illustrating a specific example of a composite video generated in the call system 100b. As shown in FIG. 8, the text image displayed in the first area 81 and the text image displayed in the fourth area 84 have different positions in the height direction of characters. In the text image representing the content uttered earlier on the time axis, characters are arranged at a higher position, and in the text image representing the content uttered later on the time axis, characters are arranged at a lower position.

通話端末１０−１と通話端末１０−２とで、どちらが先に発話されたかという判定は、音声認識部４０−１及び音声認識部４０−２によって行われても良い。すなわち、音声認識部４０（４０−１及び４０−２）は、音声認識を行う度に、他の音声認識部４０に対して音声認識を行ったことを表す信号（フラグ信号）を送信する。音声認識部４０は、フラグ信号を受信した後、フラグ信号を送信するまでの間、予め定められた高い位置の領域にテキストを配置したテキスト映像を生成する。一方、音声認識部４０は、フラグ信号を送信した後、フラグ信号を受信するまでの間、予め定められた低い位置の領域にテキストを配置したテキスト映像を生成する。
このように構成されることによって、通話を行っている各ユーザは、表示されている発話内容についてどちらが直近で話しかけたのかを容易に判断することが可能となる。 It may be determined by the voice recognition unit 40-1 and the voice recognition unit 40-2 which of the call terminal 10-1 and the call terminal 10-2 is uttered first. That is, the speech recognition unit 40 (40-1 and 40-2) transmits a signal (flag signal) indicating that speech recognition has been performed to another speech recognition unit 40 every time speech recognition is performed. After receiving the flag signal, the voice recognition unit 40 generates a text image in which text is arranged in a predetermined high region until the flag signal is transmitted. On the other hand, after transmitting the flag signal, the voice recognition unit 40 generates a text image in which text is arranged in a predetermined low position area until the flag signal is received.
By being configured in this way, each user who is making a call can easily determine which of the displayed utterance contents has most recently spoken.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

１００…通話システム，１０…通話端末，１１…ネットワーク，２０…通話中継部，３０…スプリッタ，４０…音声認識部，５１…第一入力部，５２…第二入力部，５３…第三入力部，５４…第四入力部，６０…合成部，７０…合成装置 DESCRIPTION OF SYMBOLS 100 ... Call system, 10 ... Call terminal, 11 ... Network, 20 ... Call relay part, 30 ... Splitter, 40 ... Voice recognition part, 51 ... First input part, 52 ... Second input part, 53 ... Third input part 54 ... fourth input unit, 60 ... synthesis unit, 70 ... synthesis device

Claims

A first call relay unit that receives audio and video transmitted from the first call terminal;
A first voice recognition unit that converts voice received by the first call relay unit into text data and generates a text image representing the text data;
A second call relay unit for receiving audio and video transmitted from the second call terminal;
A second voice recognition unit that converts the voice received by the second call relay unit into text data and generates a text image representing the text data;
Video received by the first call relay unit, video received by the second call relay unit, text video generated by the first voice recognition unit, and generated by the second voice recognition unit A synthesizing unit that generates a synthesized video by synthesizing the text video,
With
The first call relay unit transmits the composite video to the first call terminal;
The second call relay unit transmits the composite video to the second call terminal.

A first input unit that receives input of a text image generated by the first voice recognition unit;
A second input unit for receiving input of audio and video received by the first call relay unit;
A third input unit that receives input of audio and video received by the second call relay unit;
A fourth input unit for receiving an input of a text image generated by the second voice recognition unit;
Further comprising
The call system according to claim 1, wherein the synthesis unit generates the synthesized video by arranging each video input to the first to fourth input units in a predetermined screen area.

A first call reception step for receiving audio and video transmitted from the first call terminal;
A first voice recognition step of converting the voice received by the first call receiving step into text data and generating a text image representing the text data;
A second call receiving step for receiving audio and video transmitted from the second call terminal;
A second voice recognition step of converting the voice received by the second call receiving step into text data and generating a text image representing the text data;
The video received by the first call receiving step, the video received by the second call receiving step, the text video generated by the first voice recognition step, and the second voice recognition step A synthesis step of generating a synthesized video by synthesizing the text video;
A first transmission step of transmitting the composite video to the first call terminal;
A second transmission step of transmitting the composite video to the second call terminal;
A call relay method comprising: