JP2020136921A

JP2020136921A - Video call system and computer program

Info

Publication number: JP2020136921A
Application number: JP2019028411A
Authority: JP
Inventors: 有希 ▲高▼辻; Yuki Takatsuji; 満平山; Mitsuru Hirayama
Original assignee: Optage Inc
Current assignee: Optage Inc
Priority date: 2019-02-20
Filing date: 2019-02-20
Publication date: 2020-08-31

Abstract

To provide a video call system that requires a low processing load and enables a homeworker to provide a service without worrying about the worker's face, environment, and the like.SOLUTION: A video call system comprises: a first user device comprising imaging means for imaging a face image of a first user, a user who operates the first user device, voice input means for inputting voice of the first user, face information generation means for generating face information based on a part of the first user's face from the face image of the first user, and communication means for transmitting the first user's face information and voice to a second user device other than the first user device of a plurality of user devices; and the second user device comprising avatar image generation means for generating an avatar image of the first user on the basis of first user's face information transmitted from the first user device, image output means for outputting the avatar image of the first user, and voice output means for outputting the first user's voice.SELECTED DRAWING: Figure 1

Description

本発明は、ビデオ通話システム、およびコンピュータプログラムに関する。 The present invention relates to a video call system and a computer program.

従来から、通訳者に対してビデオ通話を介して通訳を依頼する通訳サービスが知られている（例えば、特許文献１参照）。 Conventionally, an interpreter service for requesting an interpreter to interpret via a video call has been known (see, for example, Patent Document 1).

そして、近年訪日外国人が増加しており、外国人の旅行者が、クラウドソーシングで働く通訳者とのクラウド通訳を利用して旅行を楽しむ機会が増えている。 In recent years, the number of foreign visitors to Japan has increased, and foreign travelers have more opportunities to enjoy traveling by using cloud interpreters with interpreters who work in crowdsourcing.

また、近年のインバウンドを利用して在宅ワーカーである通訳者がクラウド通訳サービスにおいて通訳者となることが多く見られるようになっている。 In recent years, it has become common for interpreters who are home-based workers to become interpreters in cloud interpretation services by using inbound tourism.

特開２００４−２０６４５２号公報Japanese Unexamined Patent Publication No. 2004-206452

しかしながら、前述のとおり、クラウド通訳にたずさわる通訳者として在宅ワーカーが増えており、通訳者の服装あるいは背景（環境）が適切とはいえない状況が散見されるようになっている。 However, as mentioned above, the number of home-based workers is increasing as interpreters involved in cloud interpretation, and there are some situations in which the clothes or background (environment) of the interpreter is not appropriate.

特に、通訳者の背景から通訳依頼者に通訳者の生活感が見えてしまう場合には、通訳者にとっても通訳依頼者にとっても好ましい状況とはいえなかった。 In particular, when the interpreter's background gives the interpreter a sense of life, it is not a favorable situation for both the interpreter and the interpreter.

また、通訳者が女性である場合には、家にいるにもかかわらず通訳のために化粧をする必要があり、通訳者になることをためらう場合があった。 In addition, when the interpreter was a woman, she had to put on makeup for the interpreter even though she was at home, and she sometimes hesitated to become an interpreter.

したがって、在宅ワーカーである通訳者が顔、服装、環境などを気にすることなく通訳業務を行うことのできる仕組みが求められていた。この問題は、クラウド通訳に限らず、他の分野のビデオ通話についても同様に当てはまる。 Therefore, there has been a demand for a mechanism that allows an interpreter who is a home worker to perform interpreting work without worrying about the face, clothes, environment, and the like. This issue applies not only to cloud interpreters, but also to video calls in other areas.

さらには、従来のビデオ通話システムにおいては、通訳者の装置と通訳依頼者の装置との間でデータ量の大きい映像情報が送受信されるため、多くの通信帯域が必要となり、サーバ装置の処理負荷が大きくなってしまう。そのため、リアルタイムでの通信ができなくなったり、通訳者と通訳依頼者との間の通信が途切れてしまったりする場合があった。 Furthermore, in a conventional video communication system, since video information with a large amount of data is transmitted and received between the interpreter's device and the interpreter's device, a large amount of communication band is required, and the processing load of the server device is increased. Becomes large. As a result, real-time communication may not be possible, or communication between the interpreter and the interpreter requester may be interrupted.

本発明の目的は、処理負荷が小さく、在宅ワーカーが顔および環境などを気にすることなくサービスを提供することができるビデオ通話システムを提供することである。 An object of the present invention is to provide a video call system having a small processing load and capable of providing a service to a home worker without worrying about the face and the environment.

第１の発明は、
通信ネットワークによって互いに接続されるとともに各ユーザの操作を受けつける複数のユーザ装置を備えるビデオ通話システムであって、
少なくとも１つの前記ユーザ装置である第１ユーザ装置は、
前記第１ユーザ装置を操作する前記ユーザである第１ユーザの顔画像を撮影する撮影手段、
前記第１ユーザの音声を入力する音声入力手段、
撮影された前記第１ユーザの前記顔画像から前記第１ユーザの顔の部位に関する顔情報を生成する顔情報生成手段、および
少なくとも前記第１ユーザの前記顔情報および前記音声を、前記複数のユーザ装置のうち前記第１ユーザ装置以外の第２ユーザ装置へ送信する通信手段、
を備え、
前記第２ユーザ装置は、
前記第１ユーザ装置から送信された前記第１ユーザの前記顔情報に基づいて前記第１ユーザのアバター画像を生成するとともに出力するアバター画像生成手段、および
前記第１ユーザの前記音声を出力する音声出力手段、
を備える、
ビデオ通話システムである。 The first invention is
A video communication system including a plurality of user devices that are connected to each other by a communication network and receive operations of each user.
The first user device, which is at least one of the user devices,
A photographing means for capturing a face image of the first user who operates the first user device,
A voice input means for inputting the voice of the first user,
A face information generating means for generating face information regarding a face portion of the first user from the photographed face image of the first user, and at least the face information and the voice of the first user are used by the plurality of users. A communication means for transmitting to a second user device other than the first user device among the devices,
With
The second user device is
An avatar image generation means that generates and outputs an avatar image of the first user based on the face information of the first user transmitted from the first user device, and a voice that outputs the voice of the first user. Output means,
To prepare
It is a video call system.

また、第１の発明において、
前記第１ユーザ装置は、通訳の依頼を受けた通訳者によって操作され、
前記第２ユーザ装置は、前記通訳者に通訳を依頼する通訳依頼者によって操作されてもよい。 Further, in the first invention,
The first user device is operated by an interpreter who has been requested to interpret.
The second user device may be operated by an interpreter requester who requests the interpreter to interpret.

また、第１の発明において、
前記顔情報は、少なくとも、前記第１ユーザの前記顔の部位である口の開け閉めの有無、目の開け閉めの有無、ならびに、頭の傾きおよび回転に関する顔認証情報であってもよい。 Further, in the first invention,
The face information may be at least face recognition information regarding the presence / absence of opening / closing of the mouth, the presence / absence of opening / closing of eyes, and the inclination and rotation of the head, which are the parts of the face of the first user.

また、第１の発明において、
前記アバター画像生成手段は、前記顔認証情報に基づいて、３次元モデルで構成される前記アバター画像において対応する目、口、および顔の頂点位置を変動させて前記アバター画像を生成してもよい。 Further, in the first invention,
The avatar image generation means may generate the avatar image by varying the positions of the apex of the eyes, mouth, and face corresponding to the avatar image composed of the three-dimensional model based on the face recognition information. ..

また、第１の発明において、
前記顔情報生成手段は、前記第１ユーザの前記顔情報および前記音声の少なくともいずれか一方に基づいて前記第１ユーザの感情を表現するための感情情報を生成し、
前記アバター画像生成手段は、前記顔情報に加えて前記感情情報に基づいて前記第１ユーザの前記アバター画像を生成してもよい。 Further, in the first invention,
The face information generating means generates emotion information for expressing the emotion of the first user based on at least one of the face information of the first user and the voice.
The avatar image generation means may generate the avatar image of the first user based on the emotion information in addition to the face information.

第２の発明は、
コンピュータ装置を、
前記コンピュータ装置を操作するユーザの顔画像を撮影する撮影手段、
前記ユーザの音声を入力する音声入力手段、
他のコンピュータ装置において前記ユーザのアバター画像を生成するために、撮影された前記ユーザの前記顔画像から前記ユーザの顔の部位に関する顔情報を生成する顔情報生成手段、および
少なくとも前記ユーザの前記顔情報および前記音声を、前記他のコンピュータ装置へ送信する通信手段、
として機能させる、
コンピュータプログラムである。 The second invention is
Computer equipment,
A photographing means for capturing a facial image of a user who operates the computer device,
A voice input means for inputting the user's voice,
A face information generating means for generating face information regarding a part of the user's face from the photographed face image of the user in order to generate an avatar image of the user in another computer device, and at least the face of the user. A communication means for transmitting information and the voice to the other computer device,
To function as
It is a computer program.

第３の発明は、
コンピュータ装置を、
ユーザの操作に基づいて、他のコンピュータ装置から前記他のコンピュータ装置を操作する他ユーザについて前記他のコンピュータ装置で撮影された前記他ユーザの顔画像から生成された前記他ユーザの顔の部位に関する顔情報および音声を受信する通信手段、
前記顔情報に基づいて前記他ユーザのアバター画像を生成するとともに出力するアバター画像生成手段、および
前記他ユーザの前記音声を出力する音声出力手段、
として機能させる、
コンピュータプログラムである。 The third invention is
Computer equipment,
Regarding the other user who operates the other computer device from the other computer device based on the user's operation, the face portion of the other user generated from the face image of the other user taken by the other computer device. Communication means for receiving face information and voice,
An avatar image generation means that generates and outputs an avatar image of the other user based on the face information, and a voice output means that outputs the voice of the other user.
To function as
It is a computer program.

第４の発明は、
コンピュータ装置を、
ユーザの操作に基づいて、他のコンピュータ装置から前記他のコンピュータ装置を操作する他ユーザについて前記他のコンピュータ装置で撮影された前記他ユーザの顔画像および音声を受信する通信手段、
前記他のコンピュータ装置から受信した前記他ユーザの前記顔画像から前記他ユーザの顔の部位に関する顔情報を生成する顔情報生成手段、
前記顔情報に基づいて前記他ユーザのアバター画像を生成するとともに出力するアバター画像生成手段、および
前記他ユーザの前記音声を出力する音声出力手段、
として機能させる、
コンピュータプログラムである。 The fourth invention is
Computer equipment,
A communication means for receiving a face image and voice of the other user taken by the other computer device for another user who operates the other computer device from the other computer device based on the user's operation.
A face information generating means for generating face information regarding a face portion of the other user from the face image of the other user received from the other computer device.
An avatar image generation means that generates and outputs an avatar image of the other user based on the face information, and a voice output means that outputs the voice of the other user.
To function as
It is a computer program.

本発明によれば、処理負荷が小さく、在宅ワーカーが顔および環境などを気にすることなくサービスを提供することができるビデオ通話システムを提供することができる。 According to the present invention, it is possible to provide a video call system in which a processing load is small and a home worker can provide a service without worrying about the face and the environment.

本実施形態における、ビデオ通話システムの概念図である。It is a conceptual diagram of the video call system in this embodiment. 本実施形態における、サーバ装置の機能構成を示す図である。It is a figure which shows the functional structure of the server apparatus in this embodiment. 本実施形態における、通訳者端末の機能構成を示す図である。It is a figure which shows the functional structure of the interpreter terminal in this embodiment. 本実施形態における、通訳依頼者端末の機能構成を示す図である。It is a figure which shows the functional structure of the interpreter requester terminal in this embodiment. 本実施形態における、ビデオ通話処理のフローを示す図である。It is a figure which shows the flow of the video call processing in this embodiment. 他の実施形態における、通訳者の顔の顔認証情報を示す図である。It is a figure which shows the face recognition information of the face of an interpreter in another embodiment. 他の実施形態における、通訳者の顔の顔認証情報を示す図である。It is a figure which shows the face recognition information of the face of an interpreter in another embodiment.

［実施形態］
本発明の実施の形態にかかるビデオ通話システム１について、図１〜図５を参照して説明する。 [Embodiment]
The video communication system 1 according to the embodiment of the present invention will be described with reference to FIGS. 1 to 5.

＜ビデオ通話システム１の説明＞
本発明のビデオ通話システム１は、図１のとおり、サーバ装置２と、複数の通訳者端末（第１ユーザ装置）３と、複数の通訳依頼者端末（第２ユーザ装置）４とを備える。 <Explanation of video call system 1>
As shown in FIG. 1, the video communication system 1 of the present invention includes a server device 2, a plurality of interpreter terminals (first user device) 3, and a plurality of interpreter requester terminals (second user device) 4.

サーバ装置２は、通訳者端末３および通訳依頼者端末４と通信ネットワーク５を介して接続される。 The server device 2 is connected to the interpreter terminal 3 and the interpreter requester terminal 4 via the communication network 5.

そして、サーバ装置２は、通訳者端末３と通訳依頼者端末４との間で映像（画像）および音声を送受信する。これにより、通訳者端末３と通訳依頼者端末４との間でビデオ通話が提供される。 Then, the server device 2 transmits / receives video (image) and audio between the interpreter terminal 3 and the interpreter requester terminal 4. As a result, a video call is provided between the interpreter terminal 3 and the interpreter requester terminal 4.

通訳者端末３は、通訳者（第１ユーザ）の操作を受けつける操作部を備えるとともに、通訳依頼者端末４およびサーバ装置２と通信ネットワーク５を介して接続される。 The interpreter terminal 3 includes an operation unit that receives the operation of the interpreter (first user), and is connected to the interpreter requester terminal 4 and the server device 2 via the communication network 5.

通訳依頼者端末４は、通訳依頼者（第２ユーザ）の操作を受けつける操作部を備えるとともに、通訳者端末３およびサーバ装置２と通信ネットワーク５を介して接続される。 The interpreter requester terminal 4 includes an operation unit that receives the operation of the interpreter requester (second user), and is connected to the interpreter terminal 3 and the server device 2 via the communication network 5.

＜ハードウェア構成＞
以下、図２〜図４を参照して、サーバ装置２の機能構成、ビデオ通話処理が提供される通訳者端末３の機能構成、および通訳依頼者端末４の機能構成について説明する。 <Hardware configuration>
Hereinafter, with reference to FIGS. 2 to 4, the functional configuration of the server device 2, the functional configuration of the interpreter terminal 3 provided with the video call processing, and the functional configuration of the interpreter requester terminal 4 will be described.

なお、各通訳者および各通訳依頼者には、それぞれ異なるアカウント（識別情報）が付与される。 A different account (identification information) is assigned to each interpreter and each interpreter requester.

各通訳者端末３が通信ネットワーク５を介してサーバ装置２と通信を行う場合には、通訳者端末３からサーバ装置２へ通訳者のアカウントが送信される。 When each interpreter terminal 3 communicates with the server device 2 via the communication network 5, the interpreter terminal 3 transmits the interpreter's account to the server device 2.

また、各通訳依頼者端末４が通信ネットワーク５を介してサーバ装置２と通信を行う場合には、通訳依頼者端末４からサーバ装置２へ通訳依頼者のアカウントが送信される。 Further, when each interpreter requester terminal 4 communicates with the server device 2 via the communication network 5, the interpreter requester terminal 4 transmits the interpreter requester account to the server device 2.

送信されたアカウントは、サーバ装置２において所定の認証がなされる。これにより、通訳者端末３または通訳依頼者端末４とサーバ装置２との通信が可能となる。その結果、通訳者端末３と通訳依頼者端末４との通信が可能となる。 The transmitted account is subjected to predetermined authentication on the server device 2. As a result, communication between the interpreter terminal 3 or the interpreter requester terminal 4 and the server device 2 becomes possible. As a result, communication between the interpreter terminal 3 and the interpreter requester terminal 4 becomes possible.

＜サーバ装置２の説明＞
サーバ装置２は、図２のとおり、制御部２０、記憶部２１、および、ネットワークインターフェース２２を備える。 <Explanation of server device 2>
As shown in FIG. 2, the server device 2 includes a control unit 20, a storage unit 21, and a network interface 22.

記憶部２１およびネットワークインターフェース２２は、バス２００を介してサーバ装置２の制御部２０に接続される。 The storage unit 21 and the network interface 22 are connected to the control unit 20 of the server device 2 via the bus 200.

制御部２０は、サーバ装置２の動作を制御する。 The control unit 20 controls the operation of the server device 2.

記憶部２１は、主にＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）およびＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）で構成される。 The storage unit 21 is mainly composed of an HDD (Hard Disk Drive), a RAM (Random Access Memory), and a ROM (Read Only Memory).

この記憶部２１には、通訳者の氏名、通訳可能言語、通訳業務の担当可能な時間帯、詳しい地域等、通訳者の登録情報が記憶されている。 The storage unit 21 stores the registered information of the interpreter, such as the name of the interpreter, the languages that can be interpreted, the time zone in which the interpreter can be in charge, and the detailed area.

また、記憶部２１には、通訳依頼者の氏名、所在地、通訳依頼者が選択したサービスプラン等、通訳依頼者の登録情報が記憶されている。 Further, the storage unit 21 stores the registration information of the interpreter requester, such as the name and location of the interpreter requester and the service plan selected by the interpreter requester.

ネットワークインターフェース２２は、サーバ装置２と通訳者端末３あるいは通訳依頼者端末４との間でデータを送受信するために、通信ネットワーク５に接続される。 The network interface 22 is connected to the communication network 5 in order to send and receive data between the server device 2 and the interpreter terminal 3 or the interpreter requester terminal 4.

＜サーバ装置２の制御部２０の機能構成＞
サーバ装置２の制御部２０は、所定のプログラムを実行することにより、通訳者管理手段２０１、通訳依頼者管理手段２０２、サービス管理手段２０３、および通信手段２０４として機能する。 <Functional configuration of control unit 20 of server device 2>
The control unit 20 of the server device 2 functions as an interpreter management means 201, an interpreter requester management means 202, a service management means 203, and a communication means 204 by executing a predetermined program.

＜通訳者管理手段２０１の説明＞
通訳者管理手段２０１は、通訳者端末３から送信される通訳者のアカウントを用いて、通訳者のアカウントの認証を行う。 <Explanation of interpreter management means 201>
The interpreter management means 201 authenticates the interpreter's account by using the interpreter's account transmitted from the interpreter terminal 3.

また、通訳者管理手段２０１は、通訳者端末３から送信された通訳者の登録情報を記憶部２１に記憶させる。この登録情報は、通訳者の操作に基づいて事前に設定される。 Further, the interpreter management means 201 stores the registered information of the interpreter transmitted from the interpreter terminal 3 in the storage unit 21. This registration information is preset based on the operation of the interpreter.

また、通訳者管理手段２０１は、通訳依頼者端末４からの通訳依頼に関する情報を受信したあと、記憶部２１から各通訳者の登録情報を読み出し、サービス管理手段２０３へ送信する。 Further, the interpreter management means 201 reads the registration information of each interpreter from the storage unit 21 after receiving the information regarding the interpretation request from the interpreter requester terminal 4, and transmits the registration information to the service management means 203.

＜通訳依頼者管理手段２０２の説明＞
通訳依頼者管理手段２０２は、通訳依頼者端末４から送信される通訳依頼者のアカウントを用いて、通訳依頼者のアカウントの認証を行う。 <Explanation of interpreter requester management means 202>
The interpreter requester management means 202 authenticates the account of the interpreter requester by using the account of the interpreter requester transmitted from the interpreter requester terminal 4.

また、通訳依頼者管理手段２０２は、通訳依頼者端末４からの通訳依頼に関する情報を受信したあと、通訳依頼者端末４から送信された通訳依頼者の登録情報を記憶部２１に記憶させる。この登録情報は、事前に、あるいは、通訳依頼時に通訳依頼者の操作に基づいて設定される。 Further, the interpreter requester management means 202 stores the registration information of the interpreter requester transmitted from the interpreter requester terminal 4 in the storage unit 21 after receiving the information regarding the interpreter request from the interpreter requester terminal 4. This registration information is set in advance or based on the operation of the interpreter requester at the time of interpreter request.

また、通訳依頼者管理手段２０２は、記憶部２１から通訳依頼者の登録情報を読み出し、サービス管理手段２０３へ送信する。 Further, the interpreter requester management means 202 reads the registration information of the interpreter requester from the storage unit 21 and transmits it to the service management means 203.

＜サービス管理手段２０３の説明＞
サービス管理手段２０３は、通訳依頼者管理手段２０２を介して通訳依頼者端末４からの通訳依頼に関する情報を受信する。この通訳依頼に関する情報には、通訳依頼者の所在地、選択言語等の情報が含まれている。 <Explanation of service management means 203>
The service management means 203 receives information regarding the interpretation request from the interpreter requester terminal 4 via the interpreter requester management means 202. The information regarding the interpretation request includes information such as the location of the interpretation requester and the selected language.

また、サービス管理手段２０３は、通訳依頼者端末４から通訳依頼に関する情報を受信したあと、通訳者管理手段２０１を介して受信した通訳者の登録情報（通訳可能言語、通訳業務担当可能時間帯）を参照して、対応可能な通訳者を選定する。通訳者が選定されたあと、サービス管理手段２０３は、通訳依頼者からの通訳依頼（コール）を、選定された通訳者の通訳者端末３に送信する。その後、通訳者端末３と通訳依頼者端末４との通信接続が確立される。 Further, the service management means 203 receives the information related to the interpreter request from the interpreter requester terminal 4, and then receives the registered information of the interpreter via the interpreter management means 201 (interpreterable language, interpreter business charge time zone). Select an interpreter who can handle the situation by referring to. After the interpreter is selected, the service management means 203 transmits an interpreter request (call) from the interpreter requester to the interpreter terminal 3 of the selected interpreter. After that, the communication connection between the interpreter terminal 3 and the interpreter requester terminal 4 is established.

また、サービス管理手段２０３は、通訳者端末３および通訳依頼者端末４のいずれかにおいて通話切断が行われた場合に、サービス利用時間等を算出し、サービス料金等を算出する。 Further, the service management means 203 calculates the service usage time and the like and calculates the service charge and the like when the call is disconnected at either the interpreter terminal 3 or the interpreter requester terminal 4.

また、サービス管理手段２０３は、通訳者端末３および通訳依頼者端末４のいずれかの通信状況の悪化等により通信が遮断された場合には、通訳者および通訳依頼者のアカウントを記憶部２１から読み出し、自動的に、あるいは、通訳者などの操作に基づいて、再接続を試みる。 In addition, the service management means 203 stores the accounts of the interpreter and the interpreter requester from the storage unit 21 when the communication is interrupted due to deterioration of the communication status of either the interpreter terminal 3 or the interpreter requester terminal 4. Attempts to reconnect, read, automatically, or based on an operation such as an interpreter.

＜通信手段２０４の説明＞
通信手段２０４は、通訳者のアカウント、通訳者の登録情報、通訳応答に関する情報、通訳者の顔認証情報（後述）、および音声情報などを通訳者端末３から受信する。 <Explanation of communication means 204>
The communication means 204 receives the interpreter's account, the interpreter's registration information, the information on the interpreter response, the interpreter's face recognition information (described later), the voice information, and the like from the interpreter terminal 3.

また、通信手段２０４は、通訳依頼者のアカウント、通訳依頼者の登録情報、通訳依頼に関する情報、通訳依頼者の映像情報、および音声情報などを通訳依頼者端末４から受信する。 Further, the communication means 204 receives the account of the interpreter requester, the registration information of the interpreter requester, the information related to the interpreter request, the video information of the interpreter requester, the audio information, and the like from the interpreter requester terminal 4.

また、通信手段２０４は、通訳者の顔認証情報、および音声情報などを通訳依頼者端末４へ送信する。 Further, the communication means 204 transmits the face authentication information of the interpreter, voice information, and the like to the interpreter requester terminal 4.

また、通信手段２０４は、通訳依頼者の映像情報、および音声情報などを通訳者端末３へ送信する。 Further, the communication means 204 transmits the video information and audio information of the interpreter requester to the interpreter terminal 3.

＜通訳者端末３の説明＞
通訳者端末３は、例えば、ラップトップコンピュータである。 <Explanation of interpreter terminal 3>
The interpreter terminal 3 is, for example, a laptop computer.

また、通訳者端末３は、サーバ装置２および通訳依頼者端末４との間で、インターネットあるいはＬＡＮなどの通信ネットワーク５を介して互いにデータ通信をすることができる。これにより、通訳者端末３において、サーバ装置２を介して通訳依頼者端末４との間のビデオ通話が提供される。 Further, the interpreter terminal 3 can perform data communication with the server device 2 and the interpreter requester terminal 4 via a communication network 5 such as the Internet or a LAN. As a result, the interpreter terminal 3 provides a video call with the interpreter requester terminal 4 via the server device 2.

また、通訳者端末３は、図３のとおり、制御部３０、記憶部３１、ネットワークインターフェース３２、撮影部３３、音声入力部３４、グラフィック処理部３５、オーディオ処理部３６、および操作部３７を備える。 Further, as shown in FIG. 3, the interpreter terminal 3 includes a control unit 30, a storage unit 31, a network interface 32, a shooting unit 33, a voice input unit 34, a graphic processing unit 35, an audio processing unit 36, and an operation unit 37. ..

記憶部３１、ネットワークインターフェース３２、撮影部３３、音声入力部３４、グラフィック処理部３５、オーディオ処理部３６、および操作部３７は、バス３００を介して通訳者端末３の制御部３０に接続される。 The storage unit 31, network interface 32, photographing unit 33, voice input unit 34, graphic processing unit 35, audio processing unit 36, and operation unit 37 are connected to the control unit 30 of the interpreter terminal 3 via the bus 300. ..

制御部３０は、通訳者端末３の動作を制御する。 The control unit 30 controls the operation of the interpreter terminal 3.

記憶部３１は、主にＨＤＤ、ＲＡＭおよびＲＯＭで構成される。 The storage unit 31 is mainly composed of an HDD, a RAM, and a ROM.

ネットワークインターフェース３２は、通訳者端末３とサーバ装置２あるいは通訳依頼者端末４との間でデータを送受信するために、通信ネットワーク５に接続される。 The network interface 32 is connected to the communication network 5 in order to send and receive data between the interpreter terminal 3 and the server device 2 or the interpreter requester terminal 4.

撮影部３３は、カメラ３３０と接続されている。このカメラ３３０によって通訳者の映像が取得される。 The photographing unit 33 is connected to the camera 330. The image of the interpreter is acquired by the camera 330.

音声入力部３４は、マイク３４０と接続されている。このマイク３４０によって通訳者の音声が取得される。 The voice input unit 34 is connected to the microphone 340. The voice of the interpreter is acquired by the microphone 340.

グラフィック処理部３５は、液晶画面３５０と接続されている。液晶画面３５０には、通訳依頼者の映像が表示される。 The graphic processing unit 35 is connected to the liquid crystal screen 350. An image of the interpreter requester is displayed on the liquid crystal screen 350.

オーディオ処理部３６は、スピーカ３６０と接続されている。スピーカ３６０からは、通訳依頼者の音声が出力される。 The audio processing unit 36 is connected to the speaker 360. The voice of the interpreter requester is output from the speaker 360.

操作部３７は、キーボードおよびマウス３７０（以下、「キーボード等」３７０という。）と接続されている。本実施形態において操作部３７には、入力検出装置であるキーボード等３７０を介して通訳者からの操作信号が入力される。通訳者はキーボード等３７０を操作することで、通訳応答、通話切断等を行う。 The operation unit 37 is connected to a keyboard and a mouse 370 (hereinafter, referred to as “keyboard or the like” 370). In the present embodiment, the operation signal from the interpreter is input to the operation unit 37 via the keyboard or the like 370 which is an input detection device. The interpreter operates the keyboard or the like 370 to answer the interpreter, disconnect the call, or the like.

＜通訳者端末３の制御部３０の機能構成＞
通訳者端末３の制御部３０は、所定のプログラムを実行することにより、通話応答手段３０１、撮影手段３０２、音声入力手段３０３、顔情報生成手段３０４、映像出力手段３０５、音声出力手段３０６、および通信手段３０７として機能する。 <Functional configuration of control unit 30 of interpreter terminal 3>
By executing a predetermined program, the control unit 30 of the interpreter terminal 3 executes a call answering means 301, a shooting means 302, a voice input means 303, a face information generating means 304, a video output means 305, a voice output means 306, and It functions as a communication means 307.

＜通話応答手段３０１の説明＞
通話応答手段３０１は、通訳者の操作に基づいて、サーバ装置２によって振り分けられた、通訳依頼者からのコールを受信（通訳応答）する。具体的には、液晶画面３５０に表示された「応答」ボタンをユーザが操作することで、通訳応答が行われる。 <Explanation of call answering means 301>
The call answering means 301 receives a call from the interpreter requester (interpreter answer) distributed by the server device 2 based on the operation of the interpreter. Specifically, the interpreter response is performed by the user operating the "response" button displayed on the liquid crystal screen 350.

＜撮影手段３０２の説明＞
撮影手段３０２は、通訳者が通訳応答を行ったことに基づいて、通訳者の映像を撮影する。具体的には、撮影手段３０２は、カメラ３３０を起動させるための情報を撮影部３３へ送信する。これにより、通訳者の映像情報が取得される。 <Explanation of photographing means 302>
The photographing means 302 photographs the image of the interpreter based on the fact that the interpreter has made an interpreter response. Specifically, the photographing means 302 transmits information for activating the camera 330 to the photographing unit 33. As a result, the video information of the interpreter is acquired.

＜音声入力手段３０３の説明＞
音声入力手段３０３は、通訳者が通訳応答を行ったことに基づいて、通訳者の音声を取得する。具体的には、音声入力手段３０３は、マイク３４０を起動させるための情報を音声入力部３４へ送信する。これにより、通訳者の音声情報が取得される。 <Explanation of voice input means 303>
The voice input means 303 acquires the voice of the interpreter based on the fact that the interpreter has made an interpreter response. Specifically, the voice input means 303 transmits information for activating the microphone 340 to the voice input unit 34. As a result, the voice information of the interpreter is acquired.

＜顔情報生成手段３０４の説明＞
顔情報生成手段３０４は、通訳者の映像情報を解析し、通訳者の顔の部位に関する顔情報を生成する。 <Explanation of face information generation means 304>
The face information generation means 304 analyzes the video information of the interpreter and generates face information regarding the facial part of the interpreter.

本実施形態において、顔情報は、例えば、通訳者の顔の部位である口の開け閉めの有無、目の開け閉めの有無、ならびに、頭の傾きおよび回転などに関する顔認証情報である。この顔認証情報を生成するにあたっては、例えば、ディープラーニングを用いたオープンソースライブラリを利用することができる。このオープンソースライブラリにより、ディープランニングで蓄積された情報に基づいて口の開け閉めの有無などの判定が行われる。 In the present embodiment, the face information is, for example, face recognition information regarding the presence / absence of opening / closing of the mouth, which is a part of the face of the interpreter, the presence / absence of opening / closing of eyes, and the inclination and rotation of the head. In generating this face recognition information, for example, an open source library using deep learning can be used. With this open source library, it is determined whether or not the mouth is opened or closed based on the information accumulated by deep running.

また、顔情報生成手段３０４は、顔認証情報に基づく通訳者の表情および通訳者の声の抑揚等から通訳者の感情情報を生成する。 Further, the face information generating means 304 generates emotion information of the interpreter from the facial expression of the interpreter and the intonation of the voice of the interpreter based on the face recognition information.

顔情報生成手段３０４は、例えば、通訳者の声のトーンが高い場合には、通訳者が楽しそうに説明している状態となるよう感情情報を生成する。また、所定の場合には、顔情報生成手段３０４は、アバター映像の周辺にハートマークあるいはエクスクラメーションマークなどを生成するための情報を生成する。これらの感情情報をアバター映像に反映させることで、アバター映像の表情を豊かにすることができる。 For example, when the tone of the interpreter's voice is high, the face information generating means 304 generates emotional information so that the interpreter is in a state of having fun explaining. Further, in a predetermined case, the face information generating means 304 generates information for generating a heart mark, an exclamation mark, or the like around the avatar image. By reflecting these emotional information in the avatar image, the facial expression of the avatar image can be enriched.

そして、顔情報生成手段３０４は、顔認証情報を、後述の通信手段３０７を介してサーバ装置２へ送信する。 Then, the face information generation means 304 transmits the face authentication information to the server device 2 via the communication means 307 described later.

＜映像出力手段３０５の説明＞
映像出力手段３０５は、通訳者が通訳応答を行ったことに基づいて、通訳依頼者端末４から送信された通訳依頼者の映像を出力する。具体的には、映像出力手段３０５は、液晶画面３５０に通訳依頼者の映像を表示させるための情報をグラフィック処理部３５へ送信する。これにより、通訳依頼者の映像が液晶画面３５０に表示される。 <Explanation of video output means 305>
The video output means 305 outputs the video of the interpreter requester transmitted from the interpreter requester terminal 4 based on the interpreter's response to the interpreter. Specifically, the video output means 305 transmits information for displaying the video of the interpreter requester on the liquid crystal screen 350 to the graphic processing unit 35. As a result, the image of the interpreter requester is displayed on the liquid crystal screen 350.

＜音声出力手段３０６の説明＞
音声出力手段３０６は、通訳者が通訳応答を行ったことに基づいて、通訳依頼者端末４から送信された通訳依頼者の音声を出力する。具体的には、音声出力手段３０６は、スピーカ３６０から通訳依頼者の音声を出力させるための情報をオーディオ処理部３６へ送信する。これにより、通訳依頼者の音声がスピーカ３６０から出力される。 <Explanation of audio output means 306>
The voice output means 306 outputs the voice of the interpreter requester transmitted from the interpreter requester terminal 4 based on the interpreter's response to the interpreter. Specifically, the voice output means 306 transmits information for outputting the voice of the interpreter requester from the speaker 360 to the audio processing unit 36. As a result, the voice of the interpreter requester is output from the speaker 360.

＜通信手段３０７の説明＞
通信手段３０７は、通訳依頼に関する情報、通訳依頼者の映像情報および音声情報などをサーバ装置２から受信する。 <Explanation of communication means 307>
The communication means 307 receives information related to the interpreter request, video information and audio information of the interpreter requester from the server device 2.

また、通信手段３０７は、通訳応答に関する情報、通訳者の顔認証情報および音声情報などをサーバ装置２へ送信する。 Further, the communication means 307 transmits information related to the interpreter response, face authentication information of the interpreter, voice information, and the like to the server device 2.

＜通訳依頼者端末４の説明＞
通訳依頼者端末４は、例えば、スマートフォン、タブレットなどの端末装置である。 <Explanation of interpreter requester terminal 4>
The interpreter requester terminal 4 is, for example, a terminal device such as a smartphone or a tablet.

また、通訳依頼者端末４は、サーバ装置２および通訳者端末３との間で、インターネットあるいはＬＡＮなどの通信ネットワーク５を介して互いにデータ通信をすることができる。これにより、通訳依頼者端末４において、サーバ装置２を介して通訳者端末３との間のビデオ通話が提供される。 Further, the interpreter requester terminal 4 can perform data communication with the server device 2 and the interpreter terminal 3 via a communication network 5 such as the Internet or a LAN. As a result, the interpreter requester terminal 4 provides a video call with the interpreter terminal 3 via the server device 2.

通訳依頼者端末４は、図４のとおり、制御部４０、記憶部４１、ネットワークインターフェース４２、撮影部４３、音声入力部４４、グラフィック処理部４５、オーディオ処理部４６、操作部４７、および位置情報検出部４８を備える。 As shown in FIG. 4, the interpreter requester terminal 4 includes a control unit 40, a storage unit 41, a network interface 42, a shooting unit 43, a voice input unit 44, a graphic processing unit 45, an audio processing unit 46, an operation unit 47, and location information. A detection unit 48 is provided.

記憶部４１、ネットワークインターフェース４２、撮影部４３、音声入力部４４、グラフィック処理部４５、オーディオ処理部４６、操作部４７、および位置情報検出部４８は、バス４００を介して、制御部４０に接続される。 The storage unit 41, network interface 42, shooting unit 43, voice input unit 44, graphic processing unit 45, audio processing unit 46, operation unit 47, and position information detection unit 48 are connected to the control unit 40 via the bus 400. Will be done.

制御部４０は、通訳依頼者端末４の動作を制御する。 The control unit 40 controls the operation of the interpreter requester terminal 4.

記憶部４１は、主にＨＤＤ、ＲＡＭおよびＲＯＭで構成される。 The storage unit 41 is mainly composed of an HDD, a RAM, and a ROM.

ネットワークインターフェース４２は、通訳依頼者端末４とサーバ装置２あるいは通訳者端末３との間でデータを送受信するために、通信ネットワーク５に接続される。 The network interface 42 is connected to the communication network 5 in order to send and receive data between the interpreter requester terminal 4 and the server device 2 or the interpreter terminal 3.

撮影部４３は、カメラ４３０と接続されている。このカメラ４３０によって通訳依頼者の映像が取得される。 The photographing unit 43 is connected to the camera 430. The image of the interpreter requester is acquired by the camera 430.

音声入力部４４は、マイク４４０と接続されている。このマイク４４０によって通訳依頼者の音声が取得される。 The voice input unit 44 is connected to the microphone 440. The voice of the interpreter requester is acquired by the microphone 440.

グラフィック処理部４５は、液晶画面４５０と接続されている。液晶画面４５０には、通訳者のアバター映像が表示される。 The graphic processing unit 45 is connected to the liquid crystal screen 450. An avatar image of the interpreter is displayed on the liquid crystal screen 450.

オーディオ処理部４６は、スピーカ４６０と接続されている。スピーカ４６０からは、通訳者の音声が出力される。 The audio processing unit 46 is connected to the speaker 460. The voice of the interpreter is output from the speaker 460.

操作部４７は、タッチパッド４７０と接続されている。本実施形態において操作部４７には、入力位置検出装置であるタッチパッド４７０を介して通訳依頼者からの操作信号が入力される。通訳依頼者はタッチパッド４７０を操作することで、通訳依頼、通話切断等を行う。 The operation unit 47 is connected to the touch pad 470. In the present embodiment, the operation signal from the interpreter requester is input to the operation unit 47 via the touch pad 470 which is an input position detection device. The interpreter requester operates the touch pad 470 to request an interpreter, disconnect the call, and the like.

位置情報検出部４８は、例えば、ＧＰＳ受信機であって、通訳依頼者端末４の現在位置を示す位置情報（例えば、経度および緯度）を検出する。 The position information detection unit 48 is, for example, a GPS receiver and detects position information (for example, longitude and latitude) indicating the current position of the interpreter requester terminal 4.

＜通訳依頼者端末４の制御部４０の機能構成＞
通訳依頼者端末４の制御部４０は、通訳アプリケーションを起動することで、通訳依頼手段４０１、撮影手段４０２、音声入力手段４０３、アバター画像生成手段４０４、音声出力手段４０５、位置情報検出手段４０６、および通信手段４０７として機能する。 <Functional configuration of the control unit 40 of the interpreter requester terminal 4>
By activating the interpreting application, the control unit 40 of the interpreter requester terminal 4 activates the interpreter request means 401, the photographing means 402, the voice input means 403, the avatar image generation means 404, the voice output means 405, the position information detection means 406, And functions as a communication means 407.

通訳アプリケーションは、通訳依頼者によってダウンロードされたのち、サーバ装置２に対し、通訳依頼者が氏名、国籍、サービスプランなどの登録を行うことで利用可能となる。通訳依頼者が必要な情報を登録したのちは、サーバ装置２よりＩＤ（アカウント）、パスワードが通訳依頼者端末４へ送信される。通訳依頼者はこのＩＤ等を入力することによって通訳アプリケーションにログインすることができる。 After being downloaded by the interpreter requester, the interpreter application can be used by the interpreter requester registering the name, nationality, service plan, etc. in the server device 2. After the interpreter requester registers the necessary information, the server device 2 transmits the ID (account) and password to the interpreter requester terminal 4. The interpreter requester can log in to the interpreter application by entering this ID or the like.

＜通訳依頼手段４０１の説明＞
通訳依頼手段４０１は、通訳依頼者の操作に基づいて、通訳アプリケーションを起動したのち、サーバ装置２対して通訳依頼を行う。具体的には、液晶画面４５０に表示された「通訳依頼」ボタンが表示されている位置に対応するタッチパッド４７０をユーザがタッチすることで、通訳依頼が行われる。 <Explanation of interpreter request means 401>
The interpreter request means 401 starts the interpreter application based on the operation of the interpreter requester, and then makes an interpreter request to the server device 2. Specifically, the interpretation request is made by the user touching the touch pad 470 corresponding to the position where the "interpretation request" button displayed on the liquid crystal screen 450 is displayed.

＜撮影手段４０２の説明＞
撮影手段４０２は、通訳依頼者が通訳依頼を行ったことに基づいて、通訳依頼者の映像を撮影する。具体的には、撮影手段４０２は、カメラ４３０を起動させるための情報を撮影部４３へ送信する。これにより、通訳依頼者の映像情報が取得される。 <Explanation of photographing means 402>
The photographing means 402 captures an image of the interpreter requester based on the interpreter requester's request for interpretation. Specifically, the photographing means 402 transmits information for activating the camera 430 to the photographing unit 43. As a result, the video information of the interpreter requester is acquired.

＜音声入力手段４０３の説明＞
音声入力手段４０３は、通訳依頼者が通訳依頼を行ったことに基づいて、通訳依頼者の音声を取得する。具体的には、音声入力手段４０３は、マイク４４０を起動させるための情報を音声入力部４４へ送信する。これにより、通訳依頼者の音声情報が取得される。 <Explanation of voice input means 403>
The voice input means 403 acquires the voice of the interpreter requester based on the interpreter requester's request for interpretation. Specifically, the voice input means 403 transmits information for activating the microphone 440 to the voice input unit 44. As a result, the voice information of the interpreter requester is acquired.

＜アバター画像生成手段４０４の説明＞
アバター画像生成手段４０４は、通訳者端末３から送信された通訳者の顔認証情報に基づいて３Ｄモデル（３次元モデル）で構成されるアバター映像を生成する。 <Explanation of avatar image generation means 404>
The avatar image generation means 404 generates an avatar image composed of a 3D model (three-dimensional model) based on the face recognition information of the interpreter transmitted from the interpreter terminal 3.

具体的には、アバター画像生成手段４０４は、サーバ装置２を介して通訳者端末３から送信された通訳者の顔認証情報および感情情報を、あらかじめ用意されたアバターの３Ｄモデルに対応させて、通訳者のアバター映像を生成する。 Specifically, the avatar image generation means 404 makes the face recognition information and emotion information of the interpreter transmitted from the interpreter terminal 3 via the server device 2 correspond to the 3D model of the avatar prepared in advance. Generate an interpreter's avatar image.

アバターの３Ｄモデルには、例えば通訳者の口が開いた場合に３Ｄモデルの口の頂点位置を所定の位置に移動させるシェイプキーが保存されている。これにより、アバター画像生成手段４０４は、通訳者が口を開いたり、目を閉じたり、顔を振ったりすることに対応して、アバターが口を開いたり、目を閉じたり、顔を振ったりする状態を描画することが可能となる。なお、例えばアバターの口が開いた状態から閉じた状態までの描画はモーフィングを行うことにより実現することができる。 The 3D model of the avatar stores, for example, a shape key that moves the apex position of the mouth of the 3D model to a predetermined position when the interpreter's mouth is opened. As a result, the avatar image generation means 404 responds to the interpreter opening his mouth, closing his eyes, and shaking his face, so that the avatar opens his mouth, closes his eyes, and shakes his face. It is possible to draw the state to be done. For example, drawing of the avatar from the open state to the closed state can be realized by performing morphing.

また、アバター画像生成手段４０４は、通訳者のアバター映像を出力する。具体的には、アバター画像生成手段４０４は、液晶画面４５０に通訳者のアバター映像を表示させるための情報をグラフィック処理部４５へ送信する。これにより、通訳者のアバター映像が液晶画面４５０に表示される。 Further, the avatar image generation means 404 outputs the avatar image of the interpreter. Specifically, the avatar image generation means 404 transmits information for displaying the interpreter's avatar image on the liquid crystal screen 450 to the graphic processing unit 45. As a result, the interpreter's avatar image is displayed on the liquid crystal screen 450.

なお、本実施形態において、通訳者のアバター映像の服装および背景は、アバター映像に合うようにあらかじめ設定された服装等である。これにより、通訳者は、服装および部屋内部の状況を気にすることなく通訳業務にたずさわることができる。 In the present embodiment, the clothes and background of the interpreter's avatar image are clothes and the like preset to match the avatar image. As a result, the interpreter can be involved in the interpreting work without worrying about the clothes and the situation inside the room.

＜音声出力手段４０５の説明＞
音声出力手段４０５は、通訳者が通訳応答を行ったことに基づいて、通訳者の音声を出力する。具体的には、音声出力手段４０５は、スピーカ４６０から通訳者の音声を出力させるための情報をオーディオ処理部４６へ送信する。これにより、通訳者の音声がスピーカ３６０から出力される。 <Explanation of audio output means 405>
The voice output means 405 outputs the voice of the interpreter based on the interpreter's response to the interpreter. Specifically, the voice output means 405 transmits information for outputting the voice of the interpreter from the speaker 460 to the audio processing unit 46. As a result, the interpreter's voice is output from the speaker 360.

＜位置情報検出手段４０６の説明＞
位置情報検出手段４０６は、通訳依頼者が通訳依頼を行ったことに基づいて、通訳依頼者の現在位置を検出する。具体的には、位置情報検出手段４０６は、位置情報検出部４８を起動させるための情報を位置情報検出部４８へ送信する。 <Explanation of position information detecting means 406>
The position information detecting means 406 detects the current position of the interpreter requester based on the interpreter requester's request for interpretation. Specifically, the position information detecting means 406 transmits information for activating the position information detecting unit 48 to the position information detecting unit 48.

＜通信手段４０７の説明＞
通信手段４０７は、通訳応答に関する情報、通訳者の顔認証情報および音声情報などをサーバ装置２から受信する。 <Explanation of communication means 407>
The communication means 407 receives information on the interpreter response, face authentication information of the interpreter, voice information, and the like from the server device 2.

また、通信手段４０７は、通訳依頼に関する情報、通訳依頼者の映像情報および音声情報などをサーバ装置２へ送信する。 Further, the communication means 407 transmits information related to the interpreter request, video information and audio information of the interpreter requester to the server device 2.

＜ビデオ通話処理の説明＞
以下、図５のフローチャートを用いて、ビデオ通話処理について説明する。なお、後述の制御手段および処理手順は一例であり、本発明の実施形態はこれらには限られない。処理手順等は、本発明の要旨を変更しない範囲で適宜設計変更が可能である。 <Explanation of video call processing>
Hereinafter, the video call processing will be described with reference to the flowchart of FIG. The control means and the processing procedure described later are examples, and the embodiments of the present invention are not limited thereto. The design of the processing procedure and the like can be appropriately changed without changing the gist of the present invention.

まず、通訳依頼者端末４の通訳依頼手段４０１が、通訳依頼者の操作に基づいて、通訳依頼者端末４にインストールされている通訳アプリケーションを起動する（ステップＳ１）。 First, the interpreter request means 401 of the interpreter requester terminal 4 activates the interpreter application installed on the interpreter requester terminal 4 based on the operation of the interpreter requester (step S1).

ついで、通訳依頼手段４０１が、通訳依頼者の操作に基づいて、必要な情報を登録するほか、通訳を依頼したい言語を選択する（ステップＳ２）。 Then, the interpreter request means 401 registers necessary information based on the operation of the interpreter requester, and also selects the language for which the interpreter is requested (step S2).

ついで、通訳依頼手段４０１が、通訳依頼者の操作に基づいて、通訳依頼を行う（ステップＳ３）。具体的には、通訳依頼手段４０１が、通信手段４０７を介して通訳依頼者のアカウント、通訳依頼に関する情報などをサーバ装置２へ送信する。 Then, the interpreter request means 401 makes an interpreter request based on the operation of the interpreter requester (step S3). Specifically, the interpreter request means 401 transmits the account of the interpreter requester, information on the interpreter request, and the like to the server device 2 via the communication means 407.

ついで、サーバ装置２が、通訳依頼者端末４との通信を開始する（ステップＳ４）。 Then, the server device 2 starts communication with the interpreter requester terminal 4 (step S4).

ついで、通訳依頼者端末４の撮影手段４０２および音声入力手段４０３が、それぞれ、通訳依頼者端末４のカメラ４３０およびマイク４４０を起動させる（ステップＳ５）。 Then, the photographing means 402 and the voice input means 403 of the interpreter requester terminal 4 activate the camera 430 and the microphone 440 of the interpreter requester terminal 4, respectively (step S5).

ついで、撮影手段４０２が通訳依頼者の映像情報を取得し、音声入力手段４０３が通訳依頼者の音声情報を取得する（ステップＳ６）。 Next, the photographing means 402 acquires the video information of the interpreter requester, and the voice input means 403 acquires the voice information of the interpreter requester (step S6).

ついで、撮影手段４０２および音声入力手段４０３が、それぞれ通信手段４０７を介して通訳依頼者の映像情報および音声情報をサーバ装置２へ送信する（ステップＳ７）。 Then, the photographing means 402 and the voice input means 403 transmit the video information and the voice information of the interpreter requester to the server device 2 via the communication means 407, respectively (step S7).

ついで、サーバ装置２のサービス管理手段２０３が、通訳者端末３と通訳依頼者端末４との接続確立のために必要な接続要求の確認を行う（ステップＳ８）。 Then, the service management means 203 of the server device 2 confirms the connection request necessary for establishing the connection between the interpreter terminal 3 and the interpreter requester terminal 4 (step S8).

ついで、サーバ装置２の通訳者管理手段２０１が、現在対応可能な通訳者を検索し、サービス管理手段２０３が対応可能な通訳者の通訳者端末３への通訳依頼の振り分けを実行する（ステップＳ９）。 Next, the interpreter management means 201 of the server device 2 searches for an interpreter currently available, and distributes the interpreter request to the interpreter terminal 3 of the interpreter that the service management means 203 can handle (step S9). ).

ついで、サービス管理手段２０３が、通訳依頼者端末４から送信された通訳依頼者の映像情報および音声情報を、通訳依頼を振り分けた通訳者端末３へ送信する（ステップＳ１０）。 Next, the service management means 203 transmits the video information and audio information of the interpreter requester transmitted from the interpreter requester terminal 4 to the interpreter terminal 3 to which the interpreter request is distributed (step S10).

ついで、通訳者端末３の通訳応答手段３０１が、通訳者の操作に基づいて、通訳応答を行う（ステップＳ１１）。 Then, the interpreter response means 301 of the interpreter terminal 3 performs an interpreter response based on the operation of the interpreter (step S11).

ついで、通訳者端末３の撮影手段３０２および音声入力手段３０３が、それぞれ、通訳者端末３のカメラ３３０およびマイク３４０を起動させる（ステップＳ１２）。 Then, the photographing means 302 and the voice input means 303 of the interpreter terminal 3 activate the camera 330 and the microphone 340 of the interpreter terminal 3, respectively (step S12).

ついで、通訳者端末３の撮影手段３０２が通訳者の映像情報を取得し、音声入力手段４０３が通訳者の音声情報を取得する（Ｓ１３）。 Next, the photographing means 302 of the interpreter terminal 3 acquires the video information of the interpreter, and the voice input means 403 acquires the voice information of the interpreter (S13).

ついで、通訳者端末３の顔情報生成手段３０４が、通訳者の映像情報を解析し、通訳者の顔の部位に関する情報である顔認証情報を生成する（ステップＳ１４）。具体的には、顔情報生成手段３０４は、通訳者の映像情報から通訳者の口の開け閉め、頭の傾きなどを抽出し、通訳者の顔に基づく顔認証情報を生成する。 Next, the face information generation means 304 of the interpreter terminal 3 analyzes the video information of the interpreter and generates face recognition information which is information about the face part of the interpreter (step S14). Specifically, the face information generation means 304 extracts the opening and closing of the interpreter's mouth, the inclination of the head, and the like from the video information of the interpreter, and generates face recognition information based on the face of the interpreter.

ついで、顔情報生成手段３０４が、顔認証情報および音声情報から感情情報を生成する（ステップＳ１５）。具体的には、顔情報生成手段３０４は、顔認証情報に基づく通訳者の表情および通訳者の声の抑揚等から通訳者の感情情報を生成する。 Then, the face information generation means 304 generates emotion information from the face authentication information and the voice information (step S15). Specifically, the face information generation means 304 generates emotion information of the interpreter from the facial expression of the interpreter based on the face recognition information, the intonation of the interpreter's voice, and the like.

ついで、通訳者端末３が、サーバ装置２を介して、通訳者の顔認証情報、感情情報、および音声情報を通訳依頼者端末４へ送信する（ステップＳ１６）。 Next, the interpreter terminal 3 transmits the interpreter's face recognition information, emotion information, and voice information to the interpreter requester terminal 4 via the server device 2 (step S16).

ついで、通訳依頼者端末４のアバター画像生成手段４０４が、受信した通訳者の顔認証情報および感情情報をもとに通訳者のアバター映像を生成する（ステップＳ１７）。具体的には、アバター画像生成手段４０４は、あらかじめ用意されたアバターの３Ｄモデルに、通訳者の顔認証情報を当てはめ、通訳者の顔認証情報と同じようにアバターが動くようにアバター映像を生成する。 Next, the avatar image generation means 404 of the interpreter requester terminal 4 generates an interpreter's avatar image based on the received face recognition information and emotion information of the interpreter (step S17). Specifically, the avatar image generation means 404 applies the interpreter's face recognition information to the 3D model of the avatar prepared in advance, and generates an avatar image so that the avatar moves in the same manner as the interpreter's face recognition information. To do.

ついで、アバター画像生成手段４０４が通訳依頼者端末４の液晶画面４５０に通訳者のアバター映像を表示させ、音声出力手段４０５が通訳依頼者端末４のスピーカ４６０に通訳者の音声を出力させる（ステップＳ１８）。その後、通訳者端末３あるいは通訳依頼者端末４において通話が切断された場合には、本発明のビデオ通話処理は終了する。
以上の手順により、本発明のビデオ通話処理が実行される。 Next, the avatar image generation means 404 displays the interpreter's avatar image on the liquid crystal screen 450 of the interpreter requester terminal 4, and the voice output means 405 causes the speaker 460 of the interpreter requester terminal 4 to output the interpreter's voice (step). S18). After that, when the call is disconnected at the interpreter terminal 3 or the interpreter requester terminal 4, the video call processing of the present invention ends.
According to the above procedure, the video call processing of the present invention is executed.

以上をまとめると、本実施形態のビデオ通話システム１は、
通信ネットワーク５によって互いに接続されるとともに各ユーザの操作を受けつける複数のユーザ装置を備えるビデオ通話システム１であって、
１つのユーザ装置である通訳者端末３は、
通訳者端末３を操作するユーザである通訳者の顔画像を撮影する撮影手段３０２、
通訳者の音声を入力する音声入力手段３０３、
撮影された通訳者の顔画像から通訳者の顔の部位に関する顔情報を生成する顔情報生成手段３０４、および
通訳者の顔情報および音声を、通訳依頼者端末４へ送信する通信手段３０７、
を備え、
通訳依頼者端末４は、
通訳者端末３から送信された通訳者の顔情報に基づいて通訳者のアバター画像を生成するとともに出力するアバター画像生成手段４０４、および
通訳者の音声を出力する音声出力手段４０５、
を備える。 Summarizing the above, the video call system 1 of the present embodiment is
A video communication system 1 including a plurality of user devices connected to each other by a communication network 5 and receiving operations of each user.
The interpreter terminal 3, which is one user device,
Shooting means 302, which captures a face image of an interpreter who is a user who operates the interpreter terminal 3.
Voice input means 303 for inputting the voice of the interpreter,
Face information generating means 304 that generates face information about the face part of the interpreter from the photographed face image of the interpreter, and communication means 307 that transmits the face information and voice of the interpreter to the interpreter requester terminal 4.
With
The interpreter requester terminal 4
An avatar image generation means 404 that generates and outputs an avatar image of an interpreter based on the face information of the interpreter transmitted from the interpreter terminal 3, and a voice output means 405 that outputs the voice of the interpreter.
To be equipped.

＜発明の効果＞
本実施形態のビデオ通話システムによれば、処理負荷が小さく、在宅ワーカーが顔および環境などを気にすることなくサービスを提供することができるビデオ通話システムを提供することができる。 <Effect of invention>
According to the video call system of the present embodiment, it is possible to provide a video call system in which the processing load is small and the home worker can provide the service without worrying about the face and the environment.

具体的には、本発明を用いれば、通訳者が在宅ワーカーであっても、通訳者は、自身の服装あるいは背景を気にすることなく通訳業務を遂行することができる。 Specifically, according to the present invention, even if the interpreter is a home-based worker, the interpreter can carry out the interpreting work without worrying about his / her clothes or background.

また、通訳者が女性であっても、通訳業務のためにわざわざ化粧をする必要がなく、気軽に通訳業務にたずさわることができる。 Moreover, even if the interpreter is a woman, she does not have to bother to put on makeup for the interpreting work, and can easily participate in the interpreting work.

さらには、通訳者端末と通訳依頼者端末との間でデータ量の小さい顔認証情報が送受信されるため、通信にタイムラグが発生することを抑制しつつ、サーバ装置の処理負荷が大きくなってしまうことを抑制することができる。 Furthermore, since face recognition information with a small amount of data is transmitted and received between the interpreter terminal and the interpreter requester terminal, the processing load of the server device increases while suppressing the occurrence of a time lag in communication. Can be suppressed.

［他の実施形態］
前記実施形態では、通訳依頼者端末では常に通訳者のアバター映像が表示される例が記載されているが、本発明はこれには限られない。例えば、通訳者は、通訳者端末の操作に基づいて、通訳者の実画像を表示させるか、アバター映像（アバター映像の種類を含む。）を表示させるかを選択することができてもよい。 [Other Embodiments]
In the above embodiment, an example is described in which the avatar image of the interpreter is always displayed on the interpreter requester terminal, but the present invention is not limited to this. For example, the interpreter may be able to select whether to display the actual image of the interpreter or the avatar image (including the type of the avatar image) based on the operation of the interpreter terminal.

また、前記実施形態とは異なり、アバター映像の背景などは黒色などの単一色であってもよい。 Further, unlike the above-described embodiment, the background of the avatar image may be a single color such as black.

さらには、通訳依頼者は、通訳依頼者端末の操作に基づいて、アバター映像の種類を選択することができてもよい。 Further, the interpreter requester may be able to select the type of the avatar image based on the operation of the interpreter requester terminal.

また、前記実施形態では、第１ユーザ装置は通訳者端末であり、第２ユーザ装置は通訳依頼者端末である例が記載されているが、本発明はこれには限られない。例えば、第１ユーザ装置は、コンビニエンスストアの店員が操作する端末であり、第２ユーザ装置は客が操作する端末であってもよい。また、第１ユーザ装置は家庭教師が操作する端末であり、第２ユーザ装置は生徒が操作する端末であってもよい。 Further, in the above embodiment, an example is described in which the first user device is an interpreter terminal and the second user device is an interpreter requester terminal, but the present invention is not limited to this. For example, the first user device may be a terminal operated by a clerk at a convenience store, and the second user device may be a terminal operated by a customer. Further, the first user device may be a terminal operated by the tutor, and the second user device may be a terminal operated by the student.

また、ＶＲ（ヴァーチャル・リアリティ）において、各アバターが会話する際にも、本発明を適用することができる。さらに、ホログラム、ＡＲ（オーグメンテッド・リアリティ）にも本発明を適用することができる。 Further, in VR (Virtual Reality), the present invention can be applied even when each avatar talks. Furthermore, the present invention can be applied to holograms and AR (augmented reality).

また、前記実施形態においては、ディープラーニングを用いたオープンソースライブラリによる顔認証技術が記載されているが、本発明はこれには限られない。例えば、図６、図７の顔認証技術を用いて通訳者の顔の特徴点が生成され、この特徴点の情報に基づいて顔認証が行われてもよい。 Further, in the above-described embodiment, a face recognition technique using an open source library using deep learning is described, but the present invention is not limited to this. For example, facial feature points of the interpreter's face may be generated using the face recognition techniques of FIGS. 6 and 7, and face recognition may be performed based on the information of the feature points.

また、前記実施形態では、顔の表情、声の抑揚などによって通訳者の感情情報が生成される例が記載されているが、本発明はこれには限られない。例えば、通訳者の発した言葉に対してテキストマイニングを行うことによって、通訳者の感情情報が生成されてもよい。例えば、通訳者が「うれしい」などのポジティブな言葉を発した場合に、アバター映像の表情に、喜びの気持ちが反映されてもよい。 Further, in the above-described embodiment, an example in which emotional information of an interpreter is generated by facial expressions, inflection of voice, etc. is described, but the present invention is not limited to this. For example, emotional information of the interpreter may be generated by performing text mining on the words spoken by the interpreter. For example, when the interpreter utters a positive word such as "happy", the facial expression of the avatar image may reflect the feeling of joy.

さらに、顔情報生成手段は、通訳者の音声情報からリップシンクなどの技術を用いて顔情報および感情情報を生成することもできる。具体的には、例えば、通訳者の音声情報から母音情報を抽出し、それに基づいてアバターの口の動きを生成することもできる。 Further, the face information generation means can also generate face information and emotion information from the voice information of the interpreter by using a technique such as lip sync. Specifically, for example, vowel information can be extracted from the voice information of the interpreter, and the movement of the avatar's mouth can be generated based on the vowel information.

また、前記実施形態では、顔情報は通訳者端末で生成される例が記載されているが、本発明はこれには限られない。例えば、通訳依頼者による通訳依頼操作のあとに、通訳者端末から通訳者の顔画像（実写映像）を受信した通訳依頼者端末に備えられる顔情報生成手段が、通訳者の顔画像から通訳者の顔の部位に関する顔情報を生成し、その顔情報に基づいて通訳依頼者端末のアバター画像生成手段が通訳者のアバター画像を生成してもよい。 Further, in the above embodiment, an example in which the face information is generated by the interpreter terminal is described, but the present invention is not limited to this. For example, the face information generation means provided in the interpreter requester terminal that receives the interpreter's face image (live video) from the interpreter terminal after the interpreter request operation by the interpreter requester is the interpreter from the interpreter's face image. The face information regarding the part of the face of the interpreter may be generated, and the avatar image generation means of the interpreter requester terminal may generate the avatar image of the interpreter based on the face information.

また、前記実施形態では、顔情報生成手段は、通訳者の顔に基づいて顔情報のみを生成する例が記載されているが、本発明はこれには限られない。例えば、顔情報生成手段は、顔のほか、手の動き、上半身の動きなどに基づいて情報を生成してもよい。 Further, in the above-described embodiment, an example is described in which the face information generating means generates only face information based on the face of an interpreter, but the present invention is not limited to this. For example, the face information generating means may generate information based on the movement of the hand, the movement of the upper body, and the like in addition to the face.

また、図５の例では、サーバ装置と通訳依頼者端末との通信が開始されたのちに通訳依頼者端末のカメラおよびマイクが起動する例が記載されているが、本発明はこれには限られない。サーバ装置と通訳依頼者端末との通信が開始される前（例えば、通訳依頼者が通訳アプリケーションを起動したとき）に通訳依頼者端末のカメラおよびマイクが起動してもよい。 Further, in the example of FIG. 5, an example is described in which the camera and the microphone of the interpreter requester terminal are activated after the communication between the server device and the interpreter requester terminal is started, but the present invention is limited to this. I can't. The camera and microphone of the interpreter requester terminal may be activated before the communication between the server device and the interpreter requester terminal is started (for example, when the interpreter requester starts the interpreter application).

また、通訳者端末に用いられるコンピュータプログラム、および、通訳依頼者端末に用いられるコンピュータプログラムは、それぞれ２つ以上のコンピュータプログラム（アプリケーション）で構成されていてもよい。 Further, the computer program used for the interpreter terminal and the computer program used for the interpreter requester terminal may each be composed of two or more computer programs (applications).

また、前記実施形態においては、通訳者端末はラップトップコンピュータである例が記載されているが、デスクトップコンピュータ、スマートフォン、タブレットなどであってもよい。 Further, in the above embodiment, the interpreter terminal is described as a laptop computer, but it may be a desktop computer, a smartphone, a tablet, or the like.

同様に、前記実施形態においては、通訳依頼者端末はスマートフォン、タブレットなどの携帯型端末である例が記載されているが、デスクトップコンピュータ、ラップトップコンピュータなどであってもよい。 Similarly, in the above-described embodiment, the interpreter requester terminal is described as a portable terminal such as a smartphone or tablet, but may be a desktop computer, a laptop computer, or the like.

１ビデオ通話システム
２サーバ装置
２０１通訳者管理手段
２０２通訳依頼者管理手段
２０３サービス管理手段
２０４通信手段
３通訳者端末
３０１通訳応答手段
３０２撮影手段
３０３音声入力手段
３０４顔情報生成手段
３０５映像出力手段
３０６音声出力手段
３０７通信手段
４通訳依頼者端末
４０１通訳依頼手段
４０２撮影手段
４０３音声入力手段
４０４アバター画像生成手段
４０５音声出力手段
４０６位置情報検出手段
４０７通信手段
５通信ネットワーク 1 Video call system 2 Server device 201 Interpreter management means 202 Interpreter requester management means 203 Service management means 204 Communication means 3 Interpreter terminal 301 Interpreter response means 302 Shooting means 303 Voice input means 304 Face information generation means 305 Video output means 306 Voice output means 307 Communication means 4 Interpreter requester terminal 401 Interpreter request means 402 Shooting means 403 Voice input means 404 Avatar image generation means 405 Voice output means 406 Position information detection means 407 Communication means 5 Communication network

Claims

A video communication system including a plurality of user devices that are connected to each other by a communication network and receive operations of each user.
The first user device, which is at least one of the user devices,
A photographing means for capturing a face image of the first user who operates the first user device,
A voice input means for inputting the voice of the first user,
A face information generating means for generating face information regarding a face portion of the first user from the photographed face image of the first user, and at least the face information and the voice of the first user are used by the plurality of users. A communication means for transmitting to a second user device other than the first user device among the devices,
With
The second user device is
An avatar image generation means that generates and outputs an avatar image of the first user based on the face information of the first user transmitted from the first user device, and a voice that outputs the voice of the first user. Output means,
To prepare
Video calling system.

The first user device is operated by an interpreter who has been requested to interpret.
The second user device is operated by an interpreter requester who requests the interpreter to interpret.
The video call system according to claim 1.

The face information is at least face recognition information regarding the presence / absence of opening / closing of the mouth, which is the portion of the face of the first user, the presence / absence of opening / closing of eyes, and the inclination and rotation of the head.
The video calling system according to claim 1 or 2.

The avatar image generation means generates the avatar image by varying the positions of the apex of the eyes, mouth, and face corresponding to the avatar image composed of the three-dimensional model based on the face recognition information.
The video call system according to claim 3.

The face information generating means generates emotion information for expressing the emotion of the first user based on at least one of the face information of the first user and the voice.
The avatar image generation means generates the avatar image of the first user based on the emotion information in addition to the face information.
The video call system according to any one of claims 1 to 4.

Computer equipment,
A photographing means for capturing a facial image of a user who operates the computer device,
A voice input means for inputting the user's voice,
A face information generating means for generating face information regarding a part of the user's face from the photographed face image of the user in order to generate an avatar image of the user in another computer device, and at least the face of the user. A communication means for transmitting information and the voice to the other computer device,
To function as
Computer program.

Computer equipment,
Regarding the other user who operates the other computer device from the other computer device based on the user's operation, the face portion of the other user generated from the face image of the other user taken by the other computer device. Communication means for receiving face information and voice,
An avatar image generation means that generates and outputs an avatar image of the other user based on the face information, and a voice output means that outputs the voice of the other user.
To function as
Computer program.

Computer equipment,
A communication means for receiving a face image and voice of the other user taken by the other computer device for another user who operates the other computer device from the other computer device based on the user's operation.
A face information generating means for generating face information regarding a face portion of the other user from the face image of the other user received from the other computer device.
An avatar image generation means that generates and outputs an avatar image of the other user based on the face information, and a voice output means that outputs the voice of the other user.
To function as
Computer program.