JP2023084986A

JP2023084986A - Display control system, display control method, and program

Info

Publication number: JP2023084986A
Application number: JP2021199424A
Authority: JP
Inventors: 一川竹; Hajime Kawatake; 達也井上; Tatsuya Inoue
Original assignee: Pocketalk Corp
Current assignee: Pocketalk Corp
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2023-06-20

Abstract

To provide a display control system, a display control method, and a program, capable of timely displaying a translation result of voice to be inputted.SOLUTION: An input relay unit 80 receives voice data representing voice inputted by a speaker. The input relay unit 80 receives a confirmation request output in response to a prescribed operation performed by the speaker. A character string relay unit 84 controls such that a translation of the voice represented by the voice data received before reception of the confirmation request starts upon the reception of the confirmation request as a trigger. A display control unit 50 causes a display unit to display a screen in which an image obtained by superimposing a character string representing a translation result of the voice represented by the voice data received before the reception of the confirmation request, on an image photographed by a photographing unit, is arranged.SELECTED DRAWING: Figure 8A

Description

特許法第３０条第２項適用申請有り令和３年６月１５日、ｈｔｔｐｓ：／／ｓｏｕｒｃｅｎｅｘｔ．ｃｏ．ｊｐ／、ｈｔｔｐｓ：／／ｓｏｕｒｃｅｎｅｘｔ．ｃｏ．ｊｐ／ｐｒｅｓｓｒｅｌｅａｓｅ＿ｈｔｍｌ／ＪＳ／２０２１／２０２１０６１５０３／令和３年９月９日、ｈｔｔｐｓ：／／ｐｏｃｋｅｔａｌｋ．ｊｐ／ｄｅｔａｉｌｓ／ｓｕｂｔｉｔｌｅｓ／、ｈｔｔｐｓ：／／ｐｏｃｋｅｔａｌｋ．ｃｏｍ／ｓｏｆｔｗａｒｅ／ｓｕｂｔｉｔｌｅｓ／ｖｅｒｉｆｙ／Patent Law Article 30, Paragraph 2 application filed June 15, 2021, https://sourcenext. co. jp/, https://sourceext. co. jp/pressrelease_html/JS/2021/2021061503/ September 9, 2021, https://pockettalk. jp/details/subtitles/, https://pockettalk. com/software/subtitles/verify/

本発明は、表示制御システム、表示制御方法及びプログラムに関する。 The present invention relates to a display control system, display control method and program.

撮影部によって撮影される画像に音声の翻訳結果を表す文字列を重畳させた画像を表示させる技術が存在する。このような技術の一例として、特許文献１には、発話者を撮像した映像信号に発話者が話した音声のデータを翻訳した翻訳後の文字情報を重畳表示した映像データの映像信号を画面に表示させるテレビ会議システムが記載されている。 2. Description of the Related Art There is a technique for displaying an image obtained by superimposing a character string representing a result of speech translation on an image captured by a capturing unit. As an example of such a technique, Patent Document 1 discloses a video signal of video data in which character information obtained by translating speech data of a speaker is superimposed on a video signal obtained by imaging a speaker. A teleconferencing system for displaying is described.

また、数秒間にわたって認識可能な音声の入力がなかったことをトリガに、それまでに入力された音声に対する翻訳を開始する技術が存在する。 There is also a technology that triggers the absence of recognizable speech input for several seconds to start translating the speech that has been input up to that point.

特開２０１５－１５３４０８号公報JP 2015-153408 A

特許文献１に記載の技術において、数秒間にわたって認識可能な音声の入力がなかったことをトリガに、それまでに入力された音声に対する翻訳を開始するようにした場合、音声の入力から当該音声の翻訳結果の表示までの間に一定程度の時間がかかる。そのため、テレビ会議の参加者等は音声の翻訳結果を適時に把握できない。 In the technique described in Patent Document 1, when there is no recognizable speech input for several seconds as a trigger, translation of the speech that has been input up to that point is started. It takes a certain amount of time until the translation result is displayed. Therefore, the participants of the video conference cannot grasp the speech translation result in a timely manner.

本発明は上記課題に鑑みてなされたものであって、その目的の１つは、入力される音声の翻訳結果を適時に表示できる表示制御システム、表示制御方法及びプログラムを提供することにある。 The present invention has been made in view of the above problems, and one of its objects is to provide a display control system, a display control method, and a program capable of displaying a translation result of input speech in a timely manner.

本発明に係る表示制御システムは、発話者により入力される音声を表す音声データを受け付ける音声データ受付手段と、前記発話者により行われる所定の操作に応じて出力される確定要求を受け付ける確定要求受付手段と、前記確定要求の受付をトリガとして、当該確定要求の受付までに受け付けた前記音声データが表す音声の翻訳が開始されるよう制御する翻訳制御手段と、撮影部によって撮影される画像に前記確定要求の受付までに受け付けた前記音声データが表す音声の翻訳結果を表す文字列を重畳させた画像が配置された画面を表示部に表示させる翻訳結果表示制御手段と、を含む。 A display control system according to the present invention includes: audio data reception means for receiving audio data representing audio input by a speaker; and confirmation request reception for receiving a confirmation request output in response to a predetermined operation performed by the speaker. means, translation control means for controlling, triggered by acceptance of the confirmation request, to start translating the voice represented by the audio data received by the acceptance of the confirmation request; translation result display control means for causing a display unit to display a screen on which an image superimposed with a character string representing a translation result of the voice represented by the voice data received before the acceptance of the confirmation request is arranged.

本発明の一態様では、前記撮影部によって撮影される画像に前記音声データが表す音声の音声認識結果を表す文字列を重畳させた画像が配置された画面を前記表示部に表示させる音声認識結果表示制御手段、をさらに含み、前記音声認識結果表示制御手段は、前記確定要求の受付よりも前に、前記撮影部によって撮影される画像に受付済の前記音声データが表す音声の音声認識結果を表す文字列を重畳させた画像が配置された画面を前記表示部に表示させる。 In one aspect of the present invention, the display unit displays a screen in which an image obtained by superimposing a character string representing a voice recognition result of the voice represented by the voice data on the image captured by the imaging unit is displayed on the display unit. display control means, wherein the voice recognition result display control means adds the voice recognition result of the voice represented by the received voice data to the image captured by the imaging unit before receiving the confirmation request. A screen on which an image superimposed with a character string representing the object is displayed on the display unit.

また、本発明の一態様では、前記翻訳結果表示制御手段は、前記撮影部によって撮影される画像に、前記確定要求の受付までに受け付けた前記音声データが表す音声の音声認識結果を表す文字列、及び、前記確定要求の受付までに受け付けた前記音声データが表す音声の翻訳結果を表す文字列の両方を重畳させた画像が配置された画面を前記表示部に表示させる。 Further, in one aspect of the present invention, the translation result display control means adds a character string representing a voice recognition result of the voice represented by the voice data received until the confirmation request is received to the image captured by the capturing unit. and a character string representing the translation result of the voice represented by the voice data received before the confirmation request is received.

また、本発明の一態様では、前記撮影部によって撮影される画像に文字列を重畳させた画像をテレビ会議システムに出力する画像出力部、をさらに含み、前記翻訳結果表示制御手段は、前記テレビ会議システムによって生成される前記画面を前記表示部に表示させる。 Further, in one aspect of the present invention, it further includes an image output unit for outputting an image obtained by superimposing a character string on the image captured by the capturing unit to a video conference system, wherein the translation result display control means controls the display of the TV The screen generated by the conference system is displayed on the display unit.

また、本発明の一態様では、前記音声データ受付手段は、前記発話者により端末に入力される音声を表す前記音声データを前記端末から受け付け、前記確定要求受付手段は、前記端末に対して前記発話者により行われる所定の操作に応じて前記端末から送信される前記確定要求を受け付け、前記翻訳結果表示制御手段は、前記確定要求の受付までに受け付けた前記音声データが表す音声の翻訳結果を表す文字列を前記端末が備える表示部に表示させ、前記翻訳結果表示制御手段は、前記撮影部によって撮影される画像に前記確定要求の受付までに受け付けた前記音声データが表す音声の翻訳結果を表す文字列を重畳させた画像が配置された画面をクライアント装置が備える表示部に表示させる。 Further, in one aspect of the present invention, the voice data receiving means receives from the terminal the voice data representing voice input by the speaker to the terminal, and the confirmation request receiving means sends the confirmation request to the terminal. The confirmation request transmitted from the terminal in response to a predetermined operation performed by the speaker is received, and the translation result display control means displays the translation result of the speech represented by the received speech data until the reception of the confirmation request. The translation result display control means causes the display unit of the terminal to display the character string representing A display unit of the client device displays a screen on which an image superimposed with a character string representing the character string is arranged.

あるいは、前記音声データ受付手段は、前記発話者によりクライアント装置に入力される音声を表す前記音声データを前記クライアント装置から受け付け、前記確定要求受付手段は、前記クライアント装置に対して前記発話者により行われる所定の操作に応じて前記クライアント装置から送信される前記確定要求を受け付け、前記翻訳結果表示制御手段は、前記撮影部によって撮影される画像に前記確定要求の受付までに受け付けた前記音声データが表す音声の翻訳結果を表す文字列を重畳させた画像が配置された画面を前記クライアント装置が備える前記表示部に表示させる。 Alternatively, the voice data receiving means receives from the client device the voice data representing a voice input to the client device by the speaker, and the confirmation request receiving means receives from the client device the confirmation request received by the speaker. The translation result display control means accepts the confirmation request transmitted from the client device in response to a predetermined operation, and the translation result display control means adds the voice data received by the reception of the confirmation request to the image captured by the imaging unit. A screen on which an image superimposed with a character string representing the translation result of the represented voice is displayed on the display unit of the client device.

また、本発明の一態様では、前記翻訳制御手段は、前記確定要求の受付までに受け付けた前記音声データが表す音声の複数の言語への翻訳が開始されるよう制御し、前記翻訳結果表示制御手段は、前記撮影部によって撮影される画像に、前記複数の言語のそれぞれについての、前記音声データが表す音声の翻訳結果を表す文字列を重畳させた画像が配置された画面を前記表示部に表示させる。 In one aspect of the present invention, the translation control means controls to start translating the voice represented by the voice data received by the time the confirmation request is received into a plurality of languages, and controls the translation result display. means for displaying, on the display unit, a screen in which an image obtained by superimposing a character string representing a translation result of the voice represented by the voice data for each of the plurality of languages on the image captured by the capturing unit is arranged; display.

また、本発明に係る表示制御方法は、発話者により入力される音声を表す音声データを受け付けるステップと、前記発話者により行われる所定の操作に応じて出力される確定要求を受け付けるステップと、前記確定要求の受付をトリガとして、当該確定要求の受付までに受け付けた前記音声データが表す音声の翻訳が開始されるよう制御するステップと、撮影部によって撮影される画像に前記確定要求の受付までに受け付けた前記音声データが表す音声の翻訳結果を表す文字列を重畳させた画像が配置された画面を表示部に表示させるステップと、を含む。 Further, a display control method according to the present invention includes the steps of: receiving voice data representing voice input by a speaker; receiving a confirmation request output in response to a predetermined operation performed by the speaker; a step of controlling the reception of a confirmation request as a trigger to start translating the voice represented by the received audio data before the reception of the confirmation request; and causing a display unit to display a screen on which an image superimposed with a character string representing a translation result of the voice represented by the received voice data is arranged.

また、本発明に係るプログラムは、発話者により入力される音声を表す音声データを受け付ける手順、前記発話者により行われる所定の操作に応じて出力される確定要求を受け付ける手順、前記確定要求の受付をトリガとして、当該確定要求の受付までに受け付けた前記音声データが表す音声の翻訳が開始されるよう制御する手順、撮影部によって撮影される画像に前記確定要求の受付までに受け付けた前記音声データが表す音声の翻訳結果を表す文字列を重畳させた画像が配置された画面を表示部に表示させる手順、をコンピュータに実行させる。 Further, the program according to the present invention includes a procedure for accepting speech data representing speech input by a speaker, a procedure for accepting a confirmation request output in response to a predetermined operation performed by the speaker, and a procedure for accepting the confirmation request. is used as a trigger to start translating the voice represented by the voice data received before the acceptance of the confirmation request; causes the computer to display, on the display unit, a screen on which an image on which a character string representing the translation result of the speech represented by is superimposed is arranged.

本発明の一実施形態に係るテレビ会議用翻訳システムの全体構成の一例を示す図である。1 is a diagram showing an example of the overall configuration of a video conference translation system according to an embodiment of the present invention; FIG. 本発明の一実施形態に係る端末の背面の一例を示す図である。It is a figure which shows an example of the back surface of the terminal which concerns on one Embodiment of this invention. 本発明の一実施形態に係る端末の構成の一例を示す図である。It is a figure which shows an example of a structure of the terminal which concerns on one Embodiment of this invention. 本発明の一実施形態に係るクライアント装置の構成の一例を示す図である。It is a figure which shows an example of a structure of the client apparatus which concerns on one Embodiment of this invention. 本発明の一実施形態に係る中継装置の構成の一例を示す図である。It is a figure which shows an example of a structure of the relay apparatus which concerns on one Embodiment of this invention. 本発明の一実施形態に係る音声処理システムの構成の一例を示す図である。It is a figure showing an example of composition of a sound processing system concerning one embodiment of the present invention. テレビ会議画面の一例を示す図である。It is a figure which shows an example of a teleconference screen. 音声認識結果画像の一例を示す図である。It is a figure which shows an example of a speech-recognition result image. テレビ会議画面の一例を示す図である。It is a figure which shows an example of a teleconference screen. 翻訳結果画像の一例を示す図である。It is a figure which shows an example of a translation result image. 本発明の一実施形態に係る端末、中継装置、及び、音声処理システムで実装される機能の一例を示す機能ブロック図である。1 is a functional block diagram showing an example of functions implemented by a terminal, a relay device, and an audio processing system according to an embodiment of the present invention; FIG. 本発明の一実施形態に係るクライアント装置で実装される機能の一例を示す機能ブロック図である。3 is a functional block diagram showing an example of functions implemented in a client device according to one embodiment of the present invention; FIG. 本発明の一実施形態に係る中継装置において行われる処理の流れの一例を示すフロー図である。FIG. 4 is a flow diagram showing an example of the flow of processing performed in the relay device according to one embodiment of the present invention; 本発明の一実施形態に係る中継装置において行われる処理の流れの一例を示すフロー図である。FIG. 4 is a flow diagram showing an example of the flow of processing performed in the relay device according to one embodiment of the present invention; 本発明の一実施形態に係るクライアント装置において行われる処理の流れの一例を示すフロー図である。FIG. 4 is a flow diagram showing an example of the flow of processing performed in the client device according to one embodiment of the present invention; テレビ会議画面の一例を示す図である。It is a figure which shows an example of a teleconference screen. 本発明の一実施形態の変形例に係るクライアント装置の構成の一例を示す図である。FIG. 10 is a diagram showing an example of the configuration of a client device according to a modification of one embodiment of the present invention;

以下、本発明の一実施形態について、図面を参照しながら説明する。 An embodiment of the present invention will be described below with reference to the drawings.

図１は、本実施形態に係るテレビ会議用翻訳システム１の全体構成の一例を示す図である。図２は、本実施形態に係る端末１０の背面の一例を示す図である。図３Ａは、本実施形態に係る端末１０の構成の一例を示す図である。図３Ｂは、本実施形態に係るクライアント装置１２の構成の一例を示す図である。図３Ｃは、本実施形態に係る中継装置１４の構成の一例を示す図である。図３Ｄは、本実施形態に係る音声処理システム１６の構成の一例を示す図である。 FIG. 1 is a diagram showing an example of the overall configuration of a video conference translation system 1 according to this embodiment. FIG. 2 is a diagram showing an example of the back surface of the terminal 10 according to this embodiment. FIG. 3A is a diagram showing an example of the configuration of the terminal 10 according to this embodiment. FIG. 3B is a diagram showing an example of the configuration of the client device 12 according to this embodiment. FIG. 3C is a diagram showing an example of the configuration of the relay device 14 according to this embodiment. FIG. 3D is a diagram showing an example of the configuration of the audio processing system 16 according to this embodiment.

図１に示すように、本実施形態に係るテレビ会議用翻訳システム１には、端末１０、クライアント装置１２、中継装置１４、音声処理システム１６、及び、テレビ会議システム１８が含まれている。端末１０、クライアント装置１２、中継装置１４、音声処理システム１６、及び、テレビ会議システム１８は、インターネット等のコンピュータネットワーク２０に接続されている。そのため端末１０、クライアント装置１２、中継装置１４、音声処理システム１６、テレビ会議システム１８は、互いに、コンピュータネットワーク２０を介して通信可能となっている。 As shown in FIG. 1, the teleconference translation system 1 according to this embodiment includes a terminal 10, a client device 12, a relay device 14, a voice processing system 16, and a teleconference system 18. FIG. The terminal 10, client device 12, relay device 14, audio processing system 16, and video conference system 18 are connected to a computer network 20 such as the Internet. Therefore, the terminal 10 , the client device 12 , the relay device 14 , the audio processing system 16 and the teleconference system 18 can communicate with each other via the computer network 20 .

本実施形態に係る端末１０は、リモート会議等のテレビ会議に参加するユーザによって利用されるコンピュータである。図３Ａに示すように、本実施形態に係る端末１０には、例えば、プロセッサ１０ａ、記憶部１０ｂ、通信部１０ｃ、操作部１０ｄ、撮影部１０ｅ、タッチパネル１０ｆ、マイク１０ｇ、スピーカ１０ｈが含まれる。 A terminal 10 according to the present embodiment is a computer used by a user who participates in a video conference such as a remote conference. As shown in FIG. 3A, the terminal 10 according to this embodiment includes, for example, a processor 10a, a storage unit 10b, a communication unit 10c, an operation unit 10d, an imaging unit 10e, a touch panel 10f, a microphone 10g, and a speaker 10h.

プロセッサ１０ａは、例えば端末１０にインストールされるプログラムに従って動作するマイクロプロセッサ等のプログラム制御デバイスである。 The processor 10a is a program-controlled device such as a microprocessor that operates according to a program installed in the terminal 10, for example.

記憶部１０ｂは、例えばＲＯＭやＲＡＭ等の記憶素子などである。記憶部１０ｂには、プロセッサ１０ａによって実行されるプログラムなどが記憶される。 The storage unit 10b is, for example, a storage element such as ROM or RAM. The storage unit 10b stores programs and the like executed by the processor 10a.

通信部１０ｃは、例えばコンピュータネットワーク２０を介して中継装置１４との間でデータを授受するための通信インタフェースである。ここで通信部１０ｃに、基地局を含む携帯電話回線を経由してインターネット等のコンピュータネットワーク２０と通信を行う無線通信モジュールが含まれていてもよい。また通信部１０ｃに、Ｗｉ－Ｆｉ（登録商標）ルータ等を経由してインターネット等のコンピュータネットワーク２０と通信を行う無線ＬＡＮモジュールが含まれていてもよい。 The communication unit 10c is a communication interface for exchanging data with the relay device 14 via the computer network 20, for example. Here, the communication unit 10c may include a wireless communication module that communicates with a computer network 20 such as the Internet via a mobile phone line including a base station. The communication unit 10c may also include a wireless LAN module that communicates with a computer network 20 such as the Internet via a Wi-Fi (registered trademark) router or the like.

操作部１０ｄは、例えばユーザが行った操作の内容をプロセッサ１０ａに出力するボタンやタッチセンサ等の操作部材である。図１には、操作部１０ｄの一例として、翻訳対象の音声を入力する際に押下される翻訳ボタン１０ｄａ、電源のオンやオフを行うための電源ボタン１０ｄｂ、スピーカ１０ｈから出力される音声の音量調整を行うための音量調整部１０ｄｃが示されている。翻訳ボタン１０ｄａは、端末１０の前面に設けられているタッチパネル１０ｆの下側に配置されている。電源ボタン１０ｄｂ、及び、音量調整部１０ｄｃは端末１０の右側面に配置されている。 The operation unit 10d is, for example, an operation member such as a button or a touch sensor that outputs the content of an operation performed by a user to the processor 10a. In FIG. 1, as an example of the operation unit 10d, a translation button 10da that is pressed when inputting a speech to be translated, a power button 10db for turning the power on and off, and the volume of the speech output from the speaker 10h. A volume adjustment section 10dc for making adjustments is shown. The translation button 10da is arranged below the touch panel 10f provided on the front surface of the terminal 10 . A power button 10db and a volume control section 10dc are arranged on the right side of the terminal 10 .

撮影部１０ｅは、例えばデジタルカメラなどの撮影デバイスである。図２に示すように、本実施形態に係る端末１０は、背面に撮影部１０ｅが設けられている。 The photographing unit 10e is, for example, a photographing device such as a digital camera. As shown in FIG. 2, the terminal 10 according to the present embodiment is provided with an imaging unit 10e on the back.

タッチパネル１０ｆは、例えばタッチセンサと液晶ディスプレイや有機ＥＬディスプレイ等のディスプレイとが一体となったものである。タッチパネル１０ｆは、端末１０の前面に設けられており、プロセッサ１０ａが生成する画面などを表示させる。 The touch panel 10f is, for example, a combination of a touch sensor and a display such as a liquid crystal display or an organic EL display. The touch panel 10f is provided on the front surface of the terminal 10, and displays a screen generated by the processor 10a.

マイク１０ｇは、例えば受け付ける音声を電気信号に変換する音声入力デバイスである。ここでマイク１０ｇが、端末１０に内蔵されている、人混みでも人の声が認識しやすいノイズキャンセリング機能を備えたデュアルマイクであってもよい。 The microphone 10g is, for example, an audio input device that converts received audio into electrical signals. Here, the microphone 10g may be a dual microphone built into the terminal 10 and equipped with a noise canceling function that makes it easy to recognize human voices even in a crowd.

スピーカ１０ｈは、例えば音声を出力する音声出力デバイスである。ここでスピーカ１０ｈが、端末１０に内蔵されている、騒がしい場所でも使えるダイナミックスピーカーであってもよい。 The speaker 10h is, for example, an audio output device that outputs audio. Here, the speaker 10h may be a dynamic speaker that is built into the terminal 10 and that can be used even in noisy places.

本実施形態に係るクライアント装置１２は、スマートフォン、タブレット端末、パーソナルコンピュータ、などの一般的なコンピュータである。図３Ｂに示すように、本実施形態に係るクライアント装置１２には、例えば、プロセッサ１２ａ、記憶部１２ｂ、通信部１２ｃ、操作部１２ｄ、撮影部１２ｅ、ディスプレイ１２ｆ、マイク１２ｇ、スピーカ１２ｈが含まれる。 The client device 12 according to this embodiment is a general computer such as a smart phone, tablet terminal, or personal computer. As shown in FIG. 3B, the client device 12 according to this embodiment includes, for example, a processor 12a, a storage unit 12b, a communication unit 12c, an operation unit 12d, a photographing unit 12e, a display 12f, a microphone 12g, and a speaker 12h. .

本実施形態に係るクライアント装置１２は、リモート会議等のテレビ会議が行われている際に、端末１０を利用するユーザによって利用されるものである。すなわち、本実施形態では、端末１０のユーザとクライアント装置１２のユーザとは同じである。 The client device 12 according to the present embodiment is used by a user who uses the terminal 10 during a teleconference such as a remote conference. That is, in this embodiment, the user of the terminal 10 and the user of the client device 12 are the same.

プロセッサ１２ａは、例えばクライアント装置１２にインストールされるプログラムに従って動作するＣＰＵ等のプログラム制御デバイスである。 The processor 12a is a program-controlled device such as a CPU that operates according to a program installed in the client device 12, for example.

記憶部１２ｂは、例えばＲＯＭやＲＡＭ等の記憶素子やソリッドステートドライブやハードディスクドライブなどである。記憶部１２ｂには、プロセッサ１２ａによって実行されるプログラムなどが記憶される。 The storage unit 12b is, for example, a storage element such as ROM or RAM, a solid state drive, a hard disk drive, or the like. The storage unit 12b stores programs and the like executed by the processor 12a.

通信部１２ｃは、例えばネットワークボードや無線ＬＡＮモジュールなどの通信インタフェースなどである。通信部１２ｃは、例えばコンピュータネットワーク２０を介して中継装置１４やテレビ会議システム１８との間でデータを授受する。 The communication unit 12c is, for example, a communication interface such as a network board or a wireless LAN module. The communication unit 12c exchanges data with the relay device 14 and the video conference system 18 via the computer network 20, for example.

操作部１２ｄは、例えばキーボードやマウスなどといったユーザインタフェースであって、ユーザの操作入力を受け付けて、その内容を示す信号をプロセッサ１２ａに出力する。 The operation unit 12d is a user interface such as a keyboard and a mouse, for example, and receives user's operation input and outputs a signal indicating the content of the input to the processor 12a.

撮影部１２ｅは、例えばデジタルビデオカメラなどの撮影デバイスである。撮影部１２ｅは、クライアント装置１２のユーザを撮影可能な位置に配置されている。本実施形態に係る撮影部１２ｅは、動画像を撮影できるようになっている。 The photographing unit 12e is, for example, a photographing device such as a digital video camera. The photographing unit 12e is arranged at a position where the user of the client device 12 can be photographed. The photographing unit 12e according to this embodiment is capable of photographing moving images.

ディスプレイ１２ｆは、例えば液晶ディスプレイや有機ＥＬディスプレイ等の表示デバイスであって、プロセッサ１２ａの指示に従って各種の画像を表示する。 The display 12f is a display device such as a liquid crystal display or an organic EL display, and displays various images according to instructions from the processor 12a.

マイク１２ｇは、例えば受け付ける音声を電気信号に変換する音声入力デバイスである。 The microphone 12g is, for example, an audio input device that converts received audio into electrical signals.

スピーカ１２ｈは、例えば音声を出力する音声出力デバイスである。 The speaker 12h is an audio output device that outputs audio, for example.

中継装置１４は、本実施形態では例えば、端末１０に入力される音声を表す音声データ、当該音声の音声認識結果を表す音声認識結果文字列、当該音声の翻訳結果を表す翻訳結果文字列、などを中継するサーバコンピュータ等のコンピュータシステムである。なお、テレビ会議用翻訳システム１に１台の中継装置１４が含まれていてもよいし、複数台の中継装置１４が含まれていてもよい。図３Ｃに示すように、本実施形態に係る中継装置１４には、例えば、プロセッサ１４ａ、記憶部１４ｂ、通信部１４ｃが含まれる。 In the present embodiment, the relay device 14 receives, for example, voice data representing voice input to the terminal 10, a voice recognition result character string representing the voice recognition result of the voice, a translation result character string representing the translation result of the voice, etc. It is a computer system such as a server computer that relays. It should be noted that the teleconference translation system 1 may include one relay device 14 or may include a plurality of relay devices 14 . As shown in FIG. 3C, the relay device 14 according to this embodiment includes, for example, a processor 14a, a storage unit 14b, and a communication unit 14c.

プロセッサ１４ａは、例えば中継装置１４にインストールされるプログラムに従って動作するＣＰＵ等のプログラム制御デバイスである。 The processor 14a is a program control device such as a CPU that operates according to a program installed in the relay device 14, for example.

記憶部１４ｂは、例えばＲＯＭやＲＡＭ等の記憶素子やソリッドステートドライブやハードディスクドライブなどである。記憶部１４ｂには、プロセッサ１４ａによって実行されるプログラムなどが記憶される。 The storage unit 14b is, for example, a storage element such as ROM or RAM, a solid state drive, a hard disk drive, or the like. Programs and the like executed by the processor 14a are stored in the storage unit 14b.

通信部１４ｃは、例えばネットワークボードなどの通信インタフェースなどである。通信部１４ｃは、例えばコンピュータネットワーク２０を介して端末１０、クライアント装置１２、及び、音声処理システム１６との間でデータを授受する。 The communication unit 14c is, for example, a communication interface such as a network board. The communication unit 14c exchanges data with the terminal 10, the client device 12, and the audio processing system 16 via the computer network 20, for example.

音声処理システム１６は、例えば、受け付ける音声データが表す音声の音声認識や、当該音声の翻訳等の音声処理を実行するサーバコンピュータ等のコンピュータシステムである。なお、音声処理システム１６が、１台のコンピュータから構成されていてもよいし、複数台のコンピュータから構成されてもよい。図３Ｄに示すように、本実施形態に係る音声処理システム１６には、例えば、プロセッサ１６ａ、記憶部１６ｂ、通信部１６ｃが含まれる。 The speech processing system 16 is, for example, a computer system such as a server computer that performs speech recognition of speech represented by received speech data and speech processing such as translation of the speech. The audio processing system 16 may be composed of one computer, or may be composed of a plurality of computers. As shown in FIG. 3D, the speech processing system 16 according to this embodiment includes, for example, a processor 16a, a storage unit 16b, and a communication unit 16c.

プロセッサ１６ａは、例えば音声処理システム１６にインストールされるプログラムに従って動作するＣＰＵ等のプログラム制御デバイスである。 The processor 16a is a program-controlled device such as a CPU that operates according to programs installed in the audio processing system 16, for example.

記憶部１６ｂは、例えばＲＯＭやＲＡＭ等の記憶素子やソリッドステートドライブやハードディスクドライブなどである。記憶部１６ｂには、プロセッサ１６ａによって実行されるプログラムなどが記憶される。 The storage unit 16b is, for example, a storage element such as ROM or RAM, a solid state drive, a hard disk drive, or the like. The storage unit 16b stores programs and the like executed by the processor 16a.

通信部１６ｃは、例えばネットワークボードなどの通信インタフェースなどである。通信部１６ｃは、例えばコンピュータネットワーク２０を介して中継装置１４との間でデータを授受する。 The communication unit 16c is, for example, a communication interface such as a network board. The communication unit 16c exchanges data with the relay device 14 via the computer network 20, for example.

テレビ会議システム１８は、例えば、複数の参加者によるリモート会議等のテレビ会議を実現する一般的なテレビ会議システムである。本実施形態では例えば、クライアント装置１２に、テレビ会議システム１８と連携して動作する、当該テレビ会議システム１８に係るクライアントソフトウェアがインストールされていることとする。 The video conference system 18 is, for example, a general video conference system that realizes a video conference such as a remote conference by a plurality of participants. In this embodiment, for example, it is assumed that the client device 12 is installed with client software related to the video conference system 18 that operates in cooperation with the video conference system 18 .

本実施形態では予め、テレビ会議システム１８の機能によって、端末１０及びクライアント装置１２のユーザを含む複数の参加者が参加するリモート会議等のテレビ会議が開催された状態となっている。 In this embodiment, a teleconference such as a remote conference in which a plurality of participants including the users of the terminals 10 and the client devices 12 participate has already been held by the function of the teleconference system 18 .

また、本実施形態では、予め、ユーザによって端末１０に所定の操作が行われることで、端末１０に入力される音声の言語である翻訳前言語と、当該音声が翻訳される言語である翻訳後言語と、が設定されている。以下の説明では、翻訳前言語として日本語が設定され、翻訳後言語として英語が設定されていることとする。 Further, in the present embodiment, by performing a predetermined operation on the terminal 10 by the user in advance, a pre-translation language, which is the language of the speech input to the terminal 10, and a post-translation language, which is the language into which the speech is translated. language is set. In the following description, it is assumed that Japanese is set as the pre-translation language and English is set as the post-translation language.

また、本実施形態では、ユーザが端末１０に設けられている所定のボタン（ここでは例えば、翻訳ボタン１０ｄａ）を指で押してから離すまでの間にマイク１０ｇを介して入力された音声に対して、音声認識処理が実行される。また、ユーザが翻訳ボタン１０ｄａから指を離したことをトリガに、ユーザが翻訳ボタン１０ｄａを指で押してから離すまでの間にマイク１０ｇを介して入力された音声に対して、翻訳処理が実行される。以下、翻訳ボタン１０ｄａが押下されている状態を入力オン状態と呼び、翻訳ボタン１０ｄａが押下されていない状態を入力オフ状態と呼ぶこととする。 Further, in the present embodiment, the voice input through the microphone 10g during the period from when the user presses a predetermined button (here, for example, the translation button 10da) provided on the terminal 10 with his finger until he releases it is , voice recognition processing is executed. Also, when the user releases his or her finger from the translation button 10da as a trigger, the translation process is executed for the voice input through the microphone 10g during the period from when the user presses the translation button 10da with his finger to when he/she releases the translation button 10da. be. Hereinafter, the state in which the translation button 10da is pressed is called an input ON state, and the state in which the translation button 10da is not pressed is called an input OFF state.

本実施形態では例えば、入力オン状態である間は、逐次、入力オフ状態から当該入力オン状態に変化したタイミングから現時点までの間に入力された音声に対して音声認識処理が実行される。そして、当該音声に対する音声認識結果を表す文字列である音声認識結果文字列が、クライアント装置１２のディスプレイ１２ｆに表示されるとともに、端末１０のタッチパネル１０ｆにも表示される。 In the present embodiment, for example, during the input ON state, speech recognition processing is sequentially performed on voices input from the timing when the input OFF state changes to the input ON state to the present time. Then, a speech recognition result character string, which is a character string representing the speech recognition result for the speech, is displayed on the display 12f of the client device 12 and also on the touch panel 10f of the terminal 10. FIG.

図４は、クライアント装置１２のディスプレイ１２ｆに表示されるリモート会議等のテレビ会議の画面であるテレビ会議画面３０の一例を示す図である。図４に示すように、本実施形態では例えば、端末１０への音声入力を行った発話者であるユーザを撮影した撮影画像に音声認識結果文字列が重畳された重畳画像３２を含むテレビ会議画面３０がディスプレイ１２ｆに表示される。本実施形態に係る撮影画像は、例えば、撮影部１２ｅによって撮影された画像である。なお、本実施形態に係る撮影画像が、撮影部１０ｅによって撮影された画像であってもよい。 FIG. 4 is a diagram showing an example of a teleconference screen 30, which is a screen of a teleconference such as a remote conference, displayed on the display 12f of the client device 12. As shown in FIG. As shown in FIG. 4, in the present embodiment, for example, a teleconference screen including a superimposed image 32 in which a voice recognition result character string is superimposed on a photographed image of a user who is a speaker who has performed voice input to the terminal 10. 30 is displayed on the display 12f. A photographed image according to the present embodiment is, for example, an image photographed by the photographing unit 12e. Note that the captured image according to the present embodiment may be an image captured by the imaging unit 10e.

図５は、端末１０のタッチパネル１０ｆに表示される音声認識結果画像３４の一例を示す図である。図５に示すように、本実施形態では、図４に示すテレビ会議画面３０に配置される文字列と同じ文字列が音声認識結果画像３４にも配置される。 FIG. 5 is a diagram showing an example of the voice recognition result image 34 displayed on the touch panel 10f of the terminal 10. As shown in FIG. As shown in FIG. 5, in this embodiment, the same character string as the character string arranged on the teleconference screen 30 shown in FIG.

本実施形態では上述のように、端末１０が入力オン状態である間は、逐次、端末１０が入力オフ状態から入力オン状態に変化したタイミングから現時点までの間に入力された音声に対して音声認識処理が実行される。そして、音声認識処理が実行される度にタッチパネル１０ｆやディスプレイ１２ｆに表示される音声認識結果文字列は更新される。 In the present embodiment, as described above, while the terminal 10 is in the input-on state, voice input is sequentially performed from the timing when the terminal 10 changes from the input-off state to the input-on state to the present time. Recognition processing is performed. Then, the voice recognition result character string displayed on the touch panel 10f or the display 12f is updated each time the voice recognition process is executed.

そして、ユーザが翻訳ボタン１０ｄａから指を離し、端末１０が入力オフ状態になると、端末１０から中継装置１４に確定要求が送信される。そして、端末１０が入力オン状態であった間に入力された音声に対して最終の音声認識処理が実行される。そして、当該音声認識処理の結果を表す音声認識結果文字列に対して翻訳処理が実行され、当該音声認識結果文字列を翻訳した翻訳結果文字列が生成される。ここでは例えば、日本語の文字列である音声認識結果文字列を翻訳した英語の文字列である翻訳結果文字列が生成される。 Then, when the user releases the translation button 10da and the terminal 10 enters an input off state, the terminal 10 transmits a confirmation request to the relay device 14. FIG. Then, the final speech recognition process is executed for the speech input while the terminal 10 is in the input ON state. Then, a translation process is performed on a speech recognition result character string representing the result of the speech recognition process, and a translation result character string is generated by translating the speech recognition result character string. Here, for example, a translation result character string, which is an English character string, is generated by translating a voice recognition result character string, which is a Japanese character string.

そして、このようにして生成される音声認識文字列及び翻訳結果文字列が、クライアント装置１２のディスプレイ１２ｆに表示されるとともに、端末１０のタッチパネル１０ｆにも表示される。 The speech recognition character string and the translation result character string thus generated are displayed on the display 12f of the client device 12 and also on the touch panel 10f of the terminal 10. FIG.

例えば、図６に示すように、端末１０への音声入力を行った発話者であるユーザを撮影した撮影画像に音声認識結果文字列及び翻訳結果文字列を重畳した重畳画像３２が配置されたテレビ会議画面３０がディスプレイ１２ｆに表示される。 For example, as shown in FIG. 6, a television in which a superimposed image 32 is arranged in which a voice recognition result character string and a translation result character string are superimposed on a photographed image of a user who is a speaker who has performed voice input to the terminal 10. A conference screen 30 is displayed on the display 12f.

また、図７に示すように、図６に示すテレビ会議画面３０に配置されている音声認識結果文字列と同じ文字列、及び、図６に示すテレビ会議画面３０に配置されている翻訳結果文字列と同じ文字列が配置された翻訳結果画像３６がタッチパネル１０ｆに表示される。 Further, as shown in FIG. 7, the same character string as the speech recognition result character string arranged on the teleconference screen 30 shown in FIG. 6 and the translation result character string arranged on the teleconference screen 30 shown in FIG. A translation result image 36 in which the same character string as the column is arranged is displayed on the touch panel 10f.

図６には、説明の都合上、翻訳結果文字列が視認しやすいテレビ会議画面３０が示されているが、実際には、翻訳結果文字列が配置される画面の背景の画像（ここでは例えば撮影画像）によっては表示されている翻訳結果文字列が見にくくなり、発話者であるユーザが翻訳結果を的確に把握できないことがあった。 For convenience of explanation, FIG. 6 shows the teleconference screen 30 on which the translation result character string is easy to see. Depending on the photographed image, the displayed translation result character string may be difficult to see, and the user, who is the speaker, may not be able to accurately grasp the translation result.

本実施形態では、図７に示すように、図６に示す翻訳結果文字列と同じ文字列が配置された翻訳結果画像３６が端末１０のタッチパネル１０ｆに表示される。 In this embodiment, as shown in FIG. 7, a translation result image 36 in which the same character string as the translation result character string shown in FIG. 6 is arranged is displayed on the touch panel 10f of the terminal 10. FIG.

このようにして、本実施形態によれば、ユーザが入力する音声の翻訳結果を当該ユーザが的確に把握できることとなる。 In this manner, according to the present embodiment, the user can accurately grasp the translation result of the speech input by the user.

また、図４及び図６には、説明の都合上、音声認識結果文字列が視認しやすいテレビ会議画面３０が示されているが、実際には、音声認識結果文字列が配置される画面の背景の画像（ここでは例えば撮影画像）によっては表示されている音声認識結果文字列が見にくくなり、発話者であるユーザが音声認識結果を的確に把握できないことがあった。 4 and 6 show the teleconference screen 30 on which the voice recognition result character string is easily visible for convenience of explanation. Depending on the background image (here, for example, a photographed image), the displayed speech recognition result character string may be difficult to see, and the user, who is the speaker, may not be able to accurately grasp the speech recognition result.

本実施形態では、図５に示すように、図４に示す音声認識結果文字列と同じ文字列が配置された音声認識結果画像３４が端末１０のタッチパネル１０ｆに表示される。また、図７に示すように、図６に示す音声認識結果文字列と同じ文字列が配置された翻訳結果画像３６が端末１０のタッチパネル１０ｆに表示される。 In this embodiment, as shown in FIG. 5, a speech recognition result image 34 in which the same character string as the speech recognition result character string shown in FIG. 4 is arranged is displayed on the touch panel 10f of the terminal 10. FIG. 7, a translation result image 36 in which the same character string as the speech recognition result character string shown in FIG. 6 is arranged is displayed on the touch panel 10f of the terminal 10. FIG.

このようにして、本実施形態によれば、ユーザが入力する音声の音声認識結果を当該ユーザが的確に把握できることとなる。 In this manner, according to the present embodiment, the user can accurately grasp the speech recognition result of the speech input by the user.

また、本実施形態では、中継装置１４が確定要求を受け付けたことをトリガとして、当該確定要求の受付までに受け付けた音声データが表す音声の翻訳が開始される。このようにすることで、数秒間にわたって認識可能な音声の入力がなかったことをトリガに、それまでに入力された音声に対する翻訳を開始する場合と比較して、音声の入力が開始されてから当該音声が翻訳されるまでの時間が短くなる。このようにして本実施形態によれば、入力される音声の翻訳結果が適時に表示できることとなる。 Further, in the present embodiment, the reception of the confirmation request by the relay device 14 triggers the translation of the speech represented by the received audio data until the reception of the confirmation request. In this way, the lack of recognizable speech input for a few seconds triggers the translation of the previously input speech, and the The time it takes for the speech to be translated is shortened. Thus, according to this embodiment, the translation result of the input speech can be displayed in a timely manner.

以下、本実施形態に係るテレビ会議用翻訳システム１の機能、及び、テレビ会議用翻訳システム１で実行される処理についてさらに説明する。 The functions of the videoconference translation system 1 according to the present embodiment and the processing executed by the videoconference translation system 1 will be further described below.

図８Ａは、本実施形態に係る端末１０、中継装置１４、及び、音声処理システム１６で実装される機能の一例を示す機能ブロック図である。図８Ｂは、本実施形態に係るクライアント装置１２で実装される機能の一例を示す機能ブロック図である。 FIG. 8A is a functional block diagram showing an example of functions implemented by the terminal 10, the relay device 14, and the voice processing system 16 according to this embodiment. FIG. 8B is a functional block diagram showing an example of functions implemented in the client device 12 according to this embodiment.

なお、本実施形態に係る端末１０、中継装置１４、及び、音声処理システム１６で、図８Ａに示す機能のすべてが実装される必要はなく、また、図８Ａに示す機能以外の機能が実装されていても構わない。また、本実施形態に係るクライアント装置１２で、図８Ｂに示す機能のすべてが実装される必要はなく、また、図８Ｂに示す機能以外の機能が実装されていても構わない。 Note that the terminal 10, the relay device 14, and the audio processing system 16 according to the present embodiment do not need to implement all the functions shown in FIG. 8A, and functions other than the functions shown in FIG. It doesn't matter if Further, the client device 12 according to the present embodiment does not need to implement all the functions shown in FIG. 8B, and may implement functions other than the functions shown in FIG. 8B.

図８Ａに示すように、本実施形態に係る端末１０には、機能的には例えば、操作入力受付部４０、音声入力受付部４２、音声バッファ４４、入力送信部４６、文字列受信部４８、表示制御部５０、が含まれる。操作入力受付部４０は、プロセッサ１０ａ、操作部１０ｄ、及び、タッチパネル１０ｆを主として実装される。音声入力受付部４２は、プロセッサ１０ａ、及び、マイク１０ｇを主として実装される。音声バッファ４４は、記憶部１０ｂを主として実装される。入力送信部４６、文字列受信部４８は、通信部１０ｃを主として実装される。表示制御部５０は、プロセッサ１０ａ、及び、タッチパネル１０ｆを主として実装される。 As shown in FIG. 8A, the terminal 10 according to the present embodiment functionally includes, for example, an operation input reception unit 40, a voice input reception unit 42, a voice buffer 44, an input transmission unit 46, a character string reception unit 48, A display control unit 50 is included. The operation input reception unit 40 is mainly implemented by the processor 10a, the operation unit 10d, and the touch panel 10f. The voice input receiving unit 42 is mainly implemented by the processor 10a and the microphone 10g. The audio buffer 44 is implemented mainly in the storage unit 10b. The input transmission unit 46 and the character string reception unit 48 are mainly implemented by the communication unit 10c. The display control unit 50 is mainly implemented by the processor 10a and the touch panel 10f.

以上の機能は、コンピュータである端末１０にインストールされた、以上の機能に対応する指令を含むプログラムをプロセッサ１０ａで実行することにより実装される。このプログラムは、例えば、光ディスク、磁気ディスク、磁気テープ、光磁気ディスク、フラッシュメモリ等のコンピュータ読み取り可能な情報記憶媒体を介して、あるいは、インターネットなどを介して端末１０に供給される。 The functions described above are implemented by causing the processor 10a to execute a program including commands corresponding to the functions described above, which is installed in the terminal 10, which is a computer. This program is supplied to the terminal 10 via a computer-readable information storage medium such as an optical disk, magnetic disk, magnetic tape, magneto-optical disk, flash memory, or the like, or via the Internet.

図８Ｂに示すように、本実施形態に係るクライアント装置１２には、機能的には例えば、音声入力受付部６０、文字列受信部６２、撮影画像取得部６４、重畳画像生成部６６、テレビ会議クライアント部６８、音声出力制御部７０、表示制御部７２、が含まれる。音声入力受付部６０は、プロセッサ１２ａ、及び、マイク１２ｇを主として実装される。文字列受信部６２は、通信部１２ｃを主として実装される。撮影画像取得部６４は、プロセッサ１２ａ、及び、撮影部１２ｅを主として実装される。重畳画像生成部６６は、プロセッサ１２ａを主として実装される。テレビ会議クライアント部６８は、プロセッサ１２ａ、及び、通信部１２ｃを主として実装される。音声出力制御部７０は、プロセッサ１２ａ、及び、スピーカ１２ｈを主として実装される。表示制御部７２は、プロセッサ１２ａ、及び、ディスプレイ１２ｆを主として実装される。 As shown in FIG. 8B, the client device 12 according to the present embodiment functionally includes, for example, a voice input reception unit 60, a character string reception unit 62, a captured image acquisition unit 64, a superimposed image generation unit 66, a video conference A client unit 68, an audio output control unit 70, and a display control unit 72 are included. The voice input reception unit 60 is mainly implemented by the processor 12a and the microphone 12g. The character string receiving unit 62 is implemented mainly by the communication unit 12c. The photographed image acquisition unit 64 is mainly implemented by the processor 12a and the photographing unit 12e. The superimposed image generator 66 is implemented mainly by the processor 12a. The videoconference client unit 68 is mainly implemented by the processor 12a and the communication unit 12c. The audio output control unit 70 is mainly implemented by the processor 12a and the speaker 12h. The display control unit 72 is mainly implemented by the processor 12a and the display 12f.

以上の機能は、コンピュータであるクライアント装置１２にインストールされた、以上の機能に対応する指令を含むプログラムをプロセッサ１２ａで実行することにより実装される。このプログラムは、例えば、光ディスク、磁気ディスク、磁気テープ、光磁気ディスク、フラッシュメモリ等のコンピュータ読み取り可能な情報記憶媒体を介して、あるいは、インターネットなどを介してクライアント装置１２に供給される。 The functions described above are implemented by causing the processor 12a to execute a program containing commands corresponding to the functions described above, which is installed in the client device 12, which is a computer. This program is supplied to the client device 12 via a computer-readable information storage medium such as an optical disk, magnetic disk, magnetic tape, magneto-optical disk, flash memory, or the like, or via the Internet.

図８Ａに示すように、本実施形態に係る中継装置１４には、機能的には例えば、入力中継部８０、音声バッファ８２、文字列中継部８４、が、含まれる。入力中継部８０、文字列中継部８４は、通信部１４ｃを主として実装される。音声バッファ８２は、記憶部１４ｂを主として実装される。 As shown in FIG. 8A, the relay device 14 according to the present embodiment functionally includes an input relay unit 80, a voice buffer 82, and a character string relay unit 84, for example. The input relay unit 80 and the character string relay unit 84 are mainly implemented by the communication unit 14c. The audio buffer 82 is implemented mainly in the storage unit 14b.

以上の機能は、コンピュータである中継装置１４にインストールされた、以上の機能に対応する指令を含むプログラムをプロセッサ１４ａで実行することにより実装される。このプログラムは、例えば、光ディスク、磁気ディスク、磁気テープ、光磁気ディスク、フラッシュメモリ等のコンピュータ読み取り可能な情報記憶媒体を介して、あるいは、インターネットなどを介して中継装置１４に供給される。 The functions described above are implemented by causing the processor 14a to execute a program containing commands corresponding to the functions described above, which is installed in the relay device 14, which is a computer. This program is supplied to the relay device 14 via a computer-readable information storage medium such as an optical disk, magnetic disk, magnetic tape, magneto-optical disk, flash memory, or the like, or via the Internet.

図８Ａに示すように、本実施形態に係る音声処理システム１６には、機能的には例えば、音声認識部９０、翻訳部９２、が含まれる。音声認識部９０、翻訳部９２は、プロセッサ１６ａ、及び、通信部１６ｃを主として実装される。 As shown in FIG. 8A, the speech processing system 16 according to this embodiment functionally includes, for example, a speech recognition unit 90 and a translation unit 92 . The speech recognition unit 90 and the translation unit 92 are mainly implemented by the processor 16a and the communication unit 16c.

以上の機能は、コンピュータである音声処理システム１６にインストールされた、以上の機能に対応する指令を含むプログラムをプロセッサ１６ａで実行することにより実装される。このプログラムは、例えば、光ディスク、磁気ディスク、磁気テープ、光磁気ディスク、フラッシュメモリ等のコンピュータ読み取り可能な情報記憶媒体を介して、あるいは、インターネットなどを介して音声処理システム１６に供給される。 The functions described above are implemented by causing the processor 16a to execute a program containing commands corresponding to the functions described above, which is installed in the audio processing system 16, which is a computer. This program is supplied to the audio processing system 16 via a computer-readable information storage medium such as an optical disk, magnetic disk, magnetic tape, magneto-optical disk, flash memory, or the like, or via the Internet.

端末１０の操作入力受付部４０は、本実施形態では例えば、ユーザが翻訳ボタン１０ｄａを指で押下する操作や翻訳ボタン１０ｄａから指を離す操作などといった端末１０に対する操作入力を受け付ける。 In this embodiment, the operation input reception unit 40 of the terminal 10 receives an operation input to the terminal 10 such as an operation of pressing the translation button 10da with the user's finger or an operation of releasing the finger from the translation button 10da.

端末１０の音声入力受付部４２は、本実施形態では例えば、端末１０が入力オン状態である間にマイク１０ｇを介して発話者により入力される音声を受け付ける。 In this embodiment, for example, the voice input reception unit 42 of the terminal 10 receives voice input by the speaker via the microphone 10g while the terminal 10 is in the input ON state.

端末１０の音声バッファ４４は、本実施形態では例えば、マイク１０ｇを介して入力される音声を表す音声データを記憶する。 The audio buffer 44 of the terminal 10 stores, in this embodiment, audio data representing audio input via the microphone 10g, for example.

端末１０の入力送信部４６は、本実施形態では例えば、操作入力受付部４０が受け付ける操作入力に応じた操作信号を中継装置１４に送信する。 The input transmission unit 46 of the terminal 10 transmits to the relay device 14, for example, an operation signal corresponding to the operation input received by the operation input reception unit 40 in this embodiment.

また、入力送信部４６は、本実施形態では例えば、端末１０に入力される音声を表す音声データを中継装置１４に送信する。 Also, in the present embodiment, the input transmission unit 46 transmits, for example, voice data representing voice input to the terminal 10 to the relay device 14 .

本実施形態では例えば、端末１０が入力オフ状態から入力オン状態に変化したことに応じて、入力送信部４６は、通信開始要求を中継装置１４に送信する。そして、端末１０が入力オフ状態から入力オン状態に変化してから、中継装置１４と端末１０との間の通信が確立されるまでの間にマイク１０ｇを介して入力される音声を表す音声データは、音声バッファ４４に蓄積される。 In the present embodiment, for example, the input transmission unit 46 transmits a communication start request to the relay device 14 in response to the terminal 10 changing from the input-off state to the input-on state. Voice data representing voice input through the microphone 10g during the period from when the terminal 10 changes from the input-off state to the input-on state until the communication between the relay device 14 and the terminal 10 is established. is stored in the audio buffer 44 .

そして、中継装置１４と端末１０との間の通信が確立される（すなわち、端末１０が中継装置１４に接続される）と、入力送信部４６は、音声バッファ４４に蓄積されている音声データを中継装置１４に送信する。一般的には例えば、音声バッファ４４に蓄積されている、２秒間の長さの音声を表す音声データが、０．１秒程度で送信される。 Then, when communication is established between the relay device 14 and the terminal 10 (that is, when the terminal 10 is connected to the relay device 14), the input transmission unit 46 transmits the voice data accumulated in the voice buffer 44. It is transmitted to the relay device 14 . In general, for example, the audio data stored in the audio buffer 44 and representing audio with a length of 2 seconds is transmitted in about 0.1 seconds.

そして、音声バッファ４４に蓄積されている音声データがすべて中継装置１４に送信された後は、入力送信部４６は、端末１０が入力オン状態である間、音声入力受付部４２が受け付ける音声を表す音声データのパケットを中継装置１４にストリーム送信する。この場合、音声データのパケットは、音声バッファ４４に蓄積されることなく中継装置１４に直接リアルタイム送信される。なお、音声データのパケットに、翻訳前言語を示す翻訳前言語データと、翻訳後言語を示す翻訳後言語データと、が含まれていてもよい。 After all the audio data accumulated in the audio buffer 44 has been transmitted to the relay device 14, the input transmission unit 46 expresses the audio received by the audio input reception unit 42 while the terminal 10 is in the input ON state. Packets of audio data are streamed to the relay device 14 . In this case, the audio data packets are directly transmitted to the relay device 14 in real time without being accumulated in the audio buffer 44 . Note that the voice data packet may include pre-translation language data indicating the pre-translation language and post-translation language data indicating the post-translation language.

中継装置１４の入力中継部８０は、本実施形態では例えば、入力送信部４６から送信される音声データを受け付ける。そして、入力中継部８０は、受け付けた音声データを、音声処理システム１６の音声認識部９０に送信する。例えば、入力中継部８０は、入力送信部４６からストリーム送信される音声データのパケットを受信して、当該パケットを音声処理システム１６の音声認識部９０に送信する。 The input relay unit 80 of the relay device 14 receives, for example, audio data transmitted from the input transmission unit 46 in this embodiment. The input relay unit 80 then transmits the received voice data to the voice recognition unit 90 of the voice processing system 16 . For example, the input relay unit 80 receives packets of audio data streamed from the input transmission unit 46 and transmits the packets to the audio recognition unit 90 of the audio processing system 16 .

なお、本実施形態において、音声処理システム１６が、それぞれ異なる言語に対応付けられる複数の音声認識部９０を含んでいてもよい。そして、入力中継部８０は、受け付けた音声データを、翻訳後言語に対応付けられる音声認識部９０に送信してもよい。 In addition, in this embodiment, the speech processing system 16 may include a plurality of speech recognition units 90 each associated with a different language. Then, the input relay unit 80 may transmit the received voice data to the voice recognition unit 90 associated with the translated language.

なお、本実施形態では、入力中継部８０は、入力送信部４６から送信されるパケットを受け付けると、当該パケットを、一旦、音声バッファ８２に記憶させる。そして、入力中継部８０は、音声バッファ８２に記憶されたパケットを音声処理システム１６の音声認識部９０に送信する。このようにすることで、音声処理システム１６と中継装置１４との間の通信における通信エラーが発生しても、パケットの送信をリトライすることが可能となる。 In this embodiment, upon receiving a packet transmitted from the input transmission unit 46, the input relay unit 80 temporarily stores the packet in the audio buffer 82. FIG. The input relay unit 80 then transmits the packets stored in the audio buffer 82 to the audio recognition unit 90 of the audio processing system 16 . By doing so, even if a communication error occurs in the communication between the voice processing system 16 and the relay device 14, it is possible to retry packet transmission.

音声処理システム１６の音声認識部９０は、本実施形態では例えば、中継装置１４の入力中継部８０から送信される音声データのパケットを受信する。 The speech recognition unit 90 of the speech processing system 16 receives packets of speech data transmitted from, for example, the input relay unit 80 of the relay device 14 in this embodiment.

そして、音声処理システム１６の音声認識部９０は、本実施形態では例えば、受信する音声データが表す音声に対して音声認識処理を実行して、当該音声の音声認識結果を表す音声認識結果文字列を生成する。ここで、例えば、音声認識部９０が音声データのパケットを受信する度に、逐次、端末１０が中継装置１４に接続されてから当該パケットを受信するまでに受信した音声データに対して音声認識処理が実行され、音声認識結果文字列が生成されるようにしてもよい。 Then, in this embodiment, the speech recognition unit 90 of the speech processing system 16 executes speech recognition processing on the speech represented by the received speech data, and generates a speech recognition result character string representing the speech recognition result of the speech. to generate Here, for example, every time the voice recognition unit 90 receives a packet of voice data, voice recognition processing is performed on the voice data received from when the terminal 10 is connected to the relay device 14 until the packet is received. may be executed to generate a speech recognition result string.

そして、音声処理システム１６の音声認識部９０は、本実施形態では例えば、音声認識部９０によって生成される音声認識結果文字列を中継装置１４に送信する。ここで、音声認識処理が逐次実行される場合、音声認識結果文字列が生成される度に、生成された音声認識結果文字列が中継装置１４に送信されるようにしてもよい。 Then, the speech recognition unit 90 of the speech processing system 16 transmits, for example, the speech recognition result character string generated by the speech recognition unit 90 to the relay device 14 in this embodiment. Here, when the speech recognition processing is executed sequentially, the generated speech recognition result character string may be transmitted to the relay device 14 each time the speech recognition result character string is generated.

中継装置１４の文字列中継部８４は、本実施形態では例えば、上述の音声認識結果文字列を受信する。 The character string relay unit 84 of the relay device 14 receives, for example, the speech recognition result character string described above in this embodiment.

そして、端末１０が入力オン状態から入力オフ状態に変化したことに応じて、入力送信部４６は、確定要求を中継装置１４に送信する。なお、入力オン状態から入力オフ状態に変化した際に音声バッファ４４に蓄積されている音声データが存在する場合は、入力送信部４６は、音声バッファ４４に蓄積されている音声データを中継装置１４に送信してから、確定要求を中継装置１４に送信する。また、入力オン状態から入力オフ状態に変化した際に音声バッファ４４に蓄積されている音声データが存在しない場合は、入力送信部４６は、直ちに確定要求を中継装置１４に送信する。一般的には、入力オン状態から入力オフ状態に変化した際には、音声バッファ４４に蓄積されている音声データが存在しないことが多く、入力オン状態から入力オフ状態に変化したタイミングには、ほぼすべての音声データが送信済の状態となる。 Then, in response to the terminal 10 changing from the input-on state to the input-off state, the input transmission unit 46 transmits a confirmation request to the relay device 14 . If there is audio data accumulated in the audio buffer 44 when the input-on state changes to the input-off state, the input transmission unit 46 transmits the audio data accumulated in the audio buffer 44 to the relay device 14 . , and then transmits a confirmation request to the relay device 14 . Also, if there is no audio data accumulated in the audio buffer 44 when the input-on state changes to the input-off state, the input transmission unit 46 immediately transmits a determination request to the relay device 14 . In general, when the input-on state changes to the input-off state, there is often no audio data stored in the audio buffer 44, and at the timing of the change from the input-on state to the input-off state, Almost all audio data has been transmitted.

なお、本実施形態において、所定時間（例えば、３０秒）にわたって端末１０に音声が入力された際には、そのタイミングで音声の受付を終了し、確定要求が送信されるようにしてもよい。 In this embodiment, when voice is input to the terminal 10 for a predetermined period of time (for example, 30 seconds), acceptance of voice may be terminated at that timing and a confirmation request may be transmitted.

中継装置１４の入力中継部８０は、本実施形態では例えば、発話者により行われる所定の操作（ここでは例えば、翻訳ボタン１０ｄａから指を離す操作）に応じて出力される確定要求を受け付ける。例えば、中継装置１４の入力中継部８０は、発話者が翻訳ボタン１０ｄａから指を離す操作を行うことにより入力送信部４６から送信される確定要求を受け付ける。 In the present embodiment, the input relay unit 80 of the relay device 14 receives, for example, a confirmation request output in response to a predetermined operation performed by the speaker (here, for example, an operation of releasing the translation button 10da). For example, the input relay unit 80 of the relay device 14 receives a confirmation request transmitted from the input transmission unit 46 when the speaker releases the translation button 10da.

中継装置１４の文字列中継部８４は、本実施形態では例えば、入力中継部８０による確定要求の受付をトリガとして、当該確定要求の受付までに受け付けた音声データが表す音声の翻訳が開始されるよう制御する。例えば、中継装置１４の文字列中継部８４は、入力中継部８０が確定要求を受信したことに応じて、端末１０が中継装置１４に接続されてから確定要求を受信するまでに受信した音声データが表す音声の音声認識結果を表す音声認識文字列を音声処理システム１６の翻訳部９２に送信する。 In this embodiment, for example, the character string relay unit 84 of the relay device 14 starts translating the speech represented by the received voice data by the reception of the confirmation request by the input relay unit 80 as a trigger. to control. For example, in response to the input relay unit 80 receiving the confirmation request, the character string relay unit 84 of the relay device 14 receives voice data received from when the terminal 10 was connected to the relay device 14 to when the confirmation request was received. to the translation unit 92 of the speech processing system 16 .

なお、本実施形態において、音声処理システム１６が、それぞれ異なる言語に対応付けられる複数の翻訳部９２を含んでいてもよい。そして、文字列中継部８４は、音声認識文字列を、翻訳後言語に対応付けられる翻訳部９２に送信してもよい。 In addition, in this embodiment, the speech processing system 16 may include a plurality of translation units 92 each associated with a different language. Then, the character string relay unit 84 may transmit the speech recognition character string to the translation unit 92 associated with the post-translation language.

音声処理システム１６の翻訳部９２は、本実施形態では例えば、文字列中継部８４によって送信される音声認識結果文字列を受信する。そして、音声処理システム１６の翻訳部９２は、受信した音声認識結果文字列に対して翻訳処理を実行する。そして、翻訳部９２は、当該翻訳処理の結果を表す翻訳結果文字列を生成する。 The translation unit 92 of the speech processing system 16 receives the speech recognition result character string transmitted by the character string relay unit 84 in this embodiment, for example. Then, the translation unit 92 of the speech processing system 16 translates the received speech recognition result character string. Then, the translation unit 92 generates a translation result character string representing the result of the translation processing.

そして、翻訳部９２は、本実施形態では例えば、上述のようにして生成される翻訳結果文字列を中継装置１４に送信する。 Then, in the present embodiment, for example, the translation unit 92 transmits the translation result character string generated as described above to the relay device 14 .

また、中継装置１４の文字列中継部８４は、本実施形態では例えば、上述の音声データが表す音声の音声認識結果を表す音声認識結果文字列を端末１０の通信部１０ｃ及びクライアント装置１２の通信部１２ｃの両方に送信する。例えば、文字列中継部８４は、音声処理システム１６の音声認識部９０から音声認識結果文字列を受信したことに応じて、当該音声認識文字列を端末１０とクライアント装置１２の両方に送信する。 Further, in the present embodiment, the character string relay unit 84 of the relay device 14, for example, transmits the speech recognition result character string representing the speech recognition result of the speech represented by the above-described speech data to the communication unit 10c of the terminal 10 and the client device 12. 12c. For example, upon receiving the speech recognition result character string from the speech recognition unit 90 of the speech processing system 16 , the character string relay unit 84 transmits the speech recognition result character string to both the terminal 10 and the client device 12 .

また、中継装置１４の文字列中継部８４は、本実施形態では例えば、上述の音声データが表す音声の翻訳結果を表す翻訳結果文字列を端末１０の通信部１０ｃ及びクライアント装置１２の通信部１２ｃの両方に送信する。例えば、文字列中継部８４は、音声処理システム１６の翻訳部９２から翻訳結果文字列を受信したことに応じて、当該翻訳結果文字列を端末１０とクライアント装置１２の両方に送信する。 Further, in the present embodiment, the character string relay unit 84 of the relay device 14, for example, transmits the translation result character string representing the translation result of the voice represented by the above-described voice data to the communication unit 10c of the terminal 10 and the communication unit 12c of the client device 12. to both. For example, the character string relay unit 84 transmits the translation result character string to both the terminal 10 and the client device 12 in response to receiving the translation result character string from the translation unit 92 of the speech processing system 16 .

端末１０の文字列受信部４８は、本実施形態では例えば、中継装置１４から送信される音声認識結果文字列を受信する。 The character string receiving unit 48 of the terminal 10 receives, for example, the speech recognition result character string transmitted from the relay device 14 in this embodiment.

また、端末１０の文字列受信部４８は、本実施形態では例えば、中継装置１４から送信される翻訳結果文字列を受信する。 Further, the character string receiving unit 48 of the terminal 10 receives, for example, the translation result character string transmitted from the relay device 14 in this embodiment.

端末１０の表示制御部５０は、例えば、文字列受信部４８が受信する音声認識結果文字列を端末１０の表示部（例えばタッチパネル１０ｆ）に表示させる。また、表示制御部５０は、例えば、文字列受信部４８が受信する翻訳結果文字列を端末１０の表示部（例えばタッチパネル１０ｆ）に表示させる。 The display control unit 50 of the terminal 10 causes the display unit (for example, the touch panel 10f) of the terminal 10 to display the voice recognition result character string received by the character string receiving unit 48, for example. The display control unit 50 also causes the display unit (for example, the touch panel 10f) of the terminal 10 to display the translation result character string received by the character string receiving unit 48, for example.

ここで、図７に示すように、表示制御部５０が、文字列受信部４８が受信する音声認識結果文字列及び翻訳結果文字列の両方が配置された画像である翻訳結果画像３６を生成してもよい。そして、表示制御部５０が、翻訳結果画像３６をタッチパネル１０ｆに表示させてもよい。 Here, as shown in FIG. 7, the display control unit 50 generates a translation result image 36, which is an image in which both the speech recognition result character string and the translation result character string received by the character string receiving unit 48 are arranged. may Then, the display control unit 50 may display the translation result image 36 on the touch panel 10f.

なお、本実施形態において、表示制御部５０は、単一色である背景上に当該背景とは異なる色で文字列受信部４８が受信する文字列をタッチパネル１０ｆに表示させてもよい。こうすれば、ユーザが入力する音声の翻訳結果や音声認識結果などを当該ユーザがより的確に把握できることとなる。 In the present embodiment, the display control unit 50 may cause the touch panel 10f to display the character string received by the character string receiving unit 48 in a different color from the single-color background. By doing so, the user can more accurately grasp the result of translation of the voice input by the user, the result of voice recognition, and the like.

クライアント装置１２の音声入力受付部６０は、本実施形態では例えば、マイク１２ｇを介して入力されるユーザの音声を受け付ける。そして、音声入力受付部６０は、入力された音声を表す音声データをテレビ会議クライアント部６８に出力する。 The voice input reception unit 60 of the client device 12 receives the user's voice input via the microphone 12g in this embodiment, for example. Then, the voice input reception unit 60 outputs voice data representing the input voice to the teleconference client unit 68 .

クライアント装置１２の文字列受信部６２は、本実施形態では例えば、中継装置１４から送信される音声認識結果文字列を受信する。 The character string receiving unit 62 of the client device 12 receives, for example, the speech recognition result character string transmitted from the relay device 14 in this embodiment.

また、クライアント装置１２の文字列受信部６２は、本実施形態では例えば、中継装置１４から送信される翻訳結果文字列を受信する。 Further, the character string receiving unit 62 of the client device 12 receives, for example, the translation result character string transmitted from the relay device 14 in this embodiment.

撮影画像取得部６４は、本実施形態では例えば、撮影部１２ｅによって撮影される画像である撮影画像を取得する。 The captured image acquisition unit 64 acquires a captured image, which is an image captured by the imaging unit 12e in this embodiment, for example.

重畳画像生成部６６は、本実施形態では例えば、上述の撮影画像に文字列受信部６２が受信する音声認識結果文字列を重畳させた画像である重畳画像３２を生成する。また、重畳画像生成部６６は、本実施形態では例えば、上述の撮影画像に文字列受信部６２が受信する翻訳結果文字列を重畳させた画像である重畳画像３２を生成する。 In the present embodiment, the superimposed image generation unit 66 generates, for example, the superimposed image 32, which is an image obtained by superimposing the speech recognition result character string received by the character string receiving unit 62 on the above-described captured image. In addition, in the present embodiment, the superimposed image generation unit 66 generates, for example, the superimposed image 32 which is an image in which the translation result character string received by the character string receiving unit 62 is superimposed on the above-mentioned photographed image.

ここで、図６に示すように、重畳画像生成部６６が、上述の撮影画像に文字列受信部６２が受信する翻訳結果文字列及び音声認識結果文字列の両方を重畳させた画像である重畳画像３２を生成してもよい。 Here, as shown in FIG. 6, the superimposed image generation unit 66 superimposes both the translation result character string and the voice recognition result character string received by the character string reception unit 62 on the above-described photographed image. An image 32 may be generated.

そして、重畳画像生成部６６は、本実施形態では例えば、生成される重畳画像３２をテレビ会議クライアント部６８に出力する。 Then, the superimposed image generation unit 66 outputs the generated superimposed image 32 to the videoconference client unit 68 in this embodiment, for example.

クライアント装置１２のテレビ会議クライアント部６８は、本実施形態では例えば、テレビ会議システム１８と連携して、テレビ会議に係る各種の処理を実行する。 In this embodiment, for example, the teleconference client unit 68 of the client device 12 cooperates with the teleconference system 18 to execute various processes related to the teleconference.

テレビ会議クライアント部６８は、例えば、上述の撮影画像に文字列受信部６２が受信する文字列を重畳させた重畳画像３２をテレビ会議システム１８に出力してもよい。例えば、テレビ会議クライアント部６８は、重畳画像生成部６６から受け付ける重畳画像３２をテレビ会議システム１８に出力してもよい。 The videoconference client unit 68 may output, to the videoconference system 18, a superimposed image 32 in which the character string received by the character string receiving unit 62 is superimposed on the above-described photographed image, for example. For example, the videoconference client unit 68 may output the superimposed image 32 received from the superimposed image generator 66 to the videoconference system 18 .

また、テレビ会議クライアント部６８は、例えば、音声入力受付部６０から受け付ける音声データをテレビ会議システム１８に出力してもよい。 Further, the teleconference client unit 68 may output voice data received from the voice input receiving unit 60 to the teleconference system 18, for example.

そして、テレビ会議クライアント部６８は、本実施形態では例えば、テレビ会議システム１８によって生成される、図４及び図６に示されているテレビ会議画面３０を表示制御部７２に出力する。 The videoconference client unit 68 outputs the videoconference screen 30 shown in FIGS. 4 and 6, which is generated by the videoconference system 18 in this embodiment, to the display control unit 72, for example.

また、テレビ会議クライアント部６８は、本実施形態では例えば、テレビ会議システム１８によって生成される、テレビ会議での発言者に係る音声を表す音声データを音声出力制御部７０に出力する。 Also, in the present embodiment, the videoconference client unit 68 outputs, for example, audio data representing the voice of the speaker in the videoconference generated by the videoconference system 18 to the audio output control unit 70 .

クライアント装置１２の音声出力制御部７０は、本実施形態では例えば、テレビ会議クライアント部６８から受け付ける音声データが表す音声をスピーカ１２ｈから出力させる。 In this embodiment, the audio output control unit 70 of the client device 12 causes the speaker 12h to output the audio represented by the audio data received from the video conference client unit 68, for example.

クライアント装置１２の表示制御部７２は、本実施形態では例えば、撮影部１２ｅによって撮影される画像に音声データが表す音声の音声認識結果を表す文字列を重畳させた画像が配置された画面をディスプレイ１２ｆに表示させる。ここで、表示制御部７２は、確定要求の受付よりも前に、撮影部１２ｅによって撮影される画像に受付済の音声データが表す音声の音声認識結果を表す文字列を重畳させた画像が配置された画面をディスプレイ１２ｆに表示させてもよい。例えば、クライアント装置１２の表示制御部７２は、上述の撮影画像に文字列受信部６２が受信する文字認識結果文字列を重畳させた画像が配置された画面をクライアント装置１２のディスプレイ１２ｆに表示させる。 In this embodiment, for example, the display control unit 72 of the client device 12 displays a screen on which an image obtained by superimposing a character string representing the voice recognition result of the voice represented by the voice data on the image captured by the capturing unit 12e is arranged. Display on 12f. Here, the display control unit 72 arranges an image obtained by superimposing a character string representing the voice recognition result of the voice represented by the received voice data on the image captured by the capturing unit 12e before accepting the confirmation request. The displayed screen may be displayed on the display 12f. For example, the display control unit 72 of the client device 12 causes the display 12f of the client device 12 to display a screen in which an image obtained by superimposing the character string received by the character string receiving unit 62 on the captured image is arranged. .

また、表示制御部７２は、本実施形態では例えば、撮影部１２ｅによって撮影される画像に確定要求の受付までに受け付けた音声データが表す音声の翻訳結果を表す文字列を重畳させた画像が配置された画面をディスプレイ１２ｆに表示させる。例えば、クライアント装置１２の表示制御部７２は、上述の撮影画像に文字列受信部６２が受信する翻訳結果文字列を重畳させた画像が配置された画面をクライアント装置１２のディスプレイ１２ｆに表示させる。 Further, in the present embodiment, the display control unit 72 arranges an image obtained by superimposing a character string representing the translation result of the voice represented by the voice data received before the acceptance of the confirmation request on the image captured by the capturing unit 12e. The displayed screen is displayed on the display 12f. For example, the display control unit 72 of the client device 12 causes the display 12f of the client device 12 to display a screen in which an image obtained by superimposing the translation result character string received by the character string receiving unit 62 on the above-mentioned photographed image is arranged.

ここで、表示制御部７２は、図６に示すように、上述の撮影画像に文字列受信部６２が受信する翻訳結果文字列及び音声認識結果文字列の両方を重畳させた重畳画像３２が配置されたテレビ会議画面３０をディスプレイ１２ｆに表示させてもよい。 Here, as shown in FIG. 6, the display control unit 72 arranges a superimposed image 32 in which both the translation result character string and the voice recognition result character string received by the character string receiving unit 62 are superimposed on the photographed image. The teleconference screen 30 may be displayed on the display 12f.

また、表示制御部７２は、テレビ会議システム１８によって生成される画面をディスプレイ１２ｆに表示させてもよい。例えば、表示制御部７２が、テレビ会議クライアント部６８から受け付けるテレビ会議画面３０をディスプレイ１２ｆに表示させてもよい。 Further, the display control unit 72 may cause the display 12f to display a screen generated by the video conference system 18. FIG. For example, the display control unit 72 may display the teleconference screen 30 received from the teleconference client unit 68 on the display 12f.

ここで、中継装置１４で実行される音声データの中継処理の流れの一例を、図９に示すフロー図を参照しながら説明する。 An example of the flow of audio data relay processing executed by the relay device 14 will now be described with reference to the flowchart shown in FIG.

本処理例では、入力中継部８０が、端末１０の入力送信部４６から送信される通信開始要求の受信を監視する（Ｓ１０１）。 In this processing example, the input relay unit 80 monitors reception of a communication start request transmitted from the input transmission unit 46 of the terminal 10 (S101).

入力中継部８０が、端末１０の入力送信部４６から通信開始要求を受信すると、入力中継部８０は、中継装置１４と端末１０との間の通信を確立する（Ｓ１０２）。 When the input relay unit 80 receives the communication start request from the input transmission unit 46 of the terminal 10, the input relay unit 80 establishes communication between the relay device 14 and the terminal 10 (S102).

そして、入力中継部８０は、音声データのパケットの受信を監視する（Ｓ１０３）。入力中継部８０が、音声データのパケットを受信すると、受信したパケットを音声バッファ８２に記憶させる（Ｓ１０４）。 Then, the input relay unit 80 monitors reception of audio data packets (S103). When the input relay unit 80 receives the audio data packet, it stores the received packet in the audio buffer 82 (S104).

そして、入力中継部８０は、Ｓ１０４に示す処理で音声バッファ８２に記憶されたパケットを音声処理システム１６の音声認識部９０に送信して、Ｓ１０３に示す処理に戻る。 Then, the input relay unit 80 transmits the packets stored in the audio buffer 82 in the process shown in S104 to the audio recognition unit 90 of the audio processing system 16, and returns to the process shown in S103.

Ｓ１０３～Ｓ１０５に示す処理は、後述のＳ２０７に示す処理が実行されるまで継続される。 The processes shown in S103 to S105 are continued until the process shown in S207, which will be described later, is executed.

次に、中継装置１４で実行される文字列の中継処理の流れの一例を、図１０に示すフロー図を参照しながら説明する。 Next, an example of the flow of character string relay processing executed by the relay device 14 will be described with reference to the flowchart shown in FIG.

本処理例では、文字列中継部８４が、音声処理システム１６の音声認識部９０から送信される音声認識結果文字列の受信を監視する（Ｓ２０１）。文字列中継部８４が、音声認識結果文字列を受信すると、受信した音声認識結果文字列をクライアント装置１２の文字列受信部６２に送信する（Ｓ２０２）。 In this processing example, the character string relay unit 84 monitors reception of the speech recognition result character string transmitted from the speech recognition unit 90 of the speech processing system 16 (S201). When the character string relay unit 84 receives the speech recognition result character string, it transmits the received speech recognition result character string to the character string reception unit 62 of the client device 12 (S202).

そして、文字列中継部８４は、入力中継部８０が確定要求を受信したか否かを確認する（Ｓ２０３）。確定要求の受信が確認されなかった場合は（Ｓ２０３：Ｎ）、Ｓ２０１に示す処理に戻る。確定要求の受信が確認された場合は（Ｓ２０３：Ｙ）、文字列中継部８４は、確定要求の受信までに受信した音声データが表す音声の音声認識結果を表す音声認識結果文字列を音声処理システム１６の翻訳部９２に送信する（Ｓ２０４）。 Then, the character string relay unit 84 confirms whether or not the input relay unit 80 has received the confirmation request (S203). If reception of the confirmation request is not confirmed (S203: N), the process returns to S201. If the reception of the confirmation request is confirmed (S203: Y), the character string relay unit 84 performs voice processing on the speech recognition result character string representing the speech recognition result of the speech represented by the voice data received up to the reception of the confirmation request. It is transmitted to the translation section 92 of the system 16 (S204).

そして、文字列中継部８４は、音声処理システム１６の翻訳部９２から送信される、Ｓ２０３に示す処理で送信された音声認識結果文字列を翻訳した翻訳結果文字列を受信する（Ｓ２０５）。 Then, the character string relay unit 84 receives a translation result character string obtained by translating the speech recognition result character string transmitted in the processing shown in S203, which is transmitted from the translation unit 92 of the speech processing system 16 (S205).

そして、文字列中継部８４は、確定フラグ、Ｓ２０５に示す処理で受信した翻訳結果文字列、及び、確定要求の受信までに受信した音声データが表す音声の音声認識結果を表す音声認識結果文字列を、クライアント装置１２の文字列受信部６２に送信する（Ｓ２０６）。 Then, the character string relay unit 84 outputs the confirmation flag, the translation result character string received in the processing shown in S205, and the speech recognition result character string representing the speech recognition result of the speech represented by the speech data received up to the reception of the confirmation request. is sent to the character string receiving unit 62 of the client device 12 (S206).

そして、文字列中継部８４は、中継装置１４と端末１０との間の通信を切断して（Ｓ２０７）、本処理例に示す処理は終了される。Ｓ２０７に示す処理が実行されることによって、Ｓ１０３～Ｓ１０５に示す処理も終了される。 Then, the character string relay unit 84 disconnects the communication between the relay device 14 and the terminal 10 (S207), and the processing shown in this processing example ends. By executing the process shown in S207, the processes shown in S103 to S105 are also terminated.

次に、クライアント装置１２で実行される重畳画像３２の生成処理の流れの一例を、図１１に示すフロー図を参照しながら説明する。本処理例では、撮影部１２ｅが撮影画像を撮影するフレームレートで以下のＳ３０１～Ｓ３０５に示す処理が繰り返し実行される。本実施形態において、例えば、Ｓ３０１～Ｓ３０５に示す処理が１／３０秒間隔で実行されてもよい。なお、Ｓ３０１～Ｓ３０５に示す処理の実行間隔は１／３０秒よりも長い間隔（あるいは、短い間隔）であってもよい。また、実行間隔がユーザによって調整可能であってもよい。 Next, an example of the flow of processing for generating the superimposed image 32 executed by the client device 12 will be described with reference to the flowchart shown in FIG. In this example of processing, the processing shown in the following S301 to S305 is repeatedly executed at the frame rate at which the photographing unit 12e photographs the photographed image. In this embodiment, for example, the processes shown in S301 to S305 may be executed at intervals of 1/30 second. Note that the execution interval of the processes shown in S301 to S305 may be an interval longer (or shorter) than 1/30 second. Also, the execution interval may be adjustable by the user.

まず、撮影画像取得部６４が、当該フレームにおける撮影画像を取得する（Ｓ３０１）。 First, the captured image acquisition unit 64 acquires the captured image in the frame (S301).

そして、重畳画像生成部６６が、Ｓ２０２に示す処理が前回実行されたタイミング以降に、文字列受信部６２が確定フラグを受信したか否かを確認する（Ｓ３０２）。 Then, the superimposed image generation unit 66 checks whether the character string reception unit 62 has received the confirmation flag after the timing when the process shown in S202 was last executed (S302).

確定フラグを受信していないことが確認された場合は（Ｓ３０２：Ｎ）、重畳画像生成部６６は、Ｓ３０１に示す処理で取得された撮影画像に、文字列受信部６２が受信した最新の音声認識結果文字列が重畳された、重畳画像３２を生成する（Ｓ３０３）。 When it is confirmed that the confirmation flag has not been received (S302: N), the superimposed image generating unit 66 adds the latest voice received by the character string receiving unit 62 to the captured image acquired in the processing shown in S301. A superimposed image 32 on which the recognition result character string is superimposed is generated (S303).

確定フラグを受信したことが確認された場合は（Ｓ３０２：Ｙ）、重畳画像生成部６６は、Ｓ３０１に示す処理で取得された撮影画像に、文字列受信部６２が受信した最新の音声認識結果文字列及び最新の翻訳結果文字列が重畳された、重畳画像３２を生成する（Ｓ３０４）。 When it is confirmed that the confirmation flag has been received (S302: Y), the superimposed image generation unit 66 adds the latest speech recognition result received by the character string reception unit 62 to the captured image acquired in the processing shown in S301. A superimposed image 32 is generated in which the character string and the latest translation result character string are superimposed (S304).

そして、重畳画像生成部６６は、Ｓ３０３又はＳ３０４に示す処理で生成された重畳画像３２をテレビ会議クライアント部６８に出力して（Ｓ３０５）、Ｓ３０１に示す処理に戻る。 Then, the superimposed image generation unit 66 outputs the superimposed image 32 generated by the processing shown in S303 or S304 to the videoconference client unit 68 (S305), and returns to the processing shown in S301.

本実施形態において、撮影画像内における表示可能エリアがユーザによって設定可能であってもよい。例えば、上段、下段、全体、などのうちから表示可能エリアが選択可能であってもよい。また、音声認識結果文字列の表示可能エリアと、翻訳結果文字列の表示可能エリアとが別々に設定可能であってもよい。例えば、図４、及び、図６は、音声認識結果文字列の表示可能エリアとして下段が設定された際のテレビ会議画面３０の一例が示されている。また、図６には、翻訳結果文字列の表示可能エリアとして全体が設定された際のテレビ会議画面３０の一例が示されている。 In this embodiment, the displayable area within the captured image may be settable by the user. For example, the displayable area may be selectable from upper, lower, entire, and the like. Also, the displayable area for the speech recognition result character string and the displayable area for the translation result character string may be set separately. For example, FIGS. 4 and 6 show an example of the teleconference screen 30 when the lower part is set as the displayable area for the speech recognition result character string. Also, FIG. 6 shows an example of the teleconference screen 30 when the entire area is set as the displayable area for the translation result character string.

また、本実施形態において、日本語などの単語を区切るスペースのない言語の文字列は所定の文字数で改行されるようにしてもよい。また、英語などの単語を区切るスペースがある言語の文字列は所定の文字数でワードラップ処理が実行されるようにしてもよい。 Further, in this embodiment, a character string in a language such as Japanese that does not have a space to separate words may be line-wrapped at a predetermined number of characters. Also, a character string in a language such as English that has spaces separating words may be word-wrapped for a predetermined number of characters.

また、可読性を高めるため、翻訳結果文字列の文字サイズが、音声認識結果文字列の文字サイズよりも大きくてもよい。 Also, in order to improve readability, the character size of the translation result character string may be larger than the character size of the speech recognition result character string.

また、本実施形態において、翻訳結果文字列と音声認識結果文字列の両方が撮影画像に重畳される必要はない。例えば、翻訳結果文字列が撮影画像に重畳される際には音声認識結果文字列は撮影画像に重畳されないようしてもよい。 Moreover, in the present embodiment, both the translation result character string and the voice recognition result character string need not be superimposed on the captured image. For example, when the translation result character string is superimposed on the captured image, the speech recognition result character string may not be superimposed on the captured image.

また、本実施形態において、音声認識結果文字列の文字サイズが固定サイズであり、翻訳結果文字列の文字サイズが可変サイズであってもよい。 Further, in the present embodiment, the character size of the speech recognition result character string may be a fixed size, and the character size of the translation result character string may be a variable size.

この場合、翻訳結果文字列に含まれる文字の最大の文字サイズが、画面の高さに対して所定の割合を乗じたサイズであってもよい。そして、１行の文字数が増えるに従い、翻訳結果文字列の文字サイズが小さくなるようにしてもよい。 In this case, the maximum character size of characters included in the translation result character string may be a size obtained by multiplying the height of the screen by a predetermined ratio. Then, the character size of the translation result character string may be made smaller as the number of characters in one line increases.

なお、音声認識結果文字列の文字サイズが可変サイズでもよい。また、翻訳結果文字列の文字サイズが固定サイズであってもよい。 Note that the character size of the speech recognition result character string may be variable. Also, the character size of the translation result character string may be a fixed size.

また、本実施形態において、表示可能エリアのサイズに対応する表示可能文字数が予め定められていてもよい。そして、表示可能文字数よりも多い文字数の音声認識結果文字列が撮影画像に重畳される際には、当該音声認識結果文字列は表示可能エリアの高さに収まるよう縮小された上で撮影画像に重畳されてもよい。また、表示可能文字数よりも多い文字数の翻訳結果文字列が撮影画像に重畳される際には、当該翻訳結果文字列は表示可能エリアの高さに収まるよう縮小された上で撮影画像に重畳されてもよい。 Also, in the present embodiment, the number of displayable characters corresponding to the size of the displayable area may be predetermined. When a voice recognition result character string with a number of characters larger than the number of characters that can be displayed is superimposed on the captured image, the voice recognition result character string is reduced to fit within the height of the displayable area and then superimposed on the captured image. It may be superimposed. Also, when a translated character string with more characters than the number of characters that can be displayed is superimposed on the captured image, the translated character string is reduced to fit within the height of the displayable area and then superimposed on the captured image. may

また、本実施形態において、文字列中継部８４が、入力中継部８０による音声データのパケットの受信が所定時間（例えば、１．５秒）途切れたことをトリガとして、現在までに受け付けた音声データが表す音声の翻訳が開始されるよう制御してもよい。例えば、中継装置１４の文字列中継部８４は、入力中継部８０による音声データのパケットの受信が所定時間（例えば、１．５秒）途切れたことに応じて、端末１０が中継装置１４に接続されてから現在までに受信した音声データが表す音声の音声認識結果を表す音声認識文字列を音声処理システム１６の翻訳部９２に送信してもよい。 Further, in the present embodiment, the character string relay unit 84 is triggered by the fact that the reception of the voice data packet by the input relay unit 80 is interrupted for a predetermined time (for example, 1.5 seconds). may be controlled to start translating the speech represented by . For example, the character string relay unit 84 of the relay device 14 connects the terminal 10 to the relay device 14 when the input relay unit 80 stops receiving voice data packets for a predetermined time (for example, 1.5 seconds). A speech recognition character string representing the speech recognition result of the speech represented by the speech data received up to the present may be transmitted to the translation unit 92 of the speech processing system 16 .

また、テレビ会議画面３０とは別の画面（例えば、ブラウザ）に、音声認識結果文字列と翻訳結果文字列の一覧（ログ）が表示されるようにしてもよい。そして、このログがクライアント装置１２の記憶部１２ｂ等の記憶媒体に保存できるようになっていてもよい。また、上述の翻訳後言語とは異なる言語に翻訳された、音声認識結果文字列を翻訳した翻訳結果文字列がブラウザに表示されるようにしてもよい。 Also, a list (log) of the speech recognition result character strings and the translation result character strings may be displayed on a screen (for example, a browser) other than the teleconference screen 30 . This log may be stored in a storage medium such as the storage unit 12b of the client device 12. FIG. Also, a translation result character string obtained by translating the voice recognition result character string, which is translated into a language different from the above-described post-translation language, may be displayed on the browser.

また、テレビ会議用翻訳システム１の端末１０の機能がクライアント装置１２において実装されてもよい。 Also, the functions of the terminal 10 of the video conference translation system 1 may be implemented in the client device 12 .

例えば、図１２に示すように、クライアント装置１２が、ディスプレイ１２ｆに翻訳ボタン９４を表示させる機能を備えていてもよい。そして、ディスプレイ１２ｆに、テレビ会議画面３０に加え、翻訳ボタン９４が表示されるようにしてもよい。そして、例えば発話者がクリック操作などの所定の操作を翻訳ボタン９４に対して行う度に、クライアント装置１２において上述の入力オン状態と入力オフ状態が切り替わるようにしてもよい。そして、入力オン状態である間に入力された音声についての音声認識結果文字列や翻訳結果文字列がテレビ会議画面３０に表示されるようにしてもよい。 For example, as shown in FIG. 12, the client device 12 may have a function of displaying a translation button 94 on the display 12f. Then, in addition to the teleconference screen 30, a translation button 94 may be displayed on the display 12f. Then, for example, each time the speaker performs a predetermined operation such as a click operation on the translation button 94, the client device 12 may switch between the input-on state and the input-off state. Then, the voice recognition result character string and the translation result character string for the voice input while the input is on may be displayed on the teleconference screen 30 .

図１３は、図１～図１１を参照して説明した一実施形態の変形例に係るクライアント装置１２で実装される機能の一例を示す機能ブロック図である。本実施形態に係るクライアント装置１２で、図１３に示す機能のすべてが実装される必要はなく、また、図１３に示す機能以外の機能が実装されていても構わない。 FIG. 13 is a functional block diagram showing an example of functions implemented in the client device 12 according to the modification of the embodiment described with reference to FIGS. 1 to 11. As shown in FIG. It is not necessary for the client device 12 according to the present embodiment to implement all the functions shown in FIG. 13, and functions other than the functions shown in FIG. 13 may be installed.

図１３に示すように、当該変形例に係るクライアント装置１２には、機能的には例えば、操作入力受付部４０、音声バッファ４４、入力送信部４６、音声入力受付部６０、文字列受信部６２、撮影画像取得部６４、重畳画像生成部６６、テレビ会議クライアント部６８、音声出力制御部７０、表示制御部７２、が含まれる。操作入力受付部４０は、プロセッサ１２ａ、及び、操作部１２ｄを主として実装される。音声バッファ４４は、記憶部１２ｂを主として実装される。入力送信部４６、文字列受信部６２は、通信部１２ｃを主として実装される。音声入力受付部６０は、プロセッサ１２ａ、及び、マイク１２ｇを主として実装される。撮影画像取得部６４は、プロセッサ１２ａ、及び、撮影部１２ｅを主として実装される。重畳画像生成部６６は、プロセッサ１２ａを主として実装される。テレビ会議クライアント部６８は、プロセッサ１２ａ、及び、通信部１２ｃを主として実装される。音声出力制御部７０は、プロセッサ１２ａ、及び、スピーカ１２ｈを主として実装される。表示制御部７２は、プロセッサ１２ａ、及び、ディスプレイ１２ｆを主として実装される。 As shown in FIG. 13 , the client device 12 according to the modification includes functionally, for example, an operation input reception unit 40 , a voice buffer 44 , an input transmission unit 46 , a voice input reception unit 60 , a character string reception unit 62 . , a captured image acquisition unit 64, a superimposed image generation unit 66, a video conference client unit 68, an audio output control unit 70, and a display control unit 72. The operation input reception unit 40 is mainly implemented by the processor 12a and the operation unit 12d. The audio buffer 44 is implemented mainly in the storage unit 12b. The input transmission unit 46 and the character string reception unit 62 are mainly implemented by the communication unit 12c. The voice input reception unit 60 is mainly implemented by the processor 12a and the microphone 12g. The photographed image acquisition unit 64 is mainly implemented by the processor 12a and the photographing unit 12e. The superimposed image generator 66 is implemented mainly by the processor 12a. The videoconference client unit 68 is mainly implemented by the processor 12a and the communication unit 12c. The audio output control unit 70 is mainly implemented by the processor 12a and the speaker 12h. The display control unit 72 is mainly implemented by the processor 12a and the display 12f.

操作入力受付部４０は、本実施形態では例えば、ディスプレイ１２ｆに翻訳ボタン９４を表示させる。そして、操作入力受付部４０は、本実施形態では例えば、翻訳ボタン９４をクリックする操作などといった操作入力を受け付ける。 In this embodiment, for example, the operation input reception unit 40 displays a translation button 94 on the display 12f. Then, the operation input receiving unit 40 receives an operation input such as an operation of clicking the translation button 94 in this embodiment.

音声入力受付部６０は、本実施形態では例えば、マイク１２ｇを介して入力されるユーザの音声を受け付ける。そして、音声入力受付部６０は、入力された音声を表す音声データをテレビ会議クライアント部６８に出力する。 The voice input reception unit 60 receives the user's voice input via the microphone 12g in this embodiment, for example. Then, the voice input reception unit 60 outputs voice data representing the input voice to the teleconference client unit 68 .

そして、本実施形態では例えば、クライアント装置１２が入力オフ状態から入力オン状態に変化したことに応じて、入力送信部４６は、通信開始要求を中継装置１４に送信する。そして、クライアント装置１２が入力オフ状態から入力オン状態に変化してから、中継装置１４と端末１０との間の通信が確立されるまでの間にマイク１２ｇを介して入力される音声を表す音声データは、テレビ会議クライアント部６８に出力されるだけでなく、音声バッファ４４に蓄積される。 In this embodiment, for example, the input transmission unit 46 transmits a communication start request to the relay device 14 in response to the client device 12 changing from the input off state to the input on state. Then, the voice representing the voice input through the microphone 12g during the period from when the client device 12 changes from the input-off state to the input-on state until the communication between the relay device 14 and the terminal 10 is established. The data is stored in the audio buffer 44 as well as being output to the videoconference client portion 68 .

また、クライアント装置１２が入力オン状態から入力オフ状態に変化したことに応じて、入力送信部４６は、確定要求を中継装置１４に送信する。 Also, in response to the client device 12 changing from the input-on state to the input-off state, the input transmission unit 46 transmits a determination request to the relay device 14 .

音声バッファ４４、入力送信部４６のその他の機能は、図８Ａを参照して説明した上述の機能と同様であるため、説明を省略する。また、文字列受信部６２、撮影画像取得部６４、重畳画像生成部６６、テレビ会議クライアント部６８、音声出力制御部７０、表示制御部７２の機能は、図８Ｂを参照して説明した上述の機能と同様であるため、説明を省略する。なお、当該変形例においては、中継装置１４は、端末１０への文字列の送信を行わない。 Other functions of the audio buffer 44 and the input transmission unit 46 are the same as the functions described above with reference to FIG. 8A, so description thereof will be omitted. The functions of the character string receiving unit 62, the captured image acquiring unit 64, the superimposed image generating unit 66, the video conference client unit 68, the audio output control unit 70, and the display control unit 72 are the same as those described above with reference to FIG. 8B. Since it is the same as the function, the explanation is omitted. In addition, in the modification, the relay device 14 does not transmit the character string to the terminal 10 .

図１２及び図１３に示す例のように、入力中継部８０が、発話者によりクライアント装置１２に入力される音声を表す音声データをクライアント装置１２から受け付けてもよい。また、入力中継部８０が、クライアント装置１２に対して発話者により行われる所定の操作に応じてクライアント装置１２から送信される確定要求を受け付けてもよい。 As in the examples shown in FIGS. 12 and 13, the input relay unit 80 may receive from the client device 12 voice data representing voice input to the client device 12 by the speaker. Also, the input relay unit 80 may receive a confirmation request transmitted from the client device 12 in response to a predetermined operation performed by the speaker on the client device 12 .

そして、表示制御部７２は、撮影部１２ｅによって撮影される画像に確定要求の受付までに受け付けた音声データが表す音声の翻訳結果を表す文字列を重畳させた画像が配置された画面をクライアント装置１２が備えるディスプレイ１２ｆに表示させてもよい。 Then, the display control unit 72 displays a screen in which an image obtained by superimposing a character string representing the translation result of the voice represented by the voice data received until the confirmation request is received on the image captured by the capturing unit 12e is displayed on the client device. 12 may be displayed on the display 12f.

また、本実施形態において、翻訳後言語として複数の言語が設定可能であってもよい。そして、文字列中継部８４が、確定要求の受付までに受け付けた音声データが表す音声の、設定された複数の言語への翻訳が開始されるよう制御してもよい。この場合、例えば、文字列中継部８４は、音声認識文字列を、複数の翻訳後言語にそれぞれ対応付けられる複数の翻訳部９２に送信してもよい。 Also, in this embodiment, a plurality of languages may be set as post-translation languages. Then, the character string relay unit 84 may perform control so that the translation of the voice represented by the voice data received before the confirmation request is received into the set multiple languages is started. In this case, for example, the character string relay unit 84 may transmit the speech recognition character string to a plurality of translation units 92 each associated with a plurality of post-translation languages.

そして、表示制御部７２は、撮影画像に、設定された複数の言語のそれぞれについての翻訳結果文字列を重畳させた画像が配置された画面をディスプレイ１２ｆに表示させてもよい。 Then, the display control unit 72 may cause the display 12f to display a screen in which an image obtained by superimposing the translation result character strings for each of the plurality of set languages on the captured image is arranged.

例えば、撮影画像内における下段に、音声認識結果文字列を英語に翻訳した翻訳結果文字列が表示されるとともに、撮影画像内における上段に、当該音声認識結果文字列を中国語の翻訳結果文字列が表示されるようにしてもよい。 For example, in the lower part of the photographed image, the translated result character string obtained by translating the speech recognition result character string into English is displayed, and in the upper part of the photographed image, the speech recognition result character string is displayed as the Chinese translation result character string. may be displayed.

そして、すべての翻訳後言語についての翻訳結果文字列が表示されたことが確認されたことに応じて、これらの翻訳結果文字列が画面から消去されるようにしてもよい。 Then, when it is confirmed that the translation result character strings for all post-translation languages have been displayed, these translation result character strings may be erased from the screen.

なお、本発明は上述の実施形態に限定されるものではない。 It should be noted that the present invention is not limited to the above-described embodiments.

例えば、端末１０、クライアント装置１２、中継装置１４、音声処理システム１６、テレビ会議システム１８の役割分担は、以上で説明したものには限定されない。例えば、音声認識結果文字列が中継装置１４を経由することなく、当該音声認識結果文字列に対する翻訳処理が音声処理システム１６において実行されてもよい。 For example, the division of roles among the terminal 10, the client device 12, the relay device 14, the audio processing system 16, and the video conference system 18 is not limited to what has been described above. For example, the speech recognition result character string may be translated by the speech processing system 16 without the speech recognition result character string passing through the relay device 14 .

例えば、クライアント装置１２が、端末１０から中継装置１４に送信される音声データを中継装置１４から受信してもよい。そして、クライアント装置１２が、マイク１２ｇを介して入力される音声を表す音声データではなく、中継装置１４から受信する音声データを、テレビ会議システム１８に出力してもよい。 For example, the client device 12 may receive voice data transmitted from the terminal 10 to the relay device 14 from the relay device 14 . Then, the client device 12 may output the voice data received from the relay device 14 to the teleconference system 18 instead of the voice data representing the voice input via the microphone 12g.

また、上記の具体的な文字列や数値及び図面中の具体的な文字列や数値は例示であり、これらの文字列や数値には限定されない。 Moreover, the specific character strings and numerical values described above and the specific character strings and numerical values in the drawings are examples, and the present invention is not limited to these character strings and numerical values.

１テレビ会議用翻訳システム、１０端末、１０ａプロセッサ、１０ｂ記憶部、１０ｃ通信部、１０ｄ操作部、１０ｄａ翻訳ボタン、１０ｄｂ電源ボタン、１０ｄｃ音量調整部、１０ｅ撮影部、１０ｆタッチパネル、１０ｇマイク、１０ｈスピーカ、１２クライアント装置、１２ａプロセッサ、１２ｂ記憶部、１２ｃ通信部、１２ｄ操作部、１２ｅ撮影部、１２ｆディスプレイ、１２ｇマイク、１２ｈスピーカ、１４中継装置、１４ａプロセッサ、１４ｂ記憶部、１４ｃ通信部、１６音声処理システム、１６ａプロセッサ、１６ｂ記憶部、１６ｃ通信部、１８テレビ会議システム、２０コンピュータネットワーク、３０テレビ会議画面、３２重畳画像、３４音声認識結果画像、３６翻訳結果画像、４０操作入力受付部、４２音声入力受付部、４４音声バッファ、４６入力送信部、４８文字列受信部、５０表示制御部、６０音声入力受付部、６２文字列受信部、６４撮影画像取得部、６６重畳画像生成部、６８テレビ会議クライアント部、７０音声出力制御部、７２表示制御部、８０入力中継部、８２音声バッファ、８４文字列中継部、９０音声認識部、９２翻訳部、９４翻訳ボタン。 1 translation system for video conference, 10 terminal, 10a processor, 10b storage unit, 10c communication unit, 10d operation unit, 10da translation button, 10db power button, 10dc volume control unit, 10e photographing unit, 10f touch panel, 10g microphone, 10h speaker , 12 client device, 12a processor, 12b storage unit, 12c communication unit, 12d operation unit, 12e photographing unit, 12f display, 12g microphone, 12h speaker, 14 relay device, 14a processor, 14b storage unit, 14c communication unit, 16 audio processing system 16a processor 16b storage unit 16c communication unit 18 video conference system 20 computer network 30 video conference screen 32 superimposed image 34 speech recognition result image 36 translation result image 40 operation input reception unit 42 voice input reception unit 44 voice buffer 46 input transmission unit 48 character string reception unit 50 display control unit 60 voice input reception unit 62 character string reception unit 64 captured image acquisition unit 66 superimposed image generation unit 68 Teleconference client unit 70 voice output control unit 72 display control unit 80 input relay unit 82 voice buffer 84 character string relay unit 90 voice recognition unit 92 translation unit 94 translation button.

Claims

voice data receiving means for receiving voice data representing voice input by a speaker;
Confirmation request receiving means for receiving a confirmation request output in response to a predetermined operation performed by the speaker;
translation control means for controlling, with the reception of the confirmation request as a trigger, to start translating the speech represented by the received audio data until the reception of the confirmation request;
Translation result display control for displaying on the display unit a screen in which a character string representing the translation result of the voice represented by the voice data received before the acceptance of the confirmation request is superimposed on the image captured by the capturing unit. means and
A display control system comprising:

voice recognition result display control means for causing the display unit to display an image obtained by superimposing a character string representing a voice recognition result of the voice represented by the voice data on the image captured by the imaging unit. ,
The voice recognition result display control means superimposes a character string representing the voice recognition result of the voice represented by the received voice data on the image captured by the imaging unit before accepting the confirmation request. to display a screen on which is arranged on the display unit;
The display control system according to claim 1, characterized by:

The translation result display control means adds a character string representing a voice recognition result of the voice represented by the voice data received until the acceptance of the confirmation request to the image captured by the imaging unit, and causing the display unit to display a screen on which an image in which both character strings representing the translation result of the voice represented by the voice data received are superimposed is arranged;
3. The display control system according to claim 1, wherein:

an image output unit that outputs an image obtained by superimposing a character string on the image captured by the capturing unit to a video conference system;
The translation result display control means causes the screen generated by the video conference system to be displayed on the display unit.
4. The display control system according to any one of claims 1 to 3, characterized by:

the voice data receiving means receives from the terminal the voice data representing voice input by the speaker to the terminal;
the confirmation request receiving means receives the confirmation request transmitted from the terminal in response to a predetermined operation performed by the speaker on the terminal;
The translation result display control means causes a display unit of the terminal to display a character string representing a translation result of the voice represented by the voice data received before the acceptance of the confirmation request,
The translation result display control means displays a screen in which an image obtained by superimposing a character string representing a translation result of the voice represented by the voice data received until the confirmation request is received on the image captured by the imaging unit is arranged. displayed on the display unit of the client device,
5. The display control system according to any one of claims 1 to 4, characterized in that:

the voice data receiving means receives from the client device the voice data representing voice input by the speaker to the client device;
the confirmation request receiving means receives the confirmation request transmitted from the client device in response to a predetermined operation performed by the speaker on the client device;
The translation result display control means displays a screen in which an image obtained by superimposing a character string representing a translation result of the voice represented by the voice data received until the confirmation request is received on the image captured by the imaging unit is arranged. display on the display unit provided in the client device;
5. The display control system according to any one of claims 1 to 4, characterized in that:

The translation control means controls to start translating the voice represented by the voice data received by the time the confirmation request is received into a plurality of languages,
The translation result display control means provides a screen on which an image obtained by superimposing a character string representing a translation result of the voice represented by the voice data for each of the plurality of languages on the image captured by the capturing unit is arranged. is displayed on the display unit,
7. The display control system according to any one of claims 1 to 6, characterized by:

receiving speech data representing speech input by a speaker;
a step of receiving a confirmation request output in response to a predetermined operation performed by the speaker;
a step of controlling the reception of the confirmation request as a trigger to start translating the speech represented by the received audio data before the reception of the confirmation request;
a step of displaying on a display unit a screen in which an image captured by a capturing unit is superimposed with a character string representing a translation result of the voice represented by the voice data received before the acceptance of the confirmation request;
A display control method comprising:

a procedure for accepting speech data representing speech input by a speaker;
a procedure for receiving a confirmation request output in response to a predetermined operation performed by the speaker;
A procedure for controlling the reception of the confirmation request as a trigger so that the translation of the speech represented by the received audio data is started by the reception of the confirmation request;
A procedure for displaying on a display unit a screen in which an image captured by an imaging unit is superimposed with a character string representing a translation result of the voice represented by the voice data received until the confirmation request is received,
A program characterized by causing a computer to execute