JP6449181B2

JP6449181B2 - Speech translation system, speech translation method, and speech translation program

Info

Publication number: JP6449181B2
Application number: JP2016017071A
Authority: JP
Inventors: 知高大越
Original assignee: RECRUIT LIFESTYLE CO., LTD.
Current assignee: RECRUIT LIFESTYLE CO., LTD.
Priority date: 2016-02-01
Filing date: 2016-02-01
Publication date: 2019-01-09
Anticipated expiration: 2036-02-01
Also published as: WO2017135214A1; JP2017138650A

Description

本発明は、音声翻訳システム、音声翻訳方法、及び音声翻訳プログラムに関する。 The present invention relates to a speech translation system, a speech translation method, and a speech translation program.

互いの言語を理解できない人同士の会話、例えば店員（飲食店等の店舗の販売員）と顧客（海外からの観光客等）との会話を可能ならしめるべく、話者の発話音声をテキスト化し、そのテキストの内容を相手の言語に機械翻訳した上で画面に表示したり、或いは、音声合成技術を用いてそのテキストの内容を音声再生したりする音声翻訳技術が提案されている（例えば特許文献１参照）。また、かかる音声翻訳技術を具現化したスマートフォン等の情報端末で動作する音声翻訳アプリケーションも実用化されている（例えば非特許文献１参照）。 In order to enable conversation between people who cannot understand each other's language, for example, conversation between a store clerk (sales clerk at a restaurant, etc.) and a customer (tourist from abroad, etc.) A speech translation technique has been proposed in which the text content is machine-translated into the language of the other party and displayed on the screen, or the text content is played back using speech synthesis technology (for example, a patent). Reference 1). In addition, a speech translation application that operates on an information terminal such as a smartphone that embodies such speech translation technology has been put into practical use (see, for example, Non-Patent Document 1).

一方、電話による複数の利用者間の通話を可能にする通訳システムが知られている（例えば特許文献２参照）。 On the other hand, an interpreting system that enables a telephone call between a plurality of users is known (for example, see Patent Document 2).

特開平９−３４８９５号公報Japanese Patent Laid-Open No. 9-34895 特開２０１０−２１６９２号公報JP 2010-21692 A

Ｕ−ＳＴＡＲコンソーシアムホームページ［平成２８年１月２２日検索］、インターネット＜ＵＲＬ：http://www.ustar-consortium.com/app_ja/app.html＞U-STAR Consortium homepage [searched on January 22, 2016], Internet <URL: http://www.ustar-consortium.com/app_en/app.html>

上記従来の音声翻訳装置においては、飲食店において、店員が顧客の注文の内容を尋ねたり、料理の素材を説明したりする際に、音声が入力されると翻訳エンジンによる機械翻訳を実行する。よって、入力される音声の内容がその言語の基本的な文型になっていないような場合や、発話した語順等が異なる場合には、誤訳が生じてしまう可能性が高くなる傾向にある。上記機械翻訳の精度が悪く、両者のコミュニケーションが円滑に行えないような場合には、例えば店員は、当該店員が携帯する音声翻訳装置から通訳者に電話をし、通訳者に翻訳をしてもらうことで、両者のコミュニケーションを円滑に行うことが可能となる。 In the above-described conventional speech translation apparatus, when a store clerk asks about the contents of a customer's order or explains a cooking material in a restaurant, machine translation by a translation engine is executed. Therefore, when the content of the input voice is not a basic sentence pattern of the language or when the order of spoken words is different, there is a tendency that mistranslation is likely to occur. If the accuracy of the machine translation is poor and communication between the two is not smooth, for example, the store clerk will call the interpreter from the speech translation device carried by the store clerk and have the interpreter translate it. Thus, communication between the two can be performed smoothly.

しかしながら、従来の音声翻訳装置において音声翻訳処理を実行している際に、通訳者に電話をする場合、通訳者（通訳者が使用する通訳者端末）を識別するための識別情報、例えば電話番号を、当該音声翻訳装置が記憶する電話帳や通信履歴等から探さなければならない。そして、電話番号を特定した後、さらに発信操作を行わなければならず、ユーザ（利用者、発話者）の負担の増加や利便性の低下を招いてしまうおそれがある。 However, identification information for identifying an interpreter (interpreter terminal used by an interpreter), for example, a telephone number, when calling an interpreter while performing speech translation processing in a conventional speech translation apparatus Must be searched from the telephone directory, communication history, etc. stored in the speech translation apparatus. Then, after specifying the telephone number, it is necessary to further perform a call operation, which may increase the burden on the user (user, speaker) and decrease convenience.

そこで、本発明は、かかる事情に鑑みてなされたものであり、ユーザの負担を軽減し且つ利便性を向上させることができるとともに、誤訳の発生を防止し且つ円滑なコミュニケーションを実現することができる音声翻訳システム、音声翻訳方法、及び音声翻訳プログラムを提供することを目的とする。 Therefore, the present invention has been made in view of such circumstances, and can reduce the burden on the user and improve convenience, and can prevent occurrence of mistranslation and realize smooth communication. An object is to provide a speech translation system, a speech translation method, and a speech translation program.

上記課題を解決するため、本発明の一側面に係る音声翻訳システムは、ユーザの音声を入力する情報端末と、情報端末に入力された音声の内容を翻訳するサーバ装置と、情報端末との間の通話処理をする通訳者端末と、を備える音声翻訳システムであって、サーバ装置は、情報端末に入力された音声の内容を認識する音声認識部と、音声認識部で認識された内容を異なる言語の内容に翻訳する翻訳部と、を備え、情報端末は、サーバ装置の翻訳部で翻訳された内容を音声で出力する音声出力部と、翻訳された内容のテキストを表示する処理を制御する第１表示処理制御部であって、テキストに加え、第１画像を選択的に表示する処理を制御する第１表示処理制御部と、通訳者端末との間の通話処理を制御する通話処理制御部であって、第１画像が選択されたとき、通話処理を開始するための通話処理開始リクエストを通訳者端末に送信する通話処理制御部と、を備える、音声翻訳システム。 In order to solve the above problems, a speech translation system according to an aspect of the present invention is provided between an information terminal that inputs a user's speech, a server device that translates the content of speech input to the information terminal, and the information terminal. A speech translation system comprising: an interpreter terminal that performs the telephone call processing, wherein the server device differs in the content recognized by the speech recognition unit and the content recognized by the speech recognition unit. A translation unit that translates the content into a language, and the information terminal controls a speech output unit that outputs the content translated by the translation unit of the server device by voice, and a process of displaying the text of the translated content Call processing control for controlling call processing between an interpreter terminal and a first display processing control unit that controls processing for selectively displaying a first image in addition to text The first drawing When but a selected comprises a call processing control unit for transmitting a call processing start request to initiate a call processing interpreter terminal, a speech translation system.

上記音声翻訳システムにおいて、サーバ装置は、翻訳精度に関するスコアを算出するスコア算出部を更に備え、第１表示処理制御部は、スコアが所定の閾値以下である場合に第１画像を表示する処理を制御してもよい。 In the speech translation system, the server device further includes a score calculation unit that calculates a score related to translation accuracy, and the first display processing control unit performs a process of displaying the first image when the score is equal to or less than a predetermined threshold. You may control.

上記音声翻訳システムにおいて、サーバ装置は、入力された音声の内容に対応付けられた翻訳された内容をユーザごとに関連付けて翻訳履歴として記憶する記憶部を更に備え、通訳者端末は、翻訳履歴をユーザごとに関連付けて表示する処理を制御する第２表示処理制御部を更に備えてもよい。 In the speech translation system, the server device further includes a storage unit that associates the translated content associated with the input speech content for each user and stores it as a translation history, and the interpreter terminal stores the translation history. You may further provide the 2nd display process control part which controls the process linked and displayed for every user.

上記音声翻訳システムにおいて、第１表示処理制御部は、二以上の言語をそれぞれ示す二以上の第２画像を更に表示する処理を制御し、通話処理制御部は、第２画像のうち一の画像が選択された後に、第１画像が選択された場合に、選択された第２画像のうち一の画像が示す言語を使用できる通訳者に対応付けられた通訳者端末との間の通話処理を制御してもよい。 In the speech translation system, the first display processing control unit controls processing for further displaying two or more second images respectively indicating two or more languages, and the call processing control unit selects one image of the second images. When the first image is selected after the selection is made, call processing with the interpreter terminal associated with the interpreter who can use the language indicated by one of the selected second images is performed. You may control.

上記課題を解決するため、本発明の一側面に係る音声翻訳方法は、ユーザの音声の内容であって、異なる言語の内容に翻訳された内容を音声で出力するステップと、翻訳された内容のテキストを表示する処理を制御するステップであって、テキストに加え、第１画像を選択的に表示する処理を制御するステップと、通訳者端末との間の通話処理を制御するステップであって、第１画像が選択されたとき、通話処理を開始するための通話処理開始リクエストを通訳者端末に送信するステップと、を含む。 In order to solve the above-described problem, a speech translation method according to an aspect of the present invention includes a step of outputting a content of a user's speech that is translated into content in a different language, and a step of outputting the translated content. A step of controlling processing for displaying text, the step of controlling processing for selectively displaying the first image in addition to the text, and the step of controlling call processing between the interpreter terminals, Transmitting a call process start request for starting the call process to the interpreter terminal when the first image is selected.

上記課題を解決するため、本発明の一側面に係る音声翻訳プログラムは、コンピュータを、ユーザの音声の内容であって、異なる言語の内容に翻訳された内容を音声で出力する音声出力部と、翻訳された内容のテキストを表示する処理を制御する第１表示処理制御部であって、テキストに加え、第１画像を選択的に表示する処理を制御する第１表示処理制御部と、通訳者端末との間の通話処理を制御する通話処理制御部であって、第１画像が選択されたとき、通話処理を開始するための通話処理開始リクエストを通訳者端末に送信する通話処理制御部と、して機能させる。 In order to solve the above problems, a speech translation program according to an aspect of the present invention provides a computer, a speech output unit that outputs the content of a user's speech and the content translated into content of a different language, A first display processing control unit for controlling processing for displaying translated text, a first display processing control unit for controlling processing for selectively displaying a first image in addition to text, and an interpreter A call processing control unit for controlling a call process with a terminal, wherein when the first image is selected, a call process control unit for transmitting a call process start request for starting the call process to the interpreter terminal; And make it work.

なお、本発明において、「部」、「装置」、「システム」とは、単に物理的手段を意味するものではなく、その「部」、「装置」、「システム」が有する機能をソフトウェアによって実現する場合も含む。また、１つの「部」、「装置」、「システム」が有する機能が２つ以上の物理的手段や装置により実現されても、２つ以上の「部」、「装置」、「システム」の機能が１つの物理的手段や装置により実現されても良い。 In the present invention, “part”, “apparatus”, and “system” do not simply mean physical means, but the functions of the “part”, “apparatus”, and “system” are realized by software. This includes cases where Further, even if the functions of one “part”, “apparatus”, and “system” are realized by two or more physical means and apparatuses, two or more “parts”, “apparatus”, “system” The function may be realized by one physical means or apparatus.

本発明によれば、ユーザの負担を軽減し且つ利便性を向上させることができるとともに、誤訳の発生を防止し且つ円滑なコミュニケーションを実現することができる。 ADVANTAGE OF THE INVENTION According to this invention, while being able to reduce a user's burden and to improve convenience, generation | occurrence | production of a mistranslation can be prevented and smooth communication can be implement | achieved.

本発明による音声翻訳システムに係るネットワーク構成の好適な一実施形態を概略的に示すシステムブロック図である。1 is a system block diagram schematically showing a preferred embodiment of a network configuration related to a speech translation system according to the present invention. FIG. 本発明による音声翻訳システムにおけるユーザ者装置（情報端末）の構成の一例を概略的に示すシステムブロック図である。It is a system block diagram which shows roughly an example of a structure of the user apparatus (information terminal) in the speech translation system by this invention. 本発明による音声翻訳システムにおけるユーザ者装置（情報端末）の機能構成の一例を概略的に示す機能ブロック図である。It is a functional block diagram which shows roughly an example of a function structure of the user apparatus (information terminal) in the speech translation system by this invention. 本発明による音声翻訳システムにおけるサーバ装置の構成の一例を概略的に示すシステムブロック図である。It is a system block diagram which shows roughly an example of a structure of the server apparatus in the speech translation system by this invention. 本発明による音声翻訳システムにおけるサーバ装置の機能構成の一例を概略的に示す機能ブロック図である。It is a functional block diagram which shows roughly an example of a function structure of the server apparatus in the speech translation system by this invention. 本発明による音声翻訳システムにおけるオペレータ端末（通訳者装置）の構成の一例を概略的に示すシステムブロック図である。It is a system block diagram which shows roughly an example of a structure of the operator terminal (interpreter apparatus) in the speech translation system by this invention. 本発明による音声翻訳システムにおけるオペレータ端末の機能構成の一例を概略的に示す機能ブロック図である。It is a functional block diagram which shows roughly an example of a function structure of the operator terminal in the speech translation system by this invention. 本発明による音声翻訳システムにおける処理の流れ（一部）の一例を示すフローチャートである。It is a flowchart which shows an example of the flow (a part) of the process in the speech translation system by this invention. （Ａ）乃至（Ｃ）は、本発明による情報端末における表示画面の遷移の一例を示す平面図である。(A) thru | or (C) are top views which show an example of the transition of the display screen in the information terminal by this invention. （Ａ）乃至（Ｃ）は、本発明による情報端末における表示画面の遷移の一例を示す平面図である。(A) thru | or (C) are top views which show an example of the transition of the display screen in the information terminal by this invention. （Ａ）乃至（Ｄ）は、本発明による情報端末における表示画面の遷移の一例を示す平面図である。(A) thru | or (D) are top views which show an example of the transition of the display screen in the information terminal by this invention. 本発明による通訳者端末における表示画面の一例を示す図である。It is a figure which shows an example of the display screen in the interpreter terminal by this invention. 本発明による音声翻訳システムにおける処理の流れ（一部）の他の一例を示すフローチャートである。It is a flowchart which shows another example of the process flow (part) in the speech translation system by this invention.

以下、本発明の実施の形態について詳細に説明する。なお、以下の実施の形態は、本発明を説明するための例示であり、本発明をその実施の形態のみに限定する趣旨ではない。また、本発明は、その要旨を逸脱しない限り、さまざまな変形が可能である。さらに、当業者であれば、以下に述べる各要素を均等なものに置換した実施の形態を採用することが可能であり、かかる実施の形態も本発明の範囲に含まれる。またさらに、必要に応じて示す上下左右等の位置関係は、特に断らない限り、図示の表示に基づくものとする。さらにまた、図面における各種の寸法比率は、その図示の比率に限定されるものではない。 Hereinafter, embodiments of the present invention will be described in detail. The following embodiments are examples for explaining the present invention, and are not intended to limit the present invention only to the embodiments. The present invention can be variously modified without departing from the gist thereof. Furthermore, those skilled in the art can employ embodiments in which the elements described below are replaced with equivalent ones, and such embodiments are also included in the scope of the present invention. Furthermore, positional relationships such as up, down, left, and right shown as needed are based on the display shown unless otherwise specified. Furthermore, various dimensional ratios in the drawings are not limited to the illustrated ratios.

（システム構成）
図１は、本発明による音声翻訳システムに係るネットワーク構成の好適な一実施形態を概略的に示すシステムブロック図である。この例において、音声翻訳システム１００は、例示的に、ユーザ（発話者、他の発話者）が使用する、ユーザの音声を入力する情報端末１０と、情報端末１０にネットワークＮを介して電子的に接続される、情報端末１０に入力された音声の内容を翻訳するサーバ装置２０と、情報端末１０及びサーバ装置２０にネットワークＮを介して電子的に接続されオペレータ端末３０（通訳者端末）であって、通訳者が使用する、情報端末１０との間の通話処理をするオペレータ端末３０（通訳者端末）と、を備える。 (System configuration)
FIG. 1 is a system block diagram schematically showing a preferred embodiment of a network configuration relating to a speech translation system according to the present invention. In this example, the speech translation system 100 exemplarily includes an information terminal 10 for inputting a user's voice, which is used by the user (speaker or other speaker), and an electronic device connected to the information terminal 10 via the network N. The server device 20 that translates the content of the voice input to the information terminal 10 and the information terminal 10 and the server device 20 that are electronically connected via the network N to the operator terminal 30 (interpreter terminal). And an operator terminal 30 (interpreter terminal) that performs a call process with the information terminal 10 used by the interpreter.

図２は、本発明による音声翻訳システムにおけるユーザ者装置（情報端末）の構成の一例を概略的に示すシステムブロック図である。図２に示すように、情報端末１０は、例示的に、プロセッサ１１と、記憶資源１２と、音声入出力デバイス１３（例えばマイクとスピーカーが別体のものも一体のものも含む）と、通信インターフェイス１４と、入力デバイス１５と、表示デバイス１６と、カメラ１７とを備えている。また、情報端末１０は、インストールされた音声翻訳アプリケーションソフト（本発明の一実施形態による音声翻訳プログラムの少なくとも一部）が動作することにより、本発明の一実施形態による音声翻訳システムの一部又は全部として機能するものである。なお、ここでの情報端末１０は、例えば、ネットワークＮとの通信機能を有するスマートフォンに代表される携帯電話を含む可搬型のタブレット型端末装置である。 FIG. 2 is a system block diagram schematically showing an example of the configuration of the user device (information terminal) in the speech translation system according to the present invention. As shown in FIG. 2, the information terminal 10 illustratively includes a processor 11, a storage resource 12, a voice input / output device 13 (for example, a microphone and a speaker that are separate or integrated), and communication. An interface 14, an input device 15, a display device 16, and a camera 17 are provided. In addition, the information terminal 10 operates as a part of the speech translation system according to the embodiment of the present invention by operating the installed speech translation application software (at least a part of the speech translation program according to the embodiment of the present invention). It functions as a whole. The information terminal 10 here is a portable tablet terminal device including a mobile phone represented by a smartphone having a communication function with the network N, for example.

プロセッサ１１は、算術論理演算ユニット及び各種レジスタ（プログラムカウンタ、データレジスタ、命令レジスタ、汎用レジスタ等）から構成される。また、プロセッサ１１は、記憶資源１２に格納されているプログラムＰ１０である音声翻訳アプリケーションソフトを解釈及び実行し、各種処理を行う。このプログラムＰ１０としての音声翻訳アプリケーションソフトは、例えばサーバ装置２０からネットワークＮを通じて配信可能なものであり、手動的に又は自動的にインストール及びアップデートされてもよい。 The processor 11 includes an arithmetic logic unit and various registers (program counter, data register, instruction register, general-purpose register, etc.). Further, the processor 11 interprets and executes speech translation application software, which is the program P10 stored in the storage resource 12, and performs various processes. The speech translation application software as the program P10 can be distributed from the server device 20 through the network N, for example, and may be installed and updated manually or automatically.

なお、ネットワークＮは、例えば、有線ネットワーク（近距離通信網（ＬＡＮ）、広域通信網（ＷＡＮ）、又は付加価値通信網（ＶＡＮ）等）と無線ネットワーク（移動通信網、衛星通信網、ブルートゥース（Bluetooth：登録商標）、ＷｉＦｉ(Wireless Fidelity)、ＨＳＤＰＡ(High Speed Downlink Packet Access)等）が混在して構成される通信網である。 The network N includes, for example, a wired network (a short-range communication network (LAN), a wide-area communication network (WAN), a value-added communication network (VAN), etc.) and a wireless network (mobile communication network, satellite communication network, Bluetooth ( Bluetooth (registered trademark), WiFi (Wireless Fidelity), HSDPA (High Speed Downlink Packet Access), etc.).

記憶資源１２は、物理デバイス（例えば、半導体メモリ等のコンピュータ読み取り可能な記憶媒体）の記憶領域が提供する論理デバイスであり、情報端末１０の処理に用いられるオペレーティングシステムプログラム、ドライバプログラム、各種情報等を格納する。ドライバプログラムとしては、例えば、音声入出力デバイス１３を制御するための入出力デバイスドライバプログラム、入力デバイス１５を制御するための入力デバイスドライバプログラム、表示デバイス１６を制御するための出力デバイスドライバプログラム等が挙げられる。さらに、音声入出力デバイス１３は、例えば、一般的なマイクロフォン、及びサウンドデータを再生可能なサウンドプレイヤである。 The storage resource 12 is a logical device provided by a storage area of a physical device (for example, a computer-readable storage medium such as a semiconductor memory), and an operating system program, a driver program, various information, etc. used for processing of the information terminal 10 Is stored. Examples of the driver program include an input / output device driver program for controlling the audio input / output device 13, an input device driver program for controlling the input device 15, an output device driver program for controlling the display device 16, and the like. Can be mentioned. Furthermore, the voice input / output device 13 is, for example, a general microphone and a sound player capable of reproducing sound data.

通信インターフェイス１４は、例えばサーバ装置２０やオペレータ端末３０との接続インターフェイスを提供するものであり、無線通信インターフェイス及び／又は有線通信インターフェイスから構成される。また、入力デバイス１５は、例えば、表示デバイス１６に表示されるアイコン、ボタン、仮想キーボード等のタップ動作による入力操作を受け付けるインターフェイスを提供するものであり、タッチパネルの他、情報端末１０に外付けされる各種入力装置を例示することができる。 The communication interface 14 provides a connection interface with the server device 20 and the operator terminal 30, for example, and is configured from a wireless communication interface and / or a wired communication interface. The input device 15 provides an interface for accepting an input operation by a tap operation such as an icon, a button, or a virtual keyboard displayed on the display device 16, and is externally attached to the information terminal 10 in addition to the touch panel. Various input devices can be exemplified.

表示デバイス１６は、画像表示インターフェイスとして各種の情報をユーザや、必要に応じて会話の相手方に提供するものであり、例えば、有機ＥＬディスプレイ、液晶ディスプレイ、ＣＲＴディスプレイ等が挙げられ、好ましくは各種方式のタッチパネルが採用されたものを含む。また、カメラ１７は、種々の被写体の静止画や動画を撮像するためのものである。 The display device 16 provides various information as an image display interface to the user and the other party of conversation as necessary. Examples thereof include an organic EL display, a liquid crystal display, a CRT display, and preferably various methods. Including those using touch panels. The camera 17 is for capturing still images and moving images of various subjects.

図３は、本発明による音声翻訳システムにおけるユーザ者装置（情報端末）の機能構成の一例を概略的に示す機能ブロック図である。図３に示すように、情報端末１０は、機能的に、音声入出力部１０１と、送受信部１０３と、入力操作受付部１０５と、表示部１０７と、情報処理部１０９と、記憶部１１７と、を備える。また、情報処理部１０９は、機能的に、スコア比較部１１１と、第１表示処理制御部１１３と、通話処理制御部１１５と、オペレータ端末特定部１１６と、を備える。 FIG. 3 is a functional block diagram schematically showing an example of the functional configuration of the user device (information terminal) in the speech translation system according to the present invention. As shown in FIG. 3, the information terminal 10 functionally includes a voice input / output unit 101, a transmission / reception unit 103, an input operation reception unit 105, a display unit 107, an information processing unit 109, and a storage unit 117. . In addition, the information processing unit 109 functionally includes a score comparison unit 111, a first display processing control unit 113, a call processing control unit 115, and an operator terminal specifying unit 116.

音声入出力部１０１は、例えば、ユーザの音声を入力する。また、音声入出力部１０１は、例えば、後述するとおり、図１に示すサーバ装置２０で翻訳された内容を音声で出力する。ここで、図２に示す音声入出力デバイス１３は、音声入出力部１０１として機能する。 The voice input / output unit 101 inputs a user's voice, for example. Moreover, the voice input / output unit 101 outputs, for example, the contents translated by the server device 20 shown in FIG. Here, the voice input / output device 13 illustrated in FIG. 2 functions as the voice input / output unit 101.

送受信部１０３は、例えば図１に示すサーバ装置２０やオペレータ端末３０と各種情報を送受信する。送受信部１０３は、例えば、入力された音声の内容をサーバ装置２０に送信する。送受信部１０３は、例えば、サーバ装置２０で翻訳された内容のテキスト情報や音声情報等を受信する。また、送受信部１０３は、例えば、サーバ装置２０から翻訳精度に関するスコアを受信する。図２に示す通信インターフェイス１４は、送受信部１０３として機能する。 The transmission / reception unit 103 transmits / receives various information to / from the server device 20 and the operator terminal 30 shown in FIG. For example, the transmission / reception unit 103 transmits the content of the input voice to the server device 20. The transmission / reception unit 103 receives, for example, text information, audio information, and the like of content translated by the server device 20. Moreover, the transmission / reception part 103 receives the score regarding a translation precision from the server apparatus 20, for example. The communication interface 14 illustrated in FIG. 2 functions as the transmission / reception unit 103.

入力操作受付部１０５は、例えば、ユーザの入力操作を受け付けるブロックである。ここで、図２に示す入力デバイス１５は、入力操作受付部１０５として機能する。 The input operation receiving unit 105 is a block that receives a user input operation, for example. Here, the input device 15 illustrated in FIG. 2 functions as the input operation reception unit 105.

表示部１０７は、各種情報を表示する。表示部１０７は、例えば、翻訳された内容のテキストを表示する。また、表示部１０７は、例えば、図９（Ａ）に示す言語ボタン６１（第２画像）や図１０（Ｃ）に示す通話開始ボタン７３（第１画像）を表示する。ここで、図２に示す表示デバイス１６は、表示部１０７として機能する。 The display unit 107 displays various information. The display unit 107 displays, for example, translated text. Further, the display unit 107 displays, for example, a language button 61 (second image) shown in FIG. 9A and a call start button 73 (first image) shown in FIG. Here, the display device 16 illustrated in FIG. 2 functions as the display unit 107.

情報処理部１０９は、図２に示すプロセッサ１１の機能を示し、スコア比較部１１１は、例えば、サーバ装置２０が行う翻訳処理の翻訳精度に関するスコアと、所定の閾値（スコア）と、を比較する。第１表示処理制御部１１３は、表示部１０７において各種情報を表示する処理を制御するブロックである。第１表示処理制御部１１３は、例えば、サーバ装置２０において翻訳された内容のテキストを表示する処理を制御し、サーバ装置２０において翻訳された内容のテキストに加え、図１０（Ｃ）に示す通話開始ボタン７３（第１画像）を選択的に表示する処理を制御する。通話処理制御部１１５は、例えば、情報端末１０とオペレータ端末３０との間の通話処理を制御するブロックであり、表示部１０７に表示される通話開始ボタン７３が選択されたとき、通話処理を開始するための通話処理開始リクエストをオペレータ端末３０に送信する。オペレータ端末特定部１１６は、例えば、図９（Ａ）に示す言語ボタン６１において選択された英語ボタンが示す言語を使用できる通訳者の使用するオペレータ端末３０を特定する。 The information processing unit 109 indicates the function of the processor 11 illustrated in FIG. 2, and the score comparison unit 111 compares, for example, a score related to the translation accuracy of the translation processing performed by the server device 20 with a predetermined threshold (score). . The first display processing control unit 113 is a block that controls processing for displaying various types of information on the display unit 107. The first display processing control unit 113 controls, for example, a process of displaying the text of the content translated in the server device 20, and in addition to the text of the content translated in the server device 20, the call shown in FIG. A process of selectively displaying the start button 73 (first image) is controlled. The call processing control unit 115 is, for example, a block that controls call processing between the information terminal 10 and the operator terminal 30 and starts the call processing when the call start button 73 displayed on the display unit 107 is selected. A call processing start request is transmitted to the operator terminal 30. For example, the operator terminal specifying unit 116 specifies the operator terminal 30 used by the interpreter who can use the language indicated by the English button selected in the language button 61 shown in FIG. 9A.

記憶部１１７は、情報端末１０の処理に用いられる各種プログラム及び情報等を記憶するブロックである。記憶部１１７は、例えば、送受信部１０３が受信した、サーバ装置２０で翻訳された内容のテキスト情報や音声情報等を記憶する。また、記憶部１１７は、送受信部１０３が受信した、サーバ装置２０の翻訳精度に関するスコアを記憶する。ここで、図２に示す記憶資源１２は、記憶部１１７として機能する。なお、図２に示すカメラ１７は、図３において不図示であるが例えば撮像部として機能する。 The storage unit 117 is a block that stores various programs and information used for processing of the information terminal 10. The storage unit 117 stores, for example, text information, audio information, and the like of the content received by the transmission / reception unit 103 and translated by the server device 20. In addition, the storage unit 117 stores a score related to the translation accuracy of the server device 20 received by the transmission / reception unit 103. Here, the storage resource 12 illustrated in FIG. 2 functions as the storage unit 117. Note that the camera 17 shown in FIG. 2 functions as, for example, an imaging unit (not shown in FIG. 3).

図４は、本発明による音声翻訳システムにおけるサーバ装置の構成の一例を概略的に示すシステムブロック図である。図４に示すように、サーバ装置２０は、例示的に、プロセッサ２１と、通信インターフェイス２２と、記憶資源２３と、を備える。サーバ装置２０は、例えば、演算処理能力の高いホストコンピュータによって構成され、そのホストコンピュータにおいて所定のサーバ用プログラムが動作することにより、サーバ機能を発現するものであり、例えば、音声認識サーバ、翻訳サーバ、及び音声合成サーバとして機能する単数又は複数のホストコンピュータから構成される（図示においては単数で示すが、これに限定されない）。 FIG. 4 is a system block diagram schematically showing an example of the configuration of the server device in the speech translation system according to the present invention. As illustrated in FIG. 4, the server device 20 illustratively includes a processor 21, a communication interface 22, and a storage resource 23. The server device 20 is configured by, for example, a host computer having high arithmetic processing capability, and expresses a server function when a predetermined server program operates on the host computer. , And a single or a plurality of host computers functioning as a speech synthesis server (in the figure, it is indicated by a single, but is not limited thereto).

プロセッサ２１は、算術演算、論理演算、ビット演算等を処理する算術論理演算ユニット及び各種レジスタ（プログラムカウンタ、データレジスタ、命令レジスタ、汎用レジスタ等）から構成され、記憶資源２３に格納されているプログラムＰ２０を解釈及び実行し、所定の演算処理結果を出力する。また、通信インターフェイス２２は、ネットワークＮを介して情報端末１０に接続するためのハードウェアモジュールであり、例えば、ＩＳＤＮモデム、ＡＤＳＬモデム、ケーブルモデム、光モデム、ソフトモデム等の変調復調装置である。 The processor 21 is composed of an arithmetic and logic unit for processing arithmetic operations, logical operations, bit operations and the like and various registers (program counter, data register, instruction register, general-purpose register, etc.), and is stored in the storage resource 23. P20 is interpreted and executed, and a predetermined calculation processing result is output. The communication interface 22 is a hardware module for connecting to the information terminal 10 via the network N. For example, the communication interface 22 is a modulation / demodulation device such as an ISDN modem, an ADSL modem, a cable modem, an optical modem, or a soft modem.

記憶資源２３は、例えば、物理デバイス（ディスクドライブ又は半導体メモリ等のコンピュータ読み取り可能な記憶媒体等）の記憶領域が提供する論理デバイスであり、それぞれ単数又は複数の、プログラムＰ２０、各種モジュールＬ２０、各種データベースＤ２０、及び各種モデルＭ２０が格納されている。 The storage resource 23 is a logical device provided by, for example, a storage area of a physical device (a computer-readable storage medium such as a disk drive or a semiconductor memory), and each includes one or a plurality of programs P20, various modules L20, and various types. A database D20 and various models M20 are stored.

プログラムＰ２０は、サーバ装置２０のメインプログラムである上述したサーバ用プログラム等である。また、各種モジュールＬ２０は、情報端末１０から送信されてくる要求及び情報に係る一連の情報処理を行うため、プログラムＰ２０の動作中に適宜呼び出されて実行されるソフトウェアモジュール（モジュール化されたサブプログラム）である。かかるモジュールＬ２０としては、音声認識モジュール、翻訳モジュール、音声合成モジュール等が挙げられる。 The program P20 is the above-described server program that is the main program of the server device 20. In addition, the various modules L20 perform a series of information processing related to requests and information transmitted from the information terminal 10, and thus are software modules (moduleized subprograms) that are appropriately called and executed during the operation of the program P20. ). Examples of the module L20 include a speech recognition module, a translation module, and a speech synthesis module.

また、各種データベースＤ２０としては、音声翻訳処理のために必要な各種コーパス（例えば、日本語と英語の音声翻訳の場合、日本語音声コーパス、英語音声コーパス、日本語文字（語彙）コーパス、英語文字（語彙）コーパス、日本語辞書、英語辞書、日英対訳辞書、日英対訳コーパス等）、後述する音声データベース、ユーザに関する情報を管理するための管理用データベース等が挙げられる。また、各種モデルＭ２０としては、後述する音声認識に使用する音響モデルや言語モデル等が挙げられる。 The various databases D20 include various corpora required for speech translation processing (for example, in the case of Japanese and English speech translation, a Japanese speech corpus, an English speech corpus, a Japanese character (vocabulary) corpus, an English character) (Vocabulary) corpus, Japanese dictionary, English dictionary, Japanese-English bilingual dictionary, Japanese-English bilingual corpus, etc.), a speech database described later, a management database for managing information related to users, and the like. In addition, examples of the various models M20 include an acoustic model and a language model used for speech recognition described later.

図５は、本発明による音声翻訳システムにおけるサーバ装置の機能構成の一例を概略的に示す機能ブロック図である。図５に示すように、サーバ装置２０は、機能的に、送受信部２０１と、情報処理部２０３と、記憶部２１３と、を備える。また、情報処理部２０３は、例えば、音声認識部２０５と、多言語翻訳部２０７と、スコア算出部２０９と、音声合成部２１１と、を備える。 FIG. 5 is a functional block diagram schematically showing an example of the functional configuration of the server device in the speech translation system according to the present invention. As illustrated in FIG. 5, the server device 20 functionally includes a transmission / reception unit 201, an information processing unit 203, and a storage unit 213. The information processing unit 203 includes, for example, a speech recognition unit 205, a multilingual translation unit 207, a score calculation unit 209, and a speech synthesis unit 211.

送受信部２０１は、例えば、図１に示す情報端末１０やオペレータ端末３０と各種情報を送受信する。送受信部２０１は、例えば、情報端末１０に入力された音声の内容を情報端末１０から受信する。送受信部２０１は、例えば、後述する多言語翻訳部２０７により翻訳された内容のテキスト情報や音声情報等を情報端末１０に送信する。また、送受信部２０１は、例えば、後述するスコア算出部２０９により算出される翻訳精度に関するスコアを情報端末１０に送信する。ここで、図４に示す通信インターフェイス２２は、送受信部２０１として機能する。 The transmission / reception unit 201 transmits / receives various information to / from the information terminal 10 and the operator terminal 30 shown in FIG. For example, the transmission / reception unit 201 receives the content of the voice input to the information terminal 10 from the information terminal 10. The transmission / reception unit 201 transmits, for example, text information, voice information, and the like of contents translated by the multilingual translation unit 207 described later to the information terminal 10. In addition, the transmission / reception unit 201 transmits, for example, a score related to translation accuracy calculated by a score calculation unit 209 described later to the information terminal 10. Here, the communication interface 22 illustrated in FIG. 4 functions as the transmission / reception unit 201.

情報処理部２０３は、図４に示すプロセッサ２１の機能を示し、音声認識部２０５は、例えば、情報端末１０に入力された音声の内容を認識する。多言語翻訳部２０７は、例えば、音声認識部２０５で認識された内容を異なる言語の内容に翻訳する。スコア算出部２０９は、例えば、多言語翻訳部２０７の翻訳精度に関するスコアを算出する。音声合成部２１１は、例えば、多言語翻訳部２０７による翻訳結果に基づいて音声合成を行う。 The information processing unit 203 indicates the function of the processor 21 illustrated in FIG. 4, and the voice recognition unit 205 recognizes the content of the voice input to the information terminal 10, for example. For example, the multilingual translation unit 207 translates the content recognized by the speech recognition unit 205 into the content of a different language. For example, the score calculation unit 209 calculates a score related to the translation accuracy of the multilingual translation unit 207. For example, the speech synthesis unit 211 performs speech synthesis based on the translation result by the multilingual translation unit 207.

記憶部２１３は、例えば、サーバ装置２０の処理に用いられる各種プログラム及び情報等を記憶するブロックである。記憶部２１３は、例えば、送受信部２０１が受信した、情報端末１０に入力された音声の内容を記憶する。また、記憶部２１３は、例えば、翻訳された内容を記憶する。記憶部２１３は、例えば、入力された音声の内容に対応付けられた翻訳された内容をユーザごとに関連付けて翻訳履歴として記憶する。ここで、図４に示す記憶資源２３は、記憶部２１３として機能する。 The storage unit 213 is a block that stores various programs and information used for processing of the server device 20, for example. The memory | storage part 213 memorize | stores the content of the audio | voice input into the information terminal 10 which the transmission / reception part 201 received, for example. Moreover, the memory | storage part 213 memorize | stores the translated content, for example. For example, the storage unit 213 stores the translated content associated with the content of the input voice as a translation history in association with each user. Here, the storage resource 23 illustrated in FIG. 4 functions as the storage unit 213.

図６は、本発明による音声翻訳システムにおけるオペレータ端末（通訳者装置）の構成の一例を概略的に示すシステムブロック図である。図６に示すように、オペレータ端末３０は、プロセッサ３１、記憶資源３２、音声入出力デバイス３３（例えばマイクとスピーカーが別体のものも一体のものも含む）、通信インターフェイス３４、入力デバイス３５、表示デバイス３６、及びカメラ３７を備えている。上記したとおりオペレータ端末３０は、図２に示す情報端末１０と同様なブロック構成を備えている。以下においては、特に、情報端末１０が備える構成と異なる構成について説明する。また、オペレータ端末３０は、例えば、本発明の一実施形態による音声翻訳プログラムの少なくとも一部として実行されるインストールされたＣＴＩ（Computer Telephony Integration）アプリケーションソフトが動作することにより、本発明の一実施形態による音声翻訳システムの一部又は全部として機能するものである。 FIG. 6 is a system block diagram schematically showing an example of the configuration of an operator terminal (interpreter device) in the speech translation system according to the present invention. As shown in FIG. 6, the operator terminal 30 includes a processor 31, a storage resource 32, a voice input / output device 33 (for example, a microphone and a speaker that are separate or integrated), a communication interface 34, an input device 35, A display device 36 and a camera 37 are provided. As described above, the operator terminal 30 has the same block configuration as the information terminal 10 shown in FIG. In the following, in particular, a configuration different from the configuration included in the information terminal 10 will be described. Further, the operator terminal 30 operates, for example, by installed CTI (Computer Telephony Integration) application software that is executed as at least a part of a speech translation program according to an embodiment of the present invention. It functions as a part or all of the speech translation system.

オペレータ端末３０は、図１に示す情報端末１０からの電話を受け付ける。通訳者は、オペレータ端末３０を介して、通訳を実行する。オペレータ端末３０は、電話の相手方、例えば、情報端末１０及び当該情報端末１０の操作者の少なくとも一方に関する情報や後で詳述する翻訳履歴等を表示デバイス３６に表示する。なお、オペレータ端末３０は、例示的に、ネットワークＮとの通信機能を有する、デスクトップ型パソコンを含む据え置き型の端末装置である。 The operator terminal 30 receives a call from the information terminal 10 shown in FIG. The interpreter performs interpretation via the operator terminal 30. The operator terminal 30 displays information related to at least one of the other party of the telephone, for example, the information terminal 10 and the operator of the information terminal, a translation history, which will be described in detail later, on the display device 36. The operator terminal 30 is, for example, a stationary terminal device including a desktop personal computer having a communication function with the network N.

プロセッサ３１は、記憶資源３２に格納されているプログラムＰ３０であるＣＴＩアプリケーションソフトを解釈及び実行し、各種処理を行う。入力デバイス３５は、例えば、表示デバイス３６に表示されるアイコン、ボタン、仮想キーボード等のタップ動作による入力操作を受け付けるインターフェイスを提供するものであり、オペレータ端末３０に外付けされる各種入力装置、例えばキーボードやマウスを例示することができる。なお、入力デバイス３５は、表示デバイス３６の機能を含んだ各種方式のタッチパネル等のデバイスであってもよい。 The processor 31 interprets and executes CTI application software, which is a program P30 stored in the storage resource 32, and performs various processes. The input device 35 provides an interface for accepting an input operation by a tap operation such as an icon, a button, or a virtual keyboard displayed on the display device 36, and various input devices externally attached to the operator terminal 30, for example, A keyboard and a mouse can be exemplified. The input device 35 may be a device such as a touch panel of various types including the function of the display device 36.

図７は、本発明による音声翻訳システムにおけるオペレータ端末（通訳者装置）の機能構成の一例を概略的に示す機能ブロック図である。図７に示すように、オペレータ端末３０は、機能的に、音声入出力部３０１と、送受信部３０３と、入力操作受付部３０５と、表示部３０７と、情報処理部３０９と、記憶部３１５と、を備える。また、情報処理部３０９は、機能的に、通話処理部３１１と、第２表示処理制御部３１３と、を備える。 FIG. 7 is a functional block diagram schematically showing an example of a functional configuration of an operator terminal (interpreter device) in the speech translation system according to the present invention. As shown in FIG. 7, the operator terminal 30 functionally includes a voice input / output unit 301, a transmission / reception unit 303, an input operation reception unit 305, a display unit 307, an information processing unit 309, and a storage unit 315. . The information processing unit 309 functionally includes a call processing unit 311 and a second display processing control unit 313.

音声入出力部３０１は、例えば、通訳者を含むオペレータの音声を入力する。また、音声入出力部３０１は、例えば、後述するとおり、送受信部３０３が受信する翻訳履歴を示す内容を音声で出力するように構成されてもよい。ここで、図６に示す音声入出力デバイス３３は、音声入出力部３０１として機能する。 The voice input / output unit 301 inputs the voice of an operator including an interpreter, for example. In addition, the voice input / output unit 301 may be configured to output the content indicating the translation history received by the transmission / reception unit 303 by voice as described later, for example. Here, the voice input / output device 33 illustrated in FIG. 6 functions as the voice input / output unit 301.

送受信部３０３は、例えば図１に示す情報端末１０やサーバ装置２０と各種情報を送受信する。送受信部３０３は、例えば、サーバ装置２０から情報端末１０を介して送信される翻訳履歴を受信する。また、送受信部３０３は、例えば、情報端末１０から送信される通話処理開始リクエストを受信する。送受信部３０３は、例えば、通話処理開始リクエストに対する応答信号を送信する。図６に示す通信インターフェイス３４は、送受信部３０３として機能する。 The transmission / reception unit 303 transmits / receives various information to / from the information terminal 10 and the server device 20 illustrated in FIG. The transmission / reception unit 303 receives, for example, a translation history transmitted from the server device 20 via the information terminal 10. In addition, the transmission / reception unit 303 receives, for example, a call processing start request transmitted from the information terminal 10. For example, the transmission / reception unit 303 transmits a response signal to the call processing start request. The communication interface 34 illustrated in FIG. 6 functions as the transmission / reception unit 303.

入力操作受付部３０５は、例えば、オペレータの入力操作を受け付けるブロックである。ここで、図６に示す入力デバイス３５は、入力操作受付部３０５として機能する。 The input operation accepting unit 305 is a block that accepts an operator's input operation, for example. Here, the input device 35 illustrated in FIG. 6 functions as the input operation reception unit 305.

表示部３０７は、各種情報を表示する。表示部３０７は、例えば、翻訳履歴をユーザごとに関連付けて表示する。ここで、図６に示す表示デバイス３６は、表示部３０７として機能する。 The display unit 307 displays various information. For example, the display unit 307 displays the translation history in association with each user. Here, the display device 36 illustrated in FIG. 6 functions as the display unit 307.

情報処理部３０９は、図６に示すプロセッサ３１の機能を示し、通話処理部３１１は、例えば、情報端末１０から送信される通話処理開始リクエストに基づいて、オペレータ端末３０と情報端末１０との間で通話可能か否かを判断し、通話処理開始リクエストに対する応答信号を生成する。応答信号は、オペレータ端末３０と情報端末１０との間で通話可能であることを示す信号や、オペレータ端末３０と情報端末１０との間で通話可能であることを示す信号を含む。第２表示処理制御部３１３は、例えば、表示部３０７において各種情報を表示する処理を制御するブロックである。第２表示処理制御部３１３は、例えば、表示部３０７において、翻訳履歴をユーザごとに関連付けて表示する処理を制御する。 The information processing unit 309 indicates the function of the processor 31 illustrated in FIG. 6, and the call processing unit 311 is configured between the operator terminal 30 and the information terminal 10 based on a call processing start request transmitted from the information terminal 10, for example. To determine whether or not a call is possible, and generate a response signal to the call processing start request. The response signal includes a signal indicating that a call is possible between the operator terminal 30 and the information terminal 10 and a signal indicating that a call is possible between the operator terminal 30 and the information terminal 10. For example, the second display processing control unit 313 is a block that controls processing for displaying various types of information on the display unit 307. For example, the second display processing control unit 313 controls the display unit 307 to display the translation history in association with each user.

記憶部３１５は、オペレータ端末３０の処理に用いられる各種プログラム及び情報等を記憶するブロックである。記憶部３１５は、例えば、送受信部３０３が受信した、サーバ装置２０から情報端末１０を介して送信される翻訳履歴を記憶する。ここで、図６に示す記憶資源３２は、記憶部３１５として機能する。なお、図６に示すカメラ３７は、図７において不図示であるが例えば撮像部として機能する。 The storage unit 315 is a block that stores various programs and information used for processing of the operator terminal 30. The storage unit 315 stores, for example, a translation history transmitted from the server device 20 via the information terminal 10 received by the transmission / reception unit 303. Here, the storage resource 32 illustrated in FIG. 6 functions as the storage unit 315. The camera 37 shown in FIG. 6 functions as an imaging unit, for example, although not shown in FIG.

以上のとおり構成された音声翻訳システム１００における、音声翻訳処理及び通話処理の操作及び動作の一例について、以下に更に説明する。 An example of operations and operations of speech translation processing and call processing in the speech translation system 100 configured as described above will be further described below.

（音声翻訳処理及び通話処理）
（第１実施形態）
図８は、本発明による音声翻訳システムにおける処理の流れ（一部）の一例を示すフローチャートである。図９（Ａ）乃至（Ｃ）、図１０（Ａ）乃至（Ｃ）、及び図１１（Ａ）乃至（Ｄ）は、本発明による情報端末における表示画面の遷移の一例を示す平面図である。図１２は、本発明による通訳者端末における表示画面の一例を示す図である。ここでは、情報端末１０のユーザが日本語を話す飲食店の店員であり、会話の相手が英語を話す顧客である場合の会話、すなわち、入力言語が日本語であり、翻訳言語が英語である会話を想定する。但し、これに限定されない。 (Voice translation processing and call processing)
(First embodiment)
FIG. 8 is a flowchart showing an example (part) of the processing flow in the speech translation system according to the present invention. FIGS. 9A to 10C, FIGS. 10A to 10C, and FIGS. 11A to 11D are plan views showing examples of display screen transition in the information terminal according to the present invention. . FIG. 12 is a diagram showing an example of a display screen in the interpreter terminal according to the present invention. Here, the conversation when the user of the information terminal 10 is a restaurant clerk who speaks Japanese and the conversation partner is a customer who speaks English, that is, the input language is Japanese and the translation language is English. Assume conversation. However, it is not limited to this.

まず、ユーザ（店員）が、情報端末１０の表示部１０７に表示されている音声翻訳アプリケーションソフトのアイコン（図示せず）をタップする場合、情報端末１０において当該アプリケーションを起動する（図８；ステップＳＪ１）。 First, when the user (clerk) taps an icon (not shown) of the speech translation application software displayed on the display unit 107 of the information terminal 10, the application is activated on the information terminal 10 (FIG. 8; step). SJ1).

当該アプリケーションが起動すると、表示部１０７に、顧客の言語選択画面が表示される（図８；ステップＳＪ２）。図９（Ａ）に示すように、この言語選択画面には、例えば顧客に言語を尋ねる旨の日本語のテキストＴ２１、その旨の英語のテキストＴ２２、及び、想定される複数の代表的な言語（ここでも、英語、中国語（例えば書体により２種類）、ハングル語）を示す言語ボタン６１（第２画像）が表示される。 When the application is activated, a customer language selection screen is displayed on the display unit 107 (FIG. 8; step SJ2). As shown in FIG. 9A, this language selection screen includes, for example, a Japanese text T21 for inquiring about the language to the customer, an English text T22 for that purpose, and a plurality of typical languages assumed. Here, a language button 61 (second image) indicating English, Chinese (for example, two types depending on the typeface), and Hangul) is displayed.

このとき、日本語のテキストＴ２１及び英語のテキストＴ２２は、第１表示処理制御部１１３及び表示部１０７により、情報端末１０の表示部１０７の画面において、例えば異なる色の領域によって区分けされ、且つ、互いに逆向き（互いに異なる向き；図示において上下逆向き）に表示される。これにより、ユーザと顧客が対面している状態で会話を行う場合、ユーザは日本語のテキストＴ２１を確認し易い一方、顧客は、英語のテキストＴ２２を確認し易くなる。また、テキストＴ２１とテキストＴ２２が区分けして表示されるので、両者を明別して更に視認し易くなる利点がある。 At this time, the Japanese text T21 and the English text T22 are classified by the first display processing control unit 113 and the display unit 107, for example, by areas of different colors on the screen of the display unit 107 of the information terminal 10, and They are displayed in opposite directions (different directions; upside down in the figure). Thereby, when a conversation is performed in a state where the user and the customer face each other, the user can easily confirm the Japanese text T21, while the customer can easily confirm the English text T22. In addition, since the text T21 and the text T22 are displayed separately, there is an advantage that the text T21 and the text T22 are clearly distinguished from each other.

それから、ユーザは、図９（Ａ）の言語選択画面に表示されたテキストＴ２１を顧客に提示し、顧客に英語（Ｅｎｇｌｉｓｈ）のボタンをタップしてもらうことで、顧客の言語が選択される。これにより、表示デバイスには、ホーム画面として、日本語と英語の音声入力の待機画面が表示される（図８；ステップＳＪ３）。このホーム画面には、ユーザと顧客の言語の何れを発話するかを問うテキストＴ２３、並びに、日本語の音声入力を行うための日本語入力ボタン６２ａ及び英語の音声入力を行うための英語入力ボタン６２ｂが表示される。また、このホーム画面には、入力内容の履歴を表示するための履歴表示ボタン６３、言語選択画面に戻って顧客の言語を切り替える（言語選択をやり直す）ための言語選択ボタン６４、及び当該アプリケーションソフトの各種設定を行うための設定ボタン６５も表示される。 Then, the user presents the text T21 displayed on the language selection screen of FIG. 9A to the customer, and has the customer tap the English button so that the customer's language is selected. As a result, a standby screen for voice input in Japanese and English is displayed on the display device as the home screen (FIG. 8; step SJ3). On this home screen, text T23 asking which of the user's or customer's language is to be spoken, a Japanese input button 62a for performing Japanese speech input, and an English input button for performing English speech input 62b is displayed. The home screen also includes a history display button 63 for displaying a history of input contents, a language selection button 64 for returning to the language selection screen and switching the customer language (re-selecting the language), and the application software. A setting button 65 for performing various settings is also displayed.

次に、図９（Ｂ）のホーム画面において、ユーザ（店員）が日本語入力ボタン６２ａをタップして日本語の音声入力を選択すると、ユーザの日本語による発話内容を受け付ける音声入力画面となる（図９（Ｃ））。この音声入力画面が表示されると、音声入出力部１０１からの音声入力が可能な状態となる。また、この音声入力画面には、ユーザの音声入力を促すテキストＴ２４、及び、音声入力の待機状態であることを示すマイク図案６６が表示される。なお、その前の画面である図９（Ｂ）において日本語音声入力が選択されたことを示すため、図９（Ｃ）の音声入力画面には、日本語入力ボタン６２ａが表示されない。また、英語入力ボタン６２ｂは、マイク図案６６の背面に、その一部が隠れるように、且つ例えば淡い色彩で表示される（後記の図１０（Ａ）及び図１０（Ｂ）において同様）。 Next, on the home screen in FIG. 9B, when the user (clerk) taps the Japanese input button 62a and selects Japanese voice input, the voice input screen for accepting the user's Japanese utterance content is displayed. (FIG. 9C). When this voice input screen is displayed, voice input from the voice input / output unit 101 is enabled. Further, on this voice input screen, a text T24 for prompting the user to input voice and a microphone design 66 indicating that the voice input is in a standby state are displayed. Note that the Japanese input button 62a is not displayed on the voice input screen of FIG. 9C to indicate that Japanese voice input has been selected in FIG. 9B, which is the previous screen. Further, the English input button 62b is displayed in a light color so that a part of the English input button 62b is hidden behind the microphone design 66 (the same applies to FIGS. 10A and 10B described later).

また、この音声入力画面の下部には、キャンセルボタン６７が表示され、これをタップすることにより、ホーム画面である音声入力の待機画面（図９（Ｂ））へ戻って音声入力をやり直すことができる（後記の図１０（Ａ）及び図１０（Ｂ）において同様）。この状態で、ユーザにより顧客への伝達事項等が日本語で音声入力されると、図１０（Ａ）に示すように、表示部１０７の画面において、テキストＴ２４とともに、声量の大小を模式的に且つ動的に示す多重円形図案６８が表示され、音声入力レベルが発話者であるユーザへ視覚的にフィードバックされる（図８；ステップＳＪ４）。 In addition, a cancel button 67 is displayed at the bottom of the voice input screen. By tapping this button, it is possible to return to the voice input standby screen (FIG. 9B) and perform voice input again. (Same as in FIGS. 10A and 10B described later). In this state, when a user inputs a message to be communicated to the customer in Japanese, as shown in FIG. 10A, the volume of the voice volume is schematically shown on the screen of the display unit 107 together with the text T24. In addition, a dynamically shown multiple circular design 68 is displayed, and the voice input level is visually fed back to the user who is the speaker (FIG. 8; step SJ4).

それから、ユーザによる発話が終了し、例えば音声入力が一定期間ないことを情報端末１０の情報処理部１０９が検知すると、情報処理部１０９は、ユーザによる発話内容の受け付けを終了する。次いで、情報処理部１０９は、その音声入力に基づいて音声信号を生成し、その音声信号を送受信部１０３及びネットワークＮを通してサーバ装置２０へ送信する。 Then, when the user's utterance ends, for example, when the information processing unit 109 of the information terminal 10 detects that there is no voice input for a certain period of time, the information processing unit 109 ends the reception of the utterance content by the user. Next, the information processing unit 109 generates an audio signal based on the audio input, and transmits the audio signal to the server device 20 through the transmission / reception unit 103 and the network N.

次に、サーバ装置２０の情報処理部２０３の音声認識部２０５は、送受信部２０１を通してその音声信号を受信し、音声認識処理を行う（図８；ステップＳＳ１）。このとき、音声認識部２０５は、記憶部２１３から、必要なモジュールＬ２０、データベースＤ２０、及びモデルＭ２０（音声認識モジュール、日本語音声コーパス、音響モデル、言語モデル等）を呼び出し、入力音声の「音」を「読み」（文字）へ変換する。 Next, the voice recognition unit 205 of the information processing unit 203 of the server device 20 receives the voice signal through the transmission / reception unit 201 and performs voice recognition processing (FIG. 8; step SS1). At this time, the speech recognition unit 205 calls the necessary module L20, database D20, and model M20 (speech recognition module, Japanese speech corpus, acoustic model, language model, etc.) from the storage unit 213, "To" reading "(character).

ここで、情報処理部２０３は、認識された音声の「読み」（文字）に基づいてテキスト出力用のテキスト信号を生成し、送受信部２０１及びネットワークＮを通して、情報端末１０へ送信する。このとき、情報処理部２０３は、認識された音声そのものの内容に基づくテキスト信号と、予め記憶部２１３に記憶されている日本語の会話コーパスのなかから、実際の発話内容に対応するものを呼び出し、それに基づくテキスト信号を生成する。そして、図１０（Ｂ）に示すように、送受信部２０１を通してそのテキスト信号を受信した情報端末１０の第１表示処理制御部１１３は、画面において、ユーザによって入力された日本語の発話内容の認識結果として、認識された音声の内容である日本語のテキストＴ２５を表示する。 Here, the information processing unit 203 generates a text signal for text output based on the recognized “reading” (characters) of the voice, and transmits the text signal to the information terminal 10 through the transmission / reception unit 201 and the network N. At this time, the information processing unit 203 calls the one corresponding to the actual utterance content from the text signal based on the content of the recognized speech itself and the Japanese conversation corpus stored in the storage unit 213 in advance. And generating a text signal based thereon. Then, as shown in FIG. 10B, the first display processing control unit 113 of the information terminal 10 that has received the text signal through the transmission / reception unit 201 recognizes the Japanese utterance content input by the user on the screen. As a result, the Japanese text T25 that is the content of the recognized speech is displayed.

次いで、多言語翻訳部２０７は、認識された音声の「読み」（文字）を他の言語に翻訳する多言語翻訳処理へ移行する（図８；ステップＳＳ２）。このとき、多言語翻訳部２０７は、記憶部２１３から、必要なモジュールＬ２０及びデータベースＤ２０（翻訳モジュール、日本語文字コーパス、日本語辞書、英語辞書、日英対訳辞書、日英対訳コーパス等）を呼び出し、認識結果である入力音声の「読み」（文字列）を適切に並び替えて日本語の句、節、文等へ変換し、その変換結果に対応する英語を抽出し、それらを英文法に従って並び替えて自然な英語の句、節、文等へと変換し、記憶部２１３からそれに対応する英語の会話コーパスを選定する。その際、図１０（Ｂ）に示すように、表示部１０７には、翻訳中であることを示す日本語のテキストＴ２６、及び、翻訳中であることを示す円形図案６９を含む待機画面が表示される。 Next, the multilingual translation unit 207 proceeds to multilingual translation processing for translating the recognized “reading” (characters) of the recognized speech into another language (FIG. 8; step SS2). At this time, the multilingual translation unit 207 stores the necessary module L20 and database D20 (translation module, Japanese character corpus, Japanese dictionary, English dictionary, Japanese-English bilingual dictionary, Japanese-English bilingual corpus, etc.) from the storage unit 213. The input speech “reading” (character string), which is the call and recognition result, is appropriately sorted and converted to Japanese phrases, clauses, sentences, etc., and the English corresponding to the conversion result is extracted, and the English grammar is extracted. Are converted into natural English phrases, clauses, sentences, etc., and the corresponding English conversation corpus is selected from the storage unit 213. At that time, as shown in FIG. 10B, the display unit 107 displays a standby screen including Japanese text T26 indicating that translation is in progress and a circular design 69 indicating that translation is in progress. Is done.

記憶部２１３は、入力音声の内容に対応付けられた翻訳結果（翻訳内容）をユーザごとに関連付けて翻訳履歴として記憶する（図８；ステップＳＳ３）。例えば、記憶部２１３は、翻訳後の英語の句、節、文等に対応する英語の会話コーパス等を入力音声の内容に対応付けて翻訳履歴として記憶する。 The storage unit 213 stores the translation result (translation content) associated with the content of the input speech for each user as a translation history (FIG. 8; step SS3). For example, the storage unit 213 stores an English conversation corpus or the like corresponding to the translated English phrase, clause, sentence, or the like as a translation history in association with the content of the input speech.

次に、音声合成部２１１は、記憶部２１３から、音声合成に必要なモジュールＬ２０、データベースＤ２０、及びモデルＭ２０（音声合成モジュール、英語音声コーパス、音響モデル、言語モデル等）を呼び出し、翻訳結果である英語の句、節、文等に対応する英語の会話コーパスを自然な音声に変換する（図８；ステップＳＳ４）。 Next, the speech synthesis unit 211 calls the module L20, database D20, and model M20 (speech synthesis module, English speech corpus, acoustic model, language model, etc.) necessary for speech synthesis from the storage unit 213, and uses the translation result. An English conversation corpus corresponding to a certain English phrase, clause, sentence or the like is converted into natural speech (FIG. 8; step SS4).

これらの多言語翻訳処理及び音声合成処理が完了すると、情報処理部２０３は、翻訳結果（翻訳内容）である英語の会話コーパスに基づいてテキスト表示用のテキスト信号を生成し、また、合成された音声に基づいて音声出力用の音声信号を生成し、送受信部２０１及びネットワークＮを通して、情報端末１０へ送信する。 When these multilingual translation processing and speech synthesis processing are completed, the information processing unit 203 generates a text signal for text display based on the English conversation corpus that is the translation result (translation content), and the synthesized text signal is also synthesized. An audio signal for audio output is generated based on the audio and transmitted to the information terminal 10 through the transmission / reception unit 201 and the network N.

そして、図１０（Ｃ）に示すように、送受信部１０３を通して、それらのテキスト信号及び音声信号を受信した情報端末１０の第１表示処理制御部１１３は、テキストＴ２５、テキストＴ２５に対応する日本語の会話コーパスのテキストＴ２７（ここではテキストＴ２５と同じであるが、これに限定されない）、及びその翻訳結果である英語の会話コーパスのテキストＴ２８を会話画面として表示し、さらに、当該画面において通話開始ボタン７３（第１画像）を選択的に表示する処理を制御する（図８；ステップＳＪ５）。ここで、情報端末１０の記憶部１１７は、例えば、サーバ装置２０から受信した上記テキスト信号や音声信号を翻訳履歴として記憶してもよい。 Then, as shown in FIG. 10C, the first display processing control unit 113 of the information terminal 10 that has received the text signal and the audio signal through the transmission / reception unit 103, the Japanese corresponding to the text T25 and the text T25. The conversation corpus text T27 (same as, but not limited to, text T25 here) and the English conversation corpus text T28, which is the translation result thereof, are displayed as a conversation screen, and the call is started on the screen. A process of selectively displaying the button 73 (first image) is controlled (FIG. 8; step SJ5). Here, the memory | storage part 117 of the information terminal 10 may memorize | store the said text signal and audio | voice signal received from the server apparatus 20 as a translation log | history, for example.

また、ステップＳＪ５と同時に、音声入出力部１０１は、翻訳結果である英語のテキストＴ２８の内容（翻訳内容）を音声で出力する（読み上げる）（図８；ステップＳＪ６）。なお、当該ステップＳＪ６は、ステップＳＪ５の前、又は、後に実行されてもよい。 Simultaneously with step SJ5, the voice input / output unit 101 outputs (reads out) the content (translation content) of the English text T28 as a translation result (FIG. 8; step SJ6). Note that step SJ6 may be executed before or after step SJ5.

このとき、図１０（Ｃ）の如く、日本語のテキストＴ２５，Ｔ２７と英語のテキストＴ２８も、情報端末１０の表示部１０７の画面において、例えば異なる色の領域や線分によって区分けされ、且つ、互いに逆向き（互いに異なる向き；図示において上下逆向き）に表示される。これにより、ユーザと顧客が対面している状態で会話を行う場合、両者が表示部１０７の画面を視認できる状態であれば、ユーザが日本語のテキストＴ２５，Ｔ２７（入力された内容）を確認し易い一方、顧客は、英語のテキストＴ２８（翻訳された内容）を確認し易くなる。また、それらのテキストＴ２５，Ｔ２７とテキストＴ２８が区分けして表示されるので、両者を明別して更に視認し易くなる利点がある。 At this time, as shown in FIG. 10C, the Japanese texts T25 and T27 and the English text T28 are also divided on the screen of the display unit 107 of the information terminal 10, for example, by different color areas and line segments, and They are displayed in opposite directions (different directions; upside down in the figure). As a result, when the user and the customer are in a face-to-face conversation, the user confirms the Japanese texts T25 and T27 (input contents) if both can see the screen of the display unit 107. On the other hand, the customer can easily confirm the English text T28 (translated content). In addition, since the texts T25, T27 and the text T28 are displayed separately, there is an advantage that the texts T25, T27 and the text T28 are clearly distinguished from each other.

なお、図１０（Ｃ）の会話画面に表示される音声出力ボタン７０をタップすることにより、音声出力が繰り返される。また、この会話画面には、その時点での翻訳を終了する旨のチェックボタン７１が表示され、これをタップすることにより、翻訳処理を終了してホーム画面（図９（Ｂ））に戻ることができる。 Note that the voice output is repeated by tapping the voice output button 70 displayed on the conversation screen of FIG. Also, on this conversation screen, a check button 71 indicating that the translation at that time is finished is displayed. By tapping this, the translation processing is finished and the home screen (FIG. 9B) is returned. Can do.

次に、翻訳が精度よく行われることによって、顧客がユーザ（店員）の質問事項を理解することができた場合、今度は、顧客の音声の入力、認識、翻訳、及び音声合成といった音声処理が行われる（図８；ステップＳＪ７においてＮｏ）。この顧客の音声処理では、まず、図１０（Ｃ）に表示されているチェックボタン７１をタップしてホーム画面（図９（Ｂ））を表示する。次に、そのホーム画面において、英語入力ボタン６２ｂをタップして顧客による英語の音声入力を選択する。この後の処理は、発話者がユーザから顧客に代わり、日本語の音声入力が英語の音声入力に切り替わり、且つ、英語の音声及びテキスト出力が日本語による音声及びテキスト出力に代わること以外は、上述した処理と基本的に同等であるので、ここでの詳細な説明は省略する。そして、ユーザと顧客の会話が完了した場合、一連の音声翻訳処理を終了する。 Next, if the customer can understand the user's (clerk's) questions due to the accuracy of the translation, then the voice processing such as the customer's voice input, recognition, translation, and voice synthesis will be performed. Is performed (FIG. 8; No in step SJ7). In this customer voice processing, first, the check screen 71 displayed in FIG. 10C is tapped to display the home screen (FIG. 9B). Next, on the home screen, the English input button 62b is tapped to select English voice input by the customer. The processing after this is performed except that the speaker changes from the user to the customer, the Japanese voice input is switched to the English voice input, and the English voice and text output is replaced with the Japanese voice and text output. Since it is basically the same as the above-described processing, detailed description thereof is omitted here. Then, when the conversation between the user and the customer is completed, a series of speech translation processing is terminated.

他方、店員による日本語入力、又は、顧客による英語入力の内容がその言語の基本的な文型になっていないような場合や、発話した語順等が異なる場合には、誤訳が生じてしまう可能性が高まりやすい。そして、実際に誤訳が存在する等翻訳精度が高くないような場合は、店員及び顧客のコミュニケーションが円滑に行われないおそれがある。そこで、このような場合においては、店員及び顧客の少なくとも一方は、図８のステップＳＪ５において情報端末１０の表示部１０７にて表示される通話開始ボタン７３（第１画像）を選択する場合、通話処理制御部１１５は、通訳者と通話するためにオペレータ端末３０に通話処理開始リクエストを送信する（図８；ステップＳＪ７においてＹｅｓ）。 On the other hand, if the contents of the Japanese input by the store clerk or the English input by the customer are not in the basic sentence pattern of the language, or if the order of spoken words is different, mistranslation may occur. Is likely to increase. If the translation accuracy is not high, such as when there is a mistranslation actually, there is a possibility that communication between the store clerk and the customer may not be performed smoothly. Therefore, in such a case, when at least one of the store clerk and the customer selects the call start button 73 (first image) displayed on the display unit 107 of the information terminal 10 in step SJ5 of FIG. The process control unit 115 transmits a call process start request to the operator terminal 30 to call the interpreter (FIG. 8; Yes in step SJ7).

具体的に、店員及び顧客の少なくとも一方が、図８のステップＳＪ５において情報端末１０の表示部１０７において表示される通話開始ボタン７３を選択する場合、図１１（Ｂ）に示すように、表示部１０７の画面がグレーアウトされ、当該画面上に、通訳者と通話するか否かを確認するための画像７５が表示される。そして、店員及び顧客の少なくとも一方が、当該画像７５に表示される「はい」を選択する場合、図１１（Ｃ）に示すように、第１表示処理制御部１１３は表示部１０７の画面にテキストＴ２９を表示する処理を制御する。例えば、店員及び顧客の少なくとも一方が、当該画像７５に表示される「はい」を選択する場合、通話処理制御部１１５は、通訳者と通話するために通話処理開始リクエストを送信するように構成されてもよい。 Specifically, when at least one of the store clerk and the customer selects the call start button 73 displayed on the display unit 107 of the information terminal 10 in step SJ5 of FIG. 8, as shown in FIG. The screen 107 is grayed out, and an image 75 for confirming whether or not to call the interpreter is displayed on the screen. When at least one of the store clerk and the customer selects “Yes” displayed in the image 75, the first display processing control unit 113 displays text on the screen of the display unit 107 as shown in FIG. Controls the process of displaying T29. For example, when at least one of the store clerk and the customer selects “Yes” displayed in the image 75, the call processing control unit 115 is configured to transmit a call processing start request to make a call with the interpreter. May be.

通話処理制御部１１５は、例えば、通話開始ボタン７３が選択された時に通話処理開始リクエストを生成してもよいし、通話開始ボタン７３が選択される前にあらかじめ通話処理開始リクエストを生成してもよい。通話処理開始リクエストは、例えば、情報端末１０の識別情報を含んで構成される。また、サーバ装置２０からの翻訳履歴を含んで生成される。情報端末１０の識別情報は、例えば、情報端末１０の使用者の属性、つまり、使用者の名称、住所、生年月日、年齢、所属、家族構成等や情報端末１０の電話番号や識別番号（ＩＤ）等を含む。また、情報端末１０を利用する店員又は顧客とオペレータ端末３０を使用する通訳者との通話は、一般的な電話回線網やＩＰ電話回線網等を含むネットワークＮを介して実行される。なお、通話手段に特に制限はなく、両者の通話が可能であればよい。 For example, the call processing control unit 115 may generate a call processing start request when the call start button 73 is selected, or may generate a call processing start request in advance before the call start button 73 is selected. Good. The call processing start request includes, for example, identification information of the information terminal 10. Further, it is generated including the translation history from the server device 20. The identification information of the information terminal 10 includes, for example, the attributes of the user of the information terminal 10, that is, the user's name, address, date of birth, age, affiliation, family structure, etc., and the telephone number or identification number of the information terminal 10 ( ID) and the like. In addition, a call between a store clerk or customer who uses the information terminal 10 and an interpreter who uses the operator terminal 30 is executed through a network N including a general telephone line network, an IP telephone line network, and the like. Note that there is no particular limitation on the calling means, and it is sufficient that both calls can be made.

ここで、店員は、通話が可能な通訳者が複数人いる場合、より適切な通訳者と通話することを望む。例えば、情報端末１０の記憶部１１７には、各通訳者又は各通訳者が使用する端末の識別情報と、各通訳者が使用できる一以上の言語を示す言語情報とが関連付けて記憶されている。図８に示すステップＳＪ３において、ユーザは、図９（Ａ）に示す言語選択画面に表示されたテキストＴ２１を顧客に提示し、顧客に英語（Ｅｎｇｌｉｓｈ）のボタンをタップしてもらうことで、顧客の言語が選択される。そうするとオペレータ端末特定部１１６は、記憶部１１７が記憶する各通訳者が使用する端末の識別情報と、各通訳者が使用できる一以上の言語を示す言語情報とを参照することにより、選択された英語ボタンが示す言語つまり英語を使用できる通訳者が使用するオペレータ端末３０を特定する。そして、通話処理制御部１１５は、当該通訳者が使用するオペレータ端末に対して通話処理開始リクエストを送信することによって、両者の通話が開始される。このように、店員と顧客とのコミュニケーションにおいて用いられる言語に対応できる通訳者を適切に特定できる。 Here, when there are a plurality of interpreters capable of making a call, the store clerk desires to make a call with a more appropriate interpreter. For example, in the storage unit 117 of the information terminal 10, identification information of each interpreter or a terminal used by each interpreter and language information indicating one or more languages that can be used by each interpreter are stored in association with each other. . In step SJ3 shown in FIG. 8, the user presents the text T21 displayed on the language selection screen shown in FIG. 9 (A) to the customer, and asks the customer to tap an English button so that the customer Language is selected. Then, the operator terminal specifying unit 116 is selected by referring to the terminal identification information used by each interpreter stored in the storage unit 117 and the language information indicating one or more languages that can be used by each interpreter. The operator terminal 30 used by the interpreter who can use the language indicated by the English button, that is, English, is specified. Then, the call processing control unit 115 transmits a call processing start request to the operator terminal used by the interpreter, so that both calls are started. In this way, it is possible to appropriately identify an interpreter who can handle the language used in communication between the store clerk and the customer.

また、店員は、英語を使用できる通訳者が複数人いる場合に、通訳がより上手な通訳者と通話することを望むと考えられる。例えば、情報端末１０の記憶部１１７は、各通訳者が使用する端末の識別情報、及び、各通訳者が使用できる一以上の言語を示す言語情報の他に、各通訳者の通訳レベルや通訳能力を示す情報を各通訳者の識別情報又は各通訳者が使用する端末の識別情報に関連付けて記憶してもよい。そして、図８に示すステップＳＪ３において、英語のボタンが選択されると、オペレータ端末特定部１１６は、英語を使用できる複数の通訳者の中から通訳レベル・能力がより高い通訳者が使用するオペレータ端末を特定するように構成されてもよい。 Also, if there are multiple interpreters who can use English, the store clerk may want to talk to an interpreter who is better at interpreting. For example, the storage unit 117 of the information terminal 10 includes, in addition to the terminal identification information used by each interpreter and the language information indicating one or more languages that can be used by each interpreter, the interpretation level and interpretation of each interpreter. Information indicating the capability may be stored in association with identification information of each interpreter or identification information of a terminal used by each interpreter. Then, when an English button is selected in step SJ3 shown in FIG. 8, the operator terminal specifying unit 116 uses an operator used by an interpreter with a higher interpreting level and ability among a plurality of interpreters who can use English. The terminal may be specified.

なお、オペレータ端末特定部１１６において、通訳者の特定は、図８のステップＳＪ５において情報端末１０の表示部１１７にて表示される通話開始ボタン７３（第１画像）が選択されるときに、実行されてもよい。また、オペレータ端末特定部１１６において、あらかじめ、店員と顧客とのコミュニケーションにおいて用いられる言語ごとに、通話する通訳者が使用するオペレータ端末を特定するように構成されてもよい。 The operator terminal specifying unit 116 specifies the interpreter when the call start button 73 (first image) displayed on the display unit 117 of the information terminal 10 is selected in step SJ5 of FIG. May be. Further, the operator terminal specifying unit 116 may be configured in advance to specify an operator terminal used by an interpreter who makes a call for each language used in communication between a store clerk and a customer.

他方、店員及び顧客の少なくとも一方が、当該画像７５に表示される「いいえ」を選択する場合、図１１（Ａ）に示す画面に戻る。 On the other hand, when at least one of the store clerk and the customer selects “No” displayed in the image 75, the screen returns to the screen shown in FIG.

次に、オペレータ端末３０の送受信部３０３は、情報端末１０からの通話処理開始リクエストを受信する（図８；ステップＳＯ１）。送受信部３０３は、情報端末１０に対して応答信号を送信する（図８；ステップＳＯ２）。例えば、通話処理部３１１は、情報端末１０とオペレータ端末３０との通話を許可する場合、通話を許可する旨の応答信号を生成する。例えば、通話処理部３１１は、受信した通話処理開始リクエストに含まれる情報端末１０の識別情報を、記憶部３１５又はオペレータ端末３０と通信可能な他の記憶資源に予め記憶されている、通話可能な情報端末の識別情報と比較することで、情報端末１０との通話を許可するか否か判断する。他方で、通話処理部３１１は、情報端末１０とオペレータ端末３０との通話を許可しない場合、通話を許可しない旨の応答信号を生成する。 Next, the transmission / reception unit 303 of the operator terminal 30 receives the call processing start request from the information terminal 10 (FIG. 8; step SO1). The transmission / reception unit 303 transmits a response signal to the information terminal 10 (FIG. 8; step SO2). For example, when allowing a call between the information terminal 10 and the operator terminal 30, the call processing unit 311 generates a response signal indicating that the call is allowed. For example, the call processing unit 311 is capable of making a call, in which the identification information of the information terminal 10 included in the received call processing start request is stored in advance in the storage unit 315 or another storage resource that can communicate with the operator terminal 30 By comparing with the identification information of the information terminal, it is determined whether or not a call with the information terminal 10 is permitted. On the other hand, when the call processing unit 311 does not permit the call between the information terminal 10 and the operator terminal 30, the call processing unit 311 generates a response signal indicating that the call is not permitted.

第２表示処理制御部３１３は、表示部３０７において、サーバ装置２０から情報端末１０を介して送信された翻訳履歴をユーザごとに関連付けて表示する（図８；ステップＳＯ３）。例えば、図１２に示すように、第２表示処理制御部３１３は、表示部３０７の画面において、「通話中」であることを示す画像８１を表示する処理を制御し、通話中であるユーザの名称を示す欄、当該ユーザが使用する情報端末の電話番号を示す欄、当該情報端末の識別番号を示す欄、及び、その他ユーザの住所等を示す属性情報を示す欄を含む画像８３を表示する処理を制御し、店員（ユーザ１）及び顧客（ユーザＸ）の入力音声の翻訳履歴をユーザごとに関連付けて翻訳履歴画像８５として表示する処理を制御する。 The second display processing control unit 313 causes the display unit 307 to display the translation history transmitted from the server device 20 via the information terminal 10 in association with each user (FIG. 8; step SO3). For example, as illustrated in FIG. 12, the second display processing control unit 313 controls a process of displaying an image 81 indicating “calling” on the screen of the display unit 307, and An image 83 including a column indicating a name, a column indicating a telephone number of an information terminal used by the user, a column indicating an identification number of the information terminal, and a column indicating attribute information indicating a user's address and the like is displayed. A process is controlled and the process which displays the translation log | history of the input speech of a salesclerk (user 1) and a customer (user X) for every user as a translation log | history image 85 is controlled.

このように、オペレータ端末３０は、表示部３０７において、音声翻訳履歴をユーザごとに関連付けて表示するので、通訳者は、音声翻訳履歴を確認できるので、店員と顧客との間の今までのコミュニケーションの流れを踏まえた応対が可能となる。 Thus, since the operator terminal 30 displays the speech translation history in association with each user on the display unit 307, the interpreter can check the speech translation history, so the communication between the store clerk and the customer so far It is possible to respond based on the flow of

また、図１２に示すように、オペレータ端末３０は、表示部３０７において、音声翻訳履歴を時系列に表示するので、通訳者は、店員と顧客との間の今までのコミュニケーションの流れをより容易に把握でき、当該流れを踏まえた適切な応対が可能となる。 As shown in FIG. 12, the operator terminal 30 displays the speech translation history in time series on the display unit 307, so that the interpreter can more easily flow the communication between the store clerk and the customer so far. Therefore, it is possible to respond appropriately based on this flow.

他方、情報端末１０がオペレータ端末３０から応答信号を受信する（図８；ステップＳＪ８）場合に、情報端末１０とオペレータ端末３０との接続が確立し、店員又は顧客と、通訳者との通話が実現する（図８；ステップＳＪ９及びＳＯ４）。ここで、店員又は顧客と、通訳者との通話が実現する場合に、図１１（Ｄ）に示すように、プロセッサ１１は、表示部１０７の画面上にテキストＴ３０を表示する。 On the other hand, when the information terminal 10 receives a response signal from the operator terminal 30 (FIG. 8; step SJ8), the connection between the information terminal 10 and the operator terminal 30 is established, and a call between the store clerk or customer and the interpreter is made. Realize (FIG. 8; steps SJ9 and SO4). Here, when a call between the store clerk or the customer and the interpreter is realized, the processor 11 displays the text T30 on the screen of the display unit 107 as shown in FIG.

（第２実施形態）
第１実施形態においては、情報端末は、翻訳結果を出力する場合に通話開始ボタン（第１画像）を表示するが、第２実施形態においては、情報端末は、サーバ装置が算出した翻訳精度に関するスコアと、所定の閾値とを比較し、当該スコアが所定の閾値以下である場合に、通話開始ボタン（第１画像）を表示する点において、第１実施形態と第２実施形態とは異なる。図１３を用いて第２実施形態を説明する。第１実施形態を説明する、図８に示すフローチャートと異なる点について特に説明し、図８に示すフローチャートと同様な点については、説明を省略する。 (Second Embodiment)
In the first embodiment, the information terminal displays a call start button (first image) when outputting the translation result. In the second embodiment, the information terminal relates to the translation accuracy calculated by the server device. The first embodiment and the second embodiment are different in that a call start button (first image) is displayed when the score is compared with a predetermined threshold and the score is equal to or lower than the predetermined threshold. A second embodiment will be described with reference to FIG. Differences from the flowchart shown in FIG. 8 that describe the first embodiment will be particularly described, and descriptions of points that are similar to the flowchart shown in FIG. 8 will be omitted.

図１３は、音声翻訳システムにおける処理の流れ（一部）の他の一例を示すフローチャートである。図１３に示すように、サーバ装置２０の多言語翻訳部２０７は、認識された音声の「読み」（文字）を他の言語に翻訳する多言語翻訳処理を実行する（図１３；ステップＳＳ１２）。記憶部２１３は、入力音声の内容に対応付けられた翻訳結果（翻訳内容）及び当該翻訳結果に対応する翻訳精度に関するスコアをユーザごとに関連付けて翻訳履歴として記憶する（図１３；ステップＳＳ１３）。 FIG. 13 is a flowchart showing another example of the process flow (part) in the speech translation system. As shown in FIG. 13, the multilingual translation unit 207 of the server device 20 executes multilingual translation processing for translating the recognized “reading” (characters) of the recognized speech into another language (FIG. 13; step SS12). . The storage unit 213 stores, as a translation history, a translation result (translation content) associated with the content of the input speech and a score related to translation accuracy corresponding to the translation result for each user (FIG. 13; step SS13).

ここで、当該翻訳処理においては、例えば統計翻訳が実施されており、対訳データから二言語間の単語や句の対応関係を抽出した、例えば確率付きの対訳辞書と確率付きの語順変換表を含む翻訳モデルと、訳文の言語らしさを表現する、並びの自然さを表す確率付き日本語の単語連鎖データを含む言語モデルと、に基づいてこれらの確率の積を最大化する訳文候補を出力する。よって、スコア算出部２０９は、例えば、各翻訳結果に対してそれぞれ何％という翻訳精度に関するスコアを算出するように構成されている。 Here, in the translation processing, for example, statistical translation is performed, and the correspondence between words and phrases between two languages is extracted from the bilingual data, for example, including a bilingual dictionary with probability and a word order conversion table with probability Based on the translation model and the language model including the Japanese word chain data with probabilities representing the naturalness of the sequence that expresses the language likeness of the translation, a translation candidate that maximizes the product of these probabilities is output. Therefore, the score calculation unit 209 is configured to calculate, for example, a score relating to the translation accuracy of what percentage for each translation result.

多言語翻訳処理及び音声合成処理が完了すると、音声合成部２１１は、翻訳結果（翻訳内容）である英語の会話コーパスに基づいてテキスト表示用のテキスト信号を生成し、また、合成された音声に基づいて音声出力用の音声信号を生成する。そして、生成されたテキスト信号、生成された音声信号、及び翻訳精度を送受信部２０１及びネットワークＮを通して、情報端末１０へ送信する。 When the multilingual translation process and the speech synthesis process are completed, the speech synthesizer 211 generates a text signal for text display based on the English conversation corpus that is the translation result (translation content), and also generates the synthesized speech. Based on this, an audio signal for audio output is generated. Then, the generated text signal, the generated voice signal, and the translation accuracy are transmitted to the information terminal 10 through the transmission / reception unit 201 and the network N.

次に、情報端末１０のスコア比較部１１１は、サーバ装置２０が算出した翻訳精度に関するスコアと所定の閾値とを比較する（図１３；ステップＳＪ１５）。スコアが所定の閾値より高ければ（図１３；ステップＳＪ１５においてＮｏ）、翻訳精度が良いことを示しており、第１表示処理制御部１１３は、表示部１０７に翻訳結果を表示し、合成された音声を出力する（図１３；ステップＳＪ１６）。例えば、所定の閾値が８０％である場合であって、サーバ装置２０における翻訳処理の翻訳精度に関するスコアが９０％である場合は、その翻訳精度は、良いことを示している。そして、翻訳が精度よく行われることによって、顧客がユーザ（店員）の質問事項を理解することができた場合、図１３に示すステップＳＪ１３に戻り、今度は、顧客の音声の入力、認識、翻訳、及び音声合成といった音声処理を行う。 Next, the score comparison unit 111 of the information terminal 10 compares the score relating to the translation accuracy calculated by the server device 20 with a predetermined threshold (FIG. 13; step SJ15). If the score is higher than the predetermined threshold (FIG. 13; No in step SJ15), it indicates that the translation accuracy is good, and the first display processing control unit 113 displays the translation result on the display unit 107 and is synthesized. Audio is output (FIG. 13; step SJ16). For example, when the predetermined threshold is 80% and the score relating to the translation accuracy of the translation process in the server device 20 is 90%, the translation accuracy is good. If the customer can understand the questions of the user (clerk) by performing the translation accurately, the process returns to step SJ13 shown in FIG. 13, and this time the customer's voice is input, recognized, and translated. And voice processing such as voice synthesis.

他方、翻訳精度に関するスコアが所定の閾値以下であれば（図１３；ステップＳＪ１５においてＹｅｓ）、翻訳精度が悪いことを示しており、第１表示処理制御部１１３は、表示部１０７に翻訳結果及び通話開始ボタンを表示する（図１３；ステップＳＪ１７）。 On the other hand, if the score relating to the translation accuracy is equal to or lower than the predetermined threshold value (FIG. 13; Yes in step SJ15), it indicates that the translation accuracy is poor, and the first display processing control unit 113 displays the translation result and A call start button is displayed (FIG. 13; step SJ17).

本発明によれば、情報端末の第１表示処理制御部は、表示部において翻訳結果を表示する場合に通話開始ボタンを選択的に表示する処理を制御し、情報端末の通話処理制御部は、通話開始ボタンが選択された場合に、ユーザと通訳者との通話を開始するための通話処理開始リクエストを送信することによって、ユーザの負担を軽減し且つ利便性を向上させることができるとともに、誤訳の発生を防止し且つ円滑なコミュニケーションを実現することができる。 According to the present invention, the first display processing control unit of the information terminal controls the process of selectively displaying the call start button when the translation result is displayed on the display unit. When the call start button is selected, by transmitting a call processing start request for starting a call between the user and an interpreter, the burden on the user can be reduced and convenience can be improved. Can be prevented and smooth communication can be realized.

また、本発明によれば、情報端末において音声翻訳の翻訳精度に関するスコアと所定の閾値とを比較し、翻訳精度が低い場合に、通話開始ボタンを表示するように構成する。よって、情報端末において通訳者との通話の必要性がより高い場合にのみ通話開始ボタンを表示するので、通訳者との通話をより円滑に開始することができる。 In addition, according to the present invention, the information terminal compares the score related to the translation accuracy of speech translation with a predetermined threshold, and displays a call start button when the translation accuracy is low. Therefore, since the call start button is displayed only when the need for a call with the interpreter is higher at the information terminal, the call with the interpreter can be started more smoothly.

（他の実施形態）
本実施形態は、本発明の理解を容易にするためのものであり、本発明を限定して解釈するものではない。本発明はその趣旨を逸脱することなく、変更／改良され得るとともに、本発明にはその等価物も含まれる。また、本発明は、その趣旨を逸脱しない範囲で種々変形（各実施形態を組み合わせる等）して実施することができる。 (Other embodiments)
This embodiment is for facilitating the understanding of the present invention, and should not be construed as limiting the present invention. The present invention can be changed / improved without departing from the gist thereof, and the present invention includes equivalents thereof. Further, the present invention can be implemented with various modifications (combining the embodiments, etc.) without departing from the spirit of the present invention.

また、上記各実施形態では、音声認識、翻訳、及び音声合成の各処理をサーバ装置２０によって実行する例について記載したが、これらの処理を情報端末１０において実行するように構成してもよい。この場合、それらの処理に用いるモジュールＬ２０は、情報端末１０の記憶資源１２に保存されていてもよいし、サーバ装置２０の記憶資源２３に保存されていてもよい。さらに、音声データベースのデータベースＤ２０、及び／又は、音響モデル等のモデルＭ２０も、情報端末１０の記憶資源１２に保存されていてもよいし、サーバ装置２０の記憶資源２３に保存されていてもよい。このとおり、音声翻訳システムは、ネットワークＮ及びサーバ装置２０を備えなくてもよい。なお、上記実施形態では、翻訳精度を判断する処理を情報端末１０によって実行する例について記載したが、この処理をサーバ装置２０において実行するように構成してもよい。 Moreover, although each said embodiment described the example which performs each process of speech recognition, translation, and a speech synthesis by the server apparatus 20, you may comprise so that these processes may be performed in the information terminal 10. FIG. In this case, the module L20 used for these processes may be stored in the storage resource 12 of the information terminal 10 or may be stored in the storage resource 23 of the server device 20. Furthermore, the database D20 of the voice database and / or the model M20 such as an acoustic model may be stored in the storage resource 12 of the information terminal 10 or may be stored in the storage resource 23 of the server device 20. . As described above, the speech translation system may not include the network N and the server device 20. In addition, although the example which performs the process which judges translation accuracy by the information terminal 10 was described in the said embodiment, you may comprise so that this process may be performed in the server apparatus 20. FIG.

なお、図８に示すステップＳＯ３に係る翻訳履歴をユーザごとに関連付けて表示するステップは、ステップＳＯ１と同時に実行されてもよいし、ステップＳＯ１の後であってステップＳＯ２と同時に又はステップＳＯ２の前に実行されてもよい。また、図１０に示すステップＳＯ１３に係る翻訳履歴をユーザごとに関連付けて表示するステップは、ステップＳＯ１１と同時に実行されてもよいし、ステップＳＯ１１の後であってステップＳＯ１２と同時に又はステップＳＯ１２の前に実行されてもよい。 The step of displaying the translation history relating to step SO3 shown in FIG. 8 in association with each user may be executed simultaneously with step SO1, or after step SO1 and simultaneously with step SO2 or before step SO2. May be executed. Further, the step of displaying the translation history related to step SO13 shown in FIG. 10 in association with each user may be executed simultaneously with step SO11, or after step SO11 and simultaneously with step SO12 or before step SO12. May be executed.

上記実施形態においては、オペレータ端末３０は、翻訳履歴を、当該翻訳履歴を含む通話処理開始リクエストを受信することによって得ることができると説明したが、これに限られない。例えば、オペレータ端末３０は、通話処理開始リクエストを受信する前に、又は後に、サーバ装置２０から直接翻訳履歴を受信するように構成されてもよい。 In the said embodiment, although the operator terminal 30 demonstrated that a translation log | history can be obtained by receiving the telephone call process start request containing the said translation log | history, it is not restricted to this. For example, the operator terminal 30 may be configured to receive the translation history directly from the server device 20 before or after receiving the call processing start request.

また、情報端末１０とネットワークＮとの間、又は、オペレータ端末３０とネットワークＮとの間には、両者間の通信プロトコルを変換するゲートウェイサーバ等が介在してももちろんよい。また、情報端末１０は、携帯型装置に限らず、例えば、デスクトップ型パソコン、ノート型パソコン、タブレット型パソコン、ラップトップ型パソコン等でもよい。さらに、オペレータ端末３０は、据え置き型装置に限られず、ネットワークＮとの通信機能を有する可搬型のタブレット型端末装置等で構成されてもよい。 Of course, a gateway server for converting a communication protocol between the information terminal 10 and the network N or between the operator terminal 30 and the network N may be interposed. The information terminal 10 is not limited to a portable device, and may be a desktop personal computer, a notebook personal computer, a tablet personal computer, a laptop personal computer, or the like. Furthermore, the operator terminal 30 is not limited to a stationary device, and may be configured with a portable tablet terminal device having a communication function with the network N.

１０情報端末
１１，２１，３１プロセッサ
１２，２３，３２記憶資源
１３，３３音声入出力デバイス
１４，２２，３４通信インターフェイス
１５，３５入力デバイス
１６，３６表示デバイス
１７，３７カメラ
２０サーバ装置
３０オペレータ端末
１００音声翻訳システム
１０１，３０１音声入出力部
１０３，２０１，３０３送受信部
１０５，３０５入力操作受付部
１０７，３０７表示部
１０９，２０３，３０９情報処理部
１１１スコア比較部
１１３第１表示処理制御部
１１５通話処理制御部
１１７，２１３，３１５記憶部
２０５音声認識部
２０７多言語翻訳部
２０９スコア算出部
２１１音声合成部
３１１通話処理部
３１３第２表示処理制御部
Ｄ２０データベース
Ｌ２０モジュール
Ｍ２０モデル
Ｎネットワーク
Ｐ１０，Ｐ２０，Ｐ３０プログラム DESCRIPTION OF SYMBOLS 10 Information terminal 11,21,31 Processor 12,23,32 Storage resource 13,33 Voice input / output device 14,22,34 Communication interface 15,35 Input device 16,36 Display device 17,37 Camera 20 Server apparatus 30 Operator terminal 100 Speech translation systems 101, 301 Voice input / output units 103, 201, 303 Transmission / reception units 105, 305 Input operation reception units 107, 307 Display units 109, 203, 309 Information processing unit 111 Score comparison unit 113 First display processing control unit 115 Call processing control unit 117, 213, 315 Storage unit 205 Speech recognition unit 207 Multilingual translation unit 209 Score calculation unit 211 Speech synthesis unit 311 Call processing unit 313 Second display processing control unit D20 Database L20 Module M20 Model N Network P10, P2 , P30 program

Claims

An information terminal for inputting the user's voice;
A server device for translating the content of the voice input to the information terminal;
An interpreter terminal that performs call processing with the information terminal; and a speech translation system comprising:
The server device
A speech recognition unit for recognizing the content of speech input to the information terminal;
A translation unit that translates the content recognized by the voice recognition unit into content in a different language;
A score calculation unit that calculates a score related to translation accuracy,
The information terminal
A voice output unit that outputs the content translated by the translation unit of the server device by voice;
A process of displaying the first text of the input speech content in a first area on the screen and displaying the second text of the translated content in a second area different from the first area on the screen A first display processing control unit for controlling the interpreter terminal, in addition to the first text and the second text, when the score calculated by the score calculation unit is a predetermined threshold value or less. A first display process control unit for controlling a process of selectively displaying a first image for transmitting a call process start request for starting a call process to the interpreter terminal on the screen ;
A call processing control unit that transmits the call processing start request to the interpreter terminal when the first image is selected;
Speech translation system.

The server device further includes a storage unit that stores the translated content associated with the content of the input voice as a translation history in association with each user,
The interpreter terminal further includes a second display processing control unit that controls processing of displaying the translation history in association with each user.
The speech translation system according to claim 1.

The first display processing control unit controls processing for further displaying two or more second images respectively indicating two or more languages,
The call processing control unit uses a language indicated by one of the selected second images when the first image is selected after the one of the second images is selected. Controlling the call processing with the interpreter terminal associated with an interpreter capable of
The speech translation system according to claim 1 or 2.

The second area is set below the first area on the screen.
The speech translation system according to any one of claims 1 to 3.

The first display processing control unit displays the first text and the second text in opposite directions on the screen;
The speech translation system according to any one of claims 1 to 4.

Outputting the contents of the user's voice, which are translated into different languages, in voice;
A process of displaying the first text of the input speech content in a first area on the screen and displaying the second text of the translated content in a second area different from the first area on the screen Call processing for starting call processing with the interpreter terminal in addition to the first text and the second text when the score relating to translation accuracy is equal to or lower than a predetermined threshold. Controlling a process of selectively displaying a first image for transmitting a start request to the interpreter terminal on the screen ;
Transmitting the call processing start request to the interpreter terminal when the first image is selected,
Speech translation method.

Computer
A voice output unit that outputs the contents of the user's voice, which are translated into different languages, in voice;
A process of displaying the first text of the input speech content in a first area on the screen and displaying the second text of the translated content in a second area different from the first area on the screen A first display processing control unit for controlling a translation accuracy, and when a score relating to translation accuracy is equal to or lower than a predetermined threshold, in addition to the first text and the second text, a call process with an interpreter terminal is started. A first display process control unit for controlling a process of selectively displaying a first image for transmitting a call process start request to the interpreter terminal on the screen ;
When the first image is selected, a call processing control unit that transmits the call processing start request to the interpreter terminal;
Make it work,
Speech translation program.