JP6383748B2

JP6383748B2 - Speech translation device, speech translation method, and speech translation program

Info

Publication number: JP6383748B2
Application number: JP2016067703A
Authority: JP
Inventors: 優基井村
Original assignee: RECRUIT LIFESTYLE CO., LTD.
Current assignee: RECRUIT LIFESTYLE CO., LTD.
Priority date: 2016-03-30
Filing date: 2016-03-30
Publication date: 2018-08-29
Anticipated expiration: 2036-03-30
Also published as: JP2017182395A

Description

本発明は、音声翻訳装置、音声翻訳方法、及び音声翻訳プログラムに関する。 The present invention relates to a speech translation device, a speech translation method, and a speech translation program.

互いの言語を理解できない人同士の会話、例えばユーザ（店舗の従業員等）と対話者（外国人客等）との会話を可能ならしめるべく、話者の発話音声をテキスト化し、そのテキストの内容を相手の言語に機械翻訳した上で画面に表示したり、或いは、音声合成技術を用いてそのテキストの内容を音声再生したりする音声翻訳技術が提案されている（例えば特許文献１及び２参照）。 In order to enable conversation between people who cannot understand each other's language, for example, conversation between a user (store employee, etc.) and a conversation person (foreign customer, etc.) Speech translation technology has been proposed in which the content is machine-translated into the partner language and displayed on the screen, or the text content is played back using speech synthesis technology (for example, Patent Documents 1 and 2). reference).

特開平９−３４８９５号公報Japanese Patent Laid-Open No. 9-34895 特開平２０１４−１６４７５号公報Japanese Unexamined Patent Publication No. 2014-16475

ところで、例えば特許文献２に図示されているアプリケーションの操作画面のとおり、従来の音声翻訳アプリケーションや装置は、会話に際して、ユーザ及び対話者の使用言語を最初に選択するように構成されている。しかし、かかる言語選択操作は、使用言語が異なる話者同士の会話を行うために必要ではあるものの、ユーザは、会話に際し、言葉が通じない対話者に対して言語選択を依頼する何らかの声掛けを行わなければならない。これは、ユーザの心理的な負担になるとともに、円滑な会話を行うための障害となっていた。 By the way, as shown in, for example, the operation screen of the application shown in Patent Document 2, a conventional speech translation application or device is configured to first select the language used by the user and the conversation person during conversation. However, such a language selection operation is necessary for the conversation between speakers using different languages, but in the conversation, the user asks a conversation person who does not understand the language to ask the speaker to select a language. It must be made. This is a psychological burden on the user and an obstacle for smooth conversation.

そこで、本発明は、かかる事情に鑑みてなされたものであり、会話の開始時に言語選択を行うことなく、ユーザと対話者の会話を自然に開始しかつ円滑に進めることができる音声翻訳装置、音声翻訳方法、及び音声翻訳プログラムを提供することを目的とする。 Therefore, the present invention has been made in view of such circumstances, and a speech translation apparatus that can naturally start and smoothly advance a conversation between a user and a conversation person without performing language selection at the start of the conversation, An object is to provide a speech translation method and a speech translation program.

上記課題を解決するため、本発明の一態様による音声翻訳装置は、ユーザ及び／又は対話者の音声を入力するための入力部と、１つの入力音声（１フレーズ）に対して（入力音声の言語以外の）複数の異なる言語による対訳を取得する翻訳部と、それらの複数の異なる言語による対訳をテキスト及び／又は音声で出力する出力部とを備える。 In order to solve the above-described problem, a speech translation apparatus according to an aspect of the present invention provides an input unit for inputting a voice of a user and / or a dialoguer and one input voice (one phrase) (for input voice). A translation unit that acquires parallel translations in a plurality of different languages (other than a language) and an output unit that outputs the parallel translations in the plurality of different languages in text and / or speech.

具体的には、出力部が、複数の異なる言語による対訳のテキストを一時（いっとき）に表示するようにしてもよい。この場合の画面表示は、単一画面に限らず、複数の画面に分割して表示してもよい。さらに、出力部による音声出力は、複数の異なる言語による対訳の全てを出力しなくてもよく、それらのうち例えばユーザ又は対話者が指定した少なくとも１つの対訳の音声を出力してもよい。 Specifically, the output unit may display a parallel translation text in a plurality of different languages at one time. The screen display in this case is not limited to a single screen, and may be divided into a plurality of screens. Furthermore, the voice output by the output unit may not output all of the parallel translations in a plurality of different languages, and may output the voice of at least one parallel translation designated by, for example, the user or the dialogue person among them.

また、複数の異なる言語による対訳が出力された後に、対話者の使用言語を選定する言語選定部を更に備えてもよい。さらに、複数の異なる言語のそれぞれの選定回数又は選定頻度を記憶又は取得する記憶部を更に備え、出力部が、複数の異なる言語のうち選定回数又は選定頻度が比較的高い上位の所定数の言語による対訳を出力するように構成しても好適である。この場合、出力部は、上記所定数の言語による対訳を選定回数又は選定頻度の順に出力してもよい。なお、「選択頻度」は、複数の異なる言語による対訳が出力された後に言語選択部によって選択された回数に基づく値でもよく、又は、事前に選択された回数に基づく値でもよく、或いは、両者の合計回数に基づく値であってもよい。 In addition, a language selection unit may be further provided that selects a language used by the conversation person after a parallel translation in a plurality of different languages is output. Furthermore, a storage unit for storing or acquiring the selection frequency or selection frequency of each of a plurality of different languages is further provided, and the output unit is a predetermined number of higher-order languages having a relatively high selection frequency or selection frequency among a plurality of different languages. It is also preferable to configure so as to output the bilingual translation. In this case, the output unit may output the translations in the predetermined number of languages in the order of selection frequency or selection frequency. The “selection frequency” may be a value based on the number of times selected by the language selection unit after bilingual translation in a plurality of different languages is output, or may be a value based on the number of times selected in advance, or both It may be a value based on the total number of times.

さらに、１つの入力音声の内容と複数の異なる言語による対訳のそれぞれとの適合性指標（翻訳の精度又は確度を表す指標）を取得する指標取得部を更に備え、出力部が、複数の異なる言語による対訳のうち適合性指標が比較的高い上位の所定数の対訳を出力するようにしても好適である。 Furthermore, it further includes an index acquisition unit that acquires a compatibility index (an index indicating the accuracy or accuracy of translation) between the content of one input speech and each of parallel translations in a plurality of different languages, and the output unit includes a plurality of different languages It is also preferable to output a predetermined number of high-order parallel translations having a relatively high compatibility index.

また、本発明の一態様による音声翻訳方法は、入力部、翻訳部、及び出力部を備える音声翻訳装置を用いる方法である。すなわち、当該方法は、入力部が、ユーザ及び／又は対話者の音声を入力するステップと、翻訳部が、１つの入力音声に対して複数の異なる言語による対訳を取得するステップと、出力部が、複数の異なる言語による対訳をテキスト及び／又は音声で出力するステップとを含む。 A speech translation method according to an aspect of the present invention is a method using a speech translation apparatus including an input unit, a translation unit, and an output unit. That is, in the method, the input unit inputs the voice of the user and / or the conversation person, the translation unit acquires parallel translations in a plurality of different languages for one input voice, and the output unit Outputting a translation in a plurality of different languages as text and / or voice.

また、本発明の一態様による音声翻訳プログラムは、コンピュータ（単数又は単一種に限られず、複数又は複数種でもよい；以下同様）を、ユーザ及び／又は対話者の音声を入力するための入力部と、１つの入力音声に対して複数の異なる言語による対訳を取得する翻訳部と、複数の異なる言語による対訳をテキスト及び／又は音声で出力する出力部として機能させる。 In addition, the speech translation program according to one aspect of the present invention is a computer (not limited to a single type or a single type, but may be a plurality or a plurality of types; the same shall apply hereinafter), and an input unit for inputting a voice of a user and / or a conversation person And a translation unit that acquires parallel translations in a plurality of different languages for one input voice, and an output unit that outputs parallel translations in a plurality of different languages as text and / or speech.

本発明によれば、ユーザと対話者との会話において、例えばユーザが発話した１つの入力音声に対して複数の異なる言語による対訳を取得し、それらの複数の対訳をテキスト及び／又は音声で出力することにより、対話者の使用言語が不明であっても、会話の切っ掛け（契機）を得ることができ、また、ユーザの発話内容を対話者に伝えることができる。よって、会話に先立って対話者の使用言語を選択しなくとも、会話を自然に開始することができ、ユーザの心理的な負担を軽減することができる。また、対話者が理解することができた対訳から対話者の使用言語が判明するので、それ以降の発話では、複数の異なる言語への翻訳が不要となり、その後の会話を円滑に進めることができる。 According to the present invention, in a conversation between a user and a conversation person, for example, parallel translations in a plurality of different languages are obtained for one input speech uttered by the user, and the plurality of parallel translations are output as text and / or speech. By doing so, even if the language used by the interlocutor is unknown, it is possible to obtain a conversation start (timing) and to convey the content of the user's utterance to the interlocutor. Therefore, the conversation can be started naturally without selecting the language used by the interlocutor prior to the conversation, and the psychological burden on the user can be reduced. In addition, since the language used by the interlocutor is determined from the parallel translation that the interlocutor can understand, the subsequent utterances do not require translation into different languages, and the subsequent conversation can proceed smoothly. .

本発明による音声翻訳装置に係るネットワーク構成等の好適な一実施形態を概略的に示すシステムブロック図である。1 is a system block diagram schematically showing a preferred embodiment of a network configuration and the like related to a speech translation apparatus according to the present invention. 本発明による音声翻訳装置の好適な一実施形態における処理の流れ（一部）の一例を示すフローチャートである。It is a flowchart which shows an example of the flow (part) of the process in suitable one Embodiment of the speech translation apparatus by this invention. 本発明による音声翻訳装置の好適な一実施形態における処理の流れ（一部）の一例を示すフローチャートである。It is a flowchart which shows an example of the flow (part) of the process in suitable one Embodiment of the speech translation apparatus by this invention. （Ａ）乃至（Ｄ）は、情報端末における表示画面の遷移の一例を示す平面図である。(A) thru | or (D) are top views which show an example of the transition of the display screen in an information terminal. （Ａ）及び（Ｂ）は、情報端末における表示画面の遷移の一例を示す平面図である。(A) And (B) is a top view which shows an example of the transition of the display screen in an information terminal.

以下、本発明の実施の形態について詳細に説明する。なお、以下の実施の形態は、本発明を説明するための例示であり、本発明をその実施の形態のみに限定する趣旨ではない。また、本発明は、その要旨を逸脱しない限り、さまざまな変形が可能である。さらに、当業者であれば、以下に述べる各要素を均等なものに置換した実施の形態を採用することが可能であり、かかる実施の形態も本発明の範囲に含まれる。またさらに、必要に応じて示す上下左右等の位置関係は、特に断らない限り、図示の表示に基づくものとする。さらにまた、図面における各種の寸法比率は、その図示の比率に限定されるものではない。 Hereinafter, embodiments of the present invention will be described in detail. The following embodiments are examples for explaining the present invention, and are not intended to limit the present invention only to the embodiments. The present invention can be variously modified without departing from the gist thereof. Furthermore, those skilled in the art can employ embodiments in which the elements described below are replaced with equivalent ones, and such embodiments are also included in the scope of the present invention. Furthermore, positional relationships such as up, down, left, and right shown as needed are based on the display shown unless otherwise specified. Furthermore, various dimensional ratios in the drawings are not limited to the illustrated ratios.

（装置構成）
図１は、本発明による音声翻訳装置に係るネットワーク構成等の好適な一実施形態を概略的に示すシステムブロック図である。この例において、音声翻訳装置１００は、ユーザが使用する情報端末１０（ユーザ装置）にネットワークＮを介して電子的に接続されるサーバ２０を備える（但し、これに限定されない）。 (Device configuration)
FIG. 1 is a system block diagram schematically showing a preferred embodiment such as a network configuration related to a speech translation apparatus according to the present invention. In this example, the speech translation apparatus 100 includes a server 20 that is electronically connected to the information terminal 10 (user apparatus) used by the user via the network N (but is not limited to this).

情報端末１０は、例えば、タッチパネル等のユーザインターフェイス及び視認性が高いディスプレイを採用する。また、ここでの情報端末１０は、ネットワークＮとの通信機能を有するスマートフォンに代表される携帯電話を含む可搬型のタブレット型端末装置である。さらに、情報端末１０は、プロセッサ１１、記憶資源１２、音声入出力デバイス１３、通信インターフェイス１４、入力デバイス１５、表示デバイス１６、及びカメラ１７を備えている。また、情報端末１０は、インストールされた音声翻訳アプリケーションソフト（本発明の一実施形態による音声翻訳プログラムの少なくとも一部）が動作することにより、本発明の一実施形態による音声翻訳装置の一部又は全部として機能するものである。 The information terminal 10 employs a user interface such as a touch panel and a display with high visibility, for example. The information terminal 10 here is a portable tablet terminal device including a mobile phone represented by a smartphone having a communication function with the network N. The information terminal 10 further includes a processor 11, a storage resource 12, a voice input / output device 13, a communication interface 14, an input device 15, a display device 16, and a camera 17. In addition, the information terminal 10 operates by the installed speech translation application software (at least a part of the speech translation program according to the embodiment of the present invention), so that a part of the speech translation apparatus according to the embodiment of the present invention or It functions as a whole.

プロセッサ１１は、算術論理演算ユニット及び各種レジスタ（プログラムカウンタ、データレジスタ、命令レジスタ、汎用レジスタ等）から構成される。また、プロセッサ１１は、記憶資源１２に格納されているプログラムＰ１０である音声翻訳アプリケーションソフトを解釈及び実行し、各種処理を行う。このプログラムＰ１０としての音声翻訳アプリケーションソフトは、例えばサーバ２０からネットワークＮを通じて配信可能なものであり、手動で又は自動でインストール及びアップデートされてもよい。 The processor 11 includes an arithmetic logic unit and various registers (program counter, data register, instruction register, general-purpose register, etc.). Further, the processor 11 interprets and executes speech translation application software, which is the program P10 stored in the storage resource 12, and performs various processes. The speech translation application software as the program P10 can be distributed from the server 20 through the network N, for example, and may be installed and updated manually or automatically.

なお、ネットワークＮは、例えば、有線ネットワーク（近距離通信網（ＬＡＮ）、広域通信網（ＷＡＮ）、又は付加価値通信網（ＶＡＮ）等）と無線ネットワーク（移動通信網、衛星通信網、ブルートゥース（Bluetooth（登録商標））、ＷｉＦｉ(Wireless Fidelity)、ＨＳＤＰＡ(High Speed Downlink Packet Access)等）が混在して構成される通信網である。 The network N includes, for example, a wired network (a short-range communication network (LAN), a wide-area communication network (WAN), a value-added communication network (VAN), etc.) and a wireless network (mobile communication network, satellite communication network, Bluetooth ( Bluetooth (registered trademark)), WiFi (Wireless Fidelity), HSDPA (High Speed Downlink Packet Access), etc.).

記憶資源１２は、物理デバイス（例えば、半導体メモリ等のコンピュータ読み取り可能な記録媒体）の記憶領域が提供する論理デバイスであり、情報端末１０の処理に用いられるオペレーティングシステムプログラム、ドライバプログラム、各種データ等を格納する。ドライバプログラムとしては、例えば、音声入出力デバイス１３を制御するための入出力デバイスドライバプログラム、入力デバイス１５を制御するための入力デバイスドライバプログラム、表示デバイス１６を制御するための表示デバイスドライバプログラム等が挙げられる。さらに、音声入出力デバイス１３は、例えば、一般的なマイクロフォン、及びサウンドデータを再生可能なサウンドプレイヤである。 The storage resource 12 is a logical device provided by a storage area of a physical device (for example, a computer-readable recording medium such as a semiconductor memory), and an operating system program, a driver program, various data, etc. used for processing of the information terminal 10 Is stored. Examples of the driver program include an input / output device driver program for controlling the audio input / output device 13, an input device driver program for controlling the input device 15, and a display device driver program for controlling the display device 16. Can be mentioned. Furthermore, the voice input / output device 13 is, for example, a general microphone and a sound player capable of reproducing sound data.

通信インターフェイス１４は、例えばサーバ２０との接続インターフェイスを提供するものであり、無線通信インターフェイス及び／又は有線通信インターフェイスから構成される。また、入力デバイス１５は、例えば、表示デバイス１６に表示されるアイコン、ボタン、仮想キーボード、テキスト等のタップ動作による入力操作を受け付けるインターフェイスを提供するものであり、タッチパネルの他、情報端末１０に外付けされる各種入力装置を例示することができる。 The communication interface 14 provides a connection interface with the server 20, for example, and is configured from a wireless communication interface and / or a wired communication interface. The input device 15 provides an interface for accepting an input operation by a tap operation such as an icon, a button, a virtual keyboard, or a text displayed on the display device 16. Various input devices to be attached can be exemplified.

表示デバイス１６は、画像表示インターフェイスとして各種の情報をユーザや対話者（会話の相手方）に提供するものであり、例えば、有機ＥＬディスプレイ、液晶ディスプレイ、ＣＲＴディスプレイ等が挙げられる。また、カメラ１７は、種々の被写体の静止画や動画を撮像するためのものである。 The display device 16 provides various information as an image display interface to a user or a conversation person (conversation partner), and examples thereof include an organic EL display, a liquid crystal display, and a CRT display. The camera 17 is for capturing still images and moving images of various subjects.

サーバ２０は、例えば、演算処理能力の高いホストコンピュータによって構成され、そのホストコンピュータにおいて所定のサーバ用プログラムが動作することにより、サーバ機能を発現するものであり、例えば、音声認識サーバ、翻訳サーバ、及び音声合成サーバとして機能する単数又は複数のホストコンピュータから構成される（図示においては単数で示すが、これに限定されない）。そして、各サーバ２０は、プロセッサ２１、通信インターフェイス２２、及び記憶資源２３を備える。 The server 20 is constituted by, for example, a host computer having a high arithmetic processing capability, and expresses a server function by operating a predetermined server program in the host computer, for example, a speech recognition server, a translation server, And a single or a plurality of host computers functioning as a speech synthesis server (in the drawing, it is indicated by a single, but is not limited thereto). Each server 20 includes a processor 21, a communication interface 22, and a storage resource 23.

プロセッサ２１は、算術演算、論理演算、ビット演算等を処理する算術論理演算ユニット及び各種レジスタ（プログラムカウンタ、データレジスタ、命令レジスタ、汎用レジスタ等）から構成され、記憶資源２３に格納されているプログラムＰ２０を解釈及び実行し、所定の演算処理結果を出力する。また、通信インターフェイス２２は、ネットワークＮを介して情報端末１０に接続するためのハードウェアモジュールであり、例えば、ＩＳＤＮモデム、ＡＤＳＬモデム、ケーブルモデム、光モデム、ソフトモデム等の変調復調装置である。 The processor 21 is composed of an arithmetic and logic unit for processing arithmetic operations, logical operations, bit operations and the like and various registers (program counter, data register, instruction register, general-purpose register, etc.), and is stored in the storage resource 23. P20 is interpreted and executed, and a predetermined calculation processing result is output. The communication interface 22 is a hardware module for connecting to the information terminal 10 via the network N. For example, the communication interface 22 is a modulation / demodulation device such as an ISDN modem, an ADSL modem, a cable modem, an optical modem, or a soft modem.

記憶資源２３は、例えば、物理デバイス（ディスクドライブ又は半導体メモリ等のコンピュータ読み取り可能な記録媒体等）の記憶領域が提供する論理デバイスであり、それぞれ単数又は複数のプログラムＰ２０、各種モジュールＬ２０、各種データベースＤ２０、及び各種モデルＭ２０が格納されている。また、記憶資源２３には、ユーザが対話者へ話しかけるために予め用意された複数の質問定型文、入力音声の履歴データ、各種設定用のデータ、各言語の選択頻度（又は選択回数）等も記憶されている。 The storage resource 23 is a logical device provided by, for example, a storage area of a physical device (a computer-readable recording medium such as a disk drive or a semiconductor memory), and each includes one or a plurality of programs P20, various modules L20, and various databases. D20 and various models M20 are stored. The storage resource 23 also includes a plurality of standard questions prepared for the user to speak to the conversation person, history data of input speech, data for various settings, selection frequency (or number of selections) of each language, and the like. It is remembered.

プログラムＰ２０は、サーバ２０のメインプログラムである上述したサーバ用プログラム等である。また、各種モジュールＬ２０は、情報端末１０から送信されてくる要求及び情報に係る一連の情報処理を行うため、プログラムＰ１０の動作中に適宜呼び出されて実行されるソフトウェアモジュール（モジュール化されたサブプログラム）である。かかるモジュールＬ２０としては、音声認識モジュール、翻訳モジュール、音声合成モジュール等が挙げられる。 The program P20 is the above-described server program that is the main program of the server 20. In addition, the various modules L20 perform a series of information processing related to requests and information transmitted from the information terminal 10, so that they are appropriately called and executed during the operation of the program P10 (moduleized subprograms). ). Examples of the module L20 include a speech recognition module, a translation module, and a speech synthesis module.

また、各種データベースＤ２０としては、音声翻訳処理のために必要な各種コーパス（例えば、日本語と他言語の音声翻訳の場合、日本語音声コーパス、各他言語音声コーパス、日本語文字（語彙）コーパス、各他言語文字（語彙）コーパス、日本語辞書、各他言語辞書、日本語／各他言語対訳辞書、日本語／各他言語対訳コーパス等）、音声データベース、ユーザに関する情報を管理するための管理用データベース、異なる言語のコーパス間の適合性指標（翻訳の精度又は確度を示す指標）データベース等が挙げられる。また、各種モデルＭ２０としては、音声認識に使用する音響モデルや言語モデル等が挙げられる。 The various databases D20 include various corpora required for speech translation processing (for example, in the case of Japanese and other language speech translation, a Japanese speech corpus, each other language speech corpus, a Japanese character (vocabulary) corpus). , Each other language character (vocabulary) corpus, Japanese dictionary, each other language dictionary, Japanese / each other language parallel translation dictionary, Japanese / each other language parallel corpus, etc.), voice database, for managing information about users Examples include a management database, a compatibility index (an index indicating the accuracy or accuracy of translation) between corpus of different languages, and the like. Examples of the various models M20 include acoustic models and language models used for speech recognition.

（第１実施形態）
以上のとおり構成された音声翻訳装置１００における音声翻訳処理の操作及び動作の一例（第１実施形態）について、以下に更に説明する。図２は、第１実施形態の音声翻訳装置１００における処理の流れ（の一部）の一例を示すフローチャートである。また、図４（Ａ）乃至（Ｄ）は、情報端末における表示画面の遷移の一例を示す平面図である。なお、ここでは、情報端末１０のユーザが日本語を話す店員（店舗の従業員）であり、対話者（会話の相手）がフランス語を話す外国人客である場合の会話を想定する（但し、言語やシチュエーションはこれに限定されない）。 (First embodiment)
An example (first embodiment) of speech translation processing operations and operations in the speech translation apparatus 100 configured as described above will be further described below. FIG. 2 is a flowchart illustrating an example of (a part of) a processing flow in the speech translation apparatus 100 according to the first embodiment. 4A to 4D are plan views illustrating an example of transition of display screens in the information terminal. Here, it is assumed that the user of the information terminal 10 is a Japanese clerk (store employee) and the conversation person (conversation partner) is a foreign customer who speaks French (however, Languages and situations are not limited to this).

まず、ユーザ（店員）が当該アプリケーションを起動する（ステップＳＵ１）と、情報端末１０の表示デバイス１６に、ホーム画面として、デフォルト言語である日本語と英語の音声入力待機画面が表示デバイス１６に表示される（図４（Ａ）；ステップＳＪ１）。この音声入力待機画面には、ユーザと対話者の言語の何れを発話するかを問う日本語のテキストＴ１、並びに、日本語の音声入力を行うための入力ボタン４２ａ及び英語の音声入力を行うための入力ボタン４２ｂが表示される。 First, when the user (clerk) starts the application (step SU1), a voice input standby screen in Japanese and English as default languages is displayed on the display device 16 as a home screen on the display device 16 of the information terminal 10. (FIG. 4A; step SJ1). In this voice input standby screen, the Japanese text T1 asking which of the user's language and the talker's language is to be spoken, the input button 42a for performing Japanese voice input, and the English voice input are performed. The input button 42b is displayed.

また、この音声入力待機画面には、予め設定されている複数の質問定型文のリスト表示を選択するためのお声がけボタン４３、対話者の言語を手動で選択するための言語選択ボタン４４、それまでになされた音声入力内容の履歴表示を選択するための履歴ボタン４５、予め用意された複数の定型フレーズ（推奨フレーズ）群のなかから所望の定型フレーズを選択して会話を進めることができるサジェスト機能を実行するためのサジェストボタン４６、及び当該アプリケーションソフトの各種設定を行うための設定ボタン４７も表示される。 In addition, the voice input standby screen includes a voice button 43 for selecting a list display of a plurality of preset canned sentences, a language selection button 44 for manually selecting a language of a conversation person, The history button 45 for selecting the history display of the voice input content made so far, and the conversation can be advanced by selecting a desired fixed phrase from a plurality of prepared fixed phrases (recommended phrases) group. A suggest button 46 for executing the suggest function and a setting button 47 for performing various settings of the application software are also displayed.

次に、図４（Ａ）に示す音声入力待機画面において、ユーザが日本語の入力ボタン４２ａをタップして日本語の音声入力を選択すると、ユーザの日本語による発話内容を受け付ける音声入力画面となる（図４（Ｂ））。この音声入力画面が表示されると、音声入出力デバイス１３からの音声入力が可能な状態となる。また、この音声入力画面には、ユーザの音声入力を促すテキストＴ２、音声入力状態にあることを示すマイク図案４８、及びテキスト入力へ切り替えるための入力切替ボタン４１が表示される。さらに、この音声入力画面にも、キャンセルボタンＢ１が表示され、これをタップすることにより、会話を終了するか、音声入力待機画面（図４（Ａ））へ戻って音声入力をやり直すことができる。 Next, in the voice input standby screen shown in FIG. 4A, when the user taps the Japanese input button 42a and selects Japanese voice input, a voice input screen for accepting the user's Japanese utterance content; (FIG. 4B). When this voice input screen is displayed, voice input from the voice input / output device 13 is enabled. Further, on this voice input screen, a text T2 for prompting the user to input voice, a microphone design 48 indicating that the voice input state is set, and an input switching button 41 for switching to the text input are displayed. Further, a cancel button B1 is also displayed on this voice input screen. By tapping this button, the conversation can be ended or the voice input can be performed again by returning to the voice input standby screen (FIG. 4A). .

この状態で、ユーザが対話者への伝達事項等（例えば「御用はございませんか？」といったフレーズ）を発話する（ステップＳＵ２）と、テキストＴ２とともに、その声量の大小を模式的に且つ動的に表す多重円形図案４９が表示され、音声入力レベルが発話者であるユーザへ視覚的にフィードバックされる。それから、発話が終了し、ユーザがマイク図案４８をタップすると、プロセッサ１１は、ユーザによる発話内容の受け付けを終了する。情報端末１０のプロセッサ１１は、その音声入力に基づいて音声信号を生成し、その音声信号を通信インターフェイス１４及びネットワークＮを通してサーバ２０へ送信する。このとおり、情報端末１０自体、又はプロセッサ１１及び音声入出力デバイス１３が「入力部」として機能する。 In this state, when the user utters an item to be communicated to the interlocutor (for example, a phrase such as “Is there any use?”) (Step SU2), the voice volume is schematically and dynamically expressed with the text T2. Multiple circular designs 49 are displayed and the voice input level is visually fed back to the user who is the speaker. Then, when the utterance is finished and the user taps the microphone design 48, the processor 11 finishes accepting the utterance content by the user. The processor 11 of the information terminal 10 generates an audio signal based on the audio input, and transmits the audio signal to the server 20 through the communication interface 14 and the network N. As described above, the information terminal 10 itself, or the processor 11 and the voice input / output device 13 function as an “input unit”.

次に、サーバ２０のプロセッサ２１は、通信インターフェイス２２を通してその音声信号を受信し、音声認識処理を行う。このとき、プロセッサ２１は、記憶資源２３から、必要なモジュールＬ２０、データベースＤ２０、及びモデルＭ２０（音声認識モジュール、日本語音声コーパス、音響モデル、言語モデル等）を呼び出し、入力音声の「音」を「読み」（文字）へ変換する。このとおり、プロセッサ２１、又は、サーバ２０が全体として「音声認識サーバ」として機能する。また、プロセッサ２１は、認識された内容を、音声入力の履歴データとして、記憶資源２３に（必要に応じて適宜のデータベースに）記憶する。 Next, the processor 21 of the server 20 receives the voice signal through the communication interface 22 and performs voice recognition processing. At this time, the processor 21 calls the necessary module L20, database D20, and model M20 (speech recognition module, Japanese speech corpus, acoustic model, language model, etc.) from the storage resource 23, and obtains “sound” of the input speech. Convert to "reading" (character). As described above, the processor 21 or the server 20 functions as a “voice recognition server” as a whole. In addition, the processor 21 stores the recognized content in the storage resource 23 (in an appropriate database as necessary) as voice input history data.

次いで、プロセッサ２１は、その入力音声の認識結果を、情報端末１０に送信し、プロセッサ１１は、それを日本語のテキストとして画面表示する（図示省略）。このとき、入力音声の認識結果をそのまま表示してもよいし、予め記憶資源２３に記憶されている日本語の会話コーパスのなかから、実際の入力音声の内容に対応するものを呼び出して表示してもよい。 Next, the processor 21 transmits the recognition result of the input voice to the information terminal 10, and the processor 11 displays it on the screen as Japanese text (not shown). At this time, the recognition result of the input voice may be displayed as it is, or the Japanese speech corpus stored in the storage resource 23 in advance is called and displayed corresponding to the content of the actual input voice. May be.

続いて、プロセッサ２１は、認識された音声の「読み」（文字）を複数の他言語に翻訳する多言語翻訳処理へ移行する。このとき、プロセッサ２１は、記憶資源２３から、必要なモジュールＬ２０及びデータベースＤ２０（翻訳モジュール、日本語文字コーパス、日本語辞書、各他言語辞書、日本語／各他言語対訳辞書、日本語／各他言語対訳コーパス等）を呼び出し、認識結果である入力音声の「読み」（文字列）を適切に並び替えて日本語の句、節、文等へ変換し、その変換結果に対応する各他言語を抽出し、それらを各他言語の文法に従って並び替えて自然な各他言語の句、節、文等へと変換する。このとおり、プロセッサ２１は、１つの入力音声に対して複数の異なる言語による対訳を取得する「翻訳部」としても機能し、サーバ２０は、全体として「翻訳サーバ」としても機能する。なお、入力音声が正確に認識されなかった場合には、音声の再入力を行うことができる（図示省略）。また、プロセッサ２１は、それらの日本語及び英語の句、節、文等を、記憶資源２３に記憶しておくこともできる。 Subsequently, the processor 21 proceeds to multilingual translation processing for translating the recognized “reading” (characters) of the speech into a plurality of other languages. At this time, the processor 21 reads the necessary module L20 and database D20 (translation module, Japanese character corpus, Japanese dictionary, each other language dictionary, each Japanese / each other language parallel dictionary, each Japanese / each language from the storage resource 23. Call other languages parallel corpus, etc.), and sort the “reading” (character string) of the input speech, which is the recognition result, and convert it into Japanese phrases, clauses, sentences, etc. The language is extracted and rearranged according to the grammar of each other language to be converted into natural phrases, clauses, sentences, etc. of each other language. As described above, the processor 21 also functions as a “translation unit” that acquires parallel translations in a plurality of different languages for one input speech, and the server 20 also functions as a “translation server” as a whole. If the input voice is not correctly recognized, the voice can be re-input (not shown). The processor 21 can also store those Japanese and English phrases, clauses, sentences, and the like in the storage resource 23.

それから、プロセッサ２１は、多言語翻訳処理によって取得した複数の異なる言語による対訳（例えば「御用はございませんか？」の対訳）の出力信号を生成し、情報端末１０へ送信する。情報端末１０のプロセッサ１１は、それらの出力信号に基づいて、例えば図４（Ｃ）及び（Ｄ）に示す対訳テキストＴ３，Ｔ４のリスト画面を表示デバイス１６に表示する。これらの対訳テキストＴ３，Ｔ４のリスト画面は、例えば表示デバイス１６の画面を指で左右にワイプすることにより切り替わる。このとおり、プロセッサ１１，２１及び表示デバイス１６が「出力部」として機能し、複数の異なる言語による対訳テキストＴ３，Ｔ４が表示デバイス１６に一時に表示される。 Then, the processor 21 generates an output signal of a parallel translation (for example, a parallel translation of “Is there any use?”) Obtained by multilingual translation processing and transmitted to the information terminal 10. Based on these output signals, the processor 11 of the information terminal 10 displays a list screen of parallel texts T3 and T4 shown in FIGS. 4C and 4D on the display device 16, for example. The list screen of these translation texts T3 and T4 is switched by wiping the screen of the display device 16 left and right with a finger, for example. As described above, the processors 11 and 21 and the display device 16 function as an “output unit”, and bilingual texts T3 and T4 in different languages are displayed on the display device 16 at a time.

ここで、図４（Ｃ）に示す対訳テキストＴ３のリスト画面には、英語とアジア系言語（中国語、ハングル語、ベトナム語、タガログ語等）による対訳が表示される。また、図４（Ｄ）に示す対訳テキストＴ４のリスト画面には、英語と他の欧米系言語（イタリア語、スペイン語、ドイツ語、フランス語等）による対訳が表示される。また、各対訳テキストＴ３，Ｔ４の直下には、それぞれの言語による回答（「はい」と「いいえ」に相当）をタップ入力するためのボタンＢ１１〜Ｂ１５，Ｂ２１〜Ｂ２５が表示される。 Here, parallel translations in English and Asian languages (Chinese, Korean, Vietnamese, Tagalog, etc.) are displayed on the list screen of the parallel translation text T3 shown in FIG. In addition, the translation screen in English and other Western languages (Italian, Spanish, German, French, etc.) is displayed on the list screen of the parallel text T4 shown in FIG. Further, buttons B11 to B15 and B21 to B25 for tapping and inputting answers in the respective languages (corresponding to “yes” and “no”) are displayed immediately below the respective parallel translation texts T3 and T4.

次に、ユーザからこの対訳テキストＴ３，Ｔ４のリスト画面を呈示された対話者は、対話者が使用する言語又は自分が理解することができる言語による対訳のテキストがあった場合、その対訳の下方に表示されているボタン（対話者がフランス人であれば、フランス語による対訳の下に表示されているボタンＢ２５）をタップして回答することができる（ステップＳＵ３）。 Next, if there is a bilingual text in a language used by the interlocutor or a language that the interlocutor can understand, the interlocutor presented with the list screen of the bilingual texts T3 and T4 below the bilingual translation. Can respond by tapping the button (button B25 displayed under the translation in French if the conversation person is French) (step SU3).

対話者が、ユーザの問い掛けに応じて、ボタンＢ２５の「Ｏｕｉ」（はい）の部分をタップすると、情報端末１０のプロセッサ１１から、その選択信号がサーバ２０へ送信され、その選択信号を受信したプロセッサ２１は、対話者の言語がフランス語であると判断し、会話における対話者の使用言語としてフランス語を選定する（ステップＳＪ３）。 When the dialogue person taps the “Oui” (Yes) portion of the button B25 in response to the user's inquiry, the selection signal is transmitted from the processor 11 of the information terminal 10 to the server 20, and the selection signal is received. The processor 21 determines that the language of the conversation person is French, and selects French as the language used by the conversation person in the conversation (step SJ3).

次に、プロセッサ２１は、再び音声入力待機画面を表示デバイス１６に表示する（再びステップＳＪ１）。この音声入力待機画面は、英語の入力ボタン４２ｂに代えて、フランス語の音声入力を行うための入力ボタンが表示されること以外は、図４（Ａ）に示す日本語の音声入力待機画面と同様に構成されている（図示省略）。次いで、その音声入力待機画面において、例えばユーザがフランス語の入力ボタンをタップしてフランス語の音声入力を選択すると、対話者のフランス語による発話内容を受け付ける音声入力画面となる。このフランス語の音声入力画面は、日本語のテキスト表示がフランス語のテキスト表示に替わること以外は、図４（Ｂ）に示す日本語の音声入力画面と同様に構成されたものである（図示省略）。 Next, the processor 21 displays the voice input standby screen on the display device 16 again (step SJ1 again). This voice input standby screen is the same as the Japanese voice input standby screen shown in FIG. 4A except that an input button for performing French voice input is displayed instead of the English input button 42b. (Not shown). Next, on the voice input standby screen, for example, when the user taps a French input button and selects French voice input, the voice input screen is displayed that accepts the content of the conversational person in French. This French speech input screen is configured in the same manner as the Japanese speech input screen shown in FIG. 4B except that the Japanese text display is replaced with the French text display (not shown). .

この状態で、対話者がユーザへの伝達事項等を発話し（再びステップＳＵ２）、対話者がマイク図案４８をタップすると、プロセッサ１１は、対話者による発話内容の受け付けを終了し、プロセッサ２１による入力音声の音声認識処理、及び、フランス語の内容から日本語の内容への翻訳処理を実行する。 In this state, the conversation person utters a matter to be transmitted to the user (step SU2 again), and when the conversation person taps the microphone design 48, the processor 11 ends the acceptance of the utterance content by the conversation person, and the processor 21 Performs speech recognition processing for input speech and translation processing from French content to Japanese content.

それから、プロセッサ２１は、音声合成処理へ移行する。このとき、プロセッサ２１は、記憶資源２３から、必要なモジュールＬ２０、データベースＤ２０、及びモデルＭ２０（音声合成モジュール、英語音声コーパス、音響モデル、言語モデル等）を呼び出し、翻訳結果である日本語の句、節、文等を自然な音声に変換する。このとおり、プロセッサ２１は、「音声合成部」としても機能し、サーバ２０は、全体として「音声合成サーバ」としても機能する。 Then, the processor 21 proceeds to speech synthesis processing. At this time, the processor 21 calls the necessary module L20, database D20, and model M20 (speech synthesis module, English speech corpus, acoustic model, language model, etc.) from the storage resource 23, and the Japanese phrase that is the translation result. , Paragraphs, sentences, etc. are converted into natural speech. As described above, the processor 21 also functions as a “speech synthesizer”, and the server 20 also functions as a “speech synthesizer” as a whole.

そして、プロセッサ２１は、合成された音声に基づいて音声出力用の音声信号を生成し、通信インターフェイス２２及びネットワークＮを通して、情報端末１０へ送信する。情報端末１０のプロセッサ１１は、通信インターフェイス１４を通してその音声信号を受信し、音声入出力デバイス１３を用いて、日本語の音声出力処理を行う（ここまでステップＳＪ４）。このとおり、プロセッサ１１，２１及び音声入出力デバイス１３も、「出力部」として機能する。なお、音声出力に先立って、対話者の音声認識結果とその翻訳結果を、情報端末１０に一旦表示し、対話者による確認後に、音声出力を行うようにしてもよい（図示省略）。 Then, the processor 21 generates a voice signal for voice output based on the synthesized voice, and transmits the voice signal to the information terminal 10 through the communication interface 22 and the network N. The processor 11 of the information terminal 10 receives the audio signal through the communication interface 14 and performs Japanese audio output processing using the audio input / output device 13 (step SJ4 so far). As described above, the processors 11 and 21 and the voice input / output device 13 also function as “output units”. Prior to the voice output, the voice recognition result and the translation result of the conversation person may be temporarily displayed on the information terminal 10 and the voice may be outputted after confirmation by the talker (not shown).

それから、ステップＳＪ１，ＳＵ２，ＳＪ４を必要に応じて適宜繰り返すことにより、ユーザと対話者の会話を進めることができ、会話が終了した後、ユーザは、当該アプリケーションを適宜終了することができる（ステップＳＵ４）。 Then, by repeating steps SJ1, SU2, and SJ4 as necessary, the conversation between the user and the conversation person can be advanced. After the conversation is finished, the user can end the application as appropriate (step). SU4).

（第２実施形態）
次に、音声翻訳装置１００における音声翻訳処理の操作及び動作の他の一例（第２実施形態）について説明する。図３は、第２実施形態の音声翻訳装置１００における処理の流れ（の一部）の一例を示すフローチャートである。また、図５（Ａ）及び（Ｂ）は、情報端末における表示画面の遷移の一例を示す平面図である。この第２実施形態においては、ステップＳＪ２において、複数の言語による対訳テキストＴ３，Ｔ４のリスト表示に加えて、それらの対訳の音声出力を行うこと以外は、第１実施形態と同様の処理を実行する。 (Second Embodiment)
Next, another example (second embodiment) of operations and operations of speech translation processing in the speech translation apparatus 100 will be described. FIG. 3 is a flowchart illustrating an example of (a part of) a processing flow in the speech translation apparatus 100 according to the second embodiment. 5A and 5B are plan views showing an example of display screen transition in the information terminal. In the second embodiment, in step SJ2, in addition to displaying a list of bilingual texts T3 and T4 in a plurality of languages, the same processing as in the first embodiment is executed except that the bilingual speech is output. To do.

このときの対訳の音声出力処理としては、例えば、対話者の外見や言動等に基づいてユーザが推定した少なくとも１つの言語による対訳をユーザが指定し、その指定した対訳のみを音声出力してもよいし、翻訳結果として得られた複数の言語による対訳の一部又は全部を自動で音声出力するようにしてもよい。音声出力する対訳を指定する方法としては、例えば図４（Ｃ）及び（Ｄ）に表示された対訳テキストＴ３，Ｔ４のうち所望の対訳の部分をタップする構成、図５（Ａ）及び（Ｂ）に示す如く、対訳のそれぞれに選択用のチェックボックス５１を設けてタップにより指定（チェック）する構成等が挙げられる。こうして音声出力する対訳が指定されると、情報端末１０及びサーバ２０により、それらの対訳の内容の音声合成が行われ、音声入出力デバイス１３を用いて、各言語による対訳の音声出力処理（順次再生）が行われる（ステップＳＪ２）。 As the speech output processing of the parallel translation at this time, for example, the user designates the parallel translation in at least one language estimated by the user based on the appearance or behavior of the conversation person, and only the designated parallel translation is output by voice. Alternatively, some or all of the parallel translations in a plurality of languages obtained as a translation result may be automatically output as voice. As a method of designating a parallel translation to be output, for example, a configuration in which a desired parallel translation portion is tapped in the parallel translation texts T3 and T4 displayed in FIGS. 4C and 4D, FIGS. As shown in FIG. 9, there is a configuration in which a check box 51 for selection is provided for each translation and designation (checking) is performed by tapping. When the parallel translation to be output in this way is designated, the information terminal 10 and the server 20 perform speech synthesis of the content of the parallel translation, and the speech input / output device 13 is used to perform parallel speech output processing in each language (sequentially). Reproduction) is performed (step SJ2).

（第３実施形態）
次に、音声翻訳装置１００における音声翻訳処理の操作及び動作の他の一例（第３実施形態）について説明する。この第３実施形態においては、ステップＳＪ２において、ユーザの店舗や同業種の店舗でのユーザと対話者の会話で選定（使用）される回数又は頻度が比較的高い複数の言語による対訳、つまり、複数の異なる言語のうち選定回数又は選定頻度が比較的高い上位の所定数の言語による対訳を出力すること以外は、第１実施形態又は第２実施形態と同様の処理を実行する。 (Third embodiment)
Next, another example (third embodiment) of speech translation processing operations and operations in the speech translation apparatus 100 will be described. In the third embodiment, in step SJ2, parallel translations in a plurality of languages having a relatively high frequency or frequency of selection (use) in a conversation between a user and a conversation person at a user's store or a store of the same industry, that is, The same processing as that in the first embodiment or the second embodiment is executed except that a parallel translation in a predetermined number of languages having a relatively high selection frequency or selection frequency among a plurality of different languages is output.

この場合、ユーザの店舗や同業種の店舗における当該音声翻訳アプリケーションを用いた会話において、サーバ２０のプロセッサ２１は、ステップＳＪ３で言語が選定された回数又は頻度を、記憶資源２３における適宜のデータベースに記憶する。或いは、プロセッサ２１は、適宜のタイミングで、ネットワークＮを介して複数のユーザの情報端末１０から各言語の選定回数又は選定頻度を取得し、それらを記憶資源２３における適宜のデータベースに記憶する。このとおり、プロセッサ２１及び記憶資源２３が「記憶部」として機能する。 In this case, in the conversation using the speech translation application at the user's store or the store of the same industry, the processor 21 of the server 20 stores the number or frequency of the language selection at step SJ3 in an appropriate database in the storage resource 23. Remember. Alternatively, the processor 21 acquires the selection frequency or selection frequency of each language from the information terminals 10 of a plurality of users via the network N at an appropriate timing, and stores them in an appropriate database in the storage resource 23. As described above, the processor 21 and the storage resource 23 function as a “storage unit”.

そして、プロセッサ２１は、ステップＳＪ２において、複数の言語による対訳のなかから選定回数又は選定頻度が比較的高い言語の対訳を所定数選択し、それらの選択された対訳を、例えば図４（Ｃ）及び（Ｄ）に示す対訳テキストＴ３，Ｔ４のリスト画面のように表示する。或いは、プロセッサ２１は、それに加えて又はそれに代えて、複数の言語による対訳のなかから選定回数又は選定頻度が比較的高い言語の対訳を、例えば図４（Ｃ）及び（Ｄ）に示す対訳テキストＴ３，Ｔ４のリスト画面においてより高い順位（例えば画面上方）に表示する。 Then, in step SJ2, the processor 21 selects a predetermined number of parallel translations of a language having a relatively high selection frequency or selection frequency from the parallel translations in a plurality of languages, and the selected parallel translations are, for example, shown in FIG. And it displays like the list screen of bilingual text T3, T4 shown to (D). Alternatively, in addition to or instead of it, the processor 21 converts a parallel translation of a language having a relatively high selection frequency or selection frequency from a plurality of languages into parallel translation texts shown in FIGS. In the list screen of T3 and T4, it is displayed in a higher order (for example, the upper part of the screen).

（第４実施形態）
さらに、音声翻訳装置１００における音声翻訳処理の操作及び動作の他の一例（第４実施形態）について説明する。この第４実施形態においては、ステップＳＪ２において、複数の異なる言語による対訳のうち翻訳の精度又は確度が比較的高い複数の対訳、つまり、入力音声の内容と対訳との適合性指標が比較的高い上位の所定数の対訳を出力すること以外は、第１実施形態又は第２実施形態と同様の処理を実行する。 (Fourth embodiment)
Furthermore, another example (fourth embodiment) of speech translation processing operations and operations in the speech translation apparatus 100 will be described. In the fourth embodiment, in step SJ2, a plurality of parallel translations having a relatively high translation accuracy or accuracy among a plurality of parallel translations in different languages, that is, the compatibility index between the contents of the input speech and the parallel translation is relatively high. A process similar to that of the first embodiment or the second embodiment is executed except that a predetermined number of higher-order translations are output.

この場合、プロセッサ２１は、例えば、各言語のコーパスに収録されている対訳フレーズのそれぞれについての適合性指標を、ネットワークＮに接続されたウェブページや商用データベースから予め取得しておき、或いは、例えばクラウドソーシングを利用した正確性評価により、予め評価又は収集しておく。また、プロセッサ２１は、取得した適合性指標を、各対訳フレーズに関連付けて記憶資源２３に記憶しておく。このとおり、プロセッサ２１及び記憶資源２３が「指標取得部」としても機能する。なお、適合性評価手法としては、例えば、２言語間の意味比較に用いられるＷａｌｋｅｒらの適合性評価（Ｗａｌｋｅｒ，Ｋ．，ｅｔａｌ：Ｍｕｌｔｉｐｌｅ−ＴｒａｎｓｌａｔｉｏｎＡｒａｂｉｃ（ＭＴＡ）Ｐａｒｔ１，ＬｉｎｇｕｉｓｔｉｃＤａｔａＣｏｎｓｏｒｔｉｕｍ，Ｐｈｉｌａｄｅｌｐｈｉａ（２００３））を利用する５段階評価等が挙げられる（但し、これに限定されない）。 In this case, for example, the processor 21 acquires in advance a compatibility index for each parallel phrase recorded in the corpus of each language from a web page or a commercial database connected to the network N, or, for example, Evaluation or collection is performed in advance by accuracy evaluation using crowdsourcing. The processor 21 stores the acquired suitability index in the storage resource 23 in association with each bilingual phrase. As described above, the processor 21 and the storage resource 23 also function as an “index acquisition unit”. In addition, as a conformity evaluation method, for example, the conformity evaluation (Walker, K., et al: Multiple-Translation Arabic (MTA) Part 1, Linguistic Data Consortium, Philadelphia used for semantic comparison between two languages is used. (2003)) is used (but not limited to).

そして、プロセッサ２１は、ステップＳＪ２において、複数の言語による対訳のなかから適合性指標が比較的高い言語の対訳を所定数選択し、それらの選択された対訳を、例えば図４（Ｃ）及び（Ｄ）に示す対訳テキストＴ３，Ｔ４のリスト画面のように表示する。或いは、プロセッサ２１は、それに加えて又はそれに代えて、複数の言語による対訳のなかから整合性指標が比較的高い言語の対訳を、例えば図４（Ｃ）及び（Ｄ）に示す対訳テキストＴ３，Ｔ４のリスト画面においてより高い順位（例えば画面上方）に表示する。 Then, in step SJ2, the processor 21 selects a predetermined number of parallel translations of languages having a relatively high suitability index from the parallel translations in a plurality of languages, and the selected parallel translations are, for example, shown in FIGS. It is displayed as a list screen of parallel translation texts T3 and T4 shown in D). Alternatively, in addition to or instead of it, the processor 21 converts a parallel translation of a language having a relatively high consistency index from a plurality of parallel translations into, for example, a parallel text T3 shown in FIGS. Displayed in a higher order (for example, at the top of the screen) on the list screen of T4.

以上のように構成された音声翻訳装置１００及びそれを用いた音声翻訳方法並びに音声翻訳プログラムによれば、ユーザと対話者の会話を開始するに際し、ユーザが発話した１つの入力音声に対して複数の異なる言語による対訳を取得し、それらの複数の対訳テキストＴ３，Ｔ４を、例えば図４（Ｃ）及び（Ｄ）の如く表示して対話者に呈示し、及び／又は、それらの対訳を音声で出力する。 According to the speech translation apparatus 100 configured as described above, the speech translation method using the speech translation device, and the speech translation program, when a conversation between the user and the conversation person is started, a plurality of input speech uttered by the user The parallel translations in different languages are acquired, the plurality of parallel translation texts T3 and T4 are displayed as shown in FIGS. 4C and 4D, for example, and presented to the conversation person, and / or the parallel translations are spoken. To output.

よって、対話者の使用言語が不明であっても、会話の切っ掛け（契機）を得ることができ、また、ユーザの発話内容を対話者に正しく伝えることができる。したがって、会話に先立って対話者の使用言語を選択しなくとも、会話を自然に開始することができ、ユーザの心理的な負担を軽減することができる。また、対話者が理解することができた対訳から対話者の使用言語（実施形態の例では、フランス語）が判明し、それを使用言語として選定するので、それ以降の発話では、複数の異なる言語への翻訳が不要となり、その後の会話を円滑に進めることができる。 Therefore, even if the language used by the interlocutor is unknown, it is possible to obtain a conversation start (timing) and to correctly convey the user's utterance content to the interlocutor. Therefore, the conversation can be started naturally without selecting the language used by the interlocutor prior to the conversation, and the psychological burden on the user can be reduced. In addition, the language used by the conversation person (French in the example of the embodiment) is determined from the parallel translation that the conversation person can understand, and is selected as the use language. Therefore, in the subsequent utterances, a plurality of different languages are used. Translation into is no longer necessary, and subsequent conversations can proceed smoothly.

また、図４（Ｃ）及び（Ｄ）の如く、複数の異なる言語による対訳テキストＴ３，Ｔ４を一時に表示すれば、視認性が向上するので、対話者がユーザの発話内容を把握し易くなる。さらに、図４（Ｃ）及び（Ｄ）の如く、アジア系言語と欧米系言語による対訳を別画面に表示するようにすれば、その際の視認性及び利便性が更に向上される。またそのように表示すれば、ユーザが対話者の外見や言動等から対話者の使用言語をある程度絞り込める場合に特に有用である。 Further, as shown in FIGS. 4C and 4D, if the bilingual texts T3 and T4 in a plurality of different languages are displayed at a time, the visibility is improved, so that the conversation person can easily grasp the contents of the user's utterance. . Further, as shown in FIGS. 4C and 4D, if bilingual translations in Asian languages and Western languages are displayed on different screens, the visibility and convenience at that time are further improved. In addition, such display is particularly useful when the user can narrow down the language used by the user to some extent from the appearance and behavior of the user.

さらに、複数の異なる言語のうち選定回数又は選定頻度が比較的高い上位の所定数の言語による対訳を出力し、また、その際に、それらの対訳を選定回数又は選定頻度の順に出力することにより、対話者の使用言語をより確実に推定して、その対訳を対話者に呈示することができる。 Furthermore, by outputting parallel translations in a predetermined number of languages having a relatively high selection frequency or selection frequency among a plurality of different languages, and outputting those parallel translations in the order of selection frequency or selection frequency. Thus, the language used by the conversation person can be estimated more reliably, and the corresponding translation can be presented to the conversation person.

またさらに、複数の異なる言語による対訳のうち適合性指標（翻訳の精度又は確度）が比較的高い上位の所定数の対訳を出力することにより、例えば、対話者が複数の言語を使用又は理解することができる場合に、ある一の言語ではユーザの発話内容を理解できなくとも、他の言語ではユーザの発話内容を理解することができる可能性が高まる。より具体的には、例えば、中国語を話しかつ英語を理解することができる対話者に対し、ユーザの発話内容によっては、英語による対訳の適合性指標が中国語による対訳の適合性指標よりも高い場合、対話者は、中国語の対訳ではユーザの発話内容を理解できないものの、英語の対訳によってそれを理解し得る可能性（つまり、中国語では通じないけれど、英語では通じるケース）が挙げられる。 Furthermore, by outputting a predetermined number of parallel translations having a relatively high compatibility index (translation accuracy or accuracy) among parallel translations in a plurality of different languages, for example, a dialogue person uses or understands a plurality of languages. In such a case, even if the user's utterance content cannot be understood in one language, the possibility that the user's utterance content can be understood in another language is increased. More specifically, for example, for a conversation person who can speak Chinese and understand English, depending on the content of the user's utterance, the compatibility index of the English translation may be higher than the compatibility index of the Chinese translation. If it is high, the interlocutor may not understand the user's utterance with the Chinese translation, but may be able to understand it with the English translation (that is, the case where it can be understood in English but not in Chinese) .

なお、上述したとおり、上記の各実施形態は、本発明を説明するための一例であり、本発明をその実施形態に限定する趣旨ではない。また、本発明は、その要旨を逸脱しない限り、様々な変形が可能である。例えば、当業者であれば、実施形態で述べたリソース（ハードウェア資源又はソフトウェア資源）を均等物に置換することが可能であり、そのような置換も本発明の範囲に含まれる。 Note that, as described above, each of the above embodiments is an example for explaining the present invention, and is not intended to limit the present invention to the embodiment. The present invention can be variously modified without departing from the gist thereof. For example, those skilled in the art can replace the resources (hardware resources or software resources) described in the embodiments with equivalents, and such replacements are also included in the scope of the present invention.

また、ステップＳＪ２において、複数の言語による対訳テキストＴ３，Ｔ４のリスト表示に代えて、複数の言語による対訳の音声出力のみとしてもよい。この場合、それらの対訳を、順次読み上げるようにしてもよい。さらに、対話者に単純な回答を求めるのではなく、対話者による発話を求める場合には、図４（Ｃ）及び（Ｄ）並びに図５（Ａ）及び（Ｂ）に示す回答をタップ入力するためのボタンＢ１１〜Ｂ１５，Ｂ２１〜Ｂ２５に代えて、各言語の対訳の近傍に発話ボタンを表示してもよい。具体的には、例えば英語であれば、「お話し下さい。」の意を示す「Ｐｌｅａｓｅｔａｌｋ．」といったテキストを表示したタップ用のボタン（アイコン）を英語の対訳の下に表示する。また、例えばドイツ語であれば、「お話し下さい。」の意を示す「Ｂｉｔｔｅｓｐｒｅｃｈｅｎ．」といったテキストを表示したタップ用のボタン（アイコン）をドイツ語の対訳の下に表示する。 Further, in step SJ2, instead of displaying a list of the parallel translation texts T3 and T4 in a plurality of languages, only the parallel translation audio output in a plurality of languages may be used. In this case, those translations may be read out sequentially. Furthermore, when the dialoguer does not ask for a simple answer but asks for the utterance by the dialoguer, the answers shown in FIGS. 4 (C) and (D) and FIGS. 5 (A) and (B) are tapped. Instead of the buttons B11 to B15 and B21 to B25, a speech button may be displayed in the vicinity of each language parallel translation. Specifically, for example, in the case of English, a tapping button (icon) displaying a text such as “Please talk.” Indicating “Please speak” is displayed below the English translation. For example, in the case of German, a tap button (icon) displaying a text such as “Bitte sprench.” Indicating “Please speak” is displayed below the translation in German.

また、音声認識、翻訳、音声合成等の各処理をサーバ２０によって実行する例について記載したが、これらの処理を情報端末１０において実行するように構成してもよい。この場合、それらの処理に用いるモジュールＬ２０は、情報端末１０の記憶資源１２に保存されていてもよいし、サーバ２０の記憶資源２３に保存されていてもよい。さらに、音声データベースであるデータベースＤ２０、及び／又は、音響モデル等のモデルＭ２０も、情報端末１０の記憶資源１２に保存されていてもよいし、サーバ２０の記憶資源２３に保存されていてもよい。このとおり、音声翻訳装置は、ネットワークＮ及びサーバ２０を備えなくてもよい。 Moreover, although the example which performs each process, such as speech recognition, translation, speech synthesis, by server 20, was described, you may comprise so that these processes may be performed in the information terminal 10. FIG. In this case, the module L20 used for these processes may be stored in the storage resource 12 of the information terminal 10 or may be stored in the storage resource 23 of the server 20. Furthermore, the database D20 that is a voice database and / or a model M20 such as an acoustic model may be stored in the storage resource 12 of the information terminal 10 or may be stored in the storage resource 23 of the server 20. . As described above, the speech translation apparatus may not include the network N and the server 20.

また、情報端末１０とネットワークＮとの間には、両者間の通信プロトコルを変換するゲートウェイサーバ等が介在してももちろんよい。また、情報端末１０は、携帯型装置に限らず、例えば、デスクトップ型パソコン、ノート型パソコン、タブレット型パソコン、ラップトップ型パソコン等でもよい。 Of course, a gateway server for converting a communication protocol between the information terminal 10 and the network N may be interposed. The information terminal 10 is not limited to a portable device, and may be a desktop personal computer, a notebook personal computer, a tablet personal computer, a laptop personal computer, or the like.

本発明によれば、会話の開始時に言語選択を行うことなく、ユーザと対話者の会話を自然に開始しかつ円滑に進めることができるので、例えば、互いの言語を理解できない人同士の会話に関するサービスの提供分野における、プログラム、装置、システム、及び方法の設計、製造、提供、販売等の活動に広く利用することができる。 According to the present invention, since the conversation between the user and the conversation person can be started naturally and smoothly without performing language selection at the start of the conversation, for example, the conversation between persons who cannot understand each other's language. The present invention can be widely used for activities such as design, manufacture, provision, and sales of programs, apparatuses, systems, and methods in the field of service provision.

１０情報端末
１１プロセッサ
１２記憶資源
１３音声入出力デバイス
１４通信インターフェイス
１５入力デバイス
１６表示デバイス
１７カメラ
２０サーバ
２１プロセッサ
２２通信インターフェイス
２３記憶資源
４１入力切替ボタン
４２ａ，４２ｂ入力ボタン
４３お声がけボタン
４４言語選択ボタン
４５履歴ボタン
４６サジェストボタン
４７設定ボタン
４８マイク図案
４９多重円形図案
５１チェックボックス
１００音声翻訳装置
Ｂ１キャンセルボタン
Ｂ１１〜Ｂ１５，Ｂ２１〜Ｂ２５回答用のボタン
Ｄ２０データベース
Ｌ２０モジュール
Ｍ２０モデル
Ｎネットワーク
Ｐ１０，Ｐ２０プログラム
Ｔ１，Ｔ２テキスト
Ｔ３，Ｔ４対訳テキスト DESCRIPTION OF SYMBOLS 10 Information terminal 11 Processor 12 Storage resource 13 Voice input / output device 14 Communication interface 15 Input device 16 Display device 17 Camera 20 Server 21 Processor 22 Communication interface 23 Storage resource 41 Input switch button 42a, 42b Input button 43 Voice button 44 Language Selection button 45 History button 46 Suggest button 47 Setting button 48 Microphone design 49 Multiple circular design 51 Check box 100 Speech translation device B1 Cancel button B11-B15, B21-B25 Answer button D20 Database L20 Module M20 Model N Network P10, P20 Program T1, T2 Text T3, T4 Parallel text

Claims

An input unit for inputting voices of a user and a dialogue person;
A translation unit for acquiring parallel translations in a plurality of different languages for one input voice by the user;
An output unit that outputs the parallel translations in the plurality of different languages in text or text and voice;
A language selection unit that selects a language used by the dialog after the bilingual texts in the plurality of different languages are output;
With
When the input speech is a question item, the output unit displays the text of the translation of the question item in the plurality of different languages at a time on the screen, and the text of each of the parallel translations Display a button indicating the answer to the question in the language of the translation,
The language selection unit selects a language corresponding to an answer of the tapped button as a language used by the conversation person when any of the buttons is tapped.
Speech translation device.

The output unit displays the text of the parallel translation of the question items in the plurality of different languages and the answer to the questionnaire in the respective parallel languages without displaying the language type itself on the screen. Display a button on the screen,
The speech translation apparatus according to claim 1.

A storage unit for storing or acquiring the number of selections or the selection frequency of each of the plurality of different languages;
The output unit outputs a parallel translation in a predetermined number of languages having a relatively high number of selections or the selection frequency among the plurality of different languages;
The speech translation apparatus according to claim 1 or 2.

The output unit outputs the translations in the predetermined number of languages in the order of the selection frequency or the selection frequency.
The speech translation apparatus according to claim 3.

An index acquisition unit for acquiring a compatibility index between the content of the one input voice and each of the parallel translations in the plurality of different languages;
The output unit outputs a predetermined number of parallel translations having a relatively high compatibility index among the translations in the plurality of different languages.
The speech translation apparatus according to claim 1.

Using a speech translation device comprising an input unit, a translation unit, an output unit, and a language selection unit,
The input unit inputting voices of a user and a dialogue person;
The translation unit acquiring parallel translations in a plurality of different languages for one input voice by the user;
The output unit outputting the bilingual translation in the plurality of different languages as text or text and voice;
The language selection unit, after the bilingual texts in the plurality of different languages are output, selecting the language used by the dialogue person;
Including
When the input speech is a question item, the output unit displays the text of the translation of the question item in the plurality of different languages at a time on the screen, and the text of each of the parallel translations Display a button indicating the answer to the question in the language of the translation,
The language selection unit selects a language corresponding to an answer of the tapped button as a language used by the conversation person when any of the buttons is tapped.
Speech translation method.

Computer
An input unit for inputting voices of a user and a dialogue person;
A translation unit for acquiring parallel translations in a plurality of different languages for one input voice by the user;
An output unit that outputs the parallel translations in the plurality of different languages in text or text and voice;
A language selection unit that selects a language used by the dialog after the bilingual texts in the plurality of different languages are output;
To function,
When the input speech is a question item, the output unit displays the text of the translation of the question item in the plurality of different languages at a time on the screen, and the text of each of the parallel translations Display a button indicating the answer to the question in the language of the translation,
The language selection unit selects a language corresponding to an answer of the tapped button as a language used by the conversation person when any of the buttons is tapped.
Speech translation program.