JP6250209B1

JP6250209B1 - Speech translation device, speech translation method, and speech translation program

Info

Publication number: JP6250209B1
Application number: JP2017061327A
Authority: JP
Inventors: 翔大渡辺
Original assignee: RECRUIT LIFESTYLE CO., LTD.
Current assignee: RECRUIT LIFESTYLE CO., LTD.
Priority date: 2017-03-27
Filing date: 2017-03-27
Publication date: 2017-12-20
Anticipated expiration: 2037-03-27
Also published as: JP2018163581A

Abstract

【課題】音声翻訳を利用した会話において、質問者が回答者に対して適切な質問を行うことができ、これにより、両者の円滑なコミュニケーションを図る。【解決手段】本発明の一態様による音声翻訳装置は、会話を行うユーザの音声を入力するための入力部と、複数のシーンのなかから、ユーザが会話におけるシーンを選択するためのシーン選択手段を提示するシーン選択手段提示部と、特定のフレーズについて想定される複数の定型構文を、その特定のフレーズと各シーンに関連付けて予め記憶する記憶部と、入力された音声を認識し、その認識された音声の内容が質問文であり且つ特定のフレーズを含んでいたときに、その特定のフレーズと選択されたシーンに関連付けて記憶された定型構文を、ユーザが選択可能なように提示する定型構文提示部等を備える。【選択図】図５In a conversation using speech translation, a questioner can ask an appropriate question to an answerer, thereby achieving smooth communication between the two. An audio translation apparatus according to an aspect of the present invention includes an input unit for inputting a voice of a user who has a conversation, and a scene selection unit for a user to select a scene in the conversation from a plurality of scenes. A scene selection means presenting unit that presents a plurality of fixed syntaxes assumed for a specific phrase, a storage unit that stores the specific phrase and each scene in advance, and recognizes the input speech. When the content of the recorded voice is a question sentence and includes a specific phrase, a fixed form is presented so that the user can select a fixed syntax stored in association with the specific phrase and the selected scene. A syntax presentation unit is provided. [Selection] Figure 5

Description

本発明は、音声翻訳装置、音声翻訳方法、及び音声翻訳プログラムに関する。 The present invention relates to a speech translation device, a speech translation method, and a speech translation program.

互いの言語を理解できない人同士の会話、例えば店舗の店員と外国人客との会話を可能ならしめるべく、話し手の発話音声をテキスト化し、そのテキストの内容を聞き手の言語に機械翻訳した上で画面に表示したり、或いは、音声合成技術を用いてそのテキストの内容を音声再生したりする音声翻訳技術が提案されている（例えば特許文献１）。また、かかる音声翻訳技術を具現化したスマートフォン等の情報端末で動作する音声翻訳アプリケーションも実用化されている（例えば非特許文献１）。 In order to enable conversations between people who do not understand each other's language, for example, conversations between store clerk and foreign customers, the speech of the speaker is converted into text and the content of the text is machine translated into the language of the listener There has been proposed a speech translation technique for displaying on a screen or playing back the text content using a speech synthesis technique (for example, Patent Document 1). A speech translation application that operates on an information terminal such as a smartphone that embodies such speech translation technology has also been put into practical use (for example, Non-Patent Document 1).

特開平９−３４８９５号公報Japanese Patent Laid-Open No. 9-34895

Ｕ−ＳＴＡＲホームページ［平成２９年３月８日検索］、インターネット＜ＵＲＬ：http://www.ustar-consortium.com/qws/slot/u50227/app/app.html＞U-STAR homepage [Search March 8, 2017], Internet <URL: http://www.ustar-consortium.com/qws/slot/u50227/app/app.html>

一般に、かかる従来の音声翻訳技術では、発話された音声の認識処理を行ってその音声の読み（文字）を取得し、辞書を用いてその文字を他の言語へ翻訳する。この場合、音声認識処理には、予め構築された音響モデルや言語モデルが適用され、また、翻訳処理には、予め用意された各言語のコーパス等のデータベースが用いられる。ところで、店舗の店員等の話者（質問者）が、外国人の客（回答者）に何らかの問い合わせをする際に、質問者の問い掛け方が不適切又は不十分である場合がある。この場合、回答者は、質問者の質問の真意がわからず、その真意を質す質問を返したり、或いは、的を射ていない回答を行ってしまったりして、円滑なコミュニケーションを図れないことがあり得る。 In general, in the conventional speech translation technology, a spoken speech is recognized to obtain a reading (character) of the speech, and the character is translated into another language using a dictionary. In this case, an acoustic model or a language model built in advance is applied to the speech recognition process, and a database such as a corpus of each language prepared in advance is used for the translation process. By the way, when a speaker (questioner) such as a store clerk makes an inquiry to a foreign customer (respondent), the questioner's way of asking may be inappropriate or insufficient. In this case, the respondent may not be able to communicate smoothly because he / she does not know the true meaning of the questioner's question, returns a question that reflects the true meaning, or gives a non-targeted answer. possible.

そこで、本発明は、かかる事情に鑑みてなされたものであり、音声翻訳を利用した会話において、質問者が回答者に対して適切な質問を行うことができ、これにより、両者の円滑なコミュニケーションを図ることが可能な音声翻訳装置、音声翻訳方法、及び音声翻訳プログラムを提供することを目的とする。 Therefore, the present invention has been made in view of such circumstances, and in a conversation using speech translation, a questioner can make an appropriate question to an answerer, thereby enabling smooth communication between the two. An object of the present invention is to provide a speech translation device, a speech translation method, and a speech translation program capable of achieving the above.

上記課題を解決するため、本発明の一態様による音声翻訳装置は、会話を行うユーザの音声を入力するための入力部と、複数のシーンのなかから、ユーザが会話におけるシーン（場面又は状況）を選択するためのシーン選択手段を提示するシーン選択手段提示部と、特定のフレーズについて想定される複数の定型構文を、その特定のフレーズと各シーンに関連付けて予め記憶する記憶部と、入力された音声を認識し、その認識された音声の内容が質問文であり且つ特定のフレーズを含んでいたときに、その特定のフレーズと選択されたシーンに関連付けて記憶された定型構文を、ユーザが選択可能なように提示する定型構文提示部と、選択された定型構文の内容を異なる言語の内容に翻訳する翻訳部と、異なる言語に翻訳された内容を、音声及び／又はテキストで出力する出力部とを備える。なお、「特定のフレーズ」には、文、節、句、語、及び数字が含まれ、また、「定型構文」には、文に付随して画像又は記号等が含まれていてもよい。 In order to solve the above-described problem, a speech translation apparatus according to an aspect of the present invention includes an input unit for inputting speech of a user who has a conversation, and a scene (scene or situation) in which the user has a conversation among a plurality of scenes. A scene selection means presenting section for presenting a scene selection means for selecting a scene, and a storage section for storing a plurality of standard syntaxes assumed for a specific phrase in advance in association with the specific phrase and each scene, and When the content of the recognized speech is a question sentence and includes a specific phrase, the user can determine the fixed syntax stored in association with the specific phrase and the selected scene. The fixed syntax presentation section that presents the selected fixed syntax, the translation section that translates the content of the selected fixed syntax into the content of a different language, and the content that has been translated into a different language Beauty / or an output unit for outputting the text. The “specific phrase” includes a sentence, a clause, a phrase, a word, and a number, and the “standard syntax” may include an image or a symbol accompanying the sentence.

また、入力部が、選択された定型構文の内容をユーザが変更するための入力変更手段を提示するように構成してもよい。 Moreover, you may comprise so that an input part may present the input change means for a user to change the content of the selected fixed form syntax.

さらに、定型構文提示部が、認識された音声の内容とともに定型構文を提示するようにしても好適である。 Furthermore, it is preferable that the fixed syntax presentation unit presents the fixed syntax together with the recognized speech content.

またさらに、ユーザが定型構文を追加して入力するための定型構文入力部を更に備えてもよい。 Furthermore, a fixed syntax input unit for a user to add and input a fixed syntax may be further provided.

また、本発明の一態様による音声翻訳方法は、入力部、シーン選択手段提示部、記憶部、定型構文提示部、翻訳部、及び出力部を備える音声翻訳装置を用い、入力部が、会話を行うユーザの音声を入力するステップと、シーン選択手段提示部が、ユーザが複数のシーンのなかから会話におけるシーンを選択するためのシーン選択手段を提示するステップと、記憶部が、特定のフレーズについて想定される複数の定型構文を、その特定のフレーズと各シーンに関連付けて予め記憶するステップと、定型構文提示部が、入力された音声を認識し、その認識された音声の内容が質問文であり且つ特定のフレーズを含んでいたときに、その特定のフレーズと選択されたシーンに関連付けて記憶された定型構文を、ユーザが選択可能なように提示するステップと、翻訳部が、選択された定型構文の内容を異なる言語の内容に翻訳するステップと、出力部が、異なる言語に翻訳された内容を、音声及び／又はテキストで出力するステップとを含む。 A speech translation method according to an aspect of the present invention uses a speech translation device including an input unit, a scene selection unit presentation unit, a storage unit, a fixed syntax presentation unit, a translation unit, and an output unit. A step of inputting a user's voice, a step of presenting a scene selection means for a user to select a scene in a conversation from a plurality of scenes, and a storage section for a specific phrase A step of storing a plurality of assumed fixed syntaxes in advance in association with the specific phrase and each scene, and a fixed syntax presenting unit recognizes the input speech, and the content of the recognized speech is a question sentence. When there is a specific phrase, the standard syntax stored in association with the specific phrase and the selected scene is presented so that the user can select it. A step in which the translation unit translates the content of the selected standard syntax into a content in a different language, and a step in which the output unit outputs the content translated into a different language in speech and / or text. Including.

また、本発明の一態様による音声翻訳プログラムは、コンピュータ（単数又は単一種に限られず、複数又は複数種でもよい；以下同様）を、会話を行うユーザの音声を入力するための入力部と、ユーザが複数のシーンのなかから会話におけるシーンを選択するためのシーン選択手段を提示するシーン選択手段提示部と、特定のフレーズについて想定される複数の定型構文を、その特定のフレーズと各シーンに関連付けて予め記憶する記憶部と、入力された音声を認識し、その認識された音声の内容が質問文であり且つ特定のフレーズを含んでいたときに、その特定のフレーズと選択されたシーンに関連付けて記憶された定型構文を、ユーザが選択可能なように提示する定型構文提示部と、選択された定型構文の内容を異なる言語の内容に翻訳する翻訳部と、異なる言語に翻訳された内容を、音声及び／又はテキストで出力する出力部として機能させる。 In addition, the speech translation program according to one aspect of the present invention includes a computer (not limited to a single type or a single type, and may be a plurality or a plurality of types; the same shall apply hereinafter), an input unit for inputting a voice of a user having a conversation, A scene selection means presenting section for presenting a scene selection means for a user to select a scene in a conversation from among a plurality of scenes, and a plurality of fixed syntaxes assumed for a specific phrase, for the specific phrase and each scene A storage unit that associates and stores in advance, and recognizes the input voice, and when the content of the recognized voice is a question sentence and includes a specific phrase, the specific phrase and the selected scene The fixed syntax presentation unit that presents the fixed syntax stored in association so that the user can select it, and the content of the selected fixed syntax is translated into the content of a different language A translation unit that, what is translated into different languages, to function as an output unit for outputting voice and / or text.

本発明によれば、音声翻訳を利用した会話を行うユーザである質問者が音声入力した内容をそのまま他言語に翻訳して相手方のユーザ（回答者）に伝えるのではなく、ユーザが会話のシーンを選択した上で、ユーザによる音声入力の内容に含まれる特定のフレーズとその会話のシーンに関連付けられた定型構文のなかから、質問者が意図した又は質問者の真意に沿った質問内容を選択することができる。そして、その選択された定型構文を翻訳して回答者に伝えるので、質問者が回答者に対して適切な質問を行うことができ、これにより、両者の円滑なコミュニケーションを図ることが可能となる。 According to the present invention, the content of the voice input by a questioner who is a user who has a conversation using speech translation is not translated into another language and transmitted to the other user (respondent). And select the content of the question that the questioner intended or in line with the questioner's intention from the specific phrases included in the content of the speech input by the user and the fixed syntax associated with the conversation scene can do. And since the selected standard syntax is translated and communicated to the respondent, the questioner can ask the respondent an appropriate question, thereby enabling smooth communication between the two. .

本発明による音声翻訳装置に係るネットワーク構成等の好適な一実施形態を概略的に示すシステムブロック図である。1 is a system block diagram schematically showing a preferred embodiment of a network configuration and the like related to a speech translation apparatus according to the present invention. 本発明による音声翻訳装置の好適な一実施形態における処理の流れ（一部）の一例を示すフローチャートである。It is a flowchart which shows an example of the flow (part) of the process in suitable one Embodiment of the speech translation apparatus by this invention. （Ａ）乃至（Ｄ）は、情報端末における表示画面の遷移の一例を示す平面図である。(A) thru | or (D) are top views which show an example of the transition of the display screen in an information terminal. （Ａ）乃至（Ｄ）は、情報端末における表示画面の遷移の一例を示す平面図である。(A) thru | or (D) are top views which show an example of the transition of the display screen in an information terminal. 本発明による音声翻訳装置の好適な一実施形態における処理の流れ（一部）の一例を示すフローチャートである。It is a flowchart which shows an example of the flow (part) of the process in suitable one Embodiment of the speech translation apparatus by this invention. （Ａ）及び（Ｂ）は、情報端末における表示画面の遷移の一例を示す平面図である。(A) And (B) is a top view which shows an example of the transition of the display screen in an information terminal.

以下、本発明の実施の形態について詳細に説明する。なお、以下の実施の形態は、本発明を説明するための例示であり、本発明をその実施の形態のみに限定する趣旨ではない。また、本発明は、その要旨を逸脱しない限り、さまざまな変形が可能である。さらに、当業者であれば、以下に述べる各要素を均等なものに置換した実施の形態を採用することが可能であり、かかる実施の形態も本発明の範囲に含まれる。またさらに、必要に応じて示す上下左右等の位置関係は、特に断らない限り、図示の表示に基づくものとする。さらにまた、図面における各種の寸法比率は、その図示の比率に限定されるものではない。 Hereinafter, embodiments of the present invention will be described in detail. The following embodiments are examples for explaining the present invention, and are not intended to limit the present invention only to the embodiments. The present invention can be variously modified without departing from the gist thereof. Furthermore, those skilled in the art can employ embodiments in which the elements described below are replaced with equivalent ones, and such embodiments are also included in the scope of the present invention. Furthermore, positional relationships such as up, down, left, and right shown as needed are based on the display shown unless otherwise specified. Furthermore, various dimensional ratios in the drawings are not limited to the illustrated ratios.

（装置構成）
図１は、本発明による音声翻訳装置に係るネットワーク構成等の好適な一実施形態を概略的に示すシステムブロック図である。この例において、音声翻訳装置１００は、ユーザが使用する情報端末１０（ユーザ装置）にネットワークＮを介して電子的に接続されるサーバ２０を備える（但し、これに限定されない）。 (Device configuration)
FIG. 1 is a system block diagram schematically showing a preferred embodiment such as a network configuration related to a speech translation apparatus according to the present invention. In this example, the speech translation apparatus 100 includes a server 20 that is electronically connected to the information terminal 10 (user apparatus) used by the user via the network N (but is not limited to this).

情報端末１０は、例えば、タッチパネル等のユーザインターフェイス及び視認性が高いディスプレイを採用する。また、ここでの情報端末１０は、ネットワークＮとの通信機能を有するスマートフォンに代表される携帯電話を含む可搬型のタブレット型端末装置である。さらに、情報端末１０は、プロセッサ１１、記憶資源１２、音声入出力デバイス１３、通信インターフェイス１４、入力デバイス１５、表示デバイス１６、及びカメラ１７を備えている。また、情報端末１０は、インストールされた音声翻訳アプリケーションソフト（本発明の一実施形態による音声翻訳プログラムの少なくとも一部）が動作することにより、本発明の一実施形態による音声翻訳装置の一部又は全部として機能するものである。 The information terminal 10 employs a user interface such as a touch panel and a display with high visibility, for example. The information terminal 10 here is a portable tablet terminal device including a mobile phone represented by a smartphone having a communication function with the network N. The information terminal 10 further includes a processor 11, a storage resource 12, a voice input / output device 13, a communication interface 14, an input device 15, a display device 16, and a camera 17. In addition, the information terminal 10 operates by the installed speech translation application software (at least a part of the speech translation program according to the embodiment of the present invention), so that a part of the speech translation apparatus according to the embodiment of the present invention or It functions as a whole.

プロセッサ１１は、算術論理演算ユニット及び各種レジスタ（プログラムカウンタ、データレジスタ、命令レジスタ、汎用レジスタ等）から構成される。また、プロセッサ１１は、記憶資源１２に格納されているプログラムＰ１０である音声翻訳アプリケーションソフトを解釈及び実行し、各種処理を行う。このプログラムＰ１０としての音声翻訳アプリケーションソフトは、例えばサーバ２０からネットワークＮを通じて配信可能なものであり、手動で又は自動でインストール及びアップデートされてもよい。 The processor 11 includes an arithmetic logic unit and various registers (program counter, data register, instruction register, general-purpose register, etc.). Further, the processor 11 interprets and executes speech translation application software, which is the program P10 stored in the storage resource 12, and performs various processes. The speech translation application software as the program P10 can be distributed from the server 20 through the network N, for example, and may be installed and updated manually or automatically.

なお、ネットワークＮは、例えば、有線ネットワーク（近距離通信網（ＬＡＮ）、広域通信網（ＷＡＮ）、又は付加価値通信網（ＶＡＮ）等）と無線ネットワーク（移動通信網、衛星通信網、ブルートゥース（Bluetooth（登録商標））、ＷｉＦｉ(Wireless Fidelity)、ＨＳＤＰＡ(High Speed Downlink Packet Access)等）が混在して構成される通信網である。 The network N includes, for example, a wired network (a short-range communication network (LAN), a wide-area communication network (WAN), a value-added communication network (VAN), etc.) and a wireless network (mobile communication network, satellite communication network, Bluetooth ( Bluetooth (registered trademark)), WiFi (Wireless Fidelity), HSDPA (High Speed Downlink Packet Access), etc.).

記憶資源１２は、物理デバイス（例えば、半導体メモリ等のコンピュータ読み取り可能な記録媒体）の記憶領域が提供する論理デバイスであり、情報端末１０の処理に用いられるオペレーティングシステムプログラム、ドライバプログラム、各種データ等を格納する。ドライバプログラムとしては、例えば、音声入出力デバイス１３を制御するための入出力デバイスドライバプログラム、入力デバイス１５を制御するための入力デバイスドライバプログラム、表示デバイス１６を制御するための表示デバイスドライバプログラム等が挙げられる。さらに、音声入出力デバイス１３は、例えば、一般的なマイクロフォン、及びサウンドデータを再生可能なサウンドプレイヤである。 The storage resource 12 is a logical device provided by a storage area of a physical device (for example, a computer-readable recording medium such as a semiconductor memory), and an operating system program, a driver program, various data, etc. used for processing of the information terminal 10 Is stored. Examples of the driver program include an input / output device driver program for controlling the audio input / output device 13, an input device driver program for controlling the input device 15, and a display device driver program for controlling the display device 16. Can be mentioned. Furthermore, the voice input / output device 13 is, for example, a general microphone and a sound player capable of reproducing sound data.

通信インターフェイス１４は、例えばサーバ２０との接続インターフェイスを提供するものであり、無線通信インターフェイス及び／又は有線通信インターフェイスから構成される。また、入力デバイス１５は、例えば、表示デバイス１６に表示されるアイコン、ボタン、仮想キーボード、テキスト等のタップ動作による入力操作を受け付けるインターフェイスを提供するものであり、タッチパネルの他、情報端末１０に外付けされる各種入力装置を例示することができる。 The communication interface 14 provides a connection interface with the server 20, for example, and is configured from a wireless communication interface and / or a wired communication interface. The input device 15 provides an interface for accepting an input operation by a tap operation such as an icon, a button, a virtual keyboard, or a text displayed on the display device 16. Various input devices to be attached can be exemplified.

表示デバイス１６は、画像表示インターフェイスとして各種の情報をユーザ（話し手と聞き手）に提供するものであり、例えば、有機ＥＬディスプレイ、液晶ディスプレイ、ＣＲＴディスプレイ等が挙げられる。また、カメラ１７は、種々の被写体の静止画や動画を撮像するためのものである。 The display device 16 provides various kinds of information to a user (speaker and listener) as an image display interface, and examples thereof include an organic EL display, a liquid crystal display, and a CRT display. The camera 17 is for capturing still images and moving images of various subjects.

サーバ２０は、例えば、演算処理能力の高いホストコンピュータによって構成され、そのホストコンピュータにおいて所定のサーバ用プログラムが動作することにより、サーバ機能を発現するものであり、例えば、音声認識サーバ、翻訳サーバ、及び音声合成サーバとして機能する単数又は複数のホストコンピュータから構成される（図示においては単数で示すが、これに限定されない）。そして、各サーバ２０は、プロセッサ２１、通信インターフェイス２２、及び記憶資源２３を備える。 The server 20 is constituted by, for example, a host computer having a high arithmetic processing capability, and expresses a server function by operating a predetermined server program in the host computer, for example, a speech recognition server, a translation server, And a single or a plurality of host computers functioning as a speech synthesis server (in the drawing, it is indicated by a single, but is not limited thereto). Each server 20 includes a processor 21, a communication interface 22, and a storage resource 23.

プロセッサ２１は、算術演算、論理演算、ビット演算等を処理する算術論理演算ユニット及び各種レジスタ（プログラムカウンタ、データレジスタ、命令レジスタ、汎用レジスタ等）から構成され、記憶資源２３に格納されているプログラムＰ２０を解釈及び実行し、所定の演算処理結果を出力する。また、通信インターフェイス２２は、ネットワークＮを介して情報端末１０に接続するためのハードウェアモジュールであり、例えば、ＩＳＤＮモデム、ＡＤＳＬモデム、ケーブルモデム、光モデム、ソフトモデム等の変調復調装置である。 The processor 21 is composed of an arithmetic and logic unit for processing arithmetic operations, logical operations, bit operations and the like and various registers (program counter, data register, instruction register, general-purpose register, etc.), and is stored in the storage resource 23. P20 is interpreted and executed, and a predetermined calculation processing result is output. The communication interface 22 is a hardware module for connecting to the information terminal 10 via the network N. For example, the communication interface 22 is a modulation / demodulation device such as an ISDN modem, an ADSL modem, a cable modem, an optical modem, or a soft modem.

記憶資源２３は、例えば、物理デバイス（ディスクドライブ又は半導体メモリ等のコンピュータ読み取り可能な記録媒体等）の記憶領域が提供する論理デバイスであり、それぞれ単数又は複数のプログラムＰ２０、各種モジュールＬ２０、各種データベースＤ２０、及び各種モデルＭ２０が格納されている。また、記憶資源２３には、会話の一方のユーザ（話し手）が会話の他方のユーザ（聞き手）へ話しかけるために予め用意された複数の質問定型文、入力音声の履歴データ、各種設定用のデータ、後述するフレーズデータ等も記憶されている。 The storage resource 23 is a logical device provided by, for example, a storage area of a physical device (a computer-readable recording medium such as a disk drive or a semiconductor memory), and each includes one or a plurality of programs P20, various modules L20, and various databases. D20 and various models M20 are stored. In addition, the storage resource 23 stores a plurality of question phrases prepared in advance for one user (speaker) of the conversation to speak to the other user (listener) of the conversation, history data of input speech, and data for various settings. Phrase data and the like described later are also stored.

プログラムＰ２０は、サーバ２０のメインプログラムである上述したサーバ用プログラム等である。また、各種モジュールＬ２０は、情報端末１０から送信されてくる要求及び情報に係る一連の情報処理を行うため、プログラムＰ１０の動作中に適宜呼び出されて実行されるソフトウェアモジュール（モジュール化されたサブプログラム）である。かかるモジュールＬ２０としては、音声認識モジュール、翻訳モジュール、音声合成モジュール等が挙げられる。 The program P20 is the above-described server program that is the main program of the server 20. In addition, the various modules L20 perform a series of information processing related to requests and information transmitted from the information terminal 10, so that they are appropriately called and executed during the operation of the program P10 (moduleized subprograms). ). Examples of the module L20 include a speech recognition module, a translation module, and a speech synthesis module.

また、各種データベースＤ２０としては、音声翻訳処理のために必要な各種コーパス（例えば、日本語と英語の音声翻訳の場合、日本語音声コーパス、英語音声コーパス、日本語文字（語彙）コーパス、英語文字（語彙）コーパス、日本語辞書、英語辞書、日英対訳辞書、日英対訳コーパス等）、音声データベース、ユーザに関する情報を管理するための管理用データベース、後述する階層構造を有するフレーズデータベース等が挙げられる。また、各種モデルＭ２０としては、音声認識に使用する音響モデルや言語モデル等が挙げられる。 The various databases D20 include various corpora required for speech translation processing (for example, in the case of Japanese and English speech translation, a Japanese speech corpus, an English speech corpus, a Japanese character (vocabulary) corpus, an English character) (Vocabulary) Corpus, Japanese dictionary, English dictionary, Japanese-English bilingual dictionary, Japanese-English bilingual corpus, etc.), voice database, management database for managing information about users, phrase database having a hierarchical structure described later, etc. It is done. Examples of the various models M20 include acoustic models and language models used for speech recognition.

（通常の音声翻訳による会話）
以上のとおり構成された音声翻訳装置１００における処理操作及び動作の一例について、以下に説明する。ここでは、ユーザ同士（質問者と回答者；何れも話者）の会話及び／又は会話準備における通常の音声翻訳処理の一例について説明する。図２は、音声翻訳装置１００における処理の流れ（の一部）の一例を示すフローチャートである。また、図３（Ａ）乃至（Ｄ）及び図４（Ａ）乃至（Ｄ）は、情報端末における表示画面の遷移の一例を示す平面図である。なお、本実施形態においては、一方の話者（質問者）の言語が日本語であり、他方の話者（回答者）の言語が中国語である場合の会話を想定する（但し、言語やシチュエーションはこれに限定されない）。 (Normal speech translation conversation)
An example of processing operations and operations in the speech translation apparatus 100 configured as described above will be described below. Here, an example of normal speech translation processing in conversation between users (questioner and answerer: both speakers) and / or conversation preparation will be described. FIG. 2 is a flowchart showing an example of (a part of) the processing flow in the speech translation apparatus 100. FIGS. 3A to 3D and FIGS. 4A to 4D are plan views illustrating examples of display screen transition in the information terminal. In the present embodiment, a conversation is assumed in which the language of one speaker (questioner) is Japanese and the language of the other speaker (answerer) is Chinese (however, the language or Situations are not limited to this).

まず、ユーザが当該アプリケーションを起動する（ステップＳＵ１）と、サーバ２０のプロセッサ２１及び情報端末１０のプロセッサ１１により、情報端末１０の表示デバイス１６に、相手方のユーザの言語を選択するための言語選択画面が表示される（図３（Ａ）；ステップＳＪ１）。この言語選択画面には、相手方のユーザに言語を尋ねることを、ユーザに促すための日本語のテキストＴ１、相手方のユーザに言語を尋ねる旨の英語のテキストＴ２、及び、想定される複数の代表的な言語（ここでは、英語、中国語（例えば書体により２種類）、ハングル語）を示す言語ボタン３１が表示される。さらにその下方には、言語選択画面を閉じて当該アプリケーションを終了するためのキャンセルボタンＢ１も表示される。 First, when the user activates the application (step SU1), the language selection for selecting the language of the other user on the display device 16 of the information terminal 10 by the processor 21 of the server 20 and the processor 11 of the information terminal 10 is performed. A screen is displayed (FIG. 3A; step SJ1). In this language selection screen, Japanese text T1 for prompting the user to ask the partner user for the language, English text T2 for asking the partner user for the language, and a plurality of assumed representatives A language button 31 is displayed that indicates a typical language (here, English, Chinese (for example, two types depending on the typeface), and Korean). Further below that, a cancel button B1 for closing the language selection screen and ending the application is also displayed.

このとき、図３（Ａ）に示す如く、日本語のテキストＴ１及び英語のテキストＴ２は、プロセッサ１１及び表示デバイス１６により、情報端末１０の表示デバイス１６の画面において、異なる領域によって区分けされ、且つ、互いに逆向き（互いに異なる向き；図示において上下逆向き）に表示される。これにより、ユーザ同士が対面している状態で会話を行う場合、一方のユーザは日本語のテキストＴ１を確認し易い一方、他方のユーザは、英語のテキストＴ２を確認し易くなる。また、日本語のテキストＴ１と英語のテキストＴ２が区分けして表示されるので、両者を明別して更に視認し易くなる利点がある。 At this time, as shown in FIG. 3A, the Japanese text T1 and the English text T2 are divided by the processor 11 and the display device 16 into different areas on the screen of the display device 16 of the information terminal 10, and Are displayed in opposite directions (different directions; upside down in the figure). Accordingly, when a conversation is performed with the users facing each other, one user can easily confirm the Japanese text T1, while the other user can easily confirm the English text T2. In addition, since the Japanese text T1 and the English text T2 are displayed separately, there is an advantage that both are clearly distinguished and can be visually recognized more easily.

ユーザがその言語選択画面における英語のテキストＴ２の表示を聞き手に提示し、相手方のユーザに「中国語」のボタンをタップしてもらうことにより、又は、相手方のユーザが自ら、その使用言語である「中国語」を選択することができる。こうして相手方のユーザの言語が選択されると、サーバ２０のプロセッサ２１及び情報端末１０のプロセッサ１１により、ホーム画面として、日本語と中国語の音声入力の待機画面が表示デバイス１６に表示される（図３（Ｂ）；ステップＳＪ２）。 When the user presents the display of the English text T2 on the language selection screen to the listener and the partner user taps the “Chinese” button, or the partner user himself / herself is in the language used “Chinese” can be selected. When the language of the other user is thus selected, the processor 21 of the server 20 and the processor 11 of the information terminal 10 display a standby screen for voice input in Japanese and Chinese on the display device 16 as the home screen ( FIG. 3B; Step SJ2).

この音声入力待機画面には、日本語の音声入力を行うためのマイクを図案化した入力ボタン３２ａ及び中国語の音声入力を行うためのマイクを図案化した入力ボタン３２ｂが表示される。また、入力ボタン３２ａ，３２ｂよりも画面の縁側には、それぞれ、日本語を中国語に変換することを示す日本語のテキストＴ３、及び、中国語を日本語に変換することを示す中国語のテキストＴ４が表示される。さらに、入力ボタン３２ａ，３２ｂよりも画面の中央側には、それぞれ、マイクを図案化した入力ボタン３２ａ，３２ｂをタップして会話を始めることを促す日本語のテキストＴ５及び中国語のテキストＴ６が表示される。 On this voice input standby screen, an input button 32a which is designed as a microphone for inputting Japanese speech and an input button 32b which is designed as a microphone for inputting Chinese speech are displayed. Further, on the edge side of the screen from the input buttons 32a and 32b, a Japanese text T3 indicating conversion of Japanese into Chinese and a Chinese text indicating conversion of Chinese into Japanese, respectively. Text T4 is displayed. Further, in the center of the screen from the input buttons 32a and 32b, there are a Japanese text T5 and a Chinese text T6 prompting to start a conversation by tapping the input buttons 32a and 32b each having a microphone. Is displayed.

またさらに、この音声入力待機画面には、ユーザが予め登録しておいたフレーズ群を表示させるための登録フレーズボタンＢ２、音声入力に代えてテキストで入力するためのテキスト入力ボタンＢ３、当該アプリケーションソフトの各種設定を行うための設定ボタンＢ４、及び会話のシーンを選択するためのシーン選択ボタンＢＳも表示される。 The voice input standby screen further includes a registered phrase button B2 for displaying a group of phrases registered in advance by the user, a text input button B3 for inputting text instead of voice input, and the application software. A setting button B4 for performing various settings and a scene selection button BS for selecting a conversation scene are also displayed.

次に、図３（Ｂ）に示す音声入力待機画面において、ユーザ（質問者）が日本語の入力ボタン３２ａをタップして日本語の音声入力を選択すると、ユーザの日本語による発話内容を受け付ける音声入力画面となる（図３（Ｃ））。この音声入力画面が表示されると、音声入出力デバイス１３からの音声入力が可能な状態となる。 Next, on the voice input standby screen shown in FIG. 3B, when the user (questioner) taps the Japanese input button 32a and selects Japanese voice input, the user's Japanese utterance content is accepted. The voice input screen is displayed (FIG. 3C). When this voice input screen is displayed, voice input from the voice input / output device 13 is enabled.

また、この音声入力画面には、情報端末１０のマイクに向かって音声入力を行うように促す日本語のテキストＴ６、相手が音声入力中であることを示す中国語のテキストＴ７、マイクを図案化した入力ボタン３２ａ、及び、その入力ボタン３２ａを囲うような多重円形図案３３が表示される。この多重円形図案３３は、音声入力状態にあることを示し、声量の大小を模式的に且つ動的に表すように、声量に応じて表示される円部分の大きさが変化する。これにより、音声入力レベルがユーザへ視覚的にフィードバックされる。 In addition, this voice input screen is designed with Japanese text T6 prompting voice input to the microphone of the information terminal 10, Chinese text T7 indicating that the other party is inputting voice, and a microphone. The input button 32a and the multiple circular design 33 surrounding the input button 32a are displayed. The multiple circular design 33 indicates that the voice input state is present, and the size of a circle portion to be displayed changes in accordance with the voice volume so as to schematically and dynamically represent the magnitude of the voice volume. Thereby, the voice input level is visually fed back to the user.

さらに、この音声入力画面にも、キャンセルボタンＢ１が表示され、これをタップすることにより、当該アプリケーションを終了するか、音声入力待機画面（図３（Ｂ））へ戻って音声入力をやり直すことができる。また、入力ボタン３２ａの近傍には、音声入力が終了した後に、後述の音声認識処理及び多言語翻訳処理を行うための日本語のテキストＴ８が表示される。 Further, a cancel button B1 is also displayed on this voice input screen, and when this is tapped, the application can be terminated, or the voice input standby screen (FIG. 3B) can be returned to perform voice input again. it can. Further, in the vicinity of the input button 32a, Japanese text T8 for performing later-described speech recognition processing and multilingual translation processing is displayed after the speech input is completed.

この状態で、ユーザ（質問者）が相手方のユーザ（回答者）への伝達事項等を発話する（ステップＳＵ２）と、音声入出力デバイス１３を通して音声入力が行われる（ステップＳＪ３）。情報端末１０のプロセッサ１１は、その音声入力に基づいて音声信号を生成し、その音声信号を通信インターフェイス１４及びネットワークＮを通してサーバ２０へ送信する。このとおり、情報端末１０自体、又はプロセッサ１１及び音声入出力デバイス１３が「入力部」として機能する。 In this state, when the user (questioner) utters a matter to be transmitted to the other user (answerer) (step SU2), voice input is performed through the voice input / output device 13 (step SJ3). The processor 11 of the information terminal 10 generates an audio signal based on the audio input, and transmits the audio signal to the server 20 through the communication interface 14 and the network N. As described above, the information terminal 10 itself, or the processor 11 and the voice input / output device 13 function as an “input unit”.

それから、発話が終了して図４（Ｃ）に示す日本語のテキストＴ８がタップ（タッチ）されると、プロセッサ１１は、発話内容の受け付けを終了する。情報端末１０のプロセッサ１１は、その音声入力に基づいて音声信号を生成し、その音声信号を通信インターフェイス１４及びネットワークＮを通してサーバ２０へ送信する。 Then, when the utterance is finished and the Japanese text T8 shown in FIG. 4C is tapped (touched), the processor 11 finishes accepting the utterance content. The processor 11 of the information terminal 10 generates an audio signal based on the audio input, and transmits the audio signal to the server 20 through the communication interface 14 and the network N.

次に、サーバ２０のプロセッサ２１は、通信インターフェイス２２を通してその音声信号を受信し、音声認識処理を行う（ステップＳＪ４）。このとき、プロセッサ２１は、記憶資源２３から、必要なモジュールＬ２０、データベースＤ２０、及びモデルＭ２０（音声認識モジュール、日本語音声コーパス、音響モデル、言語モデル等）を呼び出し、入力音声の「音」を「読み」（文字）へ変換する。このとおり、サーバ２０は、全体として「音声認識サーバ」としても機能する。また、プロセッサ２１は、認識された内容を、音声入力の履歴データとして、記憶資源２３（記憶部）に（必要に応じて適宜のデータベースに）記憶する。 Next, the processor 21 of the server 20 receives the voice signal through the communication interface 22 and performs voice recognition processing (step SJ4). At this time, the processor 21 calls the necessary module L20, database D20, and model M20 (speech recognition module, Japanese speech corpus, acoustic model, language model, etc.) from the storage resource 23, and obtains “sound” of the input speech. Convert to "reading" (character). As described above, the server 20 also functions as a “voice recognition server” as a whole. In addition, the processor 21 stores the recognized content in the storage resource 23 (storage unit) (in an appropriate database as necessary) as voice input history data.

続いて、プロセッサ２１は、認識された音声の「読み」（文字）を複数の他言語に翻訳する多言語翻訳処理へ移行する（ステップＳＪ５）。ここでは、相手方のユーザの言語として中国語が選択されているので、プロセッサ２１は、記憶資源２３から、必要なモジュールＬ２０及びデータベースＤ２０（翻訳モジュール、日本語文字コーパス、日本語辞書、中国語辞書、日本語／中国語対訳辞書、日本語／中国語対訳コーパス等）を呼び出し、認識結果である入力音声の「読み」（文字列）を適切に並び替えて日本語の句、節、文等へ変換し、その変換結果に対応する中国語を抽出し、それらを中国語の文法に従って並び替えて自然な中国語の句、節、文等へと変換する。 Subsequently, the processor 21 proceeds to multilingual translation processing for translating the recognized “reading” (characters) of the recognized speech into a plurality of other languages (step SJ5). Here, since Chinese is selected as the language of the other user, the processor 21 uses the storage resource 23 to obtain the necessary module L20 and database D20 (translation module, Japanese character corpus, Japanese dictionary, Chinese dictionary). , Japanese / Chinese Bilingual Dictionary, Japanese / Chinese Bilingual Corpus, etc.), and appropriately sort the “reading” (character string) of the input speech that is the recognition result. The Chinese corresponding to the conversion result is extracted, rearranged according to the Chinese grammar, and converted into natural Chinese phrases, clauses, sentences, and the like.

このとおり、プロセッサ２１は、入力音声の内容を第１言語（日本語）とは異なる第２言語（中国語）の内容に翻訳する「翻訳部」としても機能し、サーバ２０は、全体として「翻訳サーバ」としても機能する。なお、入力音声が正確に認識されなかった場合には、音声の再入力を行うことができる（図示省略）。また、プロセッサ２１は、それらの日本語及び中国語の句、節、文等を、記憶資源２３に記憶しておくこともできる。 As described above, the processor 21 also functions as a “translation unit” that translates the content of the input speech into the content of the second language (Chinese) different from the first language (Japanese). It also functions as a “translation server”. If the input voice is not correctly recognized, the voice can be re-input (not shown). Further, the processor 21 can also store those Japanese and Chinese phrases, clauses, sentences, and the like in the storage resource 23.

また、この翻訳処理中に、情報端末１０のプロセッサ１１は、図３（Ｄ）に示す翻訳処理中画面を表示する。この翻訳処理中画面には、翻訳処理中であることを示す日本語のテキストＴ９及び中国語のテキストＴ１０、並びに、翻訳処理中であることを表すための円弧の一部が回動するように表示される環状図案３４も表示される。さらに、この翻訳処理中画面にも、キャンセルボタンＢ１が表示され、これをタップすることにより、当該アプリケーションを終了するか、音声入力待機画面（図３（Ｂ））へ戻って音声入力をやり直すことができる。 Further, during the translation process, the processor 11 of the information terminal 10 displays a translation process in progress screen shown in FIG. In this translation processing in progress screen, Japanese text T9 and Chinese text T10 indicating that translation processing is in progress, and a part of an arc for indicating that translation processing is in progress rotate. The displayed circular design 34 is also displayed. Furthermore, the cancel button B1 is also displayed on this translation processing screen, and when this is tapped, the application is terminated, or the voice input standby screen (FIG. 3B) is returned to perform voice input again. Can do.

次に、多言語翻訳処理が完了すると、プロセッサ２１は、音声合成処理へ移行する（ステップＳＪ６）。このとき、プロセッサ２１は、記憶資源２３から、必要なモジュールＬ２０、データベースＤ２０、及びモデルＭ２０（音声合成モジュール、中国語音声コーパス、音響モデル、言語モデル等）を呼び出し、翻訳結果である中国語の句、節、文等を自然な音声に変換する。このとおり、プロセッサ２１は、「音声合成部」としても機能し、サーバ２０は、全体として「音声合成サーバ」としても機能する。 Next, when the multilingual translation processing is completed, the processor 21 proceeds to speech synthesis processing (step SJ6). At this time, the processor 21 calls the necessary module L20, database D20, and model M20 (speech synthesis module, Chinese speech corpus, acoustic model, language model, etc.) from the storage resource 23, and the translation result of Chinese. Convert phrases, clauses, sentences, etc. to natural speech. As described above, the processor 21 also functions as a “speech synthesizer”, and the server 20 also functions as a “speech synthesizer” as a whole.

次いで、プロセッサ２１は、中国語による翻訳結果（対応する中国語の会話コーパスでもよい）に基づいてテキスト表示用のテキスト信号を生成し、情報端末１０へ送信する。そのテキスト信号を受信したプロセッサ１１は、音声認識した入力音声の内容を示す日本語のテキストＴ１１と、その中国語による翻訳結果（対訳）のテキストＴ１２を、図４（Ａ）に示す翻訳結果表示画面に表示する。 Next, the processor 21 generates a text signal for text display based on the translation result in Chinese (or a corresponding Chinese conversation corpus), and transmits the text signal to the information terminal 10. The processor 11 which has received the text signal displays the Japanese text T11 indicating the content of the input speech that has been speech-recognized, and the translation result (translation) text T12 in Chinese as a translation result display shown in FIG. Display on the screen.

また、この翻訳結果表示画面には、図３（Ｂ）のホーム画面にも表示されたマイクを図案化した日本語の入力ボタン３２ａ及び中国語の入力ボタン３２ｂが表示され、それらの近傍には、それぞれ、日本語及び中国語であることを示すテキストＴ１３，Ｔ１４が表示される。 In addition, on this translation result display screen, a Japanese input button 32a and a Chinese input button 32b, which are designed microphones also displayed on the home screen of FIG. 3B, are displayed, and in the vicinity thereof. , Texts T13 and T14 indicating Japanese and Chinese, respectively, are displayed.

さらに、プロセッサ２１は、合成された音声に基づいて音声出力用の音声信号を生成し、情報端末１０へ送信する。その音声信号を受信したプロセッサ１１は、テキストＴ１３，Ｔ１４の表示とともに、音声入出力デバイス１３（出力部）を用いて、中国語のテキストＴ１２の内容の音声を出力する（読み上げる）（ステップＳＪ７）。 Furthermore, the processor 21 generates a voice signal for voice output based on the synthesized voice, and transmits the voice signal to the information terminal 10. The processor 11 that has received the speech signal outputs (reads out) the speech of the content of the Chinese text T12 using the speech input / output device 13 (output unit) together with the display of the texts T13 and T14 (step SJ7). .

さらに、テキストＴ１１の近傍には、音声入力に代えてテキストで入力するためのテキスト入力ボタンＢ５が表示され、画面下部には、図３（Ｂ）のホーム画面へ戻るためのチェックボタンＢ７、及び、翻訳結果の誤りを報告するための誤訳通知ボタンＢ６が表示される。またさらに、この翻訳結果表示画面には、ユーザの感情を表現するための感情文字又は感情記号を表示させるための表示ボタンＢ８も表示される。 Further, a text input button B5 for inputting text instead of voice input is displayed near the text T11, and a check button B7 for returning to the home screen in FIG. Then, a mistranslation notification button B6 for reporting an error in the translation result is displayed. Furthermore, a display button B8 for displaying emotion characters or emotion symbols for expressing the user's emotion is also displayed on the translation result display screen.

ここで、テキスト入力ボタンＢ５をタップすると、図４（Ｂ）に示すテキスト入力画面が表示される。このテキスト入力画面には、図４（Ａ）に表示された日本語のテキストＴ１１が淡いグレースケールで表示される。日本語のキーボードＫからテキスト入力を開始すると、テキストＴ１１が消去され、その場所に新たに入力されたテキストの内容が表示される。また、テキスト入力画面の上部には、入力したテキストをキャンセルして消去するための消去ボタンＢ９が表示され、キーボードＫの直上には、入力したテキストを翻訳するための翻訳ボタンＢ１０が表示される。この翻訳ボタンＢ１０がタップされると、先述の翻訳処理と音声合成処理が行われ、図４（Ａ）に示す翻訳結果表示画面と同様の画面が表示される。 Here, when the text input button B5 is tapped, a text input screen shown in FIG. 4B is displayed. On this text input screen, the Japanese text T11 displayed in FIG. 4A is displayed in a light gray scale. When text input is started from the Japanese keyboard K, the text T11 is erased and the content of the newly input text is displayed at that location. Further, an erase button B9 for canceling and erasing the input text is displayed at the top of the text input screen, and a translation button B10 for translating the input text is displayed immediately above the keyboard K. . When this translation button B10 is tapped, the above translation process and speech synthesis process are performed, and a screen similar to the translation result display screen shown in FIG. 4A is displayed.

その後、相手方のユーザが回答する場合、図４（Ａ）の翻訳結果表示画面に表示された中国語の入力ボタン３２ｂをタップして中国語の音声入力を選択すると、相手方のユーザの中国語による発話内容を受け付ける音声入力画面となる（図４（Ｃ））。この音声入力画面が表示されると、図３（Ｃ）に示す音声入力画面と同様に、音声入出力デバイス１３からの音声入力が可能な状態となる。また、この音声入力画面には、情報端末１０のマイクに向かって音声入力を行うように促す中国語のテキストＴ１５、相手が音声入力中であることを示す日本語のテキストＴ１６、マイクを図案化した入力ボタン３２ｂ、及び、その入力ボタン３２ｂを囲うような多重円形図案３３が表示される。 After that, when the other user answers, when the Chinese input button 32b displayed on the translation result display screen in FIG. A voice input screen for accepting the utterance content is displayed (FIG. 4C). When this voice input screen is displayed, a voice input from the voice input / output device 13 is possible as in the voice input screen shown in FIG. In addition, this voice input screen is designed with a Chinese text T15 that prompts the user to input voice into the microphone of the information terminal 10, Japanese text T16 indicating that the other party is inputting voice, and a microphone. The input button 32b and the multiple circular design 33 surrounding the input button 32b are displayed.

さらに、この音声入力画面にも、キャンセルボタンＢ１が表示され、これをタップすることにより、当該アプリケーションを終了するか、音声入力待機画面（図３（Ｂ））へ戻って音声入力をやり直すことができる。また、入力ボタン３２ｂの近傍には、音声入力が終了した後に、後述の音声認識処理及び多言語翻訳処理を行うための中国語のテキストＴ１７が表示される。 Further, a cancel button B1 is also displayed on this voice input screen, and when this is tapped, the application can be terminated, or the voice input standby screen (FIG. 3B) can be returned to perform voice input again. it can. Further, in the vicinity of the input button 32b, Chinese text T17 for performing speech recognition processing and multilingual translation processing described later is displayed after the speech input is completed.

この状態で、相手方のユーザ（回答者）がユーザ（質問者）への回答事項等を発話する（ステップＳＵ２）と、音声入出力デバイス１３を通して音声入力が行われる（ステップＳＪ３）。情報端末１０のプロセッサ１１は、その音声入力に基づいて音声信号を生成し、その音声信号を通信インターフェイス１４及びネットワークＮを通してサーバ２０へ送信する。それから、発話が終了して中国語のテキストＴ１４がタップ（タッチ）されると、プロセッサ１１は、発話内容の受け付けを終了する。情報端末１０のプロセッサ１１は、その音声入力に基づいて音声信号を生成し、その音声信号を通信インターフェイス１４及びネットワークＮを通してサーバ２０へ送信する。 In this state, when the partner user (respondent) utters an answer item or the like to the user (questioner) (step SU2), voice input is performed through the voice input / output device 13 (step SJ3). The processor 11 of the information terminal 10 generates an audio signal based on the audio input, and transmits the audio signal to the server 20 through the communication interface 14 and the network N. Then, when the utterance is finished and the Chinese text T14 is tapped (touched), the processor 11 finishes accepting the utterance content. The processor 11 of the information terminal 10 generates an audio signal based on the audio input, and transmits the audio signal to the server 20 through the communication interface 14 and the network N.

次に、サーバ２０のプロセッサ２１は、通信インターフェイス２２を通してその音声信号を受信し、音声認識処理を行う（ステップＳＪ４）。このとき、プロセッサ２１は、記憶資源２３から、必要なモジュールＬ２０、データベースＤ２０、及びモデルＭ２０（音声認識モジュール、中国語音声コーパス、音響モデル、言語モデル等）を呼び出し、入力音声の「音」を「読み」（文字）へ変換する。また、プロセッサ２１は、認識された内容を、音声入力の履歴データとして、記憶資源２３に（必要に応じて適宜のデータベースに）記憶する。 Next, the processor 21 of the server 20 receives the voice signal through the communication interface 22 and performs voice recognition processing (step SJ4). At this time, the processor 21 calls the necessary module L20, database D20, and model M20 (speech recognition module, Chinese speech corpus, acoustic model, language model, etc.) from the storage resource 23, and obtains the “sound” of the input speech. Convert to "reading" (character). In addition, the processor 21 stores the recognized content in the storage resource 23 (in an appropriate database as necessary) as voice input history data.

続いて、プロセッサ２１は、認識された音声の「読み」（文字）を複数の他言語に翻訳する多言語翻訳処理へ移行する（ステップＳＪ５）。プロセッサ２１は、記憶資源２３から、必要なモジュールＬ２０及びデータベースＤ２０（翻訳モジュール、中国語文字コーパス、中国語辞書、日本語辞書、中国語／日本語対訳辞書、中国語／日本語対訳コーパス等）を呼び出し、認識結果である入力音声の「読み」（文字列）を適切に並び替えて中国語の句、節、文等へ変換し、その変換結果に対応する日本語を抽出し、それらを日本語の文法に従って並び替えて自然な日本語の句、節、文等へと変換する。なお、入力音声が正確に認識されなかった場合には、音声の再入力を行うことができる（図示省略）。また、プロセッサ２１は、それらの中国御及び日本語の句、節、文等を、記憶資源２３に記憶しておくこともできる。 Subsequently, the processor 21 proceeds to multilingual translation processing for translating the recognized “reading” (characters) of the recognized speech into a plurality of other languages (step SJ5). The processor 21 receives the necessary module L20 and database D20 from the storage resource 23 (translation module, Chinese character corpus, Chinese dictionary, Japanese dictionary, Chinese / Japanese bilingual dictionary, Chinese / Japanese bilingual corpus, etc.) , And appropriately sort the “reading” (character string) of the input speech that is the recognition result to convert it into Chinese phrases, clauses, sentences, etc., extract the Japanese corresponding to the conversion result, and extract them Rearrange according to Japanese grammar and convert to natural Japanese phrases, clauses, sentences, etc. If the input voice is not correctly recognized, the voice can be re-input (not shown). Further, the processor 21 can also store those Chinese and Japanese phrases, clauses, sentences and the like in the storage resource 23.

また、この翻訳処理中に、情報端末１０のプロセッサ１１は、図４（Ｄ）に示す翻訳処理中画面を表示する。この翻訳処理中画面には、翻訳処理中であることを示す日本語のテキストＴ９及び中国語のテキストＴ１０、並びに、翻訳処理中であることを表すため円弧の一部が回動するように表示される環状図案３４も表示される。さらに、この翻訳処理中画面にも、キャンセルボタンＢ１が表示され、これをタップすることにより、当該アプリケーションを終了するか、音声入力待機画面（図３（Ｂ））へ戻って音声入力をやり直すことができる。 Further, during the translation process, the processor 11 of the information terminal 10 displays a translation process in progress screen shown in FIG. On this translation processing screen, Japanese text T9 and Chinese text T10 indicating that translation processing is in progress, and a part of an arc that rotates to indicate that translation processing is in progress are displayed. An annular design 34 is also displayed. Furthermore, the cancel button B1 is also displayed on this translation processing screen, and when this is tapped, the application is terminated, or the voice input standby screen (FIG. 3B) is returned to perform voice input again. Can do.

次に、多言語翻訳処理が完了すると、プロセッサ２１は、音声合成処理へ移行する（ステップＳＪ６）。このとき、プロセッサ２１は、記憶資源２３から、必要なモジュールＬ２０、データベースＤ２０、及びモデルＭ２０（音声合成モジュール、日本語音声コーパス、音響モデル、言語モデル等）を呼び出し、翻訳結果である日本語の句、節、文等を自然な音声に変換する。 Next, when the multilingual translation processing is completed, the processor 21 proceeds to speech synthesis processing (step SJ6). At this time, the processor 21 calls the necessary module L20, database D20, and model M20 (speech synthesis module, Japanese speech corpus, acoustic model, language model, etc.) from the storage resource 23, and translates the Japanese translation result. Convert phrases, clauses, sentences, etc. to natural speech.

次いで、プロセッサ２１は、日本語による翻訳結果（対応する日本語の会話コーパスでもよい）に基づいてテキスト表示用のテキスト信号を生成し、情報端末１０へ送信する。そのテキスト信号を受信したプロセッサ１１は、音声認識した入力音声の内容を示す中国語のテキストと、その日本語による翻訳結果（対訳）のテキストを、図４（Ａ）に示す翻訳結果表示画面と同様に表示する。 Next, the processor 21 generates a text signal for text display based on the translation result in Japanese (or a corresponding Japanese conversation corpus), and transmits the text signal to the information terminal 10. The processor 11 that has received the text signal converts the Chinese text indicating the content of the input speech that has been voice-recognized and the text of the translation result (translation) in Japanese into the translation result display screen shown in FIG. Display in the same way.

（会話のシーンを考慮した音声翻訳による会話）
次に、ユーザ同士（話者）の会話及び／又は会話準備において、その会話のシーンをユーザが選択し、選択されたその会話のシーンを考慮して音声翻訳を行う場合の処理操作及び動作の一例について、以下に説明する。図５は、音声翻訳装置１００における処理の流れ（の一部）の一例を示すフローチャートである。また、図６（Ａ）及び（Ｂ）は、情報端末における表示画面の遷移の一例を示す平面図である。なお、本実施形態においては、一方のユーザ（質問者）の言語が日本語であり、他方のユーザ（回答者）の言語が英語である場合の会話を想定する（但し、言語やシチュエーションはこれに限定されない）。 (Conversation by speech translation considering the conversation scene)
Next, in conversation between users (speakers) and / or conversation preparation, the user selects a scene of the conversation, and processing operations and operations when performing speech translation in consideration of the selected conversation scene An example will be described below. FIG. 5 is a flowchart showing an example of (part of) the processing flow in the speech translation apparatus 100. 6A and 6B are plan views showing an example of display screen transition in the information terminal. In the present embodiment, a conversation is assumed when the language of one user (questioner) is Japanese and the language of the other user (answerer) is English (however, the language and situation are Not limited to).

ここでの処理手順は、図２に示す発話（ステップＳＵ２）に先立って会話のシーンを選択し（ステップＳＵ３）、また、音声認識（ステップＳＪ４）と多言語翻訳（ステップＳＪ５）の間に、認識された音声の内容が質問文であり且つ特定のフレーズを含んでいるか否かの判定（ステップＳＪ８）、特定のフレーズと選択されたシーンが関連付けられているか否かの判定（ステップＳＪ９）、及び特定のフレーズと選択されたシーンに関連付けて記憶された定型構文の提示（ステップＳＪ１０）を実施すること以外は、図２に示す「通常の音声翻訳による会話」における処理手順と同様である。 The processing procedure here is to select a conversation scene prior to the utterance (step SU2) shown in FIG. 2 (step SU3), and between speech recognition (step SJ4) and multilingual translation (step SJ5). A determination as to whether or not the content of the recognized voice is a question sentence and includes a specific phrase (step SJ8); a determination as to whether or not the specific phrase is associated with the selected scene (step SJ9); The processing procedure in the “conversation by normal speech translation” shown in FIG. 2 is the same except that the fixed phrase stored in association with the specific phrase and the selected scene is executed (step SJ10).

すなわち、ユーザ（質問者）が当該アプリケーションを起動して（ステップＳＵ１）、相手方のユーザ（回答者）の言語を選択するための言語選択画面を表示し（図３（Ａ）；ステップＳＪ１）、さらに、相手方のユーザ（回答者）の言語を選択して日本語と英語の音声入力待機画面を表示デバイス１６に表示する（図３（Ｂ）と同様；ステップＳＪ２）。それから、図３（Ｂ）に示す音声入力待機画面において、ユーザ（質問者）が会話のシーンを選択するためのシーン選択ボタンＢＳをタップする。そうすると、サーバ２０のプロセッサ２１及び情報端末１０のプロセッサ１１により、会話のシーンのリスト表示画面が表示デバイス１６に表示される（図６（Ａ））。 That is, the user (questioner) starts the application (step SU1), and displays a language selection screen for selecting the language of the other user (respondent) (FIG. 3A; step SJ1). Further, the language of the other user (respondent) is selected and a Japanese and English voice input standby screen is displayed on the display device 16 (similar to FIG. 3B; step SJ2). Then, on the voice input standby screen shown in FIG. 3B, the user (questioner) taps a scene selection button BS for selecting a conversation scene. Then, a list display screen of conversation scenes is displayed on the display device 16 by the processor 21 of the server 20 and the processor 11 of the information terminal 10 (FIG. 6A).

図６（Ａ）に示す例では、シーンの大分類（例えば、「飲食」、「買い物」、「観光」、「ビジネス」等）がシーン見出しタブＳ１〜Ｓ４として表示される。それらのなかからユーザ（質問者）が所望のシーン見出しタブＳ１〜Ｓ４をタップすると、そのシーンタブに属する具体的なシーンの小分類（大分類が「飲食」の場合、例えば、「接客」、「案内」、「注文」、「会計」等）が、シーンバーＳ１１〜Ｓ１５としてリスト表示される。 In the example shown in FIG. 6A, a large classification of scenes (for example, “food”, “shopping”, “tourism”, “business”, etc.) is displayed as scene heading tabs S1 to S4. When the user (questioner) taps the desired scene heading tabs S1 to S4 from among them, a small classification of a specific scene belonging to the scene tab (when the large classification is “food”, for example, “customer service”, “ “Guidance”, “Order”, “Account”, etc.) are displayed as a list as scene bars S11 to S15.

ここで、一例として、ユーザ（質問者）が日本語を話す飲食店の店員であり、ユーザ（回答者）がその飲食店に来店した英語を話す外国人客である場合、質問者である店員は、シーンタブＳ１〜Ｓ４のなかから「飲食」のシーンタブをタップし、さらに、来店時の応対であることから、シーンバーＳ１１〜Ｓ１５のなかから、「接客」又は「案内」のシーンバーをタップして、そのときの会話のシーンを選択することができる（ステップＳＵ３）。 Here, as an example, if the user (questioner) is a clerk of a restaurant that speaks Japanese, and the user (respondent) is an English-speaking foreign customer who has visited the restaurant, the clerk who is the questioner Tap the “Drink” scene tab from the scene tabs S1 to S4, and tap the “Customer Service” or “Guidance” scene bar from the scene bars S11 to S15 because it is a response when visiting the store. Then, the scene of the conversation at that time can be selected (step SU3).

ユーザによって会話のシーンが選択されると、サーバ２０のプロセッサ２１及び情報端末１０のプロセッサ１１により、情報端末１０の表示デバイス１６に、図３（Ｂ）に示す音声入力待機画面が再び表示される。この状態で、質問者が回答者への質問事項として、図４（Ａ）に示すフレーズとは異なり、例えば「どんな席がよろしいですか？」と発話する（ステップＳＵ２）と、音声入出力デバイス１３を通して音声入力が行われる（ステップＳＪ３）。その発話が終了して図４（Ｃ）に示す日本語のテキストＴ８がタップ（タッチ）されると、プロセッサ１１は、発話内容の受け付けを終了し、サーバ２０のプロセッサ２１が、音声認識処理を行う（ステップＳＪ４）。 When a conversation scene is selected by the user, the voice input standby screen shown in FIG. 3B is displayed again on the display device 16 of the information terminal 10 by the processor 21 of the server 20 and the processor 11 of the information terminal 10. . In this state, when the questioner utters, for example, “What kind of seats are you sure?” As a question to the respondent, unlike the phrase shown in FIG. 4A (step SU2), the voice input / output device A voice input is performed through 13 (step SJ3). When the utterance is finished and the Japanese text T8 shown in FIG. 4C is tapped (touched), the processor 11 finishes accepting the utterance content, and the processor 21 of the server 20 performs the speech recognition process. Perform (step SJ4).

この場合、具体的には、プロセッサ２１は、入力音声の「音」を「読み」（文字）へ変換し、形態素解析を行い、上記の発話内容から、例えば「どんな」、「席」、「が」、「よろしい」、及び「ですか」といったフレーズが抽出される。次に、プロセッサ２１は、その入力音声の内容が質問文であり、且つ、特定のフレーズを含むか否かの判定を行う（ステップＳＪ８）。より具体的には、プロセッサ２１は、入力音声の内容に「どんな」といった疑問詞が含まれることから、或いは、それに加えて「ですか」という文尾の表現から、その入力音声の内容が質問文であると判定する。 In this case, specifically, the processor 21 converts “sound” of the input speech into “reading” (characters), performs morphological analysis, and determines, for example, “what”, “seat”, “ Phrases such as “,” “OK,” and “Is it?” Are extracted. Next, the processor 21 determines whether or not the content of the input voice is a question sentence and includes a specific phrase (step SJ8). More specifically, the processor 21 asks whether the content of the input voice is a question from the fact that the question of “what” is included in the content of the input voice, or in addition to the expression of the ending of “?”. Judge that it is a sentence.

一方、記憶資源２３（記憶部）には、例えば、「席」という単独のフレーズ、又は、「どんな」＋「席」というフレーズの組み合わせが、特定のフレーズとして予め記憶されており、プロセッサ２１は、その入力音声の内容が特定のフレーズを含むと判定する（ステップＳＪ８においてＹｅｓ）。なお、入力音声の内容が質問文ではなく、又は、特定のフレーズを含まない場合（ステップＳＪ８においてＮｏ）には、処理は通常の多言語翻訳処理（ステップＳＪ５）へ移行する。 On the other hand, in the storage resource 23 (storage unit), for example, a single phrase “seat” or a combination of the phrases “what” + “seat” is stored in advance as a specific phrase. Then, it is determined that the content of the input voice includes a specific phrase (Yes in step SJ8). If the content of the input voice is not a question sentence or does not include a specific phrase (No in step SJ8), the process proceeds to a normal multilingual translation process (step SJ5).

次いで、プロセッサ２１は、入力音声の内容に含まれる特定のフレーズ（「席」又は「どんな」＋「席」）に、選択されたシーン（「接客」又は「案内」）が関連付けられているか否かを判定する（ステップＳＪ９）。ここでは、記憶資源２３に、「席」又は「どんな」＋「席」に対して「接客」というシーンが関連付けられて予め記憶されているので、入力音声の内容に含まれる特定のフレーズに、選択されたシーンの関連付けありと判定される（ステップＳＪ９においてＹｅｓ）。なお、入力音声の内容に含まれる特定のフレーズに、選択されたシーンの関連付けがない場合（ステップＳＪ９においてＮｏ）には、処理は通常の多言語翻訳処理（ステップＳＪ５）へ移行する。 Next, the processor 21 determines whether the selected scene (“customer service” or “guidance”) is associated with a specific phrase (“seat” or “what” + “seat”) included in the content of the input voice. Is determined (step SJ9). Here, since the scene “customer service” is associated with “seat” or “what” + “seat” and stored in advance in the storage resource 23, a specific phrase included in the content of the input voice is It is determined that the selected scene is associated (Yes in step SJ9). If the selected phrase included in the content of the input voice is not associated with the selected scene (No in step SJ9), the process proceeds to a normal multilingual translation process (step SJ5).

それから、プロセッサ２１は、記憶資源２３に記憶されている情報のなかから、特定のフレーズ（「席」又は「どんな」＋「席」）と選択されたシーン（「接客」又は「案内」）の双方に関連付けて記憶されている定型構文がある場合、それを抽出して表示デバイス１６にリストとして提示する（ステップＳＪ１０においてＹｅｓ；図６（Ｂ））。この図６（Ｂ）に示す例では、特定のフレーズ（「席」又は「どんな」＋「席」）と選択されたシーン（「接客」又は「案内」）の双方に関連付けて記憶資源２３に記憶されている定型構文として、「お席は室内とテラスとどちらがよろしいですか？」、「お席は禁煙席と喫煙席のどちらがよろしいですか？」、「お席はカウンターとテーブルのどちらがよろしいですか？」、「お席は１階席と２階席のどちらがよろしいですか？」、及び「何かお席の希望がございますか？」といった構文が構文リストＰＬとして表示される。 Then, the processor 21 selects a specific phrase (“seat” or “what” + “seat”) and a selected scene (“customer service” or “guidance”) from the information stored in the storage resource 23. If there is a fixed syntax stored in association with both, it is extracted and presented as a list on the display device 16 (Yes in step SJ10; FIG. 6B). In the example shown in FIG. 6B, the storage resource 23 is associated with both a specific phrase (“seat” or “what” + “seat”) and the selected scene (“customer service” or “guidance”). Memorized canonical syntax is: “Would you like a seat indoors or a terrace?”, “Would you like a non-smoking seat or a smoking seat?”, “Would you like a counter or a table? The syntax list PL displays syntaxes such as “Which seat do you want, 1st floor or 2nd floor?” And “Do you have any seats?”

ここで、仮に、質問者である店員から「どんな席がよろしいですか？」と質問された回答者である外国人客は、そもそも、店内にどのような種類の席が存在するのか知らなかったり、店員が質問事項である「どんな席」としてどのような種類の席を想定しているのか分からなかったりすることも多々ある。その場合、質問された外国人客は、図６（Ｂ）の構文リストＰＬに例示されているような内容を、確認のため、店員に質問してしまうことも想定される。しかも、かかるやり取りを、音声翻訳を介して行うために、円滑なコミュニケーションを図り難い傾向にある。 Here, for example, the foreign customer who is the respondent who was asked by the clerk, who is the questioner, what kind of seats are there in the first place? Often, the store clerk may not know what kind of seat is assumed as “what kind of seat” is the question. In that case, it is also assumed that the foreign customer who has been questioned asks the store clerk for confirmation of the content as exemplified in the syntax list PL of FIG. In addition, since such exchange is performed via speech translation, it tends to be difficult to achieve smooth communication.

これに対し、図６（Ｂ）の如く、構文リストＰＬが表示されると、質問者である店員は、例えば「どんな席がよろしいですか？」よりも精度の高い質問内容（例えば「お席はカウンターとテーブルのどちらがよろしいですか？」）を選択して、外国人客へ問い掛けることができるので、外国人客に対して適切な質問を行うことができ、外国人の意向をより正確に把握しつつ、両者の円滑なコミュニケーションを図ることが可能となる。 On the other hand, as shown in FIG. 6B, when the syntax list PL is displayed, the clerk who is the questioner, for example, has a question content with higher accuracy than “what seat is right?” Can I ask a foreign customer to choose a counter or a table? ”), So you can ask appropriate questions to the foreign customer and make the foreigner ’s intention more accurate It is possible to facilitate smooth communication between the two while grasping.

このようにして、質問者である店員は、図６（Ｂ）に示す構文リストＰＬのなかから所望の構文をタップして選択することにより、その構文の内容が翻訳され（ステップＳＪ５）、続けて音声合成（ステップＳＪ６）及び音声出力（ステップＳＪ７）の処理が行われる。また、図６（Ｂ）に示す構文リスト表示画面には、図４（Ａ）の翻訳結果表示画面に表示されているのと同様のテキスト入力ボタンＢ５も表示される。話し手がこのテキスト入力ボタンＢ５をタップすると、図４（Ｂ）に示すのと同様のテキスト入力画面が表示され、例えば構文リストＰＬを参考にして、他の内容をテキスト入力することができ、その内容を定型構文として記憶資源２３に追加登録してもよい。 In this way, the clerk who is the questioner taps and selects a desired syntax from the syntax list PL shown in FIG. 6B, so that the content of the syntax is translated (step SJ5). Then, voice synthesis (step SJ6) and voice output (step SJ7) are performed. In addition, on the syntax list display screen shown in FIG. 6B, a text input button B5 similar to that displayed on the translation result display screen of FIG. 4A is also displayed. When the speaker taps the text input button B5, a text input screen similar to that shown in FIG. 4B is displayed. For example, referring to the syntax list PL, other contents can be input as text. The contents may be additionally registered in the storage resource 23 as a fixed syntax.

なお、上述したとおり、上記の各実施形態は、本発明を説明するための一例であり、本発明をその実施形態に限定する趣旨ではない。また、本発明は、その要旨を逸脱しない限り、様々な変形が可能である。例えば、当業者であれば、実施形態で述べたリソース（ハードウェア資源又はソフトウェア資源）を均等物に置換することが可能であり、そのような置換も本発明の範囲に含まれる。 Note that, as described above, each of the above embodiments is an example for explaining the present invention, and is not intended to limit the present invention to the embodiment. The present invention can be variously modified without departing from the gist thereof. For example, those skilled in the art can replace the resources (hardware resources or software resources) described in the embodiments with equivalents, and such replacements are also included in the scope of the present invention.

また、会話のシーンを選択するステップＳＵ３は、例えば、アプリケーション起動（ステップＳＵ１）の後、言語選択画面表示（ステップＳＪ１）の後、又は発話（ステップＳＵ２）の後に実施するようにしてもよく、或いは、ある程度会話が進んだ任意のタイミングにおいて会話のシーンを選択（入力）することができるように構成してもよい。さらに、構文リストＰＬを、認識された音声の内容（例えば先述した音声入力内容である「どんな席がよろしいですか？」）とともに表示してもよい。 Further, step SU3 for selecting a conversation scene may be performed, for example, after application activation (step SU1), after language selection screen display (step SJ1), or after speech (step SU2). Or you may comprise so that the scene of a conversation can be selected (input) in the arbitrary timings when the conversation progressed to some extent. Further, the syntax list PL may be displayed together with the recognized voice content (for example, “What kind of seats are you sure?” Which is the voice input content described above).

また、音声認識、翻訳、音声合成等の各処理をサーバ２０によって実行する例について記載したが、これらの処理を情報端末１０において実行するように構成してもよい。この場合、それらの処理に用いるモジュールＬ２０は、情報端末１０の記憶資源１２に保存されていてもよいし、サーバ２０の記憶資源２３に保存されていてもよい。さらに、音声データベースであるデータベースＤ２０、及び／又は、音響モデル等のモデルＭ２０も、情報端末１０の記憶資源１２に保存されていてもよいし、サーバ２０の記憶資源２３に保存されていてもよい。このとおり、音声翻訳装置は、ネットワークＮ及びサーバ２０を備えなくてもよい。 Moreover, although the example which performs each process, such as speech recognition, translation, speech synthesis, by server 20, was described, you may comprise so that these processes may be performed in the information terminal 10. FIG. In this case, the module L20 used for these processes may be stored in the storage resource 12 of the information terminal 10 or may be stored in the storage resource 23 of the server 20. Furthermore, the database D20 that is a voice database and / or a model M20 such as an acoustic model may be stored in the storage resource 12 of the information terminal 10 or may be stored in the storage resource 23 of the server 20. . As described above, the speech translation apparatus may not include the network N and the server 20.

また、情報端末１０とネットワークＮとの間には、両者間の通信プロトコルを変換するゲートウェイサーバ等が介在してももちろんよい。また、情報端末１０は、携帯型装置に限らず、例えば、デスクトップ型パソコン、ノート型パソコン、タブレット型パソコン、ラップトップ型パソコン等でもよい。 Of course, a gateway server for converting a communication protocol between the information terminal 10 and the network N may be interposed. The information terminal 10 is not limited to a portable device, and may be a desktop personal computer, a notebook personal computer, a tablet personal computer, a laptop personal computer, or the like.

本発明によれば、特定のフレーズと選択された会話のシーンに関連付けて記憶された定型構文を、ユーザが選択可能なように提示することにより、音声翻訳を利用した会話において、質問者が回答者に対して適切な質問を行うことができ、その結果、両者の円滑なコミュニケーションを図ることが可能となるので、例えば、互いの言語を理解できない人同士の会話に関するサービスの提供分野における、プログラム、装置、システム、及び方法の設計、製造、提供、販売等の活動に広く利用することができる。 According to the present invention, a questioner answers in a conversation using speech translation by presenting a fixed syntax stored in association with a specific phrase and a selected conversation scene so that the user can select it. For example, a program in the field of providing services related to conversations between people who cannot understand each other's language. It can be widely used in activities such as designing, manufacturing, providing, selling, etc., devices, systems, and methods.

１０…情報端末、１１…プロセッサ、１２…記憶資源、１３…音声入出力デバイス、１４…通信インターフェイス、１５…入力デバイス、１６…表示デバイス、１７…カメラ、２０…サーバ、２１…プロセッサ、２２…通信インターフェイス、２３…記憶資源、３１…言語ボタン、３２ａ，３２ｂ…入力ボタン、３３…多重円形図案、３４…環状図案、１００…音声翻訳装置、Ｂ１…キャンセルボタン、Ｂ２…登録フレーズボタン、Ｂ３…テキスト入力ボタン、Ｂ４…設定ボタン、Ｂ５…テキスト入力ボタン、Ｂ６…チェックボタン、Ｂ７…誤訳通知ボタン、Ｂ８…表示ボタン、Ｂ９…消去ボタン、Ｂ１０…翻訳ボタン、ＢＳ…シーン選択ボタン、Ｄ２０…データベース、Ｅ１，Ｅ２…顔マーク、Ｋ…キーボード、Ｌ２０…モジュール、Ｍ２０…モデル、Ｎ…ネットワーク、Ｐ１０，Ｐ２０…プログラム、ＰＬ…構文リスト、Ｓ１〜Ｓ４…シーン見出しタブ、Ｓ１１〜Ｓ１５…シーンバー、ＳＪ１〜ＳＪ１０，ＳＵ１〜ＳＵ３…ステップ、Ｔ１〜Ｔ１７…テキスト。 DESCRIPTION OF SYMBOLS 10 ... Information terminal, 11 ... Processor, 12 ... Storage resource, 13 ... Voice input / output device, 14 ... Communication interface, 15 ... Input device, 16 ... Display device, 17 ... Camera, 20 ... Server, 21 ... Processor, 22 ... Communication interface 23 ... Memory resource 31 ... Language button 32a, 32b ... Input button 33 ... Multiple circular design 34 ... Circular design 100 ... Speech translation device B1 ... Cancel button B2 ... Registered phrase button B3 ... Text input button, B4 ... Setting button, B5 ... Text input button, B6 ... Check button, B7 ... Mistranslation notification button, B8 ... Display button, B9 ... Erase button, B10 ... Translation button, BS ... Scene selection button, D20 ... Database , E1, E2 ... face mark, K ... keyboard, L20 ... module, M2 ... model, N ... network, P10, P20 ... program, PL ... syntax list, S1~S4 ... scene heading tab, S11~S15 ... scene bar, SJ1~SJ10, SU1~SU3 ... step, T1~T17 ... text.

Claims

An input unit for inputting the voice of the user who is having a conversation;
A scene selection means presenting section for presenting a scene selection means for the user to select a scene in the conversation from a plurality of scenes;
A storage unit that stores a plurality of standard syntaxes assumed for a specific phrase in advance in association with the specific phrase and the scene;
Recognizing the input voice, and when the content of the recognized voice is a question sentence and includes the specific phrase, the specific phrase and the selected scene are stored in association with each other. A fixed syntax presenting section that presents a fixed syntax so that the user can select it;
A translation unit that translates the content of the selected fixed syntax into content of a different language;
An output unit for outputting the content translated into the different languages in audio and / or text;
A speech translation apparatus comprising:

The input unit presents input changing means for the user to change the content of the selected fixed syntax,
The translation unit translates the changed content into content of a different language;
The speech translation apparatus according to claim 1.

The fixed syntax presentation unit presents the fixed syntax together with the recognized speech content;
The speech translation apparatus according to claim 1 or 2.

A standard syntax input unit for the user to add and input the standard syntax;
The speech translation apparatus according to any one of claims 1 to 3.

Using a speech translation device including an input unit, a scene selection means presentation unit, a storage unit, a fixed syntax presentation unit, a translation unit, and an output unit,
The input unit inputting a voice of a user having a conversation;
The scene selection means presenting unit presenting a scene selection means for the user to select a scene in the conversation from a plurality of scenes;
A step of storing in advance a plurality of fixed syntaxes assumed for a specific phrase in association with the specific phrase and the scene;
When the fixed syntax presenting unit recognizes the input voice and the content of the recognized voice is a question sentence and includes the specific phrase, the specific phrase and the selected scene Presenting the boilerplate syntax stored in association with the user for selection by the user;
The translating unit translating the content of the selected fixed syntax into content of a different language;
The output unit outputting the content translated into the different languages in voice and / or text;
Speech translation method including

Computer
An input unit for inputting the voice of the user who is having a conversation;
A scene selection means presenting section for presenting a scene selection means for the user to select a scene in the conversation from a plurality of scenes;
A storage unit that stores a plurality of standard syntaxes assumed for a specific phrase in advance in association with the specific phrase and the scene;
Recognizing the input voice, and when the content of the recognized voice is a question sentence and includes the specific phrase, the specific phrase and the selected scene are stored in association with each other. A fixed syntax presenting section that presents a fixed syntax so that the user can select it;
A translation unit that translates the content of the selected fixed syntax into content of a different language;
An output unit for outputting the content translated into the different languages in audio and / or text;
A speech translation program that makes it work.