JP6754154B1

JP6754154B1 - Translation programs, translation equipment, translation methods, and wearable devices

Info

Publication number: JP6754154B1
Application number: JP2020083415A
Authority: JP
Inventors: 江崎　徹; 徹江崎
Original assignee: 江崎　徹; 徹江崎
Priority date: 2020-05-11
Filing date: 2020-05-11
Publication date: 2020-09-09
Anticipated expiration: 2040-05-11
Also published as: JP2021179689A

Abstract

【課題】話者が言葉を発するときの口唇の動きに基づきその言葉を他言語の言葉に翻訳する場合において、翻訳処理の高速化が可能な技術を提供する。【解決手段】話者が発する言葉を、当該言葉と同じ意味の他言語の言葉に翻訳して出力可能な翻訳プログラムにおいて、コンピュータを、話者が発する言葉を、当該言葉を発するときの口唇の動きからは読み取り不可能であって、当該口唇の動きに対応付けられた前記他言語の言葉に翻訳して出力可能な翻訳部として機能させる。【選択図】図４PROBLEM TO BE SOLVED: To provide a technique capable of speeding up a translation process when translating a word into a word of another language based on the movement of the lip when the speaker utters the word. SOLUTION: In a translation program capable of translating and outputting a word spoken by a speaker into a word in another language having the same meaning as the word, a computer, a word spoken by the speaker, and a lip when the word is spoken. It functions as a translation unit that cannot be read from the movement and can be translated and output into the words of the other language associated with the movement of the lip. [Selection diagram] Fig. 4

Description

本発明は、話者が発した言葉を他国語に翻訳する技術に関し、特に、話者の口唇の動きに基づいて翻訳処理を実行する翻訳プログラム、翻訳装置、翻訳方法、及びウェアラブル端末に関するものである。 The present invention relates to a technique for translating words spoken by a speaker into another language, and more particularly to a translation program, a translation device, a translation method, and a wearable terminal that execute translation processing based on the movement of the speaker's lips. is there.

話者が発する言葉を、人間によらず、コンピュータによって認識させる技術の研究・開発が行われている。
例えば、話者が発する言葉を、当該話者が映っている映像を解析して認識する技術が知られている。
具体的には、映像から話者の口唇の動きを特定し、当該特定した口唇の動きに基づいて話者が発した言葉を読み取る映像解析の技術（読唇技術）が知られている（例えば、特許文献１参照）。
このような映像解析技術に基づき、話者の口唇の動きから読み取れる、当該話者の言語（日本語）や、これに対応した他言語（英語）を出力することで翻訳を実行可能な技術が提案されている（例えば、特許文献２参照）。 Research and development of technology to make the words spoken by the speaker recognized by a computer, not by humans, is being conducted.
For example, there is known a technique of analyzing and recognizing a word spoken by a speaker by analyzing an image of the speaker.
Specifically, there is known a video analysis technique (lip reading technique) that identifies the movement of the speaker's lip from the image and reads the words spoken by the speaker based on the specified lip movement (for example, lip reading technique). See Patent Document 1).
Based on such video analysis technology, there is a technology that can perform translation by outputting the speaker's language (Japanese) that can be read from the movement of the speaker's lips and other languages (English) that correspond to it. It has been proposed (see, for example, Patent Document 2).

特開２００４−１５２５０号公報Japanese Unexamined Patent Publication No. 2004-15250 特開２０１３−４５２８２号公報Japanese Unexamined Patent Publication No. 2013-45282

ところで、従来、話者の口唇の動きに基づく他言語への翻訳処理は、以下の点において問題があった。
例えば、話者が、英語で「Ｉｈａｖｅａｐｅｎ」という言葉を発した場合、話者の口唇の動きを映像解析により特定し、特定した口唇の動きから「アイハブアペン」と話者が発したこと、つまり、英語で「Ｉｈａｖｅａｐｅｎ」と話者が発したことを認識し、その後、この英語「Ｉｈａｖｅａｐｅｎ」を日本語「私はペンを持っている。」に翻訳する処理が行われる。
しかしながら、この場合、第１に、話者の口唇の動きから「英語」の言葉を認識し、第２に、当該認識した「英語」の言葉を「日本語」の言葉に翻訳するといった、複数の処理や工程が必要であり、このことが翻訳処理における遅延の要因となっていた。
つまり、仮に、話者の口唇の動きから直接「日本語」を翻訳できるのであれば遅延の問題は生じないが、話者の口唇の動きから一旦「英語」の言葉を認識するといった無駄な処理や工程があるために遅延の問題が生じていた。 By the way, conventionally, the translation process into another language based on the movement of the speaker's lips has a problem in the following points.
For example, when the speaker utters the word "I have a pen" in English, the movement of the speaker's lips is identified by video analysis, and the speaker says "I have a pen" from the identified lip movement. Recognizing what the speaker said, that is, "I have a pen" in English, then translated this English "I have a pen" into Japanese "I have a pen." Processing is performed.
However, in this case, firstly, the word "English" is recognized from the movement of the speaker's lip, and secondly, the recognized "English" word is translated into the word "Japanese". Processing and processes are required, which has been a cause of delay in translation processing.
In other words, if "Japanese" can be translated directly from the movement of the speaker's lips, there will be no delay problem, but useless processing such as recognizing the word "English" once from the movement of the speaker's lips. There was a problem of delay due to the fact that there was a process.

また、特許文献２の通訳システムは、予め口唇の動き（特徴量）に対応した日本語「暖かいです」のテキストデータとともに、これに対応する英語「Ｉｔｉｓｗａｒｍ」のテキストデータとを対応付けた態様のデータベースを登録することで、話者が日本語で「暖かいです」と話したときに、日本語と英語とを共に出力可能としている。
しかしながら、この場合、データベースは、日本語のテキストデータとこれに対応する英語のテキストデータといった複数のデータで構成されており、無駄なデータを含む構成となっている。
つまり、仮に、話者の口唇の動きから直接「日本語」を翻訳できるのであれば必要のない「英語」のテキストデータをデータベースに含んでおり、これにより、記憶部の記憶領域を無駄に占有したり、必要以上の記憶容量を用意せざるを得なかった。 In addition, the interpreting system of Patent Document 2 associates the text data of Japanese "warm" corresponding to the movement (feature amount) of the lips with the text data of English "It is warm" corresponding to this in advance. By registering the mode database, it is possible to output both Japanese and English when the speaker says "It's warm" in Japanese.
However, in this case, the database is composed of a plurality of data such as Japanese text data and English text data corresponding thereto, and is configured to include useless data.
In other words, if the "Japanese" can be translated directly from the movement of the speaker's lips, the database contains "English" text data that is not necessary, which wastefully occupies the storage area of the storage unit. I had no choice but to prepare more storage capacity than necessary.

以上のように、従来における、話者の口唇の動きに基づく他言語への翻訳技術は、話者の口唇の動きから直接的に他言語に翻訳する構成にはなっておらず、これによって遅延や記憶容量に関する種々の問題が生じていた。
これは、読唇技術者がそうであるように、話者（例えば英語話者）の口唇の動きからは、当該話者が実際に発した言語（英語）の言葉は読み取ることはできても、話者の口唇の動きから当該話者が発した言語とは異なる他言語（日本語）の言葉は読み取ることができないこと、言い換えれば、話者の口唇の動きから当該話者が実際に発していない他言語の言葉を読み取ることは不可能であること、が技術常識であることが背景にある。
つまり、このような技術常識があるために、当業者を含む多くの者が、話者の口唇の動きと他言語とを直接結び付けた翻訳処理の構成を発想するには至らず、これによって遅延や記憶容量に関する種々の問題が潜在していた。 As described above, the conventional translation technique for other languages based on the movement of the speaker's lips is not configured to directly translate the movement of the speaker's lips into another language, which delays the translation. And various problems related to storage capacity have occurred.
This is because, as is the case with lip-reading technicians, the movement of the lips of a speaker (for example, an English speaker) can read the words of the language (English) actually spoken by the speaker, Words in other languages (Japanese) different from the language spoken by the speaker cannot be read from the movement of the speaker's lips, in other words, the speaker actually speaks from the movement of the speaker's lips. The background is that it is impossible to read words in other languages.
In other words, due to such common general knowledge, many people, including those skilled in the art, have not been able to come up with a composition of translation processing that directly links the movement of the speaker's lips with other languages, which delays the process. And various problems related to storage capacity were latent.

本発明は、以上のような事情に鑑みなされたものであり、話者の口唇の動きに基づいて、直接的に他言語への翻訳処理を行う翻訳プログラム、翻訳装置、翻訳方法、及びウェアラブル端末の提供を目的とする。 The present invention has been made in view of the above circumstances, and is a translation program, a translation device, a translation method, and a wearable terminal that directly translates into another language based on the movement of the speaker's lips. The purpose is to provide.

上記課題を達成するため、本発明の翻訳プログラムは、話者が発する言葉を、当該言葉と同じ意味の他言語の言葉に翻訳可能な翻訳プログラムにおいて、コンピュータを、話者が言葉を発するときの口唇の動きを特定可能な映像情報と、前記言葉に対応する他言語の言葉の情報と、を対応付けて記憶する記憶部、前記記憶部に記憶されている、前記口唇の動きを特定可能な映像情報を入力とし前記他言語の言葉の情報を出力とする教師データを学習させることで学習済みの翻訳モデルを生成する翻訳モデル生成部、及び話者が発する言葉を、当該話者の口唇の動きを特定可能な映像情報を前記翻訳モデルに入力することで、当該翻訳モデルから出力された前記他言語の言葉に翻訳する翻訳部、として機能させるようにしてある。 In order to achieve the above object, the translation program of the present invention is a translation program capable of translating a word spoken by a speaker into a word in another language having the same meaning as the word, when the speaker speaks a word on a computer . A storage unit that stores video information that can identify the movement of the lips and information of words in other languages corresponding to the words in association with each other, and the movement of the lips that is stored in the storage unit can be specified. A translation model generator that generates a trained translation model by learning teacher data that inputs video information and outputs information of words in other languages, and words spoken by the speaker on the speaker's lips. By inputting video information that can specify the movement into the translation model , it functions as a translation unit that translates into the words of the other language output from the translation model .

本発明によれば、話者の口唇の動きに基づいて行われる他言語への翻訳を高速に行うことができる。 According to the present invention, translation into another language based on the movement of the speaker's lips can be performed at high speed.

本発明の翻訳装置の一例であるウェアラブル端末の外観図である。It is an external view of the wearable terminal which is an example of the translation apparatus of this invention. 翻訳装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware configuration of a translation apparatus. 翻訳装置のソフトウェア構成を示すブロック図である。It is a block diagram which shows the software structure of a translation apparatus. 機械学習に用いるデータセット（教師データ）の一例である。This is an example of a data set (teacher data) used for machine learning. 学習済みの翻訳モデルを示す模式図である。It is a schematic diagram which shows the trained translation model. 翻訳情報の表示例を示す図である。It is a figure which shows the display example of the translation information. 本発明の翻訳方法に係る処理手順を示すフローチャートである。It is a flowchart which shows the processing procedure which concerns on the translation method of this invention. 複数の装置で構成された翻訳装置の一例を示す図である。（ａ）は、ウェアラブル端末とスマートフォンとにより構成される翻訳装置の一例であり、（ｂ）は、サーバと各種端末装置とにより構成される翻訳装置の一例である。It is a figure which shows an example of the translation apparatus composed of a plurality of apparatus. (A) is an example of a translation device composed of a wearable terminal and a smartphone, and (b) is an example of a translation device composed of a server and various terminal devices.

本発明の翻訳装置１の実施形態について図面を参照して説明する。
図１は、本発明の翻訳装置１の一例であるウェアラブル端末の外観図である。
図１に示すように、本発明の翻訳装置１は、例えば、撮影装置１６を備えたメガネ型のウェアラブル端末によって構成することができる。
本実施形態では、聞き手である利用者がウェアラブル端末を装着した状態で、対面する話者を撮影装置１６で撮影しながら、話者が発する言葉をウェアラブル端末により翻訳し、翻訳結果を表示すること（同時通訳）を想定している。
また、本実施形態では、話者が米国人で、利用者が日本人であり、話者が発する英語の言葉を、日本語に翻訳することを想定している。
翻訳装置１は、スマートフォン、タブレット端末、パーソナルコンピュータなどの情報処理端末を適用することもできる。 An embodiment of the translation apparatus 1 of the present invention will be described with reference to the drawings.
FIG. 1 is an external view of a wearable terminal which is an example of the translation device 1 of the present invention.
As shown in FIG. 1, the translation device 1 of the present invention can be configured by, for example, a glasses-type wearable terminal provided with a photographing device 16.
In the present embodiment, while the user who is the listener is wearing the wearable terminal and the speaker facing the other is photographed by the photographing device 16, the words spoken by the speaker are translated by the wearable terminal and the translation result is displayed. (Simultaneous interpretation) is assumed.
Further, in the present embodiment, it is assumed that the speaker is an American, the user is a Japanese, and the English words spoken by the speaker are translated into Japanese.
The translation device 1 can also be applied to an information processing terminal such as a smartphone, a tablet terminal, or a personal computer.

図２は、翻訳装置１の一例であるウェアラブル端末のハードウェア構成を示すブロック図である。
図２に示すように、翻訳装置１は、プロセッサ１１と、メモリ１２と、ストレージ１３と、入力装置１４と、出力装置１５と、撮影装置１６と、通信装置１７とを備えるコンピュータである。
プロセッサ１１は、制御部、演算部、レジスタ等を含む中央処理部（ＣＰＵ）を備え、コンピュータ全体を制御する。
プロセッサ１１は、プログラム及びデータ等を、ストレージ１３や通信装置１７からメモリ１２に読み出し、これらに従って各種の処理を実行する。
メモリ１２は、コンピュータが読み取り可能な記録媒体であり、例えば、ＲＯＭ、ＥＰＲＯＭ、ＥＥＰＲＯＭ、ＲＡＭ等である。 FIG. 2 is a block diagram showing a hardware configuration of a wearable terminal which is an example of the translation device 1.
As shown in FIG. 2, the translation device 1 is a computer including a processor 11, a memory 12, a storage 13, an input device 14, an output device 15, a photographing device 16, and a communication device 17.
The processor 11 includes a central processing unit (CPU) including a control unit, a calculation unit, a register, and the like, and controls the entire computer.
The processor 11 reads a program, data, and the like from the storage 13 and the communication device 17 into the memory 12, and executes various processes according to these.
The memory 12 is a computer-readable recording medium, such as a ROM, EPROM, EEPROM, or RAM.

ストレージ１３は、コンピュータが読み取り可能な記録媒体であり、例えば、ハードディスクドライブ、フラッシュメモリ等である。
ストレージ１３は、本発明の記憶部として機能する。
入力装置１４は、外部からの入力を受け付ける入力デバイス（例えば、キーボード、マウス、マイクロフォン、スイッチ、ボタン、センサなど）である。
出力装置１５は、外部への出力を実施する出力デバイス（例えば、モニター、ディスプレイ、表示パネル、スピーカー、ＬＥＤランプなど）である。
ウェアラブル端末においては、出力装置１５として投影装置を備え、当該投影装置から翻訳情報が投影されることで、レンズ１５１上に翻訳情報が表示されるようになっている（図１、図６参照）。 The storage 13 is a computer-readable recording medium, such as a hard disk drive or a flash memory.
The storage 13 functions as a storage unit of the present invention.
The input device 14 is an input device (for example, a keyboard, a mouse, a microphone, a switch, a button, a sensor, etc.) that receives an input from the outside.
The output device 15 is an output device (for example, a monitor, a display, a display panel, a speaker, an LED lamp, etc.) that outputs to the outside.
The wearable terminal is provided with a projection device as the output device 15, and the translation information is displayed on the lens 151 by projecting the translation information from the projection device (see FIGS. 1 and 6). ..

撮影装置１６は、カメラであり、撮影した映像情報をプロセッサ１１に供給する。
撮影装置１６は、ウェアラブル端末を装着した利用者の前方の映像を撮影可能に設けられている。
本実施形態では、利用者がウェアラブル端末を装着した状態で、利用者の前方にいる話者の顔の映像（少なくとも口唇部を含む映像）が撮影できるようになっている。
なお、撮影装置１６は、外部のカメラであってもよい。
また、話者の撮影情報を外部から取り込み、ストレージ１３に記憶しておくこともできる。
通信装置１７は、有線及び／又は無線による通信回線を介してコンピュータ間の通信を行うためのハードウェア（送受信デバイス）であり、例えばネットワークデバイス、ネットワークコントローラ、ネットワークカード、通信モジュール等である。
本実施形態のメガネ型のウェアラブル端末において、前記各部は、例えば、テンプル部分に内蔵又は外付けされた態様で設けられる。 The photographing device 16 is a camera and supplies the photographed video information to the processor 11.
The photographing device 16 is provided so as to be able to photograph an image in front of the user wearing the wearable terminal.
In the present embodiment, the image of the face of the speaker (at least the image including the lip) in front of the user can be captured while the user is wearing the wearable terminal.
The photographing device 16 may be an external camera.
Further, the photographed information of the speaker can be taken in from the outside and stored in the storage 13.
The communication device 17 is hardware (transmission / reception device) for performing communication between computers via a wired and / or wireless communication line, and is, for example, a network device, a network controller, a network card, a communication module, or the like.
In the glasses-type wearable terminal of the present embodiment, the respective parts are provided, for example, in a manner built in or externally attached to the temple part.

翻訳装置１は、プロセッサ１１がプログラム（本発明の翻訳プログラム）を実行して各部を制御することで、以下に述べる機能が実現される。
例えば、ボタンなどの入力装置１４の操作に応じて、翻訳プログラムが実行され、撮影装置１６による撮影が開始される。
図３は、翻訳装置１のソフトウェア構成を示すブロック図である。
図３に示すように、翻訳装置１は、映像入力部２１と、翻訳モデル２２と、出力部２３と、翻訳モデル生成部２４と、を備える。 In the translation device 1, the processor 11 executes a program (translation program of the present invention) to control each part, thereby realizing the functions described below.
For example, a translation program is executed in response to an operation of the input device 14 such as a button, and shooting by the photographing device 16 is started.
FIG. 3 is a block diagram showing a software configuration of the translation device 1.
As shown in FIG. 3, the translation device 1 includes a video input unit 21, a translation model 22, an output unit 23, and a translation model generation unit 24.

映像入力部２１は、撮影装置１６により撮影された話者の顔の映像情報を入力する。
具体的には、撮影装置１６により撮影された話者の顔の映像から話者の口唇領域の映像、つまり、話者の口唇の動きを特定可能な映像を抽出して翻訳モデル２２に入力する。
なお、翻訳モデル２２に入力される「映像」は、話者の口唇の動きが映された動画であり、時系列要素を有する複数の静止画である。 The video input unit 21 inputs video information of the speaker's face shot by the shooting device 16.
Specifically, an image of the speaker's lip region, that is, an image capable of identifying the movement of the speaker's lip is extracted from the image of the speaker's face taken by the photographing device 16 and input to the translation model 22. ..
The "video" input to the translation model 22 is a moving image showing the movement of the speaker's lips, and is a plurality of still images having time series elements.

翻訳モデル２２は、翻訳モデル生成部２４により生成されるプログラムである。
翻訳モデル２２の生成においては、前処理として、機械学習に用いる多数のデータセット（教師データ）を記憶部であるストレージ１３に記憶させる処理を行う。
データセットは、話者が言葉を発するときの口唇の動きを特定可能な映像情報と、当該口唇の動きからは読み取り不可能な他言語であって、前記言葉に対応する他言語の言葉（すなわち、前記言葉と同じ意味の他言語の言葉）の情報と、を対応付けたデータ構成である。
なお、「読み取り不可能」とは、技術常識に基づき「読み取り不可能」であることを意味するが、これに限らず、「読み取り困難」や「極めて読み取り困難」を含む。
「同じ意味」とは、完全同一の意味に限らず、「ほぼ同じ意味」や「おおよそ同じ意味」を含む。
「他言語」とは、話者が実際に発した言語とは異なる言語であり、例えば、話者が英語で言葉を発した場合、英語以外の言語（例えば日本語）のことをいう。
言い換えると、「他言語」は、翻訳における「原言語（翻訳前の言語）」と「目的言語（翻訳後の言語）」との関係においては「目的言語」に相当する。
「言葉」は、複数の単語を組み合わせたフレーズやセンテンス（文）のみならず、単語そのものを含む。 The translation model 22 is a program generated by the translation model generation unit 24.
In the generation of the translation model 22, as a preprocessing, a process of storing a large number of data sets (teacher data) used for machine learning in the storage 13 which is a storage unit is performed.
The data set is video information that can identify the movement of the lips when the speaker utters a word, and other languages that cannot be read from the movement of the lips, and is a word of another language corresponding to the word (that is, that is). , A data structure in which information in another language having the same meaning as the above-mentioned word) is associated with the information.
The term "unreadable" means "unreadable" based on common general technical knowledge, but is not limited to this, and includes "difficult to read" and "extremely difficult to read".
The "same meaning" is not limited to the exact same meaning, but includes "almost the same meaning" and "approximately the same meaning".
The "other language" is a language different from the language actually spoken by the speaker. For example, when the speaker speaks a word in English, it means a language other than English (for example, Japanese).
In other words, "another language" corresponds to "objective language" in the relationship between "original language (language before translation)" and "objective language (language after translation)" in translation.
A "word" includes not only a phrase or sentence (sentence) that combines a plurality of words, but also the word itself.

図４は、データセット（教師データ）の一例である。
図４に示すように、本実施形態のデータセットは、（ａ）英語話者が言葉を発したときの口唇の動きを特定可能な映像情報（入力データ）と、（ｂ）その言葉の日本語の文字情報（出力データ）とにより構成される。
つまり、図４に示すデータセットは、英語→日本語の翻訳を目的とする場合のデータセットである。
例えば、Ｎｏ．１のデータセットは、（ａ）英語話者が英語で「Ｉｈａｖｅａａｐｐｌｅ．」と言ったときの口唇の動きを予め撮影した映像情報と、（ｂ）その日本語の文字情報「私はリンゴを持っている。」とにより構成される。
Ｎｏ．２のデータセットは、（ａ）英語話者が英語で「Ｉｈａｖｅａｐｅｎ．」と言ったときの口唇の動きを予め撮影した映像情報と、（ｂ）その日本語の文字情報「私はペンを持っている。」とにより構成される。
このようなデータセットは、例えば、同時通訳者に、（ａ）の映像情報（話者の音声を含む）を見せながら英語翻訳を行わせ、翻訳結果を、（ｂ）の文字情報として（ａ）の映像情報と紐付けて構成する。
これにより、同時通釈者が備える直観的な翻訳機能や翻訳技術を移植することが期待できる。
この他にも、例えば、音声認識により話者が発した言語を他言語に自動的に翻訳することが可能な公知のアプリケーションプログラムを用い、当該プログラムの翻訳結果を（ｂ）の情報として用いることもできる。
このようなデータセットを、１つの言葉に対し複数（多数）パターン記憶させるとともに、様々な言葉について記憶する。
なお、図４は、フレーズやセンテンスに係るデータセットの一例であるが、単語に係るデータセットを含めることもできる。 FIG. 4 is an example of a data set (teacher data).
As shown in FIG. 4, the data set of the present embodiment includes (a) video information (input data) that can identify the movement of the lips when an English speaker speaks a word, and (b) Japan of that word. It is composed of character information (output data) of words.
That is, the data set shown in FIG. 4 is a data set for the purpose of translating from English to Japanese.
For example, No. The data set of 1 is (a) video information obtained by preliminarily capturing the movement of the lips when an English speaker says "I have a apple." In English, and (b) the Japanese text information "I am." I have an apple. "
No. The data set of 2 is (a) video information obtained by preliminarily capturing the movement of the lips when an English speaker says "I have a pen." In English, and (b) the Japanese text information "I am." I have a pen. "
In such a data set, for example, a simultaneous interpreter is made to perform English translation while showing the video information (including the speaker's voice) of (a), and the translation result is used as the character information of (b) (a). ) Is linked with the video information.
As a result, it can be expected that the intuitive translation function and translation technology of the simultaneous interpreter will be transplanted.
In addition to this, for example, a known application program capable of automatically translating a language spoken by a speaker into another language by voice recognition is used, and the translation result of the program is used as the information in (b). You can also.
A plurality (many) patterns of such a data set are stored for one word, and various words are stored.
Although FIG. 4 is an example of a data set related to a phrase or a sentence, a data set related to a word can also be included.

図４は、原言語（翻訳前の言語）である英語を、目的言語（翻訳後の言語）である日本語に翻訳する場合のデータセットである。
このため、原言語や目的言語に応じ、データセットの構成データを変えることで様々な翻訳に対応することができる。
例えば、日本語（原言語）→英語（目的言語）の翻訳を目的とする場合、（ａ）を日本語の言葉を発するときの口唇の動きを特定可能な映像情報とし、（ｂ）をその言葉の英語の文字情報とするデータセットを用意すればよい。
また、英語（原言語）→仏語（目的言語）の翻訳を目的とする場合、（ａ）を英語の言葉を発するときの口唇の動きを特定可能な映像情報とし、（ｂ）をその言葉の仏語の文字情報とするデータセットを用意すればよい。 FIG. 4 is a data set for translating English, which is the original language (language before translation), into Japanese, which is the target language (language after translation).
Therefore, various translations can be supported by changing the constituent data of the data set according to the original language and the target language.
For example, when the purpose is to translate from Japanese (original language) to English (target language), (a) is video information that can identify the movement of the lips when uttering Japanese words, and (b) is the video information. A data set for English textual information of words may be prepared.
In addition, when the purpose is to translate from English (original language) to French (target language), (a) is used as video information that can identify the movement of the lips when the English word is spoken, and (b) is the video information of that word. A data set for French character information may be prepared.

翻訳モデル生成部２４は、ストレージ１３に記憶されているデータセットを教師データとして機械学習させることで学習済みの翻訳モデル２２を生成する。
機械学習は、公知の深層学習（ディープラーニング）などのニューラルネットワークを用いて行うものとする。
具体的には、ニューラルネットワークにおいて、入力層に、図４（ａ）のデータを入力することで、出力層から図４（ｂ）のデータが高確率（例えば９０％以上）で出力されるように学習させる。
すなわち、話者が言葉を発するときの口唇の動きを特定可能な映像情報を入力層に入力したときに、出力層において、その言葉と対応する他言語の言葉の情報の確率が高確率で導出されるように各ニューロンにおける重み係数等の最適化処理を実行させることにより学習済みの翻訳モデル２２が生成される。
機械学習の対象は、時系列要素を有する複数の静止画（映像）によって表される口唇の動きであり、その口唇の動きによって表される時系列要素を有する言葉であることから、公知のＣＮＮ（畳み込みニューラルネットワーク）、ＲＮＮ（再帰型ニューラルネットワーク）、ＬＳＴＭ等を用いる。 The translation model generation unit 24 generates the trained translation model 22 by machine learning the data set stored in the storage 13 as teacher data.
Machine learning shall be performed using a known neural network such as deep learning.
Specifically, in the neural network, by inputting the data of FIG. 4 (a) to the input layer, the data of FIG. 4 (b) is output from the output layer with a high probability (for example, 90% or more). To learn.
That is, when video information that can identify the movement of the lips when the speaker utters a word is input to the input layer, the probability of the information of the word in another language corresponding to the word is derived with high probability in the output layer. The trained translation model 22 is generated by executing the optimization process such as the weighting coefficient in each neuron so as to be performed.
The object of machine learning is the movement of the lips represented by a plurality of still images (videos) having a time series element, and since it is a word having a time series element represented by the movement of the lips, it is a known CNN. (Convolutional neural network), RNN (recurrent neural network), LSTM, etc. are used.

翻訳モデル２２は、翻訳部の一機能であり、話者の口唇の動きを特定可能な映像情報が入力されると、その言葉に対応する他言語の言葉の情報を翻訳情報として出力する。
このため、話者の映像を撮影装置１６で撮影しつつ、その映像情報を翻訳モデル２２に入力するようにことで、話者が何らかの言葉を発した場合、話者の口唇の動きに基づいて他言語の言葉を翻訳情報として出力する、いわゆる同時通訳が可能になる。
図５は、翻訳モデル生成部２４により生成された翻訳モデル２２の模式図である。
図５に示すように、翻訳モデル２２は、入力層と中間層と出力層とを有する多層のニューラルネットワークで構成されている。
図５は、英語話者が「Ｉｈａｖｅａｐｅｎ．」と発したときの口唇の動きを特定可能な映像情報を入力層に入力した場合に、出力層において「私はリンゴを持っている。」が出力（分類）されたことを示している。
つまり、話者が言葉を発した場合、当該話者がその言葉を発するときの口唇の動きを特定可能な映像情報を翻訳モデル２２に入力することで、当該翻訳モデル２２から、その言葉に対応する他言語の言葉が翻訳情報として出力される。
なお、「話者が言葉を発した場合、当該話者がその言葉を発するときの口唇の動き」には、仮に話者が言葉を発したとしたら、そのような口唇の動きになる当該口唇の動きを含む。
このため、話者が言葉を発さずに口元だけ動かす、いわゆる口パクの場合にも対応することができる。
また、翻訳モデル２２には、ストレージ１３に記憶した映像情報や、通信装置１７が外部から受信した映像情報を入力させることもできる。
このため、同時通訳のみならず、過去の映像情報から特定される話者の動きに基く翻訳も可能である。 The translation model 22 is a function of the translation unit, and when video information that can identify the movement of the speaker's lips is input, the information of words in another language corresponding to the words is output as translation information.
Therefore, by inputting the video information into the translation model 22 while shooting the video of the speaker with the shooting device 16, when the speaker utters any words, the movement of the speaker's lips is used as the basis. So-called simultaneous interpretation, which outputs words in other languages as translation information, becomes possible.
FIG. 5 is a schematic diagram of the translation model 22 generated by the translation model generation unit 24.
As shown in FIG. 5, the translation model 22 is composed of a multi-layer neural network having an input layer, an intermediate layer, and an output layer.
FIG. 5 shows that when an English speaker inputs video information that can identify the movement of the lips when he / she says “I have a pen.” To the input layer, “I have an apple” in the output layer. "Is output (classified).
That is, when the speaker utters a word, the translation model 22 responds to the word by inputting video information that can identify the movement of the lips when the speaker utters the word into the translation model 22. Words in other languages are output as translation information.
In addition, in "when the speaker utters a word, the movement of the lips when the speaker utters the word", if the speaker utters a word, the lip movement becomes such a movement of the lip. Including the movement of.
Therefore, it is possible to deal with the so-called lip-synching in which the speaker moves only the mouth without speaking a word.
Further, the translation model 22 can be made to input the video information stored in the storage 13 and the video information received from the outside by the communication device 17.
Therefore, not only simultaneous interpretation but also translation based on the movement of the speaker identified from the past video information is possible.

出力部２３は、翻訳モデル２２から出力された翻訳情報を出力装置１５に出力させる。
本実施形態のウェアラブル端末においては、出力装置１５の一例である投影装置による投影によりレンズ１５１上に翻訳情報を表示することができる（図１参照）。
図６は、翻訳情報として「私はリンゴを持っている。」がレンズ１５１上に表示された例を示している。
この場合、利用者は、ウェアラブル端末を装着した状態で、レンズ１５１に表示される翻訳情報を見ることができる。
このため、例えば、ウェアラブル端末を同時通訳に用いることができ、話者との円滑な会話をサポートすることができる。
翻訳装置１が、パーソナルコンピュータ、スマートフォン、タブレット端末の場合、翻訳情報をディスプレイやモニターに表示することができる。
出力部２３は、翻訳情報を、ストレージ１３に記憶することもできる。
これにより、例えば、会議の議事録やインタビューなどを自動的に記録することができる。
出力部２３は、翻訳情報を、通信装置１７により外部に送信させることもできる。 The output unit 23 causes the output device 15 to output the translation information output from the translation model 22.
In the wearable terminal of the present embodiment, translation information can be displayed on the lens 151 by projection by a projection device which is an example of the output device 15 (see FIG. 1).
FIG. 6 shows an example in which "I have an apple" is displayed on the lens 151 as translation information.
In this case, the user can see the translation information displayed on the lens 151 while wearing the wearable terminal.
Therefore, for example, a wearable terminal can be used for simultaneous interpretation, and smooth conversation with a speaker can be supported.
When the translation device 1 is a personal computer, a smartphone, or a tablet terminal, the translation information can be displayed on a display or a monitor.
The output unit 23 can also store the translation information in the storage 13.
As a result, for example, the minutes of a meeting or an interview can be automatically recorded.
The output unit 23 can also transmit the translation information to the outside by the communication device 17.

次に、本発明の翻訳方法について説明する。
図７は、本発明の翻訳方法に係る処理手順を示すフローチャートである。
図７に示すように、まず、話者の口唇の動きを特定可能な映像情報を入力する（Ｓ１）。本発明の前工程に相当する工程である。
次に、口唇の動きに対応付けられた他言語の言葉の情報を翻訳情報として出力する（Ｓ２）。本発明の後工程に相当する工程である。
具体的には、話者の口唇の動きを特定可能な映像情報を翻訳モデル２２に入力することで、当該翻訳モデル２２が、その口唇の動きに対応する他言語の言葉の情報を翻訳情報として出力する。 Next, the translation method of the present invention will be described.
FIG. 7 is a flowchart showing a processing procedure according to the translation method of the present invention.
As shown in FIG. 7, first, video information that can identify the movement of the speaker's lips is input (S1). This is a step corresponding to the previous step of the present invention.
Next, the information of words in other languages associated with the movement of the lips is output as translation information (S2). This is a step corresponding to the subsequent step of the present invention.
Specifically, by inputting video information that can identify the movement of the speaker's lips into the translation model 22, the translation model 22 uses the information of words in other languages corresponding to the movement of the lips as translation information. Output.

すなわち、本発明の翻訳方法は、話者が言葉を発した場合に、当該話者の口唇の動きを特定可能な映像情報を入力する前工程と、前記前工程により入力された口唇の動きを特定可能な映像情報に基づき、前記話者が発した言葉を、前記他言語の言葉に翻訳する後工程と、いった大きく２つの工程を有する。
つまり、従来技術では、前工程において入力された口唇の動きを特定可能な映像情報から前記話者が発した原言語の言葉を読み取る中工程を経て、当該中工程において読み取った原言語の言語を他言語の言葉に翻訳することが想定されるところ、本発明の翻訳方法では、このような中工程を経ることなく、つまり、前工程において入力された原言語に係る口唇の動きから直接他言語を翻訳情報として出力するようにしている。
このような本発明の翻訳方法によれば、無駄な工程を省くことができるため、翻訳処理を高速化することができる。 That is, in the translation method of the present invention, when a speaker utters a word, a pre-process of inputting video information capable of identifying the movement of the speaker's lips and the movement of the lips input by the pre-process are input. It has two major steps, a post-process of translating the words spoken by the speaker into the words of the other language based on the identifiable video information.
That is, in the prior art, the language of the original language read in the intermediate process is obtained through the intermediate process of reading the words of the original language uttered by the speaker from the video information that can identify the movement of the lips input in the previous process. Since it is assumed that the language is translated into a language of another language, the translation method of the present invention does not go through such an intermediate process, that is, directly from the movement of the lips related to the original language input in the previous process. Is output as translation information.
According to such a translation method of the present invention, unnecessary steps can be omitted, so that the translation process can be speeded up.

ここで、入力データと出力データとの「相関関係」について説明する。
要するに、（ａ）「口唇の動き」（入力データ）と、（ｂ）「他言語の言葉」（出力データ）との相関関係について説明する（図４参照）。
このうち、（ｂ）の「他言語の言葉」は、例えば、英語話者が英語で言葉を発するときの口唇の動き（ａ）からは読み取ることが不可能な日本語の言葉であるものの、（ａ）から読み取ることが可能な英語の言葉と同じ意味の日本語の言葉であることから、（ａ）と（ｂ）とは相関関係を有する。
また、話者の口唇の動きから原言語の言葉を読み取ることは可能であっても、話者の口唇の動きから、直接、他言語の言葉を読み取ること（つまり、話者が実際に発していない言語（他言語）を読み取ること）は、技術常識を超える発想であることから、この相関関係は新規かつ特徴的なものである。
本発明は、このような特徴的な相関関係のある入力データと出力データを用いることで、話者が話す言葉を、その口唇の動きから直接他言語に翻訳するものである。
つまり、本発明は、従来、翻訳処理において、話者の口唇の動きから一旦「原言語」の言葉を認識するといった無駄な処理や工程があるために生じていた遅延の問題や、登録処理において、必要のない「原言語」のデータを登録情報に含むことによる記憶容量等の問題を解決するものである。
このような本発明によれば、翻訳処理の高速化等を図ることができる。 Here, the "correlation" between the input data and the output data will be described.
In short, the correlation between (a) "movement of the lips" (input data) and (b) "words of other languages" (output data) will be described (see FIG. 4).
Of these, the "words of other languages" in (b) are Japanese words that cannot be read from the movement of the lips (a) when an English speaker speaks a word in English, for example. Since it is a Japanese word having the same meaning as an English word that can be read from (a), (a) and (b) have a correlation.
Also, even if it is possible to read the words in the original language from the movement of the speaker's lips, it is possible to read the words in another language directly from the movement of the speaker's lips (that is, the speaker actually speaks). This correlation is new and characteristic because the idea of reading a non-existent language (other language) is beyond the common technical wisdom.
The present invention uses such characteristicly correlated input data and output data to directly translate the words spoken by the speaker into another language from the movement of the lips.
That is, the present invention conventionally has a problem of delay caused by unnecessary processing or process such as recognizing the word of "original language" from the movement of the speaker's lip in the translation processing, and registration processing. , It solves the problem of storage capacity etc. by including unnecessary "original language" data in the registered information.
According to the present invention as described above, it is possible to speed up the translation process and the like.

また、従来は、第１に「口唇の動き」から「原言語」の言葉を認識し、第２に当該認識した「原言語の言葉」を「他言語の言葉」に翻訳する、といった複数の処理や工程を介して翻訳処理が行われることで、第１の処理及び第２の処理のいずれか一方で誤認識や誤訳があった場合、最終的な翻訳精度が低下していた。
このため、従来は、複数の処理に対応したそれぞれの登録データについても適切なデータを登録する必要があり、それぞれの登録データが適切でなければ、翻訳の精度が低下する問題があった。
これに対し、本発明、すなわち、本発明に係る「相関関係」によれば、このような従来の問題は生じない。 In addition, conventionally, a plurality of methods such as first recognizing a word of "original language" from "movement of lips" and secondly translating the recognized "word of original language" into "word of another language". Since the translation process is performed via the process or process, if there is a misrecognition or mistranslation in either the first process or the second process, the final translation accuracy is lowered.
For this reason, conventionally, it is necessary to register appropriate data for each registered data corresponding to a plurality of processes, and if each registered data is not appropriate, there is a problem that the accuracy of translation is lowered.
On the other hand, according to the present invention, that is, the "correlation" according to the present invention, such a conventional problem does not occur.

（他の実施形態）
翻訳装置１を、複数の装置の組み合わせにより実現することもできる。
図８は、複数の装置で構成された翻訳装置１の一例を示す図である。（ａ）は、ウェアラブル端末とスマートフォンとにより構成される翻訳装置１の一例であり、（ｂ）は、サーバと各種装置とにより構成される翻訳装置１の一例である。
図８に示すように、翻訳装置１は、複数の装置を、所定の通信回線を介してデータ通信可能に接続することによって実現することができる。
図８（ａ）に示す翻訳装置１は、ウェアラブル端末とスマートフォン等とをブルートゥース（登録商標）を介して通信可能にした構成である。
この構成によれば、例えば、翻訳プログラムや当該プログラムにより実行される機能（ソフトウェア構成）をスマートフォン側に持たせることができ、ウェアラブル端末側では、話者の撮影や翻訳結果を表示する程度の簡易な構成にすることができる。
図８（ｂ）は、クラウドコンピューティングなどにおけるサーバと各種端末装置とをインターネット回線や携帯通信回線（５Ｇなど）を介して通信可能にした構成である。
この構成によれば、例えば、翻訳プログラムや当該プログラムにより実行される機能（ソフトウェア構成）をサーバ側に持たせることができ、端末装置側では、話者の撮影や翻訳結果を表示する程度の簡易な構成にすることができる。
特に、高速大容量・低遅延の５Ｇを用いることで、端末装置側においては簡易な構成を実現しつつ、あたかも端末装置単体で実行されているかのような、ストレスのない、円滑な翻訳処理を行うことができる。 (Other embodiments)
The translation device 1 can also be realized by combining a plurality of devices.
FIG. 8 is a diagram showing an example of a translation device 1 composed of a plurality of devices. (A) is an example of a translation device 1 composed of a wearable terminal and a smartphone, and (b) is an example of a translation device 1 composed of a server and various devices.
As shown in FIG. 8, the translation device 1 can be realized by connecting a plurality of devices so that data communication is possible via a predetermined communication line.
The translation device 1 shown in FIG. 8A has a configuration in which a wearable terminal and a smartphone or the like can communicate with each other via Bluetooth (registered trademark).
According to this configuration, for example, a translation program and a function (software configuration) executed by the program can be provided on the smartphone side, and on the wearable terminal side, it is as simple as displaying a photograph of the speaker and the translation result. Can be configured.
FIG. 8B shows a configuration in which a server in cloud computing and various terminal devices can communicate with each other via an Internet line or a mobile communication line (5G or the like).
According to this configuration, for example, a translation program and a function (software configuration) executed by the program can be provided on the server side, and the terminal device side is as simple as displaying a photograph of a speaker and a translation result. Can be configured.
In particular, by using high-speed, large-capacity, low-delay 5G, while realizing a simple configuration on the terminal device side, stress-free and smooth translation processing can be performed as if it were being executed by the terminal device alone. It can be carried out.

以上説明したように、本発明の翻訳プログラムは、コンピュータを、話者が言葉を発した場合、当該話者がその言葉を発するときの口唇の動きからは読み取り不可能な他言語であって、前記言葉の情報に対応付けられた他言語の言葉の情報を翻訳情報として出力する翻訳部、として機能するようにしてある。
具体的には、コンピュータを、話者が言葉を発するときの口唇の動きを特定可能な映像情報と、当該口唇の動きからは読み取り不可能な他言語であって、前記言葉に対応する他言語の言葉の情報と、を対応付けて記憶する記憶部、前記記憶部に記憶されている、前記口唇の動きを特定可能な映像情報を入力とし前記他言語の言葉の情報を出力とする教師データを学習させることで学習済みの翻訳モデル２２を生成する翻訳モデル生成部２４、として機能させ、前記翻訳部は、話者が言葉を発した場合、当該話者の口唇の動きを特定可能な映像情報を前記翻訳モデル２２に入力することで、当該翻訳モデル２２から出力された前記他言語の言葉の情報を翻訳情報として出力するようにしてある。
また、本発明の翻訳装置１は、コンピュータであり、プロセッサ１１が、翻訳プログラムを実行することにより上記翻訳部（翻訳モデル２２）、記憶部、翻訳モデル生成部２４を機能させることが可能である。
また、本発明の翻訳方法においては、話者が言葉を発した場合に、当該話者の口唇の動きを特定可能な映像情報を入力する前工程と、前記前工程により入力された口唇の動きを特定可能な映像情報に基づき、前記話者が発した言葉を、前記他言語の言葉に翻訳する後工程と、を有し、前記後工程は、前記前工程により入力された口唇の動きを特定可能な映像情報から前記話者が発した言語を読み取る中工程を経ず、前記話者が発した言葉を、前記他言語の言葉に翻訳するようにしてある。 As described above, the translation program of the present invention is a computer in another language that cannot be read from the movement of the lips when the speaker speaks the words when the speaker speaks the words. It functions as a translation unit that outputs information on words in another language associated with the information on the words as translation information.
Specifically, the computer uses video information that can identify the movement of the lips when the speaker utters a word, and other languages that cannot be read from the movement of the lips and that correspond to the words. A storage unit that stores the information of the words in association with each other, and teacher data stored in the storage unit that inputs the video information that can identify the movement of the lips and outputs the information of the words of the other language. It functions as a translation model generation unit 24 that generates a trained translation model 22 by learning the above, and the translation unit is an image capable of identifying the movement of the speaker's lips when the speaker speaks. By inputting the information into the translation model 22, the information of the words in the other language output from the translation model 22 is output as the translation information.
Further, the translation device 1 of the present invention is a computer, and the processor 11 can make the translation unit (translation model 22), the storage unit, and the translation model generation unit 24 function by executing a translation program. ..
Further, in the translation method of the present invention, when a speaker utters a word, a pre-step of inputting video information capable of identifying the movement of the speaker's lip and a pre-step of inputting the lip movement input by the pre-step. The post-process has a post-process of translating the words uttered by the speaker into the words of the other language based on the video information that can identify the above, and the post-process is the movement of the lips input by the pre-step. The words spoken by the speaker are translated into words in the other language without going through the intermediate process of reading the language spoken by the speaker from the identifiable video information.

本発明によれば、話者が言葉を発するときの口唇の動きに基づきその言葉を直接他言語の言葉に翻訳するため、当該翻訳処理における遅延をなくし、高速化することができる。
具体的には、従来に比べ、構成、処理、工程を簡易にすることで、翻訳処理の高速化を図ることができる。
すなわち、本発明によれば、即時の翻訳処理が可能になる。
このため、例えば、本発明によれば、ストレスのない同時通訳サービスを実現できる。
具体的には、外国人と実際に会って対話する場合において、対話を円滑にすることができる。
特に、翻訳装置１としてのウェアラブル端末を装着することで、例えば、外国人との打合せや会議を円滑に行うことができる。
また、翻訳装置１としてパーソナルコンピュータやスマートフォンを用いる場合でも、Ｗｅｂ会議を介した国際会議等において円滑なコミュニケーションを図ることができる。 According to the present invention, since the word is directly translated into a word of another language based on the movement of the lip when the speaker utters the word, the delay in the translation process can be eliminated and the speed can be increased.
Specifically, the translation process can be speeded up by simplifying the configuration, processing, and process as compared with the conventional case.
That is, according to the present invention, immediate translation processing becomes possible.
Therefore, for example, according to the present invention, a stress-free simultaneous interpretation service can be realized.
Specifically, when actually meeting and interacting with a foreigner, the dialogue can be facilitated.
In particular, by wearing a wearable terminal as the translation device 1, for example, a meeting or a meeting with a foreigner can be smoothly held.
Further, even when a personal computer or a smartphone is used as the translation device 1, smooth communication can be achieved at an international conference or the like via a Web conference.

以上、本発明の翻訳プログラム、翻訳装置、及び翻訳方法について、好ましい実施形態を示して説明したが、本発明の翻訳装置等は、前述した実施形態にのみ限定されるものではなく、本発明の範囲で種々の変更実施が可能であることは言うまでもない。
例えば、入力映像から話者の口唇の動きを特定し、当該特定した口唇の動きと類似する口唇の動きをデータセットの中から抽出し、当該抽出した口唇の動きと対応付けられている他言語の言葉を翻訳情報として出力することもできる。
また、翻訳情報として音声情報を出力してもよい。この場合、例えば、文字情報を音声情報に変換可能な公知の音声読み上げプログラムを用いることで実現できる。 The translation program, translation apparatus, and translation method of the present invention have been described above by showing preferred embodiments, but the translation apparatus and the like of the present invention are not limited to the above-described embodiments, and the present invention is not limited to the above-described embodiments. Needless to say, various changes can be implemented within the scope.
For example, the movement of the speaker's lip is specified from the input video, the movement of the lip similar to the movement of the specified lip is extracted from the data set, and another language associated with the extracted movement of the lip is used. Words can also be output as translation information.
In addition, voice information may be output as translation information. In this case, for example, it can be realized by using a known voice reading program capable of converting character information into voice information.

本発明は、他言語への自動翻訳、特に、同時通訳に利用可能である。 The present invention can be used for automatic translation into other languages, especially simultaneous interpretation.

１ウェアラブル端末（翻訳装置）
１６撮影装置
２２翻訳モデル
２４翻訳モデル生成部 1 Wearable terminal (translation device)
16 Imaging device 22 Translation model 24 Translation model generator

Claims

In a translation program that can translate a word spoken by a speaker into a word in another language that has the same meaning as the word.
Computer,
A storage unit that stores video information that can identify the movement of the lips when the speaker speaks words and information on words in other languages corresponding to the words in association with each other.
A translation model that generates a learned translation model by learning teacher data stored in the storage unit, which inputs video information that can identify the movement of the lips and outputs information of words in the other language. Generator and
It functions as a translation unit that translates the words spoken by the speaker into the words of the other language output from the translation model by inputting video information that can identify the movement of the speaker's lips into the translation model. A translation program characterized by letting you do it.

In a translation program that can translate a word spoken by a speaker into a word in another language that has the same meaning as the word.
Computer,
A storage unit that stores video information that can identify the movement of the lips when the speaker speaks a word and information of a word in another language corresponding to the word in association with each other.
A translation unit that extracts words spoken by a speaker from the memory unit and translates the movements of the lips similar to the movements of the speaker's lips into words in the other language associated with the extracted movements of the lips. A translation program characterized by functioning as .

In a translation device that can translate a word spoken by a speaker into a word in another language that has the same meaning as the word.
A storage unit that stores video information that can identify the movement of the lips when the speaker speaks a word and information of a word in another language corresponding to the word in association with each other.
A translation model that generates a learned translation model by learning teacher data stored in the storage unit, which inputs video information that can identify the movement of the lips and outputs information of words in the other language. Generation part and
A translation unit that translates the words spoken by the speaker into the words of the other language output from the translation model by inputting video information that can identify the movement of the speaker's lips into the translation model. A translation device characterized by being equipped .

In a translation method that can translate a word spoken by a speaker into a word in another language that has the same meaning as the word.
A process of associating and memorizing video information that can identify the movement of the lips when the speaker utters a word and information of a word in another language corresponding to the word.
A step of generating a trained translation model by learning teacher data stored in the memorizing step, in which video information capable of identifying the movement of the lips is input and information of words in the other language is output.
It has a step of translating a word uttered by a speaker into a word in another language output from the translation model by inputting video information capable of identifying the movement of the speaker's lip into the translation model. A translation method characterized by that.

In a wearable terminal capable of operating the program according to claim 1 or 2.
A shooting unit that allows the user to shoot a predetermined range while wearing the wearable terminal,
It is provided with an output unit capable of outputting predetermined information while the user is wearing the wearable terminal.
The translation department
With the user wearing the wearable terminal, the words spoken by the speaker can be translated into the words of the other language based on the movement of the lips when the speaker photographed by the photographing unit speaks the words. ,
The output unit
It is possible to output information on words in other languages translated by the translation unit while the user is wearing the wearable terminal.
A wearable device that features that.