JP2023509031A

JP2023509031A - Translation method, device, device and computer program based on multimodal machine learning

Info

Publication number: JP2023509031A
Application number: JP2022540553A
Authority: JP
Inventors: 凡▲東▼ 孟; 永▲競▼ 尹; ▲勁▼松 ▲蘇▼; 杰周
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-05-20
Filing date: 2021-04-29
Publication date: 2023-03-06
Also published as: US20220245365A1; CN111597830A; WO2021233112A1

Abstract

マルチモーダル機械学習に基づく翻訳方法を開示し、人工知能の技術分野に関する。該方法は、ｎ個のモーダルのソースステートメントに対してセマンティック関連付けを行って、セマンティック関連図を構築する。セマンティック関連図において第１結合辺を用いて同一モーダルのセマンティックノードを結合し、且つ第２結合辺を用いて異なるモーダルのセマンティックノードを結合し、セマンティック関連図により複数のモーダルのソースステートメントの間のセマンティック関連付けを十分に表現する。続いてセマンティック関連図における特徴ベクトルに対して十分なセマンティック融合を行って、符号化後の符号化特徴ベクトルを取得し、更に符号化特徴ベクトルを復号処理した後に、より正確な目標ステートメントを取得する。該目標ステートメントはマルチモーダルのソースステートメントが総合的に表す内容、感情及び言語環境等に一層接近する。It discloses a translation method based on multimodal machine learning and relates to the technical field of artificial intelligence. The method performs semantic associations on n modal source statements to build a semantic association diagram. In the semantic relationship diagram, a first connecting edge is used to connect semantic nodes of the same modal, and a second connecting edge is used to connect semantic nodes of different modals, and the semantic relation diagram connects between multiple modal source statements. Express semantic associations well. Then perform sufficient semantic fusion on the feature vectors in the semantic relationship diagram to obtain a coded feature vector after coding, and then obtain a more accurate target statement after decoding the coded feature vector. . The goal statement more closely approximates the content, emotion, language environment, etc. that the multimodal source statements collectively represent.

Description

本願は人工知能の技術分野に関し、特にマルチモーダル機械学習に基づく翻訳方法、装置、機器及び記憶媒体に関する。 TECHNICAL FIELD The present application relates to the technical field of artificial intelligence, and more particularly to a translation method, device, device and storage medium based on multimodal machine learning.

本願は、２０２０年５月２０日に提出された出願番号が第２０２０１０４３２５９７２号であり、発明の名称が「マルチモーダル機械学習に基づく翻訳方法、装置、機器及び記憶媒体」である中国特許出願の優先権を要求し、その全部の内容は援用によって本願に組み込まれている。 This application takes precedence over a Chinese patent application with application number 2020104325972 filed on May 20, 2020 and titled "Translation method, device, apparatus and storage medium based on multimodal machine learning" All rights reserved, the entire contents of which are hereby incorporated by reference.

機械翻訳はコンピュータを利用して１種類の自然言語を他の種類の自然言語に変換するプロセスである。 Machine translation is the process of converting one type of natural language into another type of natural language using a computer.

いくつかの応用シーンにおいて、機械翻訳モデルにより複数種類の異なる表現形式のソース言語を目標言語に翻訳することができ、即ちマルチモーダルソース言語を目標言語に翻訳することができる。例示的には、ピクチャ及び対応する英語注釈を獲得し、機械翻訳モデルによりそれぞれピクチャ及び英語注釈に対して特徴抽出を行い、その後、抽出された特徴を融合し、更に融合後の特徴に基づいて翻訳し、ピクチャ及び英語注釈に対応するフランス語注釈を得る。 In some application scenarios, the machine translation model can translate a plurality of different expressions of the source language into the target language, that is, translate the multimodal source language into the target language. Illustratively, obtain pictures and corresponding English annotations, perform feature extraction on the pictures and English annotations respectively by a machine translation model, then fuse the extracted features, and further based on the fused features: Translate and get French annotations corresponding to pictures and English annotations.

本願の実施例はマルチモーダル機械学習に基づく翻訳方法、装置、機器及び記憶媒体を提供し、特徴符号化のプロセスにおいて、複数のモーダルのソース言語に対して十分なセマンティック融合を行うことができ、符号化ベクトルにより復号された目標ステートメントをソース言語により表される内容及び感情等に一層接近させる。前記技術的手段は以下のとおりである。 Embodiments of the present application provide a translation method, apparatus, apparatus and storage medium based on multimodal machine learning, which can perform sufficient semantic fusion for multiple modal source languages in the process of feature encoding, It makes the target statement decoded by the encoded vector more closely match the content, emotion, etc. expressed by the source language. The technical means are as follows.

本願の一態様によれば、コンピュータ機器により実行される、マルチモーダル機械学習に基づく翻訳方法を提供し、該方法は、
異なるモーダルに属するｎ個のソースステートメントに基づいてセマンティック関連図を構築するステップであって、前記セマンティック関連図は、ｎ種類の異なるモーダルのセマンティックノードと、同一モーダルのセマンティックノードを結合することに用いられる第１結合辺と、異なるモーダルのセマンティックノードを結合することに用いられる第２結合辺とを含み、前記セマンティックノードは１種類のモーダルにおける前記ソースステートメントの１つのセマンティックユニットを示すことに用いられ、ｎは１よりも大きな正の整数である、ステップと、
前記セマンティック関連図から複数の第１ワードベクトルを抽出するステップと、
前記複数の第１ワードベクトルを符号化して、ｎ個の符号化特徴ベクトルを取得するステップと、
前記ｎ個の符号化特徴ベクトルを復号して、翻訳後の目標ステートメントを取得するステップと、を含む。 According to one aspect of the present application, there is provided a multimodal machine learning-based translation method performed by a computing device, the method comprising:
constructing a semantic relationship diagram based on n source statements belonging to different modals, wherein the semantic relationship diagram is used to combine n different modal semantic nodes with the same modal semantic nodes; and a second connecting edge used to connect semantic nodes of different modals, wherein the semantic node is used to indicate one semantic unit of the source statement in one kind of modal. , where n is a positive integer greater than 1;
extracting a plurality of first word vectors from the semantic association diagram;
encoding the plurality of first word vectors to obtain n encoded feature vectors;
and decoding the n encoded feature vectors to obtain a translated target statement.

本願の他の態様によれば、マルチモーダル機械学習に基づく翻訳装置を提供し、該装置は、
異なるモーダルに属するｎ個のソースステートメントに基づいてセマンティック関連図を構築することに用いられるセマンティック関連付けモジュールであって、前記セマンティック関連図は、ｎ種類の異なるモーダルのセマンティックノードと、同一モーダルのセマンティックノードを結合することに用いられる第１結合辺と、異なるモーダルのセマンティックノードを結合することに用いられる第２結合辺とを含み、前記セマンティックノードは１種類のモーダルにおける前記ソースステートメントの１つのセマンティックユニットを示すことに用いられ、ｎは１よりも大きな正の整数である、セマンティック関連付けモジュールと、
前記セマンティック関連図から複数の第１ワードベクトルを抽出することに用いられる特徴抽出モジュールと、
前記複数の第１ワードベクトルを符号化して、ｎ個の符号化特徴ベクトルを取得することに用いられるベクトル符号化モジュールと、
前記ｎ個の符号化特徴ベクトルを復号して、翻訳後の目標ステートメントを取得することに用いられるベクトル復号モジュールと、を含む。 According to another aspect of the present application, there is provided a translation device based on multimodal machine learning, the device comprising:
A semantic association module for building a semantic association diagram based on n source statements belonging to different modals, wherein the semantic association diagram includes n different modal semantic nodes and the same modal semantic nodes. and a second connecting edge used to connect semantic nodes of different modals, wherein the semantic node is one semantic unit of the source statement in one kind of modal a semantic association module, wherein n is a positive integer greater than 1;
a feature extraction module used to extract a plurality of first word vectors from the semantic association diagram;
a vector encoding module used to encode the plurality of first word vectors to obtain n encoded feature vectors;
a vector decoding module used to decode the n encoded feature vectors to obtain a translated target statement.

本願の他の態様によれば、コンピュータ機器を提供し、該コンピュータ機器は、
メモリと、
メモリに接続されるプロセッサと、を含み、
プロセッサは実行可能命令をロードし且つ実行することにより上記１つの態様及びその選択可能な実施例に記載のマルチモーダル機械学習に基づく翻訳方法を実現するように構成される。 According to another aspect of the present application, a computer apparatus is provided, the computer apparatus comprising:
memory;
a processor coupled to the memory;
The processor is configured to load and execute executable instructions to implement the multimodal machine learning based translation method described in the above one aspect and alternative embodiments thereof.

本願の他の態様によれば、コンピュータ可読記憶媒体を提供し、上記コンピュータ可読記憶媒体に少なくとも１つの命令、少なくとも１セグメントのプログラム、コードセット又は命令セットが記憶され、上記少なくとも１つの命令、少なくとも１セグメントのプログラム、コードセット又は命令セットはプロセッサによりロードされ且つ実行されることにより上記１つの態様及びその選択可能な実施例に記載のマルチモーダル機械学習に基づく翻訳方法を実現する。 According to another aspect of the present application, there is provided a computer-readable storage medium having stored thereon at least one instruction, at least one segment of a program, code set or set of instructions, wherein the at least one instruction, at least A segment of program, code set or instruction set is loaded and executed by a processor to implement the multimodal machine learning based translation method described in the above one aspect and its alternative embodiments.

本願の実施例における技術的手段をより明確に説明するために、以下に実施例の記述に使用する必要のある図面を簡単に紹介するが、明らかなように、以下に記述される図面は単に本願のいくつかの実施例に過ぎない。当業者であれば、創造的な労力を要することなく、更にこれらの図面に基づき他の図面を獲得することができる。 In order to more clearly describe the technical means in the embodiments of the present application, the following briefly introduces the drawings that need to be used in the description of the embodiments. Obviously, the drawings described below are simply These are just some examples of the present application. Persons skilled in the art can further derive other drawings based on these drawings without creative effort.

図１は本願の１つの例示的な実施例が提供するマルチモーダル機械翻訳モデルの構造模式図である。FIG. 1 is a structural schematic diagram of a multimodal machine translation model provided by one exemplary embodiment of the present application. 図２は本願の１つの例示的な実施例が提供するコンピュータシステムの構造模式図である。FIG. 2 is a structural schematic diagram of a computer system provided by an exemplary embodiment of the present application. 図３は本願の１つの例示的な実施例が提供するマルチモーダル機械学習に基づく翻訳方法のフローチャートである。FIG. 3 is a flowchart of a multimodal machine learning based translation method provided by one exemplary embodiment of the present application. 図４は本願の１つの例示的な実施例が提供するセマンティック関連図を構築するフローチャートである。FIG. 4 is a flowchart for constructing a semantic relationship diagram provided by one exemplary embodiment of the present application. 図５は本願の他の例示的な実施例が提供するマルチモーダル機械学習に基づく翻訳方法のフローチャートである。FIG. 5 is a flow chart of a translation method based on multimodal machine learning provided by another exemplary embodiment of the present application. 図６は本願の他の例示的な実施例が提供するマルチモーダル機械学習に基づく翻訳方法のフローチャートである。FIG. 6 is a flowchart of a translation method based on multimodal machine learning provided by another exemplary embodiment of the present application. 図７は本願の他の例示的な実施例が提供するマルチモーダル機械翻訳モデルの構造模式図である。FIG. 7 is a structural schematic diagram of a multimodal machine translation model provided by another exemplary embodiment of the present application. 図８は本願の１つの例示的な実施例が提供するモデルテスト結果の曲線図である。FIG. 8 is a curve diagram of model test results provided by one exemplary embodiment of the present application. 図９は本願の他の例示的な実施例が提供するモデルテスト結果の曲線図である。FIG. 9 is a curve diagram of model test results provided by another exemplary embodiment of the present application. 図１０は本願の他の例示的な実施例が提供するモデルテスト結果の曲線図である。FIG. 10 is a curve diagram of model test results provided by another exemplary embodiment of the present application. 図１１は本願の１つの例示的な実施例が提供するマルチモーダル機械学習に基づく翻訳装置のブロック図である。FIG. 11 is a block diagram of a translation device based on multimodal machine learning provided by one exemplary embodiment of the present application. 図１２は本願の１つの例示的な実施例が提供するサーバの構造模式図である。FIG. 12 is a structural schematic diagram of a server provided by an exemplary embodiment of the present application.

本願の目的、技術的手段及び利点をより明確にするために、以下に図面を参照しながら本願の実施形態を更に詳しく記述する。 In order to make the purpose, technical means and advantages of the present application clearer, the embodiments of the present application are described in more detail below with reference to the drawings.

本願に関わる名詞を以下のように解釈する。 The nouns involved in this application are interpreted as follows.

人工知能（ＡｒｔｉｆｉｃｉａｌＩｎｔｅｌｌｉｇｅｎｃｅ、ＡＩ）：デジタルコンピュータ又はデジタルコンピュータにより制御される機械を利用して人の知能を模倣、拡大及び拡張し、環境を感知し、又は知識を獲得し且つ知識を使用して最適な結果を獲得する理論、方法、技術及び応用システムの技術科学である。換言すれば、人工知能はコンピュータ科学の１つの総合的な技術であり、それは知能の本質を理解し、且つ人間の知能に類似する方式で反応できる新しいインテリジェント機器を生産するように意図されている。人工知能とは、各種のインテリジェント機器の設計原理及び実現方法を研究し、機器に感知、推理及び意思決定の機能を有させるものである。 Artificial Intelligence (AI): The use of digital computers or machines controlled by digital computers to imitate, augment and augment human intelligence, sense the environment, or acquire and use knowledge It is the technical science of theories, methods, techniques and application systems for obtaining optimum results. In other words, artificial intelligence is a synthetic technology of computer science that is intended to understand the nature of intelligence and produce new intelligent machines that can react in a manner similar to human intelligence. . Artificial intelligence is the study of the design principles and implementation methods of various intelligent devices, enabling the devices to have the functions of sensing, reasoning and decision-making.

人工知能技術は１つの総合的な学科であり、関連する分野が広く、ハードウェアレベルの技術及びソフトウェアレベルの技術を含む。人工知能の基礎技術は一般的に例えばセンサ、専用の人工知能チップ、クラウドコンピューティング、分散型記憶、ビッグデータ処理技術、オペレーティング／インタラクティブシステムシステム、及びメカトロニクス等の技術を含む。人工知能ソフトウェア技術は主にコンピュータビジョン技術、音声処理技術、自然言語処理技術及び機械学習／深層学習等のいくつかの大きな方向を含む。 Artificial intelligence technology is a comprehensive discipline with a wide range of related fields, including hardware-level technology and software-level technology. The underlying technologies of artificial intelligence generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operating/interactive system, and mechatronics. Artificial intelligence software technology mainly includes several major directions such as computer vision technology, speech processing technology, natural language processing technology and machine learning/deep learning.

ここで、自然言語処理（ＮａｔｕｒｅＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ、ＮＬＰ）はコンピュータ科学分野及び人工知能分野における１つの重要な方向である。それは人とコンピュータとが自然言語により効果的な通信を行うことを実現できる各種の理論及び方法について研究する。自然言語処理は言語学、コンピュータ科学、及び数学を一体に合わせる１つの科学である。従って、この分野の研究は自然言語、即ち人々が日常に使用している言語に関する。従って、それは言語学の研究と密接に関係している。自然言語処理技術は一般的にテキスト処理、セマンティック理解、機械翻訳、ロボット問答、及びナレッジグラフ等の技術を含む。 Here, Natural Language Processing (NLP) is one important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers through natural language. Natural language processing is a science that brings together linguistics, computer science, and mathematics. Therefore, research in this area concerns natural language, the language that people use on a daily basis. Therefore, it is closely related to the study of linguistics. Natural language processing techniques generally include techniques such as text processing, semantic understanding, machine translation, robotic dialogue, and knowledge graphs.

機械学習（ＭａｃｈｉｎｅＬｅａｒｎｉｎｇ、ＭＬ）は複数の分野が交差する１つの学科であり、確率論、統計学、近似理論、凸解析、及びアルゴリズム複雑性理論等の複数の学科に関する。コンピュータが人間の学習行動をどのように模倣又は実現することにより新しい知識又はスキルを獲得し、既存の知識構造を改めて組織して自体の性能を絶えず改善するかについて、専門に研究している。機械学習は人工知能のコアであり、コンピュータに知能を持たせる根本的な方法であり、その応用は人工知能の各分野にわたっている。機械学習及び深層学習は一般的に人工ニューラルネットワーク、信頼ネットワーク、強化学習、転移学習、帰納学習、及び類推学習等の技術を含む。 Machine Learning (ML) is a discipline that intersects multiple disciplines and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, and algorithmic complexity theory. His research focuses on how computers acquire new knowledge or skills by mimicking or implementing human learning behaviors, and reorganizing existing knowledge structures to continually improve their performance. Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its application spans various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, trust networks, reinforcement learning, transfer learning, inductive learning, and analogy learning.

本願においてはマルチモーダル機械翻訳モデルを提供し、ｎ個の異なるモーダルのソースステートメントを目標ステートメントに正確に翻訳することができる。ここで、モーダルとは言語の表現形式を指し、例えば、ステートメントはグラフ表現又は文字表現等の方式を用いてもよい。ソースステートメントとは翻訳対象ステートメントを指し、翻訳対象ステートメントはテキスト形式の第１言語類の翻訳対象センテンス及び非テキスト形式の翻訳対象言語を含む。目標ステートメントとはテキスト形式の第２言語類の翻訳済みセンテンスを指し、第２言語類は第１言語類と異なる。例示的には、ソースステートメントは英語ステートメント及び該英語ステートメントのイラストを含み、マルチモーダル機械翻訳モデルにより上記英語ステートメント及びそのイラストに対応する中国語ステートメントを翻訳により取得することができる。 We provide a multimodal machine translation model, which can accurately translate n different modal source statements into target statements. Here, modal refers to a form of language expression, and for example, a statement may use a system such as graph expression or character expression. A source statement refers to a statement to be translated, which includes a sentence to be translated in the first language class in text form and a language to be translated in non-text form. A target statement refers to a translated sentence in a second language class in text form, where the second language class is different from the first language class. Exemplarily, the source statement includes an English statement and an illustration of the English statement, and the Chinese statement corresponding to the English statement and the illustration can be obtained by translation through a multimodal machine translation model.

図１のように、本願における１つの例示的な実施例が提供するマルチモーダル機械翻訳モデル１００の構造模式図を示す。該マルチモーダル機械翻訳モデル１００はマルチモーダルグラフ表現層１０１、第１ワードベクトル層１０２、マルチモーダル融合エンコーダ１０３及びデコーダ１０４を含み、
マルチモーダルグラフ表現層１０１は、ｎ個のモーダルのソース言語に対してセマンティック関連付けを行って、セマンティック関連図を獲得することに用いられる。該セマンティック関連図は、ｎ種類の異なるモーダルのセマンティックノードと、同一モーダルのセマンティックノードを結合することに用いられる第１結合辺と、異なるモーダルのセマンティックノードを結合することに用いられる第２結合辺とを含み、ｎは１よりも大きな正の整数である。ここで、１つのセマンティックノードは１種類のモーダルにおけるソースステートメントの１つのセマンティックユニットを示すことに用いられる。英語を例とすると、１つのセマンティックノードは１つの単語に対応し、中国語を例とすると、１つのセマンティックノードは１つの漢字に対応する。 As shown in FIG. 1, it shows a structural schematic diagram of a multimodal machine translation model 100 provided by one exemplary embodiment of the present application. The multimodal machine translation model 100 includes a multimodal graph representation layer 101, a first word vector layer 102, a multimodal fusion encoder 103 and a decoder 104,
The multimodal graph representation layer 101 is used to perform semantic association for n modal source languages to obtain a semantic association diagram. The semantic relationship diagram includes n kinds of different modal semantic nodes, a first connecting edge used to connect the same modal semantic nodes, and a second connecting edge used to connect different modal semantic nodes. and n is a positive integer greater than 1. Here, one semantic node is used to indicate one semantic unit of source statements in one kind of modal. Taking English as an example, one semantic node corresponds to one word, and taking Chinese as an example, one semantic node corresponds to one Chinese character.

第１ワードベクトル層１０２は、セマンティック関連図から複数の第１ワードベクトルを抽出することに用いられ、
マルチモーダル融合エンコーダ１０３は、該複数の第１ワードベクトルを符号化して、ｎ個の符号化特徴ベクトルを取得することに用いられ、
デコーダ１０４は、ｎ個の符号化特徴ベクトルを復号処理して、翻訳後の目標ステートメントを取得することに用いられる。 the first word vector layer 102 is used to extract a plurality of first word vectors from the semantic association diagram;
a multimodal fusion encoder 103 is used to encode the plurality of first word vectors to obtain n encoded feature vectors;
A decoder 104 is used to decode the n encoded feature vectors to obtain a post-translational target statement.

いくつかの選択可能な実施例において、マルチモーダルグラフ表現層１０１は、ｎ組のセマンティックノードを獲得することであって、１組のセマンティックノードが１つのモーダルのソースステートメントに対応する、ことと、同一モーダルのいずれか２つの上記セマンティックノードの間に上記第１結合辺を追加し、異なるモーダルのいずれか２つの上記セマンティックノードの間に上記第２結合辺を追加して、上記セマンティック関連図を取得することと、に用いられる。 In some optional embodiments, the multimodal graph representation layer 101 obtains n sets of semantic nodes, one set of semantic nodes corresponding to one modal source statement; By adding the first connecting edge between any two of the semantic nodes of the same modal and adding the second connecting edge between any two of the semantic nodes of the different modals, the semantic relationship diagram is: used for obtaining and

いくつかの選択可能な実施例において、マルチモーダルグラフ表現層１０１は、各々のモーダルのソース言語からセマンティックノードを抽出して、ｎ個のモーダルのソース言語に対応するｎ組のセマンティックノードを取得することに用いられ、
マルチモーダルグラフ表現層１０１は、第１結合辺を用いてｎ組のセマンティックノードに対して同一モーダル内のセマンティックノードの間の結合を行い、且つ第２結合辺を用いてｎ組のセマンティックノードに対して異なるモーダル間のセマンティックノードの間の結合を行って、セマンティック関連図を取得することに用いられる。 In some optional embodiments, the multimodal graph representation layer 101 extracts the semantic nodes from each modal's source language to obtain n sets of semantic nodes corresponding to the n modal source languages. used for
The multimodal graph representation layer 101 connects between semantic nodes in the same modal for n sets of semantic nodes using a first connecting edge, and connects n sets of semantic nodes using a second connecting edge. It is used to make connections between semantic nodes between different modals to obtain a semantic relationship diagram.

いくつかの選択可能な実施例において、ｎ個のモーダルのソースステートメントにはテキスト形式の第１ソースステートメント及び非テキスト形式の第２ソースステートメントが含まれ、ｎ組のセマンティックノードは第１セマンティックノード及び第２セマンティックノードを含み、
マルチモーダルグラフ表現層１０１は、上記第１セマンティックノードを獲得することであって、上記第１セマンティックノードは上記第１ソースステートメントを処理して取得したものである、ことと、候補セマンティックノードを獲得することであって、上記候補セマンティックノードは上記第２ソースステートメントを処理して取得したものである、ことと、上記候補セマンティックノードの第１確率分布を獲得することであって、上記第１確率分布は上記第１セマンティックノードと上記候補セマンティックノードとの間のセマンティック関連付けに応じて計算して取得したものである、ことと、上記候補セマンティックノードから上記第２セマンティックノードを決定することであって、上記第２セマンティックノードは上記マルチモーダルグラフ表現層が上記第１確率分布に基づき決定したものであることと、に用いられる。 In some alternative embodiments, the n modal source statements include a first textual source statement and a second non-textual source statement, and the n sets of semantic nodes are the first semantic node and the second non-textual source statement. including a second semantic node;
The multimodal graph representation layer 101 obtains the first semantic node, which is obtained by processing the first source statement, and obtains candidate semantic nodes. wherein the candidate semantic nodes are obtained by processing the second source statement; and obtaining a first probability distribution of the candidate semantic nodes, wherein the first probability the distribution is obtained by computing according to the semantic association between the first semantic node and the candidate semantic nodes; and determining the second semantic node from the candidate semantic nodes. , the second semantic node is determined by the multimodal graph representation layer based on the first probability distribution.

いくつかの選択可能な実施例において、マルチモーダルグラフ表現層１０１は、第１ソースステートメントから第１セマンティックノードを抽出し、且つ第２ソースステートメントから候補セマンティックノードを抽出することと、第１セマンティックノードと候補セマンティックノードとの間のセマンティック関連付けに応じて候補セマンティックノードの第１確率分布を計算することと、第１確率分布に基づき候補セマンティックノードから第２セマンティックノードを決定することと、に用いられる。 In some alternative embodiments, multimodal graph representation layer 101 extracts a first semantic node from a first source statement and candidate semantic nodes from a second source statement; calculating a first probability distribution of the candidate semantic nodes according to the semantic association between and the candidate semantic nodes; and determining a second semantic node from the candidate semantic nodes based on the first probability distribution. .

いくつかの選択可能な実施例において、マルチモーダルグラフ表現層１０１は、第ｉ組のセマンティックノードにおいて同一モーダル内のいずれか２つのセマンティックノードの間に第ｉ種類の第１結合辺を追加することに用いられ、上記第ｉ種類の第１結合辺が第ｉ番目のモーダルに対応し、ｉはｎ以下の正の整数である。 In some optional embodiments, the multimodal graph representation layer 101 adds the i-th kind of first connecting edge between any two semantic nodes in the same modal in the i-th set of semantic nodes. , the i-th kind of first connecting edge corresponds to the i-th modal, and i is a positive integer less than or equal to n.

つまり、マルチモーダルグラフ表現層１０１は、第ｉ番目のモーダルに対応する第ｉ種類の第１結合辺を決定し、第ｉ種類の第１結合辺を用いて第ｉ組のセマンティックノードに対して同一モーダル内のセマンティックノードの間の結合を行うことに用いられ、ｉはｎ以下の正の整数である。 That is, the multimodal graph representation layer 101 determines the i-th kind of first connecting edge corresponding to the i-th modal, and uses the i-th kind of first connecting edge to the i-th set of semantic nodes. Used to connect semantic nodes within the same modal, i is a positive integer less than or equal to n.

いくつかの選択可能な実施例において、ｎ個の符号化特徴ベクトルは、上記複数の第１ワードベクトルに対してモーダル内融合及びモーダル間融合をｅ回行って、上記符号化特徴ベクトルを取得するというプロセスにより獲得される。ここで、上記モーダル内融合とは同一モーダル内の上記第１ワードベクトルの間でセマンティック融合を行うことを指し、上記モーダル間融合とは異なるモーダルの上記第１ワードベクトルの間でセマンティック融合を行うことを指す。ここで、ｅは正の整数である。 In some optional embodiments, n encoded feature vectors are subjected to intra-modal fusion and inter-modal fusion e times on the plurality of first word vectors to obtain the encoded feature vector. obtained by the process of Here, the intra-modal fusion refers to semantic fusion between the first word vectors in the same modal, and the inter-modal fusion is the semantic fusion between the first word vectors of different modals. point to where e is a positive integer.

いくつかの選択可能な実施例において、マルチモーダル融合エンコーダ１０３は直列接続されているｅ個の符号化モジュール１０３１を含み、各々の符号化モジュール１０３１はいずれもｎ個のモーダルに１対１で対応するｎ個のモーダル内融合層１１及びｎ個のモーダル間融合層１２を含み、ｅは正の整数であり、
１番目の符号化モジュール１０３１は、第１ワードベクトルをそれぞれ１番目の符号化モジュールにおけるｎ個のモーダル内融合層１１に入力し、ｎ個のモーダル内融合層１１によりそれぞれ第１ワードベクトルに対して同じモーダル内部のセマンティック融合を行って、ｎ個の第１隠れ層ベクトルを取得することに用いられ、１つの上記第１隠れ層ベクトルが１つのモーダルに対応し、つまり、ｎ個のモーダルに１対１で対応するｎ個の第１隠れ層ベクトルを取得し、
１番目の符号化モジュール１０３１は、ｎ個の第１隠れ層ベクトルを１番目の符号化モジュールにおける各々のモーダル間融合層１２に入力し、各々のモーダル間融合層１２により上記ｎ個の第１隠れ層ベクトルに対して異なるモーダル間のセマンティック融合を行って、ｎ個の第１中間ベクトルを取得することに用いられ、１つの上記中間ベクトルが１つのモーダルに対応し、つまり、ｎ個のモーダルに１対１で対応するｎ個の第１中間ベクトルを取得し、
第ｊ番目の符号化モジュール１０３１は、ｎ個の第１中間ベクトルに対してｊ回目の符号化処理を行い、最後の１つの符号化モジュールがｎ個の符号化特徴ベクトルを出力するまで続けることに用いられ、１つの上記符号化特徴ベクトルが１つのモーダルに対応し、つまり、最後の１つの符号化モジュールがｎ個のモーダルに１対１で対応するｎ個の符号化特徴ベクトルを出力するまで続け、ｊは１よりも大きく且つｅ以下の正の整数である。 In some alternative embodiments, the multimodal fusion encoder 103 includes e encoding modules 1031 connected in series, each encoding module 1031 corresponding one-to-one to n modals. n intra-modal fusion layers 11 and n inter-modal fusion layers 12, e being a positive integer,
The first encoding module 1031 inputs the first word vector to each of the n intra-modal fusion layers 11 in the first encoding module, and the n intra-modal fusion layers 11 respectively convert the first word vector into is used to perform semantic fusion inside the same modal to obtain n first hidden layer vectors, one said first hidden layer vector corresponds to one modal, that is, for n modals Obtain n first hidden layer vectors corresponding one-to-one,
The first encoding module 1031 inputs n first hidden layer vectors to each inter-modal fusion layer 12 in the first encoding module, and each inter-modal fusion layer 12 converts the n first hidden layer vectors. The hidden layer vector is used to perform semantic fusion between different modals to obtain n first intermediate vectors, one intermediate vector corresponds to one modal, that is, n modals Obtain n first intermediate vectors corresponding one-to-one to
The j-th encoding module 1031 performs the j-th encoding process on the n first intermediate vectors, and continues until the last encoding module outputs n encoded feature vectors. , one coded feature vector corresponds to one modal, that is, the last one coding module outputs n coded feature vectors corresponding one-to-one to n modals and j is a positive integer greater than 1 and less than or equal to e.

いくつかの選択可能な実施例において、各々の符号化モジュール１０３１は更にｎ個の第１ベクトル変換層１３を含み、上記１つのベクトル変換層は１つのモーダルに対応し、つまり、ｎ個のモーダルに１対１で対応するｎ個の第１ベクトル変換層１３であり、
符号化モジュール１０３１は更に、ｎ個の第１中間ベクトルをそれぞれ所属するモーダルに対応するｎ個の第１ベクトル変換層１３に入力して非線形変換を行って、非線形変換後のｎ個の第１中間ベクトルを取得することに用いられる。 In some alternative embodiments, each encoding module 1031 further includes n first vector transformation layers 13, said one vector transformation layer corresponding to one modal, i.e. n modal are n first vector transformation layers 13 corresponding one-to-one to
The encoding module 1031 further inputs the n first intermediate vectors to the n first vector transformation layers 13 corresponding to the respective modals to which they belong, and non-linearly transforms them into n first intermediate vectors after the non-linear transformation. Used to get intermediate vectors.

いくつかの選択可能な実施例において、直列接続されているｅ個の符号化モジュール１０３１のうちの各々の符号化モジュール１０３１における階層構造は同じである。 In some alternative embodiments, the hierarchical structure in each encoding module 1031 of the e serially connected encoding modules 1031 is the same.

いくつかの選択可能な実施例において、異なるモーダル内融合層に異なる又は同じ自己注意関数が設定され、且つ異なるモーダル間融合層に異なる又は同じ特徴融合関数が設定される。 In some alternative embodiments, different intra-modal fusion layers are assigned different or the same self-attention functions, and different inter-modal fusion layers are assigned different or the same feature fusion functions.

いくつかの選択可能な実施例において、該マルチモーダル機械翻訳モデル１００は更に第２ワードベクトル層１０５及び分類器１０６を含み、且つデコーダ１０４は直列接続されているｄ個の復号モジュール１０４２を含み、ｄは正の整数であり、
第２ワードベクトル層１０５は、第１目標語句を獲得することであって、第１目標語句が上記目標ステートメントにおける翻訳済み語句である、ことと、上記第１目標語句に対して特徴抽出を行って、第２ワードベクトルを取得することと、に用いられ、
デコーダ１０４は、直列接続されているｄ個の復号モジュール１０４２により第２ワードベクトルと符号化特徴ベクトルとを組み合わせて特徴抽出を行って、復号特徴ベクトルを取得することに用いられ、
分類器１０６は、復号特徴ベクトルに対応する確率分布を決定し、且つ確率分布に基づき第１目標語句の後の第２目標語句を決定することに用いられる。 In some alternative embodiments, the multimodal machine translation model 100 further comprises a second word vector layer 105 and a classifier 106, and the decoder 104 comprises d serially connected decoding modules 1042, d is a positive integer;
The second word vector layer 105 is to obtain a first target phrase, where the first target phrase is a translated phrase in the target statement, and perform feature extraction on the first target phrase. to obtain a second word vector; and
The decoder 104 is used for extracting features by combining the second word vector and the encoded feature vector by d decoding modules 1042 connected in series to obtain a decoded feature vector,
Classifier 106 is used to determine a probability distribution corresponding to the decoded feature vector and to determine a second target phrase after the first target phrase based on the probability distribution.

いくつかの選択可能な実施例において、直列接続されているｄ個の復号モジュール１０４２のうちの各々の復号モジュール１０４２はいずれも第１自己注意層２１及び第２自己注意層２２を含み、
１番目の復号モジュール１０４２は、第２ワードベクトルを１番目の復号モジュール１０４２における第１自己注意層２１に入力し、第１自己注意層２１により第２ワードベクトルに対して特徴抽出を行って、第２隠れ層ベクトルを取得することに用いられ、
１番目の復号モジュール１０４２は、第２隠れ層ベクトル及び符号化特徴ベクトルを１番目の復号モジュール１０４２における第２自己注意層２２に入力し、第２自己注意層２２により第２隠れ層ベクトルと符号化特徴ベクトルとを組み合わせて特徴抽出を行って、第２中間ベクトルを取得することに用いられ、
第ｋ番目の復号モジュールは、第２中間ベクトルを第ｋ番目の復号モジュール１０４２に入力して第ｋ回目の復号処理を行い、最後の１つの復号モジュールが復号特徴ベクトルを出力するまで続けることに用いられ、ｋは１よりも大きく且つｄ以下の正の整数である。 In some alternative embodiments, each decoding module 1042 of the d serially connected decoding modules 1042 both includes a first self-attention layer 21 and a second self-attention layer 22;
The first decoding module 1042 inputs the second word vector to the first self-attention layer 21 in the first decoding module 1042, performs feature extraction on the second word vector by the first self-attention layer 21, used to obtain the second hidden layer vector,
The first decoding module 1042 inputs the second hidden layer vector and the encoded feature vector to the second self-attention layer 22 in the first decoding module 1042, and the second self-attention layer 22 decodes the second hidden layer vector and the encoded feature vector. is used to obtain a second intermediate vector by performing feature extraction in combination with the modified feature vector,
The k-th decoding module inputs the second intermediate vector to the k-th decoding module 1042 to perform the k-th decoding process, and continues until the last decoding module outputs a decoded feature vector. used, k is a positive integer greater than 1 and less than or equal to d.

いくつかの選択可能な実施例において、各々の復号モジュール１０４２は更に第２ベクトル変換層２３を含み、
復号モジュール１０４２は、第２中間ベクトルを第２ベクトル変換層２３に入力して非線形変換を行って、非線形変換後の第２中間ベクトルを取得することに用いられる。 In some optional embodiments, each decoding module 1042 further includes a second vector transformation layer 23,
The decoding module 1042 is used to input the second intermediate vector to the second vector transformation layer 23 and perform nonlinear transformation to obtain the second intermediate vector after nonlinear transformation.

以上のように、本実施例が提供するマルチモーダル機械翻訳モデルは、マルチモーダルグラフ表現層によりｎ個のモーダルのソース言語に対してセマンティック関連付けを行って、セマンティック関連図を獲得する。セマンティック関連図において第１結合辺を用いて同一モーダルのセマンティックノードを結合し、且つ第２結合辺を用いて異なるモーダルのセマンティックノードを結合し、セマンティック関連図により複数のモーダルのソース言語の間のセマンティック関連付けを十分に表現する。続いてマルチモーダル融合エンコーダによりセマンティック関連図における特徴ベクトルに対して十分なセマンティック融合を行って、符号化後の符号化特徴ベクトルを取得する。更に符号化特徴ベクトルを復号処理した後に、より正確な目標ステートメントを取得する。該目標ステートメントはマルチモーダルのソースステートメントが総合的に表す内容、感情及び言語環境等に一層接近する。 As described above, the multimodal machine translation model provided by this embodiment performs semantic association for n modal source languages through the multimodal graph representation layer to obtain a semantic association diagram. In the semantic relationship diagram, a first connecting edge is used to connect semantic nodes of the same modal, and a second connecting edge is used to connect semantic nodes of different modals, and the semantic relation diagram is used to connect multiple modal source languages. Express semantic associations well. Then a multimodal fusion encoder performs sufficient semantic fusion on the feature vectors in the semantic association diagram to obtain a coded feature vector after coding. After decoding the encoded feature vector, a more accurate target statement is obtained. The goal statement more closely approximates the content, emotion and language environment, etc. that the multimodal source statements collectively represent.

図２に参照されるように、本願の１つの例示的な実施例が提供するコンピュータシステムの構造模式図を示し、該コンピュータシステムは端末２２０及びサーバ２４０を含む。 Referring to FIG. 2 , it shows a structural schematic diagram of a computer system provided by one exemplary embodiment of the present application, which includes a terminal 220 and a server 240 .

端末２２０にオペレーティングシステムがインストールされ、該オペレーティングシステムにアプリケーションプログラムがインストールされ、該アプリケーションプログラムはマルチモーダルソース言語の翻訳機能をサポートする。例示的には、上記アプリケーションプログラムはインスタントメッセージングソフトウェア、金融ソフトウェア、ゲームソフトウェア、ショッピングソフトウェア、ビデオ再生ソフトウェア、コミュニティーサービスソフトウェア、オーディオソフトウェア、教育ソフトウェア、支払いソフトウェア及び翻訳ソフトウェア等を含んでもよく、上記アプリケーションプログラムに上記マルチモーダルソース言語の翻訳機能が統合されている。 An operating system is installed in the terminal 220, and an application program is installed in the operating system, and the application program supports a multimodal source language translation function. Illustratively, the application programs may include instant messaging software, financial software, game software, shopping software, video playback software, community service software, audio software, education software, payment software and translation software, etc. is integrated with the above multimodal source language translation function.

端末２２０とサーバ２４０とは有線又は無線ネットワーク経由で互いに結合されている。サーバ２４０は１台のサーバ、複数台のサーバ、クラウドコンピューティングプラットフォーム及び仮想化センターのうちの少なくとも１つを含む。例示的には、サーバ２４０はプロセッサ及びメモリを含む。ここで、メモリにコンピュータプログラムが記憶され、プロセッサは上記コンピュータプログラムを読み取り且つ実行してマルチモーダルソース言語の翻訳機能を実現することができる。 Terminal 220 and server 240 are coupled to each other via a wired or wireless network. Server 240 includes at least one of a server, multiple servers, a cloud computing platform, and a virtualization center. Illustratively, server 240 includes a processor and memory. Here, a computer program is stored in the memory, and the processor can read and execute the computer program to realize a multimodal source language translation function.

選択肢として、サーバ２４０は主な計算作業を担い、端末２２０は副次的な計算作業を担う。又は、サーバ２４０は副次的な計算作業を担い、端末２２０は主な計算作業を担う。又は、サーバ２４０と端末２２０との両方の間は分散型計算アーキテクチャを用いて協調計算を行う。 Alternatively, the server 240 is responsible for the main computational tasks and the terminal 220 is responsible for the secondary computational tasks. Alternatively, the server 240 is responsible for the secondary computational work and the terminal 220 is responsible for the main computational work. Alternatively, both the server 240 and the terminal 220 perform cooperative computation using a distributed computing architecture.

いくつかの選択可能な実施例において、上記マルチモーダル言語の翻訳機能を実現するプロセスにおいて、サーバ２４０は端末２２０におけるアプリケーションプログラムにバックグラウンドサービスを提供する。例示的には、端末２２０はｎ個のモーダルのソースステートメントを収集し、上記ｎ個のモーダルのソースステートメントをサーバ２４０に送信し、サーバ２４０により本願が提供するマルチモーダル機械学習に基づく翻訳方法を実行し、ｎは１よりも大きな正の整数である。 In some alternative embodiments, server 240 provides background services to application programs on terminal 220 in the process of implementing the multimodal language translation function. Illustratively, the terminal 220 collects n modal source statements, transmits the n modal source statements to the server 240, and the server 240 executes the translation method based on multimodal machine learning provided by the present application. and n is a positive integer greater than one.

例示的には、端末２２０にはデータ伝送制御部材が含まれ、端末２２０は上記データ伝送制御部材により翻訳対象ステートメント及び翻訳対象ステートメントにマッチングする画像のこの２つの異なるモーダルのソースステートメントをサーバ２４０にアップロードする。サーバ２４０により本願が提供するマルチモーダル機械学習に基づく翻訳方法を実行し、２つのモーダルのソースステートメントを目標ステートメントに翻訳する。 Illustratively, the terminal 220 includes a data transmission control member, and the terminal 220 transmits the two different modal source statements of the translatable statement and the image matching the translatable statement to the server 240 by the data transmission control member. Upload. The server 240 executes the translation method based on multimodal machine learning provided by the present application to translate the two modal source statements into target statements.

いくつかの選択可能な実施例において、ソースステートメントは音声信号を含んでもよい。ｎ個のモーダルのソースステートメントに音声信号が含まれる場合、ｎ個のモーダルのソースステートメントを翻訳する前に、端末２２０又はサーバ２４０はまず音声信号を文字テキストに変換する。例示的には、端末２２０はマイクロホンにより音声信号を収集し、又は、端末２２０は他の端末から送信された音声信号を受信する。 In some alternative embodiments, the source statements may include audio signals. If the n modal source statements include audio signals, the terminal 220 or server 240 first converts the audio signals into text before translating the n modal source statements. Illustratively, terminal 220 collects audio signals via a microphone, or terminal 220 receives audio signals transmitted from other terminals.

上記マルチモーダル機械学習に基づく翻訳方法はマルチメディアニュース翻訳シーンに応用できる。例示的には、端末２２０は文字と画像とを含むマルチメディアニュースをサーバ２４０にアップロードし、サーバ２４０により本願が提供するマルチモーダル機械学習に基づく翻訳方法を実行し、マルチメディアニュースにおける第１言語類の文字を第２言語類の文字に翻訳する。 The above multimodal machine learning based translation method can be applied to the multimedia news translation scene. Illustratively, the terminal 220 uploads multimedia news including text and images to the server 240, and the server 240 executes the multimodal machine learning-based translation method provided by the present application to translate the first language in the multimedia news. Translate a class of characters into a second language class of characters.

上記マルチモーダル機械学習に基づく翻訳方法は外国語文献翻訳シーンに応用できる。例示的には、端末２２０は外国語文献における文字及び文字に対応する挿絵をサーバ２４０にアップロードし、サーバ２４０により本願が提供するマルチモーダル機械学習に基づく翻訳方法を実行し、外国語文献における第１言語類の文字を第２言語類の文字に翻訳する。 The above multimodal machine learning based translation method can be applied to the foreign language document translation scene. Exemplarily, the terminal 220 uploads characters and illustrations corresponding to the characters in the foreign language document to the server 240, and the server 240 executes the translation method based on multimodal machine learning provided by the present application, Translating characters of one language class into characters of a second language class.

上記マルチモーダル機械学習に基づく翻訳方法は外国語ウェブサイト翻訳シーンに応用できる。例示的には、端末２２０は外国語ウェブサイトにおける文字及び文字イラストを収集し、上記文字及び文字イラストをサーバ２４０にアップロードし、サーバ２４０により本願が提供するマルチモーダル機械学習に基づく翻訳方法を実行し、外国語ウェブサイトにおける第１言語類の文字を第２言語類の文字に翻訳し、更に外国語ウェブサイトに対する翻訳を実現する。 The above multimodal machine learning based translation method can be applied to the foreign language website translation scene. Illustratively, the terminal 220 collects characters and character illustrations on foreign language websites, uploads the characters and character illustrations to the server 240, and the server 240 executes the translation method based on multimodal machine learning provided by the present application. and translate the characters of the first language class on the foreign language website into the characters of the second language class, and further realize the translation for the foreign language website.

いくつかの選択可能な実施例において、端末２２０が翻訳された文字を展示する方式は音声形式又は文字形式を含む。 In some alternative embodiments, the manner in which terminal 220 presents the translated text includes phonetic or textual formats.

説明する必要があるように、いくつかの選択可能な実施例において、端末２２０は本願が提供するマルチモーダル機械学習に基づく翻訳方法を実行し、更にｎ個のモーダルのソースステートメントを翻訳する。 As should be explained, in some alternative embodiments, the terminal 220 executes the multimodal machine learning-based translation method provided by the present application, and also translates the n modal source statements.

端末２２０は一般的に複数の端末のうちの１つを指してもよく、本実施例は端末２２０のみを例として説明する。該端末２２０はスマートフォン、タブレットコンピュータ、電子ブックリーダー、ＭＰＥＧオーディオレイヤー３（ＭｏｖｉｎｇＰｉｃｔｕｒｅＥｘｐｅｒｔｓＧｒｏｕｐＡｕｄｉｏＬａｙｅｒＩＩＩ、ＭＰ３）プレーヤー、ＭＰＥＧオーディオレイヤー４（ＭｏｖｉｎｇＰｉｃｔｕｒｅＥｘｐｅｒｔｓＧｒｏｕｐＡｕｄｉｏＬａｙｅｒＩＶ、ＭＰ４）プレーヤー、ラップトップポケットコンピュータ、デスクトップコンピュータ、及びノートパソコンのうちの少なくとも１つを含んでもよい。以下の実施例は端末２２０がスマートフォン及びパーソナルコンピュータ機器を含む場合を例として説明する。 Terminal 220 may generally refer to one of a plurality of terminals, and this embodiment describes only terminal 220 as an example. The terminal 220 can be a smart phone, tablet computer, e-book reader, MPEG Audio Layer 3 (Moving Picture Experts Group Audio Layer III, MP3) player, MPEG Audio Layer 4 (Moving Picture Experts Group Audio Layer IV, MP4) player, laptop pocket. It may include at least one of a computer, a desktop computer, and a laptop. In the following embodiments, terminals 220 include smart phones and personal computer devices as examples.

当業者であれば分かるように、上記端末２２０の数はより多く又はより少なくてもよい。例えば、上記端末は１つのみであってもよく、又は上記端末は数十個若しくは数百個、若しくはより多くの数である。本願の実施例は端末２２０の数及び機器タイプを限定しない。 As will be appreciated by those skilled in the art, the number of terminals 220 may be greater or less. For example, there may be only one terminal, or there may be tens or hundreds of terminals, or more. Embodiments of the present application do not limit the number and device types of terminals 220 .

図３に参照されるように、本願の１つの例示的な実施例が提供するマルチモーダル機械学習に基づく翻訳方法のフローチャートを示す。該方法は図２に示されるコンピュータ機器に応用され、該コンピュータ機器は端末又はサーバを含み、該方法は以下を含む。 Referring to FIG. 3, it shows a flowchart of a multimodal machine learning based translation method provided by one exemplary embodiment of the present application. The method is applied to the computer equipment shown in FIG. 2, the computer equipment includes a terminal or server, the method includes: a.

ステップ３０１：コンピュータ機器はｎ個のモーダルのソースステートメントに対してセマンティック関連付けを行って、セマンティック関連図を構築する。 Step 301: The computing device makes semantic associations for the n modal source statements to build a semantic association diagram.

上記セマンティック関連図は、ｎ種類の異なるモーダルのセマンティックノードと、同一モーダルのセマンティックノードを結合することに用いられる第１結合辺と、異なるモーダルのセマンティックノードを結合することに用いられる第２結合辺とを含み、ｎは１よりも大きな正の整数である。 The above semantic relationship diagram includes n types of different modal semantic nodes, a first connecting edge used to connect the same modal semantic nodes, and a second connecting edge used to connect different modal semantic nodes. and n is a positive integer greater than 1.

１つのモーダルのソースステートメントを例とすると、該ソースステートメントは１組のセマンティックノードに対応し、該１組のセマンティックノードはソースステートメントにおけるセマンティックユニットを示すことに用いられる少なくとも１つのセマンティックノードを含む。 Taking a modal source statement as an example, the source statement corresponds to a set of semantic nodes, and the set of semantic nodes includes at least one semantic node used to indicate a semantic unit in the source statement.

該コンピュータ機器にマルチモーダル融合エンコーダ及びデコーダが設定され、コンピュータ機器はマルチモーダルグラフ表現層により各々のモーダルのソースステートメントからセマンティックノードを抽出して、ｎ個のモーダルのソースステートメントに対応するｎ組のセマンティックノードを取得し、マルチモーダルグラフ表現層により第１結合辺を用いてｎ組のセマンティックノードに対して同一モーダル内のセマンティックノードの間の結合を行う。つまり、同一モーダルのいずれか２つのセマンティックノードの間に第１結合辺を追加し、且つ第２結合辺を用いてｎ組のセマンティックノードに対して異なるモーダル間のセマンティックノードの間の結合を行う。つまり、異なるモーダルのセマンティックノードの間に第２結合辺を追加して、セマンティック関連図を取得する。 The computing device is configured with a multimodal fusion encoder and decoder, and the computing device extracts semantic nodes from each modal source statement by a multimodal graph representation layer to generate n-tuples corresponding to the n modal source statements. A semantic node is obtained, and a connection between semantic nodes in the same modal is performed for n sets of semantic nodes using the first connecting edge by the multimodal graph representation layer. That is, add a first connecting edge between any two semantic nodes of the same modal, and use the second connecting edge to connect between semantic nodes between different modals for n sets of semantic nodes. . That is, a second connecting edge is added between different modal semantic nodes to obtain a semantic association diagram.

選択肢として、ｎ個のモーダルのソースステートメントにはテキスト形式の第１ソースステートメント及び非テキスト形式の第２ソースステートメントが含まれる。ｎ組のセマンティックノードは第１セマンティックノード及び第２セマンティックノードを含む。コンピュータ機器は、マルチモーダルグラフ表現層により第１ソースステートメントから第１セマンティックノードを抽出し、且つ第２ソースステートメントから候補セマンティックノードを抽出し、マルチモーダルグラフ表現層を呼び出し、第１セマンティックノードと候補セマンティックノードとの間のセマンティック関連付けに応じて候補セマンティックノードの第１確率分布を計算し、マルチモーダルグラフ表現層を呼び出し、第１確率分布に基づき候補セマンティックノードから第２セマンティックノードを決定する。 Optionally, the n modal source statements include a textual first source statement and a non-textual second source statement. The n-tuple of semantic nodes includes a first semantic node and a second semantic node. A computing device extracts a first semantic node from a first source statement and a candidate semantic node from a second source statement with a multimodal graph representation layer, invokes the multimodal graph representation layer, and extracts the first semantic node and the candidate Compute a first probability distribution of the candidate semantic nodes according to semantic associations with the semantic nodes, and invoke a multimodal graph representation layer to determine a second semantic node from the candidate semantic nodes based on the first probability distribution.

ここで、テキスト形式の第１ソースステートメントにおけるセマンティックノードの抽出については、コンピュータ機器は第１ソースステートメントに対して単語分割処理を行って、単語分割後のｍ個の語句を取得し、ｍ個の語句が第１ソースステートメントにおける第１セマンティックノードに対応し、ｍは正の整数であり、
非テキスト形式の第２ソースステートメントにおけるセマンティックノードの抽出については、コンピュータ機器は第２ソースステートメントからｍ個の語句のうちの少なくとも１つの語句のセマンティックに対応する目標を抽出し、該目標が第２ソースステートメントにおける第２セマンティックノードである。 Here, regarding the extraction of semantic nodes in the first source statement in text format, the computer device performs word segmentation processing on the first source statement, obtains m terms after word segmentation, and obtains m words and phrases after word segmentation. the phrase corresponds to the first semantic node in the first source statement, m is a positive integer,
For extracting semantic nodes in the non-textual second source statement, the computer device extracts a target semantically corresponding to at least one of the m terms from the second source statement, the target being the second source statement. It is the second semantic node in the source statement.

例示的には、図４のように、２つのモーダルのソースステートメントは翻訳対象画像３１及び翻訳対象ステートメント３２を含み、翻訳対象ステートメント３２の内容は「Ｔｗｏｂｏｙｓａｒｅｐｌａｙｉｎｇｗｉｔｈａｔｏｙｃａｒ．」を含む。各々の英語単語が１つの第１セマンティックノードに対応し、それぞれＶｘ１、Ｖｘ２、Ｖｘ３、Ｖｘ４、Ｖｘ５、Ｖｘ６、Ｖｘ７及びＶｘ８である。コンピュータ機器はセマンティックノードのセマンティックに基づいて翻訳対象画像３１から候補画像を切り取り、セマンティックノードと候補画像とのセマンティック関連付けに基づき第１確率分布を算出し、第１確率分布に基づき候補画像からＶｘ１及びＶｘ２のセマンティックに対応する目標画像１及び目標画像２、並びにＶｘ６、Ｖｘ７及びＶｘ８のセマンティックに対応する目標画像３を決定する。目標画像１、目標画像２及び目標画像３にそれぞれ対応するＶｏ１、Ｖｏ２及びＶｏ３は翻訳対象画像３１における３つの第２セマンティックノードである。コンピュータ機器はＶｘ１、Ｖｘ２、Ｖｘ３、Ｖｘ４、Ｖｘ５、Ｖｘ６、Ｖｘ７及びＶｘ８の２つごとの間に第１結合辺（実線）を用いてモーダル内セマンティック結合を行い、Ｖｏ１、Ｖｏ２及びＶｏ３の２つごとの間に第１結合辺を用いてモーダル内セマンティック結合を行い、第１セマンティックノードと第２セマンティックノードとの間に第２結合辺（破線）を用いてモーダル間セマンティック結合を行う。 Illustratively, as shown in FIG. 4, the two modal source statements include a translatable image 31 and a translatable statement 32, and the content of the translatable statement 32 includes "Two boys are playing with a toy car." . Each English word corresponds to one first semantic node, Vx1, Vx2, Vx3, Vx4, Vx5, Vx6, Vx7 and Vx8 respectively. The computer device cuts a candidate image from the translation target image 31 based on the semantic of the semantic node, calculates a first probability distribution based on the semantic association between the semantic node and the candidate image, and calculates Vx1 and Vx1 from the candidate image based on the first probability distribution. Determine target image 1 and target image 2 corresponding to the semantics of Vx2, and target image 3 corresponding to the semantics of Vx6, Vx7 and Vx8. Vo1, Vo2 and Vo3 corresponding to target image 1, target image 2 and target image 3 respectively are three second semantic nodes in the image 31 to be translated. The computing device performs intra-modal semantic coupling using the first coupling edge (solid line) between every two of Vx1, Vx2, Vx3, Vx4, Vx5, Vx6, Vx7 and Vx8, and two of Vo1, Vo2 and Vo3. Intra-modal semantic coupling is performed using the first connecting edge between each node, and inter-modal semantic coupling is performed using the second connecting edge (dashed line) between the first semantic node and the second semantic node.

選択肢として、異なるモーダルには異なる第１結合辺が対応して設定される。コンピュータ機器はセマンティックノードに対してモーダル内結合を行うときに、マルチモーダルグラフ表現層により第ｉ番目のモーダルに対応する第ｉ種類の第１結合辺を決定し、第ｉ種類の第１結合辺を用いて第ｉ組のセマンティックノードに対して同一モーダル内のセマンティックノードの間の結合を行う。つまり、第ｉ組のセマンティックノードにおけるいずれか２つのセマンティックノードの間に第ｉ種類の第１結合辺を追加し、ｉはｎ以下の正の整数である。 As an option, different modals are associated with different first connecting edges. When the computer device performs intra-modal connection to the semantic node, the multimodal graph representation layer determines the i-th type of first connection edge corresponding to the i-th modal, and determines the i-th type of first connection edge. is used to perform a connection between semantic nodes within the same modal for the i-th set of semantic nodes. That is, add the i-th kind of first connecting edge between any two semantic nodes in the i-th set of semantic nodes, where i is a positive integer less than or equal to n.

選択肢として、２つのモーダルのソースステートメントを翻訳するに際し、２つのモーダルのソースステートメントがそれぞれ文字及び画像である場合、コンピュータ機器は視覚グラウンディング（ｖｉｓｕａｌｇｒｏｕｎｄｉｎｇ）ツールにより２つのモーダルのソースステートメントの間のセマンティック関連付けを確立し、セマンティック関連図を構築する。 Alternatively, in translating two modal source statements, if the two modal source statements are text and an image, respectively, the computer device uses a visual grounding tool to highlight the text between the two modal source statements. Establish semantic associations and build semantic association diagrams.

ステップ３０２：コンピュータ機器はセマンティック関連図から複数の第１ワードベクトルを抽出する。 Step 302: The computing device extracts a plurality of first word vectors from the semantic association diagram.

例示的には、コンピュータ機器はワード埋め込み方式を用いてセマンティック関連図を処理して、複数の第１ワードベクトルを取得する。ワード埋め込みとは単語をワードベクトルにマッピングすることを指し、選択肢として、ワード埋め込み方法は、
ニューラルネットワークモデルによりワード埋め込みを行うこと、
語句共起行列に対して次元低減を行うことによりワード埋め込みを行うこと、
確率モデルによりワード埋め込みを行うこと、及び
単語の位置するコンテキストのセマンティックにより単語に対してワード埋め込みを行うこと、の４種類のうちの少なくとも１種類を含む。 Illustratively, the computing device processes the semantic relationship diagram using a word embedding scheme to obtain a plurality of first word vectors. Word embedding refers to mapping words to word vectors, and alternative word embedding methods include:
performing word embedding with a neural network model;
performing word embedding by performing dimensionality reduction on the term co-occurrence matrix;
It includes at least one of four types: word embedding according to a probabilistic model; and word embedding for words according to the semantics of the context in which the word is located.

例えば、ワンホットエンコーディング（Ｏｎｅ－ＨｏｔＥｎｃｏｄｉｎｇ）によりテキスト形式のソースステートメントにおける単語を表現し、続いて埋め込み行列によりワード埋め込みを行う。 For example, One-Hot Encoding is used to represent words in a textual source statement, followed by word embeddings using an embedding matrix.

ステップ３０３：コンピュータ機器は複数の第１ワードベクトルを符号化して、ｎ個の符号化特徴ベクトルを取得する。 Step 303: The computing device encodes the plurality of first word vectors to obtain n encoded feature vectors.

コンピュータ機器はマルチモーダル融合エンコーダにより第１ワードベクトルに対してモーダル内の特徴抽出を行い、続いて特徴抽出により取得されたベクトルに対してモーダル間の特徴融合を行う。 The computing device performs intramodal feature extraction on the first word vector with a multimodal fusion encoder, and then performs intermodal feature fusion on the vector obtained by feature extraction.

例示的に、ｎの値が３である場合を例とする。マルチモーダル融合エンコーダには第１モーダルに対応する第１特徴抽出関数、第２モーダルに対応する第２特徴抽出関数、及び第３モーダルに対応する第３特徴抽出関数が含まれる。コンピュータ機器は第１特徴抽出関数により第１ワードベクトルに対して第１モーダル内の特徴抽出を行い、第２特徴抽出関数により第１ワードベクトルに対して第２モーダル内の特徴抽出を行い、第３特徴抽出関数により第１ワードベクトルに対して第３モーダル内の特徴抽出を行って、最終的に３つの隠れ層関数を取得する。マルチモーダル融合エンコーダには第１モーダルに対応する第１特徴融合関数、第２モーダルに対応する第２特徴融合関数、及び第３モーダルに対応する第３特徴融合関数が更に含まれる。コンピュータ機器は第１特徴融合関数により上記３つの隠れ層関数に対してモーダル間の特徴融合を行い、第２特徴融合関数により上記３つの隠れ層関数に対してモーダル間の特徴融合を行い、第３特徴融合関数により上記３つの隠れ層関数に対してモーダル間の特徴融合を行って、３つの特徴融合後の隠れ層ベクトル、即ち符号化特徴ベクトルを取得する。 As an illustrative example, the value of n is 3. The multimodal fusion encoder includes a first feature extraction function corresponding to the first modal, a second feature extraction function corresponding to the second modal, and a third feature extraction function corresponding to the third modal. The computing device performs feature extraction within a first modal on the first word vector with a first feature extraction function, performs feature extraction within a second modal on the first word vector with a second feature extraction function, and performs feature extraction on the first modal with a second feature extraction function. Perform feature extraction in a third modal on the first word vector by three feature extraction functions to finally obtain three hidden layer functions. The multimodal fusion encoder further includes a first feature fusion function corresponding to the first modal, a second feature fusion function corresponding to the second modal, and a third feature fusion function corresponding to the third modal. The computing device performs inter-modal feature fusion on the three hidden layer functions by a first feature fusion function, performs inter-modal feature fusion on the three hidden layer functions by a second feature fusion function, and Inter-modal feature fusion is performed on the above three hidden layer functions by a three-feature fusion function to obtain three hidden layer vectors after feature fusion, that is, encoded feature vectors.

ステップ３０４：コンピュータ機器はｎ個の符号化特徴ベクトルを復号処理して、翻訳後の目標ステートメントを取得する。 Step 304: The computing device decodes the n encoded feature vectors to obtain a translated target statement.

コンピュータ機器はデコーダを呼び出してｎ個の符号化特徴ベクトルを復号処理して、翻訳後の目標ステートメントを取得する。該目標ステートメントがｎ個のモーダルのソースステートメントを指定された言語類に翻訳して取得したステートメントである。 The computer device calls a decoder to decode the n encoded feature vectors to obtain a post-translation target statement. The target statement is a statement obtained by translating n modal source statements into a designated language class.

以上のように、本実施例が提供するマルチモーダル機械学習に基づく翻訳方法は、マルチモーダルグラフ表現層によりｎ個のモーダルのソースステートメントに対してセマンティック関連付けを行って、セマンティック関連図を構築し、セマンティック関連図において第１結合辺を用いて同一モーダルのセマンティックノードを結合し、且つ第２結合辺を用いて異なるモーダルのセマンティックノードを結合し、セマンティック関連図により複数のモーダルのソースステートメントの間のセマンティック関連付けを十分に表現する。続いてマルチモーダル融合エンコーダによりセマンティック関連図における特徴ベクトルに対して十分なセマンティック融合を行って、符号化後の符号化特徴ベクトルを取得し、更に符号化特徴ベクトルを復号処理した後により正確な目標ステートメントを取得する。該目標ステートメントはマルチモーダルのソースステートメントが総合的に表す内容、感情及び言語環境等に一層接近する。 As described above, the translation method based on multimodal machine learning provided by the present embodiment performs semantic association for n modal source statements by the multimodal graph representation layer, constructs a semantic association diagram, In the semantic relationship diagram, the first connecting edge is used to connect the semantic nodes of the same modal and the second connecting edge is used to connect the semantic nodes of the different modals, and the semantic relation diagram connects the source statements of the multiple modals. Express semantic associations well. Then a multimodal fusion encoder performs sufficient semantic fusion on the feature vectors in the semantic association diagram to obtain a coded feature vector after coding, and a more accurate target after decoding the coded feature vector. Get statement. The goal statement more closely approximates the content, emotion and language environment, etc. that the multimodal source statements collectively represent.

図３に基づいて、マルチモーダル融合エンコーダは直列接続されているｅ個の符号化モジュールを含み、各々の符号化モジュールはいずれもｎ個のモーダルに１対１で対応するｎ個のモーダル内融合層及びｎ個のモーダル間融合層を含み、ｅは正の整数である。従って、ステップ３０３はステップ３０３１を含んでもよく、図５のように、ステップは、以下の通りである。
ステップ３０３１：コンピュータ機器は直列接続されているｅ個の符号化モジュールにより複数の第１ワードベクトルに対してモーダル内融合及びモーダル間融合をｅ回行って、ｎ個の符号化特徴ベクトルを取得する。 Based on FIG. 3, the multimodal fusion encoder includes e encoding modules connected in series, each encoding module having n intra-modal fusions corresponding one-to-one to the n modals. layer and n inter-modal fusion layers, where e is a positive integer. Accordingly, step 303 may include step 3031, and as in FIG. 5, the steps are as follows.
Step 3031: The computing device performs e intra-modal fusion and inter-modal fusion on the plurality of first word vectors by e encoding modules connected in series to obtain n encoded feature vectors. .

ここで、モーダル内融合とは同一モーダル内の第１ワードベクトルの間でセマンティック融合を行うことを指し、モーダル間融合とは異なるモーダルの第１ワードベクトルの間でセマンティック融合を行うことを意味する。 Here, intra-modal fusion means performing semantic fusion between first word vectors in the same modal, and inter-modal fusion means performing semantic fusion between first word vectors of different modals. .

例示的には、上記符号化特徴ベクトルのモーダル内及びモーダル間融合は以下のステップにより実現され得る。 Illustratively, intramodal and intermodal fusion of the above encoded feature vectors can be achieved by the following steps.

１）第１ワードベクトルをそれぞれ１番目の符号化モジュールにおけるｎ個のモーダル内融合層に入力し、ｎ個のモーダル内融合層によりそれぞれ第１ワードベクトルに対して同じモーダル内部のセマンティック融合を行って、ｎ個の第１隠れ層ベクトルを取得する。１つの上記第１隠れ層ベクトルが１つのモーダルに対応し、つまり、ｎ個のモーダルに１対１で対応するｎ個の第１隠れ層ベクトルを取得する。 1) input the first word vector into n intra-modal fusion layers in the first encoding module respectively, and perform the same intra-modal semantic fusion on the first word vector by the n intra-modal fusion layers respectively; to obtain n first hidden layer vectors. One above-mentioned first hidden layer vector corresponds to one modal, that is, obtain n first hidden layer vectors corresponding to n modals one-to-one.

例示的には、コンピュータ機器は第１ワードベクトルを１番目の符号化モジュールにおける１番目のモーダル内融合層に入力し、１番目のモーダル内融合層により第１ワードベクトルに対してモーダル内のセマンティック融合を行って１番目の第１隠れ層ベクトルを取得し、第１ワードベクトルを１番目の符号化モジュールにおける２番目のモーダル内融合層に入力し、２番目のモーダル内融合層により第１ワードベクトルに対してモーダル内のセマンティック融合を行って２番目の第１隠れ層ベクトルを取得し、…、第１ワードベクトルを１番目の符号化モジュールにおけるｎ番目のモーダル内融合層に入力し、ｎ番目のモーダル内融合層により第１ワードベクトルに対してモーダル内のセマンティック融合を行ってｎ番目の第１隠れ層ベクトルを取得する。 Illustratively, the computing device inputs a first word vector to a first intra-modal fusion layer in a first encoding module, and the first intra-modal fusion layer applies an intra-modal semantic to the first word vector. Fusion is performed to obtain the first hidden layer vector, the first word vector is input to the second intra-modal fusion layer in the first encoding module, and the second intra-modal fusion layer synthesizes the first word vector Perform intramodal semantic fusion on the vector to obtain the second first hidden layer vector, . . . , input the first word vector into the nth intramodal fusion layer in the first encoding module, n Perform intra-modal semantic fusion on the first word vector by the th intra-modal fusion layer to obtain the n-th first hidden layer vector.

モーダル内融合層内には特徴抽出関数が設定され、選択肢として、特徴抽出関数は自己注意関数を含む。選択肢として、異なるモーダル内融合層内に異なる又は同じ自己注意関数が設定される。説明する必要があるように、自己注意関数が異なるとは関数内のパラメータが異なることを指し、異なるモーダルに対応する自己注意関数が異なれば、異なるモーダルに対応する関数内のパラメータは異なる。 A feature extraction function is set within the intra-modal fusion layer, and optionally the feature extraction function includes a self-attention function. Optionally, different or the same self-attention functions are set in different intra-modal fusion layers. As should be explained, different self-attention functions refer to different parameters in the function, and different self-attention functions corresponding to different modals have different parameters in the function corresponding to different modals.

２）ｎ個の第１隠れ層ベクトルを１番目の符号化モジュールにおける各々のモーダル間融合層に入力し、各々のモーダル間融合層によりｎ個の第１隠れ層ベクトルに対して異なるモーダル間のセマンティック融合を行って、ｎ個の第１中間ベクトルを取得する。１つの上記中間ベクトルが１つのモーダルに対応し、つまり、ｎ個のモーダルに１対１で対応するｎ個の第１中間ベクトルを取得する。 2) input the n first hidden layer vectors into each inter-modal fusion layer in the first encoding module, and each inter-modal fusion layer performs different inter-modal fusions for the n first hidden layer vectors; Perform semantic fusion to obtain n first intermediate vectors. One intermediate vector corresponds to one modal, that is, obtain n first intermediate vectors corresponding to n modals one-to-one.

例示的には、コンピュータ機器は、ｎ個の第１隠れ層ベクトルを１番目の符号化モジュールにおける１番目のモーダル間融合層に入力し、１番目のモーダル間融合層によりｎ個の第１隠れ層ベクトルに対してモーダル間のセマンティック融合を行って１番目のモーダルに対応する１番目の第１中間ベクトルを取得し、ｎ個の第１隠れ層ベクトルを１番目の符号化モジュールにおける２番目のモーダル間融合層に入力し、２番目のモーダル間融合層によりｎ個の第１隠れ層ベクトルに対してモーダル間のセマンティック融合を行って２番目のモーダルに対応する２番目の第１中間ベクトルを取得し、…、ｎ個の第１隠れ層ベクトルを１番目の符号化モジュールにおけるｎ番目のモーダル間融合層に入力し、ｎ番目のモーダル間融合層によりｎ個の第１隠れ層ベクトルに対してモーダル間のセマンティック融合を行ってｎ番目のモーダルに対応するｎ番目の第１中間ベクトルを取得する。 Illustratively, the computer device inputs n first hidden layer vectors into a first inter-modal fusion layer in a first encoding module, and generates n first hidden layers by the first inter-modal fusion layer. Perform cross-modal semantic fusion on the layer vectors to obtain the first intermediate vector corresponding to the first modal, and convert the n first hidden layer vectors to the second input to the inter-modal fusion layer, and perform inter-modal semantic fusion on the n first hidden layer vectors by the second inter-modal fusion layer to obtain the second first intermediate vector corresponding to the second modal , input the n first hidden layer vectors into the nth intermodal fusion layer in the first encoding module, and for the n first hidden layer vectors by the nth intermodal fusion layer to perform semantic fusion between modals to obtain the nth first intermediate vector corresponding to the nth modal.

モーダル間融合層には特徴融合関数が設定され、選択肢として、異なるモーダル間融合層内に設定される特徴融合関数は異なる又は同じである。説明する必要があるように、特徴融合関数が異なるとは関数内のパラメータが異なることを指し、又は、関数の計算方式が異なることを意味する。 Feature fusion functions are set in the inter-modal fusion layers, and optionally the feature fusion functions set in different inter-modal fusion layers are different or the same. As should be explained, different feature fusion functions refer to different parameters in the functions, or different calculation methods of the functions.

選択肢として、各々の符号化モジュールは、ｎ個のモーダルに１対１で対応するｎ個の第１ベクトル変換層を更に含む。ｎ個の第１中間ベクトルを取得した後に、コンピュータ機器は更にｎ個の第１中間ベクトルをそれぞれ所属するモーダルに対応するｎ個の第１ベクトル変換層に入力して非線形変換を行って、非線形変換後のｎ個の第１中間ベクトルを取得する。 Optionally, each encoding module further comprises n first vector transformation layers corresponding one-to-one to the n modals. After obtaining the n first intermediate vectors, the computer device further inputs the n first intermediate vectors to the n first vector transformation layers corresponding to the respective modals to perform nonlinear transformation, Obtain n first intermediate vectors after transformation.

３）ｎ個の第１中間ベクトルを第ｊ番目の符号化モジュールに入力してｊ回目の符号化処理を行い、これを最後の１つの符号化モジュールがｎ個の符号化特徴ベクトルを出力するまで続ける。１つの上記符号化特徴ベクトルが１つのモーダルに対応し、つまり、最後の１つの符号化モジュールがｎ個のモーダルに１対１で対応するｎ個の符号化特徴ベクトルを出力するまで続ける。 3) The n first intermediate vectors are input to the j-th encoding module to perform the j-th encoding process, and the last encoding module outputs n encoded feature vectors. continue until One coded feature vector corresponds to one modal, ie until the last one coding module outputs n coded feature vectors corresponding one-to-one to n modals.

コンピュータ機器は、ｎ個の中間ベクトルを２番目の符号化モジュールに入力して２回目の符号化処理を行って、改めて符号化されたｎ個の第１中間ベクトルを取得し、…、改めて符号化されたｎ個の第１中間ベクトルを第ｊ番目の符号化モジュールに入力してｊ回目の符号化処理を行って、改めて符号化されたｎ個の第１中間ベクトルを取得し、…、改めて符号化されたｎ個の第１中間ベクトルをｅ番目の符号化モジュールに入力してｅ回目の符号化処理を行って、ｎ個の符号化特徴ベクトルを取得する。ここで、ｊは１よりも大きく且つｅ以下の正の整数である。選択肢として、直列接続されているｅ個の符号化モジュールのうちの上記各々の符号化モジュールにおける階層構造は同じである。即ち、第ｊ番目の符号化モジュールは１番目の符号化モジュールが第１中間ベクトルを符号化するステップに従って処理し、最後の１つの符号化モジュールが符号化特徴ベクトルを出力するまで続ける。 The computer device inputs the n intermediate vectors to the second encoding module to perform the second encoding process, obtains the re-encoded n first intermediate vectors, . . . input the encoded n first intermediate vectors to the j-th encoding module and perform the j-th encoding process to obtain the n encoded first intermediate vectors, . . . The re-encoded n first intermediate vectors are input to the e-th encoding module and subjected to the e-th encoding process to obtain n encoded feature vectors. Here, j is a positive integer greater than 1 and less than or equal to e. Optionally, the hierarchical structure in each of the e encoding modules connected in series is the same. That is, the jth encoding module follows the steps of the first encoding module encoding the first intermediate vector, and so on until the last one encoding module outputs an encoded feature vector.

例示的には、本実施例において自己注意メカニズムを用いて同じモーダル内部のセマンティック情報をモデリングする。そうすると、第ｊ番目の符号化モジュールはテキストステートメントに対応する第１隠れ層ベクトル［数１］を計算し、式は、
［数２］であり、
ここで、［数３］はテキストステートメントに対応する第１ワードベクトル又は（ｊ－１）番目の符号化モジュールが出力する第１中間ベクトルを指し、ｘはテキストステートメントのセマンティックノード、及びテキストステートメントのセマンティックノードにより計算して取得されたベクトルをマークすることに用いられ、ＭｕｌｔｉＨｅａｄ（Ｑ，Ｋ，Ｖ）は多重注意メカニズムモデリング関数であり、トリプレット（Ｑｕｅｒｉｅｓ，Ｋｅｙ，Ｖａｌｕｅｓ）を入力とし、Ｑがクエリ行列であり、Ｋがキー行列であり、Ｖが値行列であり、ここで、Ｑ、Ｋ及びＶが［数４］及びパラメータベクトルから計算して取得したものである。 Illustratively, the self-attention mechanism is used in this embodiment to model semantic information within the same modal. Then the j-th encoding module computes the first hidden layer vector [equation 1] corresponding to the text statement, and the formula is
[Equation 2],
Here, [Formula 3] refers to the first word vector corresponding to the text statement or the first intermediate vector output by the (j−1)-th encoding module, x is the semantic node of the text statement, and MultiHead(Q, K, V) is a multi-attention mechanism modeling function, with triplet (Queries, Key, Values) as input, Q is query , K is the key matrix, and V is the value matrix, where Q, K and V are computed and obtained from [Equation 4] and the parameter vector.

第ｊ番目のマルチモーダル融合エンコーダは画像に対応する第１隠れ層ベクトル［数５］を計算し、式は、
［数６］であり、 The j-th multimodal fusion encoder computes the first hidden layer vector [equation 5] corresponding to the image, and the equation is
[Equation 6],

ここで、［数７］は画像に対応する第１ワードベクトル又は（ｊ－１）番目の符号化モジュールが出力する第１中間ベクトルを指し、 Here, [Formula 7] refers to the first word vector corresponding to the image or the first intermediate vector output by the (j−1)-th encoding module,

本実施例において更にゲーティングメカニズムに基づくクロスモーダル融合メカニズムを用いてマルチモーダル間のセマンティック融合をモデリングし、そうすると、第ｊ番目の符号化モジュールはテキストステートメントに対応する第１中間ベクトル又は符号化特徴ベクトル［数８］を計算し、式は、
［数９］、
［数１０］であり、 In this embodiment, we further use a cross-modal fusion mechanism based on a gating mechanism to model the semantic fusion between multimodals, so that the j-th coding module generates the first intermediate vector or coding feature corresponding to the text statement Compute the vector [Equation 8], the formula is
[Number 9],
[Equation 10],

ここで、Ａは集合を示す。対応して、［数１１］は第１セマンティックノード［数１２］のセマンティック関連図における近傍ノードの集合である。［数１３］はテキストステートメントのｕ番目のセマンティックノードを示し、ｕは正の整数である。［数１４］は第ｊ番目の符号化モジュールにおける画像のｓ番目のセマンティックノードのセマンティック表現ベクトルであり、［数１５］は第ｊ番目の符号化モジュールにおけるテキストステートメントのｕ番目のセマンティックノードのセマンティック表現ベクトルである。［数１６］と［数１７］はパラメータ行列であり、［数１８］は否定排他的論理和演算を示し、Ｓｉｇｍｏｉｄ（）はｓ曲線型関数である。ｏは画像のセマンティックノード、及び画像のセマンティックノードにより計算して取得されたベクトルをマークすることに用いられる。更に同じ計算方式によって画像に対応する第１中間ベクトル又は符号化特徴ベクトル［数１９］を計算し、ここで再び詳しく説明しない。 where A denotes a set. Correspondingly, [Equation 11] is the set of neighboring nodes in the semantic association diagram of the first semantic node [Equation 12]. [13] denotes the u-th semantic node of the text statement, where u is a positive integer. [Equation 14] is the semantic representation vector of the sth semantic node of the image in the jth encoding module, [Equation 15] is the semantic representation vector of the uth semantic node of the text statement in the jth encoding module is a representation vector. [Equation 16] and [Equation 17] are parameter matrices, [Equation 18] indicates a negative exclusive OR operation, and Sigmoid( ) is an s-curve function. o is used to mark the semantic nodes of the image and the vectors computed by the semantic nodes of the image. Further, the same calculation method is used to calculate the first intermediate vector or encoded feature vector [Eq.

マルチモーダル間融合を経た後に、本実施例において更にフィードフォワードニューラルネットワーク（ＦｅｅｄＦｏｒｗａｒｄＮｅｕｒａｌ、ＦＦＮ）を用いて最終的な符号化特徴ベクトルを生成し、テキストステートメントに対応する符号化特徴ベクトル及び画像に対応する符号化特徴ベクトルはそれぞれ、
［数２０］、
［数２１］であり、 After going through multimodal interfusion, this embodiment further uses a FeedForward Neural Network (FFN) to generate the final encoded feature vector, which corresponds to the encoded feature vector corresponding to the text statement and the image. Each encoded feature vector for
[number 20],
[Equation 21],

ここで、［数２２］であり、｛｝は集合を示し、［数２３］は第ｊ番目の符号化モジュールにおけるテキストステートメントのｕ番目のセマンティックノードに対応する符号化特徴ベクトルを示し、［数２４］は第ｊ番目の符号化モジュールにおける画像のｓ番目のセマンティックノードに対応する符号化特徴ベクトルを示す。 where { } indicates a set, [23] indicates the encoding feature vector corresponding to the u-th semantic node of the text statement in the j-th encoding module, and [number 24] shows the encoded feature vector corresponding to the sth semantic node of the image in the jth encoding module.

以上のように、本実施例が提供するマルチモーダル機械学習に基づく翻訳方法は、マルチモーダルグラフ表現層によりｎ個のモーダルのソース言語に対してセマンティック関連付けを行って、セマンティック関連図を構築する。セマンティック関連図において第１結合辺を用いて同一モーダルのセマンティックノードを結合し、且つ第２結合辺を用いて異なるモーダルのセマンティックノードを結合し、セマンティック関連図により複数のモーダルのソース言語の間のセマンティック関連付けを十分に表現する。続いてマルチモーダル融合エンコーダによりセマンティック関連図における特徴ベクトルに対して十分なセマンティック融合を行って、符号化後の符号化特徴ベクトルを取得し、更に符号化特徴ベクトルを復号処理した後に、より正確な目標ステートメントを取得する。該目標ステートメントはマルチモーダルのソース言語が総合的に表す内容、感情及び言語環境等に一層接近する。 As described above, the multimodal machine learning-based translation method provided by the present embodiment performs semantic association for n modal source languages through the multimodal graph representation layer to construct a semantic association diagram. In the semantic relationship diagram, a first connecting edge is used to connect semantic nodes of the same modal, and a second connecting edge is used to connect semantic nodes of different modals, and the semantic relation diagram is used to connect multiple modal source languages. Express semantic associations well. Then, a multimodal fusion encoder performs sufficient semantic fusion on the feature vectors in the semantic relationship diagram to obtain a coded feature vector after coding, and after decoding the coded feature vector, a more accurate Get a goal statement. The goal statement more closely approximates the content, emotion and linguistic environment, etc. that the multimodal source language collectively represents.

該方法においてマルチモーダル融合エンコーダには直列接続されているｅ個の符号化モジュールが含まれる。各々の符号化モジュールはいずれもモーダル内融合層及びモーダル間融合層を含み、モーダル内及びモーダル間の特徴融合を複数回交互に行うことによりセマンティック融合がより完全な符号化特徴ベクトルを取得し、更にｎ個のモーダルのソース言語に対応するより正確な目標ステートメントを復号することができる。 In the method, the multimodal fusion encoder includes e encoding modules connected in series. Each encoding module includes an intra-modal fusion layer and an inter-modal fusion layer, wherein multiple alternating intra-modal and inter-modal feature fusions are performed to obtain a more complete encoding feature vector with semantic fusion; In addition, more accurate target statements corresponding to n modal source languages can be decoded.

図３に基づいて、デコーダは直列接続されているｄ個の復号モジュールを更に含み、ｄは正の整数である。従って、ステップ３０４はステップ３０４１～ステップ３０４４を含んでもよく、図６に示すように、これらステップは以下のとおりである。 Based on FIG. 3, the decoder further includes d serially connected decoding modules, where d is a positive integer. Accordingly, step 304 may include steps 3041-3044, which, as shown in FIG. 6, are as follows.

ステップ３０４１：コンピュータ機器は第２ワードベクトル層により第１目標語句を獲得する。 Step 3041: The computing device obtains the first target phrase through the second word vector layer.

ここで、第１目標語句は目標ステートメントにおける翻訳済み語句である。コンピュータ機器は目標ステートメントにおける語句を１つずつ翻訳し、目標ステートメントにおけるｒ番目の語句を翻訳した後に、ｒ番目の語句を第１目標語句とし、ｒ＋１番目の語句を翻訳することに用いる。言い換えれば、コンピュータ機器はｒ番目の語句を第２ワードベクトル層に入力し、ｒは負ではない整数である。 Here, the first target phrase is the translated phrase in the target statement. The computer device translates the words in the target statement one by one, and after translating the rth word in the target statement, the rth word is taken as the first target word and used to translate the r+1th word. In other words, the computer device inputs the rth phrase into the second word vector layer, where r is a non-negative integer.

ステップ３０４２：コンピュータ機器は第２ワードベクトル層により第１目標語句に対して特徴抽出を行って、第２ワードベクトルを取得する。 Step 3042: The computing device performs feature extraction on the first target phrase through the second word vector layer to obtain a second word vector.

例示的には、コンピュータ機器は第２ベクトル層により第１目標語句に対してワード埋め込みを行って、第２ワードベクトルを取得する。ワード埋め込みは、単語をベクトル空間において実数ベクトルとして表現する技術であり、本実施例においてワード埋め込みとは単語をワードベクトルにマッピングすることを指す。例えば、「わたし」をマッピングしてワードベクトル（０．１，０．５，５）を取得すれば、すなわち（０．１，０．５，５）は「わたし」に対してワード埋め込みを行った後のワードベクトルである。 Illustratively, the computing device performs word embedding on the first target phrase with a second vector layer to obtain a second word vector. Word embedding is a technique of representing words as real vectors in a vector space, and in this embodiment, word embedding refers to mapping words to word vectors. For example, if you map "watashi" to get the word vector (0.1,0.5,5), i.e. (0.1,0.5,5) performs word embedding on "watashi". is the word vector after

ステップ３０４３：コンピュータ機器は直列接続されているｄ個の復号モジュールにより第２ワードベクトルと符号化特徴ベクトルとを組み合わせて特徴抽出を行って、復号特徴ベクトルを取得する。 Step 3043: The computing device performs feature extraction by combining the second word vector and the encoded feature vector with d serially connected decoding modules to obtain a decoded feature vector.

コンピュータ機器は直列接続されているｄ個の復号モジュールを呼び出して注意メカニズムに基づいて符号化特徴ベクトル及び第２ワードベクトルを処理して、復号特徴ベクトルを抽出する。 The computing device calls d serially connected decoding modules to process the encoded feature vector and the second word vector according to the attention mechanism to extract the decoded feature vector.

選択肢として、直列接続されているｄ個の復号モジュールのうちの各々の復号モジュールはいずれも１つの第１自己注意層、１つの第２自己注意層及び１つの第２ベクトル変換層を含む。復号特徴ベクトルの抽出については、コンピュータ機器は第２ワードベクトルを１番目の復号モジュールにおける第１自己注意層に入力し、第１自己注意層により第２ワードベクトルに対して特徴抽出を行って、第２隠れ層ベクトルを取得し、第２隠れ層ベクトル及び符号化特徴ベクトルを１番目の復号モジュールにおける第２自己注意層に入力し、第２自己注意層により第２隠れ層ベクトルと符号化特徴ベクトルとを組み合わせて特徴抽出を行って、第２中間ベクトルを取得し、第２中間ベクトルを第ｋ番目の復号モジュールに入力してｋ回目の復号処理を行い、これを最後の１つの復号モジュールが復号特徴ベクトルを出力するまで続け、ｋは１よりも大きく且つｄ以下の正の整数である。 Optionally, each decoding module of the d series-connected decoding modules includes one first self-attention layer, one second self-attention layer and one second vector transformation layer. For decoding feature vector extraction, the computing device inputs the second word vector into the first self-attention layer in the first decoding module, performs feature extraction on the second word vector by the first self-attention layer, Obtaining a second hidden layer vector, inputting the second hidden layer vector and the encoded feature vector into the second self-attention layer in the first decoding module, and obtaining the second hidden layer vector and the encoded feature vector by the second self-attention layer and the vector to perform feature extraction to obtain a second intermediate vector; outputs a decoded feature vector, where k is a positive integer greater than 1 and less than or equal to d.

ここで、第１自己注意層は自己注意メカニズムに基づいて第２ワードベクトルを処理して、第２隠れ層ベクトルを抽出することに用いられ、第２自己注意層は注意メカニズムに基づいて目標ステートメントの言語類を用いて第２隠れ層ベクトル及び符号化特徴ベクトルを処理して、第２中間ベクトルを取得することに用いられる。第１自己注意層に第１自己注意関数が含まれ、第２自己注意層に第２自己注意関数が含まれ、第１自己注意関数と第２自己注意関数のパラメータは異なる。 wherein the first self-attention layer is used to process the second word vector based on the self-attention mechanism to extract the second hidden layer vector, and the second self-attention layer is used to extract the goal statement based on the attention mechanism. is used to process the second hidden layer vector and the encoded feature vector to obtain a second intermediate vector. The first self-attention layer includes a first self-attention function, the second self-attention layer includes a second self-attention function, and the parameters of the first self-attention function and the second self-attention function are different.

選択肢として、各々の復号モジュールは更に第２ベクトル変換層を含み、第２中間ベクトルを計算して取得した後に、コンピュータ機器は更に第２中間ベクトルを第２ベクトル変換層に入力して非線形変換を行って、非線形変換後の第２中間ベクトルを取得する。 Optionally, each decoding module further includes a second vector transformation layer, and after calculating and obtaining the second intermediate vector, the computer device further inputs the second intermediate vector to the second vector transformation layer to perform the non-linear transformation. to obtain a second intermediate vector after nonlinear transformation.

ステップ３０４４：コンピュータ機器は復号特徴ベクトルを分類器に入力し、分類器により復号特徴ベクトルに対応する確率分布を計算し、且つ確率分布に基づき第１目標語句の後の第２目標語句を決定する。 Step 3044: The computer device inputs the decoded feature vector into the classifier, calculates the probability distribution corresponding to the decoded feature vector by the classifier, and determines the second target phrase after the first target phrase based on the probability distribution. .

選択肢として、分類器に正規化（ｓｏｆｔｍａｘ）関数が含まれ、コンピュータ機器はｓｏｆｔｍａｘ関数により復号特徴ベクトルに対応する確率分布を計算し、且つ復号特徴ベクトルに対応する確率分布に基づき第１目標語句の後の第２目標語句を決定する。 Optionally, the classifier includes a normalization (softmax) function, the computing device calculates a probability distribution corresponding to the decoded feature vector by the softmax function, and determines the first target phrase based on the probability distribution corresponding to the decoded feature vector. Determine a later second target phrase.

以上のように、本実施例が提供するマルチモーダル機械学習に基づく翻訳方法は、マルチモーダルグラフ表現層によりｎ個のモーダルのソース言語に対してセマンティック関連付けを行って、セマンティック関連図を構築する。セマンティック関連図において第１結合辺を用いて同一モーダルのセマンティックノードを結合し、且つ第２結合辺を用いて異なるモーダルのセマンティックノードを結合し、セマンティック関連図により複数のモーダルのソース言語の間のセマンティック関連付けを十分に表現する。続いてマルチモーダル融合エンコーダによりセマンティック関連図における特徴ベクトルに対して十分なセマンティック融合を行って、符号化後の符号化特徴ベクトルを取得し、更に符号化特徴ベクトルを復号処理した後により正確な目標ステートメントを取得する。該目標ステートメントはマルチモーダルのソース言語が総合的に表す内容、感情及び言語環境等に一層接近する。 As described above, the multimodal machine learning-based translation method provided by the present embodiment performs semantic association for n modal source languages through the multimodal graph representation layer to construct a semantic association diagram. In the semantic relationship diagram, a first connecting edge is used to connect semantic nodes of the same modal, and a second connecting edge is used to connect semantic nodes of different modals, and the semantic relation diagram is used to connect multiple modal source languages. Express semantic associations well. Then a multimodal fusion encoder performs sufficient semantic fusion on the feature vectors in the semantic association diagram to obtain a coded feature vector after coding, and a more accurate target after decoding the coded feature vector. Get statement. The goal statement more closely approximates the content, emotion and linguistic environment, etc. that the multimodal source language collectively represents.

該方法は更にｄ個の復号モジュールにより目標ステートメントの言語類を用いて符号化特徴ベクトル及び第２隠れ層ベクトルに対して注意を繰り返し行って、より正確な目標ステートメントを復号する。 The method further iteratively pays attention to the encoded feature vector and the second hidden layer vector using the linguistic class of the target statement by d decoding modules to decode a more accurate target statement.

更に説明する必要があるように、本願が提供するマルチモーダル機械翻訳モデルと以前のマルチモーダルニューラル機械翻訳（ＮｅｕｒａｌＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎ、ＮＭＴ）とに対してテスト比較を行ったところ、本願が提供するマルチモーダル機械翻訳モデルの翻訳効果が最も高いことが明らかになった。例示的に、入力データが画像及びテキストの２種類のソース言語であることを例として、上記テスト比較を以下のように詳しく説明する。 As should be further explained, in test comparisons against the multimodal machine translation model provided by the present application and previous multimodal neural machine translation (NMT), the multimodal machine translation provided by the present application was found to be It became clear that the translation effect of the machine translation model was the highest. By way of example, taking the input data as an example of two kinds of source languages, image and text, the above test comparison is detailed as follows.

本願が提供するマルチモーダル機械翻訳モデルは注意のコーデックフレームワークに基づいて構築されたものであり、訓練データの対数尤度の最大化を目標関数とする。本質的に、本願が提供するマルチモーダル融合エンコーダは１つのマルチモーダル拡張グラフニューラルネットワーク（ＧｒａｐｈＮｅｕｒａｌＮｅｔｗｏｒｋ、ＧＮＮ）として見なされてもよい。マルチモーダル融合エンコーダを構築するために、入力された画像及びテキストを１つのマルチモーダルグラフ（即ちセマンティック関連図）として対応付けて表現し、その後、上記マルチモーダルグラフに基づいて複数のマルチモーダル融合層を重ね合わせてノード（即ちセマンティックノード）表現を学習し、デコーダに注意に基づくコンテキストベクトルを提供する。 The multimodal machine translation model provided by the present application is built on the codec framework of attention, with the goal function of maximizing the log-likelihood of the training data. In essence, the multimodal fusion encoder provided by the present application may be viewed as one multimodal augmented graph neural network (GNN). To construct a multimodal fusion encoder, the input image and text are represented as a correspondence as a multimodal graph (i.e., semantic association diagram), and then multiple multimodal fusion layers based on the multimodal graph. to learn node (ie, semantic node) representations to provide decoders with attention-based context vectors.

一、マルチモーダルグラフの構築については、形式的にマルチモーダルグラフは無向であり、Ｇ＝（Ｖ，Ｅ）に形式化することができる。ここで、ノードセットＶにおいて、個々のノードはテキスト語句又は視覚オブジェクトを示す。ここでテキストに対応するノードはセマンティックノードと称され、視覚オブジェクトに対応するノードは視覚ノードと称され、且つ以下のポリシーを用いてノードの間のセマンティック関連付けを構築する。 First, regarding the construction of the multimodal graph, formally the multimodal graph is undirected and can be formalized as G=(V,E). Here, in the node set V, each node represents a textual phrase or visual object. Here nodes corresponding to text are called semantic nodes, nodes corresponding to visual objects are called visual nodes, and the following policies are used to build semantic associations between nodes.

１、ノードの抽出
（１）テキスト情報を十分に利用するために、テキストにおけるすべての単語を個別のテキストノードとする。例えば、図４においてマルチモーダルグラフは合計８つのテキストノードを含み、個々のテキストノードが入力ステートメント（即ち翻訳対象ステートメント）における１つの単語に対応する。（２）スタンフォードパーサ（Ｓｔａｎｆｏｒｄｐａｒｓｅｒ）を使用して入力ステートメントにおけるすべての名詞フレーズを識別し、次に視覚グラウンディングツールキットを応用して個々の名詞フレーズの入力画像（即ち翻訳対象画像）における対応する境界ボックス（視覚オブジェクト）を識別する。その後、検出されたすべての視覚オブジェクトはいずれも独立した視覚ノードとされる。例えば、図４においてテキストノードＶｘ１及びＶｘ２は視覚ノードＶｏ１及びＶｏ２に対応し、テキストノードＶｘ６、Ｖｘ７及びＶｘ８は視覚ノードＶｏ３に対応する。 1. Node Extraction (1) To make full use of the text information, make every word in the text a separate text node. For example, in FIG. 4 the multimodal graph contains a total of 8 text nodes, each corresponding to one word in the input statement (ie, the translatable statement). (2) using a Stanford parser to identify all noun phrases in the input statement, then applying a visual grounding toolkit to match individual noun phrases in the input image (i.e., the image to be translated); Identifies the bounding box (visual object) that All detected visual objects are then made into independent visual nodes. For example, in FIG. 4 text nodes Vx1 and Vx2 correspond to visual nodes Vo1 and Vo2, and text nodes Vx6, Vx7 and Vx8 correspond to visual node Vo3.

２、マルチモーダルセマンティックユニットの間の各種のセマンティック関連付けを捕獲するために、２種類のエッジ（即ち結合辺）を用いてセマンティックノードを結合する。エッジセットＥにおける２種類のエッジは、（１）同一モーダルにおけるいずれか２つのセマンティックノードがいずれも１つのモーダル内エッジ（第１結合辺）により結合されることと、（２）いかなるテキストノード及び相応の視覚ノードがいずれも１つのモーダル間エッジ（第２結合辺）により結合されることと、を含む。例示的には、図４のように、Ｖｏ１とＶｏ２との間にモーダル内エッジ（実線）を用いて結合し、Ｖｏ１とＶｘ１との間にモーダル間エッジ（実線）を用いて結合する。 2. To capture various semantic associations between multimodal semantic units, two types of edges (ie, connecting edges) are used to connect semantic nodes. The two types of edges in the edge set E are: (1) any two semantic nodes in the same modal are connected by one intra-modal edge (first connecting edge); (2) any text node and All corresponding visual nodes are connected by one inter-modal edge (second connecting edge). Illustratively, Vo1 and Vo2 are connected using an intra-modal edge (solid line), and Vo1 and Vx1 are connected using an inter-modal edge (solid line), as shown in FIG.

二、埋め込み層については、マルチモーダルグラフを積層したマルチモーダル融合層に入力する前に、１つのワード埋め込み層を導入してノードの状態を初期化する必要がある。個々のテキストノードＶｘｕについては、その初期状態Ｈｘｕをワード埋め込みと位置埋め込みとの和として定義する。視覚ノードＶｏｓの初期状態Ｈｏｓについては、Ｆａｓｔｅｒ－ＲＣＮＮにおける関心領域プール（ＲｅｇｉｏｎＯｆＩｎｔｅｒｅｓｔｐｏｏｌｉｎｇ、ＲＯＩプール）層の全結合層（ｆｕｌｌｙ－ｃｏｎｎｅｃｔｅｄｌａｙｅｒ）により視覚特徴を抽出し、次に線形整流関数（ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ、ＲｅＬＵ）を活性化関数とする多層パーセプトロンを使用して視覚特徴をテキスト表現と同じ空間に投影する必要がある。 Second, for the embedding layer, it is necessary to introduce a word embedding layer to initialize the state of the node before inputting the multimodal fusion layer stacked with multimodal graphs. For each text node Vxu, define its initial state Hxu as the sum of the word embedding and the position embedding. For the initial state Hos of the visual node Vos, the visual features are extracted by a fully-connected layer of the Region Of Interest pooling (ROI pool) layer in Faster-RCNN, and then a linear rectification function ( A multi-layer perceptron with a Rectified Linear Unit (ReLU) as the activation function should be used to project the visual features into the same space as the textual representation.

ここで、ＲＣＣＮは精確な物体検出及びセマンティックセグメンテーションに用いられる豊富な特徴階層構造（Ｒｉｃｈｆｅａｔｕｒｅｈｉｅｒａｒｃｈｉｅｓｆｏｒａｃｃｕｒａｔｅｏｂｊｅｃｔｄｅｔｅｃｔｉｏｎａｎｄｓｅｍａｎｔｉｃｓｅｇｍｅｎｔａｔｉｏｎ）である。 Here, RCCN is a Rich feature hierarchy for accurate object detection and semantic segmentation.

三、図７のように、左側部分にエンコーダを示し、埋め込み層４０２の頂部にｅ層のグラフに基づくマルチモーダル融合層がスタックされ、それにより上記マルチモーダルグラフを符号化する。マルチモーダル融合層において、モーダル内及びモーダル間融合を順次行って、すべてのノード状態を更新する。このように、最終的なノード状態は同一モーダルにおけるコンテキスト情報及びクロスモーダルセマンティック情報を同時に符号化したものである。特に、視覚ノード及びテキストノードは異なるモード情報を含む２種類のセマンティックユニットであるため、操作が類似するがパラメータが異なる関数を用いてノードの状態更新プロセスをモデリングする。 3. As in FIG. 7, the left part shows the encoder, and on top of the embedding layer 402 is stacked a multimodal fusion layer based on the e-layer graph, thereby encoding the above multimodal graph. In the multimodal fusion layer, intramodal and intermodal fusions are performed sequentially to update all node states. Thus, the final node state is a simultaneous encoding of co-modal contextual and cross-modal semantic information. In particular, since visual nodes and text nodes are two types of semantic units that contain different modal information, we use functions with similar operations but different parameters to model the node state update process.

例示的には、ｊ個のマルチモーダル融合層において、テキストノード状態［数２５］及び視覚ノード状態［数２６］の更新は主に以下のステップに関する。 Illustratively, in j multimodal fusion layers, updating text node states [eq.25] and visual node states [eq.26] mainly involves the following steps.

ステップ１：モーダル内融合。このステップにおいて、自己注意を使用して同一モーダル内の隣接ノードの間の情報融合を行って、個々のノードのコンテキスト表現を生成する。形式的に、すべてのテキストノードのコンテキスト表現［数２７］の計算式は、
［数２８］であり、 Step 1: Intra-modal fusion. In this step, self-attention is used to perform information fusion between adjacent nodes within the same modal to generate contextual representations of individual nodes. Formally, the formula for computing the context representation [eq.27] of all text nodes is
[Equation 28],

ここで、ＭｕｌｔｉＨｅａｄ（Ｑ，Ｋ，Ｖ）は多重注意メカニズムモデリング関数（マルチヘッド自己注意関数とも称される）であり、クエリ行列Ｑ、キー行列Ｋ及び値行列Ｖを入力とする。同様に、すべての視覚ノードのコンテキスト表現［数２９］の計算式は、
［数３０］である。 where MultiHead(Q, K, V) is the multi-attention mechanism modeling function (also called multi-head self-attention function), with query matrix Q, key matrix K and value matrix V as inputs. Similarly, the formula for the context representation [Equation 29] of all visual nodes is
[Equation 30].

特に、視覚オブジェクトの初期状態は、深層学習アルゴリズム（ｄｅｅｐＣＮＮｓ）により抽出されたものであり、従って、１つの簡略化されたマルチヘッド自己注意を応用して視覚オブジェクトの初期状態を表現する。ここで、獲得された線形項目値及び最終的な出力を削除する。 In particular, the initial states of visual objects are extracted by deep learning algorithms (deep CNNs), and thus one simplified multi-headed self-attention is applied to represent the initial states of visual objects. Here we delete the obtained linear term values and the final output.

ステップ２：モーダル間融合。マルチモーダルの間に特徴融合を行うときに、要素操作特性を有する一種のクロスモーダルゲーティング制御メカニズムを用いて、個々のノードのクロスモーダル近傍領域のセマンティック情報を学習する。具体的に、テキストノードＶｘｕの状態表現［数３１］を生成する方式は、
［数３２］、
［数３３］であり、 Step 2: Inter-modal fusion. A kind of cross-modal gating control mechanism with element manipulation properties is used to learn the semantic information of the cross-modal neighborhood of individual nodes when performing feature fusion during multimodal. Specifically, the method for generating the state representation [equation 31] of the text node Vxu is
[number 32],
[Equation 33],

ここで、［数３４］はノードＶｘｕのマルチモーダルグラフにおける近傍ノードの集合であり、［数３５］と［数３６］はパラメータ行列である。同様に、テキストノードＶｏｓの状態表現［数３７］を生成する方式は、
［数３８］、
［数３９］であり、 Here, [Formula 34] is a set of neighboring nodes in the multimodal graph of node Vxu, and [Formula 35] and [Formula 36] are parameter matrices. Similarly, the scheme for generating the state representation of the text node Vos [Equation 37] is
[Number 38],
[Equation 39],

ここで、［数４０］はノードＶｏｓのマルチモーダルグラフにおける近傍ノードの集合であり、［数４１］と［数４２］はパラメータ行列である。 Here, [Formula 40] is a set of neighboring nodes in the multimodal graph of node Vos, and [Formula 41] and [Formula 42] are parameter matrices.

上記マルチモーダル融合プロセスを経た後に、フィードフォワードニューラルネットワークを用いて最終的なデル隠れ層表現を生成する。テキストノード状態［数４３］及び画像ノード状態［数４４］の計算プロセスは、
［数４５］、
［数４６］であり、 After going through the above multimodal fusion process, a feedforward neural network is used to generate the final Dell hidden layer representation. The computation process for text node state [eq.43] and image node state [eq.44] is:
[Number 45],
[Equation 46],

ここで、［数４７］は全部のテキストノード状態及び画像ノード状態が更新されたことを示す。 Here, [Equation 47] indicates that all text node states and image node states have been updated.

四、デコーダについては、従来のトランスフォーマ（Ｔｒａｎｓｆｏｒｍｅｒ）デコーダと類似する。視覚情報が既に複数のグラフに基づくマルチモーダル融合層によりすべてのテキストノードに融合されているため、デコーダがテキストノード状態のみに注目してマルチモーダルコンテキストを動的に利用することは許容されており、即ちテキストノード状態のみをデコーダに入力する。 Fourth, the decoder is similar to a conventional transformer decoder. Since visual information is already fused to all text nodes by multiple graph-based multimodal fusion layers, it is permissible for decoders to focus only on text node states and exploit multimodal context dynamically. , i.e., input only the text node state to the decoder.

図７の右側部分に示すように、ｄ個の同じ層を重ね合わせて目標側隠れ状態を生成する。ここで、個々の層は３つのサブ層により構成される。具体的に、上位２つのサブ層はそれぞれマスキング自己注意Ｅｊ及びコーデック注意Ｔｊであり、それにより目標及びソース言語側コンテキストを統合し、
［数４８］、
［数４９］であり、 As shown in the right part of FIG. 7, d identical layers are superimposed to generate the target hidden state. Here, each layer is composed of three sub-layers. Specifically, the top two sub-layers are masking self-attentive Ej and codec-attentive Tj, respectively, thereby integrating the target and source language side contexts,
[Number 48],
[Equation 49],

ここで、Ｓ（ｊ－１）は第ｊ－１層における目標側隠れ状態を示す。特に、Ｓ（０）は入力された目標語句の埋め込みベクトルであり、［数５０］はデコーダにおける最上層の隠れ状態である。次に、１つの位置方向の全結合フィードフォワードニューラルネットワークはＳ（ｊ）を生成することに用いられ、式は、
［数５１］であり、 Here, S(j-1) denotes the target-side hidden state in the j-1th layer. In particular, S(0) is the embedding vector of the input target phrase, and [Equation 50] is the top layer hidden state in the decoder. Then, a fully connected feedforward neural network in one position direction is used to generate S(j), and the equation is
[Equation 51],

最後に、ｓｏｆｔｍａｘ層を利用して目標ステートメントを生成する確率分布を定義し、該層は最上層の隠れ状態［数５２］を入力とし、
［数５３］であり、 Finally, a softmax layer is used to define the probability distribution that generates the target statement, which takes as input the hidden state of the top layer [Equation 52], and
[Equation 53],

ここで、Ｘは入力された翻訳対象ステートメントであり、Ｉは入力された翻訳対象画像であり、Ｙは目標ステートメント（即ち翻訳ステートメント）であり、Ｗとｂはｓｏｆｔｍａｘ層のパラメータである。 where X is the input translatable statement, I is the input translatable image, Y is the target statement (ie, translation statement), and W and b are parameters of the softmax layer.

実験プロセスにおいて、英語をフランス語及びドイツ語に翻訳することを翻訳タスクとし、データセットはＭｕｌｔｉ３０Ｋデータセットを用いる。ここで、データセットにおける各画像は、英語の記述、並びに人間が翻訳したドイツ語、及びフランス語に対応してペアになる。訓練、検証及びテストセットはそれぞれ２９０００個、１０１４個及び１０００個の実例を含む。この他に、更にＷＭＴ１７テストセットにおける各種のモデル及びファジーＭＳＣＯＣＯテストセットを評価するが、それらはそれぞれ１０００個及び４６１個の実例を含む。本実験において、前処理されたステートメントを直接使用して、バイトペア符号化及び１００００個の合併操作により単語をサブ単語に分割する。 In the experimental process, the translation task is to translate English into French and German, and the Multi30K dataset is used as the dataset. Here, each image in the dataset is paired with corresponding English descriptions and human-translated German and French. The training, validation and test sets contain 29000, 1014 and 1000 examples respectively. Besides this, we also evaluate various models in the WMT17 test set and the fuzzy MSCOCO test set, which contain 1000 and 461 examples respectively. In this experiment, the preprocessed statements are used directly to split words into subwords by byte-pair encoding and 10000 union operations.

視覚特徴：まずスタンフォード（Ｓｔａｎｆｏｒｄ）パーサを用いて個々のソースステートメントから名詞フレーズを識別し、次に視覚グラウンディングツールキットを使用して識別された名詞フレーズの関連視覚オブジェクトを検出する。個々のフレーズについては、その対応する視覚オブジェクトの予測確率を最も高く維持することにより、豊富な視覚オブジェクトの悪影響を軽減する。個々のセンテンスにおいて、物体及び単語の平均数はそれぞれ３．５及び１５．０程度である。最後に、予め訓練されたＲｅｓＮｅｔ－１００ＦａｓｔｅｒＲＣＮＮを使用してこれらのオブジェクトの２０４８次元特徴を計算する。 Visual Features: First, a Stanford parser is used to identify noun phrases from individual source statements, then a visual grounding toolkit is used to detect relevant visual objects for the identified noun phrases. For individual phrases, the negative effects of rich visual objects are mitigated by keeping the highest predicted probability of its corresponding visual object. In individual sentences, the average number of objects and words is on the order of 3.5 and 15.0, respectively. Finally, we compute 2048-dimensional features of these objects using a pre-trained ResNet-100 Faster RCNN.

設定：トランスフォーマを基礎として使用する。訓練コーパスが比較的小さいため、訓練後のモデルは過度にフィッティングする傾向があり、まず１つの小さなグリッド検索を行って、１組の英語からドイツ語への翻訳検証セットにおけるハイパーパラメータを獲得する。具体的には、ワード埋め込み次元数及び隠れサイズはそれぞれ１２８及び２５６である。デコーダは４層を有し、注意のヘッド数は４である。ドロップアウト率を０．５として設定する。各ロットは約２０００個のソースコードシンボル及び目標トークンにより構成される。所定の学習率を有するＡｄａｍオプティマイザを応用して各種のモデルを最適化し、且つそれと同じ他の設定を使用する。最後に、バイリンガル評価アンダースタディ（ＢｉｌｉｎｇｕａｌＥｖａＬｕａｔｉｏｎＵｎｄｅｒｓｔｕｄｙ、ＢＬＥＵ）指標及びＭＥＴＥＯＲ指標を使用して翻訳の品質を評価する。説明する必要があるように、個々の実験においてすべてのモデルに対して３回の動作をさせ、且つ平均結果を報告した。 Settings: Use Transformers as a basis. Because the training corpus is relatively small, the post-training model tends to be overfitting, and we first perform one small grid search to obtain the hyperparameters in a set of English-to-German translation validation sets. Specifically, the word embedding dimensionality and hidden size are 128 and 256, respectively. The decoder has four layers and the number of heads of attention is four. Set the dropout rate as 0.5. Each lot consists of approximately 2000 source code symbols and target tokens. We apply the Adam optimizer with a given learning rate to optimize various models and use the same other settings. Finally, the Bilingual Evaluation Understudy (BLEU) index and the METEOR index are used to assess translation quality. All models were run in triplicate in individual experiments, and average results were reported, as should be noted.

基礎モデル：テキストに基づくトランスフォーマ（ＴｒａｎｓＦｏｒｍｅｒ、ＴＦ）以外に、更に視覚特徴を利用し、幾つか種類の効果的な方法を用いて変換を行い、且つ本願の実施例が提供するモデルをトランスフォーマと比較した。 Basic model: Besides the text-based transformer (TransformFormer, TF), it also uses visual features, uses several kinds of effective methods to transform, and compares the model provided by the examples of the present application with the transformer. bottom.

１、ＯｂｊｅｃｔＡｓＴｏｋｅｎ（ＴＦ）。これはトランスフォーマの１つのバリエーションであり、すべての視覚オブジェクトはいずれも付加的なソースコードシンボルとして見なされ、且つ入力ステートメントの前に置かれる。 1, ObjectAsToken (TF). This is a variation of the Transformer where all visual objects are treated as additional source code symbols and placed before the input statement.

２、Ｅｎｃ－ａｔｔ（ＴＦ）。トランスフォーマにおいてエンコーダに基づく画像注意メカニズムを用いており、個々のソース注釈及び注意に基づく視覚特徴ベクトルを追加ししている。 2, Enc-att (TF). It uses an encoder-based image attention mechanism in the transformer, adding visual feature vectors based on individual source annotations and attention.

３、Ｄｏｕｂｌｙ－ａｔｔ（ＴＦ）。これは１つの二重注意のトランスフォーマである。個々の復号層において、全結合フィードフォロード層の前に１つのクロスモーダルマルチヘッド注意サブ層を挿入し、それにより視覚特徴に基づいて視覚コンテキストベクトルを生成する。 3, Doubly-att (TF). This is one double attention transformer. In each decoding layer, we insert one cross-modal multi-head attention sub-layer before the fully connected feed-follow layer, thereby generating a visual context vector based on the visual features.

それに対応して、更に幾つか種類の主なマルチモーダルニューラル機械翻訳（ＮｅｕｒａｌＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎ、ＮＭＴ）モデルの性能、例えばＤｏｕｂｌｙ－ａｔｔ（ＲＮＮ）、Ｓｏｆｔ－ａｔｔ（ＲＮＮ）、Ｓｔｏｃｈａｓｔｌｃ－ａｔｔ（ＲＮＮ）、Ｆｕｓｉｏｎ－ｃｏｎｖ（ＲＮＮ）、Ｔｒｇ－ｍｕｌ（ＲＮＮ）、ＶＭＭＴ（ＲＮＮ）及びＤｅｌｌｂｅｒａｔｉｏｎＮｅｔｗｏｒｋ（ＴＦ）が展開されている。ここで、ＲＮＮは再帰型ニューラルネットワーク（ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｄ）である。 Correspondingly, the performance of several major types of multimodal neural machine translation (NMT) models, such as Doubly-att (RNN), Soft-att (RNN), Stochastlc-att (RNN), Fusion-conv (RNN), Trg-mul (RNN), VMM T (RNN) and Dellberation Network (TF) are deployed. Here, RNN is a recurrent neural network (Recurrent Neural Network).

マルチモーダル融合層の数ｅは１つの重要なハイパーパラメータであり、エンコーダにおける細粒度セマンティック融合の程度を直接決める。従って、まずそれが英語からドイツ語への翻訳検証セットに与える影響を検査する。図８には実験結果を示しており、ｅが３であるときに、モデルは最適なｐ形態に達した。従って、すべての後続の実験においてｅ＝３を使用した。 The number of multimodal fusion layers, e, is one important hyperparameter and directly determines the degree of fine-grained semantic fusion in the encoder. Therefore, we first examine the effect it has on the English to German translation validation set. Experimental results are shown in FIG. 8, where the model reaches the optimal p-form when e is 3. Therefore, e=3 was used in all subsequent experiments.

［表１］には英語からドイツ語への翻訳タスクの主な結果を示した。ＭＥＴＥＯＲにおいてＦｕｓｉｏｎ－ｃｏｎｖ（ＲＮＮ）及びＴｒｇ－ｍｕｌ（ＲＮＮ）と比較し、本願の実施例が提供するモデルの性能はほとんどの以前のモデルよりも優れている。２組の結果はＷＭＴ２０１７テストセットにおけるシステム状態によって決められており、該ＷＭＴ２０１７テストセットはＭＥＴＥＯＲに基づいて選択したものである。基礎モデルと比較して、以下の結論を得ることができる。 Table 1 shows the main results of the English to German translation task. Compared to Fusion-conv (RNN) and Trg-mul (RNN) in METEOR, the performance of the model provided by the examples of the present application is superior to most previous models. Two sets of results were determined by the system state on the WMT2017 test set, which was selected based on METEOR. Comparing with the basic model, the following conclusions can be drawn.

まず、本願の実施例が提供するモデルはＯｂｊｅｃｔＡｓＴｏｋｅｎ（ＴＦ）よりも優れている。該モデルは領域視覚特徴とテキストとを一体に結合して、注目可能シーケンスを形成し、且つ自己注意メカニズムを利用してマルチモーダル融合を行う。その基本的な理由は２つの点を含み、第１としては、異なるモーダルのセマンティックユニットの間のセマンティック対応関係をモデリングしたことであり、第２としては、異なるモーダルのモデルパラメータを区別したことである。 First, the model provided by the embodiments of the present application is superior to ObjectAsToken (TF). The model combines regional visual features and text together to form attentionable sequences and utilizes self-attention mechanisms for multimodal fusion. The basic reasons for this include two points: first, we modeled the semantic correspondence between semantic units of different modals, and second, we distinguished the model parameters of different modals. be.

次に、本願の実施例が提供するモデルもＥｎｃ－ａｔｔ（ＴＦ）よりも著しく優れている。ここで、Ｅｎｃ－ａｔｔ（ＴＦ）は単層セマンティック融合エンコーダとして見なされてもよい。セマンティック対応関係をモデリングする利点以外に、多層マルチモーダルセマンティックインタラクションもＮＭＴに有利であると更に推量される。 Next, the model provided by the examples of the present application is also significantly superior to Enc-att(TF). Here, Enc-att(TF) may be viewed as a single-layer semantic fusion encoder. Besides the advantages of modeling semantic correspondences, it is further speculated that multi-layered multimodal semantic interactions are also advantageous for NMT.

第３としては、注意メカニズムだけを利用して視覚情報を抽出するＤｏｕｂｌｙ－ａｔｔ（ＴＦ）に比べて、エンコーダにおいて十分なマルチモーダル融合を提供するため、本願の実施例が提供するモデルは著しく改良されている。 Third, compared to Doubly-att (TF), which exploits only attentional mechanisms to extract visual information, the models provided by the embodiments of the present application are significantly improved because they provide sufficient multimodal fusion in the encoder. It is

また、ソース文の長さ及び名詞フレーズの数に基づきテストセットを異なるグループに分け、次に各グループのテストセットにおける異なるモデルの性能を比較する。図９及び図１０には上記グループのＢＬＥＵスコアが示されている。まとめて言えば、本願の実施例が提供するモデルは依然としてすべてのグループにおいて常に最適な性能に達する。従って、本願の実施例が提供するモデルの有効性及び汎用性は再び実証されたといえる。注意する必要があるように、フレーズが比較的多いセンテンスにおいては、一般的にセンテンスが長くなり、本願の実施例が提供するモデルは基礎モデルの改良よりと比べてより深い意義を有する。長いセンテンスには比較的多く多義的なワードが含まれる場合が多いと推測される。従って、短いセンテンスに比べて、長いセンテンスは視覚情報を補充情報としてより良く利用する必要がある可能性があり、これは本願の実施例が提供するモデルのマルチモーダルセマンティックインタラクションにより実現され得る。 We also divide the test set into different groups based on the length of the source sentence and the number of noun phrases, and then compare the performance of different models in the test set of each group. Figures 9 and 10 show the BLEU scores for the above groups. In summary, the models provided by the examples of the present application still always reach optimal performance in all groups. Therefore, it can be said that the effectiveness and versatility of the model provided by the examples of the present application have been verified again. It should be noted that sentences with more phrases generally have longer sentences, and the model provided by the examples of the present application has a deeper meaning than the refinement of the basic model. It is presumed that long sentences often contain a relatively large number of ambiguous words. Therefore, long sentences may need to make better use of visual information as supplemental information than short sentences, and this can be achieved by the model multimodal semantic interaction provided by the embodiments herein.

更に、［表４］には更に本願の実施例が提供するモデル及び基礎モデルの訓練及び復号速度を示す。訓練プロセスにおいて、本願の実施例が提供するモデルは１秒あたりに約１．１Ｋのトークンを処理することができ、これは他のマルチモーダルモデルに相当する。復号プロセスに関する場合、本願の実施例が提供するモデルは１秒あたりに約１６．７句を翻訳し、トランスフォーマに比べて、速度が少々低下した。この他は、本願の実施例が提供するモデルは少量の付加的なパラメータを導入したのみで、より良い性能を獲得している。 Furthermore, [Table 4] further shows the training and decoding speed of the model and the base model provided by the embodiments of the present application. In the training process, the model provided by the embodiments of the present application can process approximately 1.1K tokens per second, which is comparable to other multimodal models. As for the decoding process, the model provided by the examples of the present application translated approximately 16.7 phrases per second, slightly slower than the transformer. Other than that, the model provided by the examples of the present application only introduces a small amount of additional parameters to obtain better performance.

異なる成分の有効性を研究するために、更に実験を行い、本願の実施例が提供するモデルと［表２］における以下のバリエーションとを比較した。 To study the effectiveness of different ingredients, further experiments were performed comparing the model provided by the Examples of the present application with the following variations in [Table 2].

（１）モーダル間融合。このバリエーションにおいて、２つの独立したトランスフォーマフォーマエンコーダを使用してそれぞれ単語及び視覚オブジェクトのセマンティック表現を学習し、次に二重注意デコーダを使用してテキスト及び視覚コンテキストをデコーダに合併する。［表２］における第３行の結果は、モーダル間融合をなくすと性能の顕著な低下をもたらすことを表している。これは、マルチモーダルセマンティックユニットの間のセマンティックインタラクションがマルチモーダル表現学習にとって有用であることを表している。 (1) Inter-modal fusion. In this variation, two independent Transformer Encoders are used to learn the semantic representations of words and visual objects respectively, and then a double attention decoder is used to merge the text and visual context into the decoder. The results in row 3 in Table 2 show that eliminating inter-modal fusion leads to a significant decrease in performance. This indicates that semantic interactions between multimodal semantic units are useful for multimodal representation learning.

（２）視覚グラウンディングから全結合まで。単語及び視覚オブジェクトを一体に完全に結合し、モーダル間の対応関係を確立する。［表２］における第４行の結果は、この変化が性能の顕著な低下をもたらすことを表明している。その根本的な理由は、完全に結合しているセマンティックの対応は本願の実施例が提供するモデルに非常に大きなノイズをもたらすことにある。 (2) visual grounding to total connectivity; It completely binds words and visual objects together and establishes correspondences between modals. The results in row 4 in Table 2 demonstrate that this change results in a noticeable drop in performance. The underlying reason is that a fully coupled semantic correspondence introduces a great deal of noise into the models provided by the embodiments of the present application.

（３）異なるパラメータから統一パラメータまで。このバリエーションを構築するときに、統一パラメータを割り当てて異なるモードにおけるノード状態を更新する。明らかなようにに、［表２］における第５行が報告する性能低下は、異なるパラメータを使用する方法の有効性も証明した。 (3) From different parameters to unified parameters. When building this variation, we assign unified parameters to update node states in different modes. As can be seen, the performance degradation reported by row 5 in [Table 2] also proved the effectiveness of the method using different parameters.

（４）視覚ノード注意。テキストノードのみを考慮するモデルと異なり、このバリエーションのデコーダが二重注意デコーダを使用してこの２種類のタイプのノードを考慮することは許容されている。［表２］における第６行の結果から観察できるように、すべてのノードを考慮することは更なる改良をもたらすことがない。上記結果はもとの仮定を実証しており、即ち、視覚情報は既に完全にエンコーダにおけるテキストノードに取り入れられているといえる。 (4) Visual node attention. Unlike the model that considers only text nodes, it is allowed for this variant decoder to consider these two types of nodes using a double attention decoder. As can be observed from the results of row 6 in Table 2, considering all nodes does not yield any further improvement. The above results confirm the original assumption, ie, it can be said that the visual information is already fully incorporated into the text nodes in the encoder.

（５）テキストノード注意及び視覚ノード注意。しかしながら、視覚ノードのみを考慮するときには、モデルの性能が急激に低下するが、これは［表２］における第７行に示されている。これは、視覚ノードの数がテキストノードよりも遥かに少ないが、テキストノードが十分な翻訳コンテキストを生成できないためである。 (5) Text node attention and visual node attention. However, when only visual nodes are considered, the performance of the model drops sharply, which is shown in row 7 in Table 2. This is because the number of visual nodes is much lower than text nodes, but text nodes cannot generate enough translation context.

例示的に、更に英語からフランス語への翻訳データセットにおいて実験を行う。［表３］からわかるように、すべての以前のモデルに比べて、本願の実施例が提供するモデルは依然としてより良い性能を獲得する。これは、マルチモーダルＮＭＴにおいて本願の実施例が提供するモデルは異なる言語に対して有効及び汎用的なものであることを再び証明している。 Exemplarily, further experiments are performed on the English to French translation dataset. As can be seen from Table 3, compared to all previous models, the model provided by the examples of the present application still obtains better performance. This again proves that the model provided by the present embodiment in multimodal NMT is valid and universal for different languages.

［表２］において、関連するマルチモーダルＮＭＴシステム及び本願の実施例が提供するマルチモーダルＮＭＴシステムにおいて提供する機械翻訳モデルと比較を行っている。ＢＬＥＵ及びＭＥＴＥＯＲ指標から明らかなように、英語とフランス語との間の翻訳に対しても、本願が提供する機械翻訳モデルはより良い効果を獲得し、４つの指標値のうち３つはいずれも最高値（太字の数字）であった。 In Table 2, a comparison is made with the machine translation models provided in the relevant multimodal NMT system and the multimodal NMT system provided by the examples of this application. As is evident from the BLEU and METEOR indices, the machine translation model provided by the present application also achieves better performance for translation between English and French, with three of the four index values being the highest. values (numbers in bold).

図１１に参照されるように、本願の１つの例示的な実施例が提供するマルチモーダル機械学習に基づく翻訳装置を示している。該装置はソフトウェア、ハードウェア又はそれらの組み合わせによりコンピュータ機器の一部又は全部となり、該装置はセマンティック関連付けモジュール５０１と、特徴抽出モジュール５０２と、ベクトル符号化モジュール５０３と、ベクトル復号モジュール５０４と、を含む。 Referring to FIG. 11, there is shown a translation device based on multimodal machine learning provided by one exemplary embodiment of the present application. The device may be part or all of a computer device in software, hardware, or a combination thereof, and the device may include a semantic association module 501, a feature extraction module 502, a vector encoding module 503, and a vector decoding module 504. include.

セマンティック関連付けモジュール５０１は、異なるモーダルに属するｎ個のソースステートメントに基づいてセマンティック関連図を獲得することに用いられる。上記セマンティック関連図は、ｎ種類の異なるモーダルのセマンティックノードと、同一モーダルのセマンティックノードを結合することに用いられる第１結合辺と、異なるモーダルのセマンティックノードを結合することに用いられる第２結合辺とを含み、上記セマンティックノードは１種類のモーダルにおける上記ソースステートメントの１つのセマンティックユニットを示すことに用いられ、ｎは１よりも大きな正の整数である。 The semantic association module 501 is used to obtain a semantic association diagram based on n source statements belonging to different modals. The above semantic relationship diagram includes n types of different modal semantic nodes, a first connecting edge used to connect the same modal semantic nodes, and a second connecting edge used to connect different modal semantic nodes. , where the semantic node is used to indicate one semantic unit of the source statement in one kind of modal, and n is a positive integer greater than one.

選択肢として、マルチモーダルグラフ表現層によりｎ個のモーダルのソース言語に対してセマンティック関連付けを行って、セマンティック関連図を構築することに用いられ、セマンティック関連図は、ｎ種類の異なるモーダルのセマンティックノードと、同一モーダルのセマンティックノードを結合することに用いられる第１結合辺と、異なるモーダルのセマンティックノードを結合することに用いられる第２結合辺とを含み、ｎは１よりも大きな正の整数であり、
特徴抽出モジュール５０２は、上記セマンティック関連図から複数の第１ワードベクトルを抽出することに用いられ、選択肢として、第１ワードベクトル層によりセマンティック関連図から第１ワードベクトルを抽出し、
ベクトル符号化モジュール５０３は、上記複数の第１ワードベクトルを符号化して、ｎ個の符号化特徴ベクトルを取得することに用いられ、選択肢として、マルチモーダル融合エンコーダにより第１ワードベクトルを符号化して、符号化特徴ベクトルを取得し、
ベクトル復号モジュール５０４は、上記ｎ個の符号化特徴ベクトルを復号して、翻訳後の目標ステートメントを取得することに用いられ、選択肢として、デコーダを呼び出して符号化特徴ベクトルを復号処理して、翻訳後の目標ステートメントを取得する。 Alternatively, the multi-modal graph representation layer is used to make semantic associations for n modal source languages to build a semantic association diagram, which semantic association diagram consists of n different modal semantic nodes and , including a first connecting edge used to connect semantic nodes of the same modal and a second connecting edge used to connect semantic nodes of different modals, where n is a positive integer greater than 1 ,
the feature extraction module 502 is used to extract a plurality of first word vectors from the semantic association diagram, optionally extracting the first word vectors from the semantic association diagram by a first word vector layer;
The vector encoding module 503 is used to encode the plurality of first word vectors to obtain n encoded feature vectors, optionally encoding the first word vectors with a multimodal fusion encoder. , to get the encoded feature vector, and
The vector decoding module 504 is used to decode the n encoded feature vectors to obtain the post-translational target statement, optionally calling the decoder to decode the encoded feature vectors to process the translation. Get later goal statements.

いくつかの選択可能な実施例において、セマンティック関連付けモジュール５０１は、ｎ組のセマンティックノードを獲得することであって、１組のセマンティックノードが１つのモーダルのソースステートメントに対応する、ことと、同一モーダルのいずれか２つの上記セマンティックノードの間に上記第１結合辺を追加し、異なるモーダルのいずれか２つの上記セマンティックノードの間に上記第２結合辺を追加して、上記セマンティック関連図を取得することと、に用いられる。選択肢として、セマンティック関連付けモジュール５０１は、マルチモーダルグラフ表現層により各々のモーダルのソース言語からセマンティックノードを抽出して、ｎ個のモーダルのソース言語に対応するｎ組のセマンティックノードを取得することと、マルチモーダルグラフ表現層により第１結合辺を用いてｎ組のセマンティックノードに対して同一モーダル内のセマンティックノードの間の結合を行い、且つ第２結合辺を用いてｎ組のセマンティックノードに対して異なるモーダル間のセマンティックノードの間の結合を行って、セマンティック関連図を取得することと、に用いられる。 In some optional embodiments, the semantic association module 501 obtains n sets of semantic nodes, one set of semantic nodes corresponding to one modal source statement, and the same modal and add the second connecting edge between any two of the semantic nodes of different modals to obtain the semantic relationship diagram Used for things and things. Optionally, the semantic association module 501 extracts semantic nodes from each modal's source language by the multimodal graph representation layer to obtain n sets of semantic nodes corresponding to the n modal source languages; The multimodal graph representation layer uses a first connecting edge to connect between semantic nodes in the same modal for n sets of semantic nodes, and uses a second connecting edge to connect n sets of semantic nodes. and making connections between semantic nodes across different modals to obtain a semantic relationship diagram.

いくつかの選択可能な実施例において、ｎ個のモーダルのソース言語にはテキスト形式の第１ソース言語及び非テキスト形式の第２ソース言語が含まれ、ｎ組のセマンティックノードは第１セマンティックノード及び第２セマンティックノードを含み、
セマンティック関連付けモジュール５０１は、上記第１セマンティックノードを獲得することであって、上記第１セマンティックノードはマルチモーダルグラフ表現層が上記第１ソースステートメントを処理することにより取得される、ことと、候補セマンティックノードを獲得することであって、上記候補セマンティックノードはマルチモーダルグラフ表現層が上記第２ソースステートメントを処理することにより取得される、ことと、上記候補セマンティックノードの第１確率分布を獲得することであって、上記第１確率分布は上記マルチモーダルグラフ表現層が上記第１セマンティックノードと上記候補セマンティックノードとの間のセマンティック関連付けに応じて計算することにより取得される、ことと、上記候補セマンティックノードから上記第２セマンティックノードを決定することであって、上記第２セマンティックノードは上記マルチモーダルグラフ表現層が上記第１確率分布に基づき決定したものである、ことと、に用いられる。 In some alternative embodiments, the n modal source languages include a textual first source language and a non-textual second source language, and the n sets of semantic nodes are the first semantic nodes and the including a second semantic node;
The semantic association module 501 is to obtain the first semantic node, the first semantic node is obtained by the multimodal graph representation layer processing the first source statement; obtaining a node, wherein the candidate semantic node is obtained by a multimodal graph representation layer processing the second source statement; and obtaining a first probability distribution of the candidate semantic node. wherein the first probability distribution is obtained by the multimodal graph representation layer computing according to semantic associations between the first semantic nodes and the candidate semantic nodes; Determining the second semantic node from a node, the second semantic node being determined by the multimodal graph representation layer based on the first probability distribution.

選択肢として、セマンティック関連付けモジュール５０１は、マルチモーダルグラフ表現層により第１ソースステートメントから第１セマンティックノードを抽出し、且つ第２ソース言語から候補セマンティックノードを抽出することと、マルチモーダルグラフ表現層を呼び出して第１セマンティックノードと候補セマンティックノードとの間のセマンティック関連付けに応じて候補セマンティックノードの第１確率分布を計算することと、マルチモーダルグラフ表現層を呼び出して第１確率分布に基づき候補セマンティックノードから第２セマンティックノードを決定することと、に用いられる。 Alternatively, the semantic association module 501 extracts the first semantic nodes from the first source statement and the candidate semantic nodes from the second source language with the multimodal graph representation layer and invokes the multimodal graph representation layer. calculating a first probability distribution of the candidate semantic nodes according to the semantic association between the first semantic node and the candidate semantic nodes using and determining a second semantic node.

いくつかの選択可能な実施例において、セマンティック関連付けモジュール５０１は、第ｉ組のセマンティックノードにおいて同一モーダル内のいずれか２つのセマンティックノードの間に第ｉ種類の第１結合辺を追加することに用いられ、上記第ｉ種類の第１結合辺が第ｉ番目のモーダルに対応し、ｉはｎ以下の正の整数である。 In some optional embodiments, the semantic association module 501 is used to add a first connecting edge of the i-th type between any two semantic nodes within the same modal in the i-th set of semantic nodes. , the i-th kind of first connecting edge corresponds to the i-th modal, and i is a positive integer less than or equal to n.

選択肢として、セマンティック関連付けモジュール５０１は、マルチモーダルグラフ表現層により第ｉ番目のモーダルに対応する第ｉ種類の第１結合辺を決定し、第ｉ種類の第１結合辺を用いて第ｉ組のセマンティックノードに対して同一モーダル内のセマンティックノードの間の結合を行うことに用いられ、ｉはｎ以下の正の整数である。 Optionally, the semantic association module 501 determines the i-th kind of first connecting edge corresponding to the i-th modal through the multimodal graph representation layer, and uses the i-th kind of first connecting edge to construct the i-th set of It is used to connect semantic nodes within the same modal to semantic nodes, and i is a positive integer less than or equal to n.

いくつかの選択可能な実施例において、ベクトル符号化モジュール５０３は、上記複数の第１ワードベクトルに対してモーダル内融合及びモーダル間融合をｅ回行って、上記ｎ個の符号化特徴ベクトルを取得することに用いられる。ここで、上記モーダル内融合とは同一モーダル内の上記第１ワードベクトルの間でセマンティック融合を行うことを指し、上記モーダル間融合とは異なるモーダルの上記第１ワードベクトルの間でセマンティック融合を行うことを指し、ここで、ｅは正の整数である。 In some optional embodiments, the vector encoding module 503 performs intramodal fusion and intermodal fusion on the plurality of first word vectors e times to obtain the n encoded feature vectors. used to do Here, the intra-modal fusion refers to semantic fusion between the first word vectors in the same modal, and the inter-modal fusion is the semantic fusion between the first word vectors of different modals. , where e is a positive integer.

選択肢として、マルチモーダル融合エンコーダは直列接続されているｅ個の符号化モジュールを含み、ｅは正の整数であり、
ベクトル符号化モジュール５０３は、直列接続されているｅ個の符号化モジュールにより第１ワードベクトルに対してモーダル内融合及びモーダル間融合をｅ回行って、符号化特徴ベクトルを取得することに用いられる。ここで、上記モーダル内融合とは同一モーダル内の上記第１ワードベクトルの間でセマンティック融合を行うことを指し、上記モーダル間融合とは異なるモーダルの上記第１ワードベクトルの間でセマンティック融合を行うことを指す。
いくつかの選択可能な実施例において、各々の符号化モジュールはいずれもｎ個のモーダルに１対１で対応するｎ個のモーダル内融合層及びｎ個のモーダル間融合層を含み、
ベクトル符号化モジュール５０３は、第１ワードベクトルをそれぞれ１番目の符号化モジュールにおけるｎ個のモーダル内融合層に入力し、ｎ個のモーダル内融合層によりそれぞれ第１ワードベクトルに対して同じモーダル内部のセマンティック融合を行ってｎ個の第１隠れ層ベクトルを取得することであって、１つの上記第１隠れ層ベクトルが１つのモーダルに対応し、つまり、ｎ個のモーダルに１対１で対応するｎ個の第１隠れ層ベクトルを取得する、ことと、
ｎ個の第１隠れ層ベクトルを１番目の符号化モジュールにおける各々のモーダル間融合層に入力し、各々のモーダル間融合層によりｎ個の第１隠れ層ベクトルに対して異なるモーダル間のセマンティック融合を行ってｎ個の第１中間ベクトルを取得することであって、１つの上記中間ベクトルが１つのモーダルに対応し、つまり、ｎ個のモーダルに１対１で対応するｎ個の第１中間ベクトルを取得する、ことと、
ｎ個の第１中間ベクトルを第ｊ番目の符号化モジュールに入力して第ｊ回目の符号化処理を行い、これを最後の１つの符号化モジュールがｎ個の符号化特徴ベクトルを出力するまで続けることであって、１つの上記符号化特徴ベクトルが１つのモーダルに対応し、つまり、最後の１つの符号化モジュールがｎ個のモーダルに１対１で対応するｎ個の符号化特徴ベクトルを出力するまで続け、ｊは１よりも大きく且つｅ以下の正の整数である、ことと、に用いられる。 Alternatively, the multimodal fusion encoder comprises e encoding modules connected in series, e being a positive integer,
The vector encoding module 503 is used to perform intra-modal fusion and inter-modal fusion on the first word vector e times by e encoding modules connected in series to obtain an encoded feature vector. . Here, the intra-modal fusion refers to semantic fusion between the first word vectors in the same modal, and the inter-modal fusion is the semantic fusion between the first word vectors of different modals. point to
In some alternative embodiments, each encoding module includes n intra-modal fusion layers and n inter-modal fusion layers, each corresponding one-to-one to the n modals;
The vector encoding module 503 inputs the first word vector to each of the n intra-modal fusion layers in the first encoding module, and the n intra-modal fusion layers respectively generate the same modal intra-modal for the first word vector. to obtain n first hidden layer vectors, one said first hidden layer vector corresponds to one modal, that is, one-to-one correspondence to n modals obtaining n first hidden layer vectors that
Input the n first hidden layer vectors into each intermodal fusion layer in the first encoding module, and perform semantic fusion between different modals for the n first hidden layer vectors by each intermodal fusion layer. to obtain n first intermediate vectors, one intermediate vector corresponding to one modal, that is, n first intermediate vectors corresponding one-to-one to n modals to get the vector, and
The n first intermediate vectors are input to the j-th encoding module to perform the j-th encoding process until the last encoding module outputs n encoded feature vectors. Continuing, one encoding feature vector corresponds to one modal, that is, the last one encoding module creates n encoding feature vectors corresponding one-to-one to n modals. and j is a positive integer greater than 1 and less than or equal to e.

いくつかの選択可能な実施例において、各々の符号化モジュールは更にｎ個の第１ベクトル変換層を含み、上記１つのベクトル変換層が１つのモーダルに対応し、つまり、ｎ個のモーダルに１対１で対応するｎ個の第１ベクトル変換層であり、
ベクトル符号化モジュール５０３は更に、ｎ個の第１中間ベクトルをそれぞれ所属するモーダルに対応するｎ個の第１ベクトル変換層に入力して非線形変換を行って、非線形変換後のｎ個の第１中間ベクトルを取得することに用いられる。 In some alternative embodiments, each encoding module further includes n first vector transformation layers, and one vector transformation layer corresponds to one modal, i.e., 1 in n modals. n first vector transformation layers corresponding one-to-one,
The vector encoding module 503 further inputs the n first intermediate vectors to the n first vector transformation layers corresponding to the respective modals to which they belong, and nonlinearly transforms them into n first intermediate vectors after the nonlinear transformation. Used to get intermediate vectors.

いくつかの選択可能な実施例において、直列接続されているｅ個の符号化モジュールのうちの各々の符号化モジュールにおける階層構造は同じである。 In some alternative embodiments, the hierarchical structure in each encoding module of the e serially connected encoding modules is the same.

いくつかの選択可能な実施例において、ベクトル復号モジュール５０４は、第１目標語句に対して特徴抽出を行って第２ワードベクトルを取得することであって、上記第１目標語句が上記目標ステートメントにおける翻訳済み語句である、ことと、上記第２ワードベクトルを上記符号化特徴ベクトルと組み合わせて特徴抽出を行って復号特徴ベクトルを取得することと、上記復号特徴ベクトルに対応する確率分布を決定し、且つ上記確率分布に基づき上記第１目標語句の後の第２目標語句を決定することと、に用いられる。 In some optional embodiments, the vector decoding module 504 performs feature extraction on a first target phrase to obtain a second word vector, wherein the first target phrase is is a translated word; performing feature extraction by combining the second word vector with the encoded feature vector to obtain a decoded feature vector; determining a probability distribution corresponding to the decoded feature vector; and determining a second target phrase after the first target phrase based on the probability distribution.

選択肢として、デコーダは直列接続されているｄ個の復号モジュールを含み、ｄは正の整数であり、
ベクトル復号モジュール５０４は、第２ワードベクトル層により第１目標語句を獲得することであって、第１目標語句が目標ステートメントにおける翻訳済み語句である、ことと、第２ワードベクトル層により第１目標語句に対して特徴抽出を行って、第２ワードベクトルを取得することと、
直列接続されているｄ個の復号モジュールにより第２ワードベクトルを符号化特徴ベクトルと組み合わせて特徴抽出を行って、復号特徴ベクトルを取得することと、復号特徴ベクトルを分類器に入力し、分類器により復号特徴ベクトルに対応する確率分布を計算し、且つ確率分布に基づき第１目標語句の後の第２目標語句を決定することと、に用いられる。 Optionally, the decoder comprises d decoding modules connected in series, d being a positive integer,
Vector decoding module 504 is to obtain a first target phrase through a second word vector layer, where the first target phrase is a translated phrase in the target statement; performing feature extraction on the phrase to obtain a second word vector;
performing feature extraction by combining the second word vector with the encoded feature vector by d decoding modules connected in series to obtain a decoded feature vector; inputting the decoded feature vector to a classifier; calculating a probability distribution corresponding to the decoded feature vector by and determining a second target phrase after the first target phrase based on the probability distribution.

いくつかの選択可能な実施例において、直列接続されているｄ個の復号モジュールのうちの各々の復号モジュールはいずれも第１自己注意層及び第２自己注意層を含み、
ベクトル復号モジュール５０４は、第２ワードベクトルを１番目の復号モジュールにおける第１自己注意層に入力し、第１自己注意層により第２ワードベクトルに対して特徴抽出を行って、第２隠れ層ベクトルを取得することと、
第２隠れ層ベクトル及び符号化特徴ベクトルを１番目の復号モジュールにおける第２自己注意層に入力し、第２自己注意層により第２隠れ層ベクトルと符号化特徴ベクトルとを組み合わせて特徴抽出を行って、第２中間ベクトルを取得することと、
第２中間ベクトルを第ｋ番目の復号モジュールに入力してｋ回目の復号処理を行い、最後の１つの復号モジュールが復号特徴ベクトルを出力するまで続けることであって、ｋは１よりも大きく且つｄ以下の正の整数である、ことと、に用いられる。 In some alternative embodiments, each decoding module of the d serially connected decoding modules both includes a first self-attention layer and a second self-attention layer;
The vector decoding module 504 inputs the second word vector to the first self-attention layer in the first decoding module, performs feature extraction on the second word vector by the first self-attention layer, and extracts the second hidden layer vector and
The second hidden layer vector and the encoded feature vector are input to the second self-attention layer in the first decoding module, and the second self-attention layer combines the second hidden layer vector and the encoded feature vector to perform feature extraction. to obtain a second intermediate vector;
inputting the second intermediate vector to the k-th decoding module to perform the k-th decoding process, and continuing until the last one decoding module outputs a decoded feature vector, where k is greater than 1 and is a positive integer less than or equal to d.

いくつかの選択可能な実施例において、各々の復号モジュールは更に第２ベクトル変換層を含み、
ベクトル復号モジュール５０４は更に、第２中間ベクトルを第２ベクトル変換層に入力して非線形変換を行って、非線形変換後の第２中間ベクトルを取得することに用いられる。 In some optional embodiments, each decoding module further includes a second vector transformation layer;
The vector decoding module 504 is further used to input the second intermediate vector to the second vector transformation layer for non-linear transformation to obtain a second intermediate vector after non-linear transformation.

以上のように、本実施例が提供するマルチモーダル機械学習に基づく翻訳装置は、マルチモーダルグラフ表現層によりｎ個のモーダルのソース言語に対してセマンティック関連付けを行って、セマンティック関連図を構築し、セマンティック関連図において第１結合辺を用いて同一モーダルのセマンティックノードを結合し、且つ第２結合辺を用いて異なるモーダルのセマンティックノードを結合し、セマンティック関連図により複数のモーダルのソース言語の間のセマンティック関連付けを十分に表現する。続いてマルチモーダル融合エンコーダによりセマンティック関連図における特徴ベクトルに対して十分なセマンティック融合を行って、符号化後の符号化特徴ベクトルを取得し、更に符号化特徴ベクトルを復号処理した後により正確な目標ステートメントを取得する。該目標ステートメントはマルチモーダルのソース言語が総合的に表す内容、感情及び言語環境等に一層接近する。 As described above, the translation device based on multimodal machine learning provided by the present embodiment performs semantic associations with n modal source languages through the multimodal graph representation layer, constructs a semantic association diagram, In the semantic relationship diagram, a first connecting edge is used to connect semantic nodes of the same modal, and a second connecting edge is used to connect semantic nodes of different modals, and the semantic relation diagram is used to connect multiple modal source languages. Express semantic associations well. Then a multimodal fusion encoder performs sufficient semantic fusion on the feature vectors in the semantic association diagram to obtain a coded feature vector after coding, and a more accurate target after decoding the coded feature vector. Get statement. The goal statement more closely approximates the content, emotion and linguistic environment, etc. that the multimodal source language collectively represents.

図１２に参照されるように、本願の１つの実施例が提供するサーバの構造模式図を示す。該サーバは上記実施例において提供するマルチモーダル機械学習に基づく翻訳方法のステップを実施することに用いられる。具体的には、
上記サーバ６００はＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ、中央処理装置）６０１と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ、ランダムアクセスメモリ）６０２及びＲＯＭ（Ｒｅａｄ－ＯｎｌｙＭｅｍｏｒｙ、読み出し専用メモリ）６０３を含むシステムメモリ６０４と、システムメモリ６０４と中央処理ユニット６０１とを結合するシステムバス６０５と、を含む。上記サーバ６００はコンピュータ内の各デバイスの間で情報を伝送することを支援する基本Ｉ／Ｏ（Ｉｎｐｕｔ／Ｏｕｔｐｕｔ、入力／出力）システム６０６と、オペレーティングシステム６１３、アプリケーションプログラム６１４及び他のプログラムモジュール６１５を記憶することに用いられる大容量記憶機器６０７とを更に含む。 Referring to FIG. 12, it shows a structural schematic diagram of a server provided by one embodiment of the present application. The server is used to implement the steps of the multimodal machine learning based translation method provided in the above embodiment. in particular,
The server 600 includes a CPU (Central Processing Unit) 601, a system memory 604 including a RAM (Random Access Memory) 602 and a ROM (Read-Only Memory) 603, and a system memory 604 and a system bus 605 coupling the central processing unit 601 . The server 600 includes a basic I/O (Input/Output) system 606, an operating system 613, application programs 614, and other program modules 615 that support information transmission between devices in a computer. and a mass storage device 607 used to store the .

上記基本入力／出力システム６０６は情報を表示することに用いられるディスプレイ６０８と、ユーザーが情報を入力することに用いられる例えばマウス、キーボード等のような入力機器６０９とを含む。ここで上記ディスプレイ６０８及び入力機器６０９はいずれもシステムバス６０５に結合される入力出力コントローラ６１０により中央処理ユニット６０１に結合される。上記基本入力／出力システム６０６は更に入力出力コントローラ６１０を含んでもよく、それによりキーボード、マウス又は電子スタイラス等の複数の他の機器からの入力を受信及び処理することに用いられる。同様に、入力出力コントローラ６１０は更にディスプレイスクリーン、プリンタ又は他のタイプの出力機器に出力を提供する。 The basic input/output system 606 includes a display 608 used to display information and an input device 609 such as a mouse, keyboard, etc. used to input information by a user. Here the display 608 and input devices 609 are both coupled to the central processing unit 601 by an input/output controller 610 coupled to the system bus 605 . The basic input/output system 606 may further include an input/output controller 610, which is used to receive and process input from a number of other devices such as keyboards, mice or electronic styluses. Similarly, input/output controller 610 may also provide output to a display screen, printer, or other type of output device.

上記大容量記憶機器６０７はシステムバス６０５に結合される大容量記憶コントローラ（図示せず）により中央処理ユニット６０１に結合される。上記大容量記憶機器６０７及びその関連するコンピュータ可読媒体はサーバ６００に不揮発性記憶を提供する。言い換えれば、上記大容量記憶機器６０７は例えばハードディスク又はＣＤ－ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃＲｅａｄ－ＯｎｌｙＭｅｍｏｒｙ、コンパクトディスクリードオンリーメモリ）ドライバ等のようなコンピュータ可読媒体（図示せず）を含んでもよい。 The mass storage device 607 is coupled to central processing unit 601 by a mass storage controller (not shown) coupled to system bus 605 . The mass storage device 607 and its associated computer-readable media provide nonvolatile storage for server 600 . In other words, the mass storage device 607 may include a computer readable medium (not shown) such as a hard disk or CD-ROM (Compact Disc Read-Only Memory) driver or the like.

一般性を失うことなく、上記コンピュータ可読媒体はコンピュータ記憶媒体及び通信媒体を含んでもよい。コンピュータ記憶媒体は例えばコンピュータ可読命令、データ構造、プログラムモジュール又は他のデータ等の情報を記憶することに用いられるいかなる方法又は技術により実現される揮発性及び不揮発性、移動可能及び移動不可能媒体を含む。コンピュータ記憶媒体はＲＡＭ、ＲＯＭ、ＥＰＲＯＭ（ＥｒａｓａｂｌｅＰｒｏｇｒａｍｍａｂｌｅＲｅａｄ－ＯｎｌｙＭｅｍｏｒｙ、消去可能プログラマブル読み出し専用メモリ）、ＥＥＰＲＯＭ（ＥｌｅｃｔｒｉｃａｌｌｙＥｒａｓａｂｌｅＰｒｏｇｒａｍｍａｂｌｅＲｅａｄ－ＯｎｌｙＭｅｍｏｒｙ、電気的消去可能プログラマブル読み出し専用メモリ）、フラッシュメモリ（ＦｌａｓｈＭｅｍｏｒｙ）若しくは他のソリッドステートメモリ技術、ＣＤ－ＲＯＭ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ、デジタル多用途ディスク）若しくは他の光学記憶、テープカセット、磁気テープ、磁気ディスク記憶若しくは他の磁気記憶機器を含む。当然ながら、当業者であれば明らかなように、上記コンピュータ記憶媒体は上記幾つか種類に限定されるものではない。上記システムメモリ６０４及び大容量記憶機器６０７はメモリと総称されてもよい。 Without loss of generality, the computer-readable media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented by any method or technology used for storage of information such as computer readable instructions, data structures, program modules or other data. include. Computer storage media include RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), Flash Memory. ) or other solid state memory technology, CD-ROM, DVD (Digital Versatile Disc) or other optical storage, tape cassette, magnetic tape, magnetic disk storage or other magnetic storage device. Of course, as will be appreciated by those skilled in the art, the computer storage medium is not limited to the few types described above. The system memory 604 and mass storage device 607 may be collectively referred to as memory.

本願の各種の実施例に基づき、上記サーバ６００は更に例えばインターネット等のネットワーク経由でネットワークにおけるリモートコンピュータに結合して動作することができる。即ち、サーバ６００は上記システムバス６０５に結合されるネットワークインターフェースユニット６１１によりネットワーク６１２に結合されてもよく、又は、ネットワークインターフェースユニット６１１を使用して他のタイプのネットワーク又はリモートコンピュータシステム（図示せず）に結合されてもよい。 According to various embodiments of the present application, the server 600 may also be operatively coupled to remote computers in a network, eg, via a network such as the Internet. That is, server 600 may be coupled to network 612 by network interface unit 611 coupled to system bus 605 described above, or may be connected to other types of networks or remote computer systems (not shown) using network interface unit 611 . ).

例示的な実施例において、コンピュータ可読記憶媒体を含むもの、例えば、命令を含むメモリ６０２を更に提供し、上記命令はサーバ６００のプロセッサ６０１により実行されることで上記マルチモーダル機械学習に基づく翻訳方法を完了することができる。選択肢として、コンピュータ可読記憶媒体は非一時的記憶媒体であってもよく、例えば、上記非一時的記憶媒体はＲＯＭ、ランダムアクセスメモリ（ＲＡＭ）、ＣＤ－ＲＯＭ、磁気テープ、フロッピーディスク及び光データ記憶機器等であってもよい。 In an exemplary embodiment, further comprising a computer readable storage medium, e.g., a memory 602 containing instructions, the instructions being executed by the processor 601 of the server 600 to perform the multimodal machine learning based translation method. can be completed. Alternatively, the computer readable storage medium may be a non-transitory storage medium, such as ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk and optical data storage. It may be a device or the like.

例示的な実施例において、コンピュータプログラム製品を更に提供し、これはコンピュータプログラムを含み、該コンピュータプログラムは電子機器のプロセッサにより実行されてもよく、それにより上記マルチモーダル機械学習に基づく翻訳方法を実現する。 In an exemplary embodiment, a computer program product is further provided, which includes a computer program, the computer program may be executed by a processor of an electronic device, thereby implementing the above multimodal machine learning based translation method. do.

当業者であれば理解できるように、上記実施例を実現する全部又は一部のステップはハードウェアにより完了してもよく、プログラムが関連するハードウェアに命令を出すことにより完了してもよく、上記プログラムは一種のコンピュータ可読記憶媒体に記憶されてもよく、上記言及した記憶媒体は読み出し専用メモリ、磁気ディスク又は光ディスク等であってもよい。 As can be understood by those skilled in the art, all or part of the steps for implementing the above embodiments may be completed by hardware, or may be completed by a program issuing instructions to relevant hardware, The above program may be stored in a kind of computer-readable storage medium, and the storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk, or the like.

以上の説明は単に本願の選択可能な実施例に過ぎず、本願を制限するためのものではない。本願の趣旨及び原則内において行われたいかなる修正、均等物への置換又は改良等は、いずれも本願の保護範囲内に含まれるべきである。 The above descriptions are merely alternative embodiments of the present application and are not intended to limit the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application shall all fall within the protection scope of the present application.

１１モーダル内融合層
１２モーダル間融合層
１３第１ベクトル変換層
２１第１自己注意層
２２第２自己注意層
２３第２ベクトル変換層
３１翻訳対象画像
３２翻訳対象ステートメント
１００マルチモーダル機械翻訳モデル
１０１マルチモーダルグラフ表現層
１０２第１ワードベクトル層
１０３マルチモーダル融合エンコーダ
１０４デコーダ
１０５第２ワードベクトル層
１０６分類器
２２０端末
２４０サーバ
５０２特徴抽出モジュール
５０３ベクトル符号化モジュール
５０４ベクトル復号モジュール
６００サーバ
６０１中央処理ユニット
６０１プロセッサ
６０２メモリ
６０４システムメモリ
６０５システムバス
６０６出力システム
６０７大容量記憶機器
６０８ディスプレイ
６０９入力機器
６１０入力出力コントローラ
６１１ネットワークインターフェースユニット
６１２ネットワーク
６１３オペレーティングシステム
６１４アプリケーションプログラム
６１５プログラムモジュール
１０３１符号化モジュール
１０４２復号モジュール 11 intramodal fusion layer 12 intermodal fusion layer 13 first vector transformation layer 21 first self-attention layer 22 second self-attention layer 23 second vector transformation layer 31 translation target image 32 translation target statement 100 multimodal machine translation model 101 multi modal graph representation layer 102 first word vector layer 103 multimodal fusion encoder 104 decoder 105 second word vector layer 106 classifier 220 terminal 240 server 502 feature extraction module 503 vector encoding module 504 vector decoding module 600 server 601 central processing unit 601 processor 602 memory 604 system memory 605 system bus 606 output system 607 mass storage device 608 display 609 input device 610 input output controller 611 network interface unit 612 network 613 operating system 614 application program 615 program module 1031 encoding module 1042 decoding module

Claims

A multimodal machine learning based translation method performed by a computer device, the method comprising:
A step of obtaining a semantic relationship diagram based on n source statements belonging to different modals, wherein the semantic relationship diagram is used to combine n different modal semantic nodes with the same modal semantic nodes. and a second connecting edge used to connect semantic nodes of different modals, wherein the semantic node indicates one semantic unit of the source statement in one kind of modal. wherein n is a positive integer greater than 1;
extracting a plurality of first word vectors from the semantic association diagram;
encoding the plurality of first word vectors to obtain n encoded feature vectors;
decoding n encoded feature vectors to obtain a post-translation target statement.

The step of obtaining a semantic association diagram based on n source statements belonging to different modals comprises:
obtaining n sets of semantic nodes, one set of semantic nodes corresponding to one modal source statement;
adding the first connecting edge between any two of the semantic nodes of the same modal and adding the second connecting edge between any two of the semantic nodes of different modals to form the semantic relationship diagram and obtaining.

the n modal source statements include a textual first source statement and a non-textual second source statement; the n sets of semantic nodes include a first semantic node and a second semantic node;
The step of obtaining n-tuples of semantic nodes includes:
obtaining the first semantic node, wherein the first semantic node is obtained by a multimodal graph representation layer processing the first source statement;
obtaining candidate semantic nodes, wherein the candidate semantic nodes are obtained by a multimodal graph representation layer processing the second source statement;
obtaining a first probability distribution of the candidate semantic nodes, the first probability distribution being computed by the multimodal graph representation layer according to semantic associations between the first semantic nodes and the candidate semantic nodes; a step obtained by
determining the second semantic node from the candidate semantic nodes, the second semantic node determined by the multimodal graph representation layer based on the first probability distribution; 3. The method of claim 2.

the step of adding the first connecting edge between any two of the semantic nodes of the same modal;
adding an i-th kind of first connecting edge between any two semantic nodes in the same modal in the i-th set of semantic nodes, wherein the i-th kind of first connecting edge is the i-th 3. The method of claim 2, comprising steps corresponding to modal, i being a positive integer less than or equal to n.

The step of encoding the plurality of first word vectors to obtain n encoded feature vectors comprises:
performing intra-modal fusion and inter-modal fusion on the plurality of first word vectors e times to obtain n encoded feature vectors, wherein the intra-modal fusion is the first modal fusion in the same modal; refers to performing semantic fusion between one word vectors, refers to performing semantic fusion between said first word vectors of modals different from said inter-modal fusion, wherein e is a positive integer. , the method according to any one of claims 1 to 4.

A multimodal fusion encoder includes e encoding modules connected in series,
each of the encoding modules includes n intra-modal fusion layers and n inter-modal fusion layers corresponding to the n modals one-to-one;
performing e intra-modal and inter-modal fusions on the plurality of first word vectors to obtain n encoded feature vectors;
inputting the plurality of first word vectors into n intra-modal fusion layers in the first encoding module, respectively; to obtain n first hidden layer vectors, one said first hidden layer vector corresponding to one modal;
Input n first hidden layer vectors to each intermodal fusion layer in the first encoding module, and perform different intermodal fusion layers for the n first hidden layer vectors by each intermodal fusion layer. performing semantic fusion to obtain n first intermediate vectors, one said first intermediate vector corresponding to one modal;
inputting the n first intermediate vectors to the jth encoding module to perform the jth encoding process until the last one encoding module outputs n encoded feature vectors; and wherein one encoding feature vector corresponds to one modal, and j is a positive integer greater than 1 and less than or equal to e.

each of the encoding modules further includes n first vector transformation layers, one of the first vector transformation layers corresponding to one modal;
The method further comprises:
inputting n first intermediate vectors to n first vector transformation layers corresponding to respective modals to which they belong, and performing nonlinear transformation to obtain n first intermediate vectors after nonlinear transformation; 7. The method of claim 6.

7. The method of claim 6, wherein the hierarchical structure in each of said serially connected e encoding modules is the same.

7. The method of claim 6, wherein different intra-modal fusion layers are set with different or the same self-attention functions, and different inter-modal fusion layers are set with different or the same feature fusion functions.

The step of decoding the n encoded feature vectors to obtain a translated target statement comprises:
performing feature extraction on a first target phrase to obtain a second word vector, wherein the first target phrase is a translated phrase in the target statement;
performing feature extraction by combining the second word vector with the encoded feature vector to obtain a decoded feature vector;
determining a probability distribution corresponding to said decoding feature vector, and determining a second target phrase after said first target phrase based on said probability distribution. The method described in .

The decoder includes d serially connected decoding modules, where d is a positive integer, and each decoding module of the d serially connected decoding modules is both a first self-attention layer and including a second self-attention layer,
The step of performing feature extraction by combining the second word vector with the encoded feature vector to obtain a decoded feature vector,
Inputting the second word vector into a first self-attention layer in a first decoding module, performing feature extraction on the second word vector by the first self-attention layer to obtain a second hidden layer vector. a step;
inputting the second hidden layer vector and the encoded feature vector to a second self-attention layer in the first decoding module, and converting the second hidden layer vector and the encoded feature vector by the second self-attention layer; performing combined feature extraction to obtain a second intermediate vector;
a step of inputting the second intermediate vector to the k-th decoding module to perform the k-th decoding process, and continuing until the last one decoding module outputs the decoded feature vector, where k is greater than 1; is a positive integer greater than d and less than or equal to d.

each decoding module further includes a second vector transformation layer;
The method further comprises:
12. The method of claim 11, comprising inputting the second intermediate vector to the second vector transformation layer to perform nonlinear transformation to obtain a second intermediate vector after nonlinear transformation.

A translation device based on multimodal machine learning, said device comprising:
A semantic association module for building a semantic association diagram based on n source statements belonging to different modals, wherein the semantic association diagram includes n different modal semantic nodes and the same modal semantic nodes. and a second connecting edge used to connect semantic nodes of different modals, wherein the semantic node is one semantic unit of the source statement in one kind of modal a semantic association module, wherein n is a positive integer greater than 1;
a feature extraction module used to extract a plurality of first word vectors from the semantic association diagram;
a vector encoding module used to encode the plurality of first word vectors to obtain n encoded feature vectors;
a vector decoding module used to decode the encoded feature vector to obtain a post-translation target statement.

A computer device, said computer device comprising:
memory;
a processor coupled to the memory;
Computer equipment, wherein the processor is configured to implement a multimodal machine learning based translation method according to any one of claims 1 to 12 by loading and executing executable instructions.

A computer readable storage medium, wherein at least one segment of a program is stored on said computer readable storage medium, said at least one segment of program being loaded and executed by a processor. A computer-readable storage medium that implements the multimodal machine learning-based translation method described in .