WO2021233112A1 - 基于多模态机器学习的翻译方法、装置、设备及存储介质 - Google Patents
基于多模态机器学习的翻译方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2021233112A1 WO2021233112A1 PCT/CN2021/091114 CN2021091114W WO2021233112A1 WO 2021233112 A1 WO2021233112 A1 WO 2021233112A1 CN 2021091114 W CN2021091114 W CN 2021091114W WO 2021233112 A1 WO2021233112 A1 WO 2021233112A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- semantic
- modal
- vector
- vectors
- fusion
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 83
- 238000013519 translation Methods 0.000 title claims abstract description 71
- 238000010801 machine learning Methods 0.000 title claims abstract description 39
- 230000004927 fusion Effects 0.000 claims abstract description 165
- 239000013598 vector Substances 0.000 claims description 347
- 230000006870 function Effects 0.000 claims description 55
- 238000006243 chemical reaction Methods 0.000 claims description 38
- 238000000605 extraction Methods 0.000 claims description 38
- 230000008569 process Effects 0.000 claims description 26
- 238000012545 processing Methods 0.000 claims description 17
- 238000010586 diagram Methods 0.000 abstract description 13
- 238000013473 artificial intelligence Methods 0.000 abstract description 11
- 230000008451 emotion Effects 0.000 abstract description 7
- 239000010410 layer Substances 0.000 description 198
- 230000000007 visual effect Effects 0.000 description 41
- 238000005516 engineering process Methods 0.000 description 19
- 230000007246 mechanism Effects 0.000 description 12
- 239000011159 matrix material Substances 0.000 description 11
- 238000012360 testing method Methods 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 9
- 239000000284 extract Substances 0.000 description 9
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000002474 experimental method Methods 0.000 description 6
- 238000003058 natural language processing Methods 0.000 description 5
- 230000001537 neural effect Effects 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 4
- 230000015556 catabolic process Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000006731 degradation reaction Methods 0.000 description 3
- 230000009977 dual effect Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/44—Statistical methods, e.g. probability models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/51—Translation evaluation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Definitions
- This application relates to the field of artificial intelligence technology, and in particular to a translation method, device, equipment and storage medium based on multi-modal machine learning.
- Machine translation is the process of using computers to transform one natural language into another natural language.
- the machine translation model can be used to translate a variety of different forms of source language into the target language, that is, the multi-modal source language is translated into the target language; for example, the image and corresponding English annotations can be obtained through the machine
- the translation model separately extracts the features of the image and the English annotation, then fuses the extracted features, and then translates the French annotation corresponding to the image and the English annotation based on the fused features.
- the embodiment of the application provides a translation method, device, equipment and storage medium based on multi-modal machine learning, which can perform sufficient semantic fusion of source languages of multiple modalities in the process of feature encoding, so that the encoding vector The decoded target sentence is closer to the content and emotion expressed in the source language.
- the technical solution is as follows:
- a translation method based on multi-modal machine learning is provided, which is executed by a computer device, and the method includes:
- the semantic association graph includes semantic nodes of n different modalities, a first connecting edge for connecting semantic nodes of the same modal, and A second connecting edge of semantic nodes of different modalities, where the semantic node is used to represent a semantic unit of the source sentence in one modal, and n is a positive integer greater than one;
- the n coded feature vectors are decoded to obtain the translated target sentence.
- a translation device based on multi-modal machine learning including:
- the semantic association module is used to construct a semantic association graph based on n source sentences belonging to different modalities.
- the semantic association graph includes n semantic nodes of different modalities and the first connection for connecting the semantic nodes of the same modal.
- An edge, and a second connecting edge used to connect semantic nodes of different modalities, where the semantic node is used to represent a semantic unit of the source sentence in one modal, and n is a positive integer greater than 1;
- the feature extraction module is used to extract a plurality of first word vectors from the semantic association graph
- the vector encoding module is used for encoding the multiple first word vectors to obtain n encoding feature vectors
- the vector decoding module is used to decode the n coded feature vectors to obtain the translated target sentence.
- a computer device which includes:
- the processor connected to the memory;
- the processor is configured to load and execute executable instructions to implement the translation method based on multi-modal machine learning as described in the previous aspect and its optional embodiments.
- a computer-readable storage medium stores at least one instruction, at least one program, code set, or instruction set, the above-mentioned at least one instruction, at least one program,
- the code set or instruction set is loaded and executed by the processor to implement the translation method based on multi-modal machine learning as described in the previous aspect and its optional embodiments.
- Fig. 1 is a schematic structural diagram of a multi-modal machine translation model provided by an exemplary embodiment of the present application
- Fig. 2 is a schematic structural diagram of a computer system provided by an exemplary embodiment of the present application.
- Fig. 3 is a flowchart of a translation method based on multi-modal machine learning provided by an exemplary embodiment of the present application
- Fig. 4 is a flowchart of constructing a semantic association graph provided by an exemplary embodiment of the present application.
- Fig. 5 is a flowchart of a translation method based on multi-modal machine learning provided by another exemplary embodiment of the present application.
- Fig. 6 is a flowchart of a translation method based on multi-modal machine learning provided by another exemplary embodiment of the present application.
- Fig. 7 is a schematic structural diagram of a multi-modal machine translation model provided by another exemplary embodiment of the present application.
- Fig. 8 is a graph of model test results provided by an exemplary embodiment of the present application.
- Fig. 9 is a graph of model test results provided by another exemplary embodiment of the present application.
- Fig. 10 is a graph of model test results provided by another exemplary embodiment of the present application.
- Fig. 11 is a block diagram of a translation device based on multi-modal machine learning provided by an exemplary embodiment of the present application.
- Fig. 12 is a schematic structural diagram of a server provided by an exemplary embodiment of the present application.
- Artificial Intelligence the theory, method, technology and application system that use digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results
- AI Artificial Intelligence
- Artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
- Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
- Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including hardware-level technology and software-level technology.
- Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
- Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
- Natural language processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language people use daily, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
- Machine Learning is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
- Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence.
- Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, analogical learning and other technologies.
- a multi-modal machine translation model which can accurately translate source sentences of n different modalities into target sentences.
- modal refers to the form of language.
- the sentence can be represented by graphics or text;
- the source sentence refers to the sentence to be translated, and the sentence to be translated includes the sentence to be translated in the first language category and non-text in the form of text.
- the target sentence refers to the translated sentence of the second language in the form of text, which is different from the first language.
- the source sentence includes an English sentence and a picture of the English sentence, and the above-mentioned English sentence and the Chinese sentence corresponding to the picture can be translated through a multi-modal machine translation model.
- FIG. 1 shows a schematic structural diagram of a multi-modal machine translation model 100 provided by an exemplary embodiment of the present application.
- the multi-modal machine translation model 100 includes a multi-modal graph representation layer 101 and a first word vector layer. 102.
- the multi-modal graph presentation layer 101 is used to perform semantic associations on the source languages of n modalities to obtain a semantic association graph.
- the semantic association graph includes semantic nodes of n different modalities and is used to connect semantic nodes of the same modal.
- the first connecting edge of, and the second connecting edge used to connect semantic nodes of different modalities, n is a positive integer greater than 1.
- a semantic node is used to represent a semantic unit of a source sentence in a modal Taking English as an example, a semantic node corresponds to a word, and in Chinese as an example, a semantic node corresponds to a Chinese character.
- the first word vector layer 102 is used to extract multiple first word vectors from the semantic association graph
- the multimodal fusion encoder 103 is configured to encode the plurality of first word vectors to obtain n encoded feature vectors;
- the decoder 104 is configured to decode the n coded feature vectors to obtain the translated target sentence.
- the multi-modal graph representation layer 101 is used to obtain n sets of semantic nodes, and a set of semantic nodes corresponds to a source sentence of a modal; in the same modal, any two of the semantic nodes The first connecting edge is added in between, and the second connecting edge is added between any two semantic nodes of different modalities to obtain the semantic association graph.
- the multimodal graph representation layer 101 is used to extract semantic nodes from the source language of each modal to obtain n sets of semantic nodes corresponding to the source language of the n modalities;
- the multi-modal graph representation layer 101 is used to connect n groups of semantic nodes between semantic nodes in the same modal by using the first connection edge, and use the second connection edge to connect n groups of semantic nodes between different modalities.
- the connection between semantic nodes results in a semantic association graph.
- the n modal source sentences include a first source sentence in a text form and a second source sentence in a non-text form, and the n groups of semantic nodes include the first semantic node and the second semantic node;
- the multi-modal graph representation layer 101 is used to obtain the first semantic node, which is obtained by processing the first source sentence; obtain candidate semantic nodes, which are The second source sentence is processed; the first probability distribution of the candidate semantic node is obtained, and the first probability distribution is calculated according to the semantic association between the first semantic node and the candidate semantic node; Among the candidate semantic nodes, the second semantic node is determined, and the second semantic node is determined by the multimodal graph presentation layer according to the first probability distribution.
- the multimodal graph representation layer 101 is used to extract the first semantic node from the first source sentence and extract candidate semantic nodes from the second source sentence; according to the first semantic node and The semantic association between the candidate semantic nodes calculates the first probability distribution of the candidate semantic nodes; the second semantic node is determined from the candidate semantic nodes according to the first probability distribution.
- the multi-modal graph representation layer 101 is used to add the i-th first connecting edge between any two semantic nodes in the same modal in the i-th group of semantic nodes.
- the first connecting edge of the i type corresponds to the i-th mode, and i is a positive integer less than or equal to n.
- the multi-modal graph representation layer 101 is used to determine the i-th first connection edge corresponding to the i-th modal, and use the i-th first connection edge to perform the same modal intra-modality on the i-th group of semantic nodes.
- the connection between the semantic nodes, i is a positive integer less than or equal to n.
- the n coded feature vectors are obtained through the following process: performing e intra-modal fusion and inter-modal fusion on the plurality of first word vectors to obtain the coded feature vector, wherein,
- the intra-modal fusion refers to semantic fusion between the first word vectors in the same modal
- the inter-modal fusion refers to semantic fusion between the first word vectors of different modalities , Where e is a positive integer.
- the multi-modal fusion encoder 103 includes e serially connected encoding modules 1031, and each encoding module 1031 includes n intra-modal fusion layers 11 and n corresponding to the n modalities one-to-one.
- the first coding module 1031 is used to input the first word vector into the n intra-modal fusion layers 11 in the first coding module, and perform the same mode on the first word vector through the n intra-modal fusion layers 11 respectively. Semantic fusion within the state, obtain n first hidden layer vectors, one of the first hidden layer vectors corresponds to a mode, that is, obtain n first hidden layer vectors one-to-one corresponding to the n modes ;
- the first encoding module 1031 is used to input n first hidden layer vectors into each inter-modal fusion layer 12 in the first encoding module, and each inter-modal fusion layer 12 is used to compare the n first hidden layer vectors.
- the hidden layer vector performs semantic fusion between different modalities to obtain n first intermediate vectors.
- One of the intermediate vectors is for one modal, that is, n first intermediate vectors corresponding to the n modalities one-to-one are obtained. ;
- the j-th encoding module 1031 is used to perform the j-th encoding process on the n first intermediate vectors, until the last encoding module outputs n encoding feature vectors, and one of the encoding feature vectors corresponds to a modality, that is, , Until the last encoding module outputs n encoding feature vectors one-to-one corresponding to n modalities, and j is a positive integer greater than 1 and less than or equal to e.
- each encoding module 1031 further includes: n first vector conversion layers 13, and the one vector conversion layer corresponds to one mode, that is, one-to-one corresponding to the n modes n first vector conversion layers 13;
- the encoding module 1031 is also configured to input the n first intermediate vectors into the n first vector conversion layers 13 corresponding to the respective modalities to perform nonlinear conversion, and obtain n first intermediate vectors after nonlinear conversion.
- each encoding module 1031 of the e serial encoding modules 1031 is the same.
- different or the same self-attention functions are set in different intra-modal fusion layers, and different or the same feature fusion functions are set in different inter-modal fusion layers.
- the multimodal machine translation model 100 further includes a second word vector layer 105 and a classifier 106, and the decoder 104 includes d serially connected decoding modules 1042, where d is a positive integer;
- the second word vector layer 105 is used to obtain a first target word, where the first target word is a translated word in the target sentence; perform feature extraction on the first target word to obtain a second word vector;
- the decoder 104 is configured to perform feature extraction by combining the second word vector and the encoded feature vector through d serial decoding modules 1042 to obtain a decoded feature vector;
- the classifier 106 is configured to determine the probability distribution corresponding to the decoded feature vector, and determine the second target word after the first target word according to the probability distribution.
- each of the d serially connected decoding modules 1042 includes a first self-attention layer 21 and a second self-attention layer 22;
- the first decoding module 1042 is used to input the second word vector into the first self-attention layer 21 in the first decoding module 1042, and perform feature extraction on the second word vector through the first self-attention layer 21 to obtain the second Hidden layer vector
- the first decoding module 1042 is used to input the second hidden layer vector and the encoded feature vector into the second self-attention layer 22 in the first decoding module 1042, and the second hidden layer vector and the second hidden layer vector are combined with the second self-attention layer 22 Perform feature extraction on the encoded feature vector to obtain the second intermediate vector;
- the k-th decoding module is used to input the second intermediate vector into the k-th decoding module 1042 for the k-th decoding process until the last decoding module outputs the decoding feature vector, where k is a positive integer greater than 1 and less than or equal to d.
- each decoding module 1042 further includes: a second vector conversion layer 23;
- the decoding module 1042 is configured to input the second intermediate vector into the second vector conversion layer 23 for non-linear conversion to obtain the second intermediate vector after the non-linear conversion.
- the multi-modal machine translation model uses the multi-modal graph representation layer to semantically associate the source languages of n modalities to obtain the semantic association graph, and use the first connection in the semantic association graph. Connect the semantic nodes of the same modal by connecting the semantic nodes of the same modal, and use the second connecting edge to connect the semantic nodes of different modals, and the semantic association between the source languages of multiple modals can be fully represented by the semantic association graph, and then through multi-modal fusion
- the encoder performs sufficient semantic fusion on the feature vectors in the semantic association graph to obtain the encoded feature vector after encoding, and then after decoding the encoded feature vector, a more accurate target sentence is obtained. The content, emotion, and language environment expressed by the source sentence are closer together.
- FIG. 2 shows a schematic structural diagram of a computer system provided by an exemplary embodiment of the present application.
- the computer system includes a terminal 220 and a server 240.
- An operating system is installed on the terminal 220; an application program is installed on the operating system, and the application program supports a multi-modal source language translation function.
- the above-mentioned application programs may include instant messaging software, financial software, game software, shopping software, video playback software, community service software, audio software, education software, payment software, and translation software, etc.
- the above-mentioned application programs are integrated with The translation function of the above-mentioned multi-modal source language.
- the terminal 220 and the server 240 are connected to each other through a wired or wireless network.
- the server 240 includes at least one of a server, multiple servers, a cloud computing platform, and a virtualization center.
- the server 240 includes a processor and a memory, where a computer program is stored in the memory, and the processor reads and executes the computer program to realize a multi-modal source language translation function.
- the server 240 is responsible for the main calculation work, and the terminal 220 is responsible for the secondary calculation work; or the server 240 is responsible for the secondary calculation work and the terminal 220 is responsible for the main calculation work; or, the server 240 and the terminal 220 are distributed Computing architecture for collaborative computing.
- the server 240 provides background services for applications on the terminal 220 in the process of implementing the translation function of the above-mentioned multi-modal language.
- the terminal 220 collects n modal source sentences, and sends the above n modal source sentences to the server 240.
- the server 240 executes the translation method based on multi-modal machine learning provided in this application, where n is A positive integer greater than 1.
- the terminal 220 includes a data transmission control; the terminal 220 uploads the source sentences of two different modalities, the sentence to be translated and the image matching the sentence to be translated, to the server 240 through the above data transmission control, and the server 240 executes
- the translation method based on multi-modal machine learning provided in this application translates two modal source sentences into target sentences.
- the source sentence may include a voice signal; if the source sentence of n modalities includes a voice signal, before translating the source sentence of the n modalities, the terminal 220 or the server 240 first transmits the voice signal Convert to text text.
- the terminal 220 collects voice signals through a microphone, or the terminal 220 receives voice signals sent by other terminals.
- the above-mentioned translation method based on multi-modal machine learning can be applied to a multimedia news translation scenario.
- the terminal 220 uploads multimedia news including text and images to the server 240, and the server 240 executes the multi-modal machine provided in this application. Learn the translation method to translate the first language text in the multimedia news into the second language text.
- the above-mentioned translation method based on multi-modal machine learning can be applied to a foreign language document translation scenario.
- the terminal 220 uploads the text in the foreign language document and the illustrations corresponding to the text to the server 240, and the server 240 executes the The translation method of multi-modal machine learning translates the first language text in the foreign language document into the second language text.
- the above-mentioned translation method based on multi-modal machine learning can be applied to a foreign language website translation scenario.
- the terminal 220 collects text and text images on the foreign language website, and uploads the text and text images to the server 240, and the server 240 Execute the translation method based on multi-modal machine learning provided in this application to translate the first language text in the foreign language website into the second language text, and then realize the translation of the foreign language website.
- the terminal 220 displays the translated text in a voice format or a text format.
- the terminal 220 executes the translation method based on multi-modal machine learning provided in this application, and then translates source sentences of n modalities.
- the terminal 220 may generally refer to one of multiple terminals, and this embodiment only uses the terminal 220 as an example for illustration.
- the terminal 220 may include: a smartphone, a tablet computer, an e-book reader, a moving Picture Experts Group Audio Layer III (MP3) player, a moving Picture Experts compressed standard audio layer 4 (Moving Picture Experts Group, MP3) player Experts Group Audio Layer IV, MP4) player, laptop portable computer and desktop computer, at least one of the notebook computer.
- the terminal 220 includes a smart phone and a personal computer device as an example.
- the number of the aforementioned terminals 220 may be more or less. For example, there may be only one terminal, or there may be dozens or hundreds of terminals, or more. The embodiment of the present application does not limit the number and device types of the terminals 220.
- FIG. 3 shows a flowchart of a translation method based on multi-modal machine learning provided by an exemplary embodiment of the present application.
- the method is applied to a computer device as shown in FIG. 2, and the computer device includes a terminal or Server, the method includes:
- Step 301 The computer device performs semantic association on the source sentences of n modalities, and constructs a semantic association graph.
- the above-mentioned semantic association graph includes semantic nodes of n different modalities, a first connecting edge used to connect semantic nodes of the same modal, and a second connecting edge used to connect semantic nodes of different modalities, where n is greater than 1. Positive integer.
- the source sentence corresponds to a set of semantic nodes, and the set of semantic nodes includes at least one semantic node for representing a semantic unit in the source sentence.
- the computer equipment is equipped with a multi-modal fusion encoder and decoder.
- the computer equipment extracts semantic nodes from the source sentences of each modal through the multi-modal graph representation layer, and obtains n sets of semantics corresponding to the source sentences of n modals Node; through the multi-modal graph representation layer, the first connection edge is used to connect n groups of semantic nodes between semantic nodes in the same modal, that is, the first connection between any two semantic nodes in the same modal is added.
- One connecting edge, and using a second connecting edge to connect n sets of semantic nodes between semantic nodes in different modalities that is, adding a second connecting edge between semantic nodes in different modalities to obtain semantics Association diagram.
- the n modal source sentences include a first source sentence in a text form and a second source sentence in a non-text form, and the n sets of semantic nodes include the first semantic node and the second semantic node;
- the state graph representation layer extracts the first semantic node from the first source sentence, and extracts candidate semantic nodes from the second source sentence;
- the multimodal graph representation layer is called according to the relationship between the first semantic node and the candidate semantic node.
- Semantic association calculate the first probability distribution of candidate semantic nodes; call the multi-modal graph representation layer, according to the first probability distribution, determine the second semantic node from the candidate semantic nodes.
- the computer device performs word segmentation processing on the first source sentence to obtain m words after word segmentation, and the m words correspond to the first semantic node in the first source sentence , M is a positive integer;
- the computer device For the extraction of semantic nodes in the second source sentence in non-text form, the computer device extracts from the second source sentence a target corresponding to the semantics of at least one of the m words, and the target is the first source sentence in the second source sentence. Two semantic nodes.
- the source sentences of the two modalities include the image to be translated 31 and the sentence to be translated 32.
- the content of the sentence to be translated 32 includes "Two boys are playing with a toy car.”, and each English word corresponds to A first semantic node, namely Vx1, Vx2, Vx3, Vx4, Vx5, Vx6, Vx7, and Vx8; computer equipment based on the semantics of the semantic node, cuts out candidate images from the image to be translated 31, according to the semantic node and the candidate image Calculate the first probability distribution, and determine the target image 1 and target image 2 corresponding to the semantics of Vx1 and Vx2 from the candidate images according to the first probability distribution, as well as the target image 1 and target image 2 corresponding to the semantics of Vx6, Vx7, and Vx8.
- the semantically corresponding target image 3, Vo1, Vo2, and Vo3 corresponding to the target image 1, the target image 2, and the target image 3, respectively, are the three second semantic nodes in the image 31 to be translated.
- Computer equipment uses the first connection edge (solid line) between Vx1, Vx2, Vx3, Vx4, Vx5, Vx6, Vx7, and Vx8 for intra-modal semantic connection. It is used between Vo1, Vo2, and Vo3.
- the first connecting edge performs intra-modal semantic connection
- the second connecting edge (dotted line) is used between the first semantic node and the second semantic node to perform inter-modal semantic connection.
- different modalities are correspondingly provided with different first connecting edges; when the computer device internally connects the semantic nodes, it determines the i-th type corresponding to the i-th modal through the multi-modal graph representation layer.
- the first connecting edge, the i-th first connecting edge is used to connect the semantic nodes of the i-th group of semantic nodes in the same modal, that is, for any two semantic nodes in the i-th group of semantic nodes Add the i-th first connecting edge between, where i is a positive integer less than or equal to n.
- the source sentences of the two modalities are translated. If the source sentences of the two modalities are text and images respectively, the computer equipment uses visual grounding tools to establish the gap between the source sentences of the two modalities. Semantic associations, construct a semantic association graph.
- Step 302 The computer device extracts multiple first word vectors from the semantic association graph.
- the computer device uses word embedding to process the semantic association graph to obtain multiple first word vectors; word embedding refers to mapping words into word vectors.
- word embedding methods include the following four At least one of:
- Word embedding is carried out based on the semantics of the context in which the word is located.
- one-hot encoding (One-Hot Encoding) is used to represent words in a source sentence in text form, and then word embedding is performed through an embedding matrix.
- Step 303 The computer device encodes multiple first word vectors to obtain n encoded feature vectors.
- the computer equipment uses the multi-modal fusion encoder to extract the features of the first word vector within the modal, and then perform the feature fusion between the modals on the vector obtained by the feature extraction.
- the multi-modal fusion encoder includes a first feature extraction function corresponding to the first mode, a second feature extraction function corresponding to the second mode, and a third mode corresponding The third feature extraction function; the computer device uses the first feature extraction function to perform feature extraction in the first mode of the first word vector, and uses the second feature extraction function to perform the second mode of feature extraction on the first word vector Feature extraction, through the third feature extraction function, the first word vector is extracted in the third mode, and finally three hidden layer functions are obtained.
- the multi-modal fusion encoder also includes a first feature fusion function corresponding to the first mode, a second feature fusion function corresponding to the second mode, and a third feature fusion function corresponding to the third mode;
- a feature fusion function performs inter-modal feature fusion on the above three hidden layer functions, uses a second feature fusion function to perform inter-modal feature fusion on the above three hidden layer functions, and uses a third feature fusion function to perform inter-modal feature fusion on the above three hidden layer functions.
- the three hidden layer functions perform feature fusion between modes, and the hidden layer vector after the fusion of the three features is obtained, which is the coded feature vector.
- Step 304 The computer device decodes the n coded feature vectors to obtain the translated target sentence.
- the computer device calls the decoder to decode the n coded feature vectors to obtain the translated target sentence, which is a sentence obtained by translating n modal source sentences into a specified language class.
- the translation method based on multi-modal machine learning uses the multi-modal graph representation layer to semantically associate n modal source sentences to construct a semantic association graph, which is used in the semantic association graph.
- the first connecting edge connects the semantic nodes of the same modal
- the second connecting edge connects the semantic nodes of different modalities.
- the semantic association graph fully shows the semantic associations between the source sentences of multiple modalities, and then through multiple
- the modal fusion encoder performs sufficient semantic fusion on the feature vectors in the semantic association graph to obtain the encoded feature vector after encoding, and then obtains a more accurate target sentence after decoding the encoded feature vector.
- the content, emotion, and language environment expressed by the source sentences of the state are closer to each other.
- the multi-modal fusion encoder includes e serially connected encoding modules, and each encoding module includes n intra-modal fusion layers and n inter-modal fusion layers corresponding to n modalities one-to-one, e Is a positive integer; therefore, step 303 may include step 3031, as shown in Figure 5, the steps are as follows:
- step 3031 the computer device performs e intra-modal fusion and inter-modal fusion on the plurality of first word vectors through e serial coding modules to obtain n coded feature vectors.
- intra-modal fusion refers to semantic fusion between first word vectors in the same modal
- inter-modal fusion refers to semantic fusion between first word vectors of different modalities.
- the intra-modal and inter-modal fusion of the above-mentioned encoded feature vector can be realized through the following steps:
- the computer device inputs the first word vector into the first intra-modal fusion layer in the first encoding module, and the first intra-modal fusion layer performs intra-modal semantic fusion on the first word vector to obtain The first first hidden layer vector; input the first word vector into the second intra-modal fusion layer in the first encoding module, and the second intra-modal fusion layer performs intra-modal semantics on the first word vector Fusion, get the second first hidden layer vector; ⁇ ; Input the first word vector into the n-th modal fusion layer in the first coding module, and the n-th modal fusion layer is paired The first word vector performs semantic fusion within the modal to obtain the n-th first hidden layer vector.
- a feature extraction function is set in the intra-modal fusion layer.
- the feature extraction function includes a self-attention function.
- different or the same self-attention functions are set in the fusion layer in different modalities. It should be noted that the difference in the self-attention function refers to the different parameters in the function; if the self-attention functions corresponding to different modes are different, the parameters in the function corresponding to different modes are different.
- the computer device inputs n first hidden layer vectors into the first inter-modal fusion layer in the first encoding module, and the first inter-modal fusion layer modulates the n first hidden layer vectors. Semantic fusion between modes, to obtain the first first intermediate vector corresponding to the first mode; input n first hidden layer vectors into the second inter-modal fusion layer in the first encoding module, and then The two inter-modal fusion layers perform semantic fusion between the n first hidden layer vectors to obtain the second first intermediate vector corresponding to the second mode; ⁇ ; The first hidden layer vector is input to the nth intermodal fusion layer in the first encoding module, and the nth intermodal fusion layer performs the semantic fusion between the n first hidden layer vectors to obtain the The n-th first intermediate vector corresponding to n modalities.
- the inter-modal fusion layer length is set with a feature fusion function.
- the feature fusion functions set in different inter-modal fusion layers are different or the same. It should be noted that the difference in the feature fusion function refers to the different parameters in the function, or the different calculation methods of the function.
- each encoding module further includes: n first vector conversion layers one-to-one corresponding to the n modalities; after obtaining the n first intermediate vectors, the computer device also inputs the n first intermediate vectors respectively Non-linear conversion is performed in the n first vector conversion layers corresponding to the modalities to obtain n first intermediate vectors after the non-linear conversion.
- the computer equipment inputs n intermediate vectors into the second encoding module for the second encoding process, and obtains n re-encoded first intermediate vectors; ⁇ ; n re-encoded first intermediate vectors
- the vector between vectors is input to the j-th encoding module for the j-th encoding process, and n re-encoded first intermediate vectors are obtained; ⁇ ; the n re-encoded first intermediate vectors are input to the e-th
- the e-th encoding process is performed in each encoding module to obtain n encoding feature vectors; where j is a positive integer greater than 1 and less than or equal to e.
- each encoding module in the e serial encoding modules is the same, that is, the j-th encoding module processes the first intermediate vector according to the steps of the first encoding module to encode the first intermediate vector, until the last one
- the encoding module outputs the encoding feature vector.
- the self-attention mechanism is used to model the semantic information within the same modal, then the j-th encoding module calculates the first hidden layer vector corresponding to the text sentence
- the formula is:
- x is used to identify the semantic node of the text sentence and the vector calculated from the semantic node of the text sentence;
- MultiHead (Q, K, V) is a multi-attention mechanism modeling function, with triples (Queries, Key, Values) as input, Q is the query matrix, K is the key matrix, and V is the value matrix, where Q, K, V by Calculated with the parameter vector.
- the j-th multimodal fusion encoder calculates the first hidden layer vector corresponding to the image
- the formula is:
- a cross-modal fusion mechanism based on the door mechanism is also used to model semantic fusion between multiple modalities, and the j-th encoding module calculates the first intermediate vector or encoding feature vector corresponding to the text sentence
- the formula is:
- A represents a set, corresponding, Is the first semantic node
- the collection of neighbor nodes in the semantic association graph Represents the u-th semantic node of the text sentence, u is a positive integer; Is the semantic representation vector of the sth semantic node of the image in the jth encoding module; Is the semantic representation vector of the u-th semantic node of the text sentence in the j-th coding module; with Is the parameter matrix; ⁇ represents the XOR operation; Sigmoid() is the s-curve function; o is used to identify the semantic node of the image and the vector calculated from the semantic node of the image. Also calculate the first intermediate vector or encoding feature vector corresponding to the image in the same calculation method I will not repeat them here.
- FNN Feed Forward Neural
- ⁇ means collection, Represents the encoding feature vector corresponding to the u-th semantic node of the text sentence in the j-th encoding module, Is the encoding feature vector corresponding to the sth semantic node of the image in the jth encoding module.
- the translation method based on multi-modal machine learning uses the multi-modal graph representation layer to semantically associate n modal source languages to construct a semantic association graph, which is used in the semantic association graph.
- the first connecting edge connects the semantic nodes of the same modal
- the second connecting edge connects the semantic nodes of different modalities.
- the semantic association graph fully shows the semantic associations between the source languages of multiple modalities, and then through multiple
- the modal fusion encoder performs sufficient semantic fusion on the feature vectors in the semantic association graph to obtain the encoded feature vector, and then after decoding the encoded feature vector, a more accurate target sentence is obtained.
- the content, emotion, and language environment expressed in the source language of the modal are closer to each other.
- the multi-modal fusion encoder includes e serial coding modules, and each coding module includes an intra-modal fusion layer and an inter-model fusion layer.
- each coding module includes an intra-modal fusion layer and an inter-model fusion layer.
- step 304 can include steps 3041 to 3044, as shown in Figure 6, the steps are as follows:
- Step 3041 The computer device obtains the first target word through the second word vector layer.
- the first target word is the translated word in the target sentence.
- the computer device translates the words in the target sentence one by one, and after the rth word in the target sentence is translated, the rth word is used as the first target word to translate the r+1th word; that is to say, ,
- the computer device inputs the r-th word into the second word vector layer, where r is a non-negative integer.
- Step 3042 The computer device performs feature extraction on the first target word through the second word vector layer to obtain a second word vector.
- the computer device performs word embedding on the first target word through the second vector layer to obtain the second word vector.
- Word embedding is a technology that represents words as real-number vectors in a vector space; in this embodiment, word embedding refers to the word vector to which the word is mapped; for example, "I" is mapped to obtain the word vector (0.1, 0.5, 5), (0.1,0.5,5) is the word vector after embedding the word "I".
- step 3043 the computer device performs feature extraction by combining the second word vector and the encoded feature vector through d serial decoding modules to obtain the decoded feature vector.
- the computer equipment calls d serial decoding modules to process the encoded feature vector and the second word vector based on the attention mechanism, and extract the decoded feature vector.
- each of the d serial decoding modules includes a first self-attention layer, a second self-attention layer, and a second vector conversion layer; for the extraction of decoded feature vectors,
- the computer device inputs the second word vector into the first self-attention layer in the first decoding module, and performs feature extraction on the second word vector through the first self-attention layer to obtain the second hidden layer vector;
- the second hidden layer vector And the coded feature vector is input to the second self-attention layer in the first decoding module, and the second self-attention layer combines the second hidden layer vector and the coded feature vector for feature extraction to obtain the second intermediate vector;
- the second intermediate vector Input the k-th decoding module to perform the k-th decoding process until the last decoding module outputs the decoding feature vector, where k is a positive integer greater than 1 and less than or equal to d.
- the first self-attention layer is used to process the second word vector based on the self-attention mechanism to extract the second hidden layer vector; the second self-attention layer is used to use the language of the target sentence based on the attention mechanism
- the class processes the second hidden layer vector and the encoding feature vector to obtain the second intermediate vector.
- the first self-attention layer includes a first self-attention function, and the second self-attention layer includes a second self-attention function.
- the first self-attention function and the second self-attention function have different parameters.
- each decoding module further includes: a second vector conversion layer; after the second intermediate vector is calculated, the computer device also inputs the second intermediate vector into the second vector conversion layer for non-linear conversion to obtain a non-linear conversion After the second intermediate vector.
- Step 3044 The computer device inputs the decoded feature vector to the classifier, calculates the probability distribution corresponding to the decoded feature vector through the classifier, and determines the second target word after the first target word according to the probability distribution.
- the classifier includes a normalization (softmax) function, and the computer device calculates the probability distribution corresponding to the decoded feature vector through the softmax function, and determines the second target after the first target word according to the probability distribution corresponding to the decoded feature vector Words.
- softmax normalization
- the translation method based on multi-modal machine learning uses the multi-modal graph representation layer to semantically associate n modal source languages to construct a semantic association graph, which is used in the semantic association graph.
- the first connecting edge connects the semantic nodes of the same modal
- the second connecting edge connects the semantic nodes of different modalities.
- the semantic association graph fully shows the semantic associations between the source languages of multiple modalities, and then through multiple
- the modal fusion encoder performs sufficient semantic fusion on the feature vectors in the semantic association graph to obtain the encoded feature vector after encoding, and then obtains a more accurate target sentence after decoding the encoded feature vector.
- the content, emotions, and language environment expressed in the source language of the state are closer to each other.
- the method also uses the language class of the target sentence through d decoding modules to repeatedly pay attention to the encoding feature vector and the second hidden layer vector to decode a more accurate target sentence.
- the multi-modal machine translation model provided by this application is tested and compared with the previous multi-modal neural machine translation (NMT), and it can be clearly seen that the multi-modal machine translation provided by this application
- NMT multi-modal neural machine translation
- the multi-modal machine translation model provided in this application is constructed based on an attention-based coding and decoding framework, and takes maximizing the log-likelihood of training data as the objective function.
- the multi-modal fusion encoder provided in this application can be regarded as a multi-modal enhanced graph neural network (Graph Neural Network, GNN).
- GNN Graph Neural Network
- the input image and text are correspondingly represented as a multi-modal graph (ie, semantic association graph); then based on the above-mentioned multi-modal graph, multiple multi-modal fusion layers are superimposed to learn the node ( That is, the semantic node) means that it provides the decoder with an attention-based context vector.
- each node represents a text word or a visual object.
- the node corresponding to the text is called a semantic node
- the node corresponding to the visual object is called a visual node.
- the following strategies are adopted to construct the inter-node Semantic association:
- the multi-modal graph in Figure 4 includes a total of 8 text nodes, and each text node corresponds to the input sentence (that is, the sentence to be translated). ); (2) Use the Stanford parser to identify all noun phrases in the input sentence, and then apply the visual landing toolkit to identify the corresponding noun phrase in the input image (ie, the image to be translated) The bounding box (visual object). After that, all the detected visual objects are regarded as independent visual nodes.
- the text nodes Vx1 and Vx2 correspond to the visual nodes Vo1 and Vo2
- the text nodes Vx6, Vx7, and Vx8 correspond to the visual node Vo3.
- edges are used to connect semantic nodes.
- the two edges in edge set E include: (1) In the same modal Any two semantic nodes are connected by an inner modal edge (the first connecting edge); (2) Any text node and the corresponding visual node are connected by an inter-modal edge (the second connecting edge).
- Vo1 and Vo2 are connected by inner modal edges (solid lines), and Vo1 and Vx1 are connected by inter-modal edges (solid lines).
- a word embedding layer needs to be introduced to initialize the node state.
- Hxu is defined as the sum of word embedding and position embedding.
- ROI pool Region Of Interest pooling
- ReLU linear rectification function
- RCCN is a rich feature hierarchy (rich feature hierarchy for accurate object detection and semantic segmentation) used for precise object positioning and semantic segmentation.
- the encoder is shown in the left part.
- an e-layer graph-based multi-modal fusion layer is stacked to encode the above-mentioned multi-modal graph.
- intra-modal and inter-modal fusion are performed in sequence to update all node states.
- the final node state simultaneously encodes the context information and cross-modal semantic information in the same modality.
- visual nodes and text nodes are two semantic units containing different mode information, functions with similar operations but different parameters are used to model the state update process of nodes.
- the text node state And visual node status mainly involves the following steps:
- Step 1 Intra-modal fusion.
- self-attention is used for information fusion between adjacent nodes in the same modality to generate a contextual representation of each node.
- the contextual representation of all text nodes The calculation formula is as follows:
- MultiHead (Q, K, V) is a multi-attention mechanism modeling function (also called a multi-head self-attention function), which takes the query matrix Q, the key matrix K, and the value matrix V as input.
- the contextual representation of all visual nodes The calculation formula is as follows:
- a simplified multi-head self-attention is applied to represent the initial state of the visual object, in which the acquired linear item value and The final output.
- Step 2 Fusion between modals.
- a cross-modal gating mechanism with element operation characteristics is used to learn the semantic information of each node's cross-modal neighborhood. Specifically, generate the state representation of the text node Vxu The way is as follows:
- the decoder it is similar to the traditional transformer (Transformer) decoder. Since visual information has been integrated into all text nodes through multiple graph-based multi-modal fusion layers, the decoder is allowed to focus only on the state of the text node to dynamically utilize the multi-modal context, that is, only the state of the text node is input to the decoding ⁇
- each layer is composed of three sub-layers.
- the first two sub-layers are masking self-attention Ej and codec attention Tj, respectively, to integrate the target and source language side contexts:
- E (j) MultiHead(S j-1 , S j-1 , S j-1 );
- S(j-1) represents the hidden state of the target side in the j-1th layer.
- S(0) is the embedding vector of the input target word, The hidden state of the top layer in the decoder.
- a fully connected feedforward neural network in the position direction is used to generate S(j), the formula is as follows:
- the softmax layer is used to define the probability distribution of the generated target sentence. This layer is based on the hidden state of the top layer. For input:
- X is the input sentence to be translated
- I is the input image to be translated
- Y is the target sentence (ie, the translation sentence)
- W and b are the parameters of the softmax layer.
- a Stanford parser is used to recognize noun phrases from each source sentence, and then a visual landing toolkit is used to detect the visual objects related to the recognized noun phrases. For each phrase, the predicted probability of keeping its corresponding visual object is the highest, so as to reduce the negative influence of rich visual objects. In each sentence, the average number of objects and words are around 3.5 and 15.0, respectively. Finally, the pre-trained ResNet-100 Faster RCNN is used to calculate the 2048-dimensional features of these objects.
- ObjectAsToken It is a variant of the transformer, all visual objects are treated as additional source code symbols and placed in front of the input sentence.
- Enc-att (TF). An encoder-based image attention mechanism is used in the transformer, which adds each source annotation and attention-based visual feature vector.
- RNN is a recurrent neural network (Recurrent Neural Netword).
- the number e of multimodal fusion layers is an important hyperparameter, which directly determines the degree of fine-grained semantic fusion in the encoder. Therefore, first check its impact on the English-German verification set.
- Table 1 shows the main results of the English-German task. Compared with Fusion-conv (RNN) and Trg-mul (RNN) on METEOR, the performance of the model provided by the embodiment of this application is better than most previous models. The two sets of results depend on the system status on the WMT2017 test set, which is selected based on METEOR. Compared with the basic model, the following conclusions can be drawn.
- the model provided by the embodiments of this application is superior to ObjectAsToken (TF), which connects regional visual features with text to form an attentionable sequence, and uses the self-attention mechanism for multi-modal fusion.
- TF ObjectAsToken
- the basic reasons include two aspects: one is to model the semantic correspondence between the semantic units of different modalities, and the other is to distinguish the model parameters of different modalities.
- Enc-att TF
- TF Enc-att
- test set is divided into different groups according to the length of the source sentence and the number of noun phrases, and then the performance of different models under each test set is compared.
- Figure 9 and Figure 10 show the BLEU scores of the above group.
- the model provided in the embodiments of the present application still achieves the best performance in all groups. Therefore, the validity and versatility of the model provided in the embodiments of the present application are confirmed again.
- the model provided by the embodiments of the present application is more meaningful than the improvement of the basic model. It is inferred that long sentences often contain more ambiguous words. Therefore, compared with short sentences, long sentences may need to make better use of visual information as supplementary information, which can be achieved through the multi-modal semantic interaction of the models provided in the embodiments of the present application.
- Table 4 also shows the training and decoding speed of the model provided by the embodiment of the application and the basic model.
- the model provided by the embodiment of the present application can process approximately 1.1K tokens per second, which is equivalent to other multi-modal models.
- the model provided by the embodiment of the present application translates about 16.7 sentences per second, which is slightly slower than the transformer.
- the model provided by the embodiment of the present application only introduces a small number of additional parameters and achieves better performance.
- Attention text node and attention visual node are considered, the performance of the model drops sharply, as shown in row 7 of Table 2. This is because the number of visual nodes is far less than that of text nodes, and text nodes cannot generate enough translation context.
- FIG. 11 shows a translation device based on multi-modal machine learning provided by an exemplary embodiment of the present application.
- the device becomes part or all of computer equipment through software, hardware, or a combination of the two.
- the device includes:
- the semantic association module 501 is used to obtain a semantic association graph based on n source sentences belonging to different modalities.
- the semantic association graph includes semantic nodes of n different modalities and the first semantic node used to connect the semantic nodes of the same modal.
- a connecting edge, and a second connecting edge used to connect semantic nodes of different modalities, the semantic node is used to represent a semantic unit of the source sentence in one modal, and n is a positive integer greater than 1.
- semantic association graph includes semantic nodes of n different modalities and is used to connect the semantic nodes of the same modal.
- the feature extraction module 502 is configured to extract a plurality of first word vectors from the semantic correlation map, and optionally, extract the first word vectors from the semantic correlation map through the first word vector layer;
- the vector encoding module 503 is configured to encode the plurality of first word vectors to obtain n encoded feature vectors, optionally, encode the first word vectors by a multimodal fusion encoder to obtain the encoded feature vectors;
- the vector decoding module 504 is configured to decode the n coded feature vectors to obtain the translated target sentence, and optionally, call a decoder to decode the coded feature vector to obtain the translated target sentence.
- the semantic association module 501 is used to obtain n sets of semantic nodes, and a set of semantic nodes corresponds to a source sentence of a modal; adding all the semantic nodes between any two of the semantic nodes of the same modal. In the first connection edge, the second connection edge is added between any two semantic nodes in different modalities to obtain the semantic association graph.
- the semantic association module 501 is configured to extract semantic nodes from the source language of each modal through the multi-modal graph representation layer to obtain n sets of semantic nodes corresponding to the source language of the n modalities;
- the graph representation layer uses the first connecting edge to connect n sets of semantic nodes between semantic nodes in the same modal, and uses the second connecting edge to connect n sets of semantic nodes between semantic nodes in different modalities. Get the semantic association graph.
- the source languages of the n modals include a first source language in a text form and a second source language in a non-text form, and the n sets of semantic nodes include the first semantic node and the second semantic node;
- the semantic association module 501 is configured to obtain the first semantic node, which is obtained by processing the first source sentence by the multi-modal graph presentation layer; obtain candidate semantic nodes, which are obtained by The multimodal graph presentation layer processes the second source sentence to obtain the first probability distribution of the candidate semantic node, and the first probability distribution is determined by the multimodal graph presentation layer according to the first semantic
- the semantic association between the node and the candidate semantic node is calculated; from the candidate semantic node, the second semantic node is determined, and the second semantic node is determined by the multimodal graph representation layer according to the The first probability distribution is determined.
- the semantic association module 501 is configured to extract the first semantic node from the first source sentence and the candidate semantic node from the second source language through the multimodal graph representation layer; call the multimodal graph representation The layer calculates the first probability distribution of the candidate semantic node according to the semantic association between the first semantic node and the candidate semantic node; calling the multimodal graph representation layer to determine the second semantic node from the candidate semantic nodes according to the first probability distribution.
- the semantic association module 501 is configured to add an i-th first connecting edge between any two semantic nodes in the same modal in the i-th group of semantic nodes.
- a connected edge corresponds to the i-th mode, and i is a positive integer less than or equal to n.
- the semantic association module 501 is configured to determine the i-th type of first connection edge corresponding to the i-th modal through the multi-modal graph representation layer, and use the i-th type of first connection edge to perform the processing on the i-th group of semantic nodes For the connection between semantic nodes in the same modal, i is a positive integer less than or equal to n.
- the vector encoding module 503 is configured to perform e intra-modal fusion and inter-modal fusion on the plurality of first word vectors to obtain the n coded feature vectors, where the Intramodal fusion refers to semantic fusion between the first word vectors in the same modal, and the intermodal fusion refers to semantic fusion between the first word vectors of different modalities, where , E is a positive integer.
- the multi-modal fusion encoder includes e serially connected encoding modules, where e is a positive integer;
- the vector encoding module 503 is used to perform e intra-modal fusion and inter-modal fusion on the first word vector through e serially-connected encoding modules to obtain an encoded feature vector, where the intra-modal fusion refers to the same mode Semantic fusion is performed between the first word vectors in different modalities, and the inter-modal fusion refers to semantic fusion between the first word vectors of different modalities;
- each encoding module includes n intra-modal fusion layers and n inter-modal fusion layers corresponding to the n modalities one-to-one;
- the vector encoding module 503 is used to input the first word vector into the n intra-modal fusion layers in the first encoding module, and the n intra-modal fusion layers respectively perform the same internal semantics of the first word vector Fusion to obtain n first hidden layer vectors, one of the first hidden layer vectors corresponds to one mode, that is, n first hidden layer vectors corresponding to the n modes are obtained one-to-one;
- each inter-modal fusion layer performs semantic fusion between n first hidden layer vectors in different modalities to obtain n first intermediate vectors, one of the intermediate vectors is for one mode, that is, n first intermediate vectors corresponding to the n modes are obtained one-to-one;
- Input n first intermediate vectors into the j-th encoding module for the j-th encoding process, until the last encoding module outputs n encoding feature vectors, and one of the encoding feature vectors corresponds to a modal, that is, until The last encoding module outputs n encoding feature vectors one-to-one corresponding to n modalities, and j is a positive integer greater than 1 and less than or equal to e.
- each encoding module further includes: n first vector conversion layers, and the one vector conversion layer corresponds to one modal, that is, n modalities corresponding to the n modalities one-to-one.
- the first vector conversion layer
- the vector encoding module 503 is also configured to input the n first intermediate vectors into the n first vector conversion layers corresponding to the respective modalities to perform non-linear conversion to obtain n first intermediate vectors after the non-linear conversion.
- the hierarchical structure in each of the e serially connected encoding modules is the same.
- different or the same self-attention functions are set in different intra-modal fusion layers, and different or the same feature fusion functions are set in different inter-modal fusion layers.
- the vector decoding module 504 is configured to perform feature extraction on the first target word to obtain a second word vector, where the first target word is a translated word in the target sentence;
- the second word vector combines the encoded feature vector to perform feature extraction to obtain a decoded feature vector; determine the probability distribution corresponding to the decoded feature vector, and according to the probability distribution, determine the second target after the first target word Words.
- the decoder includes d serially connected decoding modules, where d is a positive integer;
- the vector decoding module 504 is used to obtain the first target word through the second word vector layer, where the first target word is the translated word in the target sentence; perform feature extraction on the first target word through the second word vector layer to obtain the second Word vector
- the second word vector is combined with the encoded feature vector for feature extraction to obtain the decoded feature vector; the decoded feature vector is input to the classifier, and the probability distribution corresponding to the decoded feature vector is calculated by the classifier, and determined according to the probability distribution Out the second target word after the first target word.
- each of the d serially connected decoding modules includes a first self-attention layer and a second self-attention layer;
- the vector decoding module 504 is configured to input the second word vector into the first self-attention layer in the first decoding module, and the first self-attention layer performs feature extraction on the second word vector to obtain the second hidden layer vector;
- the second hidden layer vector and the encoded feature vector are input to the second self-attention layer in the first decoding module, and the second self-attention layer combines the second hidden layer vector and the encoded feature vector for feature extraction to obtain the second intermediate vector ;
- the second intermediate vector is input to the k-th decoding module to perform the k-th decoding process until the last decoding module outputs the decoding feature vector, where k is a positive integer greater than 1 and less than or equal to d.
- each decoding module further includes: a second vector conversion layer
- the vector decoding module 504 is further configured to input the second intermediate vector into the second vector conversion layer for non-linear conversion to obtain the second intermediate vector after the non-linear conversion.
- the translation device based on multi-modal machine learning uses the multi-modal graph representation layer to semantically associate n modal source languages to construct a semantic association graph, which is used in the semantic association graph.
- the first connecting edge connects the semantic nodes of the same modal
- the second connecting edge connects the semantic nodes of different modalities.
- the semantic association graph fully shows the semantic associations between the source languages of multiple modalities, and then through multiple
- the modal fusion encoder performs sufficient semantic fusion on the feature vectors in the semantic association graph to obtain the encoded feature vector after encoding, and then obtains a more accurate target sentence after decoding the encoded feature vector.
- the content, emotions, and language environment expressed in the source language of the state are closer to each other.
- FIG. 12 shows a schematic structural diagram of a server provided by an embodiment of the present application.
- the server is used to implement the steps of the translation method based on multi-modal machine learning provided in the foregoing embodiment. Specifically:
- the server 600 includes a CPU (Central Processing Unit, central processing unit) 601, a system memory 604 including RAM (Random Access Memory) 602 and ROM (Read-Only Memory) 603, and connection The system memory 604 and the system bus 605 of the central processing unit 601.
- the server 600 also includes a basic I/O (Input/Output) system 606 that helps transfer information between various devices in the computer, and a basic I/O (Input/Output) system 606 for storing the operating system 613, application programs 614, and other program modules 615.
- the basic input/output system 606 includes a display 608 for displaying information and an input device 609 such as a mouse and a keyboard for the user to input information.
- the display 608 and the input device 609 are both connected to the central processing unit 601 through the input and output controller 610 connected to the system bus 605.
- the basic input/output system 606 may also include an input and output controller 610 for receiving and processing input from multiple other devices such as a keyboard, a mouse, or an electronic stylus. Similarly, the input and output controller 610 also provides output to a display screen, a printer, or other types of output devices.
- the mass storage device 607 is connected to the central processing unit 601 through a mass storage controller (not shown) connected to the system bus 605.
- the mass storage device 607 and its associated computer readable medium provide non-volatile storage for the server 600. That is, the mass storage device 607 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact Disc Read-Only Memory) drive.
- the computer-readable media may include computer storage media and communication media.
- Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storing information such as computer readable instructions, data structures, program modules or other data.
- Computer storage media include RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash memory (Flash Memory) or other solid-state storage technologies, CD-ROM, DVD (Digital Versatile Disc, Digital Versatile Disc) or other optical storage, tape cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices.
- the aforementioned system memory 604 and mass storage device 607 may be collectively referred to as a memory.
- the server 600 may also be connected to a remote computer on the network to run through a network such as the Internet. That is, the server 600 can be connected to the network 612 through the network interface unit 611 connected to the system bus 605, or in other words, the network interface unit 611 can also be used to connect to other types of networks or remote computer systems (not shown) .
- a storage medium including a computer readable storage medium, such as a memory 602 including instructions, which can be executed by the processor 601 of the server 600 to complete the above-described translation method based on multi-modal machine learning.
- the computer-readable storage medium may be a non-transitory storage medium, for example, the non-transitory storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
- a computer program product including a computer program, which can be executed by a processor of an electronic device to implement the above-mentioned translation method based on multi-modal machine learning.
- the program can be stored in a computer-readable storage medium.
- the storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (15)
- 一种基于多模态机器学习的翻译方法,由计算机设备执行,其中,所述方法包括:基于属于不同模态的n个源语句,获取语义关联图,所述语义关联图包括n种不同模态的语义节点、用于连接同一模态的语义节点的第一连接边,以及用于连接不同模态的语义节点的第二连接边,所述语义节点用于表示一种模态下所述源语句的一个语义单元,n为大于1的正整数;从所述语义关联图中,提取出多个第一词向量;对所述多个第一词向量进行编码,得到n个编码特征向量;对所述n个编码特征向量进行解码,得到翻译后的目标语句。
- 根据权利要求1所述的方法,其中,所述基于属于不同模态的n个源语句,获取语义关联图,包括:获取n组语义节点,一组语义节点对应于一个模态的源语句;在同一模态的任两个所述语义节点之间添加所述第一连接边,在不同模态的任两个所述语义节点之间添加所述第二连接边,得到所述语义关联图。
- 根据权利要求2所述的方法,其中,所述n个模态的源语句中包括文本形式的第一源语句和非文本形式的第二源语句,所述n组语义节点包括第一语义节点和第二语义节点;所述获取n组语义节点,包括:获取所述第一语义节点,所述第一语义节点由多模态图表示层对所述第一源语句进行处理得到;获取候选语义节点,所述候选语义节点由多模态图表示层对所述第二源语句进行处理得到;获取所述候选语义节点的第一概率分布,所述第一概率分布由所述多模态图表示层按照所述第一语义节点与所述候选语义节点之间的语义关联进行计算得到;从所述候选语义节点中,确定出所述第二语义节点,所述第二语义节点由所述多模态图表示层根据所述第一概率分布确定。
- 根据权利要求2所述的方法,其中,所述在同一模态的任两个所述语义节点之间添加所述第一连接边,包括:在第i组语义节点中将同一模态内的任两个语义节点之间添加第i种第一连接边,所述第i种第一连接边对应于第i个模态,i是小于或等于n的正整数。
- 根据权利要求1至4任一所述的方法,其中,所述对所述多个第一词向量进行编码,得到n个编码特征向量,包括:对所述多个第一词向量进行e次模态内融合和模态间融合,得到所述n个编码特征向量,其中,所述模态内融合是指在同一模态内的所述第一词向量之间进行语义融合,所述模态间 融合是指在不同模态的所述第一词向量之间进行语义融合,其中,e为正整数。
- 根据权利要求5所述的方法,其中,多模态融合编码器包括e个串联的编码模块;每一个所述编码模块均包括与n个模态一一对应的n个模态内融合层和n个模态间融合层;所述对所述多个第一词向量进行e次模态内融合和模态间融合,得到所述n个编码特征向量,包括:将所述多个第一词向量分别输入第1个所述编码模块中的n个模态内融合层,由所述n个模态内融合层,分别对所述多个第一词向量进行相同模态内部的语义融合,得到n个第一隐层向量,一个所述第一隐层向量对应于一个模态;将所述n个第一隐层向量输入所述第1个编码模块中的每一个模态间融合层,由所述每一个模态间融合层对所述n个第一隐层向量进行不同模态间的语义融合,得到n个第一中间向量,一个所述中间向量对于一个模态;将所述n个第一中间向量输入第j个编码模块中进行第j次编码处理,直至最后一个编码模块输出n个编码特征向量,一个所述编码特征向量与一个模态对应,j为大于1且小于等于e的正整数。
- 根据权利要求6所述的方法,其中,所述每一个编码模块还包括:n个第一向量转换层,所述一个向量转换层对应于一个模态;所述方法还包括:将所述n个第一中间向量分别输入所属模态对应的所述n个第一向量转换层中进行非线性转换,得到非线性转换后的n个第一中间向量。
- 根据权利要求6所述的方法,其中,所述e个串联的编码模块中所述每一个编码模块中的层级结构相同。
- 根据权利要求6所述的方法,其中,不同的所述模态内融合层中设置有不同或者相同的自注意力函数,且不同的所述模态间融合层中设置有不同或者相同的特征融合函数。
- 根据权利要求1至4任一所述的方法,其中,所述对所述n个编码特征向量进行解码,得到翻译后的目标语句,包括:对第一目标词语进行特征提取,得到第二词向量,所述第一目标词语是所述目标语句中的已翻译词语;将所述第二词向量结合所述编码特征向量进行特征提取,得到解码特征向量;确定所述解码特征向量对应的概率分布,且根据所述概率分布,确定出所述第一目标词语之后的第二目标词语。
- 根据权利要求10所述的方法,其中,解码器包括d个串联的解码模块,d为正整数,所述d个串联的解码模块中每一个解码模块均包括第一自注意力层和第二自注意力层;所述将所述第二词向量结合所述编码特征向量进行特征提取,得到解码特征向量,包括:将所述第二词向量输入第1个解码模块中第一自注意力层,由所述第一自注意力层对所述第二词向量进行特征提取,得到第二隐层向量;将所述第二隐层向量和所述编码特征向量输入所述第1个解码模块中第二自注意力层,由所述第二自注意力层结合所述第二隐层向量和所述编码特征向量进行特征提取,得到第二中间向量;将所述第二中间向量输入第k个解码模块中进行第k次解码处理,直至最后一个解码模块输出所述解码特征向量,k为大于1且小于等于d的正整数。
- 根据权利要求11所述的方法,其中,所述每一个解码模块还包括:第二向量转换层;所述方法还包括:将所述第二中间向量输入所述第二向量转换层中进行非线性转换,得到非线性转换后的第二中间向量。
- 一种基于多模态机器学习的翻译装置,其中,所述装置包括:语义关联模块,用于基于属于不同模态的n个源语句,构建语义关联图,所述语义关联图包括n种不同模态的语义节点、用于连接同一模态的语义节点的第一连接边,以及用于连接不同模态的语义节点的第二连接边,所述语义节点用于表示一种模态下所述源语句的一个语义单元,n为大于1的正整数;特征提取模块,用于从所述语义关联图中,提取出多个第一词向量;向量编码模块,用于对所述多个第一词向量进行编码,得到n个编码特征向量;向量解码模块,用于对所述编码特征向量进行解码,得到翻译后的目标语句。
- 一种计算机设备,其中,所述计算机设备包括:存储器;与所述存储器相连的处理器;其中,所述处理器被配置为加载并执行可执行指令以实现如权利要求1至12任一所述的基于多模态机器学习的翻译方法。
- 一种计算机可读存储介质,其中,所述计算机可读存储介质中存储有至少一段程序;所述至少一段程序由处理器加载并执行以实现如权利要求1至12任一所述的基于多模态机器学习的翻译方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2022540553A JP2023509031A (ja) | 2020-05-20 | 2021-04-29 | マルチモーダル機械学習に基づく翻訳方法、装置、機器及びコンピュータプログラム |
US17/719,170 US12056458B2 (en) | 2020-05-20 | 2022-04-12 | Translation method and apparatus based on multimodal machine learning, device, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010432597.2 | 2020-05-20 | ||
CN202010432597.2A CN111597830A (zh) | 2020-05-20 | 2020-05-20 | 基于多模态机器学习的翻译方法、装置、设备及存储介质 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/719,170 Continuation US12056458B2 (en) | 2020-05-20 | 2022-04-12 | Translation method and apparatus based on multimodal machine learning, device, and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021233112A1 true WO2021233112A1 (zh) | 2021-11-25 |
Family
ID=72187523
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/091114 WO2021233112A1 (zh) | 2020-05-20 | 2021-04-29 | 基于多模态机器学习的翻译方法、装置、设备及存储介质 |
Country Status (4)
Country | Link |
---|---|
US (1) | US12056458B2 (zh) |
JP (1) | JP2023509031A (zh) |
CN (1) | CN111597830A (zh) |
WO (1) | WO2021233112A1 (zh) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114139637A (zh) * | 2021-12-03 | 2022-03-04 | 哈尔滨工业大学(深圳) | 多智能体信息融合方法、装置、电子设备及可读存储介质 |
CN115994177A (zh) * | 2023-03-23 | 2023-04-21 | 山东文衡科技股份有限公司 | 基于数据湖的知识产权管理方法及其系统 |
CN116934754A (zh) * | 2023-09-18 | 2023-10-24 | 四川大学华西第二医院 | 基于图神经网络的肝脏影像识别方法及装置 |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111597830A (zh) * | 2020-05-20 | 2020-08-28 | 腾讯科技(深圳)有限公司 | 基于多模态机器学习的翻译方法、装置、设备及存储介质 |
CN112015955B (zh) * | 2020-09-01 | 2021-07-30 | 清华大学 | 一种多模态数据关联方法和装置 |
CN112418450A (zh) * | 2020-10-30 | 2021-02-26 | 济南浪潮高新科技投资发展有限公司 | 一种基于多模态机器学习的设备预测性维护的方法 |
CN114580692A (zh) * | 2020-12-01 | 2022-06-03 | 北京三快在线科技有限公司 | 配送路径生成方法、装置、设备及存储介质 |
CN113569584B (zh) * | 2021-01-25 | 2024-06-14 | 腾讯科技(深圳)有限公司 | 文本翻译方法、装置、电子设备及计算机可读存储介质 |
CN112800782B (zh) * | 2021-01-29 | 2023-10-03 | 中国科学院自动化研究所 | 融合文本语义特征的语音翻译方法、系统、设备 |
CN112989977B (zh) * | 2021-03-03 | 2022-09-06 | 复旦大学 | 一种基于跨模态注意力机制的视听事件定位方法及装置 |
CN112800785B (zh) * | 2021-04-13 | 2021-07-27 | 中国科学院自动化研究所 | 多模态机器翻译方法、装置、电子设备和存储介质 |
CN113052257B (zh) * | 2021-04-13 | 2024-04-16 | 中国电子科技集团公司信息科学研究院 | 一种基于视觉转换器的深度强化学习方法及装置 |
CN113344067A (zh) * | 2021-05-31 | 2021-09-03 | 中国工商银行股份有限公司 | 一种生成客户画像的方法、装置及设备 |
EP4113285A1 (en) | 2021-06-29 | 2023-01-04 | Tata Consultancy Services Limited | Method and system for translation of codes based on semantic similarity |
CN113469094B (zh) * | 2021-07-13 | 2023-12-26 | 上海中科辰新卫星技术有限公司 | 一种基于多模态遥感数据深度融合的地表覆盖分类方法 |
CN113515960B (zh) * | 2021-07-14 | 2024-04-02 | 厦门大学 | 一种融合句法信息的翻译质量自动评估方法 |
EP4170449B1 (en) * | 2021-10-22 | 2024-01-31 | Tata Consultancy Services Limited | System and method for ontology guided indoor scene understanding for cognitive robotic tasks |
CN114118111B (zh) * | 2021-11-26 | 2024-05-24 | 昆明理工大学 | 融合文本和图片特征的多模态机器翻译方法 |
CN115130435B (zh) * | 2022-06-27 | 2023-08-11 | 北京百度网讯科技有限公司 | 文档处理方法、装置、电子设备和存储介质 |
CN115080766B (zh) * | 2022-08-16 | 2022-12-06 | 之江实验室 | 基于预训练模型的多模态知识图谱表征系统及方法 |
CN115759199B (zh) * | 2022-11-21 | 2023-09-26 | 山东大学 | 基于层次化图神经网络的多机器人环境探索方法及系统 |
CN116089619B (zh) * | 2023-04-06 | 2023-06-06 | 华南师范大学 | 情感分类方法、装置、设备以及存储介质 |
CN116151263B (zh) * | 2023-04-24 | 2023-06-30 | 华南师范大学 | 多模态命名实体识别方法、装置、设备以及存储介质 |
CN117113281B (zh) * | 2023-10-20 | 2024-01-26 | 光轮智能(北京)科技有限公司 | 多模态数据的处理方法、设备、智能体和介质 |
CN117474019B (zh) * | 2023-12-27 | 2024-05-24 | 天津大学 | 一种视觉引导的目标端未来语境翻译方法 |
CN117809150B (zh) * | 2024-02-27 | 2024-04-30 | 广东工业大学 | 基于跨模态注意力机制的多模态错误信息检测方法及系统 |
CN118035435B (zh) * | 2024-04-15 | 2024-06-11 | 南京信息工程大学 | 一种新闻摘要生成方法及相关装置 |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004355481A (ja) * | 2003-05-30 | 2004-12-16 | Konica Minolta Medical & Graphic Inc | 医用画像処理装置 |
CN102262624A (zh) * | 2011-08-08 | 2011-11-30 | 中国科学院自动化研究所 | 基于多模态辅助的实现跨语言沟通系统及方法 |
CN104813318A (zh) * | 2012-09-26 | 2015-07-29 | 谷歌公司 | 用于翻译的基于上下文对消息分组的技术 |
CN106980664A (zh) * | 2017-03-21 | 2017-07-25 | 苏州大学 | 一种双语可比较语料挖掘方法及装置 |
US20170323203A1 (en) * | 2016-05-06 | 2017-11-09 | Ebay Inc. | Using meta-information in neural machine translation |
CN108647705A (zh) * | 2018-04-23 | 2018-10-12 | 北京交通大学 | 基于图像和文本语义相似度的图像语义消歧方法和装置 |
CN110245364A (zh) * | 2019-06-24 | 2019-09-17 | 中国科学技术大学 | 零平行语料多模态神经机器翻译方法 |
CN110457718A (zh) * | 2019-08-21 | 2019-11-15 | 腾讯科技(深圳)有限公司 | 一种文本生成方法、装置、计算机设备及存储介质 |
CN110489761A (zh) * | 2018-05-15 | 2019-11-22 | 科大讯飞股份有限公司 | 一种篇章级文本翻译方法及装置 |
US20200034436A1 (en) * | 2018-07-26 | 2020-01-30 | Google Llc | Machine translation using neural network models |
CN111597830A (zh) * | 2020-05-20 | 2020-08-28 | 腾讯科技(深圳)有限公司 | 基于多模态机器学习的翻译方法、装置、设备及存储介质 |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060123358A1 (en) * | 2004-12-03 | 2006-06-08 | Lee Hang S | Method and system for generating input grammars for multi-modal dialog systems |
US11397462B2 (en) * | 2012-09-28 | 2022-07-26 | Sri International | Real-time human-machine collaboration using big data driven augmented reality technologies |
US9560415B2 (en) | 2013-01-25 | 2017-01-31 | TapShop, LLC | Method and system for interactive selection of items for purchase from a video |
US20140236570A1 (en) * | 2013-02-18 | 2014-08-21 | Microsoft Corporation | Exploiting the semantic web for unsupervised spoken language understanding |
CN105869007A (zh) | 2015-11-25 | 2016-08-17 | 乐视网信息技术(北京)股份有限公司 | 一种视频购物的实现方法及系统 |
WO2017112813A1 (en) * | 2015-12-22 | 2017-06-29 | Sri International | Multi-lingual virtual personal assistant |
CN105916050A (zh) | 2016-05-03 | 2016-08-31 | 乐视控股(北京)有限公司 | 电视购物信息处理方法和装置 |
CN106202317A (zh) | 2016-07-01 | 2016-12-07 | 传线网络科技(上海)有限公司 | 基于视频的商品推荐方法及装置 |
CN108124184A (zh) | 2016-11-28 | 2018-06-05 | 广州华多网络科技有限公司 | 一种直播互动的方法及装置 |
CN108462889A (zh) | 2017-02-17 | 2018-08-28 | 阿里巴巴集团控股有限公司 | 直播过程中的信息推荐方法及装置 |
US20190287012A1 (en) * | 2018-03-16 | 2019-09-19 | Microsoft Technology Licensing, Llc | Encoder-decoder network with intercommunicating encoder agents |
CN109034115B (zh) | 2018-08-22 | 2021-10-22 | Oppo广东移动通信有限公司 | 视频识图方法、装置、终端及存储介质 |
US20200242146A1 (en) * | 2019-01-24 | 2020-07-30 | Andrew R. Kalukin | Artificial intelligence system for generating conjectures and comprehending text, audio, and visual data using natural language understanding |
US11562744B1 (en) * | 2020-02-13 | 2023-01-24 | Meta Platforms Technologies, Llc | Stylizing text-to-speech (TTS) voice response for assistant systems |
CN111652678B (zh) | 2020-05-27 | 2023-11-14 | 腾讯科技(深圳)有限公司 | 物品信息显示方法、装置、终端、服务器及可读存储介质 |
-
2020
- 2020-05-20 CN CN202010432597.2A patent/CN111597830A/zh active Pending
-
2021
- 2021-04-29 JP JP2022540553A patent/JP2023509031A/ja active Pending
- 2021-04-29 WO PCT/CN2021/091114 patent/WO2021233112A1/zh active Application Filing
-
2022
- 2022-04-12 US US17/719,170 patent/US12056458B2/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004355481A (ja) * | 2003-05-30 | 2004-12-16 | Konica Minolta Medical & Graphic Inc | 医用画像処理装置 |
CN102262624A (zh) * | 2011-08-08 | 2011-11-30 | 中国科学院自动化研究所 | 基于多模态辅助的实现跨语言沟通系统及方法 |
CN104813318A (zh) * | 2012-09-26 | 2015-07-29 | 谷歌公司 | 用于翻译的基于上下文对消息分组的技术 |
US20170323203A1 (en) * | 2016-05-06 | 2017-11-09 | Ebay Inc. | Using meta-information in neural machine translation |
CN106980664A (zh) * | 2017-03-21 | 2017-07-25 | 苏州大学 | 一种双语可比较语料挖掘方法及装置 |
CN108647705A (zh) * | 2018-04-23 | 2018-10-12 | 北京交通大学 | 基于图像和文本语义相似度的图像语义消歧方法和装置 |
CN110489761A (zh) * | 2018-05-15 | 2019-11-22 | 科大讯飞股份有限公司 | 一种篇章级文本翻译方法及装置 |
US20200034436A1 (en) * | 2018-07-26 | 2020-01-30 | Google Llc | Machine translation using neural network models |
CN110245364A (zh) * | 2019-06-24 | 2019-09-17 | 中国科学技术大学 | 零平行语料多模态神经机器翻译方法 |
CN110457718A (zh) * | 2019-08-21 | 2019-11-15 | 腾讯科技(深圳)有限公司 | 一种文本生成方法、装置、计算机设备及存储介质 |
CN111597830A (zh) * | 2020-05-20 | 2020-08-28 | 腾讯科技(深圳)有限公司 | 基于多模态机器学习的翻译方法、装置、设备及存储介质 |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114139637A (zh) * | 2021-12-03 | 2022-03-04 | 哈尔滨工业大学(深圳) | 多智能体信息融合方法、装置、电子设备及可读存储介质 |
CN114139637B (zh) * | 2021-12-03 | 2022-11-04 | 哈尔滨工业大学(深圳) | 多智能体信息融合方法、装置、电子设备及可读存储介质 |
CN115994177A (zh) * | 2023-03-23 | 2023-04-21 | 山东文衡科技股份有限公司 | 基于数据湖的知识产权管理方法及其系统 |
CN116934754A (zh) * | 2023-09-18 | 2023-10-24 | 四川大学华西第二医院 | 基于图神经网络的肝脏影像识别方法及装置 |
CN116934754B (zh) * | 2023-09-18 | 2023-12-01 | 四川大学华西第二医院 | 基于图神经网络的肝脏影像识别方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
JP2023509031A (ja) | 2023-03-06 |
US20220245365A1 (en) | 2022-08-04 |
CN111597830A (zh) | 2020-08-28 |
US12056458B2 (en) | 2024-08-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021233112A1 (zh) | 基于多模态机器学习的翻译方法、装置、设备及存储介质 | |
CN111444709B (zh) | 文本分类方法、装置、存储介质及设备 | |
CN111488739A (zh) | 基于多粒度生成图像增强表示的隐式篇章关系识别方法 | |
CN111930942B (zh) | 文本分类方法、语言模型训练方法、装置及设备 | |
WO2023160472A1 (zh) | 一种模型训练方法及相关设备 | |
Zhang et al. | Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition | |
Tyagi et al. | Demystifying the role of natural language processing (NLP) in smart city applications: background, motivation, recent advances, and future research directions | |
US20220172710A1 (en) | Interactive systems and methods | |
CN113704460B (zh) | 一种文本分类方法、装置、电子设备和存储介质 | |
CN113987179A (zh) | 基于知识增强和回溯损失的对话情绪识别网络模型、构建方法、电子设备及存储介质 | |
CN111597341B (zh) | 一种文档级关系抽取方法、装置、设备及存储介质 | |
EP4361843A1 (en) | Neural network searching method and related device | |
WO2021129411A1 (zh) | 文本处理方法及装置 | |
CN116432019A (zh) | 一种数据处理方法及相关设备 | |
Sivarethinamohan et al. | Envisioning the potential of natural language processing (nlp) in health care management | |
Manzoor et al. | Multimodality representation learning: A survey on evolution, pretraining and its applications | |
Xue et al. | Lcsnet: End-to-end lipreading with channel-aware feature selection | |
CN118378148A (zh) | 多标签分类模型的训练方法、多标签分类方法及相关装置 | |
Özdemir et al. | Multi-cue temporal modeling for skeleton-based sign language recognition | |
CN110889505A (zh) | 一种图文序列匹配的跨媒体综合推理方法和系统 | |
Ding et al. | DialogueINAB: an interaction neural network based on attitudes and behaviors of interlocutors for dialogue emotion recognition | |
CN115759262A (zh) | 基于知识感知注意力网络的视觉常识推理方法及系统 | |
WO2021129410A1 (zh) | 文本处理方法及装置 | |
Novais | A framework for emotion and sentiment predicting supported in ensembles | |
Yuhan et al. | Sensory Features in Affective Analysis: A Study Based on Neural Network Models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21807758 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2022540553 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 180423) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21807758 Country of ref document: EP Kind code of ref document: A1 |