CN114812551A - Indoor environment robot navigation natural language instruction generation method - Google Patents

Indoor environment robot navigation natural language instruction generation method Download PDF

Info

Publication number
CN114812551A
CN114812551A CN202210224196.7A CN202210224196A CN114812551A CN 114812551 A CN114812551 A CN 114812551A CN 202210224196 A CN202210224196 A CN 202210224196A CN 114812551 A CN114812551 A CN 114812551A
Authority
CN
China
Prior art keywords
robot
output
offset angle
panoramic image
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210224196.7A
Other languages
Chinese (zh)
Other versions
CN114812551B (en
Inventor
陈启军
王柳懿
刘成菊
何宗涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202210224196.7A priority Critical patent/CN114812551B/en
Publication of CN114812551A publication Critical patent/CN114812551A/en
Application granted granted Critical
Publication of CN114812551B publication Critical patent/CN114812551B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations
    • G01C21/206Instruments for performing navigational calculations specially adapted for indoor navigation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Automation & Control Theory (AREA)
  • Manipulator (AREA)

Abstract

The invention relates to a method for generating a navigation natural language instruction of a robot in an indoor environment, which comprises the following steps: s1, extracting image feature vectors of the panoramic image collected by the robot camera; s2, acquiring the current offset angle of the robot, expanding the data dimension through triangular transformation, and splicing the data dimension with the image feature vector to form a corresponding action feature vector and a panoramic image feature vector; s3, aligning the motion characteristic vectors and the panoramic image characteristic vectors by adopting multi-head attention and performing dimension reduction calculation; s4, coding the vision and action information of the robot by adopting a Transformer frame, and outputting a predicted language result; and S5, adding an additional auxiliary supervision task to the output part of the decoder to assist the robot to learn the corresponding relation between the output sentence and the input action. Compared with the prior art, the method has the advantages of improving the utilization degree of the characteristic information, improving the accuracy and generalization capability of the generated model and the like.

Description

Indoor environment robot navigation natural language instruction generation method
Technical Field
The invention relates to the field of computer vision and natural language generation, in particular to a method for generating a navigation natural language instruction of an indoor environment robot.
Background
The visual language navigation task is an important research task of artificial intelligence, is one of representative problems in the cross-modal cross research field of computer vision and natural language processing, and aims to use natural language to issue a path instruction to a robot, and the robot autonomously analyzes and judges the target direction represented by the language instruction, and adjusts behaviors and plans a path according to a visual image fed back in real time.
However, it is very time-consuming and labor-consuming to use the manual annotation natural language instruction, and therefore, many studies prove that the training precision of the existing robot on the visual language navigation task can be effectively improved by introducing the autoregressive instruction language generation model, but the existing natural language instruction generation model for guiding the robot navigation all depends on the RNN timing sequence model with a relatively simple structure, and has the disadvantages of poor long-term capturing dependence performance, slow serial operation speed and scarce output sentence semantics.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a method for generating an indoor environment robot navigation natural language instruction.
The purpose of the invention can be realized by the following technical scheme:
a method for generating an indoor environment robot navigation natural language instruction comprises the following steps:
s1, extracting image feature vectors of the panoramic image collected by the robot camera by using a deep convolutional neural network;
s2, acquiring the current offset angle of the robot, expanding the data dimension through triangular transformation, and splicing the data dimension with the image feature vector to form a corresponding action feature vector and a panoramic image feature vector;
s3, aligning the motion characteristic vector and the panoramic image characteristic vector by adopting multi-head attention and performing dimensionality reduction calculation to enable the robot to focus on more important visual contents in the environment;
s4, coding the visual and motion information of the robot by adopting a sequence-to-sequence Transformer frame, performing cross-modal attention fusion with a language embedded code with a mask at a decoder end, and outputting a predicted language result;
s5, an additional auxiliary supervision task is added to the output part of the decoder, the robot is assisted in learning the corresponding relation between the output sentences and the input actions, and the expression of the network model on the input and output relation is improved.
In step S1, the deep convolutional neural network model uses a ResNet-152 network pre-trained on ImageNet to output the last layer of images before classification obtained by forward inference after the images are input into the ResNet-152 network as image feature vectors.
The panoramic image collected by the robot camera comprises 36 sub-images, including 12 observation images at 30-degree intervals under three visual angles of overlooking, head-up and looking-up respectively, and each observation image corresponds to one image feature vector.
In step S2, the offset angle of the robot includes an action offset angle and a view angle offset angle, the action offset angle is an offset angle between the current position of the robot and the position of the robot at the previous time, the view angle offset angle is an offset angle based on the center of each sub-image included in the panoramic image observed by the robot, and the offset angle has the following expression:
Figure RE-GDA0003707764370000021
wherein gamma is the offset angle, theta is the heading angle of the offset,
Figure RE-GDA0003707764370000022
is an offset pitch angle.
In step S2, the motion feature vector a is specifically formed by splicing an image feature vector corresponding to a sub-image directly in front of the robot in the panoramic image and a motion offset angle vector after data dimensionality extension, and the panoramic image feature vector E is specifically formed by splicing an image feature vector corresponding to all sub-images in the panoramic image and a view offset angle vector after data dimensionality extension.
In step S3, the expression of the output X after the multi-head attention alignment and the dimensionality reduction calculation is:
Q=AW Q
K=EW K
V=EW V
Figure RE-GDA0003707764370000023
wherein Q, K, V denotes the look-up matrix, the key matrix and the value matrix of the attention mechanism after linear transformation, respectively, W Q 、W K 、W V Respectively, the learnable weights for making linear changes to the motion characteristic vector A and the panoramic image characteristic vector E,
Figure RE-GDA0003707764370000024
is the dimension of K.
In step S4, when performing prediction using a transform, adding position sequence coding to emphasize different influences of input on output in time series, and performing position coding on output X calculated using multi-head attention alignment and dimensionality reduction includes:
Figure RE-GDA0003707764370000031
Figure RE-GDA0003707764370000032
wherein, PE (pos,2i) For embedding coded 2 i-dimension position coded values, PE (pos,2i+1) For the embedding of the position-coded value in the 2i +1 th dimension, pos is the actual position of the element in the input sequence, d model Is the dimension of the embedded code.
The Transformer comprises an encoder and a decoder, wherein the encoder is composed of a plurality of multi-head self-attention modules, a forward connection network and a residual connection, and each module of the decoder comprises a cross-attention module, a self-attention module, a forward connection network and a residual connection.
In the decoder, the true text is shifted one bit to the right and diagonally masked to ensure that the text is input only dependent on the previous prediction, and then the text is converted into an embedded representation by a layer of linear transformation and cross-focused with the output from the encoder.
In step S5, after an additional auxiliary monitoring task is added to the output portion of the decoder, the final Loss function Loss is expressed as:
Loss=λL 1 +(1-λ)ωL 2
Figure RE-GDA0003707764370000033
Figure RE-GDA0003707764370000034
Figure RE-GDA0003707764370000035
wherein L is 1 For differences between predicted and true values obtained using cross entropy loss functions, L 2 Is the difference between the predicted value and the true value obtained by using the mean square error function, theta is the network parameter, f θ (. cndot.) is the predicted probability of the network,
Figure RE-GDA0003707764370000036
to output the p-th real value of the instruction,
Figure RE-GDA0003707764370000037
for outputting the 1 st to p th true values, Z, of the instruction l The first vocabulary pair output for the networkThe corresponding predicted progress value, L being the total number of words, P (I' j ) Is the sub-instruction I 'where the current vocabulary is located' j K is the total number of sub-commands, λ is a value controlling the specific gravity of the two loss functions, ω is a value unifying the magnitudes of the two loss functions,
Figure RE-GDA0003707764370000041
the real process value corresponding to the ith vocabulary.
Compared with the prior art, the invention has the following advantages:
the invention effectively integrates the action behavior and the environment observation of the robot by introducing a multi-head attention method, improves the utilization degree of input characteristic information, uses an advanced sequence generation model Transformer to replace the original RNN structural model, and improves the integral coding and decoding capability of the model.
The invention provides an additional progress supervision auxiliary task, improves the expression capacity of the network model to the input and output corresponding relation by utilizing more exquisite sub instructions and corresponding sub behavior actions, and introduces a priori knowledge method to assist the model to better learn the relation between language generation and actions, thereby improving the accuracy and generalization capacity of the generated model on the premise of hardly increasing the network parameter quantity.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention.
FIG. 2 is a system flow diagram illustrating the method of the present invention.
FIG. 3 is a schematic diagram of a progress supervision assistance task in the method of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.
Examples
The invention provides a method for generating a navigation natural language instruction of a robot in an indoor environment, wherein the overall flow block diagram of the method is shown in figure 1, and the method specifically comprises the following steps:
s1, extracting image features collected by the robot camera by using a deep convolutional neural network;
the deep convolutional neural network model adopts a ResNet-152 network pre-trained on ImageNet, the last layer of output before classification obtained by forward reasoning after an image is input into the network is used as a feature vector, and v is used j To represent the jth image feature vector.
S2, acquiring the current offset angle gamma of the robot, expanding the data dimension through triangular transformation, and splicing with the image features to form a new feature vector;
the offset angle includes motion offset and view angle offset, the motion offset refers to the offset angle between the current position of the robot and the position of the robot at the previous moment, the view angle offset refers to the offset angle based on the center of each sub-image contained in a panoramic image observed by the robot, and in the example, a sine function and a cosine function are used for respectively calculating the offset pitch angle
Figure RE-GDA0003707764370000051
And a heading angle θ, the formula being:
Figure RE-GDA0003707764370000052
in order for the network model to better learn the relationship between images and motion angles, it is extended to 128 dimensions, i.e. γ j And is combined with v j Forming a characteristic vector after splicing: o j ={v j ;γ j }。
S3, aligning and performing dimensionality reduction calculation on the panoramic image of the robot action and observation by using multi-head attention, so that the robot focuses on more important visual contents in the environment;
the multi-head attention uses the motion characteristic vector as a query matrix Q, uses the panoramic image as a key matrix K and a value matrix V, and can obviously reduce the dimensionality of input characteristics after the attention is paid to the motion characteristic vector and the panoramic image, filter unimportant components, enrich input semantic information, and express an attention calculation formula as follows:
Q=AW Q
K=EW K
V=EW V
Figure RE-GDA0003707764370000053
a and E represent motion feature vector and panorama image feature vector, respectively, W Q 、W K 、W V Respectively, the learnable weights for making linear changes to the motion characteristic vector A and the panoramic image characteristic vector E,
Figure RE-GDA0003707764370000054
the dimensionality of K is represented, and the model can be made to pay attention to different emphasis points by using multi-head attention, so that the training and learning capabilities of the model are further improved.
S4, coding the visual and motion information of the robot by using a sequence-to-sequence Transformer frame, performing cross-modal attention fusion with a language embedded code with a mask at a decoder end, and outputting a predicted language result;
since the Transformer is a parallel structure and does not inherently have the capability of capturing the sequence of the input sequence, it is necessary to add a position sequence code to emphasize the different effects of the input on the output in time sequence, and therefore, the position code is implemented by using the following function for the input X after multi-head attention fusion:
Figure RE-GDA0003707764370000061
Figure RE-GDA0003707764370000062
wherein, PE (pos,2i) For embedding coded 2 i-dimension position coded values, PE (pos,2i+1) For the embedding of the position-coded value in the 2i +1 th dimension, pos is the actual position of the element in the input sequence, d model Is the dimension of the embedded code.
The Transformer comprises an encoder and a decoder, wherein in the encoder, the encoder is formed by connecting a plurality of multi-head self-attention modules, a forward connection network and a residual error; in the decoder, the input of text needs to ensure that it depends only on the previous prediction, so the operation of right-shifting one bit for true text and making diagonal masking is adopted, then the text is converted into embedded expression by a layer of linear transformation and cross attention is made with the output result from the encoder, each module of the decoder comprises cross attention, self attention, forward connection network and residual connection, and the final generated word probability distribution prediction result is output by using a layer of linear change and softmax function.
The difference between the predicted and true values is calculated using a cross entropy loss function:
Figure RE-GDA0003707764370000063
wherein L is 1 For the difference between the predicted value and the true value obtained by using the cross entropy loss function, theta is the network parameter, f θ (. cndot.) is the predicted probability of the network,
Figure RE-GDA0003707764370000064
to output the p-th real value of the instruction,
Figure RE-GDA0003707764370000065
the 1 st to p th real values in the output instruction.
S5, adding an additional auxiliary supervision task at the output part of the decoder to help the robot learn the corresponding relation between the output sentence and the input action and improve the expression of the network model to the input and output relation.
The auxiliary task is a means which is commonly used for improving the robot translation task, and the auxiliary task helps a model to learn the internal association between data more easily by adding extra loss supervision through utilizing prior knowledge. Assuming that an action can be divided into k sub-actions, each sub-action corresponds to a sub-instruction containing a plurality of words, the process value corresponding to each word in each sub-instruction is:
Figure RE-GDA0003707764370000066
wherein P (-) denotes the sub-instruction I 'where the current vocabulary is located' j K is the total number of segments of the sub-instruction (or sub-action),
Figure RE-GDA0003707764370000071
the real process value corresponding to the ith vocabulary. The auxiliary supervision task is parallel to the original text predictor output, and the feature vectors output by the last layer of a transform decoder are used as input.
The difference between the predicted and true values is calculated using a mean square error function:
Figure RE-GDA0003707764370000072
wherein L is 2 For differences between predicted and true values obtained using the mean square error function, Z l And outputting a predicted process value corresponding to the ith vocabulary for the network.
The final loss function is defined as follows:
Loss=λL 1 +(1-λ)ωL 2
wherein λ is a value controlling the specific gravity of the two loss functions, and ω is a value unifying the magnitudes of the two loss functions
In conclusion, the invention provides a natural language instruction generation method for guiding the navigation behavior of the robot aiming at the visual language navigation task of the robot. According to the method, a Transformer is used as a text generation frame from a sequence to a sequence, an additional generation progress auxiliary supervision task is introduced, a joint training loss function is designed, and end-to-end learning and prediction are achieved. The method can effectively realize the generation of the natural language instruction of the robot navigation path, thereby improving the visual language navigation capability of the robot on the premise of not introducing additional manual marking, and has the advantages of rich generated language semantic information, strong model generalization, high training speed and the like.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims (10)

1. A method for generating an indoor environment robot navigation natural language instruction is characterized by comprising the following steps:
s1, extracting image feature vectors of the panoramic image collected by the robot camera by using a deep convolutional neural network;
s2, acquiring the current offset angle of the robot, expanding the data dimension through triangular transformation, and splicing the data dimension with the image feature vector to form a corresponding action feature vector and a panoramic image feature vector;
s3, aligning the motion characteristic vector and the panoramic image characteristic vector by adopting multi-head attention and performing dimensionality reduction calculation to enable the robot to focus on more important visual contents in the environment;
s4, coding the visual and motion information of the robot by adopting a sequence-to-sequence Transformer frame, performing cross-modal attention fusion with a language embedded code with a mask at a decoder end, and outputting a predicted language result;
s5, adding an additional auxiliary supervision task at the output part of the decoder, assisting the robot to learn the corresponding relation between the output sentence and the input action, and improving the expression of the network model to the input and output relation.
2. The method for generating an indoor environment robot navigation natural language instruction according to claim 1, wherein in step S1, the deep convolutional neural network model adopts a ResNet-152 network pre-trained on ImageNet, so that the final layer of output before classification obtained by forward reasoning after the image is input into the ResNet-152 network is used as the image feature vector.
3. The method as claimed in claim 2, wherein the panoramic image collected by the robot camera includes 36 sub-images, including 12 observation images at 30 degrees intervals at three viewing angles of overlook, head-up and head-up, each observation image corresponding to an image feature vector.
4. The method for generating an indoor environment robot navigation natural language instruction according to claim 1, wherein in step S2, the offset angle of the robot includes an action offset angle and a view angle offset angle, the action offset angle is an offset angle between a current position of the robot and a previous position of the robot, the view angle offset angle is an offset angle based on a center of each sub-image included in a panoramic image observed by the robot, and an expression of the offset angle is as follows:
Figure FDA0003538621330000011
wherein gamma is the offset angle, theta is the heading angle of the offset,
Figure FDA0003538621330000012
is an offset pitch angle.
5. The method according to claim 4, wherein in step S2, the motion eigenvector A is formed by splicing an image eigenvector corresponding to a subimage directly in front of the robot in the panoramic image with a motion offset angle vector after the dimension of the augmented data, and the panoramic image eigenvector E is formed by splicing an image eigenvector corresponding to all subimages in the panoramic image with a view offset angle vector after the dimension of the augmented data.
6. The method for generating an indoor environment robot navigation natural language instruction according to claim 1, wherein in step S3, an expression of an output X after the multi-head attention is aligned and subjected to the dimensionality reduction calculation is as follows:
Q=AW Q
K=EW K
V=EW V
Figure FDA0003538621330000021
wherein Q, K, V denotes the look-up matrix, the key matrix and the value matrix of the attention mechanism after linear transformation, respectively, W Q 、W K 、W V Respectively, the learnable weights for making linear changes to the motion characteristic vector A and the panoramic image characteristic vector E,
Figure FDA0003538621330000022
is the dimension of K.
7. The method for generating natural language instructions for indoor environment robot navigation according to claim 1, wherein in step S4, when performing prediction using a transform, adding position sequence coding to emphasize different effects of time-series input on output, and performing position coding on output X after performing multi-head attention alignment and dimensionality reduction calculation, the method comprises:
Figure FDA0003538621330000023
Figure FDA0003538621330000024
wherein, PE (pos,2i) For embedding coded 2 i-dimension position coded values, PE (pos,2i+1) For the embedding of the position-coded value in the 2i +1 th dimension, pos is the actual position of the element in the input sequence, d model Is the dimension of the embedded code.
8. The method as claimed in claim 7, wherein the Transformer comprises an encoder and a decoder, the encoder comprises a plurality of multi-headed self-attention modules, a forward connection network and a residual connection, and each module of the decoder comprises a cross-attention, a self-attention, a forward connection network and a residual connection.
9. The method as claimed in claim 8, wherein the true text is shifted one bit to the right and the diagonal masking is performed in the decoder to ensure that the text is input only based on the previous prediction, and then the text is transformed into the embedded expression by a layer of linear transformation and cross-focused with the output from the encoder.
10. The method according to claim 8, wherein in step S5, after an additional auxiliary task is added to the output part of the decoder, the final Loss function Loss is expressed as:
Loss=λL 1 +(1-λ)ωL 2
Figure FDA0003538621330000031
Figure FDA0003538621330000032
Figure FDA0003538621330000033
wherein L is 1 For differences between predicted and true values obtained using cross entropy loss functions, L 2 Is the difference between the predicted value and the true value obtained by using the mean square error function, theta is the network parameter, f θ (. cndot.) is the predicted probability of the network,
Figure FDA0003538621330000034
to output the p-th real value of the instruction,
Figure FDA0003538621330000035
for outputting the 1 st to p th true values, Z, of the instruction l A predicted progress value corresponding to the L-th vocabulary output from the network, L being the total number of vocabularies, P (I' j ) Is the sub-instruction I 'where the current vocabulary is located' j K is the total number of sub-commands, λ is a value controlling the specific gravity of the two loss functions, ω is a value unifying the magnitudes of the two loss functions,
Figure FDA0003538621330000036
the real process value corresponding to the ith vocabulary.
CN202210224196.7A 2022-03-09 2022-03-09 Indoor environment robot navigation natural language instruction generation method Active CN114812551B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210224196.7A CN114812551B (en) 2022-03-09 2022-03-09 Indoor environment robot navigation natural language instruction generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210224196.7A CN114812551B (en) 2022-03-09 2022-03-09 Indoor environment robot navigation natural language instruction generation method

Publications (2)

Publication Number Publication Date
CN114812551A true CN114812551A (en) 2022-07-29
CN114812551B CN114812551B (en) 2024-07-26

Family

ID=82529629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210224196.7A Active CN114812551B (en) 2022-03-09 2022-03-09 Indoor environment robot navigation natural language instruction generation method

Country Status (1)

Country Link
CN (1) CN114812551B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115795278A (en) * 2022-12-02 2023-03-14 广东元一科技实业有限公司 Intelligent cloth paving machine control method and device and electronic equipment
CN118015162A (en) * 2024-04-10 2024-05-10 哈尔滨工业大学(威海) Three-dimensional digital human head animation generation method based on phonetic prosody decomposition

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814844A (en) * 2020-03-17 2020-10-23 同济大学 Intensive video description method based on position coding fusion
CN112200244A (en) * 2020-10-09 2021-01-08 西安交通大学 Intelligent detection method for anomaly of aerospace engine based on hierarchical countermeasure training
CN112560438A (en) * 2020-11-27 2021-03-26 同济大学 Text generation method based on generation of confrontation network
CN113268561A (en) * 2021-04-25 2021-08-17 中国科学技术大学 Problem generation method based on multi-task joint training
CN113537024A (en) * 2021-07-08 2021-10-22 天津理工大学 Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism
US20210374358A1 (en) * 2020-05-31 2021-12-02 Salesforce.Com, Inc. Systems and methods for composed variational natural language generation
CN113988274A (en) * 2021-11-11 2022-01-28 电子科技大学 Text intelligent generation method based on deep learning
US20220059200A1 (en) * 2020-08-21 2022-02-24 Washington University Deep-learning systems and methods for medical report generation and anomaly detection
CN114092774A (en) * 2021-11-22 2022-02-25 沈阳工业大学 RGB-T image significance detection system and detection method based on information flow fusion
CN114091466A (en) * 2021-10-13 2022-02-25 山东师范大学 Multi-modal emotion analysis method and system based on Transformer and multi-task learning

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814844A (en) * 2020-03-17 2020-10-23 同济大学 Intensive video description method based on position coding fusion
US20210374358A1 (en) * 2020-05-31 2021-12-02 Salesforce.Com, Inc. Systems and methods for composed variational natural language generation
US20220059200A1 (en) * 2020-08-21 2022-02-24 Washington University Deep-learning systems and methods for medical report generation and anomaly detection
CN112200244A (en) * 2020-10-09 2021-01-08 西安交通大学 Intelligent detection method for anomaly of aerospace engine based on hierarchical countermeasure training
CN112560438A (en) * 2020-11-27 2021-03-26 同济大学 Text generation method based on generation of confrontation network
CN113268561A (en) * 2021-04-25 2021-08-17 中国科学技术大学 Problem generation method based on multi-task joint training
CN113537024A (en) * 2021-07-08 2021-10-22 天津理工大学 Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism
CN114091466A (en) * 2021-10-13 2022-02-25 山东师范大学 Multi-modal emotion analysis method and system based on Transformer and multi-task learning
CN113988274A (en) * 2021-11-11 2022-01-28 电子科技大学 Text intelligent generation method based on deep learning
CN114092774A (en) * 2021-11-22 2022-02-25 沈阳工业大学 RGB-T image significance detection system and detection method based on information flow fusion

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MOTONARI KAMBARA等: "Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions", 《IEEE ROBOTICS AND AUTOMATION LETTERS》, 24 August 2021 (2021-08-24), pages 8371, XP011876804, DOI: 10.1109/LRA.2021.3107026 *
庄暑楠: "基于深度学习的文本规范化的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 8, 15 August 2020 (2020-08-15), pages 138 - 817 *
李雪晴等: "自然语言生成综述", 《计算机应用》, no. 5, 31 May 2021 (2021-05-31), pages 1227 - 1235 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115795278A (en) * 2022-12-02 2023-03-14 广东元一科技实业有限公司 Intelligent cloth paving machine control method and device and electronic equipment
CN115795278B (en) * 2022-12-02 2023-08-04 广东元一科技实业有限公司 Intelligent cloth paving machine control method and device and electronic equipment
CN118015162A (en) * 2024-04-10 2024-05-10 哈尔滨工业大学(威海) Three-dimensional digital human head animation generation method based on phonetic prosody decomposition

Also Published As

Publication number Publication date
CN114812551B (en) 2024-07-26

Similar Documents

Publication Publication Date Title
CN114812551B (en) Indoor environment robot navigation natural language instruction generation method
CN111339281B (en) Answer selection method for reading comprehension choice questions with multi-view fusion
CN109145974B (en) Multilevel image feature fusion method based on image-text matching
CN111967272B (en) Visual dialogue generating system based on semantic alignment
CN111967277A (en) Translation method based on multi-modal machine translation model
CN113065496B (en) Neural network machine translation model training method, machine translation method and device
CN117273150A (en) Visual large language model method based on few sample learning
Cui et al. Representation and correlation enhanced encoder-decoder framework for scene text recognition
CN117292146A (en) Industrial scene-oriented method, system and application method for constructing multi-mode large language model
Park et al. Vlaad: Vision and language assistant for autonomous driving
Tanaka et al. Cross-modal transformer-based neural correction models for automatic speech recognition
Cui et al. An end-to-end network for irregular printed Mongolian recognition
CN113010662B (en) Hierarchical conversational machine reading understanding system and method
Yuan et al. VRDriving: A virtual-to-real autonomous driving framework based on adversarial learning
CN110197521B (en) Visual text embedding method based on semantic structure representation
Huang et al. Knowledge distilled pre-training model for vision-language-navigation
CN117216536A (en) Model training method, device and equipment and storage medium
CN115759262A (en) Visual common sense reasoning method and system based on knowledge perception attention network
Zhang et al. Video-Language Graph Convolutional Network for Human Action Recognition
Li et al. LabanFormer: Multi-scale graph attention network and transformer with gated recurrent positional encoding for labanotation generation
Chen et al. A novel detection method based on DETR for drone aerial images
Chen et al. LCVO: An Efficient Pretraining-Free Framework for Visual Question Answering Grounding
CN117710688B (en) Target tracking method and system based on convolution and attention combination feature extraction
Li et al. Limited receptive field network for real-time driving scene semantic segmentation
Wu et al. Prospective Role of Foundation Models in Advancing Autonomous Vehicles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant