CN114812551A - Indoor environment robot navigation natural language instruction generation method - Google Patents
Indoor environment robot navigation natural language instruction generation method Download PDFInfo
- Publication number
- CN114812551A CN114812551A CN202210224196.7A CN202210224196A CN114812551A CN 114812551 A CN114812551 A CN 114812551A CN 202210224196 A CN202210224196 A CN 202210224196A CN 114812551 A CN114812551 A CN 114812551A
- Authority
- CN
- China
- Prior art keywords
- robot
- output
- offset angle
- panoramic image
- dimension
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 239000013598 vector Substances 0.000 claims abstract description 45
- 230000000875 corresponding effect Effects 0.000 claims abstract description 21
- 230000009471 action Effects 0.000 claims abstract description 14
- 230000009466 transformation Effects 0.000 claims abstract description 9
- 238000004364 calculation method Methods 0.000 claims abstract description 8
- 230000009467 reduction Effects 0.000 claims abstract description 8
- 230000006870 function Effects 0.000 claims description 22
- 230000000007 visual effect Effects 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 7
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 230000004927 fusion Effects 0.000 claims description 4
- 230000001276 controlling effect Effects 0.000 claims description 3
- 230000005484 gravity Effects 0.000 claims description 3
- 230000000694 effects Effects 0.000 claims description 2
- 230000000873 masking effect Effects 0.000 claims description 2
- 230000007246 mechanism Effects 0.000 claims description 2
- 230000003190 augmentative effect Effects 0.000 claims 2
- 230000006399 behavior Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01C—MEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
- G01C21/00—Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
- G01C21/20—Instruments for performing navigational calculations
- G01C21/206—Instruments for performing navigational calculations specially adapted for indoor navigation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Remote Sensing (AREA)
- Radar, Positioning & Navigation (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Automation & Control Theory (AREA)
- Manipulator (AREA)
Abstract
The invention relates to a method for generating a navigation natural language instruction of a robot in an indoor environment, which comprises the following steps: s1, extracting image feature vectors of the panoramic image collected by the robot camera; s2, acquiring the current offset angle of the robot, expanding the data dimension through triangular transformation, and splicing the data dimension with the image feature vector to form a corresponding action feature vector and a panoramic image feature vector; s3, aligning the motion characteristic vectors and the panoramic image characteristic vectors by adopting multi-head attention and performing dimension reduction calculation; s4, coding the vision and action information of the robot by adopting a Transformer frame, and outputting a predicted language result; and S5, adding an additional auxiliary supervision task to the output part of the decoder to assist the robot to learn the corresponding relation between the output sentence and the input action. Compared with the prior art, the method has the advantages of improving the utilization degree of the characteristic information, improving the accuracy and generalization capability of the generated model and the like.
Description
Technical Field
The invention relates to the field of computer vision and natural language generation, in particular to a method for generating a navigation natural language instruction of an indoor environment robot.
Background
The visual language navigation task is an important research task of artificial intelligence, is one of representative problems in the cross-modal cross research field of computer vision and natural language processing, and aims to use natural language to issue a path instruction to a robot, and the robot autonomously analyzes and judges the target direction represented by the language instruction, and adjusts behaviors and plans a path according to a visual image fed back in real time.
However, it is very time-consuming and labor-consuming to use the manual annotation natural language instruction, and therefore, many studies prove that the training precision of the existing robot on the visual language navigation task can be effectively improved by introducing the autoregressive instruction language generation model, but the existing natural language instruction generation model for guiding the robot navigation all depends on the RNN timing sequence model with a relatively simple structure, and has the disadvantages of poor long-term capturing dependence performance, slow serial operation speed and scarce output sentence semantics.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a method for generating an indoor environment robot navigation natural language instruction.
The purpose of the invention can be realized by the following technical scheme:
a method for generating an indoor environment robot navigation natural language instruction comprises the following steps:
s1, extracting image feature vectors of the panoramic image collected by the robot camera by using a deep convolutional neural network;
s2, acquiring the current offset angle of the robot, expanding the data dimension through triangular transformation, and splicing the data dimension with the image feature vector to form a corresponding action feature vector and a panoramic image feature vector;
s3, aligning the motion characteristic vector and the panoramic image characteristic vector by adopting multi-head attention and performing dimensionality reduction calculation to enable the robot to focus on more important visual contents in the environment;
s4, coding the visual and motion information of the robot by adopting a sequence-to-sequence Transformer frame, performing cross-modal attention fusion with a language embedded code with a mask at a decoder end, and outputting a predicted language result;
s5, an additional auxiliary supervision task is added to the output part of the decoder, the robot is assisted in learning the corresponding relation between the output sentences and the input actions, and the expression of the network model on the input and output relation is improved.
In step S1, the deep convolutional neural network model uses a ResNet-152 network pre-trained on ImageNet to output the last layer of images before classification obtained by forward inference after the images are input into the ResNet-152 network as image feature vectors.
The panoramic image collected by the robot camera comprises 36 sub-images, including 12 observation images at 30-degree intervals under three visual angles of overlooking, head-up and looking-up respectively, and each observation image corresponds to one image feature vector.
In step S2, the offset angle of the robot includes an action offset angle and a view angle offset angle, the action offset angle is an offset angle between the current position of the robot and the position of the robot at the previous time, the view angle offset angle is an offset angle based on the center of each sub-image included in the panoramic image observed by the robot, and the offset angle has the following expression:
wherein gamma is the offset angle, theta is the heading angle of the offset,is an offset pitch angle.
In step S2, the motion feature vector a is specifically formed by splicing an image feature vector corresponding to a sub-image directly in front of the robot in the panoramic image and a motion offset angle vector after data dimensionality extension, and the panoramic image feature vector E is specifically formed by splicing an image feature vector corresponding to all sub-images in the panoramic image and a view offset angle vector after data dimensionality extension.
In step S3, the expression of the output X after the multi-head attention alignment and the dimensionality reduction calculation is:
Q=AW Q
K=EW K
V=EW V
wherein Q, K, V denotes the look-up matrix, the key matrix and the value matrix of the attention mechanism after linear transformation, respectively, W Q 、W K 、W V Respectively, the learnable weights for making linear changes to the motion characteristic vector A and the panoramic image characteristic vector E,is the dimension of K.
In step S4, when performing prediction using a transform, adding position sequence coding to emphasize different influences of input on output in time series, and performing position coding on output X calculated using multi-head attention alignment and dimensionality reduction includes:
wherein, PE (pos,2i) For embedding coded 2 i-dimension position coded values, PE (pos,2i+1) For the embedding of the position-coded value in the 2i +1 th dimension, pos is the actual position of the element in the input sequence, d model Is the dimension of the embedded code.
The Transformer comprises an encoder and a decoder, wherein the encoder is composed of a plurality of multi-head self-attention modules, a forward connection network and a residual connection, and each module of the decoder comprises a cross-attention module, a self-attention module, a forward connection network and a residual connection.
In the decoder, the true text is shifted one bit to the right and diagonally masked to ensure that the text is input only dependent on the previous prediction, and then the text is converted into an embedded representation by a layer of linear transformation and cross-focused with the output from the encoder.
In step S5, after an additional auxiliary monitoring task is added to the output portion of the decoder, the final Loss function Loss is expressed as:
Loss=λL 1 +(1-λ)ωL 2
wherein L is 1 For differences between predicted and true values obtained using cross entropy loss functions, L 2 Is the difference between the predicted value and the true value obtained by using the mean square error function, theta is the network parameter, f θ (. cndot.) is the predicted probability of the network,to output the p-th real value of the instruction,for outputting the 1 st to p th true values, Z, of the instruction l The first vocabulary pair output for the networkThe corresponding predicted progress value, L being the total number of words, P (I' j ) Is the sub-instruction I 'where the current vocabulary is located' j K is the total number of sub-commands, λ is a value controlling the specific gravity of the two loss functions, ω is a value unifying the magnitudes of the two loss functions,the real process value corresponding to the ith vocabulary.
Compared with the prior art, the invention has the following advantages:
the invention effectively integrates the action behavior and the environment observation of the robot by introducing a multi-head attention method, improves the utilization degree of input characteristic information, uses an advanced sequence generation model Transformer to replace the original RNN structural model, and improves the integral coding and decoding capability of the model.
The invention provides an additional progress supervision auxiliary task, improves the expression capacity of the network model to the input and output corresponding relation by utilizing more exquisite sub instructions and corresponding sub behavior actions, and introduces a priori knowledge method to assist the model to better learn the relation between language generation and actions, thereby improving the accuracy and generalization capacity of the generated model on the premise of hardly increasing the network parameter quantity.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention.
FIG. 2 is a system flow diagram illustrating the method of the present invention.
FIG. 3 is a schematic diagram of a progress supervision assistance task in the method of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.
Examples
The invention provides a method for generating a navigation natural language instruction of a robot in an indoor environment, wherein the overall flow block diagram of the method is shown in figure 1, and the method specifically comprises the following steps:
s1, extracting image features collected by the robot camera by using a deep convolutional neural network;
the deep convolutional neural network model adopts a ResNet-152 network pre-trained on ImageNet, the last layer of output before classification obtained by forward reasoning after an image is input into the network is used as a feature vector, and v is used j To represent the jth image feature vector.
S2, acquiring the current offset angle gamma of the robot, expanding the data dimension through triangular transformation, and splicing with the image features to form a new feature vector;
the offset angle includes motion offset and view angle offset, the motion offset refers to the offset angle between the current position of the robot and the position of the robot at the previous moment, the view angle offset refers to the offset angle based on the center of each sub-image contained in a panoramic image observed by the robot, and in the example, a sine function and a cosine function are used for respectively calculating the offset pitch angleAnd a heading angle θ, the formula being:
in order for the network model to better learn the relationship between images and motion angles, it is extended to 128 dimensions, i.e. γ j And is combined with v j Forming a characteristic vector after splicing: o j ={v j ;γ j }。
S3, aligning and performing dimensionality reduction calculation on the panoramic image of the robot action and observation by using multi-head attention, so that the robot focuses on more important visual contents in the environment;
the multi-head attention uses the motion characteristic vector as a query matrix Q, uses the panoramic image as a key matrix K and a value matrix V, and can obviously reduce the dimensionality of input characteristics after the attention is paid to the motion characteristic vector and the panoramic image, filter unimportant components, enrich input semantic information, and express an attention calculation formula as follows:
Q=AW Q
K=EW K
V=EW V
a and E represent motion feature vector and panorama image feature vector, respectively, W Q 、W K 、W V Respectively, the learnable weights for making linear changes to the motion characteristic vector A and the panoramic image characteristic vector E,the dimensionality of K is represented, and the model can be made to pay attention to different emphasis points by using multi-head attention, so that the training and learning capabilities of the model are further improved.
S4, coding the visual and motion information of the robot by using a sequence-to-sequence Transformer frame, performing cross-modal attention fusion with a language embedded code with a mask at a decoder end, and outputting a predicted language result;
since the Transformer is a parallel structure and does not inherently have the capability of capturing the sequence of the input sequence, it is necessary to add a position sequence code to emphasize the different effects of the input on the output in time sequence, and therefore, the position code is implemented by using the following function for the input X after multi-head attention fusion:
wherein, PE (pos,2i) For embedding coded 2 i-dimension position coded values, PE (pos,2i+1) For the embedding of the position-coded value in the 2i +1 th dimension, pos is the actual position of the element in the input sequence, d model Is the dimension of the embedded code.
The Transformer comprises an encoder and a decoder, wherein in the encoder, the encoder is formed by connecting a plurality of multi-head self-attention modules, a forward connection network and a residual error; in the decoder, the input of text needs to ensure that it depends only on the previous prediction, so the operation of right-shifting one bit for true text and making diagonal masking is adopted, then the text is converted into embedded expression by a layer of linear transformation and cross attention is made with the output result from the encoder, each module of the decoder comprises cross attention, self attention, forward connection network and residual connection, and the final generated word probability distribution prediction result is output by using a layer of linear change and softmax function.
The difference between the predicted and true values is calculated using a cross entropy loss function:
wherein L is 1 For the difference between the predicted value and the true value obtained by using the cross entropy loss function, theta is the network parameter, f θ (. cndot.) is the predicted probability of the network,to output the p-th real value of the instruction,the 1 st to p th real values in the output instruction.
S5, adding an additional auxiliary supervision task at the output part of the decoder to help the robot learn the corresponding relation between the output sentence and the input action and improve the expression of the network model to the input and output relation.
The auxiliary task is a means which is commonly used for improving the robot translation task, and the auxiliary task helps a model to learn the internal association between data more easily by adding extra loss supervision through utilizing prior knowledge. Assuming that an action can be divided into k sub-actions, each sub-action corresponds to a sub-instruction containing a plurality of words, the process value corresponding to each word in each sub-instruction is:
wherein P (-) denotes the sub-instruction I 'where the current vocabulary is located' j K is the total number of segments of the sub-instruction (or sub-action),the real process value corresponding to the ith vocabulary. The auxiliary supervision task is parallel to the original text predictor output, and the feature vectors output by the last layer of a transform decoder are used as input.
The difference between the predicted and true values is calculated using a mean square error function:
wherein L is 2 For differences between predicted and true values obtained using the mean square error function, Z l And outputting a predicted process value corresponding to the ith vocabulary for the network.
The final loss function is defined as follows:
Loss=λL 1 +(1-λ)ωL 2
wherein λ is a value controlling the specific gravity of the two loss functions, and ω is a value unifying the magnitudes of the two loss functions
In conclusion, the invention provides a natural language instruction generation method for guiding the navigation behavior of the robot aiming at the visual language navigation task of the robot. According to the method, a Transformer is used as a text generation frame from a sequence to a sequence, an additional generation progress auxiliary supervision task is introduced, a joint training loss function is designed, and end-to-end learning and prediction are achieved. The method can effectively realize the generation of the natural language instruction of the robot navigation path, thereby improving the visual language navigation capability of the robot on the premise of not introducing additional manual marking, and has the advantages of rich generated language semantic information, strong model generalization, high training speed and the like.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.
Claims (10)
1. A method for generating an indoor environment robot navigation natural language instruction is characterized by comprising the following steps:
s1, extracting image feature vectors of the panoramic image collected by the robot camera by using a deep convolutional neural network;
s2, acquiring the current offset angle of the robot, expanding the data dimension through triangular transformation, and splicing the data dimension with the image feature vector to form a corresponding action feature vector and a panoramic image feature vector;
s3, aligning the motion characteristic vector and the panoramic image characteristic vector by adopting multi-head attention and performing dimensionality reduction calculation to enable the robot to focus on more important visual contents in the environment;
s4, coding the visual and motion information of the robot by adopting a sequence-to-sequence Transformer frame, performing cross-modal attention fusion with a language embedded code with a mask at a decoder end, and outputting a predicted language result;
s5, adding an additional auxiliary supervision task at the output part of the decoder, assisting the robot to learn the corresponding relation between the output sentence and the input action, and improving the expression of the network model to the input and output relation.
2. The method for generating an indoor environment robot navigation natural language instruction according to claim 1, wherein in step S1, the deep convolutional neural network model adopts a ResNet-152 network pre-trained on ImageNet, so that the final layer of output before classification obtained by forward reasoning after the image is input into the ResNet-152 network is used as the image feature vector.
3. The method as claimed in claim 2, wherein the panoramic image collected by the robot camera includes 36 sub-images, including 12 observation images at 30 degrees intervals at three viewing angles of overlook, head-up and head-up, each observation image corresponding to an image feature vector.
4. The method for generating an indoor environment robot navigation natural language instruction according to claim 1, wherein in step S2, the offset angle of the robot includes an action offset angle and a view angle offset angle, the action offset angle is an offset angle between a current position of the robot and a previous position of the robot, the view angle offset angle is an offset angle based on a center of each sub-image included in a panoramic image observed by the robot, and an expression of the offset angle is as follows:
5. The method according to claim 4, wherein in step S2, the motion eigenvector A is formed by splicing an image eigenvector corresponding to a subimage directly in front of the robot in the panoramic image with a motion offset angle vector after the dimension of the augmented data, and the panoramic image eigenvector E is formed by splicing an image eigenvector corresponding to all subimages in the panoramic image with a view offset angle vector after the dimension of the augmented data.
6. The method for generating an indoor environment robot navigation natural language instruction according to claim 1, wherein in step S3, an expression of an output X after the multi-head attention is aligned and subjected to the dimensionality reduction calculation is as follows:
Q=AW Q
K=EW K
V=EW V
wherein Q, K, V denotes the look-up matrix, the key matrix and the value matrix of the attention mechanism after linear transformation, respectively, W Q 、W K 、W V Respectively, the learnable weights for making linear changes to the motion characteristic vector A and the panoramic image characteristic vector E,is the dimension of K.
7. The method for generating natural language instructions for indoor environment robot navigation according to claim 1, wherein in step S4, when performing prediction using a transform, adding position sequence coding to emphasize different effects of time-series input on output, and performing position coding on output X after performing multi-head attention alignment and dimensionality reduction calculation, the method comprises:
wherein, PE (pos,2i) For embedding coded 2 i-dimension position coded values, PE (pos,2i+1) For the embedding of the position-coded value in the 2i +1 th dimension, pos is the actual position of the element in the input sequence, d model Is the dimension of the embedded code.
8. The method as claimed in claim 7, wherein the Transformer comprises an encoder and a decoder, the encoder comprises a plurality of multi-headed self-attention modules, a forward connection network and a residual connection, and each module of the decoder comprises a cross-attention, a self-attention, a forward connection network and a residual connection.
9. The method as claimed in claim 8, wherein the true text is shifted one bit to the right and the diagonal masking is performed in the decoder to ensure that the text is input only based on the previous prediction, and then the text is transformed into the embedded expression by a layer of linear transformation and cross-focused with the output from the encoder.
10. The method according to claim 8, wherein in step S5, after an additional auxiliary task is added to the output part of the decoder, the final Loss function Loss is expressed as:
Loss=λL 1 +(1-λ)ωL 2
wherein L is 1 For differences between predicted and true values obtained using cross entropy loss functions, L 2 Is the difference between the predicted value and the true value obtained by using the mean square error function, theta is the network parameter, f θ (. cndot.) is the predicted probability of the network,to output the p-th real value of the instruction,for outputting the 1 st to p th true values, Z, of the instruction l A predicted progress value corresponding to the L-th vocabulary output from the network, L being the total number of vocabularies, P (I' j ) Is the sub-instruction I 'where the current vocabulary is located' j K is the total number of sub-commands, λ is a value controlling the specific gravity of the two loss functions, ω is a value unifying the magnitudes of the two loss functions,the real process value corresponding to the ith vocabulary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210224196.7A CN114812551B (en) | 2022-03-09 | 2022-03-09 | Indoor environment robot navigation natural language instruction generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210224196.7A CN114812551B (en) | 2022-03-09 | 2022-03-09 | Indoor environment robot navigation natural language instruction generation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114812551A true CN114812551A (en) | 2022-07-29 |
CN114812551B CN114812551B (en) | 2024-07-26 |
Family
ID=82529629
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210224196.7A Active CN114812551B (en) | 2022-03-09 | 2022-03-09 | Indoor environment robot navigation natural language instruction generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114812551B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115795278A (en) * | 2022-12-02 | 2023-03-14 | 广东元一科技实业有限公司 | Intelligent cloth paving machine control method and device and electronic equipment |
CN118015162A (en) * | 2024-04-10 | 2024-05-10 | 哈尔滨工业大学(威海) | Three-dimensional digital human head animation generation method based on phonetic prosody decomposition |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111814844A (en) * | 2020-03-17 | 2020-10-23 | 同济大学 | Intensive video description method based on position coding fusion |
CN112200244A (en) * | 2020-10-09 | 2021-01-08 | 西安交通大学 | Intelligent detection method for anomaly of aerospace engine based on hierarchical countermeasure training |
CN112560438A (en) * | 2020-11-27 | 2021-03-26 | 同济大学 | Text generation method based on generation of confrontation network |
CN113268561A (en) * | 2021-04-25 | 2021-08-17 | 中国科学技术大学 | Problem generation method based on multi-task joint training |
CN113537024A (en) * | 2021-07-08 | 2021-10-22 | 天津理工大学 | Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism |
US20210374358A1 (en) * | 2020-05-31 | 2021-12-02 | Salesforce.Com, Inc. | Systems and methods for composed variational natural language generation |
CN113988274A (en) * | 2021-11-11 | 2022-01-28 | 电子科技大学 | Text intelligent generation method based on deep learning |
US20220059200A1 (en) * | 2020-08-21 | 2022-02-24 | Washington University | Deep-learning systems and methods for medical report generation and anomaly detection |
CN114092774A (en) * | 2021-11-22 | 2022-02-25 | 沈阳工业大学 | RGB-T image significance detection system and detection method based on information flow fusion |
CN114091466A (en) * | 2021-10-13 | 2022-02-25 | 山东师范大学 | Multi-modal emotion analysis method and system based on Transformer and multi-task learning |
-
2022
- 2022-03-09 CN CN202210224196.7A patent/CN114812551B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111814844A (en) * | 2020-03-17 | 2020-10-23 | 同济大学 | Intensive video description method based on position coding fusion |
US20210374358A1 (en) * | 2020-05-31 | 2021-12-02 | Salesforce.Com, Inc. | Systems and methods for composed variational natural language generation |
US20220059200A1 (en) * | 2020-08-21 | 2022-02-24 | Washington University | Deep-learning systems and methods for medical report generation and anomaly detection |
CN112200244A (en) * | 2020-10-09 | 2021-01-08 | 西安交通大学 | Intelligent detection method for anomaly of aerospace engine based on hierarchical countermeasure training |
CN112560438A (en) * | 2020-11-27 | 2021-03-26 | 同济大学 | Text generation method based on generation of confrontation network |
CN113268561A (en) * | 2021-04-25 | 2021-08-17 | 中国科学技术大学 | Problem generation method based on multi-task joint training |
CN113537024A (en) * | 2021-07-08 | 2021-10-22 | 天津理工大学 | Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism |
CN114091466A (en) * | 2021-10-13 | 2022-02-25 | 山东师范大学 | Multi-modal emotion analysis method and system based on Transformer and multi-task learning |
CN113988274A (en) * | 2021-11-11 | 2022-01-28 | 电子科技大学 | Text intelligent generation method based on deep learning |
CN114092774A (en) * | 2021-11-22 | 2022-02-25 | 沈阳工业大学 | RGB-T image significance detection system and detection method based on information flow fusion |
Non-Patent Citations (3)
Title |
---|
MOTONARI KAMBARA等: "Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions", 《IEEE ROBOTICS AND AUTOMATION LETTERS》, 24 August 2021 (2021-08-24), pages 8371, XP011876804, DOI: 10.1109/LRA.2021.3107026 * |
庄暑楠: "基于深度学习的文本规范化的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 8, 15 August 2020 (2020-08-15), pages 138 - 817 * |
李雪晴等: "自然语言生成综述", 《计算机应用》, no. 5, 31 May 2021 (2021-05-31), pages 1227 - 1235 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115795278A (en) * | 2022-12-02 | 2023-03-14 | 广东元一科技实业有限公司 | Intelligent cloth paving machine control method and device and electronic equipment |
CN115795278B (en) * | 2022-12-02 | 2023-08-04 | 广东元一科技实业有限公司 | Intelligent cloth paving machine control method and device and electronic equipment |
CN118015162A (en) * | 2024-04-10 | 2024-05-10 | 哈尔滨工业大学(威海) | Three-dimensional digital human head animation generation method based on phonetic prosody decomposition |
Also Published As
Publication number | Publication date |
---|---|
CN114812551B (en) | 2024-07-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114812551B (en) | Indoor environment robot navigation natural language instruction generation method | |
CN111339281B (en) | Answer selection method for reading comprehension choice questions with multi-view fusion | |
CN109145974B (en) | Multilevel image feature fusion method based on image-text matching | |
CN111967272B (en) | Visual dialogue generating system based on semantic alignment | |
CN111967277A (en) | Translation method based on multi-modal machine translation model | |
CN113065496B (en) | Neural network machine translation model training method, machine translation method and device | |
CN117273150A (en) | Visual large language model method based on few sample learning | |
Cui et al. | Representation and correlation enhanced encoder-decoder framework for scene text recognition | |
CN117292146A (en) | Industrial scene-oriented method, system and application method for constructing multi-mode large language model | |
Park et al. | Vlaad: Vision and language assistant for autonomous driving | |
Tanaka et al. | Cross-modal transformer-based neural correction models for automatic speech recognition | |
Cui et al. | An end-to-end network for irregular printed Mongolian recognition | |
CN113010662B (en) | Hierarchical conversational machine reading understanding system and method | |
Yuan et al. | VRDriving: A virtual-to-real autonomous driving framework based on adversarial learning | |
CN110197521B (en) | Visual text embedding method based on semantic structure representation | |
Huang et al. | Knowledge distilled pre-training model for vision-language-navigation | |
CN117216536A (en) | Model training method, device and equipment and storage medium | |
CN115759262A (en) | Visual common sense reasoning method and system based on knowledge perception attention network | |
Zhang et al. | Video-Language Graph Convolutional Network for Human Action Recognition | |
Li et al. | LabanFormer: Multi-scale graph attention network and transformer with gated recurrent positional encoding for labanotation generation | |
Chen et al. | A novel detection method based on DETR for drone aerial images | |
Chen et al. | LCVO: An Efficient Pretraining-Free Framework for Visual Question Answering Grounding | |
CN117710688B (en) | Target tracking method and system based on convolution and attention combination feature extraction | |
Li et al. | Limited receptive field network for real-time driving scene semantic segmentation | |
Wu et al. | Prospective Role of Foundation Models in Advancing Autonomous Vehicles |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |