CN113804200A - Visual language navigation system and method based on dynamic reinforced instruction attack module - Google Patents
Visual language navigation system and method based on dynamic reinforced instruction attack module Download PDFInfo
- Publication number
- CN113804200A CN113804200A CN202111202568.8A CN202111202568A CN113804200A CN 113804200 A CN113804200 A CN 113804200A CN 202111202568 A CN202111202568 A CN 202111202568A CN 113804200 A CN113804200 A CN 113804200A
- Authority
- CN
- China
- Prior art keywords
- instruction
- module
- dynamic
- word
- visual language
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 70
- 238000000034 method Methods 0.000 title claims abstract description 20
- 230000002787 reinforcement Effects 0.000 claims abstract description 31
- 238000005728 strengthening Methods 0.000 claims abstract description 21
- 238000005457 optimization Methods 0.000 claims abstract description 13
- 230000003416 augmentation Effects 0.000 claims description 7
- 238000006467 substitution reaction Methods 0.000 claims description 7
- 230000009471 action Effects 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000008485 antagonism Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01C—MEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
- G01C21/00—Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
- G01C21/20—Instruments for performing navigational calculations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Automation & Control Theory (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a visual language navigation system and method based on a dynamic strengthening instruction attack module, wherein the system comprises: the dynamic strengthening instruction attackers are used for calculating candidate substitute words of the input instruction and giving out attack scores of corresponding target words and then generating disturbed instructions; the visual language navigator based on the encoder decoder structure completes navigation tasks according to input instructions and image information; the optimization learning module is used for iteratively optimizing the navigator and the attacker in a counterattack reinforcement learning mode, and the multi-mode learning capability of the navigator is improved in a self-supervision mode. The method has the advantage of improving the navigation robustness of the navigator in the environment.
Description
Technical Field
The invention relates to the field of visual language navigation, in particular to a visual language navigation system and a visual language navigation method based on a dynamic strengthening instruction attack module.
Background
Visual navigation tasks based on natural language show great potential in real world robotic applications and attract more and more interest. To achieve successful navigation, the navigator needs to extract key information from the long instructions, such as visual objects, specific rooms or navigation directions, according to dynamic visual observations in order to guide the navigation at each time. However, due to the complexity and semantic ambiguity of natural language, it is difficult for a navigator to efficiently learn cross-modal alignment and capture accurate semantic intent from an instruction through limited training of manually annotated instruction path data.
Previous work has mainly adopted data enhancement strategies to solve the problem of data scarcity in navigation tasks. Ronghang Hu et al propose a Speaker-Follower framework to generate augmentation instructions in random sampling paths. However, generating a large number of whole instructions is costly and may not emphasize the most instructive information. Other work has focused more on creating challenging augmentation paths and different visual scenes by directly generating augmentation instructions with the Speaker-Follower model. Therefore, the improvement of the navigator's ability to understand the instructions is still limited.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a visual language navigation system and a method based on a dynamic reinforced instruction attack module.
In order to achieve the above object, the present invention provides a dynamic reinforcement learning instruction attack system applied to a visual language navigation task, comprising: the system comprises a dynamic strengthening instruction attacker, a visual language navigation module and an optimized learning module;
in the dynamic strengthening instruction attackers, at the starting point, the attackers acquire a section of text, and the text is an instruction for describing the track step by step. At each moment t, by considering the importance of words in the current instruction and the replacement influence of different candidate words, the dynamic strengthening instruction attacker calculates the action prediction probability, also called the attack score, carries out corresponding replacement on the target words in the instruction according to the maximum probability, and finally outputs the attacked instruction, namely the instruction with disturbance information.
The visual language navigation module receives an instruction with disturbance information output by the dynamic strengthening instruction attacker and panoramic image input at the current moment at different moments t based on the navigator with the encoder and decoder structure, and completes navigation according to the information.
The optimization learning module optimizes the dynamic reinforcement instruction attackers and the visual language navigation module by adopting an confrontation reinforcement learning mode and a self-supervision learning mode.
Further, the dynamic strengthening instruction attackers comprise a candidate surrogate word generation module, an attack score prediction module and a disturbance generation module;
the candidate substitute word generation module is used for constructing a target word set of each instruction by performing character string matching between each instruction and an instruction vocabulary. The instruction vocabulary contains words indicating visual objects or positions, which words are collected from a given instruction vocabulary of the data set; a set of candidate replacement words is constructed for each target word by selecting the remaining target words in the same instruction.
And the attack score prediction module calculates the target word at each moment t and then generates a corresponding attack score. And the attack score prediction module is obtained by calculating the input of the panoramic image at each moment t and the current position and dynamically updates along with the navigation process.
And the disturbance generation module carries out corresponding replacement on the target word in the instruction according to the attack score calculated by the attack score prediction module and then generates the instruction with disturbance information.
Further, the optimized learning module comprises an confrontation reinforcement learning module and an automatic supervision learning module;
the confrontation reinforcement learning module respectively optimizes the dynamic reinforcement instruction attackers and the visual language navigation module in an iterative manner by utilizing the confrontation reinforcement learning mode.
The self-supervision learning module predicts the target words of actual attacks again by utilizing a self-supervision learning mode, and improves the cross-modal information understanding capability of the visual language navigation module.
Further, a visual language navigation method based on a dynamic strengthening instruction attack module comprises the following steps:
step S1, the dynamic strengthening instruction attackers receive instructions for describing the track step by step, and the candidate word generation module constructs a candidate substitute word set for the input instructions;
step S2, the attack score prediction module calculates the attack score of each corresponding replaceable word;
step S3, according to the attack score calculated in step S2, replacing the corresponding word with a disturbance generation module according to the attack score to generate a command with disturbance information;
step S4, the visual language navigation module navigates according to the instruction with disturbance information;
and step S5, the confrontation reinforcement learning module respectively optimizes the dynamic reinforcement instruction attackers and the visual language navigation module, and simultaneously, self-supervision learning is used for assisting the optimization of the visual language navigation module.
Further, the step S2 includes the following sub-steps:
step S200, calculating word importance vectorWhereinAndrespectively representing the word features obtained by computing the target word with BiLSTM and the visual features obtained with the attention-based mechanism.Andare learnable linear-variation parameters that convert different features into the same linear space. DwRespectively, the length of the characteristic dimension of the word characteristic, DvLength of a feature dimension that is a visual feature, and DpThe number of probabilities that are the final output;
step S201, calculating substitution influence matrixes of different candidate words on each target wordWhereinAndrespectively represent the target word wjAnd candidate word w'jThe characteristics of the words of (a) are,is a learnable linear transformation;
step S202, calculating the final attack scoreWhereinRepresents the word importance vector betatAnd the substitution influence matrix gammat,jThe corresponding elements in the t-th row of (a) are multiplied one by one, atRepresenting a candidate action set having a size of L' xK.
Further, the step S5 includes the following sub-steps:
s500, designing an anti-reinforcement learning mode to respectively optimize parameters of the dynamic reinforcement instruction attacker (10) and the visual language navigation module (11), and expressing the parameters asWherein pi and eta represent the strategies of the dynamic strengthening instruction attacker (10) and the visual language navigation module (11), respectively, and rηRepresenting a reward function in reinforcement learning;
s501, assisting the optimization of the visual language navigation module (11) by using an auxiliary self-supervision task, and expressing as Where c is the set of target words for a given instruction I, Pc(c) The probability of the prediction is represented by,representing the target word characteristics, L' is the target word set size,representing visual and instruction-aware hidden state features of a decoder in a navigator,andrepresenting a learnable linear transformation.
Compared with the prior art, the method has the following advantages:
1. the robust navigator is trained by utilizing the adversarial attack to the navigation task language instruction, and the adversarial attack is dynamically changed along with the navigation process unlike the prior natural language task which is generally static.
2. By describing the process of disturbance generation into a Markov decision process, the dynamic reinforcement instruction attacker can train and learn to generate effective disturbance by reinforcement learning without a target based on classification.
3. The invention utilizes a different antagonism training strategy and an auxiliary self-supervision reasoning task to improve the cross-modal understanding capability of the navigator.
4. The method can improve the robustness and accuracy of the existing model under the visual language navigation task.
Drawings
FIG. 1 is a system architecture diagram of the present invention;
FIG. 2 is a flow chart of the steps of the present invention;
FIG. 3 is a diagram of an exemplary candidate alternative generation module according to an embodiment of the present invention;
Detailed Description
Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.
As shown in fig. 1, the present invention is applied to a visual language navigation system based on a dynamic reinforced instruction attack module, and includes a dynamic reinforced instruction attacker 10, a visual language navigation module 11, and an optimization learning module 12;
the dynamic strengthening instruction attackers 10, at the beginning, obtain a text, which is an instruction describing the trajectory step by step. At each time t, the dynamically enhanced instruction attackers 10 compute an action prediction probability, also referred to as an attack score, by considering the importance of the word in the current instruction and the substitution impact of different candidate words.
In an embodiment of the present invention, specifically, the dynamic strengthening instruction attacker 10 includes a candidate substitute word generation module 100, an attack score prediction module 101, and a disturbance generation module 102;
the candidate replacement word generation module 100, for each instruction, first constructs its target word set by performing string matching between it and the instruction vocabulary. The instruction vocabulary contains words indicating visual objects or positions, which words are collected from a given instruction vocabulary of the data set; a set of candidate replacement words is constructed for each target word by selecting the remaining target words in the same instruction.
The attack score prediction module 101 calculates the target word at each time t, and then generates a corresponding attack score. The attack score prediction module 101 calculates the attack score at each time t and the panoramic image input at the current position, and dynamically updates the attack score along with the navigation process.
And the disturbance generation module 102 performs corresponding replacement on the target word in the instruction according to the attack score calculated by the attack score prediction module 101, and then generates an instruction with disturbance information.
The visual language navigation module 11, based on the navigator with the encoder/decoder structure, receives the instruction with disturbance information and the panoramic image input at the current time, which are output by the dynamic strengthening instruction attacker 10, at different times t, and completes navigation according to the information.
And the optimization learning module 12 is used for optimizing the dynamic reinforcement instruction attacker (10) and the visual language navigation module 11 by adopting an anti-reinforcement learning mode and an automatic supervision learning mode.
In a specific embodiment of the present invention, specifically, the optimized learning module 12 includes an confrontation reinforcement learning module 120 and an auto-supervised learning module 121;
the confrontation reinforcement learning module 120 respectively optimizes the dynamic reinforcement instruction attacker 10 and the visual language navigation module 11 by iteration in the confrontation reinforcement learning mode.
The self-monitoring learning module 121 predicts the target words of the actual attack again by using a self-monitoring learning mode, and improves the cross-modal information understanding capability of the visual language navigation module 11.
FIG. 2 is a flowchart illustrating the steps of the visual language navigation method according to the present invention. The method comprises the following steps:
in step S1, the candidate word generation module 100 constructs a candidate substitute word set for the input instruction. In particular, for target word w in instruction Ij(j is more than or equal to 0 and less than or equal to L '), and L' is the size of the target word set. We denote the set of candidate surrogate words asWhere K is the size of the set of candidate replacement words. To facilitate understanding of a given instruction and maintain a reasonable set size, we select the remaining target words in the same instruction to construct a set of candidate replacement words for a particular target word. The construction details of the target word set and the candidate replacement word set for the visual language navigation task are shown in fig. 3, respectively.
In step S2, the attack score prediction module 101 calculates the attack score of each alternative word.
Specifically, step S2 further includes:
step S200, calculating word importance vectorWhereinAndrespectively representing the word features obtained by computing the target word with BiLSTM and the visual features obtained with the attention-based mechanism.Andare learnable linear-variation parameters that convert different features into the same linear space. . DwRespectively, the length of the characteristic dimension of the word characteristic, DvLength of a feature dimension that is a visual feature, and DpIs the number of probabilities of the final output.
A real number matrix with the size shown, step S201, calculating the substitution influence matrix of different candidate words on each target wordWhereinAndrespectively represent the target word wjAnd candidate word w'jThe characteristics of the words of (a) are,is a learnable linear transformation.
At step 202, a final attack score is calculated. After calculating the alternative influence of different candidate words of all target words in the instruction in S201, the attack score can be obtainedWhereinRepresents the word importance vector betatAnd the substitution influence matrix gammat,jThe corresponding elements in the t-th row of (a) are multiplied one by one, atRepresenting a candidate action set having a size of L' xK.
And step S3, replacing the corresponding words with the perturbation generation module 102 according to the attack scores calculated in step S2, and generating instructions with perturbation information.
In step S4, the visual language navigation module 11 is used to navigate according to the instruction with disturbance information. The visual language navigation module 11 is attacked at each time t, and then the next decision is calculated according to the attacked instruction and the current panoramic image information.
In step S5, the dynamic-state-enhanced-instruction attacker 10 and the visual language navigation module 11 are optimized by the countermeasure-enhanced-learning module 120, and the optimization of the visual language navigation module 11 is assisted by the self-supervised learning.
Specifically, step S5 further includes:
s500, designing an anti-reinforcement learning mode to respectively optimize parameters of the dynamic reinforcement instruction attacker 10 and the visual language navigation module 11, and expressing the parameters asWhere π and η represent the strategy of the attacker and navigator, respectively, and rηRepresenting a reward function in reinforcement learning. We divide the training into two phases, in the first phase we train the navigator in advance and use the pre-trained navigator to train the attacker. In the second stage, we perform alternative iterative process between the navigator and the attacker to realize joint optimizationThe A2C algorithm is used to train the RL strategy for both the attacker and the navigator.
S501, assisting the visual language navigation module 11 to optimize by using the auxiliary self-supervision task, and expressing asWhere c is the set of target words for a given instruction I, Pc(c) The probability of the prediction is represented by,representing the target word characteristics, L' is the target word set size,representing visual and instruction-aware hidden state features of a decoder in a navigator,andrepresenting a learnable linear transformation. The prediction is optimized by cross-entropy loss with supervision of the actual attack word. By doing so, the navigator can learn better ability to align across modality information and have self-correcting ability to perturb commands.
FIG. 3 is a diagram of an exemplary module for generating candidate replacement words according to an embodiment of the present invention. The target word set is a target word set constructed for each instruction by string matching between the instruction and the instruction vocabulary, which contain only words indicating visual objects and positions. The candidate replacement words set for each target word are constructed by collecting the remaining target words in the same instruction.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.
Claims (6)
1. A visual language navigation system based on a dynamic strengthening instruction attack module is characterized by comprising: the system comprises a dynamic strengthening instruction attacker (10), a visual language navigation module (11) and an optimization learning module (12);
the dynamic strengthening instruction attackers (10) acquire a section of text at the starting point, wherein the text is an instruction for describing the track step by step. At each time t, by considering the importance of words in the current instruction and the replacement influence of different candidate words, the dynamic strengthening instruction attacker (10) calculates the action prediction probability, also called the attack score, carries out corresponding replacement on the target words in the instruction according to the maximum probability, and finally outputs the attacked instruction, namely the instruction with disturbance information.
The visual language navigation module (11) receives the instruction with disturbance information output by the dynamic strengthening instruction attacker (10) and the panoramic image input at the current moment at different moments t based on the navigator with the encoder-decoder structure, and completes navigation according to the information.
The optimization learning module (12) optimizes the dynamic reinforcement instruction attacker (10) and the visual language navigation module (11) by adopting an anti-reinforcement learning mode and an automatic supervision learning mode.
2. The visual language navigation system based on the dynamic augmentation instruction attack module as claimed in claim 1, wherein the dynamic augmentation instruction attacker (10) comprises a candidate surrogate word generation module (100), an attack score prediction module (101) and a perturbation generation module (102);
the candidate replacement word generation module (100) first constructs, for each instruction, a target word set for each instruction by performing string matching between each instruction and an instruction vocabulary. The instruction vocabulary contains words indicating visual objects or positions, which words are collected from a given instruction vocabulary of the data set; a set of candidate replacement words is constructed for each target word by selecting the remaining target words in the same instruction.
The attack score prediction module (101) calculates the target word at each time t and then generates a corresponding attack score. And the attack score prediction module (101) is obtained by calculating the input of the panoramic image at each moment t and the current position and dynamically updates along with the navigation process.
And the disturbance generation module (102) carries out corresponding replacement on the target word in the instruction according to the attack score calculated by the attack score prediction module (101), and then generates the instruction with disturbance information.
3. The dynamic reinforcement instruction attack module-based visual language navigation system of claim 1, wherein the optimization learning module (12) comprises a confrontation reinforcement learning module (120) and an auto-supervised learning module (121);
the confrontation reinforcement learning module (120) respectively optimizes the dynamic reinforcement instruction attacker (10) and the visual language navigation module (11) in an iterative way by using the confrontation reinforcement learning mode.
The self-supervision learning module (121) predicts the target words of actual attacks again by using a self-supervision learning mode, and improves the cross-modal information understanding capability of the visual language navigation module (11).
4. A visual language navigation method using the system of claim 1, comprising the steps of:
step S1, the dynamic strengthening instruction attackers (10) receive instructions for describing the track step by step, and the candidate word generation module (100) constructs a candidate substitute word set for the input instructions;
step S2, the attack score prediction module (101) calculates the attack score of each corresponding replaceable word;
step S3, according to the attack score calculated in S2, replacing the corresponding word by a disturbance generation module (102) according to the attack score to generate an instruction with disturbance information;
step S4, the visual language navigation module (11) navigates according to the instruction with disturbance information;
and step S5, the countermeasure reinforcement learning module (120) respectively optimizes the dynamic reinforcement instruction attacker (10) and the visual language navigation module (11), and simultaneously, self-supervision learning is used for assisting the optimization of the visual language navigation module (11).
5. The visual language navigation method applied to the dynamic augmentation instruction attack module as claimed in claim 4, wherein the step S2 comprises the following sub-steps:
step S200, calculating word importance vectorWhereinAndrespectively representing the word features obtained by computing the target word with BiLSTM and the visual features obtained with the attention-based mechanism.Andare learnable linear-variation parameters that convert different features into the same linear space. DwRespectively, the length of the characteristic dimension of the word characteristic, DvLength of a feature dimension that is a visual feature, and DpThe number of probabilities that are the final output;
step S201, calculating substitution influence matrixes of different candidate words on each target wordWhereinAndrespectively represent the target word wjAnd candidate word w'jThe characteristics of the words of (a) are,is a learnable linear transformation;
6. The visual language navigation method applied to the dynamic augmentation instruction attack module as claimed in claim 4, wherein the step S5 comprises the following sub-steps:
s500, designing an anti-reinforcement learning mode to respectively optimize parameters of the dynamic reinforcement instruction attacker (10) and the visual language navigation module (11), and expressing the parameters asWherein pi and eta represent the strategies of the dynamic strengthening instruction attacker (10) and the visual language navigation module (11), respectively, and rηRepresenting a reward function in reinforcement learning;
s501, assisting the optimization of the visual language navigation module (11) by using an auxiliary self-supervision task, and expressing as Where c is the set of target words for a given instruction I, Pc(c) The probability of the prediction is represented by,representing the target word characteristics, L' is the target word set size,representing visual and instruction-aware hidden state features of a decoder in a navigator,andrepresenting a learnable linear transformation.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110388939 | 2021-04-12 | ||
CN202110388939X | 2021-04-12 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113804200A true CN113804200A (en) | 2021-12-17 |
CN113804200B CN113804200B (en) | 2023-12-29 |
Family
ID=78937771
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111202568.8A Active CN113804200B (en) | 2021-04-12 | 2021-10-15 | Visual language navigation system and method based on dynamic enhanced instruction attack module |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113804200B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115082915A (en) * | 2022-05-27 | 2022-09-20 | 华南理工大学 | Mobile robot vision-language navigation method based on multi-modal characteristics |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108229682A (en) * | 2018-02-07 | 2018-06-29 | 深圳市唯特视科技有限公司 | A kind of image detection countercheck based on backpropagation attack |
CN111209370A (en) * | 2019-12-27 | 2020-05-29 | 同济大学 | Text classification method based on neural network interpretability |
US20200285952A1 (en) * | 2019-03-08 | 2020-09-10 | International Business Machines Corporation | Quantifying Vulnerabilities of Deep Learning Computing Systems to Adversarial Perturbations |
CN112380357A (en) * | 2020-12-09 | 2021-02-19 | 武汉烽火众智数字技术有限责任公司 | Method for realizing interactive navigation of knowledge graph visualization |
CN112529295A (en) * | 2020-12-09 | 2021-03-19 | 西湖大学 | Self-supervision visual language navigator based on progress prediction and path shortening method |
US20210089891A1 (en) * | 2019-09-24 | 2021-03-25 | Hrl Laboratories, Llc | Deep reinforcement learning based method for surreptitiously generating signals to fool a recurrent neural network |
WO2021058090A1 (en) * | 2019-09-24 | 2021-04-01 | Toyota Motor Europe | System and method for navigating a vehicle using language instructions |
CN112633309A (en) * | 2019-09-24 | 2021-04-09 | 罗伯特·博世有限公司 | Efficient query black box anti-attack method based on Bayesian optimization |
-
2021
- 2021-10-15 CN CN202111202568.8A patent/CN113804200B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108229682A (en) * | 2018-02-07 | 2018-06-29 | 深圳市唯特视科技有限公司 | A kind of image detection countercheck based on backpropagation attack |
US20200285952A1 (en) * | 2019-03-08 | 2020-09-10 | International Business Machines Corporation | Quantifying Vulnerabilities of Deep Learning Computing Systems to Adversarial Perturbations |
US20210089891A1 (en) * | 2019-09-24 | 2021-03-25 | Hrl Laboratories, Llc | Deep reinforcement learning based method for surreptitiously generating signals to fool a recurrent neural network |
WO2021058090A1 (en) * | 2019-09-24 | 2021-04-01 | Toyota Motor Europe | System and method for navigating a vehicle using language instructions |
CN112633309A (en) * | 2019-09-24 | 2021-04-09 | 罗伯特·博世有限公司 | Efficient query black box anti-attack method based on Bayesian optimization |
CN111209370A (en) * | 2019-12-27 | 2020-05-29 | 同济大学 | Text classification method based on neural network interpretability |
CN112380357A (en) * | 2020-12-09 | 2021-02-19 | 武汉烽火众智数字技术有限责任公司 | Method for realizing interactive navigation of knowledge graph visualization |
CN112529295A (en) * | 2020-12-09 | 2021-03-19 | 西湖大学 | Self-supervision visual language navigator based on progress prediction and path shortening method |
Non-Patent Citations (2)
Title |
---|
倪丛云;黄华;: "认知电子战系统组成及其关键技术研究", 舰船电子对抗, no. 03 * |
张嘉楠;王逸翔;刘博;常晓林;: "深度学习的对抗攻击方法综述", 网络空间安全, no. 07 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115082915A (en) * | 2022-05-27 | 2022-09-20 | 华南理工大学 | Mobile robot vision-language navigation method based on multi-modal characteristics |
CN115082915B (en) * | 2022-05-27 | 2024-03-29 | 华南理工大学 | Multi-modal feature-based mobile robot vision-language navigation method |
Also Published As
Publication number | Publication date |
---|---|
CN113804200B (en) | 2023-12-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Fried et al. | Speaker-follower models for vision-and-language navigation | |
CN110765966B (en) | One-stage automatic recognition and translation method for handwritten characters | |
EP3516595B1 (en) | Training action selection neural networks | |
CN108804611B (en) | Dialog reply generation method and system based on self comment sequence learning | |
CN110770759A (en) | Neural network system | |
CN115618045B (en) | Visual question answering method, device and storage medium | |
US10572603B2 (en) | Sequence transduction neural networks | |
RU2712101C2 (en) | Prediction of probability of occurrence of line using sequence of vectors | |
CN117121015A (en) | Multimodal, less-hair learning using frozen language models | |
CN115293139A (en) | Training method of voice transcription text error correction model and computer equipment | |
CN112527993A (en) | Cross-media hierarchical deep video question-answer reasoning framework | |
CN115293138A (en) | Text error correction method and computer equipment | |
CN113804200A (en) | Visual language navigation system and method based on dynamic reinforced instruction attack module | |
Medina et al. | Towards interactive physical robotic assistance: Parameterizing motion primitives through natural language | |
Rohmatillah et al. | Causal Confusion Reduction for Robust Multi-Domain Dialogue Policy. | |
Luo et al. | Robust-EQA: robust learning for embodied question answering with noisy labels | |
CN114528387A (en) | Deep learning conversation strategy model construction method and system based on conversation flow bootstrap | |
CN111832699A (en) | Computationally efficient expressive output layer for neural networks | |
CN116842955A (en) | Medical entity relation method based on multi-feature extraction | |
CN116982054A (en) | Sequence-to-sequence neural network system using look-ahead tree search | |
CN112307769A (en) | Natural language model generation method and computer equipment | |
CN115291888A (en) | Software community warehouse mining method and device based on self-attention interactive network | |
KR20240057422A (en) | Control interactive agents using multi-mode input | |
CN115512214A (en) | Indoor visual navigation method based on causal attention | |
US20230306202A1 (en) | Language processing apparatus, learning apparatus, language processing method, learning method and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |