CN113804200A - Visual language navigation system and method based on dynamic reinforced instruction attack module - Google Patents

Visual language navigation system and method based on dynamic reinforced instruction attack module Download PDF

Info

Publication number
CN113804200A
CN113804200A CN202111202568.8A CN202111202568A CN113804200A CN 113804200 A CN113804200 A CN 113804200A CN 202111202568 A CN202111202568 A CN 202111202568A CN 113804200 A CN113804200 A CN 113804200A
Authority
CN
China
Prior art keywords
instruction
module
dynamic
word
visual language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111202568.8A
Other languages
Chinese (zh)
Other versions
CN113804200B (en
Inventor
梁小丹
龙衍鑫
林冰倩
宋伟
朱世强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Sun Yat Sen University
Original Assignee
Zhejiang Lab
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab, Sun Yat Sen University filed Critical Zhejiang Lab
Publication of CN113804200A publication Critical patent/CN113804200A/en
Application granted granted Critical
Publication of CN113804200B publication Critical patent/CN113804200B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Automation & Control Theory (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a visual language navigation system and method based on a dynamic strengthening instruction attack module, wherein the system comprises: the dynamic strengthening instruction attackers are used for calculating candidate substitute words of the input instruction and giving out attack scores of corresponding target words and then generating disturbed instructions; the visual language navigator based on the encoder decoder structure completes navigation tasks according to input instructions and image information; the optimization learning module is used for iteratively optimizing the navigator and the attacker in a counterattack reinforcement learning mode, and the multi-mode learning capability of the navigator is improved in a self-supervision mode. The method has the advantage of improving the navigation robustness of the navigator in the environment.

Description

Visual language navigation system and method based on dynamic reinforced instruction attack module
Technical Field
The invention relates to the field of visual language navigation, in particular to a visual language navigation system and a visual language navigation method based on a dynamic strengthening instruction attack module.
Background
Visual navigation tasks based on natural language show great potential in real world robotic applications and attract more and more interest. To achieve successful navigation, the navigator needs to extract key information from the long instructions, such as visual objects, specific rooms or navigation directions, according to dynamic visual observations in order to guide the navigation at each time. However, due to the complexity and semantic ambiguity of natural language, it is difficult for a navigator to efficiently learn cross-modal alignment and capture accurate semantic intent from an instruction through limited training of manually annotated instruction path data.
Previous work has mainly adopted data enhancement strategies to solve the problem of data scarcity in navigation tasks. Ronghang Hu et al propose a Speaker-Follower framework to generate augmentation instructions in random sampling paths. However, generating a large number of whole instructions is costly and may not emphasize the most instructive information. Other work has focused more on creating challenging augmentation paths and different visual scenes by directly generating augmentation instructions with the Speaker-Follower model. Therefore, the improvement of the navigator's ability to understand the instructions is still limited.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a visual language navigation system and a method based on a dynamic reinforced instruction attack module.
In order to achieve the above object, the present invention provides a dynamic reinforcement learning instruction attack system applied to a visual language navigation task, comprising: the system comprises a dynamic strengthening instruction attacker, a visual language navigation module and an optimized learning module;
in the dynamic strengthening instruction attackers, at the starting point, the attackers acquire a section of text, and the text is an instruction for describing the track step by step. At each moment t, by considering the importance of words in the current instruction and the replacement influence of different candidate words, the dynamic strengthening instruction attacker calculates the action prediction probability, also called the attack score, carries out corresponding replacement on the target words in the instruction according to the maximum probability, and finally outputs the attacked instruction, namely the instruction with disturbance information.
The visual language navigation module receives an instruction with disturbance information output by the dynamic strengthening instruction attacker and panoramic image input at the current moment at different moments t based on the navigator with the encoder and decoder structure, and completes navigation according to the information.
The optimization learning module optimizes the dynamic reinforcement instruction attackers and the visual language navigation module by adopting an confrontation reinforcement learning mode and a self-supervision learning mode.
Further, the dynamic strengthening instruction attackers comprise a candidate surrogate word generation module, an attack score prediction module and a disturbance generation module;
the candidate substitute word generation module is used for constructing a target word set of each instruction by performing character string matching between each instruction and an instruction vocabulary. The instruction vocabulary contains words indicating visual objects or positions, which words are collected from a given instruction vocabulary of the data set; a set of candidate replacement words is constructed for each target word by selecting the remaining target words in the same instruction.
And the attack score prediction module calculates the target word at each moment t and then generates a corresponding attack score. And the attack score prediction module is obtained by calculating the input of the panoramic image at each moment t and the current position and dynamically updates along with the navigation process.
And the disturbance generation module carries out corresponding replacement on the target word in the instruction according to the attack score calculated by the attack score prediction module and then generates the instruction with disturbance information.
Further, the optimized learning module comprises an confrontation reinforcement learning module and an automatic supervision learning module;
the confrontation reinforcement learning module respectively optimizes the dynamic reinforcement instruction attackers and the visual language navigation module in an iterative manner by utilizing the confrontation reinforcement learning mode.
The self-supervision learning module predicts the target words of actual attacks again by utilizing a self-supervision learning mode, and improves the cross-modal information understanding capability of the visual language navigation module.
Further, a visual language navigation method based on a dynamic strengthening instruction attack module comprises the following steps:
step S1, the dynamic strengthening instruction attackers receive instructions for describing the track step by step, and the candidate word generation module constructs a candidate substitute word set for the input instructions;
step S2, the attack score prediction module calculates the attack score of each corresponding replaceable word;
step S3, according to the attack score calculated in step S2, replacing the corresponding word with a disturbance generation module according to the attack score to generate a command with disturbance information;
step S4, the visual language navigation module navigates according to the instruction with disturbance information;
and step S5, the confrontation reinforcement learning module respectively optimizes the dynamic reinforcement instruction attackers and the visual language navigation module, and simultaneously, self-supervision learning is used for assisting the optimization of the visual language navigation module.
Further, the step S2 includes the following sub-steps:
step S200, calculating word importance vector
Figure BDA0003305532170000021
Wherein
Figure BDA0003305532170000022
And
Figure BDA0003305532170000023
respectively representing the word features obtained by computing the target word with BiLSTM and the visual features obtained with the attention-based mechanism.
Figure BDA0003305532170000024
And
Figure BDA0003305532170000025
are learnable linear-variation parameters that convert different features into the same linear space. DwRespectively, the length of the characteristic dimension of the word characteristic, DvLength of a feature dimension that is a visual feature, and DpThe number of probabilities that are the final output;
step S201, calculating substitution influence matrixes of different candidate words on each target word
Figure BDA0003305532170000031
Wherein
Figure BDA0003305532170000032
And
Figure BDA0003305532170000033
respectively represent the target word wjAnd candidate word w'jThe characteristics of the words of (a) are,
Figure BDA0003305532170000034
is a learnable linear transformation;
step S202, calculating the final attack score
Figure BDA0003305532170000035
Wherein
Figure BDA0003305532170000036
Represents the word importance vector betatAnd the substitution influence matrix gammat,jThe corresponding elements in the t-th row of (a) are multiplied one by one, atRepresenting a candidate action set having a size of L' xK.
Further, the step S5 includes the following sub-steps:
s500, designing an anti-reinforcement learning mode to respectively optimize parameters of the dynamic reinforcement instruction attacker (10) and the visual language navigation module (11), and expressing the parameters as
Figure BDA0003305532170000037
Wherein pi and eta represent the strategies of the dynamic strengthening instruction attacker (10) and the visual language navigation module (11), respectively, and rηRepresenting a reward function in reinforcement learning;
s501, assisting the optimization of the visual language navigation module (11) by using an auxiliary self-supervision task, and expressing as
Figure BDA0003305532170000038
Figure BDA0003305532170000039
Where c is the set of target words for a given instruction I, Pc(c) The probability of the prediction is represented by,
Figure BDA00033055321700000310
representing the target word characteristics, L' is the target word set size,
Figure BDA00033055321700000311
representing visual and instruction-aware hidden state features of a decoder in a navigator,
Figure BDA00033055321700000312
and
Figure BDA00033055321700000313
representing a learnable linear transformation.
Compared with the prior art, the method has the following advantages:
1. the robust navigator is trained by utilizing the adversarial attack to the navigation task language instruction, and the adversarial attack is dynamically changed along with the navigation process unlike the prior natural language task which is generally static.
2. By describing the process of disturbance generation into a Markov decision process, the dynamic reinforcement instruction attacker can train and learn to generate effective disturbance by reinforcement learning without a target based on classification.
3. The invention utilizes a different antagonism training strategy and an auxiliary self-supervision reasoning task to improve the cross-modal understanding capability of the navigator.
4. The method can improve the robustness and accuracy of the existing model under the visual language navigation task.
Drawings
FIG. 1 is a system architecture diagram of the present invention;
FIG. 2 is a flow chart of the steps of the present invention;
FIG. 3 is a diagram of an exemplary candidate alternative generation module according to an embodiment of the present invention;
Detailed Description
Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.
As shown in fig. 1, the present invention is applied to a visual language navigation system based on a dynamic reinforced instruction attack module, and includes a dynamic reinforced instruction attacker 10, a visual language navigation module 11, and an optimization learning module 12;
the dynamic strengthening instruction attackers 10, at the beginning, obtain a text, which is an instruction describing the trajectory step by step. At each time t, the dynamically enhanced instruction attackers 10 compute an action prediction probability, also referred to as an attack score, by considering the importance of the word in the current instruction and the substitution impact of different candidate words.
In an embodiment of the present invention, specifically, the dynamic strengthening instruction attacker 10 includes a candidate substitute word generation module 100, an attack score prediction module 101, and a disturbance generation module 102;
the candidate replacement word generation module 100, for each instruction, first constructs its target word set by performing string matching between it and the instruction vocabulary. The instruction vocabulary contains words indicating visual objects or positions, which words are collected from a given instruction vocabulary of the data set; a set of candidate replacement words is constructed for each target word by selecting the remaining target words in the same instruction.
The attack score prediction module 101 calculates the target word at each time t, and then generates a corresponding attack score. The attack score prediction module 101 calculates the attack score at each time t and the panoramic image input at the current position, and dynamically updates the attack score along with the navigation process.
And the disturbance generation module 102 performs corresponding replacement on the target word in the instruction according to the attack score calculated by the attack score prediction module 101, and then generates an instruction with disturbance information.
The visual language navigation module 11, based on the navigator with the encoder/decoder structure, receives the instruction with disturbance information and the panoramic image input at the current time, which are output by the dynamic strengthening instruction attacker 10, at different times t, and completes navigation according to the information.
And the optimization learning module 12 is used for optimizing the dynamic reinforcement instruction attacker (10) and the visual language navigation module 11 by adopting an anti-reinforcement learning mode and an automatic supervision learning mode.
In a specific embodiment of the present invention, specifically, the optimized learning module 12 includes an confrontation reinforcement learning module 120 and an auto-supervised learning module 121;
the confrontation reinforcement learning module 120 respectively optimizes the dynamic reinforcement instruction attacker 10 and the visual language navigation module 11 by iteration in the confrontation reinforcement learning mode.
The self-monitoring learning module 121 predicts the target words of the actual attack again by using a self-monitoring learning mode, and improves the cross-modal information understanding capability of the visual language navigation module 11.
FIG. 2 is a flowchart illustrating the steps of the visual language navigation method according to the present invention. The method comprises the following steps:
in step S1, the candidate word generation module 100 constructs a candidate substitute word set for the input instruction. In particular, for target word w in instruction Ij(j is more than or equal to 0 and less than or equal to L '), and L' is the size of the target word set. We denote the set of candidate surrogate words as
Figure BDA0003305532170000041
Where K is the size of the set of candidate replacement words. To facilitate understanding of a given instruction and maintain a reasonable set size, we select the remaining target words in the same instruction to construct a set of candidate replacement words for a particular target word. The construction details of the target word set and the candidate replacement word set for the visual language navigation task are shown in fig. 3, respectively.
In step S2, the attack score prediction module 101 calculates the attack score of each alternative word.
Specifically, step S2 further includes:
step S200, calculating word importance vector
Figure BDA0003305532170000051
Wherein
Figure BDA0003305532170000052
And
Figure BDA0003305532170000053
respectively representing the word features obtained by computing the target word with BiLSTM and the visual features obtained with the attention-based mechanism.
Figure BDA0003305532170000054
And
Figure BDA0003305532170000055
are learnable linear-variation parameters that convert different features into the same linear space. . DwRespectively, the length of the characteristic dimension of the word characteristic, DvLength of a feature dimension that is a visual feature, and DpIs the number of probabilities of the final output.
A real number matrix with the size shown, step S201, calculating the substitution influence matrix of different candidate words on each target word
Figure BDA0003305532170000056
Wherein
Figure BDA0003305532170000057
And
Figure BDA0003305532170000058
respectively represent the target word wjAnd candidate word w'jThe characteristics of the words of (a) are,
Figure BDA0003305532170000059
is a learnable linear transformation.
At step 202, a final attack score is calculated. After calculating the alternative influence of different candidate words of all target words in the instruction in S201, the attack score can be obtained
Figure BDA00033055321700000510
Wherein
Figure BDA00033055321700000511
Represents the word importance vector betatAnd the substitution influence matrix gammat,jThe corresponding elements in the t-th row of (a) are multiplied one by one, atRepresenting a candidate action set having a size of L' xK.
And step S3, replacing the corresponding words with the perturbation generation module 102 according to the attack scores calculated in step S2, and generating instructions with perturbation information.
In step S4, the visual language navigation module 11 is used to navigate according to the instruction with disturbance information. The visual language navigation module 11 is attacked at each time t, and then the next decision is calculated according to the attacked instruction and the current panoramic image information.
In step S5, the dynamic-state-enhanced-instruction attacker 10 and the visual language navigation module 11 are optimized by the countermeasure-enhanced-learning module 120, and the optimization of the visual language navigation module 11 is assisted by the self-supervised learning.
Specifically, step S5 further includes:
s500, designing an anti-reinforcement learning mode to respectively optimize parameters of the dynamic reinforcement instruction attacker 10 and the visual language navigation module 11, and expressing the parameters as
Figure BDA00033055321700000512
Where π and η represent the strategy of the attacker and navigator, respectively, and rηRepresenting a reward function in reinforcement learning. We divide the training into two phases, in the first phase we train the navigator in advance and use the pre-trained navigator to train the attacker. In the second stage, we perform alternative iterative process between the navigator and the attacker to realize joint optimization
Figure BDA0003305532170000061
The A2C algorithm is used to train the RL strategy for both the attacker and the navigator.
S501, assisting the visual language navigation module 11 to optimize by using the auxiliary self-supervision task, and expressing as
Figure BDA0003305532170000062
Where c is the set of target words for a given instruction I, Pc(c) The probability of the prediction is represented by,
Figure BDA0003305532170000063
representing the target word characteristics, L' is the target word set size,
Figure BDA0003305532170000064
representing visual and instruction-aware hidden state features of a decoder in a navigator,
Figure BDA0003305532170000065
and
Figure BDA0003305532170000066
representing a learnable linear transformation. The prediction is optimized by cross-entropy loss with supervision of the actual attack word. By doing so, the navigator can learn better ability to align across modality information and have self-correcting ability to perturb commands.
FIG. 3 is a diagram of an exemplary module for generating candidate replacement words according to an embodiment of the present invention. The target word set is a target word set constructed for each instruction by string matching between the instruction and the instruction vocabulary, which contain only words indicating visual objects and positions. The candidate replacement words set for each target word are constructed by collecting the remaining target words in the same instruction.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.

Claims (6)

1. A visual language navigation system based on a dynamic strengthening instruction attack module is characterized by comprising: the system comprises a dynamic strengthening instruction attacker (10), a visual language navigation module (11) and an optimization learning module (12);
the dynamic strengthening instruction attackers (10) acquire a section of text at the starting point, wherein the text is an instruction for describing the track step by step. At each time t, by considering the importance of words in the current instruction and the replacement influence of different candidate words, the dynamic strengthening instruction attacker (10) calculates the action prediction probability, also called the attack score, carries out corresponding replacement on the target words in the instruction according to the maximum probability, and finally outputs the attacked instruction, namely the instruction with disturbance information.
The visual language navigation module (11) receives the instruction with disturbance information output by the dynamic strengthening instruction attacker (10) and the panoramic image input at the current moment at different moments t based on the navigator with the encoder-decoder structure, and completes navigation according to the information.
The optimization learning module (12) optimizes the dynamic reinforcement instruction attacker (10) and the visual language navigation module (11) by adopting an anti-reinforcement learning mode and an automatic supervision learning mode.
2. The visual language navigation system based on the dynamic augmentation instruction attack module as claimed in claim 1, wherein the dynamic augmentation instruction attacker (10) comprises a candidate surrogate word generation module (100), an attack score prediction module (101) and a perturbation generation module (102);
the candidate replacement word generation module (100) first constructs, for each instruction, a target word set for each instruction by performing string matching between each instruction and an instruction vocabulary. The instruction vocabulary contains words indicating visual objects or positions, which words are collected from a given instruction vocabulary of the data set; a set of candidate replacement words is constructed for each target word by selecting the remaining target words in the same instruction.
The attack score prediction module (101) calculates the target word at each time t and then generates a corresponding attack score. And the attack score prediction module (101) is obtained by calculating the input of the panoramic image at each moment t and the current position and dynamically updates along with the navigation process.
And the disturbance generation module (102) carries out corresponding replacement on the target word in the instruction according to the attack score calculated by the attack score prediction module (101), and then generates the instruction with disturbance information.
3. The dynamic reinforcement instruction attack module-based visual language navigation system of claim 1, wherein the optimization learning module (12) comprises a confrontation reinforcement learning module (120) and an auto-supervised learning module (121);
the confrontation reinforcement learning module (120) respectively optimizes the dynamic reinforcement instruction attacker (10) and the visual language navigation module (11) in an iterative way by using the confrontation reinforcement learning mode.
The self-supervision learning module (121) predicts the target words of actual attacks again by using a self-supervision learning mode, and improves the cross-modal information understanding capability of the visual language navigation module (11).
4. A visual language navigation method using the system of claim 1, comprising the steps of:
step S1, the dynamic strengthening instruction attackers (10) receive instructions for describing the track step by step, and the candidate word generation module (100) constructs a candidate substitute word set for the input instructions;
step S2, the attack score prediction module (101) calculates the attack score of each corresponding replaceable word;
step S3, according to the attack score calculated in S2, replacing the corresponding word by a disturbance generation module (102) according to the attack score to generate an instruction with disturbance information;
step S4, the visual language navigation module (11) navigates according to the instruction with disturbance information;
and step S5, the countermeasure reinforcement learning module (120) respectively optimizes the dynamic reinforcement instruction attacker (10) and the visual language navigation module (11), and simultaneously, self-supervision learning is used for assisting the optimization of the visual language navigation module (11).
5. The visual language navigation method applied to the dynamic augmentation instruction attack module as claimed in claim 4, wherein the step S2 comprises the following sub-steps:
step S200, calculating word importance vector
Figure FDA0003305532160000021
Wherein
Figure FDA0003305532160000022
And
Figure FDA0003305532160000023
respectively representing the word features obtained by computing the target word with BiLSTM and the visual features obtained with the attention-based mechanism.
Figure FDA0003305532160000024
And
Figure FDA0003305532160000025
are learnable linear-variation parameters that convert different features into the same linear space. DwRespectively, the length of the characteristic dimension of the word characteristic, DvLength of a feature dimension that is a visual feature, and DpThe number of probabilities that are the final output;
step S201, calculating substitution influence matrixes of different candidate words on each target word
Figure FDA0003305532160000026
Wherein
Figure FDA0003305532160000027
And
Figure FDA0003305532160000028
respectively represent the target word wjAnd candidate word w'jThe characteristics of the words of (a) are,
Figure FDA0003305532160000029
is a learnable linear transformation;
step S202, calculating the final attack score
Figure FDA00033055321600000213
Wherein
Figure FDA00033055321600000214
Represents the word importance vector betatAnd the substitution influence matrix gammat,jThe corresponding elements in the t-th row of (a) are multiplied one by one, atRepresenting a candidate action set having a size of L' xK.
6. The visual language navigation method applied to the dynamic augmentation instruction attack module as claimed in claim 4, wherein the step S5 comprises the following sub-steps:
s500, designing an anti-reinforcement learning mode to respectively optimize parameters of the dynamic reinforcement instruction attacker (10) and the visual language navigation module (11), and expressing the parameters as
Figure FDA00033055321600000210
Wherein pi and eta represent the strategies of the dynamic strengthening instruction attacker (10) and the visual language navigation module (11), respectively, and rηRepresenting a reward function in reinforcement learning;
s501, assisting the optimization of the visual language navigation module (11) by using an auxiliary self-supervision task, and expressing as
Figure FDA00033055321600000211
Figure FDA00033055321600000212
Where c is the set of target words for a given instruction I, Pc(c) The probability of the prediction is represented by,
Figure FDA0003305532160000031
representing the target word characteristics, L' is the target word set size,
Figure FDA0003305532160000032
representing visual and instruction-aware hidden state features of a decoder in a navigator,
Figure FDA0003305532160000033
and
Figure FDA0003305532160000034
representing a learnable linear transformation.
CN202111202568.8A 2021-04-12 2021-10-15 Visual language navigation system and method based on dynamic enhanced instruction attack module Active CN113804200B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110388939 2021-04-12
CN202110388939X 2021-04-12

Publications (2)

Publication Number Publication Date
CN113804200A true CN113804200A (en) 2021-12-17
CN113804200B CN113804200B (en) 2023-12-29

Family

ID=78937771

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111202568.8A Active CN113804200B (en) 2021-04-12 2021-10-15 Visual language navigation system and method based on dynamic enhanced instruction attack module

Country Status (1)

Country Link
CN (1) CN113804200B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115082915A (en) * 2022-05-27 2022-09-20 华南理工大学 Mobile robot vision-language navigation method based on multi-modal characteristics

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229682A (en) * 2018-02-07 2018-06-29 深圳市唯特视科技有限公司 A kind of image detection countercheck based on backpropagation attack
CN111209370A (en) * 2019-12-27 2020-05-29 同济大学 Text classification method based on neural network interpretability
US20200285952A1 (en) * 2019-03-08 2020-09-10 International Business Machines Corporation Quantifying Vulnerabilities of Deep Learning Computing Systems to Adversarial Perturbations
CN112380357A (en) * 2020-12-09 2021-02-19 武汉烽火众智数字技术有限责任公司 Method for realizing interactive navigation of knowledge graph visualization
CN112529295A (en) * 2020-12-09 2021-03-19 西湖大学 Self-supervision visual language navigator based on progress prediction and path shortening method
US20210089891A1 (en) * 2019-09-24 2021-03-25 Hrl Laboratories, Llc Deep reinforcement learning based method for surreptitiously generating signals to fool a recurrent neural network
WO2021058090A1 (en) * 2019-09-24 2021-04-01 Toyota Motor Europe System and method for navigating a vehicle using language instructions
CN112633309A (en) * 2019-09-24 2021-04-09 罗伯特·博世有限公司 Efficient query black box anti-attack method based on Bayesian optimization

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229682A (en) * 2018-02-07 2018-06-29 深圳市唯特视科技有限公司 A kind of image detection countercheck based on backpropagation attack
US20200285952A1 (en) * 2019-03-08 2020-09-10 International Business Machines Corporation Quantifying Vulnerabilities of Deep Learning Computing Systems to Adversarial Perturbations
US20210089891A1 (en) * 2019-09-24 2021-03-25 Hrl Laboratories, Llc Deep reinforcement learning based method for surreptitiously generating signals to fool a recurrent neural network
WO2021058090A1 (en) * 2019-09-24 2021-04-01 Toyota Motor Europe System and method for navigating a vehicle using language instructions
CN112633309A (en) * 2019-09-24 2021-04-09 罗伯特·博世有限公司 Efficient query black box anti-attack method based on Bayesian optimization
CN111209370A (en) * 2019-12-27 2020-05-29 同济大学 Text classification method based on neural network interpretability
CN112380357A (en) * 2020-12-09 2021-02-19 武汉烽火众智数字技术有限责任公司 Method for realizing interactive navigation of knowledge graph visualization
CN112529295A (en) * 2020-12-09 2021-03-19 西湖大学 Self-supervision visual language navigator based on progress prediction and path shortening method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
倪丛云;黄华;: "认知电子战系统组成及其关键技术研究", 舰船电子对抗, no. 03 *
张嘉楠;王逸翔;刘博;常晓林;: "深度学习的对抗攻击方法综述", 网络空间安全, no. 07 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115082915A (en) * 2022-05-27 2022-09-20 华南理工大学 Mobile robot vision-language navigation method based on multi-modal characteristics
CN115082915B (en) * 2022-05-27 2024-03-29 华南理工大学 Multi-modal feature-based mobile robot vision-language navigation method

Also Published As

Publication number Publication date
CN113804200B (en) 2023-12-29

Similar Documents

Publication Publication Date Title
Fried et al. Speaker-follower models for vision-and-language navigation
CN110765966B (en) One-stage automatic recognition and translation method for handwritten characters
EP3516595B1 (en) Training action selection neural networks
CN108804611B (en) Dialog reply generation method and system based on self comment sequence learning
CN110770759A (en) Neural network system
CN115618045B (en) Visual question answering method, device and storage medium
US10572603B2 (en) Sequence transduction neural networks
RU2712101C2 (en) Prediction of probability of occurrence of line using sequence of vectors
CN117121015A (en) Multimodal, less-hair learning using frozen language models
CN115293139A (en) Training method of voice transcription text error correction model and computer equipment
CN112527993A (en) Cross-media hierarchical deep video question-answer reasoning framework
CN115293138A (en) Text error correction method and computer equipment
CN113804200A (en) Visual language navigation system and method based on dynamic reinforced instruction attack module
Medina et al. Towards interactive physical robotic assistance: Parameterizing motion primitives through natural language
Rohmatillah et al. Causal Confusion Reduction for Robust Multi-Domain Dialogue Policy.
Luo et al. Robust-EQA: robust learning for embodied question answering with noisy labels
CN114528387A (en) Deep learning conversation strategy model construction method and system based on conversation flow bootstrap
CN111832699A (en) Computationally efficient expressive output layer for neural networks
CN116842955A (en) Medical entity relation method based on multi-feature extraction
CN116982054A (en) Sequence-to-sequence neural network system using look-ahead tree search
CN112307769A (en) Natural language model generation method and computer equipment
CN115291888A (en) Software community warehouse mining method and device based on self-attention interactive network
KR20240057422A (en) Control interactive agents using multi-mode input
CN115512214A (en) Indoor visual navigation method based on causal attention
US20230306202A1 (en) Language processing apparatus, learning apparatus, language processing method, learning method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant