CN113804200A

CN113804200A - Visual language navigation system and method based on dynamic reinforced instruction attack module

Info

Publication number: CN113804200A
Application number: CN202111202568.8A
Authority: CN
Inventors: 梁小丹; 龙衍鑫; 林冰倩; 宋伟; 朱世强
Original assignee: Zhejiang Lab; Sun Yat Sen University
Current assignee: Zhejiang Lab; Sun Yat Sen University
Priority date: 2021-04-12
Filing date: 2021-10-15
Publication date: 2021-12-17
Anticipated expiration: 2041-10-15
Also published as: CN113804200B

Abstract

The invention discloses a visual language navigation system and method based on a dynamic strengthening instruction attack module, wherein the system comprises: the dynamic strengthening instruction attackers are used for calculating candidate substitute words of the input instruction and giving out attack scores of corresponding target words and then generating disturbed instructions; the visual language navigator based on the encoder decoder structure completes navigation tasks according to input instructions and image information; the optimization learning module is used for iteratively optimizing the navigator and the attacker in a counterattack reinforcement learning mode, and the multi-mode learning capability of the navigator is improved in a self-supervision mode. The method has the advantage of improving the navigation robustness of the navigator in the environment.

Description

Visual language navigation system and method based on dynamic reinforced instruction attack module

Technical Field

The invention relates to the field of visual language navigation, in particular to a visual language navigation system and a visual language navigation method based on a dynamic strengthening instruction attack module.

Background

Visual navigation tasks based on natural language show great potential in real world robotic applications and attract more and more interest. To achieve successful navigation, the navigator needs to extract key information from the long instructions, such as visual objects, specific rooms or navigation directions, according to dynamic visual observations in order to guide the navigation at each time. However, due to the complexity and semantic ambiguity of natural language, it is difficult for a navigator to efficiently learn cross-modal alignment and capture accurate semantic intent from an instruction through limited training of manually annotated instruction path data.

Previous work has mainly adopted data enhancement strategies to solve the problem of data scarcity in navigation tasks. Ronghang Hu et al propose a Speaker-Follower framework to generate augmentation instructions in random sampling paths. However, generating a large number of whole instructions is costly and may not emphasize the most instructive information. Other work has focused more on creating challenging augmentation paths and different visual scenes by directly generating augmentation instructions with the Speaker-Follower model. Therefore, the improvement of the navigator's ability to understand the instructions is still limited.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a visual language navigation system and a method based on a dynamic reinforced instruction attack module.

In order to achieve the above object, the present invention provides a dynamic reinforcement learning instruction attack system applied to a visual language navigation task, comprising: the system comprises a dynamic strengthening instruction attacker, a visual language navigation module and an optimized learning module;

in the dynamic strengthening instruction attackers, at the starting point, the attackers acquire a section of text, and the text is an instruction for describing the track step by step. At each moment t, by considering the importance of words in the current instruction and the replacement influence of different candidate words, the dynamic strengthening instruction attacker calculates the action prediction probability, also called the attack score, carries out corresponding replacement on the target words in the instruction according to the maximum probability, and finally outputs the attacked instruction, namely the instruction with disturbance information.

The visual language navigation module receives an instruction with disturbance information output by the dynamic strengthening instruction attacker and panoramic image input at the current moment at different moments t based on the navigator with the encoder and decoder structure, and completes navigation according to the information.

The optimization learning module optimizes the dynamic reinforcement instruction attackers and the visual language navigation module by adopting an confrontation reinforcement learning mode and a self-supervision learning mode.

Further, the dynamic strengthening instruction attackers comprise a candidate surrogate word generation module, an attack score prediction module and a disturbance generation module;

the candidate substitute word generation module is used for constructing a target word set of each instruction by performing character string matching between each instruction and an instruction vocabulary. The instruction vocabulary contains words indicating visual objects or positions, which words are collected from a given instruction vocabulary of the data set; a set of candidate replacement words is constructed for each target word by selecting the remaining target words in the same instruction.

And the attack score prediction module calculates the target word at each moment t and then generates a corresponding attack score. And the attack score prediction module is obtained by calculating the input of the panoramic image at each moment t and the current position and dynamically updates along with the navigation process.

And the disturbance generation module carries out corresponding replacement on the target word in the instruction according to the attack score calculated by the attack score prediction module and then generates the instruction with disturbance information.

Further, the optimized learning module comprises an confrontation reinforcement learning module and an automatic supervision learning module;

the confrontation reinforcement learning module respectively optimizes the dynamic reinforcement instruction attackers and the visual language navigation module in an iterative manner by utilizing the confrontation reinforcement learning mode.

The self-supervision learning module predicts the target words of actual attacks again by utilizing a self-supervision learning mode, and improves the cross-modal information understanding capability of the visual language navigation module.

Further, a visual language navigation method based on a dynamic strengthening instruction attack module comprises the following steps:

step S1, the dynamic strengthening instruction attackers receive instructions for describing the track step by step, and the candidate word generation module constructs a candidate substitute word set for the input instructions;

step S2, the attack score prediction module calculates the attack score of each corresponding replaceable word;

step S3, according to the attack score calculated in step S2, replacing the corresponding word with a disturbance generation module according to the attack score to generate a command with disturbance information;

step S4, the visual language navigation module navigates according to the instruction with disturbance information;

and step S5, the confrontation reinforcement learning module respectively optimizes the dynamic reinforcement instruction attackers and the visual language navigation module, and simultaneously, self-supervision learning is used for assisting the optimization of the visual language navigation module.

Further, the step S2 includes the following sub-steps:

step S200, calculating word importance vector

Wherein

And

respectively representing the word features obtained by computing the target word with BiLSTM and the visual features obtained with the attention-based mechanism.

And

are learnable linear-variation parameters that convert different features into the same linear space. D_wRespectively, the length of the characteristic dimension of the word characteristic, D_vLength of a feature dimension that is a visual feature, and D_pThe number of probabilities that are the final output;

step S201, calculating substitution influence matrixes of different candidate words on each target word

Wherein

And

respectively represent the target word w_jAnd candidate word w'_jThe characteristics of the words of (a) are,

is a learnable linear transformation;

step S202, calculating the final attack score

Wherein

Represents the word importance vector beta_tAnd the substitution influence matrix gamma_t,jThe corresponding elements in the t-th row of (a) are multiplied one by one, a_tRepresenting a candidate action set having a size of L' xK.

Further, the step S5 includes the following sub-steps:

s500, designing an anti-reinforcement learning mode to respectively optimize parameters of the dynamic reinforcement instruction attacker (10) and the visual language navigation module (11), and expressing the parameters as

Wherein pi and eta represent the strategies of the dynamic strengthening instruction attacker (10) and the visual language navigation module (11), respectively, and r_ηRepresenting a reward function in reinforcement learning;

s501, assisting the optimization of the visual language navigation module (11) by using an auxiliary self-supervision task, and expressing as

Where c is the set of target words for a given instruction I, P_c(c) The probability of the prediction is represented by,

representing the target word characteristics, L' is the target word set size,

representing visual and instruction-aware hidden state features of a decoder in a navigator,

and

representing a learnable linear transformation.

Compared with the prior art, the method has the following advantages:

1. the robust navigator is trained by utilizing the adversarial attack to the navigation task language instruction, and the adversarial attack is dynamically changed along with the navigation process unlike the prior natural language task which is generally static.

2. By describing the process of disturbance generation into a Markov decision process, the dynamic reinforcement instruction attacker can train and learn to generate effective disturbance by reinforcement learning without a target based on classification.

3. The invention utilizes a different antagonism training strategy and an auxiliary self-supervision reasoning task to improve the cross-modal understanding capability of the navigator.

4. The method can improve the robustness and accuracy of the existing model under the visual language navigation task.

Drawings

FIG. 1 is a system architecture diagram of the present invention;

FIG. 2 is a flow chart of the steps of the present invention;

FIG. 3 is a diagram of an exemplary candidate alternative generation module according to an embodiment of the present invention;

Detailed Description

Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.

As shown in fig. 1, the present invention is applied to a visual language navigation system based on a dynamic reinforced instruction attack module, and includes a dynamic reinforced instruction attacker 10, a visual language navigation module 11, and an optimization learning module 12;

the dynamic strengthening instruction attackers 10, at the beginning, obtain a text, which is an instruction describing the trajectory step by step. At each time t, the dynamically enhanced instruction attackers 10 compute an action prediction probability, also referred to as an attack score, by considering the importance of the word in the current instruction and the substitution impact of different candidate words.

In an embodiment of the present invention, specifically, the dynamic strengthening instruction attacker 10 includes a candidate substitute word generation module 100, an attack score prediction module 101, and a disturbance generation module 102;

the candidate replacement word generation module 100, for each instruction, first constructs its target word set by performing string matching between it and the instruction vocabulary. The instruction vocabulary contains words indicating visual objects or positions, which words are collected from a given instruction vocabulary of the data set; a set of candidate replacement words is constructed for each target word by selecting the remaining target words in the same instruction.

The attack score prediction module 101 calculates the target word at each time t, and then generates a corresponding attack score. The attack score prediction module 101 calculates the attack score at each time t and the panoramic image input at the current position, and dynamically updates the attack score along with the navigation process.

And the disturbance generation module 102 performs corresponding replacement on the target word in the instruction according to the attack score calculated by the attack score prediction module 101, and then generates an instruction with disturbance information.

The visual language navigation module 11, based on the navigator with the encoder/decoder structure, receives the instruction with disturbance information and the panoramic image input at the current time, which are output by the dynamic strengthening instruction attacker 10, at different times t, and completes navigation according to the information.

And the optimization learning module 12 is used for optimizing the dynamic reinforcement instruction attacker (10) and the visual language navigation module 11 by adopting an anti-reinforcement learning mode and an automatic supervision learning mode.

In a specific embodiment of the present invention, specifically, the optimized learning module 12 includes an confrontation reinforcement learning module 120 and an auto-supervised learning module 121;

the confrontation reinforcement learning module 120 respectively optimizes the dynamic reinforcement instruction attacker 10 and the visual language navigation module 11 by iteration in the confrontation reinforcement learning mode.

The self-monitoring learning module 121 predicts the target words of the actual attack again by using a self-monitoring learning mode, and improves the cross-modal information understanding capability of the visual language navigation module 11.

FIG. 2 is a flowchart illustrating the steps of the visual language navigation method according to the present invention. The method comprises the following steps:

in step S1, the candidate word generation module 100 constructs a candidate substitute word set for the input instruction. In particular, for target word w in instruction I_j(j is more than or equal to 0 and less than or equal to L '), and L' is the size of the target word set. We denote the set of candidate surrogate words as

Where K is the size of the set of candidate replacement words. To facilitate understanding of a given instruction and maintain a reasonable set size, we select the remaining target words in the same instruction to construct a set of candidate replacement words for a particular target word. The construction details of the target word set and the candidate replacement word set for the visual language navigation task are shown in fig. 3, respectively.

In step S2, the attack score prediction module 101 calculates the attack score of each alternative word.

Specifically, step S2 further includes:

step S200, calculating word importance vector

Wherein

And

And

are learnable linear-variation parameters that convert different features into the same linear space. . D_wRespectively, the length of the characteristic dimension of the word characteristic, D_vLength of a feature dimension that is a visual feature, and D_pIs the number of probabilities of the final output.

A real number matrix with the size shown, step S201, calculating the substitution influence matrix of different candidate words on each target word

Wherein

And

is a learnable linear transformation.

At step 202, a final attack score is calculated. After calculating the alternative influence of different candidate words of all target words in the instruction in S201, the attack score can be obtained

Wherein

And step S3, replacing the corresponding words with the perturbation generation module 102 according to the attack scores calculated in step S2, and generating instructions with perturbation information.

In step S4, the visual language navigation module 11 is used to navigate according to the instruction with disturbance information. The visual language navigation module 11 is attacked at each time t, and then the next decision is calculated according to the attacked instruction and the current panoramic image information.

In step S5, the dynamic-state-enhanced-instruction attacker 10 and the visual language navigation module 11 are optimized by the countermeasure-enhanced-learning module 120, and the optimization of the visual language navigation module 11 is assisted by the self-supervised learning.

Specifically, step S5 further includes:

s500, designing an anti-reinforcement learning mode to respectively optimize parameters of the dynamic reinforcement instruction attacker 10 and the visual language navigation module 11, and expressing the parameters as

Where π and η represent the strategy of the attacker and navigator, respectively, and r_ηRepresenting a reward function in reinforcement learning. We divide the training into two phases, in the first phase we train the navigator in advance and use the pre-trained navigator to train the attacker. In the second stage, we perform alternative iterative process between the navigator and the attacker to realize joint optimization

The A2C algorithm is used to train the RL strategy for both the attacker and the navigator.

S501, assisting the visual language navigation module 11 to optimize by using the auxiliary self-supervision task, and expressing as

representing the target word characteristics, L' is the target word set size,

and

representing a learnable linear transformation. The prediction is optimized by cross-entropy loss with supervision of the actual attack word. By doing so, the navigator can learn better ability to align across modality information and have self-correcting ability to perturb commands.

FIG. 3 is a diagram of an exemplary module for generating candidate replacement words according to an embodiment of the present invention. The target word set is a target word set constructed for each instruction by string matching between the instruction and the instruction vocabulary, which contain only words indicating visual objects and positions. The candidate replacement words set for each target word are constructed by collecting the remaining target words in the same instruction.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.

Claims

1. A visual language navigation system based on a dynamic strengthening instruction attack module is characterized by comprising: the system comprises a dynamic strengthening instruction attacker (10), a visual language navigation module (11) and an optimization learning module (12);

the dynamic strengthening instruction attackers (10) acquire a section of text at the starting point, wherein the text is an instruction for describing the track step by step. At each time t, by considering the importance of words in the current instruction and the replacement influence of different candidate words, the dynamic strengthening instruction attacker (10) calculates the action prediction probability, also called the attack score, carries out corresponding replacement on the target words in the instruction according to the maximum probability, and finally outputs the attacked instruction, namely the instruction with disturbance information.

The visual language navigation module (11) receives the instruction with disturbance information output by the dynamic strengthening instruction attacker (10) and the panoramic image input at the current moment at different moments t based on the navigator with the encoder-decoder structure, and completes navigation according to the information.

The optimization learning module (12) optimizes the dynamic reinforcement instruction attacker (10) and the visual language navigation module (11) by adopting an anti-reinforcement learning mode and an automatic supervision learning mode.

2. The visual language navigation system based on the dynamic augmentation instruction attack module as claimed in claim 1, wherein the dynamic augmentation instruction attacker (10) comprises a candidate surrogate word generation module (100), an attack score prediction module (101) and a perturbation generation module (102);

the candidate replacement word generation module (100) first constructs, for each instruction, a target word set for each instruction by performing string matching between each instruction and an instruction vocabulary. The instruction vocabulary contains words indicating visual objects or positions, which words are collected from a given instruction vocabulary of the data set; a set of candidate replacement words is constructed for each target word by selecting the remaining target words in the same instruction.

The attack score prediction module (101) calculates the target word at each time t and then generates a corresponding attack score. And the attack score prediction module (101) is obtained by calculating the input of the panoramic image at each moment t and the current position and dynamically updates along with the navigation process.

And the disturbance generation module (102) carries out corresponding replacement on the target word in the instruction according to the attack score calculated by the attack score prediction module (101), and then generates the instruction with disturbance information.

3. The dynamic reinforcement instruction attack module-based visual language navigation system of claim 1, wherein the optimization learning module (12) comprises a confrontation reinforcement learning module (120) and an auto-supervised learning module (121);

the confrontation reinforcement learning module (120) respectively optimizes the dynamic reinforcement instruction attacker (10) and the visual language navigation module (11) in an iterative way by using the confrontation reinforcement learning mode.

The self-supervision learning module (121) predicts the target words of actual attacks again by using a self-supervision learning mode, and improves the cross-modal information understanding capability of the visual language navigation module (11).

4. A visual language navigation method using the system of claim 1, comprising the steps of:

step S1, the dynamic strengthening instruction attackers (10) receive instructions for describing the track step by step, and the candidate word generation module (100) constructs a candidate substitute word set for the input instructions;

step S2, the attack score prediction module (101) calculates the attack score of each corresponding replaceable word;

step S3, according to the attack score calculated in S2, replacing the corresponding word by a disturbance generation module (102) according to the attack score to generate an instruction with disturbance information;

step S4, the visual language navigation module (11) navigates according to the instruction with disturbance information;

and step S5, the countermeasure reinforcement learning module (120) respectively optimizes the dynamic reinforcement instruction attacker (10) and the visual language navigation module (11), and simultaneously, self-supervision learning is used for assisting the optimization of the visual language navigation module (11).

5. The visual language navigation method applied to the dynamic augmentation instruction attack module as claimed in claim 4, wherein the step S2 comprises the following sub-steps:

step S200, calculating word importance vector

Wherein

And

And

Wherein

And

is a learnable linear transformation;

step S202, calculating the final attack score

Wherein

Represents the word importance vector beta_tAnd the substitution influence matrix gamma_t，jThe corresponding elements in the t-th row of (a) are multiplied one by one, a_tRepresenting a candidate action set having a size of L' xK.

6. The visual language navigation method applied to the dynamic augmentation instruction attack module as claimed in claim 4, wherein the step S5 comprises the following sub-steps:

representing the target word characteristics, L' is the target word set size,

and

representing a learnable linear transformation.