CN115906831A

CN115906831A - Distance perception-based Transformer visual language navigation algorithm

Info

Publication number: CN115906831A
Application number: CN202211342144.6A
Authority: CN
Inventors: 魏忠钰; 杜梦飞
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-04-04

Abstract

The invention discloses a distance perception-based visual language navigation algorithm of a Transformer, and belongs to the technical field of visual language cross-modal. The algorithm is implemented by the following way: visual information, instruction information and a memory structure of a sensible area of the intelligent agent are initialized, and then exploration information in a navigation process is fused by providing a scene memory updating module based on a graph data structure and combining a language vision multi-mode pre-training model, so that the perception capability of the intelligent agent on the environment is enhanced; the motion space of each decision in the navigation process is compressed by providing a progress monitor based on distance, so that the operation resources are reduced, and the model training is accelerated; by providing the dynamic distance fusion-based module, distance information is fused into action decision, so that the algorithm gives consideration to the exploration path length while carrying out global exploration, and the efficiency of a navigation task is improved. The visual language navigation algorithm of the transform based on distance perception provided by the invention ensures better navigation success rate and obviously improves the exploration efficiency based on the scene memory algorithm.

Description

Transformer visual language navigation algorithm based on distance perception

Technical Field

The invention relates to the technical field of visual language cross-modal, in particular to a transform visual language navigation algorithm based on distance perception.

Background

Visual Language Navigation task (VLN) aims to train an agent to reach a target location by performing a series of actions in an unstructured, invisible environment in combination with natural Language instructions and Visual information observed by the agent.

The visual language navigation task requires that the intelligent algorithm has the comprehensive capabilities of natural language understanding, visual environment perception, multi-modal feature alignment and reasonable strategy decision.

Most VLN navigation algorithms mainly use a sequence-to-sequence (Seq 2 Seq) framework [1] and model the time sequence state of intelligent agent navigation by using a Long short-term memory neural network (LSTM) for processing language image information flow in the navigation process. Such an operation prevents the agent from directly accessing historical information in the navigation process, such as spatial layout of visited locations, previous decisions, etc. Another type of intelligent algorithm is that the explored scene is memorized and stored in a graph structure in the navigation process, so that a local action space is expanded to a global action space, the intelligent agent is helped to evaluate all current navigable positions, and the intelligent agent has strong capability of timely correcting errors.

The existing method has the following two problems:

1. the convergence speed of the model can be obviously reduced in the aspect of quick expansion of the global action space, and on the other hand, a large amount of GPU resources are consumed in the model training process due to excessive candidate positions;

2. the algorithm generally lacks backtracking constraint, and more invalid repeated paths are easy to appear, so that the length of a navigation path is longer, and the navigation efficiency is reduced.

Disclosure of Invention

The invention provides a distance perception-based Transformer visual language navigation algorithm, which is characterized in that a scene memory updating module based on a graph data structure and a language visual multi-modal pre-training model are provided to further fuse exploration information in the navigation process, so that the perception capability of an intelligent agent to the environment is enhanced; the motion space of each decision in the navigation process is compressed by providing a progress monitor based on distance, so that the operation resources are reduced, and the model training is accelerated; the distance information is integrated into action decision-making through providing a dynamic distance fusion module, so that the algorithm gives consideration to exploration path length while carrying out global exploration, fully exploration is given consideration to, and the selection of candidate positions with longer unnecessary distances is further reduced while correcting the capability in time, thereby improving the efficiency of the language and vision navigation task.

In order to achieve the purpose, the invention adopts the following technical scheme:

the visual language navigation algorithm of the Transformer based on distance perception is provided, and is characterized by comprising the following steps of:

s1, performing navigation initialization according to a navigation task and visual information in a sensible environment, wherein a main initialization component is a scene memory structure

Navigation module and navigation state h ₀ Navigation instruction information X;

s2, according to the scene memory structure of the current navigation progress t

Extracting visual information V of current position _t Distance information D _t And candidate position vision and space information and constructing a global action space corresponding to the current position on the basis of the candidate position vision and space information;

s3, utilizing progress monitor f _monitor Screening the global action space, and selecting the first n candidate positions with the highest evaluation progress as the actual action space in the action decision process;

s4, the visual information V is processed _t Command information X and distance information D _t Inputting the navigation state h of the intelligent agent into a distance perception Transformer navigation module based on a dynamic distance fusion module, and updating the navigation state h of the intelligent agent _t And outputting the distance weighted action probability distribution corresponding to the current position

Selecting candidate nodes corresponding to the maximum probability as the next action;

s5, updating the scene memory structure according to the visual characteristics observed after the action is executed

To>

And S6, repeatedly executing S2-S5 until the intelligent algorithm judges that the navigation task is completed or the maximum moving step number K is reached.

Preferably, the navigation initialization is performed by:

a1, constructing a directed graph according to visual characteristics of an observable range at a starting point

Wherein node->

Corresponding to the revealing position, edge e _u,s ＝(u,s)∈ε _t The spatial relative position relation of the corresponding positions u and s is mainly the spatial position relation between the starting position and the candidate position;

a2, replacing parameters of a transform model initialized at random in a navigation module according to parameters of a pre-training language model;

a3, the text U is marked by clause [ SEP ] according to the natural language instruction]Dividing sentences according to the sentence constitution of the original instruction, and adding [ CLS ] at the initial position]The markers are used to handle loop states in the navigation process. The preprocessed sequence can obtain initialized instruction information X and navigation state h through a multi-layer Transformer network according to the following formula (1) ₀ ：

h ₀ ,X＝BERT([CLS],U,[SEP]) # formula (1)

Preferably, the structure is memorized according to the scene

Constructing a global action space corresponding to the current progress t, wherein the global action space is as follows (2):

in the formula (2), the first and second groups,

representing the number of locations the agent has visited at the current schedule;

K _u represents candidate locations directly connected to the visited location u, i.e., a subset of the action space at the current schedule;

o _u,k memory structure for representing scene

Visual and spatial relative position information of the candidate locations is stored.

Preferably, by means of a progress monitor f _monitor And screening the global motion space, wherein the following formula (3) is adopted:

in the formula (3), h _t-1 Representing the navigation state of the agent in the last step;

visual information representing the kth candidate location.

The final screened motion space is

Preferably, the progress monitor f _monitor Navigation state h integrated with agent in last step _t-1 With the currently observed visual information

Estimating the current navigation progress, as the following formula (4):

in formula (4), σ represents a Sigmoid activation function;

W _p ，W _v is a parameter matrix of a learnable fully-connected neural network.

Preferably, n =20.

Preferably, the distance information D _t The distance of the candidate location from the current location is encoded as a 20-dimensional 0-1 vector using one-hot encoding.

Preferably, the distance perception Transformer navigation module based on the dynamic distance fusion module receives the visual information V _t Command information X and distance information D _t And then making action decision at the progress t position and updating the current navigation state based on the following steps:

and B1, respectively calculating the overall attention scores of the language, the vision, the distance information and the navigation state vector and the language vector. Attention scores and attention weights for all language features;

b2, respectively obtaining language and vision weighting information by using attention weights corresponding to the language and the vision, performing cross-mode matching by using point multiplication, splicing with a navigation state vector output by the last layer of a Transformer, and mapping to a new state representation;

b3, splicing the spatial information to a newly generated navigation state vector and mapping the spatial information to a final navigation state representation;

b4, respectively taking the attention scores corresponding to vision and distance as action probability distribution and distance weight distribution;

and B5, combining the action probability distribution with the distance weight distribution to obtain the action probability distribution based on distance weighting.

Preferably, the agent algorithm is trained using a hybrid learning mode of reinforcement learning and mock learning, optimized by the following formula (5):

in the formula (5)，

Representing the action obtained by the algorithm according to probability sampling; a. The _t Representing the reward corresponding to the action; />

Represents a correct action; λ represents a coefficient that mimics a learning loss function, μ represents a coefficient of a distance loss function; and gamma represents a coefficient corresponding to the loss function of the progress monitor.

Preferably, the updating of the scene memory of the intelligent agent algorithm is specifically as follows: at navigation step t, from

After a candidate position u is selected, the algorithm adds a new node to the scene memory structure to represent u. When the agent reaches position u, then>

The navigable node corresponding to u in (g) is deleted while adding the navigable node added after reaching position u. And simultaneously, updating the visual information and the space relative information in the map according to the navigation state vector.

Preferably, K =15.

The invention has the following beneficial effects:

1. the method comprises the steps that a scene memory updating module based on a graph data structure is provided, and exploration information in the navigation process is further fused by combining a language vision multi-mode pre-training model, so that the perception capability of an intelligent agent on the environment is enhanced;

2. the motion space of each decision in the navigation process is compressed by providing a progress monitor based on distance, so that the operation resources are reduced, and the model training is accelerated;

3. by providing the dynamic distance fusion-based module, distance information is fused into action decision, so that the algorithm gives consideration to the exploration path length while carrying out global exploration, and the efficiency of a navigation task is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a diagram of an implementation step of a distance-aware Transformer-based visual language navigation algorithm provided by the present invention;

FIG. 2 is a system structure diagram of the visual language navigation algorithm of the transform based on distance sensing according to the present invention;

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

Wherein the showings are for the purpose of illustration only and not for the purpose of limiting the same, the same is shown by way of illustration only and not in the form of limitation; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.

In the description of the present invention, unless otherwise explicitly specified or limited, the term "connected" or the like, if appearing to indicate a connection relationship between the components, is to be understood broadly, for example, as being fixed or detachable or integral; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or may be connected through one or more other components or may be in an interactive relationship with one another. The specific meanings of the above terms in the present invention can be understood in a specific case to those of ordinary skill in the art.

The embodiment of the invention provides a distance perception-based Transformer visual language navigation algorithm, as shown in FIG. 1, which comprises the following steps:

s1, performing navigation initialization according to a navigation task and visual information in a sensible environment, wherein the main initialized content is a scene memory structure

Navigation module and navigation state h ₀ And navigation instruction information X. Memory structure for a scene->

Constructing a directed graph based on the visual characteristics of the observable range at the origin>

Wherein node->

Corresponding to the start position, edge e _u,s ＝(u,s)∈ε _t The spatial relative position relationship between the corresponding positions u and s is mainly the spatial position relationship between the starting position and the candidate position. For the navigation module, parameters of a transform model initialized randomly in the navigation module are replaced according to parameters of a pre-training language model (the invention uses parameters of a PREVALENT pre-training model). For the navigation state in the navigation instruction information, the sentence mark [ SEP ] is utilized according to the natural language instruction text U]Dividing sentences according to the sentence composition of the original instruction, and adding [ CLS ] at the initial position]The markers are used to process loop states in the navigation process.The preprocessed sequence can obtain initialized instruction information X and navigation state h through a multi-layer Transformer network ₀ ；

Extracting visual information V of current position _t Distance information D _t And candidate position vision and space information and constructing a global action space (corresponding to the current position) based on the candidate position vision and space information>

In the formula (2), the first and second groups,

o _u,k memory structure for representing scene

S3, utilizing a progress monitor f _monitor For global action space

Screening is carried out, and the first n candidate positions with the highest evaluation progress are selected as the action space->

(n represents the size of the motion space after screening, and is the sum of the convergence rate and the convergence rate in the algorithm training processAnd (3) calculating important parameters occupied by resources, and through repeated experimental data comparison, n =20 can accelerate algorithm convergence speed while ensuring the navigation success rate, and reduce the occupation of the calculation resources, so that a model needing 8 TeslaV100 display cards to be trained can be trained on 1 2080Ti display card). The progress monitor primarily utilizes the previous navigation state h _t-1 The currently observed visual feature pick>

And evaluating the current navigation progress, such as formula (4):

in formula (4), σ represents a Sigmoid activation function;

S4, the visual information V is processed _t Command information X and distance information D _t Inputting the navigation state h of the intelligent agent into a distance perception Transformer navigation module based on a dynamic distance fusion module, and updating the navigation state h of the intelligent agent _t Outputting a distance weighted motion probability distribution p 'corresponding to the current position' _t ^a And selecting the candidate node corresponding to the maximum probability as the next action, wherein the candidate node is represented by the following formula (6):

in particular, the present application uses

And &>

Respectively represent DistVLN>

Multilayer T in BERTState vectors and language vectors for the k-th attention head and l-th layer output of the ransformer structure. The attention scores for all language features are as follows, equation (7):

then, the model averages the scores of all the attention heads (K = 12) and uses a Softmax function to obtain the overall attention weight of the language feature, as shown in equation (8):

similarly, the visual feature attention score and weight are respectively

And &>

Distance feature attention score and weight are @, respectively>

And &>

Then, the model performs weighted summation on the input text features and the visual features respectively to obtain weighted features, as the following formula (9):

subsequently, the model performs cross-mode matching between the weighted language and visual features using point-by-point pairs and VLN

Output from the last layer of BERTNavigation status vector>

Mapping to a new state representation, as in equation (10):

finally, the model maps the directional features r _t Spliced to the newly generated navigation state vector and mapped to the final navigation state representation, as in equation (11):

then, the model is weighted by the distance feature attention score and the visual feature attention score to obtain a distance-based weighted action probability distribution, as shown in the following formula (12):

reinforcement learning and mimic learning are used herein to train agents. In reinforcement learning, the A2C algorithm is mainly used, and the agent outputs action probability p 'according to the agent at each step t of navigation' ^a _t Using the distance from the corresponding position to the task end as the reward A for the corresponding action _t . In mock learning, the agent navigates through the real trajectory by following the correct actions of each location and calculates the cross-entropy loss for each decision step. To train the distance sensing module, we calculate the shortest path distance vector from each candidate location in the scene memory map to the current navigation location

And the distance weight vector->

And (4) performing dot-multiply summation as part of training loss, and being used for helping the intelligent agent comprehensively consider the comprehensive influence of language, vision and distance current distance characteristics on action decision.

In particular, the loss function of agent training may be expressed as:

in the formula (5), the first and second groups,

Represents a correct action; λ represents a coefficient that mimics a learning loss function; μ represents the coefficient of the distance loss function; gamma represents the coefficient corresponding to the loss function of the progress monitor; r is _t Representing the true task progress of the navigation; />

Representing an estimate of the progress of the task made by the progress monitor.

To/>

Specifically, the method comprises the following steps: in a navigation step t, slave->

After a candidate position u is selected, the algorithm adds a new node to the scene memory structure to represent u. When the agent reaches position u, it>

The navigable node corresponding to u in (g) is deleted while adding the navigable node added after reaching position u. Meanwhile, the algorithm is used for updating visual and spatial information in the map according to the landmark and action related text features in the language instruction and the current navigation state, and the following formula (13) (14):

wherein h is _t In order to be in the current navigation state,

and/or>

Respectively, representations of landmarks and actions in the text for resolving landmark and action features of interest in the command by the agent. Next, the model respectively splices and maps the navigation state, the landmark and the action text representation to obtain a landmark perception state and an action perception state, as in formula (15):

then, for the nodes and edges on the scene memory structure, the stored visual information and spatial information are updated respectively by using the attention mechanism, as the following formula (16) (17):

/>

the invention then uses a long-term inference based iterative update method that iteratively updates visual and directional features by exchanging messages between nodes in the form of trainable functions in order to comprehensively consider visual and spatial information corresponding to nodes and edges on the graph. In particular, at the s-th iteration, the visual information for each node u

Updating is performed as follows, equation (18) (19):

wherein the function U is updated ^★ (. Cndot.) is implemented by a Gated neural network (GRU). After S iterations of polymerization, the model further improves the visual and directional characteristics, i.e.

By capturing information within the S-hop neighborhood of node u.

And S6, repeatedly executing S2-S5 until the intelligent algorithm judges that the navigation task is completed or the maximum moving step number K is reached. (the maximum number of motion steps K =15 is chosen here based on experience from prior studies)

In order to verify the performance of the application in the language vision navigation task, the application evaluates the performance of the model by a Room-to-Room data set. The data set consists of 14,0251,020,2349 navigation tasks in a training set, a validation set (known environment), and a validation set (unknown environment), respectively. Experimental procedures the present application focuses mainly on the following 5 evaluation indicators to evaluate the performance of the VLN algorithm:

(1) Success Rate (SR), considering the percentage of the total tasks that are navigation tasks less than 3 meters from the target location after the agent stops. The index is the most direct index for evaluating the navigation performance of the intelligent agent.

(2) The Navigation Error (NE) refers to the shortest distance between the agent stop position and the target position.

(3) The Trajectory Length (TL) is the average total length of the measured agent navigation trajectory. The efficiency of agent navigation may be reflected to some extent.

(4) The global success rate (OR) refers to the proportion of navigation tasks with the position distance less than 3 meters from the target position to the total number of tasks existing in the navigation process of the intelligent agent.

(5) The path weighted success rate length (SPL) [31] is the trade-off between SR and TL.

In an embodiment, the learning rate of the model is fixed at 10 throughout the training process ^-5 And an AdamW optimizer was used. In terms of data, a hybrid dataset of Room-to-Room and enhancement data from PREVALENT is chosen herein for training the model. Visual information in the environment is encoded by the ResNet-152 network pre-trained on the Places365 dataset. The training batch was sized to 8 and model training was performed on a single NVIDIA 2080Ti GPU.

The present application compares the performance of the sequence-to-sequence model with the performance of the scene-based memory model, and the comparison results refer to table 1 below. Compared with a local action space at the current position for navigation decision (1-4 lines in table 1) and a model based on scene memory (5-7 lines in table 1), as can be seen from table 1, the navigation success rate of the intelligent algorithm based on scene memory is obviously higher, but because the navigation success rate is not set to be proper for the exploration process, the navigation path length TL is too long, so that the path length weighting success rate SPL is obviously reduced. The distance perception-based Transformer visual language navigation algorithm well maintains the original success rate of the algorithm, and meanwhile obviously reduces the navigation path length TL and the navigation error NE, so that the path length weighting success rate SPL of an intelligent agent based on scene memory is obviously improved, and the navigation efficiency is greatly improved.

TABLE 1

TABLE 1

Meanwhile, on the data set, an ablation experiment is carried out on the algorithm provided by the application, and the experimental result refers to the 8-10 lines in the following table 1, wherein w/o represents without, and w/o distance fusion represents the removal of a distance fusion module, so that the intelligent agent can directly select a candidate position to execute an action without considering distance limitation; and (4) w/o distance fusion and action screening, namely, removing distance fusion and action screening modules, namely, screening the global action space on the basis of no considering the distance.

As can be seen from table 1, the Transformer model initialized by the pre-trained model with global motion space and progress monitor (i.e. the elimination distance fusion, motion filtering module) is superior to the previous best model in both SR and OR, both on the validation set (known environment) and the validation set (unknown environment) as a whole. However, it requires much exploration in a strange environment. The global action filter improves the performance of the model in the verification set (unknown environment) of NEs, TL, SPL, which indicates that controlling the size of the global action space is useful for the agent to effectively explore the unknown environment. In addition, the dynamic fusion module provided obviously reduces the length of the navigation path, and further improves the exploration efficiency. Although these two modules cause some performance loss, SPL measurement performance is improved, meaning that they are still beneficial for efficient exploration.

This application has still compared and has used the influence of vision language pre-training initialization model to navigation performance, and the experimental result refers to table 2:

table 2 shows a comparison of the performance of the model at random initialization and initialization using PREVALENT to train 50,000 iterations. It can be found that the model initialized by using the pre-training model has obvious advantages in main indexes, which proves the effectiveness of introducing the pre-training model.

It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terminology used in the description and claims of the present application is not limiting, but is used for convenience only.

Claims

1. A transform visual language navigation algorithm based on distance perception is characterized by comprising the following steps:

Extracting visual information V of current position _t Distance information D _t And candidate location visual and spatial information and based thereon construct a global action space &'s corresponding to the current location>

S3, utilizing a progress monitor f _monitor For global motion space

Screening, and selecting the first n candidate positions with the highest evaluation progress as action spaces in the action decision process>

Wherein n is a positive integer greater than 0;

s4, the visual information V is processed _t Command information X and distance information D _t Inputting the navigation state h of the intelligent agent into a distance perception Transformer navigation module based on a dynamic distance fusion module, and updating the navigation state h of the intelligent agent _t Outputting the distance weighted action probability distribution corresponding to the current position

Selecting candidate nodes corresponding to the maximum probability as next action;

s5, according to the visual characteristics and the action space observed after the action is executed

Updating a scene memory structure>

To>

2. The distance perception based Transformer visual language navigation algorithm according to claim 1, wherein the step S1 further comprises:

Wherein a node &>

Corresponding to an inspired position, side->

The spatial relative position relation of the corresponding positions u and s is mainly the spatial position relation between the initial position and the candidate position;

a3, the text U is marked by clause [ SEP ] according to the natural language instruction]Dividing sentences according to the sentence composition of the original instruction, and adding [ CLS ] at the initial position]Marking a loop state used in processing the navigation process; the preprocessed sequence can obtain initialized instruction information X and navigation state h through a multi-layer Transformer network as follows ₀ ：

h ₀ ,X＝BERT([CLS],U,[SEP])#。

3. The distance perception-based Transformer visual language navigation algorithm according to claim 1, wherein the step S2 further comprises constructing a global motion space corresponding to the current position by using the following formula

Wherein the content of the first and second substances,

K _u representing candidate locations directly connected to the visited location u, i.e. at current progressA subset of the motion space;

o _u,k memory structure for representing scene

4. The distance perception-based Transformer visual language navigation algorithm as claimed in claim 1, wherein the step S3 comprises evaluating with the following formula:

wherein h is _t-1 Representing the navigation state of the agent in the last step;

visual information representing the kth candidate location.

The final screened motion space is

5. The distance perception based Transformer's visual language navigation algorithm of claim 4, further comprising: the progress monitor f _monitor Navigation state h integrated with agent in last step _t-1 With the currently observed visual information

Estimating the current navigation progress as follows:

wherein σ represents a Sigmoid activation function;

6. The distance perception based Transformer's visual language navigation algorithm of claim 1, wherein n =20.

7. The distance-aware fransformer-based visual language navigation algorithm of claim 1, wherein the distance information D is _t The distance of the candidate location from the current location is encoded as a 20-dimensional 0-1 vector using one-hot encoding.

8. The distance-aware fransformer-based visual language navigation algorithm of claim 1, wherein the distance-aware fransformer navigation module based on the dynamic distance fusion module accepts visual information V _t Command information X and distance information D _t Then, the action decision is carried out at the progress t position based on the following steps and the current navigation state is updated, wherein the steps comprise:

b2, respectively obtaining language and vision weighting information by using attention weights corresponding to the language and the vision, performing cross-mode matching by using point multiplication, splicing with a navigation state vector output by the last layer of the Transformer, and mapping to a new state representation;

9. The distance perception-based Transformer visual language navigation algorithm as claimed in claim 1 or 8, wherein the intelligent agent algorithm is trained by using a hybrid learning mode of reinforcement learning and imitation learning, and is optimized by the following formula:

wherein, the first and the second end of the pipe are connected with each other,

Represents a correct action; λ represents a coefficient that mimics a learning loss function, μ represents a coefficient of a distance loss function; gamma represents the coefficient corresponding to the loss function of the progress monitor; p' ^a _t Representing the action probability of the agent output; />

A task progress estimated value obtained by the representative progress monitor; r is a radical of hydrogen _t Representing the true task progress of the navigation; />

Represents a shortest path distance vector; />

And the distance weight vector output by the model is represented, T represents the total step number of the navigation task, and i represents the label of each element in the vector when the two vectors are subjected to inner product operation.

10. The distance perception-based Transformer visual language navigation algorithm according to claim 1, wherein step S5 is included inNavigation step t, from

After a candidate position u is selected, a new node is added to the scene memory structure to represent u; when the agent reaches position u, then>

The navigable node corresponding to u in the list is deleted, and the navigable nodes added after reaching the position u are added; and simultaneously, updating the visual information and the space relative information in the map according to the navigation state vector. />