CN115906831A - Distance perception-based Transformer visual language navigation algorithm - Google Patents

Distance perception-based Transformer visual language navigation algorithm Download PDF

Info

Publication number
CN115906831A
CN115906831A CN202211342144.6A CN202211342144A CN115906831A CN 115906831 A CN115906831 A CN 115906831A CN 202211342144 A CN202211342144 A CN 202211342144A CN 115906831 A CN115906831 A CN 115906831A
Authority
CN
China
Prior art keywords
navigation
distance
information
visual
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211342144.6A
Other languages
Chinese (zh)
Inventor
魏忠钰
杜梦飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202211342144.6A priority Critical patent/CN115906831A/en
Publication of CN115906831A publication Critical patent/CN115906831A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Navigation (AREA)

Abstract

The invention discloses a distance perception-based visual language navigation algorithm of a Transformer, and belongs to the technical field of visual language cross-modal. The algorithm is implemented by the following way: visual information, instruction information and a memory structure of a sensible area of the intelligent agent are initialized, and then exploration information in a navigation process is fused by providing a scene memory updating module based on a graph data structure and combining a language vision multi-mode pre-training model, so that the perception capability of the intelligent agent on the environment is enhanced; the motion space of each decision in the navigation process is compressed by providing a progress monitor based on distance, so that the operation resources are reduced, and the model training is accelerated; by providing the dynamic distance fusion-based module, distance information is fused into action decision, so that the algorithm gives consideration to the exploration path length while carrying out global exploration, and the efficiency of a navigation task is improved. The visual language navigation algorithm of the transform based on distance perception provided by the invention ensures better navigation success rate and obviously improves the exploration efficiency based on the scene memory algorithm.

Description

Transformer visual language navigation algorithm based on distance perception
Technical Field
The invention relates to the technical field of visual language cross-modal, in particular to a transform visual language navigation algorithm based on distance perception.
Background
Visual Language Navigation task (VLN) aims to train an agent to reach a target location by performing a series of actions in an unstructured, invisible environment in combination with natural Language instructions and Visual information observed by the agent.
The visual language navigation task requires that the intelligent algorithm has the comprehensive capabilities of natural language understanding, visual environment perception, multi-modal feature alignment and reasonable strategy decision.
Most VLN navigation algorithms mainly use a sequence-to-sequence (Seq 2 Seq) framework [1] and model the time sequence state of intelligent agent navigation by using a Long short-term memory neural network (LSTM) for processing language image information flow in the navigation process. Such an operation prevents the agent from directly accessing historical information in the navigation process, such as spatial layout of visited locations, previous decisions, etc. Another type of intelligent algorithm is that the explored scene is memorized and stored in a graph structure in the navigation process, so that a local action space is expanded to a global action space, the intelligent agent is helped to evaluate all current navigable positions, and the intelligent agent has strong capability of timely correcting errors.
The existing method has the following two problems:
1. the convergence speed of the model can be obviously reduced in the aspect of quick expansion of the global action space, and on the other hand, a large amount of GPU resources are consumed in the model training process due to excessive candidate positions;
2. the algorithm generally lacks backtracking constraint, and more invalid repeated paths are easy to appear, so that the length of a navigation path is longer, and the navigation efficiency is reduced.
Disclosure of Invention
The invention provides a distance perception-based Transformer visual language navigation algorithm, which is characterized in that a scene memory updating module based on a graph data structure and a language visual multi-modal pre-training model are provided to further fuse exploration information in the navigation process, so that the perception capability of an intelligent agent to the environment is enhanced; the motion space of each decision in the navigation process is compressed by providing a progress monitor based on distance, so that the operation resources are reduced, and the model training is accelerated; the distance information is integrated into action decision-making through providing a dynamic distance fusion module, so that the algorithm gives consideration to exploration path length while carrying out global exploration, fully exploration is given consideration to, and the selection of candidate positions with longer unnecessary distances is further reduced while correcting the capability in time, thereby improving the efficiency of the language and vision navigation task.
In order to achieve the purpose, the invention adopts the following technical scheme:
the visual language navigation algorithm of the Transformer based on distance perception is provided, and is characterized by comprising the following steps of:
s1, performing navigation initialization according to a navigation task and visual information in a sensible environment, wherein a main initialization component is a scene memory structure
Figure BDA0003916472650000021
Navigation module and navigation state h 0 Navigation instruction information X;
s2, according to the scene memory structure of the current navigation progress t
Figure BDA0003916472650000022
Extracting visual information V of current position t Distance information D t And candidate position vision and space information and constructing a global action space corresponding to the current position on the basis of the candidate position vision and space information;
s3, utilizing progress monitor f monitor Screening the global action space, and selecting the first n candidate positions with the highest evaluation progress as the actual action space in the action decision process;
s4, the visual information V is processed t Command information X and distance information D t Inputting the navigation state h of the intelligent agent into a distance perception Transformer navigation module based on a dynamic distance fusion module, and updating the navigation state h of the intelligent agent t And outputting the distance weighted action probability distribution corresponding to the current position
Figure BDA0003916472650000023
Selecting candidate nodes corresponding to the maximum probability as the next action;
s5, updating the scene memory structure according to the visual characteristics observed after the action is executed
Figure BDA0003916472650000024
To>
Figure BDA0003916472650000025
And S6, repeatedly executing S2-S5 until the intelligent algorithm judges that the navigation task is completed or the maximum moving step number K is reached.
Preferably, the navigation initialization is performed by:
a1, constructing a directed graph according to visual characteristics of an observable range at a starting point
Figure BDA0003916472650000026
Wherein node->
Figure BDA0003916472650000027
Corresponding to the revealing position, edge e u,s =(u,s)∈ε t The spatial relative position relation of the corresponding positions u and s is mainly the spatial position relation between the starting position and the candidate position;
a2, replacing parameters of a transform model initialized at random in a navigation module according to parameters of a pre-training language model;
a3, the text U is marked by clause [ SEP ] according to the natural language instruction]Dividing sentences according to the sentence constitution of the original instruction, and adding [ CLS ] at the initial position]The markers are used to handle loop states in the navigation process. The preprocessed sequence can obtain initialized instruction information X and navigation state h through a multi-layer Transformer network according to the following formula (1) 0
h 0 ,X=BERT([CLS],U,[SEP]) # formula (1)
Preferably, the structure is memorized according to the scene
Figure BDA0003916472650000028
Constructing a global action space corresponding to the current progress t, wherein the global action space is as follows (2):
Figure BDA0003916472650000029
in the formula (2), the first and second groups,
Figure BDA00039164726500000210
representing the number of locations the agent has visited at the current schedule;
K u represents candidate locations directly connected to the visited location u, i.e., a subset of the action space at the current schedule;
o u,k memory structure for representing scene
Figure BDA00039164726500000211
Visual and spatial relative position information of the candidate locations is stored.
Preferably, by means of a progress monitor f monitor And screening the global motion space, wherein the following formula (3) is adopted:
Figure BDA0003916472650000031
in the formula (3), h t-1 Representing the navigation state of the agent in the last step;
Figure BDA0003916472650000032
visual information representing the kth candidate location.
The final screened motion space is
Figure BDA0003916472650000033
Preferably, the progress monitor f monitor Navigation state h integrated with agent in last step t-1 With the currently observed visual information
Figure BDA0003916472650000034
Estimating the current navigation progress, as the following formula (4):
Figure BDA0003916472650000035
in formula (4), σ represents a Sigmoid activation function;
W p ,W v is a parameter matrix of a learnable fully-connected neural network.
Preferably, n =20.
Preferably, the distance information D t The distance of the candidate location from the current location is encoded as a 20-dimensional 0-1 vector using one-hot encoding.
Preferably, the distance perception Transformer navigation module based on the dynamic distance fusion module receives the visual information V t Command information X and distance information D t And then making action decision at the progress t position and updating the current navigation state based on the following steps:
and B1, respectively calculating the overall attention scores of the language, the vision, the distance information and the navigation state vector and the language vector. Attention scores and attention weights for all language features;
b2, respectively obtaining language and vision weighting information by using attention weights corresponding to the language and the vision, performing cross-mode matching by using point multiplication, splicing with a navigation state vector output by the last layer of a Transformer, and mapping to a new state representation;
b3, splicing the spatial information to a newly generated navigation state vector and mapping the spatial information to a final navigation state representation;
b4, respectively taking the attention scores corresponding to vision and distance as action probability distribution and distance weight distribution;
and B5, combining the action probability distribution with the distance weight distribution to obtain the action probability distribution based on distance weighting.
Preferably, the agent algorithm is trained using a hybrid learning mode of reinforcement learning and mock learning, optimized by the following formula (5):
Figure BDA0003916472650000036
in the formula (5),
Figure BDA0003916472650000037
Representing the action obtained by the algorithm according to probability sampling; a. The t Representing the reward corresponding to the action; />
Figure BDA0003916472650000038
Represents a correct action; λ represents a coefficient that mimics a learning loss function, μ represents a coefficient of a distance loss function; and gamma represents a coefficient corresponding to the loss function of the progress monitor.
Preferably, the updating of the scene memory of the intelligent agent algorithm is specifically as follows: at navigation step t, from
Figure BDA0003916472650000039
After a candidate position u is selected, the algorithm adds a new node to the scene memory structure to represent u. When the agent reaches position u, then>
Figure BDA00039164726500000310
The navigable node corresponding to u in (g) is deleted while adding the navigable node added after reaching position u. And simultaneously, updating the visual information and the space relative information in the map according to the navigation state vector.
Preferably, K =15.
The invention has the following beneficial effects:
1. the method comprises the steps that a scene memory updating module based on a graph data structure is provided, and exploration information in the navigation process is further fused by combining a language vision multi-mode pre-training model, so that the perception capability of an intelligent agent on the environment is enhanced;
2. the motion space of each decision in the navigation process is compressed by providing a progress monitor based on distance, so that the operation resources are reduced, and the model training is accelerated;
3. by providing the dynamic distance fusion-based module, distance information is fused into action decision, so that the algorithm gives consideration to the exploration path length while carrying out global exploration, and the efficiency of a navigation task is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a diagram of an implementation step of a distance-aware Transformer-based visual language navigation algorithm provided by the present invention;
FIG. 2 is a system structure diagram of the visual language navigation algorithm of the transform based on distance sensing according to the present invention;
Detailed Description
The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.
Wherein the showings are for the purpose of illustration only and not for the purpose of limiting the same, the same is shown by way of illustration only and not in the form of limitation; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.
In the description of the present invention, unless otherwise explicitly specified or limited, the term "connected" or the like, if appearing to indicate a connection relationship between the components, is to be understood broadly, for example, as being fixed or detachable or integral; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or may be connected through one or more other components or may be in an interactive relationship with one another. The specific meanings of the above terms in the present invention can be understood in a specific case to those of ordinary skill in the art.
The embodiment of the invention provides a distance perception-based Transformer visual language navigation algorithm, as shown in FIG. 1, which comprises the following steps:
s1, performing navigation initialization according to a navigation task and visual information in a sensible environment, wherein the main initialized content is a scene memory structure
Figure BDA0003916472650000051
Navigation module and navigation state h 0 And navigation instruction information X. Memory structure for a scene->
Figure BDA0003916472650000052
Constructing a directed graph based on the visual characteristics of the observable range at the origin>
Figure BDA0003916472650000053
Wherein node->
Figure BDA0003916472650000054
Corresponding to the start position, edge e u,s =(u,s)∈ε t The spatial relative position relationship between the corresponding positions u and s is mainly the spatial position relationship between the starting position and the candidate position. For the navigation module, parameters of a transform model initialized randomly in the navigation module are replaced according to parameters of a pre-training language model (the invention uses parameters of a PREVALENT pre-training model). For the navigation state in the navigation instruction information, the sentence mark [ SEP ] is utilized according to the natural language instruction text U]Dividing sentences according to the sentence composition of the original instruction, and adding [ CLS ] at the initial position]The markers are used to process loop states in the navigation process.The preprocessed sequence can obtain initialized instruction information X and navigation state h through a multi-layer Transformer network 0
S2, according to the scene memory structure of the current navigation progress t
Figure BDA0003916472650000055
Extracting visual information V of current position t Distance information D t And candidate position vision and space information and constructing a global action space (corresponding to the current position) based on the candidate position vision and space information>
Figure BDA0003916472650000056
Figure BDA0003916472650000057
In the formula (2), the first and second groups,
Figure BDA0003916472650000058
representing the number of locations the agent has visited at the current schedule;
K u represents candidate locations directly connected to the visited location u, i.e., a subset of the action space at the current schedule;
o u,k memory structure for representing scene
Figure BDA0003916472650000059
Visual and spatial relative position information of the candidate locations is stored.
S3, utilizing a progress monitor f monitor For global action space
Figure BDA00039164726500000510
Screening is carried out, and the first n candidate positions with the highest evaluation progress are selected as the action space->
Figure BDA00039164726500000511
(n represents the size of the motion space after screening, and is the sum of the convergence rate and the convergence rate in the algorithm training processAnd (3) calculating important parameters occupied by resources, and through repeated experimental data comparison, n =20 can accelerate algorithm convergence speed while ensuring the navigation success rate, and reduce the occupation of the calculation resources, so that a model needing 8 TeslaV100 display cards to be trained can be trained on 1 2080Ti display card). The progress monitor primarily utilizes the previous navigation state h t-1 The currently observed visual feature pick>
Figure BDA00039164726500000512
And evaluating the current navigation progress, such as formula (4):
Figure BDA00039164726500000513
in formula (4), σ represents a Sigmoid activation function;
W p ,W v is a parameter matrix of a learnable fully-connected neural network.
S4, the visual information V is processed t Command information X and distance information D t Inputting the navigation state h of the intelligent agent into a distance perception Transformer navigation module based on a dynamic distance fusion module, and updating the navigation state h of the intelligent agent t Outputting a distance weighted motion probability distribution p 'corresponding to the current position' t a And selecting the candidate node corresponding to the maximum probability as the next action, wherein the candidate node is represented by the following formula (6):
Figure BDA0003916472650000061
in particular, the present application uses
Figure BDA0003916472650000062
And &>
Figure BDA0003916472650000063
Respectively represent DistVLN>
Figure BDA0003916472650000064
Multilayer T in BERTState vectors and language vectors for the k-th attention head and l-th layer output of the ransformer structure. The attention scores for all language features are as follows, equation (7):
Figure BDA0003916472650000065
then, the model averages the scores of all the attention heads (K = 12) and uses a Softmax function to obtain the overall attention weight of the language feature, as shown in equation (8):
Figure BDA0003916472650000066
similarly, the visual feature attention score and weight are respectively
Figure BDA0003916472650000067
And &>
Figure BDA0003916472650000068
Distance feature attention score and weight are @, respectively>
Figure BDA0003916472650000069
And &>
Figure BDA00039164726500000610
Then, the model performs weighted summation on the input text features and the visual features respectively to obtain weighted features, as the following formula (9):
Figure BDA00039164726500000611
subsequently, the model performs cross-mode matching between the weighted language and visual features using point-by-point pairs and VLN
Figure BDA00039164726500000612
Output from the last layer of BERTNavigation status vector>
Figure BDA00039164726500000613
Mapping to a new state representation, as in equation (10):
Figure BDA00039164726500000614
finally, the model maps the directional features r t Spliced to the newly generated navigation state vector and mapped to the final navigation state representation, as in equation (11):
Figure BDA00039164726500000615
then, the model is weighted by the distance feature attention score and the visual feature attention score to obtain a distance-based weighted action probability distribution, as shown in the following formula (12):
Figure BDA00039164726500000616
reinforcement learning and mimic learning are used herein to train agents. In reinforcement learning, the A2C algorithm is mainly used, and the agent outputs action probability p 'according to the agent at each step t of navigation' a t Using the distance from the corresponding position to the task end as the reward A for the corresponding action t . In mock learning, the agent navigates through the real trajectory by following the correct actions of each location and calculates the cross-entropy loss for each decision step. To train the distance sensing module, we calculate the shortest path distance vector from each candidate location in the scene memory map to the current navigation location
Figure BDA0003916472650000071
And the distance weight vector->
Figure BDA0003916472650000072
And (4) performing dot-multiply summation as part of training loss, and being used for helping the intelligent agent comprehensively consider the comprehensive influence of language, vision and distance current distance characteristics on action decision.
In particular, the loss function of agent training may be expressed as:
Figure BDA0003916472650000073
in the formula (5), the first and second groups,
Figure BDA0003916472650000074
representing the action obtained by the algorithm according to probability sampling; a. The t Representing the reward corresponding to the action; />
Figure BDA0003916472650000075
Represents a correct action; λ represents a coefficient that mimics a learning loss function; μ represents the coefficient of the distance loss function; gamma represents the coefficient corresponding to the loss function of the progress monitor; r is t Representing the true task progress of the navigation; />
Figure BDA0003916472650000076
Representing an estimate of the progress of the task made by the progress monitor.
S5, updating the scene memory structure according to the visual characteristics observed after the action is executed
Figure BDA0003916472650000077
To/>
Figure BDA0003916472650000078
Specifically, the method comprises the following steps: in a navigation step t, slave->
Figure BDA0003916472650000079
After a candidate position u is selected, the algorithm adds a new node to the scene memory structure to represent u. When the agent reaches position u, it>
Figure BDA00039164726500000710
The navigable node corresponding to u in (g) is deleted while adding the navigable node added after reaching position u. Meanwhile, the algorithm is used for updating visual and spatial information in the map according to the landmark and action related text features in the language instruction and the current navigation state, and the following formula (13) (14):
Figure BDA00039164726500000711
Figure BDA00039164726500000712
wherein h is t In order to be in the current navigation state,
Figure BDA00039164726500000713
and/or>
Figure BDA00039164726500000714
Respectively, representations of landmarks and actions in the text for resolving landmark and action features of interest in the command by the agent. Next, the model respectively splices and maps the navigation state, the landmark and the action text representation to obtain a landmark perception state and an action perception state, as in formula (15):
Figure BDA00039164726500000715
then, for the nodes and edges on the scene memory structure, the stored visual information and spatial information are updated respectively by using the attention mechanism, as the following formula (16) (17):
Figure BDA0003916472650000081
/>
Figure BDA0003916472650000082
the invention then uses a long-term inference based iterative update method that iteratively updates visual and directional features by exchanging messages between nodes in the form of trainable functions in order to comprehensively consider visual and spatial information corresponding to nodes and edges on the graph. In particular, at the s-th iteration, the visual information for each node u
Figure BDA0003916472650000083
Updating is performed as follows, equation (18) (19):
Figure BDA0003916472650000084
Figure BDA0003916472650000085
wherein the function U is updated (. Cndot.) is implemented by a Gated neural network (GRU). After S iterations of polymerization, the model further improves the visual and directional characteristics, i.e.
Figure BDA0003916472650000086
By capturing information within the S-hop neighborhood of node u.
And S6, repeatedly executing S2-S5 until the intelligent algorithm judges that the navigation task is completed or the maximum moving step number K is reached. (the maximum number of motion steps K =15 is chosen here based on experience from prior studies)
In order to verify the performance of the application in the language vision navigation task, the application evaluates the performance of the model by a Room-to-Room data set. The data set consists of 14,0251,020,2349 navigation tasks in a training set, a validation set (known environment), and a validation set (unknown environment), respectively. Experimental procedures the present application focuses mainly on the following 5 evaluation indicators to evaluate the performance of the VLN algorithm:
(1) Success Rate (SR), considering the percentage of the total tasks that are navigation tasks less than 3 meters from the target location after the agent stops. The index is the most direct index for evaluating the navigation performance of the intelligent agent.
(2) The Navigation Error (NE) refers to the shortest distance between the agent stop position and the target position.
(3) The Trajectory Length (TL) is the average total length of the measured agent navigation trajectory. The efficiency of agent navigation may be reflected to some extent.
(4) The global success rate (OR) refers to the proportion of navigation tasks with the position distance less than 3 meters from the target position to the total number of tasks existing in the navigation process of the intelligent agent.
(5) The path weighted success rate length (SPL) [31] is the trade-off between SR and TL.
In an embodiment, the learning rate of the model is fixed at 10 throughout the training process -5 And an AdamW optimizer was used. In terms of data, a hybrid dataset of Room-to-Room and enhancement data from PREVALENT is chosen herein for training the model. Visual information in the environment is encoded by the ResNet-152 network pre-trained on the Places365 dataset. The training batch was sized to 8 and model training was performed on a single NVIDIA 2080Ti GPU.
The present application compares the performance of the sequence-to-sequence model with the performance of the scene-based memory model, and the comparison results refer to table 1 below. Compared with a local action space at the current position for navigation decision (1-4 lines in table 1) and a model based on scene memory (5-7 lines in table 1), as can be seen from table 1, the navigation success rate of the intelligent algorithm based on scene memory is obviously higher, but because the navigation success rate is not set to be proper for the exploration process, the navigation path length TL is too long, so that the path length weighting success rate SPL is obviously reduced. The distance perception-based Transformer visual language navigation algorithm well maintains the original success rate of the algorithm, and meanwhile obviously reduces the navigation path length TL and the navigation error NE, so that the path length weighting success rate SPL of an intelligent agent based on scene memory is obviously improved, and the navigation efficiency is greatly improved.
Figure BDA0003916472650000091
TABLE 1
Figure BDA0003916472650000092
Figure BDA0003916472650000101
TABLE 1
Meanwhile, on the data set, an ablation experiment is carried out on the algorithm provided by the application, and the experimental result refers to the 8-10 lines in the following table 1, wherein w/o represents without, and w/o distance fusion represents the removal of a distance fusion module, so that the intelligent agent can directly select a candidate position to execute an action without considering distance limitation; and (4) w/o distance fusion and action screening, namely, removing distance fusion and action screening modules, namely, screening the global action space on the basis of no considering the distance.
As can be seen from table 1, the Transformer model initialized by the pre-trained model with global motion space and progress monitor (i.e. the elimination distance fusion, motion filtering module) is superior to the previous best model in both SR and OR, both on the validation set (known environment) and the validation set (unknown environment) as a whole. However, it requires much exploration in a strange environment. The global action filter improves the performance of the model in the verification set (unknown environment) of NEs, TL, SPL, which indicates that controlling the size of the global action space is useful for the agent to effectively explore the unknown environment. In addition, the dynamic fusion module provided obviously reduces the length of the navigation path, and further improves the exploration efficiency. Although these two modules cause some performance loss, SPL measurement performance is improved, meaning that they are still beneficial for efficient exploration.
This application has still compared and has used the influence of vision language pre-training initialization model to navigation performance, and the experimental result refers to table 2:
Figure BDA0003916472650000102
table 2 shows a comparison of the performance of the model at random initialization and initialization using PREVALENT to train 50,000 iterations. It can be found that the model initialized by using the pre-training model has obvious advantages in main indexes, which proves the effectiveness of introducing the pre-training model.
It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terminology used in the description and claims of the present application is not limiting, but is used for convenience only.

Claims (10)

1. A transform visual language navigation algorithm based on distance perception is characterized by comprising the following steps:
s1, performing navigation initialization according to a navigation task and visual information in a sensible environment, wherein the main initialized content is a scene memory structure
Figure FDA0003916472640000011
Navigation module and navigation state h 0 Navigation instruction information X;
s2, according to the scene memory structure of the current navigation progress t
Figure FDA0003916472640000012
Extracting visual information V of current position t Distance information D t And candidate location visual and spatial information and based thereon construct a global action space &'s corresponding to the current location>
Figure FDA0003916472640000013
S3, utilizing a progress monitor f monitor For global motion space
Figure FDA0003916472640000014
Screening, and selecting the first n candidate positions with the highest evaluation progress as action spaces in the action decision process>
Figure FDA0003916472640000015
Wherein n is a positive integer greater than 0;
s4, the visual information V is processed t Command information X and distance information D t Inputting the navigation state h of the intelligent agent into a distance perception Transformer navigation module based on a dynamic distance fusion module, and updating the navigation state h of the intelligent agent t Outputting the distance weighted action probability distribution corresponding to the current position
Figure FDA0003916472640000016
Selecting candidate nodes corresponding to the maximum probability as next action;
s5, according to the visual characteristics and the action space observed after the action is executed
Figure FDA0003916472640000017
Updating a scene memory structure>
Figure FDA0003916472640000018
To>
Figure FDA0003916472640000019
And S6, repeatedly executing S2-S5 until the intelligent algorithm judges that the navigation task is completed or the maximum moving step number K is reached.
2. The distance perception based Transformer visual language navigation algorithm according to claim 1, wherein the step S1 further comprises:
a1, constructing a directed graph according to visual characteristics of an observable range at a starting point
Figure FDA00039164726400000110
Wherein a node &>
Figure FDA00039164726400000111
Corresponding to an inspired position, side->
Figure FDA00039164726400000112
The spatial relative position relation of the corresponding positions u and s is mainly the spatial position relation between the initial position and the candidate position;
a2, replacing parameters of a transform model initialized at random in a navigation module according to parameters of a pre-training language model;
a3, the text U is marked by clause [ SEP ] according to the natural language instruction]Dividing sentences according to the sentence composition of the original instruction, and adding [ CLS ] at the initial position]Marking a loop state used in processing the navigation process; the preprocessed sequence can obtain initialized instruction information X and navigation state h through a multi-layer Transformer network as follows 0
h 0 ,X=BERT([CLS],U,[SEP])#。
3. The distance perception-based Transformer visual language navigation algorithm according to claim 1, wherein the step S2 further comprises constructing a global motion space corresponding to the current position by using the following formula
Figure FDA00039164726400000113
Figure FDA00039164726400000114
Wherein the content of the first and second substances,
Figure FDA00039164726400000115
representing the number of locations the agent has visited at the current schedule;
K u representing candidate locations directly connected to the visited location u, i.e. at current progressA subset of the motion space;
o u,k memory structure for representing scene
Figure FDA00039164726400000116
Visual and spatial relative position information of the candidate locations is stored.
4. The distance perception-based Transformer visual language navigation algorithm as claimed in claim 1, wherein the step S3 comprises evaluating with the following formula:
Figure FDA0003916472640000021
wherein h is t-1 Representing the navigation state of the agent in the last step;
Figure FDA0003916472640000022
visual information representing the kth candidate location.
The final screened motion space is
Figure FDA0003916472640000023
5. The distance perception based Transformer's visual language navigation algorithm of claim 4, further comprising: the progress monitor f monitor Navigation state h integrated with agent in last step t-1 With the currently observed visual information
Figure FDA0003916472640000024
Estimating the current navigation progress as follows:
Figure FDA0003916472640000025
wherein σ represents a Sigmoid activation function;
W p ,W v is a parameter matrix of a learnable fully-connected neural network.
6. The distance perception based Transformer's visual language navigation algorithm of claim 1, wherein n =20.
7. The distance-aware fransformer-based visual language navigation algorithm of claim 1, wherein the distance information D is t The distance of the candidate location from the current location is encoded as a 20-dimensional 0-1 vector using one-hot encoding.
8. The distance-aware fransformer-based visual language navigation algorithm of claim 1, wherein the distance-aware fransformer navigation module based on the dynamic distance fusion module accepts visual information V t Command information X and distance information D t Then, the action decision is carried out at the progress t position based on the following steps and the current navigation state is updated, wherein the steps comprise:
and B1, respectively calculating the overall attention scores of the language, the vision, the distance information and the navigation state vector and the language vector. Attention scores and attention weights for all language features;
b2, respectively obtaining language and vision weighting information by using attention weights corresponding to the language and the vision, performing cross-mode matching by using point multiplication, splicing with a navigation state vector output by the last layer of the Transformer, and mapping to a new state representation;
b3, splicing the spatial information to a newly generated navigation state vector and mapping the spatial information to a final navigation state representation;
b4, respectively taking the attention scores corresponding to vision and distance as action probability distribution and distance weight distribution;
and B5, combining the action probability distribution with the distance weight distribution to obtain the action probability distribution based on distance weighting.
9. The distance perception-based Transformer visual language navigation algorithm as claimed in claim 1 or 8, wherein the intelligent agent algorithm is trained by using a hybrid learning mode of reinforcement learning and imitation learning, and is optimized by the following formula:
Figure FDA0003916472640000026
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003916472640000027
representing the action obtained by the algorithm according to probability sampling; a. The t Representing the reward corresponding to the action; />
Figure FDA0003916472640000028
Represents a correct action; λ represents a coefficient that mimics a learning loss function, μ represents a coefficient of a distance loss function; gamma represents the coefficient corresponding to the loss function of the progress monitor; p' a t Representing the action probability of the agent output; />
Figure FDA0003916472640000031
A task progress estimated value obtained by the representative progress monitor; r is a radical of hydrogen t Representing the true task progress of the navigation; />
Figure FDA0003916472640000032
Represents a shortest path distance vector; />
Figure FDA0003916472640000033
And the distance weight vector output by the model is represented, T represents the total step number of the navigation task, and i represents the label of each element in the vector when the two vectors are subjected to inner product operation.
10. The distance perception-based Transformer visual language navigation algorithm according to claim 1, wherein step S5 is included inNavigation step t, from
Figure FDA0003916472640000034
After a candidate position u is selected, a new node is added to the scene memory structure to represent u; when the agent reaches position u, then>
Figure FDA0003916472640000035
The navigable node corresponding to u in the list is deleted, and the navigable nodes added after reaching the position u are added; and simultaneously, updating the visual information and the space relative information in the map according to the navigation state vector. />
CN202211342144.6A 2022-10-31 2022-10-31 Distance perception-based Transformer visual language navigation algorithm Pending CN115906831A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211342144.6A CN115906831A (en) 2022-10-31 2022-10-31 Distance perception-based Transformer visual language navigation algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211342144.6A CN115906831A (en) 2022-10-31 2022-10-31 Distance perception-based Transformer visual language navigation algorithm

Publications (1)

Publication Number Publication Date
CN115906831A true CN115906831A (en) 2023-04-04

Family

ID=86480598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211342144.6A Pending CN115906831A (en) 2022-10-31 2022-10-31 Distance perception-based Transformer visual language navigation algorithm

Country Status (1)

Country Link
CN (1) CN115906831A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117875535A (en) * 2024-03-13 2024-04-12 中南大学 Method and system for planning picking and delivering paths based on historical information embedding

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117875535A (en) * 2024-03-13 2024-04-12 中南大学 Method and system for planning picking and delivering paths based on historical information embedding
CN117875535B (en) * 2024-03-13 2024-06-04 中南大学 Method and system for planning picking and delivering paths based on historical information embedding

Similar Documents

Publication Publication Date Title
CN109711529B (en) Cross-domain federated learning model and method based on value iterative network
CN109299262B (en) Text inclusion relation recognition method fusing multi-granularity information
CN103049792B (en) Deep-neural-network distinguish pre-training
CN110134774A (en) It is a kind of based on the image vision Question-Answering Model of attention decision, method and system
CN111681178B (en) Knowledge distillation-based image defogging method
CN109743196B (en) Network characterization method based on cross-double-layer network random walk
CN113190688B (en) Complex network link prediction method and system based on logical reasoning and graph convolution
CN106920215A (en) A kind of detection method of panoramic picture registration effect
Drori The science of deep learning
CN114332578A (en) Image anomaly detection model training method, image anomaly detection method and device
CN111782840A (en) Image question-answering method, image question-answering device, computer equipment and medium
Khan et al. A systematic review on reinforcement learning-based robotics within the last decade
CN111210002B (en) Multi-layer academic network community discovery method and system based on generation of confrontation network model
CN106991666A (en) A kind of disease geo-radar image recognition methods suitable for many size pictorial informations
CN114782694A (en) Unsupervised anomaly detection method, system, device and storage medium
CN111126758B (en) Academic team influence propagation prediction method, academic team influence propagation prediction equipment and storage medium
CN115906831A (en) Distance perception-based Transformer visual language navigation algorithm
CN116110022A (en) Lightweight traffic sign detection method and system based on response knowledge distillation
CN106203373A (en) A kind of human face in-vivo detection method based on deep vision word bag model
CN115862747A (en) Sequence-structure-function coupled protein pre-training model construction method
CN115601745A (en) Multi-view three-dimensional object identification method facing application end
CN115496991A (en) Reference expression understanding method based on multi-scale cross-modal feature fusion
CN111882124B (en) Homogeneous platform development effect prediction method based on generation confrontation simulation learning
CN114742292A (en) Knowledge tracking process-oriented two-state co-evolution method for predicting future performance of students
CN115170218A (en) Neural network sequence recommendation method fusing graph neural network and multi-level comparison learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination