CN116242359A - Visual language navigation method, device and medium based on scene fusion knowledge - Google Patents
Visual language navigation method, device and medium based on scene fusion knowledge Download PDFInfo
- Publication number
- CN116242359A CN116242359A CN202310087842.4A CN202310087842A CN116242359A CN 116242359 A CN116242359 A CN 116242359A CN 202310087842 A CN202310087842 A CN 202310087842A CN 116242359 A CN116242359 A CN 116242359A
- Authority
- CN
- China
- Prior art keywords
- knowledge
- scene
- features
- visual
- agent
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 75
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000004927 fusion Effects 0.000 title claims abstract description 28
- 230000009471 action Effects 0.000 claims abstract description 16
- 230000008447 perception Effects 0.000 claims abstract description 13
- 239000003795 chemical substances by application Substances 0.000 claims description 61
- 230000006870 function Effects 0.000 claims description 24
- 230000003935 attention Effects 0.000 claims description 23
- 238000012512 characterization method Methods 0.000 claims description 15
- 230000007246 mechanism Effects 0.000 claims description 13
- 239000013598 vector Substances 0.000 claims description 13
- 230000002787 reinforcement Effects 0.000 claims description 10
- 238000003860 storage Methods 0.000 claims description 8
- 230000010332 selective attention Effects 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000012804 iterative process Methods 0.000 claims description 3
- 238000005192 partition Methods 0.000 claims description 3
- 238000005096 rolling process Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000004931 aggregating effect Effects 0.000 claims description 2
- 230000008569 process Effects 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000002776 aggregation Effects 0.000 description 3
- 238000004220 aggregation Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000006467 substitution reaction Methods 0.000 description 3
- 238000009825 accumulation Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000035800 maturation Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01C—MEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
- G01C21/00—Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
- G01C21/20—Instruments for performing navigational calculations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/35—Categorising the entire scene, e.g. birthday party or wedding scene
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Remote Sensing (AREA)
- Radar, Positioning & Navigation (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Automation & Control Theory (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a visual language navigation method, a device and a medium based on scene fusion knowledge, wherein the method comprises the following steps: acquiring a visual language navigation task; acquiring natural language instruction features, scene features and object features according to a visual language navigation task, iteratively updating weights of the object features by using a graph convolution network based on semantic and position awareness, and retrieving knowledge-enhanced object features by using object tags in the scene; and using a multi-mode decision module based on scene and knowledge perception to fuse the natural language instruction characteristics, scene characteristics and object characteristics, and performing action prediction and updating the running state of the intelligent agent until the intelligent agent is selected to stop. The invention makes the scene characteristic and the natural language instruction characteristic aligned better by utilizing the semantic and position relation of the object and the knowledge in the scene, and enables the intelligent body to navigate effectively in limited visual observation and unseen environment. The invention can be widely applied to the technical field of visual language navigation.
Description
Technical Field
The invention relates to the technical field of visual language navigation, in particular to a visual language navigation method, device and medium based on scene fusion knowledge.
Background
With the development and maturation of artificial intelligence technology in recent years, computer vision, natural language processing and robotics have been widely used in various fields. The robot is endowed with human intelligence, so that the intelligent body can 'read' the language of the human, 'read' the visual information and act autonomously to serve the human as a target of long-term effort. The field of visual language navigation is to research such a method, so that an intelligent agent can be continuously explored in a visual environment under the guidance of instructions of natural language, and finally, a designated task is completed.
Most of the existing visual language navigation methods are based on visual features, and fusion and alignment of natural language instruction features enable an intelligent body to carry out tracking navigation according to instructions guided by paths. However, in a real scene, the object-finding navigation task is more valuable for practical application. Natural language instructions in such tasks often contain only descriptive information of the target object and do not give detailed path descriptions. Under the existing model, the intelligent agent is limited by limited instruction content and insufficient perception of overall layout of a scene, and is difficult to effectively explore and find a target object in the environment.
Disclosure of Invention
In order to solve at least one of the technical problems existing in the prior art to a certain extent, the invention aims to provide a visual language navigation method, device and medium based on scene fusion knowledge.
The technical scheme adopted by the invention is as follows:
a visual language navigation method based on scene fusion knowledge comprises the following steps:
acquiring a visual language navigation task, wherein the visual language navigation task comprises a natural language instruction, initial visual information and initial position information;
encoding the natural language instruction into natural language instruction characteristics and an initial running state of the agent; coding and splicing the visual information and the position information to obtain scene characteristics;
extracting object labels from the visual information, and encoding semantic labels and position information of the objects into object features so as to update node characterization in a graph convolution network;
iteratively updating weights of object features by using a graph convolution network based on semantic and location awareness, and retrieving knowledge-enhanced object features by using object tags in a scene;
and using a multi-mode decision module based on scene and knowledge perception to fuse the natural language instruction features, scene features and object features enhanced by knowledge, and performing action prediction and updating the running state of the intelligent agent until the intelligent agent is selected to stop.
Further, the natural language instruction is encoded into natural language instruction characteristics and an initial running state of the agent; encoding and splicing the visual information and the position information to obtain scene characteristics, wherein the method comprises the following steps:
after the intelligent agent obtains the visual language navigation task, a natural language instruction is obtainedL represents the length of the instruction; wherein the agent is placed at the origin location; />
In the initialization phase, [ CLS ]]Instruction sequence I and partition identifier [ SEP ]]The sequence is input into a transducer for coding to obtain the initial running state s of the intelligent agent 0 And the nature of natural language instructions:
s 0 ,X=Transformer([CLS],I,[SEP])
navigation is a continuous iterative process, at each time t, the intelligent agent acquires visual information and position information of a scene, wherein the visual information comprises panoramic image information of the current position of the intelligent agent, the panoramic image information is divided into 36 discrete views, and the CLIP-ViT-B-32 is used as a visual encoder to acquire visual characteristics of the panoramic imageSign of signThe position information of the view includes a steering angle θ of the view with respect to the current position i And elevation angle phi i The method comprises the steps of carrying out a first treatment on the surface of the By copying the position angle 32 times vector (cos theta i ,sinθ i ,cosφ i ,sinφ i ) Construction of 128-dimensional directional code d i ;
For the current scene, there is N t The navigable directions, denoted asObtaining the corresponding position information characteristicsSplicing the two to obtain corresponding scene characteristics +.>
For each candidate direction i, the top m objects most significant in the scene are extracted using the fast-RCNN object extractor, and the tags of the objects are marked as
Further, the operation mechanism of the graph rolling network based on the semantic and the location awareness is as follows:
building an object and a knowledge graph: the method comprises the steps that an object and knowledge form a graph structure, the relation between the object and the knowledge is a corresponding relation in a knowledge base, and an implicit relation is defined between the object and the knowledge;
embedded edge relationship characterization: using a variable of a group of learnable parameters as a base vector, and obtaining the embedding of the edge relation in an orderly weight accumulation and normalization mode; different relation characterization is obtained by embedding the relation between different objects and knowledge, and the implicit relation between the objects is also used as a special relation embedding;
and (3) embedded node characterization: carrying out semantic coding on all nodes to form vectors, carrying out position coding on accessed objects, and initializing the position coding of unaccessed objects and all knowledge entity nodes to form all-zero vectors;
combining the graph convolution network and the edge information representation, and carrying out feature update on nodes in the graph to obtain a final graph node representation;
inputting the object label type corresponding to the current scene, and outputting the object characteristics after the knowledge enhancement after the graph convolution update characterization.
Further, the feature updating is performed on the nodes in the graph by combining the graph rolling network and the edge information representation to obtain a final graph node representation, which comprises the following steps:
a1, summing the neighbor nodes, and adding an edge embedding representation to update the characteristic representation of the target node;
a2, in order to better represent the target node, the characteristic of the target node is added during the final updating, and then the output result is subjected to a nonlinear activation function to obtain the updated representation of the node;
a3, iterating the steps A1-A2 by using the multi-layer graph convolution model structure to obtain a final graph node representation.
Further, the constructing the object and knowledge graph includes:
and detecting and obtaining an object tag list by using a preset network model as an index, and searching k pieces of knowledge with highest correlation weight in a preset knowledge base.
Further, the operation mechanism of the multi-mode decision module based on scene and knowledge perception is as follows:
fusing the multimodal characterization using a selective attention mechanism;
updating the running state of the intelligent agent: at each moment, the weighted sum value of the state variable of the last layer of the multi-mode decision module to the natural language instruction feature and the corresponding attention score is spliced together with the weighted sum value of the visual feature and the attention score, and a new state feature is obtained through linear transformation;
dynamically aggregating scenes and knowledge: sorting the attention scores of the scene features and the knowledge features in the same view according to the state features, and selecting the maximum value of the attention scores as the score of the view;
decision of output agent: carrying out Softmax on the final scores of all the different views, and selecting the view corresponding to the maximum value of the scores as the moving direction of the agent; if the maximum value of the score corresponds to the current view, the agent chooses to stop.
Further, the fusing of multimodal characterization using selective attention mechanisms includes:
and inputting natural language instruction features, scene features and knowledge-enhanced object features, wherein the natural language instruction features and the scene features are only used as keys and values of an attention mechanism and are not updated, and the knowledge-enhanced object features are updated by referring to the natural language instruction features and the scene features.
Further, the graph roll-up network and the multi-modal decision module are trained by:
training the agent in a manner that mimics learning IL and reinforcement learning RL; in imitation learning, the agent takes labeled actions at each point in timeEffectively follow the shortest path so that the action probability p t The track is close to the shortest track as much as possible;
let T be the total length of the agent trajectory, then the expression mimicking the learned loss function is:
in reinforcement learning, an agent moves from a probability p t Sampling and selecting an action sample, and learning from rewards; the expression of the reinforcement learning loss function is:
the total loss function is:
L=L IL +λL RL
wherein λ represents a hyper-parameter.
The invention adopts another technical scheme that:
a visual language navigation device based on scene fusion knowledge, comprising:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method described above.
The invention adopts another technical scheme that:
a computer readable storage medium, in which a processor executable program is stored, which when executed by a processor is adapted to carry out the method as described above.
The beneficial effects of the invention are as follows: the invention makes the scene characteristic and the natural language instruction characteristic aligned better by utilizing the semantic and position relation of the object and the knowledge in the scene, and enables the intelligent body to navigate effectively in limited visual observation and unseen environment.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description is made with reference to the accompanying drawings of the embodiments of the present invention or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present invention, and other drawings may be obtained according to these drawings without the need of inventive labor for those skilled in the art.
Fig. 1 is a frame diagram of a visual language navigation method based on scene fusion knowledge in an embodiment of the invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
In the description of the present invention, it should be understood that references to orientation descriptions such as upper, lower, front, rear, left, right, etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of description of the present invention and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present invention.
In the description of the present invention, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present invention, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present invention can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.
As shown in fig. 1, the present embodiment provides a visual language navigation method based on scene fusion knowledge, which adopts a graph roll-up network module (OK-GCN) based on semantics and location awareness to infer the relationship between objects and knowledge, and fuses multi-modal information to make decisions through a multi-modal decision module (SK-transducer) based on scene and knowledge awareness, so as to improve the awareness of the layout of an agent on the environment and the exploration ability in the unseen environment. The method comprises the following specific steps:
s1, obtaining a visual language navigation task, wherein an agent is placed at a starting point position to obtain a natural language instruction, and then obtaining visual information and position information of a scene at each moment t, wherein the visual information comprises panoramic image information of the current position of the agent, the visual information is divided into 36 discrete views, and the position information comprises steering angles and elevation angles of the views relative to the current position.
S2, constructing a global object and knowledge relation reasoning graph OK-GCN, and searching the k pieces of knowledge with the highest confidence in the knowledge graph by using the most obvious object label which can be detected by the pretrained fast-RCNN. All the objects and the knowledge entities are used as nodes, the objects are connected into edges in pairs, the related objects and the knowledge entities are connected into edges, and a global object and knowledge relation diagram is constructed. The nodes in the relation graph use Glove coding semantics, the position codes are initialized to zero vector features, and the initial features of the nodes are formed by splicing the nodes.
S3, encoding the input natural language instruction by using a transducer to obtain a language feature vector and an initial state feature vector.
S4, for a scene corresponding to the direction which can be navigated at the current moment, using the CLIP-ViT-B-32 as a visual encoder to obtain visual characteristics of the scene, and splicing the visual characteristics and the direction characteristics obtained by position encoding to obtain scene characteristics. The object tags observed in the current scene are detected using a pre-trained fast-RCNN.
S5, updating the position characteristics of the detected object labels of the current scene, then reasoning the object knowledge relation graph OK-GCN in a multi-step graph convolution mode to obtain an updated object characteristic matrix, and searching the corresponding object characteristics with knowledge enhancement by using the object characteristics of the current nodes.
S6, using a scene and knowledge fusion module SK-Transformer (Scene and Knowledge Aware Transformer) to perform cross-modal coding on the obtained language features, scene features, knowledge-enhanced object features and state vectors obtained at the last moment of the intelligent agent. And obtaining the state characteristics of the current moment and the attention scores of the corresponding aggregated scenes.
And S7, inputting the attention scores of the corresponding aggregated scenes into a Softmax classifier to obtain probability distribution of the actions of the agent, and enabling the agent to select the scene direction corresponding to the maximum value of the scores to move. If the scene corresponding to the maximum score is the current scene, the agent selects to stop.
S8, repeating the steps until the agent selects to stop, and finally updating the weight of the model, namely, a semantic and position perception relation based reasoning module OK-GCN and a scene and knowledge perception based fusion module SK-transducer.
The above method is explained in detail with reference to specific examples.
As shown in fig. 1, the invention provides a visual language navigation method for exploring a dynamic topological map by combining knowledge, which comprises the following specific steps:
(1) Extracting multi-modal features or labels: object labels including natural language instruction features, scene features, corresponding scenes
After obtaining the visual language navigation task, the intelligent agent is placed at the starting point position to obtain the natural language instructionL represents the length of the instruction.
First, in order to obtain the features of natural language instructions, we will [ CLS ] in the initialization phase]Instruction sequence I, and partition identifier [ SEP ]]The sequence is input into a transducer for coding to obtain the initial running state s of the intelligent agent 0 And the nature of natural language instructions.
s 0 ,X=Transformer([CLS],I,[SEP])
The navigation is then a continuous iterative process, at each time t, the agent acquires visual information and position information of the scene, wherein the visual information comprises panoramic image information of the current position of the agent, the panoramic image information is divided into 36 discrete views, and CLIP-ViT-B-32 is used as visionThe encoder obtains visual features of the panoramaThe position information of the view includes a steering angle θ of the view with respect to the current position i And elevation angle phi i . The present invention is implemented by copying the position angle 32 times vector (cos theta i ,sinθ i ,cosφ i ,sinφ i ) Construction of 128-dimensional directional code d i 。
Then for the current scene, there is N t The navigable directions, denoted asThe same we can also get the corresponding position information feature +.>Splicing the two to obtain corresponding scene characteristics +.>For each candidate direction i, we extract the top m objects most significant in the scene using the fast-RCNN object extractor, we label the label of the object as +.>
In summary, we obtain the initial state s of operation of the agent 0 Features X of natural language instructions, features of adjacent scenesObject tag corresponding to adjacent scene +.>
(2) Constructing a global object and knowledge relation graph OK-GCN, updating node and neighbor characteristics by using a neural network based on a position and semantic perception graph, and then retrieving object characterization enhanced by knowledge through object label information of a scene
A knowledge graph is selected, which is a semantic representation of the real world, whose basic constituent units are entity-relationship-entity triples. The common knowledge fusion method can acquire knowledge from a structured knowledge base, such as ConceptNet, or from a semi-structured knowledge base, such as Visual Genome. Here, conceptNet is selected as the acquisition source of external knowledge.
Construction of an object knowledge relationship graph using the most prominent object tags detectable by Faster-RCNN pre-trained on Visual GenomeThe top k pieces of knowledge that are most relevant are retrieved as an index in the ConceptNet. Knowledge can be expressed as (h i ,r i,j ,t j ,w i,j ). Wherein h is i Is an object detected in the scene, t j Is an object retrieved from a knowledge base, w i,j Weights representing knowledge correlations, r i,j Representing the relationship between the object and the knowledge.
The object and knowledge relationship graph may be represented as G K =(H K ,E K ) Wherein H is K Is a node set that includes all entities, namely object tags in the scene and knowledge entity tags retrieved from the knowledge base. E (E) K Is a collection of edges. Let the total number of nodes in the relationship diagram be N.
Based on semantics and location encoding nodes, we encoded all entities using GloVe to get 300-dimensional semantic representation, using 128-dimensional zero vector as initial location representation of objects and knowledge. The two are spliced to be represented as nodes of the graph, so h i ,t j ∈R 428 . So we get H K ∈R N×428
Reasoning using edge-embedded graph neural network, A K ∈R N×N An adjacency matrix representing a relationship graph, each entry in the matrix representing a relationship between nodes. For the relationship of objects and knowledge, whereinIs defined as a relationship r in a knowledge base i,j . For the object to object relationship, a special implicit relationship is set here, with the relationships numbered sequentially r from 1.
First, the edges are embedded to represent, where the encoding is performed by orderly accumulating and LayerNorm, and a group of learnable parameters is selected as a baseEncode it into e for each relationship r :
Then updating OK-GCN by using multi-layer graph convolution to update neighbor nodesTogether aggregate the features and edge features of the target node i to obtain the representation of the target node of the next layer
Where sigma represents the activation function and,representing a set of neighbor nodes to node i, W G Representing a parameter which can be learned in the model, +.>Representing embedding of relational representations of corresponding layers.
Finally, the features of the different layers are spliced together to be used as the final features of the target node.
Wherein W is k Is a parameter which can be learned, L is the layer number of the model, o i Is the characteristic output of the object after the knowledge enhancement.
(3) Multi-mode decision model SK-transducer based on scene and knowledge perception for multi-mode feature reasoning and decision
Now we have obtained the initial operating state s of the agent 0 Features X of natural language instructions, features of adjacent scenesAnd knowledge-enhanced object features in the adjacent scene i +.>And can also be regarded as knowledge information of the scene.
By taking reference to successful experience of Recbert (A Recurrent Vision-and-Language BERT for Navigation) in visual language navigation, a novel multi-mode decision model SK-transducer based on scene and knowledge perception is provided by combining knowledge fusion.
Selective attention mechanism:
during the process of SK-transducer fusion of the multi-modal features, a specific attention mask is set. So that the natural language instruction features and scene features are only input as keys and values SK-transducers, the instruction features and scene features remain unchanged during the update process, they only provide references as context information in the attention mechanism.
In particular, state features s t And knowledge-enhanced object featuresThe key and value are entered as a query into SK-transducer, the standard attention pattern in transducer.
Intuitively, under such attention masks, the instructions and scene observed by the object do not change, they only provide context information, and the object can choose the part that needs attention from the scene features through the features of the object and knowledge.
Scene and knowledge dynamic aggregation:
for aggregation of scenes and knowledge, we select the maximum attention score among all scene and knowledge-enhanced object features in a given view and use the corresponding features to represent that view. Intuitively, this approach allows the model to select one relevant object or knowledge or the entire scene to represent each navigable direction.
And (3) agent action decision:
the notation function ψ (x) is expressed as the corresponding output of the input variable x at the last layer of the SK-Transformer. Psi(s) t ) Representing state features s t The output of the corresponding last layer in the SK-transducer, ψ (s t )∈R d D represents the size of the hidden layer of the SK-transducer model.Representing scene features, representing N t The directions of the adjacent scenes and the directions corresponding to the 1 stop marks. Psi (V) t ) Representing scene features V t Corresponding output after the last layer aggregation in SK-transducer, calculate state feature pair ψ (V t ) Attention score α (V) t ) The attention score was normalized using the softmax layer to get +.>Represented as N t A navigable direction and a probability distribution of stopping.
Normalized value of attention score to scene feature using state features of agentAnd predicting the action, and selecting the direction corresponding to the numerical value with the maximum probability as the decision direction of the intelligent agent.
Updating the running state of the intelligent agent:
in SK-transducer, the state history is passed through the state feature s t For maintenance, wherein s t ∈R d . Specifically s t+1 Is through psi(s) t ) Is updated to incorporate features of the fused scene, features of the instructions and features of the actions into the state history. We first calculate a weighted sum of features and attention scores of the scene, the weighted sum of features and attention scores of the instructions, the specific calculation process is shown in the following formula
Then the original state features s t And F v And F is equal to l Performing point multiplication on element by element for splicing, performing linear transformation, and then performing position representation with the selection direction of the intelligent agentSplicing, and obtaining a new state s after linear transformation t+1
Wherein W is 1 And W is 2 Is a parameter that can be learned, as indicated by element-wise multiplication, [;]the connection is represented by a representation of the connection,the location characteristics of the directions are selected.
(4) Training OK-GCN and SK-transducer models
During the training process, we train the agent in a manner that mimics learning IL and reinforcement learning RL. In imitation learning, the agent takes labeled actions at each point in timeEffectively follow the shortest path so that the action probability p t As close to the shortest path as possible. Let T be the total length of the agent trajectory, then the expression mimicking the learned loss function is:
in reinforcement learning, an agent moves from a probability p t The method comprises the steps of sampling and selecting action samples, and learning from rewards, so that an intelligent agent can learn how to explore the environment and improve generalization capability, and consult dominant variable A proposed in A2C algorithm t The expression for the reinforcement learning loss function herein is:
L=L IL +λL RL
where λ represents a super-parameter used to adjust the importance of the simulated learning and reinforcement learning loss functions, respectively. AdamW is used for optimizing the target loss function during training, and parameters of the model are updated.
(5) Visual language navigation using OK-GCN and SK-transducer
When the invention is used for navigation, OK-GCN and SK-transducer are set into a test mode, and gradient accumulation is not carried out, so that the calculation speed is increased. The model will also not use Dropout layer anymore and the global mean and variance of the Batch normalization layer will not be updated.
When the intelligent body makes a navigation decision, each moment T selects the direction with the highest score to move, and the maximum movement step number of the intelligent body is set as T max This navigation process is repeated until the number of movements exceeds T max Or the agent chooses to stop at the current location.
As shown in tables 1 and 2 below, table 1 shows the experimental results of SK-transducer on R2R dataset, and table 2 shows the experimental results of SK-transducer on REVERIE dataset, and as can be seen from tables 1 and 2, the SK-transducer of this example can obtain more excellent results.
TABLE 1 Experimental results of the invention (SK-transducer) on the R2R dataset
TABLE 2 experimental results of the invention (SK-transducer) on REVERIE dataset
In summary, compared with the prior art, the invention has at least the following advantages and beneficial effects:
(1) The invention uses the relation reasoning module based on semantic and position perception, can effectively model the interrelation of the object in the scene and the knowledge in the knowledge graph, and outputs the object representation with semantic and position information enhancement.
(2) Considering the scene, the interrelationship of the object and knowledge; the invention provides a fusion module SK-transducer based on scene and knowledge perception, which updates knowledge by using a selective attention mechanism, aggregates the scenes and the knowledge of different visual angles, scores an aggregate result and selects a final decision direction.
(3) The invention can infer according to the relation of objects in the scene and with common sense, and has stronger exploration capability under limited visual observation conditions or unseen environments.
The embodiment also provides a visual language navigation device based on scene fusion knowledge, which comprises:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method illustrated in fig. 1.
The visual language navigation device based on the scene fusion knowledge can execute any combination implementation steps of the visual language navigation method based on the scene fusion knowledge, and has corresponding functions and beneficial effects.
The present application also discloses a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.
The embodiment also provides a storage medium which stores instructions or programs for executing the visual language navigation method based on scene fusion knowledge, and when the instructions or programs are run, the instructions or programs can execute any combination implementation steps of the method embodiment, and the method has corresponding functions and beneficial effects.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.
Claims (10)
1. The visual language navigation method based on scene fusion knowledge is characterized by comprising the following steps of:
acquiring a visual language navigation task, wherein the visual language navigation task comprises a natural language instruction, initial visual information and initial position information;
encoding the natural language instruction into natural language instruction characteristics and an initial running state of the agent; coding and splicing the visual information and the position information to obtain scene characteristics;
extracting object labels from the visual information, and encoding semantic labels and position information of the objects into object features so as to update node characterization in a graph convolution network;
iteratively updating weights of object features by using a graph convolution network based on semantic and location awareness, and retrieving knowledge-enhanced object features by using object tags in a scene;
and using a multi-mode decision module based on scene and knowledge perception to fuse the natural language instruction features, scene features and object features enhanced by knowledge, and performing action prediction and updating the running state of the intelligent agent until the intelligent agent is selected to stop.
2. The visual language navigation method based on scene fusion knowledge according to claim 1, wherein the operation mechanism of the graph rolling network based on semantic and location awareness is:
building an object and a knowledge graph: the method comprises the steps that an object and knowledge form a graph structure, the relation between the object and the knowledge is a corresponding relation in a knowledge base, and an implicit relation is defined between the object and the knowledge;
embedded edge relationship characterization: obtaining the embedding of the edge relation; different relation characterization is obtained by embedding the relation between different objects and knowledge, and the implicit relation between the objects is also used as a special relation embedding;
and (3) embedded node characterization: carrying out semantic coding on all nodes to form vectors, carrying out position coding on accessed objects, and initializing the position coding of unaccessed objects and all knowledge entity nodes to form all-zero vectors;
combining the graph convolution network and the edge information representation, and carrying out feature update on nodes in the graph to obtain a final graph node representation; inputting the object label type corresponding to the current scene, and outputting the object characteristics after the knowledge enhancement after the graph convolution update characterization.
3. The visual language navigation method based on scene fusion knowledge according to claim 2, wherein the combining the graph convolution network and the side information representation, performing feature update on nodes in the graph to obtain a final graph node representation, comprises:
a1, summing the neighbor nodes, and adding an edge embedding representation to update the characteristic representation of the target node;
a2, in order to better represent the target node, the characteristic of the target node is added during the final updating, and then the output result is subjected to a nonlinear activation function to obtain the updated representation of the node;
a3, iterating the steps A1-A2 by using the multi-layer graph convolution model structure to obtain a final graph node representation.
4. The visual language navigation method based on scene fusion knowledge according to claim 2, wherein said constructing objects and knowledge graph comprises:
and detecting and obtaining an object tag list by using a preset network model as an index, and searching k pieces of knowledge with highest correlation weight in a preset knowledge base.
5. The visual language navigation method based on scene fusion knowledge according to claim 1, wherein the operation mechanism of the multi-modal decision module based on scene and knowledge perception is as follows:
fusing the multimodal characterization using a selective attention mechanism;
updating the running state of the intelligent agent: at each moment, the weighted sum value of the state variable of the last layer of the multi-mode decision module to the natural language instruction feature and the corresponding attention score is spliced together with the weighted sum value of the visual feature and the attention score, and a new state feature is obtained through linear transformation;
dynamically aggregating scenes and knowledge: sorting the attention scores of scene features and knowledge-enhanced object features in the same view according to the state features, and selecting the maximum value of the attention scores as the score of the view;
decision of output agent: carrying out Softmax on the final scores of all the different views, and selecting the view corresponding to the maximum value of the scores as the moving direction of the agent; if the maximum value of the score corresponds to the current view, the agent chooses to stop.
6. The visual language navigation method based on scene fusion knowledge of claim 5, wherein said fusing multimodal characterization using selective attention mechanisms comprises:
and inputting natural language instruction features, scene features and knowledge-enhanced object features, wherein the natural language instruction features and the scene features are only used as keys and values of an attention mechanism and are not updated, and the knowledge-enhanced object features are updated by referring to the natural language instruction features and the scene features.
7. The visual language navigation method based on scene fusion knowledge according to claim 1, wherein the natural language instruction is encoded into natural language instruction features and an initial running state of an agent; encoding and splicing the visual information and the position information to obtain scene characteristics, wherein the method comprises the following steps:
after the intelligent agent obtains the visual language navigation task, a natural language instruction is obtainedL represents the length of the instruction; wherein the agent is placed at the origin location;
in the initialization phase, [ CLS ]]Instruction sequence I and partition identifier [ SEP ]]The sequence is input into a transducer for coding to obtain the initial running state s of the intelligent agent 0 And the nature of natural language instructions:
s 0 ,X=Transformer([CLS],I,[SEP])
navigation is a continuous iterative process, at each time t, the intelligent agent acquires visual information and position information of a scene, wherein the visual information comprises panoramic image information of the current position of the intelligent agent, the panoramic image information is divided into 36 discrete views, and the CLIP-ViT-B-32 is used as a visual encoder to acquire visual characteristics of the panoramic imageThe position information of the view includes a steering angle θ of the view with respect to the current position i And elevation angle phi i The method comprises the steps of carrying out a first treatment on the surface of the By copying the position angle 32 times vector (cos theta i ,sinθ i ,cosφ i ,sinφ i ) Construction of 128-dimensional directional code d i ;
For the current scene, there is N t The navigable directions, denoted asObtain the corresponding position information feature->Splicing the two to obtain corresponding scene characteristics +.>
8. The visual language navigation method based on scene fusion knowledge of claim 1, wherein the graph roll-up network and the multi-modal decision module are trained by:
training the agent in a manner that mimics learning IL and reinforcement learning RL; in imitation learning, the agent takes labeled actions at each point in timeEffectively follow the shortest path so that the action probability p t The track is close to the shortest track as much as possible;
let T be the total length of the agent trajectory, then the expression mimicking the learned loss function is:
in reinforcement learning, an agent moves from a probability p t Sampling and selecting an action sample, and learning from rewards; the expression of the reinforcement learning loss function is:
the total loss function is:
L=L IL +λL RL
wherein λ represents a hyper-parameter.
9. A visual language navigation device based on scene fusion knowledge, comprising:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any one of claims 1-8.
10. A computer readable storage medium, in which a processor executable program is stored, characterized in that the processor executable program is for performing the method according to any of claims 1-8 when being executed by a processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310087842.4A CN116242359A (en) | 2023-02-08 | 2023-02-08 | Visual language navigation method, device and medium based on scene fusion knowledge |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310087842.4A CN116242359A (en) | 2023-02-08 | 2023-02-08 | Visual language navigation method, device and medium based on scene fusion knowledge |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116242359A true CN116242359A (en) | 2023-06-09 |
Family
ID=86630695
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310087842.4A Pending CN116242359A (en) | 2023-02-08 | 2023-02-08 | Visual language navigation method, device and medium based on scene fusion knowledge |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116242359A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116499471A (en) * | 2023-06-30 | 2023-07-28 | 华南理工大学 | Visual language navigation method, device and medium based on open scene map |
CN116524513A (en) * | 2023-07-03 | 2023-08-01 | 中国科学技术大学 | Open vocabulary scene graph generation method, system, equipment and storage medium |
CN116737899A (en) * | 2023-06-12 | 2023-09-12 | 山东大学 | Visual natural language navigation system and method based on common sense information assistance |
CN117746303A (en) * | 2024-02-20 | 2024-03-22 | 山东大学 | Zero sample visual navigation method and system based on perception correlation network |
CN117773934A (en) * | 2023-12-29 | 2024-03-29 | 兰州大学 | Language-guide-based object grabbing method and device, electronic equipment and medium |
CN118258406A (en) * | 2024-05-29 | 2024-06-28 | 浙江大学湖州研究院 | Automatic guided vehicle navigation method and device based on visual language model |
CN118305783A (en) * | 2024-03-21 | 2024-07-09 | 北京工业大学 | Language-oriented robot category-level pushing and grabbing cooperative method |
-
2023
- 2023-02-08 CN CN202310087842.4A patent/CN116242359A/en active Pending
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116737899A (en) * | 2023-06-12 | 2023-09-12 | 山东大学 | Visual natural language navigation system and method based on common sense information assistance |
CN116737899B (en) * | 2023-06-12 | 2024-01-26 | 山东大学 | Visual natural language navigation system and method based on common sense information assistance |
CN116499471A (en) * | 2023-06-30 | 2023-07-28 | 华南理工大学 | Visual language navigation method, device and medium based on open scene map |
CN116499471B (en) * | 2023-06-30 | 2023-09-12 | 华南理工大学 | Visual language navigation method, device and medium based on open scene map |
CN116524513A (en) * | 2023-07-03 | 2023-08-01 | 中国科学技术大学 | Open vocabulary scene graph generation method, system, equipment and storage medium |
CN116524513B (en) * | 2023-07-03 | 2023-10-20 | 中国科学技术大学 | Open vocabulary scene graph generation method, system, equipment and storage medium |
CN117773934A (en) * | 2023-12-29 | 2024-03-29 | 兰州大学 | Language-guide-based object grabbing method and device, electronic equipment and medium |
CN117746303A (en) * | 2024-02-20 | 2024-03-22 | 山东大学 | Zero sample visual navigation method and system based on perception correlation network |
CN117746303B (en) * | 2024-02-20 | 2024-05-17 | 山东大学 | Zero sample visual navigation method and system based on perception correlation network |
CN118305783A (en) * | 2024-03-21 | 2024-07-09 | 北京工业大学 | Language-oriented robot category-level pushing and grabbing cooperative method |
CN118258406A (en) * | 2024-05-29 | 2024-06-28 | 浙江大学湖州研究院 | Automatic guided vehicle navigation method and device based on visual language model |
CN118258406B (en) * | 2024-05-29 | 2024-08-13 | 浙江大学湖州研究院 | Automatic guided vehicle navigation method and device based on visual language model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116242359A (en) | Visual language navigation method, device and medium based on scene fusion knowledge | |
Cebollada et al. | A state-of-the-art review on mobile robotics tasks using artificial intelligence and visual data | |
Liu et al. | Deep fusion lstms for text semantic matching | |
Tennenholtz et al. | The natural language of actions | |
CN110413844A (en) | Dynamic link prediction technique based on space-time attention depth model | |
WO2024032096A1 (en) | Reactant molecule prediction method and apparatus, training method and apparatus, and electronic device | |
Sang et al. | A novel neural multi-store memory network for autonomous visual navigation in unknown environment | |
Swedish et al. | Deep visual teach and repeat on path networks | |
Mohtasib et al. | A study on dense and sparse (visual) rewards in robot policy learning | |
CN114880440A (en) | Visual language navigation method and device based on intelligent assistance and knowledge enabling | |
Farhadi et al. | Domain adaptation in reinforcement learning: a comprehensive and systematic study | |
Bhutta et al. | Why-so-deep: Towards boosting previously trained models for visual place recognition | |
Zhou et al. | Improving indoor visual navigation generalization with scene priors and Markov relational reasoning | |
CN117113270A (en) | Knowledge fusion multi-mode interaction method and device based on improved alignment method | |
CN116737897A (en) | Intelligent building knowledge extraction model and method based on multiple modes | |
Deng et al. | Learning visual-based deformable object rearrangement with local graph neural networks | |
Li et al. | Context vector-based visual mapless navigation in indoor using hierarchical semantic information and meta-learning | |
Guo et al. | Object goal visual navigation using semantic spatial relationships | |
Tanaka | Domain-invariant nbv planner for active cross-domain self-localization | |
Bartolo et al. | Integrating Saliency Ranking and Reinforcement Learning for Enhanced Object Detection | |
Enabor et al. | Action Unit Recognition: Leveraging Weak Supervision with Large Loss Rejection | |
Wang et al. | ACT: Action-assoCiated and Target-Related Representations for Object Navigation | |
Yang et al. | Overcoming Catastrophic Forgetting for Semantic Segmentation Via Incremental Learning | |
CN114913597B (en) | Fall detection method and system based on OpenPose and lightweight neural network | |
CN118386227A (en) | Control method, system, medium and product of intelligent navigation robot based on zero sample |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |