CN116168333B

CN116168333B - Self-supervision visual language navigation pre-training method, device and storage medium

Info

Publication number: CN116168333B
Application number: CN202310425915.6A
Authority: CN
Inventors: 谭明奎; 林坤阳; 陈沛豪; 黄狄伟; 杜卿
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-08-22
Anticipated expiration: 2043-04-20
Also published as: CN116168333A

Abstract

The application discloses a self-supervision visual language navigation pre-training method, a device and a storage medium, wherein the method comprises the following steps: acquiring house tour videos, and filtering the house tour videos to obtain effective frames; constructing a navigation track through a track generation algorithm based on an entropy minimum theory according to the obtained effective frame; constructing a navigation instruction according to the obtained navigation track; constructing a track-instruction pair according to the navigation track and the navigation instruction to generate a pre-training data set; and pre-training the network architecture by using a track judgment task according to the obtained pre-training data set. The method adopts house tour videos to construct visual language navigation pre-training data for the first time, automatically generates navigation tracks and navigation instructions, constructs track-instruction pairs and effectively reduces labeling cost. In addition, a pre-training task aiming at layout reasoning capability learning is designed, so that the learning of the house layout knowledge by the visual language navigation agent is realized, and the method can be widely applied to the technical field of visual language navigation.

Description

Self-supervision visual language navigation pre-training method, device and storage medium

Technical Field

The application relates to the technical field of visual language navigation, in particular to a self-supervision visual language navigation pre-training method, a device and a storage medium.

Background

An important goal of personal artificial intelligence is to develop agents that can communicate with humans using natural language and perform real world tasks. Visual and linguistic navigation is an important task to achieve this goal, requiring indoor agents to navigate in unknown environments following natural instructions. Visual language navigation has attracted considerable attention in the fields of computer vision and robotics, and has applications in home robotics and warehouse assistants.

One of the important challenges of visual language navigation is the generalization capability of an agent to unknown environments. The existing visual language navigation method is used for dealing with the challenge by enabling an agent to conduct self-supervision pre-training on a visual language data set, and mainly comprises two types: one class of methods trains agents on manually annotated simulated navigation environment data, and the other class trains agents by constructing path instruction pairs using network image data. However, the existing methods still have the following limitations: 1) The simulation dataset is limited to a limited number of environments; 2) Simply stitching network images to construct a trajectory can lead to unreasonable room layouts being generated, thereby impeding the ability of the agent to learn layout reasoning. The network contains a large number of indoor tour videos of different houses, and the videos contain real navigation experiences and house layout information. However, such videos are not used for learning of visual language navigation.

Disclosure of Invention

In order to solve at least one of the technical problems existing in the prior art to a certain extent, the application aims to provide a self-supervision visual language navigation pre-training method, a device and a storage medium.

The technical scheme adopted by the application is as follows:

a self-supervision visual language navigation pre-training method comprises the following steps:

acquiring house tour videos, and filtering the house tour videos to obtain effective frames;

constructing a navigation track through a track generation algorithm based on an entropy minimum theory according to the obtained effective frame;

constructing a navigation instruction through an action perception instruction generation algorithm according to the acquired navigation track, and screening incorrect instructions by using the ChatGPT;

constructing a track-instruction pair according to the navigation track and the navigation instruction to generate a pre-training data set;

and pre-training the network architecture by using a track judgment task according to the obtained pre-training data set.

Further, the obtaining the house tour video, filtering the house tour video to obtain an effective frame, includes:

order theThe individual house tour videos are shown as +.>Sampling house tour videos:

wherein ,representing video sample frame operations, +.>Representing video frames->The effective frame number representing the Nth video is +.>；

For each video frame using an object detection modelExtracting regional characteristics:

wherein ,representing the object detection model, +.>Indicate->Features of the individual target areas->Indicating the number of target areas.

Further, the target detection model comprises a trained Resnet model and a Mask RCNN model, wherein the Resnet model is used for eliminating video frames belonging to outdoor scenes, and the Mask RCNN model is used for eliminating video frames containing human beings.

Further, the constructing a navigation track according to the obtained valid frame through a track generation algorithm based on the entropy minimum theory comprises the following steps:

classifying the obtained effective frames by using the CLIP model to obtain each video frameSimilarity to room type, object class:

wherein ,to->Representing video frame->And->Room-like type similarity,/->To->Representing video frame->And->Similarity of class object types;

order theIndicate->Consecutive +.>Frame to->The frames are of the same room type->Definition of->Frame to->Frame is +.>A group in the video, each frame in the group is calculated +.>At->Similarity information entropy of individual room types:

the video frame with the minimum information entropy in each group is obtained to be used as a key frame, the key frame is defined as a room node, and the rest non-key frames are defined as transition nodes;

for each house tour video, selectAs the length of a navigation track, and randomly select +.>Individual room nodes>And the transition nodes form a section of navigation track.

Further, the constructing a navigation instruction by an action perception instruction generating algorithm according to the obtained navigation track, and screening an incorrect instruction by using the ChatGPT comprises the following steps:

hollowing out nouns and navigation action words in a preset navigation instruction to generate an instruction template; the position of the noun null is [ NMASK ], and the position of the navigation action word null is [ VMASK ];

for havingThe navigation track of each room node is obtained for a section with +.>Individual [ NMASK]、/>Personal [ VMASK]Is a command template of (a);

acquiring the type of the room node of the navigation track, and filling the acquired type into NMASK;

acquiring navigation actions among nodes in adjacent rooms; for each filled [ NMASK ], acquiring the [ VMASK ] nearest to the [ NMASK ], and filling the [ NMASK ] in the acquired [ VMASK ] to reach the navigation action required by the next [ NMASK ];

after [ VMASK ] and [ NMASK ] are filled, obtaining a navigation instruction corresponding to the navigation track, and representing the navigation instruction as Ins;

the large model ChatGPT is processed by using natural language, a section of prompt is input, whether the section of instruction is suitable for a section of visual language navigation task is judged from grammar and logic level, and only the answer 'yes' or 'no' is allowed: inputting the generated instruction Ins, and if the response given by the ChatGPT is 'yes', reserving the instruction; if the response given by the ChatGPT is 'no', the warrant is to generate a new instruction again according to the above flow.

Further, the constructing a track-instruction pair according to the navigation track and the navigation instruction, and generating the pre-training data set includes:

acquiring a training set according to house tour videos;

acquiring a navigation track of each video in a training set and a navigation instruction corresponding to the navigation track, and constructing a track-instruction pair;

and constructing a pre-training data set for visual language navigation according to the obtained track-instruction pair.

Further, the pre-training the network architecture by using the track judging task according to the obtained pre-training data set includes:

modeling a track judging task into two classification problems, wherein the network architecture is a ViL-BERT model, the input of the ViL-BERT model is a track-instruction pair, and the output is the probability of judging that the track is reasonable；

And training the ViL-BERT model according to the pre-training data set and the binary cross entropy loss function to obtain a trained model.

Further, the binary cross entropy loss function has the expression:

wherein ,representing the sum of the number of positive and negative samples; defining positive samples as reasonable trajectories and negative samples as unreasonable trajectories, ++>Indicate->Whether each sample is a positive sample, if so, +.>If not, then->；/>Representation model output->Probability that the individual samples are positive samples, +.>Is a factor used to mitigate positive and negative sample imbalance.

The application adopts another technical scheme that:

a self-supervising visual language navigation pre-training device, comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method described above.

The application adopts another technical scheme that:

a computer readable storage medium, in which a processor executable program is stored, which when executed by a processor is adapted to carry out the method as described above.

The beneficial effects of the application are as follows: the method adopts house tour videos to construct visual language navigation pre-training data for the first time, automatically generates navigation tracks and navigation instructions, constructs track-instruction pairs and effectively reduces labeling cost. In addition, a pre-training task aiming at layout reasoning capability learning is designed, and learning of house layout knowledge by a visual language navigation agent is realized.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description is made with reference to the accompanying drawings of the embodiments of the present application or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present application, and other drawings may be obtained according to these drawings without the need of inventive labor for those skilled in the art.

FIG. 1 is a flow chart of steps of a self-supervising visual language navigation pre-training method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a self-supervising visual language navigation pre-training method according to an embodiment of the present application;

FIG. 3 is a graph of the performance of the present application versus the prior art method on an R2R dataset;

fig. 4 is a graph of the performance of the present application versus the prior art method on a revrie dataset.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

In the description of the present application, it should be understood that references to orientation descriptions such as upper, lower, front, rear, left, right, etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of description of the present application and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present application.

In the description of the present application, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present application, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present application can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.

Based on the problems in the prior art, the application considers the training of the visual language navigation intelligent body by using the house tour video, thereby solving the defects of the existing method. The solution is to model the navigation experience in the house tour video as track-instruction pairs and train the agent as a dataset. However, there are difficulties in constructing track-instruction pairs from the original unlabeled video and learning house layout information in the video to promote the reasoning ability for the layout. Therefore, the application provides a track generation technology based on minimum entropy, an instruction generation technology based on motion perception and a track judgment pre-training task, which respectively solve the difficulties.

Referring to fig. 1 and 2, the present embodiment provides a self-supervision visual language navigation pre-training method, which includes the following steps:

s1, acquiring house tour videos, and filtering the house tour videos to obtain effective frames.

Order theThe individual house tour videos are shown as +.>The present application uses +/sec>Sampling house video at the frame rate of (a):

wherein Representing a video sample frame operation, representing +.>Video frames obtained by sampling the video, < >>The effective frame number representing the Nth video is +.>. Further, the present embodiment uses the fast R-CNN object detection model +_for each video frame>Extracting regional characteristics:

wherein FR represents the Faster R-CNN target detection model,indicate->Features of the individual target areas, together->Target area of->The present application eliminates the video frames of the video.

Further as an optional implementation manner, this embodiment uses a Resnet model trained on the plane 365 dataset to eliminate frames belonging to the outdoor scene; frames containing humans are culled using the Mask R-CNN model trained on the COCO dataset.

The video frames after the elimination are collected as follows:。

s2, constructing a navigation track through a track generation algorithm based on an entropy minimum theory according to the obtained effective frame.

For the video frame obtained in step S1, this embodiment uses the CLIP model to use "a photo of a { }" as a hint template to make the simulator Matterport3D commonClass room type +.>The class object class is used as a text filled in a prompt template, and each video frame is calculated>Similarity to room type and object class:

wherein ,to->Representing video frame->And->Room-like type similarity,/->To->Representing video frame->And->Class object type similarity. And taking the room type and the object type with the highest similarity as a room type label and an object type label contained in the frame respectively.

Order theIndicate->Consecutive +.>Frame to->The frames belong to the same room type->Then define the firstFrame to->Frame is +.>One group in each video. Further, the present embodimentCalculate each frame in the group +.>At->Similarity information entropy of individual room types:

in the embodiment, the frame with the minimum information entropy in each group is selected as a key frame, and defined as a room node, wherein the visual characteristics of the frame are the regional characteristics extracted in the step S1; the remaining non-key frames are defined as transition nodes. In addition, in order to simulate the visual characteristics of the panorama, the application further combines the node characteristics with the visual area characteristics of other frames in the same group as the final node characteristics.

For each house tour video, the application randomly selectsAs the length of a section of navigation track, and randomly selectsIndividual room nodes>And the transition nodes finally form a section of track.

S3, constructing a navigation instruction through an action perception instruction generation algorithm according to the obtained navigation track.

In this embodiment, an action aware instruction generation algorithm is provided to describe the track generated in step S2.

Specifically, in this embodiment, firstly, the nouns and the navigation action words in the existing manually generated navigation instruction are hollowed to generate an instruction template, and the position of the nouns is [ NMASK ]]The position of the navigation action word space is [ VMASK ]]. For each of the strips generated by step S2The application randomly selects a track of individual room nodes for this purpose with +.>Individual [ NMASK]、/>Personal [ VMASK]Is a template of instructions for a computer. Further, the present embodiment obtains the category to which the room node of the track belongs, and fills it into [ NMASK ] in order]The method comprises the steps of carrying out a first treatment on the surface of the Secondly, training a convolutional neural network model in manually marked simulation environment data, wherein the convolutional neural network model is input into two frames of observation pictures which are continuous in time, is output into navigation actions from a previous frame of observation picture to a next frame of observation picture, and uses the model to infer the navigation actions among room nodes; for each filled [ NMASK]The present embodiment finds the nearest [ VMASK ]]And fill in the [ NMASK ]]To the next [ NMASK]The required navigational action. After filling [ VMASK]And [ NMASK]And then, the generated instruction is passed to the natural language processing large model ChatGPT to judge the correctness, if the instruction is correct, the navigation instruction is output, otherwise, the instruction is regenerated. The embodiment finally obtains the matched navigation instruction for each generated track.

S4, constructing a track-instruction pair according to the navigation track and the navigation instruction, and generating a pre-training data set.

In the present embodiment, according toRatio of->The individual house tour videos are segmented into a training set and a test set, namely the training set contains +.>Video, test set contains ∈ ->Video. For each video, track-instruction pairs are generated according to steps S1 to S3, finally forming a videoThe individual visual languages navigate the pre-training data set.

S5, pre-training the network architecture by using a track judging task according to the obtained pre-training data set.

Based on the generated data set, the track judgment task provided by the embodiment is used and the original network architecture is pre-trained by combining the existing pre-training task. Modeling the track judging task into two kinds of problems, inputting the two kinds of problems into a track-instruction pair based on a ViL-BERT model, and optimizing by adopting a binary cross entropy loss function.

The embodiment provides a track judging task and realizes the study of the house layout reasoning capability by the intelligent agent. The track judging task requires an agent to judge whether a section of track is reasonable or not, wherein a positive sample is an actual reasonable track, and a negative sample is an actual unreasonable track. Specifically, the present embodiment defines that the number of nodes of a segment generated in step S2 isThe track of (2) is a positive sample, namely, a positive sample is defined as a reasonable track, a negative sample is an unreasonable track, and three modes are adopted to generate the negative sample: (1) disturbing transition nodes in the track; (2) scrambling all nodes; (3) And keeping the position of the room node unchanged, and randomly replacing the transition node into the transition node of other videos.

In the embodiment, viL-BERT is adopted as a model, input is a track-instruction pair, and output is the probability of judging that the track is reasonableIt is desirable to minimize the following binary cross entropy loss function:

wherein ,representing the sum of the number of positive and negative samples, < >>Indicate->Whether the individual samples are positive samples, yes +.>Otherwise，/>Representation model output->Probability that the individual samples are positive samples, +.>A factor to mitigate positive and negative sample imbalance is equal to the ratio of negative to positive sample numbers. The agent is then used to perform the downstream visual language navigation task fine tuning after pre-training using the proposed trajectory judgment task and other existing pre-training tasks (e.g., visual mask modeling, language mask modeling, and path ranking tasks in fig. 2), the data set in step S4.

Experimental results

To verify the effectiveness of the present application and quantify its effect, the present application compares the present method with existing algorithms on two recognized visual language navigation benchmark data sets, the R2R data set and the REVERIE data set. As shown in fig. 3, compared with other methods, the method of the present application has a significant improvement in various navigation indexes on the R2R dataset; compared with the existing performance optimal method DUET, the method improves the navigation Success Rate (SR) of the unseen verification set by 2.0%, improves the path length weighted success rate (SPL) of the test set by 2.0%, and improves the navigation Success Rate (SR) of the test set by 3.0%. On the REVERIE dataset, as shown in FIG. 4, the application has the optimal performance method DUET, the navigation Success Rate (SR) of the test set is improved by 1.81%, and the path length weighting success rate (SPL) is improved by 1.28%.

In summary, compared with the prior art, the method of the application has at least the following advantages and beneficial effects:

(1) Building visual language navigation pre-training data by adopting house tour videos for the first time, wherein a built data set comprises track-instruction pairs with real house layout, diversified environments and intrinsic navigation actions; provides good data support for multi-agent learning visual language navigation capability.

(2) The application designs an automatic track and instruction generation method. The proposed trajectory generation method based on the entropy minimization theory can generate diversified and reliable room nodes, and the proposed motion-aware instruction generation method generates matching instructions with correct motions, and neither does expensive manual labeling. The application adopts a method based on entropy minimization theory, takes frames which are continuous in video frames and belong to the same room type as nodes, and selects frames with minimum classification entropy as key frames, and takes the characteristics as node characteristics. According to the application, through training a neural network model, the conversion actions of two frames of video frames which cannot directly acquire relative included angles are predicted, the predicted actions are filled according to the position relation of nouns corresponding to two nodes in the instruction, and finally the correctness of the navigation instruction is judged through processing a large model by natural language, so that the generation of the correct and matched navigation instruction is ensured.

(3) The application designs a pre-training task aiming at the study of layout reasoning capability, and realizes the study of the house layout knowledge by the visual language navigation agent. In addition, all the obtained navigation indexes are obviously improved.

The embodiment also provides a self-supervision visual language navigation pre-training device, which comprises:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method illustrated in fig. 1.

The self-supervision visual language navigation pre-training device provided by the embodiment of the application can be used for executing the self-supervision visual language navigation pre-training method provided by the embodiment of the method, and any combination of the embodiment of the method can be executed to realize the steps, so that the method has the corresponding functions and beneficial effects.

Embodiments of the present application also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.

The embodiment also provides a storage medium which stores instructions or programs for executing the self-supervision visual language navigation pre-training method provided by the embodiment of the method, and when the instructions or programs are run, the steps can be implemented by any combination of the embodiment of the executable method, so that the method has corresponding functions and beneficial effects.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present application are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the application is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present application. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the application as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the application, which is to be defined in the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the application, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present application has been described in detail, the present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims

1. The self-supervision visual language navigation pre-training method is characterized by comprising the following steps of:

constructing a navigation instruction according to the obtained navigation track;

pre-training the network architecture by using a track judging task according to the obtained pre-training data set;

the method for constructing the navigation track through the track generation algorithm based on the entropy minimum theory according to the obtained effective frame comprises the following steps:

classifying the obtained effective frames by using the CLIP model to obtain each video frameAnd room type, object categorySimilarity of (3):

order theIndicate->Consecutive +.>Frame to->Frames belonging to the same room classIs->Definition of->Frame to->Frame is +.>A group in the video, each frame in the group is calculated +.>At->Similarity information entropy of individual room types:

for each house tour video, selectAs the length of a navigation track, and randomly select +.>Individual room nodesAnd the transition nodes form a section of navigation track.

2. The method for pre-training self-supervision visual language navigation according to claim 1, wherein the steps of obtaining house tour videos, filtering the house tour videos to obtain effective frames comprise:

wherein ,representing video sample frame operations, +.>Representing video frames->Representing the effective frame number of the nth video;

3. The self-supervising visual language navigation pre-training method according to claim 2, wherein the object detection model comprises a trained Resnet model and a Mask RCNN model, the Resnet model is used for eliminating video frames belonging to outdoor scenes, and the Mask RCNN model is used for eliminating video frames containing human beings.

4. The method for pre-training self-supervising visual language navigation according to claim 1, wherein the constructing navigation instructions according to the obtained navigation track comprises:

after [ VMASK ] and [ NMASK ] are filled, navigation instructions corresponding to the navigation track are obtained, and the obtained navigation instructions are screened by using ChatGPT.

5. The method of claim 1, wherein constructing track-instruction pairs from the navigation tracks and the navigation instructions to generate the pre-training data set comprises:

acquiring a training set according to house tour videos;

6. The method for pre-training self-supervising visual language navigation according to claim 1, wherein the pre-training the network architecture using the trajectory judgment task according to the obtained pre-training data set comprises:

7. The method for pre-training self-supervising visual language navigation according to claim 6, wherein the binary cross entropy loss function is expressed as:

8. A self-supervising visual language navigation pre-training device, comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any one of claims 1-7.

9. A computer readable storage medium, in which a processor executable program is stored, characterized in that the processor executable program is for performing the method according to any of claims 1-7 when being executed by a processor.