CN116499471A

CN116499471A - Visual language navigation method, device and medium based on open scene map

Info

Publication number: CN116499471A
Application number: CN202310788171.4A
Authority: CN
Inventors: 谭明奎; 陈沛豪; 吉冬昱; 林坤阳; 杜卿
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-07-28
Anticipated expiration: 2043-06-30
Also published as: CN116499471B

Abstract

The invention discloses a visual language navigation method, device and medium based on an open scene map, and belongs to the technical field of intelligent navigation. The method comprises the following steps: acquiring visual image data of an intelligent agent in an environment; constructing an open scene map representation according to the visual image data, wherein the open scene map representation comprises an object attribute level map, an open scene object semantic map and a marker semantic level map; and according to the constructed open scene map, predicting the position and navigation progress of the sub-target point, and executing corresponding actions. According to the invention, object attribute level information is combined with semantic information of the open scene object and the instruction marker, the above information is combined to construct the open scene map, the characterization capability of the map on the attribute and the position of various objects in the open scene is improved, the map characterization is not limited to a fixed small number of object types, and the added object attribute information can help an intelligent agent to eliminate object type ambiguity and accurately position the object of interest.

Description

Visual language navigation method, device and medium based on open scene map

Technical Field

The invention relates to the technical field of intelligent navigation, in particular to a visual language navigation method, device and medium based on an open scene map.

Background

The appearance of the intelligent body provides an important technical route for improving the cognitive ability of the current artificial intelligence and moving to the general intelligent. Through the channel of interaction with the environment, the intelligent agent can acquire real feedback from a real physical or virtual digital space, so that further learning and progress are realized, wherein visual language navigation aims at enabling the intelligent agent to carry out autonomous navigation along with natural language instructions, and the intelligent agent gradually receives wide attention in recent years, has become one of research hotspots with body intelligence, and has huge potential application value in the aspects of man-machine interaction, home service robots and the like.

At present, the existing method proposes a map-based modular mode to realize visual language navigation, and the environment information is represented by constructing a semantic map. However, two main problems still exist in the semantic map constructed by the existing method: 1) The existing map construction mode ignores rich attribute information (such as colors, textures and the like) contained in the object, so that object ambiguity is caused. For example, when two sofas with different colors exist in a room, if the map only represents the semantic category of the sofas, the two sofas cannot be distinguished; 2) Existing map construction approaches can only represent a limited class of objects (typically 40 classes). The actual instructions and scenes often contain complex and various object category information, and the existing semantic map is difficult to effectively express the object category information, so that the navigation performance of the intelligent agent is affected. Therefore, how to integrate the detailed attribute information of the objects into the map and accurately represent the various object category information in the open scene is one of the research hotspots and difficulties of the current visual language navigation task.

Disclosure of Invention

In order to solve at least one of the technical problems existing in the prior art to a certain extent, the invention aims to provide a visual language navigation method, device and medium based on an open scene map.

The technical scheme adopted by the invention is as follows:

a visual language navigation method based on an open scene map comprises the following steps:

acquiring visual image data of an intelligent agent in an environment; the visual image data includes an RGB image and a depth image;

constructing an open scene map representation according to the visual image data, wherein the open scene map representation comprises an object attribute level map, an open scene object semantic map and a marker semantic level map;

and according to the constructed open scene map, predicting the position and navigation progress of the sub-target point, and executing corresponding actions.

Further, the constructing an open scene map representation from the visual image data includes:

acquiring an object attribute level map according to the RGB image and the depth image;

acquiring an open scene object semantic map according to the RGB image, the depth image and a preset open scene object class;

acquiring a logo semantic level map according to the RGB image, the depth image and a preset navigation instruction;

and respectively passing the object attribute level map, the open scene object semantic map and the marker semantic level map through a convolution layer, and after subspace connection, obtaining the open scene map representation through the convolution layer.

Further, the object attribute level map is obtained specifically by:

inputting the RGB image into a trained deep neural network, and obtaining an intermediate layer characteristic diagram of the deep neural network;

and mapping the obtained middle layer feature map according to the depth information of the depth image to obtain the object attribute level map.

Further, the open scene object semantic map is obtained specifically by:

inputting a preset open scene object category and RGB image to an object detector facing open vocabulary, and detecting to obtain an open scene object position;

and mapping the detected open scene object position according to the depth information of the depth image to obtain an open scene object semantic map.

Further, the marker semantic hierarchy map is specifically obtained by:

inputting the navigation instruction into a marker analyzer to obtain the category of the marker in the instruction;

inputting the obtained marker category to a target detector facing the open vocabulary to obtain a marker position;

and mapping according to the obtained marker position and the depth information of the depth image to obtain the marker semantic level map.

Further, the marker parser is implemented using a GPT large language model, and the target detector is implemented using a GLIP model.

Further, the predicting the position and navigation progress of the sub-target point according to the constructed open scene map representation, and executing the corresponding action, including:

inputting the open scene map representation and the instruction into the GRU to obtain the current state characteristics of the intelligent agent;

the obtained state characteristics pass through a sub-target point predictor to predict the relative coordinate deviation of the sub-target point from the current position;

and predicting the navigation progress in the current state according to the relative coordinate deviation, and acquiring the next action of the intelligent agent according to the position of the sub-target point and the navigation progress.

The invention adopts another technical scheme that:

a visual language navigation device based on an open scene map, comprising:

the data acquisition module is used for acquiring visual image data of the intelligent agent in the environment; the visual image data includes an RGB image and a depth image;

the representation construction module is used for constructing an open scene map representation according to the visual image data, wherein the open scene map representation comprises an object attribute level map, an open scene object semantic map and a marker semantic level map;

and the navigation application module is used for representing the position and navigation progress of the predicted sub-target point according to the constructed open scene map and executing corresponding actions.

The invention adopts another technical scheme that:

a visual language navigation device based on an open scene map, comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method as described above.

The invention adopts another technical scheme that:

a computer readable storage medium, in which a processor executable program is stored, which when executed by a processor is adapted to carry out the method as described above.

The beneficial effects of the invention are as follows: according to the invention, object attribute level information is combined with semantic information of the open scene object and the instruction marker, the above information is combined to construct the open scene map, the characterization capability of the map on the attribute and the position of various objects in the open scene is improved, the map characterization is not limited to a fixed small number of object types, and the added object attribute information can help an intelligent agent to eliminate object type ambiguity and accurately position the object of interest.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description is made with reference to the accompanying drawings of the embodiments of the present invention or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present invention, and other drawings may be obtained according to these drawings without the need of inventive labor for those skilled in the art.

FIG. 1 is a flow chart of steps of a visual language navigation method based on an open scene map in an embodiment of the invention;

fig. 2 is a schematic diagram of an open scene map building module according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

In the description of the present invention, it should be understood that references to orientation descriptions such as upper, lower, front, rear, left, right, etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of description of the present invention and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present invention.

In the description of the present invention, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

Furthermore, in the description of the present invention, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

In the description of the present invention, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present invention can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.

Term interpretation:

GRU: the gate cycle unit is one of the implementation modes of the cyclic neural network (RNN). Any form of recurrent neural network may be used in the application including, but not limited to, GRU, LSTM.

As shown in fig. 1, the present embodiment provides a visual language navigation method based on an open scene map, which includes the following steps:

s1, visual image data of the agent in the environment are acquired. The visual image data includes an RGB image and a depth image.

S2, constructing an open scene map representation according to the visual image data, wherein the open scene map representation comprises an object attribute level map, an open scene object semantic map and a marker semantic level map.

Specifically, step S2 includes the following steps S21-S24:

s21, inputting the RGB image into a trained deep neural network, obtaining a feature map of a middle layer of the deep neural network, and mapping the feature map according to the depth information of the depth image to obtain an object attribute level map.

S22, inputting common open scene object categories and RGB images to an object detector facing open vocabulary, detecting to obtain common open scene object positions, and mapping the object positions according to depth information of the depth images to obtain an open scene object semantic level map.

S23, inputting a navigation instruction to a marker analyzer to obtain an instruction marker category, inputting the marker category to a target detector facing open vocabulary to obtain a marker position, and mapping according to depth information of a depth image to obtain a marker semantic hierarchical map.

And S24, respectively passing the three maps through a convolution layer, and obtaining the open scene map representation through the convolution layer after the three maps are connected in the subspace.

S3, predicting the position and navigation progress of the sub-target point according to the constructed open scene map representation, and executing corresponding actions.

Specifically, the state characteristics of the intelligent agent are passed through a full connection layer, the relative coordinate offset of the target point of the sub-target point from the current position is predicted, and the navigation progress in the current state is predicted through another full connection layer. And determining the next action of the intelligent agent according to the position of the sub-target point and the navigation progress.

The above method is explained in detail below with reference to fig. 2 and the specific embodiment.

The embodiment provides a visual language navigation method based on an open scene map, which comprises the following steps:

step 1: and acquiring visual images and other data of the intelligent agent in the environment.

Visual images of the agent observed in the simulation environment are acquired, the visual images including RGB images and depth images, in this embodiment using a public simulator habit-sim, and using the public dataset VLN-CE as training and test data.

Step 2: and constructing an open scene map representation.

The constructed open scene multi-level map representation mainly comprises three parts, namely an object attribute level map containing object detail characteristics, an open scene object semantic map containing object semantic characteristics and an instruction mark semantic map.

The interpretability work of the existing neural network shows that the characteristics of different hidden layers of the neural network can acquire different types of information of objects in an image, wherein shallow characteristics can generally extract local details of the objects, and deep characteristics can generally extract global contours of the objects. Therefore, when the object attribute map is constructed, based on the CLIP network pre-trained by the image-text matching task, RGB pictures are input to the network, shallow layer features and deep layer features of the network are selected, the object attribute feature map is obtained by splicing, each feature vector is mapped to a corresponding position in the map through depth information, and therefore the object attribute hierarchical map is obtained.

In order to embody high-level object semantic information, a target detector facing open vocabulary is utilized to detect common object categories in an open scene and positions of instruction markers in RGB pictures, and each feature vector is mapped to a corresponding position in a map through depth information, so that an open scene object semantic level map and an instruction marker semantic level map are obtained.

Specifically, a marker class in an instruction is obtained through analysis of an instruction marker analyzer (such as a GPT large language model), the marker class is input to an object detector (such as a GLIP model) facing open vocabulary together with a common object class and a picture in an open scene, the spatial positions of the objects in an RGB picture are obtained, each feature vector is mapped to a corresponding position in a map through depth information, and therefore an open scene object semantic hierarchical map and an instruction marker semantic hierarchical map are obtained.

Map encoder: the three maps are respectively subjected to a convolution layer, and the three maps are connected in subspace and then subjected to the convolution layer to obtain the open scene map representation.

Step 3: predicting the position of the sub-target point and the navigation progress, and executing corresponding actions.

And the action decision mode of the intelligent agent selects to predict the position of the next sub-target point at each moment. Specifically, the map representation and the instructions are input into a cyclic neural network (optionally any cyclic neural network is selected, including but not limited to GRU and LSTM), the current state characteristics of the intelligent agent are obtained, and the state characteristics are passed through a sub-target point predictor to predict the relative coordinate offset of the sub-target point from the current position. Therefore, the positions of the sub-target points can be marked on the map, and the next actions of the intelligent agent, including forward, left turn and right turn, can be obtained through the existing visual navigation algorithm (such as DDPPO). And simultaneously, the state characteristics pass through a navigation progress predictor to predict the navigation progress in the current state, and when the predicted progress is greater than a certain threshold, the current navigation is ended by executing a stopping action.

In summary, the method of the embodiment effectively utilizes the object attribute features contained in the hidden layer features in the deep neural network, maps the features to the map to obtain object attribute level information, which is used for representing the attributes (such as colors, outlines, materials and the like) of the objects in the open scene, and then combines the semantic information of the objects and the instruction markers of the open scene to position any object in the open scene by using the object detector facing the open vocabulary. The information is combined to construct an open scene map, the representation capability of the map on the attribute and the position of various objects in the open scene is improved, the map representation is not limited to a fixed small number of object types, and the added object attribute information can help an intelligent body to disambiguate the object types and accurately position the object of interest.

The embodiment also provides a visual language navigation device based on the open scene map, which comprises:

The visual language navigation device based on the open scene map can execute any combination implementation steps of the visual language navigation method based on the open scene map, and has corresponding functions and beneficial effects.

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method as shown in fig. 1.

The present application also discloses a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.

The embodiment also provides a storage medium which stores instructions or programs for executing the visual language navigation method based on the open scene map, and when the instructions or programs are run, the instructions or programs can execute any combination implementation steps of the method embodiment, and the method has corresponding functions and beneficial effects.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims

1. The visual language navigation method based on the open scene map is characterized by comprising the following steps of:

2. The method of claim 1, wherein constructing an open scene map representation from visual image data comprises:

3. The visual language navigation method based on an open scene map according to claim 1, wherein the object attribute level map is obtained specifically by:

4. The visual language navigation method based on an open scene map according to claim 1, wherein the open scene object semantic map is obtained specifically by:

5. The visual language navigation method based on an open scene map according to claim 1, wherein the marker semantic hierarchy map is obtained specifically by:

6. The visual language navigation method based on an open scene map of claim 5, wherein the marker parser is implemented using a GPT large language model and the object detector is implemented using a GLIP model.

7. The visual language navigation method based on an open scene map according to claim 1, wherein the predicting the position and navigation progress of the sub-target point according to the constructed open scene map representation, and executing the corresponding actions, comprises:

8. A visual language navigation device based on an open scene map, comprising:

9. A visual language navigation device based on an open scene map, comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any one of claims 1-7.

10. A computer readable storage medium, in which a processor executable program is stored, characterized in that the processor executable program is for performing the method according to any of claims 1-7 when being executed by a processor.