CN116499471A - Visual language navigation method, device and medium based on open scene map - Google Patents

Visual language navigation method, device and medium based on open scene map Download PDF

Info

Publication number
CN116499471A
CN116499471A CN202310788171.4A CN202310788171A CN116499471A CN 116499471 A CN116499471 A CN 116499471A CN 202310788171 A CN202310788171 A CN 202310788171A CN 116499471 A CN116499471 A CN 116499471A
Authority
CN
China
Prior art keywords
map
open scene
open
semantic
navigation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310788171.4A
Other languages
Chinese (zh)
Other versions
CN116499471B (en
Inventor
谭明奎
陈沛豪
吉冬昱
林坤阳
杜卿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202310788171.4A priority Critical patent/CN116499471B/en
Publication of CN116499471A publication Critical patent/CN116499471A/en
Application granted granted Critical
Publication of CN116499471B publication Critical patent/CN116499471B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations
    • G01C21/206Instruments for performing navigational calculations specially adapted for indoor navigation
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/38Electronic maps specially adapted for navigation; Updating thereof
    • G01C21/3863Structures of map data
    • G01C21/387Organisation of map data, e.g. version management or database structures
    • G01C21/3878Hierarchical structures, e.g. layering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Automation & Control Theory (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Navigation (AREA)

Abstract

The invention discloses a visual language navigation method, device and medium based on an open scene map, and belongs to the technical field of intelligent navigation. The method comprises the following steps: acquiring visual image data of an intelligent agent in an environment; constructing an open scene map representation according to the visual image data, wherein the open scene map representation comprises an object attribute level map, an open scene object semantic map and a marker semantic level map; and according to the constructed open scene map, predicting the position and navigation progress of the sub-target point, and executing corresponding actions. According to the invention, object attribute level information is combined with semantic information of the open scene object and the instruction marker, the above information is combined to construct the open scene map, the characterization capability of the map on the attribute and the position of various objects in the open scene is improved, the map characterization is not limited to a fixed small number of object types, and the added object attribute information can help an intelligent agent to eliminate object type ambiguity and accurately position the object of interest.

Description

Visual language navigation method, device and medium based on open scene map
Technical Field
The invention relates to the technical field of intelligent navigation, in particular to a visual language navigation method, device and medium based on an open scene map.
Background
The appearance of the intelligent body provides an important technical route for improving the cognitive ability of the current artificial intelligence and moving to the general intelligent. Through the channel of interaction with the environment, the intelligent agent can acquire real feedback from a real physical or virtual digital space, so that further learning and progress are realized, wherein visual language navigation aims at enabling the intelligent agent to carry out autonomous navigation along with natural language instructions, and the intelligent agent gradually receives wide attention in recent years, has become one of research hotspots with body intelligence, and has huge potential application value in the aspects of man-machine interaction, home service robots and the like.
At present, the existing method proposes a map-based modular mode to realize visual language navigation, and the environment information is represented by constructing a semantic map. However, two main problems still exist in the semantic map constructed by the existing method: 1) The existing map construction mode ignores rich attribute information (such as colors, textures and the like) contained in the object, so that object ambiguity is caused. For example, when two sofas with different colors exist in a room, if the map only represents the semantic category of the sofas, the two sofas cannot be distinguished; 2) Existing map construction approaches can only represent a limited class of objects (typically 40 classes). The actual instructions and scenes often contain complex and various object category information, and the existing semantic map is difficult to effectively express the object category information, so that the navigation performance of the intelligent agent is affected. Therefore, how to integrate the detailed attribute information of the objects into the map and accurately represent the various object category information in the open scene is one of the research hotspots and difficulties of the current visual language navigation task.
Disclosure of Invention
In order to solve at least one of the technical problems existing in the prior art to a certain extent, the invention aims to provide a visual language navigation method, device and medium based on an open scene map.
The technical scheme adopted by the invention is as follows:
a visual language navigation method based on an open scene map comprises the following steps:
acquiring visual image data of an intelligent agent in an environment; the visual image data includes an RGB image and a depth image;
constructing an open scene map representation according to the visual image data, wherein the open scene map representation comprises an object attribute level map, an open scene object semantic map and a marker semantic level map;
and according to the constructed open scene map, predicting the position and navigation progress of the sub-target point, and executing corresponding actions.
Further, the constructing an open scene map representation from the visual image data includes:
acquiring an object attribute level map according to the RGB image and the depth image;
acquiring an open scene object semantic map according to the RGB image, the depth image and a preset open scene object class;
acquiring a logo semantic level map according to the RGB image, the depth image and a preset navigation instruction;
and respectively passing the object attribute level map, the open scene object semantic map and the marker semantic level map through a convolution layer, and after subspace connection, obtaining the open scene map representation through the convolution layer.
Further, the object attribute level map is obtained specifically by:
inputting the RGB image into a trained deep neural network, and obtaining an intermediate layer characteristic diagram of the deep neural network;
and mapping the obtained middle layer feature map according to the depth information of the depth image to obtain the object attribute level map.
Further, the open scene object semantic map is obtained specifically by:
inputting a preset open scene object category and RGB image to an object detector facing open vocabulary, and detecting to obtain an open scene object position;
and mapping the detected open scene object position according to the depth information of the depth image to obtain an open scene object semantic map.
Further, the marker semantic hierarchy map is specifically obtained by:
inputting the navigation instruction into a marker analyzer to obtain the category of the marker in the instruction;
inputting the obtained marker category to a target detector facing the open vocabulary to obtain a marker position;
and mapping according to the obtained marker position and the depth information of the depth image to obtain the marker semantic level map.
Further, the marker parser is implemented using a GPT large language model, and the target detector is implemented using a GLIP model.
Further, the predicting the position and navigation progress of the sub-target point according to the constructed open scene map representation, and executing the corresponding action, including:
inputting the open scene map representation and the instruction into the GRU to obtain the current state characteristics of the intelligent agent;
the obtained state characteristics pass through a sub-target point predictor to predict the relative coordinate deviation of the sub-target point from the current position;
and predicting the navigation progress in the current state according to the relative coordinate deviation, and acquiring the next action of the intelligent agent according to the position of the sub-target point and the navigation progress.
The invention adopts another technical scheme that:
a visual language navigation device based on an open scene map, comprising:
the data acquisition module is used for acquiring visual image data of the intelligent agent in the environment; the visual image data includes an RGB image and a depth image;
the representation construction module is used for constructing an open scene map representation according to the visual image data, wherein the open scene map representation comprises an object attribute level map, an open scene object semantic map and a marker semantic level map;
and the navigation application module is used for representing the position and navigation progress of the predicted sub-target point according to the constructed open scene map and executing corresponding actions.
The invention adopts another technical scheme that:
a visual language navigation device based on an open scene map, comprising:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method as described above.
The invention adopts another technical scheme that:
a computer readable storage medium, in which a processor executable program is stored, which when executed by a processor is adapted to carry out the method as described above.
The beneficial effects of the invention are as follows: according to the invention, object attribute level information is combined with semantic information of the open scene object and the instruction marker, the above information is combined to construct the open scene map, the characterization capability of the map on the attribute and the position of various objects in the open scene is improved, the map characterization is not limited to a fixed small number of object types, and the added object attribute information can help an intelligent agent to eliminate object type ambiguity and accurately position the object of interest.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description is made with reference to the accompanying drawings of the embodiments of the present invention or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present invention, and other drawings may be obtained according to these drawings without the need of inventive labor for those skilled in the art.
FIG. 1 is a flow chart of steps of a visual language navigation method based on an open scene map in an embodiment of the invention;
fig. 2 is a schematic diagram of an open scene map building module according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
In the description of the present invention, it should be understood that references to orientation descriptions such as upper, lower, front, rear, left, right, etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of description of the present invention and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present invention.
In the description of the present invention, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
Furthermore, in the description of the present invention, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
In the description of the present invention, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present invention can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.
Term interpretation:
GRU: the gate cycle unit is one of the implementation modes of the cyclic neural network (RNN). Any form of recurrent neural network may be used in the application including, but not limited to, GRU, LSTM.
As shown in fig. 1, the present embodiment provides a visual language navigation method based on an open scene map, which includes the following steps:
s1, visual image data of the agent in the environment are acquired. The visual image data includes an RGB image and a depth image.
S2, constructing an open scene map representation according to the visual image data, wherein the open scene map representation comprises an object attribute level map, an open scene object semantic map and a marker semantic level map.
Specifically, step S2 includes the following steps S21-S24:
s21, inputting the RGB image into a trained deep neural network, obtaining a feature map of a middle layer of the deep neural network, and mapping the feature map according to the depth information of the depth image to obtain an object attribute level map.
S22, inputting common open scene object categories and RGB images to an object detector facing open vocabulary, detecting to obtain common open scene object positions, and mapping the object positions according to depth information of the depth images to obtain an open scene object semantic level map.
S23, inputting a navigation instruction to a marker analyzer to obtain an instruction marker category, inputting the marker category to a target detector facing open vocabulary to obtain a marker position, and mapping according to depth information of a depth image to obtain a marker semantic hierarchical map.
And S24, respectively passing the three maps through a convolution layer, and obtaining the open scene map representation through the convolution layer after the three maps are connected in the subspace.
S3, predicting the position and navigation progress of the sub-target point according to the constructed open scene map representation, and executing corresponding actions.
Specifically, the state characteristics of the intelligent agent are passed through a full connection layer, the relative coordinate offset of the target point of the sub-target point from the current position is predicted, and the navigation progress in the current state is predicted through another full connection layer. And determining the next action of the intelligent agent according to the position of the sub-target point and the navigation progress.
The above method is explained in detail below with reference to fig. 2 and the specific embodiment.
The embodiment provides a visual language navigation method based on an open scene map, which comprises the following steps:
step 1: and acquiring visual images and other data of the intelligent agent in the environment.
Visual images of the agent observed in the simulation environment are acquired, the visual images including RGB images and depth images, in this embodiment using a public simulator habit-sim, and using the public dataset VLN-CE as training and test data.
Step 2: and constructing an open scene map representation.
The constructed open scene multi-level map representation mainly comprises three parts, namely an object attribute level map containing object detail characteristics, an open scene object semantic map containing object semantic characteristics and an instruction mark semantic map.
The interpretability work of the existing neural network shows that the characteristics of different hidden layers of the neural network can acquire different types of information of objects in an image, wherein shallow characteristics can generally extract local details of the objects, and deep characteristics can generally extract global contours of the objects. Therefore, when the object attribute map is constructed, based on the CLIP network pre-trained by the image-text matching task, RGB pictures are input to the network, shallow layer features and deep layer features of the network are selected, the object attribute feature map is obtained by splicing, each feature vector is mapped to a corresponding position in the map through depth information, and therefore the object attribute hierarchical map is obtained.
In order to embody high-level object semantic information, a target detector facing open vocabulary is utilized to detect common object categories in an open scene and positions of instruction markers in RGB pictures, and each feature vector is mapped to a corresponding position in a map through depth information, so that an open scene object semantic level map and an instruction marker semantic level map are obtained.
Specifically, a marker class in an instruction is obtained through analysis of an instruction marker analyzer (such as a GPT large language model), the marker class is input to an object detector (such as a GLIP model) facing open vocabulary together with a common object class and a picture in an open scene, the spatial positions of the objects in an RGB picture are obtained, each feature vector is mapped to a corresponding position in a map through depth information, and therefore an open scene object semantic hierarchical map and an instruction marker semantic hierarchical map are obtained.
Map encoder: the three maps are respectively subjected to a convolution layer, and the three maps are connected in subspace and then subjected to the convolution layer to obtain the open scene map representation.
Step 3: predicting the position of the sub-target point and the navigation progress, and executing corresponding actions.
And the action decision mode of the intelligent agent selects to predict the position of the next sub-target point at each moment. Specifically, the map representation and the instructions are input into a cyclic neural network (optionally any cyclic neural network is selected, including but not limited to GRU and LSTM), the current state characteristics of the intelligent agent are obtained, and the state characteristics are passed through a sub-target point predictor to predict the relative coordinate offset of the sub-target point from the current position. Therefore, the positions of the sub-target points can be marked on the map, and the next actions of the intelligent agent, including forward, left turn and right turn, can be obtained through the existing visual navigation algorithm (such as DDPPO). And simultaneously, the state characteristics pass through a navigation progress predictor to predict the navigation progress in the current state, and when the predicted progress is greater than a certain threshold, the current navigation is ended by executing a stopping action.
In summary, the method of the embodiment effectively utilizes the object attribute features contained in the hidden layer features in the deep neural network, maps the features to the map to obtain object attribute level information, which is used for representing the attributes (such as colors, outlines, materials and the like) of the objects in the open scene, and then combines the semantic information of the objects and the instruction markers of the open scene to position any object in the open scene by using the object detector facing the open vocabulary. The information is combined to construct an open scene map, the representation capability of the map on the attribute and the position of various objects in the open scene is improved, the map representation is not limited to a fixed small number of object types, and the added object attribute information can help an intelligent body to disambiguate the object types and accurately position the object of interest.
The embodiment also provides a visual language navigation device based on the open scene map, which comprises:
the data acquisition module is used for acquiring visual image data of the intelligent agent in the environment; the visual image data includes an RGB image and a depth image;
the representation construction module is used for constructing an open scene map representation according to the visual image data, wherein the open scene map representation comprises an object attribute level map, an open scene object semantic map and a marker semantic level map;
and the navigation application module is used for representing the position and navigation progress of the predicted sub-target point according to the constructed open scene map and executing corresponding actions.
The visual language navigation device based on the open scene map can execute any combination implementation steps of the visual language navigation method based on the open scene map, and has corresponding functions and beneficial effects.
The embodiment also provides a visual language navigation device based on the open scene map, which comprises:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method as shown in fig. 1.
The visual language navigation device based on the open scene map can execute any combination implementation steps of the visual language navigation method based on the open scene map, and has corresponding functions and beneficial effects.
The present application also discloses a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.
The embodiment also provides a storage medium which stores instructions or programs for executing the visual language navigation method based on the open scene map, and when the instructions or programs are run, the instructions or programs can execute any combination implementation steps of the method embodiment, and the method has corresponding functions and beneficial effects.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims (10)

1. The visual language navigation method based on the open scene map is characterized by comprising the following steps of:
acquiring visual image data of an intelligent agent in an environment; the visual image data includes an RGB image and a depth image;
constructing an open scene map representation according to the visual image data, wherein the open scene map representation comprises an object attribute level map, an open scene object semantic map and a marker semantic level map;
and according to the constructed open scene map, predicting the position and navigation progress of the sub-target point, and executing corresponding actions.
2. The method of claim 1, wherein constructing an open scene map representation from visual image data comprises:
acquiring an object attribute level map according to the RGB image and the depth image;
acquiring an open scene object semantic map according to the RGB image, the depth image and a preset open scene object class;
acquiring a logo semantic level map according to the RGB image, the depth image and a preset navigation instruction;
and respectively passing the object attribute level map, the open scene object semantic map and the marker semantic level map through a convolution layer, and after subspace connection, obtaining the open scene map representation through the convolution layer.
3. The visual language navigation method based on an open scene map according to claim 1, wherein the object attribute level map is obtained specifically by:
inputting the RGB image into a trained deep neural network, and obtaining an intermediate layer characteristic diagram of the deep neural network;
and mapping the obtained middle layer feature map according to the depth information of the depth image to obtain the object attribute level map.
4. The visual language navigation method based on an open scene map according to claim 1, wherein the open scene object semantic map is obtained specifically by:
inputting a preset open scene object category and RGB image to an object detector facing open vocabulary, and detecting to obtain an open scene object position;
and mapping the detected open scene object position according to the depth information of the depth image to obtain an open scene object semantic map.
5. The visual language navigation method based on an open scene map according to claim 1, wherein the marker semantic hierarchy map is obtained specifically by:
inputting the navigation instruction into a marker analyzer to obtain the category of the marker in the instruction;
inputting the obtained marker category to a target detector facing the open vocabulary to obtain a marker position;
and mapping according to the obtained marker position and the depth information of the depth image to obtain the marker semantic level map.
6. The visual language navigation method based on an open scene map of claim 5, wherein the marker parser is implemented using a GPT large language model and the object detector is implemented using a GLIP model.
7. The visual language navigation method based on an open scene map according to claim 1, wherein the predicting the position and navigation progress of the sub-target point according to the constructed open scene map representation, and executing the corresponding actions, comprises:
inputting the open scene map representation and the instruction into the GRU to obtain the current state characteristics of the intelligent agent;
the obtained state characteristics pass through a sub-target point predictor to predict the relative coordinate deviation of the sub-target point from the current position;
and predicting the navigation progress in the current state according to the relative coordinate deviation, and acquiring the next action of the intelligent agent according to the position of the sub-target point and the navigation progress.
8. A visual language navigation device based on an open scene map, comprising:
the data acquisition module is used for acquiring visual image data of the intelligent agent in the environment; the visual image data includes an RGB image and a depth image;
the representation construction module is used for constructing an open scene map representation according to the visual image data, wherein the open scene map representation comprises an object attribute level map, an open scene object semantic map and a marker semantic level map;
and the navigation application module is used for representing the position and navigation progress of the predicted sub-target point according to the constructed open scene map and executing corresponding actions.
9. A visual language navigation device based on an open scene map, comprising:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any one of claims 1-7.
10. A computer readable storage medium, in which a processor executable program is stored, characterized in that the processor executable program is for performing the method according to any of claims 1-7 when being executed by a processor.
CN202310788171.4A 2023-06-30 2023-06-30 Visual language navigation method, device and medium based on open scene map Active CN116499471B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310788171.4A CN116499471B (en) 2023-06-30 2023-06-30 Visual language navigation method, device and medium based on open scene map

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310788171.4A CN116499471B (en) 2023-06-30 2023-06-30 Visual language navigation method, device and medium based on open scene map

Publications (2)

Publication Number Publication Date
CN116499471A true CN116499471A (en) 2023-07-28
CN116499471B CN116499471B (en) 2023-09-12

Family

ID=87325325

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310788171.4A Active CN116499471B (en) 2023-06-30 2023-06-30 Visual language navigation method, device and medium based on open scene map

Country Status (1)

Country Link
CN (1) CN116499471B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874302A (en) * 2024-03-12 2024-04-12 暗物智能科技(广州)有限公司 Full-open vocabulary scene graph generation method and system based on depth fusion

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10437252B1 (en) * 2017-09-08 2019-10-08 Perceptln Shenzhen Limited High-precision multi-layer visual and semantic map for autonomous driving
CN111645073A (en) * 2020-05-29 2020-09-11 武汉理工大学 Robot visual semantic navigation method, device and system
US20200364554A1 (en) * 2018-02-09 2020-11-19 Baidu Usa Llc Systems and methods for deep localization and segmentation with a 3d semantic map
WO2021058090A1 (en) * 2019-09-24 2021-04-01 Toyota Motor Europe System and method for navigating a vehicle using language instructions
CN113312983A (en) * 2021-05-08 2021-08-27 华南理工大学 Semantic segmentation method, system, device and medium based on multi-modal data fusion
US20210302585A1 (en) * 2018-08-17 2021-09-30 Beijing Jingdong Shangke Information Technology Co., Ltd. Smart navigation method and system based on topological map
CN113670310A (en) * 2021-07-27 2021-11-19 际络科技(上海)有限公司 Visual voice navigation method, device, equipment and storage medium
CN114384920A (en) * 2022-03-23 2022-04-22 安徽大学 Dynamic obstacle avoidance method based on real-time construction of local grid map
CN114460943A (en) * 2022-02-10 2022-05-10 山东大学 Self-adaptive target navigation method and system for service robot
US20220198813A1 (en) * 2020-12-17 2022-06-23 Sri International System and method for efficient visual navigation
CN114782530A (en) * 2022-03-28 2022-07-22 杭州国辰机器人科技有限公司 Three-dimensional semantic map construction method, device, equipment and medium under indoor scene
CN114973125A (en) * 2022-05-12 2022-08-30 武汉大学 Method and system for assisting navigation in intelligent navigation scene by using knowledge graph
CN115311538A (en) * 2022-02-21 2022-11-08 上海应用技术大学 Intelligent agent target searching method based on scene prior
CN115824213A (en) * 2022-11-18 2023-03-21 天津大学 Visual language navigation method based on follower model
CN116242359A (en) * 2023-02-08 2023-06-09 华南理工大学 Visual language navigation method, device and medium based on scene fusion knowledge

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10437252B1 (en) * 2017-09-08 2019-10-08 Perceptln Shenzhen Limited High-precision multi-layer visual and semantic map for autonomous driving
US20200364554A1 (en) * 2018-02-09 2020-11-19 Baidu Usa Llc Systems and methods for deep localization and segmentation with a 3d semantic map
US20210302585A1 (en) * 2018-08-17 2021-09-30 Beijing Jingdong Shangke Information Technology Co., Ltd. Smart navigation method and system based on topological map
WO2021058090A1 (en) * 2019-09-24 2021-04-01 Toyota Motor Europe System and method for navigating a vehicle using language instructions
CN111645073A (en) * 2020-05-29 2020-09-11 武汉理工大学 Robot visual semantic navigation method, device and system
US20220198813A1 (en) * 2020-12-17 2022-06-23 Sri International System and method for efficient visual navigation
CN113312983A (en) * 2021-05-08 2021-08-27 华南理工大学 Semantic segmentation method, system, device and medium based on multi-modal data fusion
CN113670310A (en) * 2021-07-27 2021-11-19 际络科技(上海)有限公司 Visual voice navigation method, device, equipment and storage medium
CN114460943A (en) * 2022-02-10 2022-05-10 山东大学 Self-adaptive target navigation method and system for service robot
CN115311538A (en) * 2022-02-21 2022-11-08 上海应用技术大学 Intelligent agent target searching method based on scene prior
CN114384920A (en) * 2022-03-23 2022-04-22 安徽大学 Dynamic obstacle avoidance method based on real-time construction of local grid map
CN114782530A (en) * 2022-03-28 2022-07-22 杭州国辰机器人科技有限公司 Three-dimensional semantic map construction method, device, equipment and medium under indoor scene
CN114973125A (en) * 2022-05-12 2022-08-30 武汉大学 Method and system for assisting navigation in intelligent navigation scene by using knowledge graph
CN115824213A (en) * 2022-11-18 2023-03-21 天津大学 Visual language navigation method based on follower model
CN116242359A (en) * 2023-02-08 2023-06-09 华南理工大学 Visual language navigation method, device and medium based on scene fusion knowledge

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PEIHAO CHEN等: "Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation", 《ARXIV》, pages 1 - 13 *
RUNHAO ZENG等: "Dense Regression Network for Video Grounding", 《2020 IEEE/CVF CVPR》, pages 10287 - 10296 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874302A (en) * 2024-03-12 2024-04-12 暗物智能科技(广州)有限公司 Full-open vocabulary scene graph generation method and system based on depth fusion

Also Published As

Publication number Publication date
CN116499471B (en) 2023-09-12

Similar Documents

Publication Publication Date Title
Lyu et al. Robot path planning by leveraging the graph-encoded Floyd algorithm
CN106169188A (en) A kind of method for tracing object based on the search of Monte Carlo tree
CN107886120A (en) Method and apparatus for target detection tracking
CN108986138A (en) Method for tracking target and equipment
CN106802655A (en) Indoor map generation method and device
CN107784663A (en) Correlation filtering tracking and device based on depth information
CN116499471B (en) Visual language navigation method, device and medium based on open scene map
Zheng et al. Active scene understanding via online semantic reconstruction
CN111105439B (en) Synchronous positioning and mapping method using residual attention mechanism network
CN111095170B (en) Virtual reality scene, interaction method thereof and terminal equipment
Wu et al. Revisiting embodiedqa: A simple baseline and beyond
CN110465089B (en) Map exploration method, map exploration device, map exploration medium and electronic equipment based on image recognition
CN106767755A (en) Method and device for planning autonomous formula equipment operating point
CN106782030A (en) Method and device for generating the indoor map with semantic description
Ye et al. From seeing to moving: A survey on learning for visual indoor navigation (vin)
CN106782029A (en) Indoor map generation method and device
CN106814734A (en) The method and system of autonomous formula equipment are controlled using computing device
CN116109812A (en) Target detection method based on non-maximum suppression threshold optimization
CN109858402B (en) Image detection method, device, terminal and storage medium
CN115147637A (en) Real-time semantic map generation method and device based on robot
CN111080671A (en) Motion prediction method based on deep neural network and intelligent terminal
Wu et al. Vision-language navigation: a survey and taxonomy
CN104484034B (en) A kind of gesture motion primitive transition frames localization method based on gesture identification
Ehsani et al. Object manipulation via visual target localization
Dai et al. RGB‐D SLAM with moving object tracking in dynamic environments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant