CN117073701A - Visual language navigation technical scheme based on multi-modal perception model and large language model - Google Patents

Visual language navigation technical scheme based on multi-modal perception model and large language model Download PDF

Info

Publication number
CN117073701A
CN117073701A CN202310815734.4A CN202310815734A CN117073701A CN 117073701 A CN117073701 A CN 117073701A CN 202310815734 A CN202310815734 A CN 202310815734A CN 117073701 A CN117073701 A CN 117073701A
Authority
CN
China
Prior art keywords
navigation
language
model
visual
robot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310815734.4A
Other languages
Chinese (zh)
Inventor
柳书博
请求不公布姓名
张红生
徐嘉悦
李亮辰
杜胤葑
束九禾
吕冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202310815734.4A priority Critical patent/CN117073701A/en
Publication of CN117073701A publication Critical patent/CN117073701A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • G01C21/34Route searching; Route guidance
    • G01C21/3446Details of route searching algorithms, e.g. Dijkstra, A*, arc-flags, using precalculated routes
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • G01C21/34Route searching; Route guidance
    • G01C21/36Input/output arrangements for on-board computers
    • G01C21/3602Input other than that of destination using image analysis, e.g. detection of road signs, lanes, buildings, real preceding vehicles using a camera
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • G01C21/34Route searching; Route guidance
    • G01C21/36Input/output arrangements for on-board computers
    • G01C21/3605Destination input or retrieval
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • G01C21/34Route searching; Route guidance
    • G01C21/36Input/output arrangements for on-board computers
    • G01C21/3605Destination input or retrieval
    • G01C21/3608Destination input or retrieval using speech input, e.g. using speech recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The application provides a visual language navigation method based on a multi-mode and large language pre-training model and an embedded hardware device, and relates to the technical field of navigation. Establishing a navigation task situation for a large language pre-training model; generating a text description of the surrounding environment through a multimodal visual-linguistic pre-training model; and generating each step of navigation decision by the comprehensive natural language instruction through a large language model, and further controlling the behavior of the robot. The application provides an implementation mode for applying the multi-mode vision-language pre-training model and the large-language pre-training model to vision language navigation, which not only improves the navigation success rate, but also further enhances the generalization of the vision language navigation method based on the characteristics of the large-language model.

Description

Visual language navigation technical scheme based on multi-modal perception model and large language model
Technical Field
The application relates to the technical field of navigation, in particular to a visual language navigation technology based on a multi-modal perception model and a large language model and an embedded hardware device.
Background
The most popular and most commonly used navigation method is satellite navigation technology, that is, the navigation is realized by calculating the shortest path or the optimal path by using a known map, a GPS, a Beidou satellite positioning system and other satellite positioning systems. The satellite navigation technology is a technology set for realizing satellite navigation service in the fields of spaceflight, computer, mapping, communication and the like, and is a great adult of human science and technology. The satellite navigation technology is more suitable for navigation in large scenes, such as map navigation software commonly used in daily life, helps users and vehicles find an optimal path to a target place, and reports road conditions, advancing directions, straight mileage and the like in real time. However, satellite navigation technology is difficult to be adequate in small scenes such as indoor, outdoor or underground mine sites and forests, and the reason for the satellite navigation technology is that the scenes have the characteristics of small range, three-dimensional structure, complex environment obstacle display and the like.
Therefore, the application provides a visual language navigation technology, which can enable a user to send out natural language instructions, tell an intelligent body about a target place, a target object or a path in detail or in short, and enable the intelligent body to automatically seek paths and navigate to go to the target place according to the self environment perception capability, even achieve the effect of interacting with the environment (for example, picking up the object). The visual language navigation technology has many advantages in small scenes, such as natural language interaction capability, namely, interaction with a user can be realized through natural language instructions, so that complicated input and operation are avoided; environmental awareness—navigation in an unknown environment; adaptability-with a certain degree of adaptability and autonomy, which can be adjusted according to environmental changes and user requirements; the multi-mode fusion mode, which can fuse natural language instructions with visual information, realizes more accurate and flexible navigation. The visual language navigation technology automatically collects environment information through vision, judges the advancing route, can eliminate the need of a known global map, can solve the navigation requirements in three-dimensional modes such as going upstairs and downstairs, and realizes high-precision navigation tasks in the environments.
Disclosure of Invention
The application aims to provide a visual language navigation method based on a multi-mode and large language pre-training model and an embedded device.
The first aspect of the application provides a visual language navigation method based on a multi-mode and large language pre-training model, which comprises the following steps:
1. an initialization task instruction is set to outline navigation tasks that the large language model needs to complete.
2. And inputting the initialization task instruction into a pre-trained large language model to perform task initialization.
3. An image description is generated for the current ambient image by a multimodal vision-language pre-training model.
4. And integrating the image description of the current surrounding environment image with natural language instructions, past path information, optional decision types and large language model answer specifications according to a preset navigation instruction grammar format to form a current navigation instruction.
5. And inputting the current navigation instruction into the large language model after task initialization, and generating a text representation of the current navigation decision.
6. And converting the text representation of the navigation decision into corresponding state change parameters which can be received by the robot body to control the behavior of the robot body.
In some modified embodiments of the first aspect of the present application, the initializing navigation task instruction needs to include three sets of important information, namely, completing a navigation task according to the task requirement, keywords of the current environment, such as indoor, outdoor, etc., and question-answering templates of the navigation instruction.
For some modified embodiments of the first aspect of the present application, the multi-modal visual-language pre-training model generates the image description in the following three ways, but not limited to: directly inputting an environment image and outputting an image description; interacting with a large language model, wherein the large language model is responsible for asking questions, the multi-modal model is responsible for answers, and after multiple iterations, summarizing all answers by the large language model to generate image descriptions; and after the multi-mode model outputs the image descriptions corresponding to all the environment images, summarizing by the large language model to generate unified and detailed surrounding environment descriptions.
In some modified embodiments of the first aspect of the present application, the preset navigation instruction grammar format is a formatted text, and after the natural language instruction, the environment description, the past path information, and the optional decision types are integrated into the corresponding keyword descriptors in the grammar format in an embedded manner, the keyword descriptors include, but are not limited to: the navigation task is performed with the surrounding information being the path that has been passed including the selectable decision.
In some variant embodiments of the first aspect of the present application, the means for converting the navigation decision from the text representation to the state change parameter are:
different behavior categories are preset for different self-contained robots, including but not limited to: "left turn", "right turn", "forward", "reverse", "stop", "ignore".
And coding each behavior according to different communication protocols of the robot with the body, obtaining a robot state change parameter corresponding to each behavior, extracting behavior types contained in a navigation decision of the large language model in a character string matching processing mode, and indexing the corresponding state change parameter.
In some variations of the first aspect of the present application, the large language model includes, but is not limited to ChatGPT, LLAMA, discourse.
In some variations of the first aspect of the present application, the multimodal visual-language model includes, but is not limited to, BILP2; the controlled autonomous robots include, but are not limited to, virtual intelligent agents, unmanned robots, autonomous vehicles, and other mobile robots.
A second aspect of the present application provides an embedded hardware device, comprising:
1. and a data storage module: for storing various data generated each time a navigation task is performed, including, but not limited to, received natural language instructions, ambient image description data, question-answer records that interact with a large language model.
2. And a data processing module: for running the visual language navigation method of claim 1.
3. And a data communication module: the system is used for carrying out interactive communication with the robot body, communication data comprise natural language instructions received by the robot, surrounding environment images captured after each step of action of the robot, and navigation decisions and state change parameters given by the data processing module.
In some modified embodiments of the second aspect of the present application, the data communication module supports a plurality of communication protocols of the robot with body, can detect the communication protocol of the robot after embedding, and call the preset behavior type and the coding mode corresponding to the communication protocol from the data storage module, so that the data processing module outputs the navigation decision and the state change parameter, and executes the corresponding behavior for the robot with body.
The beneficial effects of the application are as follows:
1. the core of the application is the high fusion of a large language model, a visual-language perception model and an autonomous robot from the viewpoint of model effect. By utilizing the data breadth and the reasoning capability of the large language model, compared with the prior other VLN technologies, the visual language navigation technology of the application greatly improves the generalization capability, and solves the problem of poor navigation effect of the prior VLN algorithm under an unknown environment.
2. The application uses a pre-trained large language model and a visual language perception model, does not need a large amount of marked data training, has enough existing data to wholly fine-tune and optimize the technology, reduces the calculation cost and space-time expenditure, and creatively utilizes the prior art results to avoid the problems of difficult VLN task data acquisition and high data scarcity. Therefore, the technology has high practicability and application prospect.
3. From the technical theory perspective, the existing VLN task generally has the problems of inaccurate semantic understanding, inaccurate visual perception and difficult modeling due to long-term dependence. The application creatively introduces a large language model and a multi-mode perception model which are pre-trained by using a huge data set to finish the three works, improves the performance and the effect on the problems, provides a new thought and a new direction for the research and development of the VLN field, and provides a new paradigm for the application of the large language model in the body robot.
4. The application has the advantages of smaller hardware requirement and good algorithm performance, and the large language model can be used as a mode for realizing the foundation stone, so that the application can be easily embedded into other integrated algorithm software systems, and can be erected in the existing unmanned intelligent machinery as a function, such as an unmanned automobile, an intelligent household robot and the like, and the self-contained robot capacity is expanded.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a flow chart of a visual language navigation method based on a multi-modal and large language pre-training model according to some embodiments of the present application.
Fig. 2 is a flow chart corresponding to an environment image description generating method in a visual language navigation method based on a multi-modal and large language pre-training model according to some embodiments of the present application.
Fig. 3 is a schematic diagram of an embedded hardware device according to some embodiments of the present application.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application pertains.
In addition, the terms "first" and "second" etc. are used to distinguish different objects and are not used to describe a particular order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
In order to facilitate understanding of the following embodiments of the present application, first, the prior art information of the embodiments of the present application is described as follows:
at present, the visual language navigation technology has a plurality of algorithm models aiming at language instructions with different characteristics, such as detail level, whether human-computer interaction is performed, whether environment interaction is performed, and the like, but the models have the following five problems:
1. the data is scarce. Due to the complexity of the VLN task and the difficulty of data acquisition, the data sets currently available are very limited, which presents a significant challenge for model training and evaluation.
2. The model generalization ability is poor. Because of the limitation of the data set, the existing VLN model can only navigate in the known environment, and has poor navigation effect on the unknown environment.
3. Semantic understanding is inaccurate. Natural language instructions often have ambiguity and ambiguity, and existing VLN models have certain limitations in terms of semantic understanding.
4. Visual perception is inaccurate. Existing VLN models require further improvement in visual perception due to errors and noise in the visual perception technique itself.
5. Long-term dependence is difficult to model. Because VLN tasks typically involve long-term planning and decision-making, existing VLN models also present difficulties in long-term dependent modeling.
In view of the foregoing, embodiments of the present application provide a visual language navigation method based on a multi-modal and large language pre-training model and an embedded hardware device, which are described below with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of a visual language navigation method based on a multi-modal and large language pre-training model according to some embodiments of the present application is shown, and as shown, the visual language navigation method based on the multi-modal and large language pre-training model may include the following steps:
step S101: initializing a navigation context.
Before the navigation task is executed, an initialization task instruction is set and is input into the large language model, and prior information of the large language model 'navigation task needs to be processed' is given. The task instruction needs to be initialized to fully explain the purpose and the requirement of the task, namely, a navigation situation is constructed, the large language model is given an autonomous route searching setting, and the environment where the large language model is located, such as a home environment and an urban ring, is informed in a keyword form, so that the reasoning of the model is more in line with the technical requirement, a corresponding question-answer template is also needed to be provided, and the output specification of the large language model is restrained.
The embodiment of the application is not limited to the specific content of the initialization task instruction, and provides a content example as a reference: now, please imagine that you are a road director, put on a previously unseen a priori environmental keyword, and no map is available. You need to provide route guidance to people who enter this [ a priori environmental keywords ] to find a particular item or go to a particular location, for example when people ask: [ path decision questions ], you should choose one of these possibilities: [ all possible behavioural decisions ] as your answer, neither add nor subtract any text, do you understand your work? "
Step S102: an image description is generated for the current ambient image by a multimodal vision-language pre-training model.
In some embodiments, in this step, the ambient image may be directly input into the multimodal visual-language pre-training model to generate the image description.
In other embodiments, in this step, the descriptions corresponding to all the environmental images may be output by the multimodal visual-language pre-training model, and then summarized by the large language model based on these descriptions, so as to generate a unified and detailed description of the surrounding environment.
In other embodiments, in this step, the large language model may be interacted with, the large language model is responsible for asking questions, the multi-modal vision-language pre-training model is responsible for generating image descriptions based on the environmental image answers, and after multiple iterations, all answers are summarized by the large language model.
The third of the three provided image description generation methods is described in the embodiment of the present application with reference to the drawings, and the flow is described:
as shown in fig. 1, the image description generating method includes:
step S201: providing an initial task description, fully explaining the targets and requirements of the task, informing the large language model that the multi-mode vision-language model has input an environment image, and sending a question to the multi-mode vision-language model to obtain information contained in the image. In this document, the task specification is used to explain goals and requirements of the task.
The embodiment of the application is not limited to the specific content of the initialization task instruction, and provides a content example as a reference: there is now an ambient image that requires you to ask questions based on the image description information contained in my answers to get richer image information, you first ask questions: what information is contained in the image? An initial image description is obtained and no text can be added. After getting my answer, the next question is issued based on my answer.
Step S202: multiple iterations of question-answer are performed, with each question of the large language model based on all past answers of the multimodal visual-language model.
In some embodiments, in this step, the question-answer record may be stored with a data structure of a variable length dictionary array, and the multimodal visual-language model answers the question of the previous large language model based on a certain grammar format, for example: my answer is: [ answers to multimodal visual-language model ], you can ask the next question based on past question-answer records [ question-answer records ]. To inform the large language model of the answer results and the next question can be generated. The embodiment of the application does not limit the specific content of the grammar format.
Step S203: the large language model synthesizes chat records and summarizes and generates more accurate, detailed and rich image descriptions.
In some embodiments, in this step, the question-answer iteration number limit may be fixed, for example, ten times, and the number of times reached is first followed by a certain grammar format, for example: and finishing the questioning link, wherein all questioning and answering records are as follows: [ question-answer record ]. You need to summary to generate a more accurate, detailed and rich image description from this record, taking care that no text can be output that is irrelevant to the summary result. The embodiment of the application does not limit the specific content of the grammar format.
It should be noted that the above steps S201-S203 are described as a flowchart description of fig. 2 of the accompanying drawings, irrespective of fig. 1 of the accompanying drawings.
Step S103: and integrating the image description of the current surrounding environment image with natural language instructions, past path information, optional decision types and large language model answer specifications according to a preset navigation instruction grammar format to form a current navigation instruction.
Step S103: and synthesizing image description, natural language instruction, past path information, optional decision types and large language model answer specifications of the current surrounding environment image according to a preset navigation instruction grammar to form the current navigation instruction.
In some embodiments, in this step, the preset navigation instruction grammar is a formatted text, and after the natural language instruction, the environment description, the past path information, and the optional decision types are integrated into the corresponding keyword in the grammar format in an embedded manner, the types and the numbers of the keyword types are not limited in the embodiment of the present application, including but not limited to: the navigation task is performed, the surrounding information is, the path which has been passed includes, and the optional behavior decision is available.
The embodiment of the application does not limit the specific content of the preset navigation instruction grammar, and provides a content example: now, another person comes here. The navigation task you need to do is "[ natural language instruction ]". He has now taken steps, the paths that he has traversed include historical navigation paths, he says that his surrounding environment information is: [ ambient description ], optional behavior decisions are: [ optional decision category ], please know his progress according to the past path, and select one of the optional behavior decisions according to the current surrounding environment information, and answer with serial number, unable to output any other text.
Step S104: and inputting the constructed current navigation instruction into a large language model, obtaining decision output of the large language model, converting the decision output into a state change parameter which can be received by the corresponding robot body, and controlling the behavior of the robot body.
In this step, there are different behavior types for different robots, such as "turn left", "turn right", "forward", "backward", "stop", "ignore". And encoding each behavior according to different communication protocols of the robot with the body to obtain the robot state change parameters corresponding to each behavior. And extracting behavior types contained in the navigation decision of the large language model in a character string matching processing mode, and indexing corresponding state change parameters.
The embodiment of the application is not limited to the type of the robot, and can be a virtual intelligent agent, an unmanned plane, an automatic driving vehicle and other movable robots.
The visual language navigation method based on the multi-mode and large language pre-training model provided by the embodiment of the application has the advantages that on one hand, the large language model, the multi-mode visual-language model and the robot with body are highly integrated. The generalization capability of the visual language navigation method is greatly improved by utilizing the data breadth and the reasoning capability of the large language model, and the navigation effect of the visual language navigation technology in an unknown environment is effectively improved; on the other hand, the method uses a pre-trained large language model and a multi-modal visual-language model, does not need a large amount of marking data training, has enough data to wholly fine-tune and optimize the technology, reduces the calculation cost and space-time expenditure, and creatively utilizes the prior art achievements to avoid the problems of difficult data acquisition and high data scarcity of visual language navigation tasks; in the method, when a navigation decision is made on the large language model in a text instruction mode, the past path information is explicitly integrated into the large language model, so that a solution of difficult modeling of long-term dependence of a visual language navigation task is provided; on the last hand, the method has smaller hardware requirement and good algorithm performance, and the large language model can be used as a mode for realizing the foundation stone, so that the method can be easily embedded into other integrated algorithm software systems, is erected in the existing unmanned intelligent machinery as a function, such as an unmanned automobile, an intelligent household robot and the like, and expands the capability of the robot.
In the above embodiment, a visual language navigation method based on a multi-mode and large language pre-training model is provided, and correspondingly, the application also provides an embedded hardware device. The embedded hardware device provided by the embodiment of the application can implement the visual language navigation method based on the multi-mode and large language pre-training model, and can be realized by software, hardware or a combination of software and hardware. For example, the embedded hardware device may comprise integrated or separate functional modules or units to provide support for the implementation of the above described method. Referring to fig. 3, a schematic diagram of an embedded hardware device according to some embodiments of the present application is shown. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
Referring to fig. 3, a schematic diagram of an embedded hardware device according to some embodiments of the application is shown. As shown in fig. 3, the embedded hardware device may include:
the data storage module 301: for storing various data generated each time a navigation task is performed, including, but not limited to, received natural language instructions, ambient image description data, question-answer records that interact with a large language model.
The data processing module 302: for running the visual language navigation method of claim 1.
The data communication module 303: the system is used for carrying out interactive communication with the robot body, communication data comprise natural language instructions received by the robot, surrounding environment images captured after each step of action of the robot, and navigation decisions and state change parameters given by the data processing module.
In some modified implementations of the embodiments of the present application, the data communication module supports a plurality of communication protocols of the robot with body, can detect the communication protocol of the robot after embedding, and invokes a preset behavior type and a coding mode corresponding to the communication protocol from the data storage module, so that the data processing module outputs navigation decision and state change parameters, and executes corresponding behaviors for the robot with body.
The embedded hardware device provided by the embodiment of the application has universality and can be simply configured into various robots with bodies, such as an automatic driving vehicle, an unmanned aerial vehicle and an unmanned ship.
It is noted that the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams, each arrow may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block or arrow may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In particular implementations, program code for carrying out embodiments of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, python, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
In the embodiments provided in the present application, it should be understood that, for the disclosed apparatus and method, it may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application, and are intended to be included within the scope of the appended claims and description.

Claims (10)

1. A visual language navigation method based on a multi-mode and large language pre-training model. Characterized by comprising the following steps:
a. an initialization task instruction is set to outline navigation tasks that the large language model needs to complete.
b. And inputting the initialization task instruction into a pre-trained large language model to perform task initialization.
c. An image description is generated for the current ambient image by a multimodal vision-language pre-training model.
d. And integrating the image description of the current surrounding environment image with natural language instructions, past path information, optional decision types and large language model answer specifications according to a preset navigation instruction grammar format to form a current navigation instruction.
e. And inputting the current navigation instruction into the large language model after task initialization, and generating a text representation of the current navigation decision.
f. And converting the text representation of the navigation decision into corresponding state change parameters which can be received by the robot body to control the behavior of the robot body.
2. The visual language navigation method of claim 1, wherein the initialization navigation task instruction needs to include three sets of important information, namely, a task is completed as soon as the task is required, keywords of the current environment such as indoor, outdoor and the like, and question-answering templates of the navigation instruction.
3. The visual language navigation method of claim 1, wherein the means for generating an image description by the multimodal visual-language pre-training model includes, but is not limited to: 1) Directly inputting an environment image and outputting an image description; 2) Interacting with a large language model, wherein the large language model is responsible for asking questions, the multi-modal model is responsible for answers, and after multiple iterations, summarizing all answers by the large language model to generate image descriptions; 3) And after the multi-mode model outputs the image descriptions corresponding to all the environment images, summarizing by the large language model to generate unified and detailed surrounding environment descriptions.
4. The visual language navigation method of claim 1, wherein the pre-set navigation instruction grammar format is a formatted text, and the natural language instruction, the environment description, the past path information, and the optional decision types are integrated into the corresponding key description words in the grammar format in an embedded manner, and the key description words include but are not limited to: the navigation task is performed with the surrounding information being the path that has been passed including the selectable decision.
5. A visual language navigation method as claimed in claim 1, wherein the means for converting the navigation decision from a text representation to a state change parameter is:
different behavior categories are preset for different self-contained robots, including but not limited to: "left turn", "right turn", "forward", "reverse", "stop", "ignore".
And coding each behavior according to different communication protocols of the robot with the body, obtaining a robot state change parameter corresponding to each behavior, extracting behavior types contained in a navigation decision of the large language model in a character string matching processing mode, and indexing the corresponding state change parameter.
6. The visual language navigation method of claim 1, wherein the large language model includes, but is not limited to ChatGPT, LLAMA, discourse.
7. The visual language navigation method of claim 1, wherein the multimodal visual-language model includes, but is not limited to, BILP2.
8. The visual language navigation method of claim 1, wherein the controlled autonomous robots include, but are not limited to, virtual intelligent agents, unmanned aerial vehicles, autonomous vehicles, and other mobile robots.
9. An embedded hardware device, comprising:
and a data storage module: for storing various data generated each time a navigation task is performed, including, but not limited to, received natural language instructions, ambient image description data, question-answer records that interact with a large language model.
And a data processing module: for running the visual language navigation method of claim 1.
And a data communication module: the system is used for carrying out interactive communication with the robot body, communication data comprise natural language instructions received by the robot, surrounding environment images captured after each step of action of the robot, and navigation decisions and state change parameters given by the data processing module.
10. The embedded hardware device of claim 8, wherein the data communication module supports a plurality of communication protocols of the robot, can detect the communication protocol of the robot after the embedded, and can retrieve a preset behavior type and a coding mode corresponding to the communication protocol from the data storage module, so that the data processing module can output navigation decision and state change parameters, and can execute corresponding behaviors for the robot.
CN202310815734.4A 2023-07-05 2023-07-05 Visual language navigation technical scheme based on multi-modal perception model and large language model Pending CN117073701A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310815734.4A CN117073701A (en) 2023-07-05 2023-07-05 Visual language navigation technical scheme based on multi-modal perception model and large language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310815734.4A CN117073701A (en) 2023-07-05 2023-07-05 Visual language navigation technical scheme based on multi-modal perception model and large language model

Publications (1)

Publication Number Publication Date
CN117073701A true CN117073701A (en) 2023-11-17

Family

ID=88712275

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310815734.4A Pending CN117073701A (en) 2023-07-05 2023-07-05 Visual language navigation technical scheme based on multi-modal perception model and large language model

Country Status (1)

Country Link
CN (1) CN117073701A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117506940A (en) * 2024-01-04 2024-02-06 中国科学院自动化研究所 Robot track language description generation method, device and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117506940A (en) * 2024-01-04 2024-02-06 中国科学院自动化研究所 Robot track language description generation method, device and readable storage medium
CN117506940B (en) * 2024-01-04 2024-04-09 中国科学院自动化研究所 Robot track language description generation method, device and readable storage medium

Similar Documents

Publication Publication Date Title
Sünderhauf Switchable Constraints for Robust Simultaneous Localization and Mapping and Satellite-Based Localization
González et al. A review of motion planning techniques for automated vehicles
KR102331675B1 (en) Artificial intelligence apparatus and method for recognizing speech of user
CN117073701A (en) Visual language navigation technical scheme based on multi-modal perception model and large language model
CN108364646B (en) Embedded voice operation method, device and system
Al-Wazzan et al. Tour-guide robot
Moratz et al. Spatial knowledge representation for human-robot interaction
CN111044045A (en) Navigation method and device based on neural network and terminal equipment
Kuo et al. Trajectory prediction with linguistic representations
Howard et al. An intelligence architecture for grounded language communication with field robots
CN208744840U (en) Robot instruction's motion control system
CN111862727B (en) Artificial intelligence graphical programming teaching platform and method
CN111009054A (en) Vehicle-mounted control device, data processing method and storage medium
Sattar et al. Ensuring safety in human-robot dialog—A cost-directed approach
Huang et al. Language-driven robot manipulation with perspective disambiguation and placement optimization
KR102526501B1 (en) Apparatus, method and system for determining answer regarding question
Chung et al. Robot motion planning in dynamic uncertain environments
Jernite et al. Craftassist instruction parsing: Semantic parsing for a minecraft assistant
Guo et al. Vision-based autonomous driving for smart city: a case for end-to-end learning utilizing temporal information
Ma et al. DOROTHIE: Spoken dialogue for handling unexpected situations in interactive autonomous driving agents
Vasudevan et al. Talk2Nav: Long-Range Vision-and-Language Navigation in Cities.
CN116533992B (en) Automatic parking path planning method and system based on deep reinforcement learning algorithm
Elmogy et al. Multimodal cognitive interface for robot navigation
Echefu Towards Natural Human Control and Navigation of Autonomous Wheelchairs
Smith et al. A voice operated tour planning system for autonomous mobile robots

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication