CN117073701A

CN117073701A - Visual language navigation technical scheme based on multi-modal perception model and large language model

Info

Publication number: CN117073701A
Application number: CN202310815734.4A
Authority: CN
Inventors: 柳书博; 请求不公布姓名; 张红生; 徐嘉悦; 李亮辰; 杜胤葑; 束九禾; 吕冰
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2023-07-05
Filing date: 2023-07-05
Publication date: 2023-11-17

Abstract

The application provides a visual language navigation method based on a multi-mode and large language pre-training model and an embedded hardware device, and relates to the technical field of navigation. Establishing a navigation task situation for a large language pre-training model; generating a text description of the surrounding environment through a multimodal visual-linguistic pre-training model; and generating each step of navigation decision by the comprehensive natural language instruction through a large language model, and further controlling the behavior of the robot. The application provides an implementation mode for applying the multi-mode vision-language pre-training model and the large-language pre-training model to vision language navigation, which not only improves the navigation success rate, but also further enhances the generalization of the vision language navigation method based on the characteristics of the large-language model.

Description

Visual language navigation technical scheme based on multi-modal perception model and large language model

Technical Field

The application relates to the technical field of navigation, in particular to a visual language navigation technology based on a multi-modal perception model and a large language model and an embedded hardware device.

Background

The most popular and most commonly used navigation method is satellite navigation technology, that is, the navigation is realized by calculating the shortest path or the optimal path by using a known map, a GPS, a Beidou satellite positioning system and other satellite positioning systems. The satellite navigation technology is a technology set for realizing satellite navigation service in the fields of spaceflight, computer, mapping, communication and the like, and is a great adult of human science and technology. The satellite navigation technology is more suitable for navigation in large scenes, such as map navigation software commonly used in daily life, helps users and vehicles find an optimal path to a target place, and reports road conditions, advancing directions, straight mileage and the like in real time. However, satellite navigation technology is difficult to be adequate in small scenes such as indoor, outdoor or underground mine sites and forests, and the reason for the satellite navigation technology is that the scenes have the characteristics of small range, three-dimensional structure, complex environment obstacle display and the like.

Therefore, the application provides a visual language navigation technology, which can enable a user to send out natural language instructions, tell an intelligent body about a target place, a target object or a path in detail or in short, and enable the intelligent body to automatically seek paths and navigate to go to the target place according to the self environment perception capability, even achieve the effect of interacting with the environment (for example, picking up the object). The visual language navigation technology has many advantages in small scenes, such as natural language interaction capability, namely, interaction with a user can be realized through natural language instructions, so that complicated input and operation are avoided; environmental awareness—navigation in an unknown environment; adaptability-with a certain degree of adaptability and autonomy, which can be adjusted according to environmental changes and user requirements; the multi-mode fusion mode, which can fuse natural language instructions with visual information, realizes more accurate and flexible navigation. The visual language navigation technology automatically collects environment information through vision, judges the advancing route, can eliminate the need of a known global map, can solve the navigation requirements in three-dimensional modes such as going upstairs and downstairs, and realizes high-precision navigation tasks in the environments.

Disclosure of Invention

The application aims to provide a visual language navigation method based on a multi-mode and large language pre-training model and an embedded device.

The first aspect of the application provides a visual language navigation method based on a multi-mode and large language pre-training model, which comprises the following steps:

1. an initialization task instruction is set to outline navigation tasks that the large language model needs to complete.

2. And inputting the initialization task instruction into a pre-trained large language model to perform task initialization.

3. An image description is generated for the current ambient image by a multimodal vision-language pre-training model.

4. And integrating the image description of the current surrounding environment image with natural language instructions, past path information, optional decision types and large language model answer specifications according to a preset navigation instruction grammar format to form a current navigation instruction.

5. And inputting the current navigation instruction into the large language model after task initialization, and generating a text representation of the current navigation decision.

6. And converting the text representation of the navigation decision into corresponding state change parameters which can be received by the robot body to control the behavior of the robot body.

In some modified embodiments of the first aspect of the present application, the initializing navigation task instruction needs to include three sets of important information, namely, completing a navigation task according to the task requirement, keywords of the current environment, such as indoor, outdoor, etc., and question-answering templates of the navigation instruction.

For some modified embodiments of the first aspect of the present application, the multi-modal visual-language pre-training model generates the image description in the following three ways, but not limited to: directly inputting an environment image and outputting an image description; interacting with a large language model, wherein the large language model is responsible for asking questions, the multi-modal model is responsible for answers, and after multiple iterations, summarizing all answers by the large language model to generate image descriptions; and after the multi-mode model outputs the image descriptions corresponding to all the environment images, summarizing by the large language model to generate unified and detailed surrounding environment descriptions.

In some modified embodiments of the first aspect of the present application, the preset navigation instruction grammar format is a formatted text, and after the natural language instruction, the environment description, the past path information, and the optional decision types are integrated into the corresponding keyword descriptors in the grammar format in an embedded manner, the keyword descriptors include, but are not limited to: the navigation task is performed with the surrounding information being the path that has been passed including the selectable decision.

In some variant embodiments of the first aspect of the present application, the means for converting the navigation decision from the text representation to the state change parameter are:

different behavior categories are preset for different self-contained robots, including but not limited to: "left turn", "right turn", "forward", "reverse", "stop", "ignore".

And coding each behavior according to different communication protocols of the robot with the body, obtaining a robot state change parameter corresponding to each behavior, extracting behavior types contained in a navigation decision of the large language model in a character string matching processing mode, and indexing the corresponding state change parameter.

In some variations of the first aspect of the present application, the large language model includes, but is not limited to ChatGPT, LLAMA, discourse.

In some variations of the first aspect of the present application, the multimodal visual-language model includes, but is not limited to, BILP2; the controlled autonomous robots include, but are not limited to, virtual intelligent agents, unmanned robots, autonomous vehicles, and other mobile robots.

A second aspect of the present application provides an embedded hardware device, comprising:

1. and a data storage module: for storing various data generated each time a navigation task is performed, including, but not limited to, received natural language instructions, ambient image description data, question-answer records that interact with a large language model.

2. And a data processing module: for running the visual language navigation method of claim 1.

3. And a data communication module: the system is used for carrying out interactive communication with the robot body, communication data comprise natural language instructions received by the robot, surrounding environment images captured after each step of action of the robot, and navigation decisions and state change parameters given by the data processing module.

In some modified embodiments of the second aspect of the present application, the data communication module supports a plurality of communication protocols of the robot with body, can detect the communication protocol of the robot after embedding, and call the preset behavior type and the coding mode corresponding to the communication protocol from the data storage module, so that the data processing module outputs the navigation decision and the state change parameter, and executes the corresponding behavior for the robot with body.

The beneficial effects of the application are as follows:

1. the core of the application is the high fusion of a large language model, a visual-language perception model and an autonomous robot from the viewpoint of model effect. By utilizing the data breadth and the reasoning capability of the large language model, compared with the prior other VLN technologies, the visual language navigation technology of the application greatly improves the generalization capability, and solves the problem of poor navigation effect of the prior VLN algorithm under an unknown environment.

2. The application uses a pre-trained large language model and a visual language perception model, does not need a large amount of marked data training, has enough existing data to wholly fine-tune and optimize the technology, reduces the calculation cost and space-time expenditure, and creatively utilizes the prior art results to avoid the problems of difficult VLN task data acquisition and high data scarcity. Therefore, the technology has high practicability and application prospect.

3. From the technical theory perspective, the existing VLN task generally has the problems of inaccurate semantic understanding, inaccurate visual perception and difficult modeling due to long-term dependence. The application creatively introduces a large language model and a multi-mode perception model which are pre-trained by using a huge data set to finish the three works, improves the performance and the effect on the problems, provides a new thought and a new direction for the research and development of the VLN field, and provides a new paradigm for the application of the large language model in the body robot.

4. The application has the advantages of smaller hardware requirement and good algorithm performance, and the large language model can be used as a mode for realizing the foundation stone, so that the application can be easily embedded into other integrated algorithm software systems, and can be erected in the existing unmanned intelligent machinery as a function, such as an unmanned automobile, an intelligent household robot and the like, and the self-contained robot capacity is expanded.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a flow chart of a visual language navigation method based on a multi-modal and large language pre-training model according to some embodiments of the present application.

Fig. 2 is a flow chart corresponding to an environment image description generating method in a visual language navigation method based on a multi-modal and large language pre-training model according to some embodiments of the present application.

Fig. 3 is a schematic diagram of an embedded hardware device according to some embodiments of the present application.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application pertains.

In addition, the terms "first" and "second" etc. are used to distinguish different objects and are not used to describe a particular order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

In order to facilitate understanding of the following embodiments of the present application, first, the prior art information of the embodiments of the present application is described as follows:

at present, the visual language navigation technology has a plurality of algorithm models aiming at language instructions with different characteristics, such as detail level, whether human-computer interaction is performed, whether environment interaction is performed, and the like, but the models have the following five problems:

1. the data is scarce. Due to the complexity of the VLN task and the difficulty of data acquisition, the data sets currently available are very limited, which presents a significant challenge for model training and evaluation.

2. The model generalization ability is poor. Because of the limitation of the data set, the existing VLN model can only navigate in the known environment, and has poor navigation effect on the unknown environment.

3. Semantic understanding is inaccurate. Natural language instructions often have ambiguity and ambiguity, and existing VLN models have certain limitations in terms of semantic understanding.

4. Visual perception is inaccurate. Existing VLN models require further improvement in visual perception due to errors and noise in the visual perception technique itself.

5. Long-term dependence is difficult to model. Because VLN tasks typically involve long-term planning and decision-making, existing VLN models also present difficulties in long-term dependent modeling.

In view of the foregoing, embodiments of the present application provide a visual language navigation method based on a multi-modal and large language pre-training model and an embedded hardware device, which are described below with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of a visual language navigation method based on a multi-modal and large language pre-training model according to some embodiments of the present application is shown, and as shown, the visual language navigation method based on the multi-modal and large language pre-training model may include the following steps:

step S101: initializing a navigation context.

Before the navigation task is executed, an initialization task instruction is set and is input into the large language model, and prior information of the large language model 'navigation task needs to be processed' is given. The task instruction needs to be initialized to fully explain the purpose and the requirement of the task, namely, a navigation situation is constructed, the large language model is given an autonomous route searching setting, and the environment where the large language model is located, such as a home environment and an urban ring, is informed in a keyword form, so that the reasoning of the model is more in line with the technical requirement, a corresponding question-answer template is also needed to be provided, and the output specification of the large language model is restrained.

The embodiment of the application is not limited to the specific content of the initialization task instruction, and provides a content example as a reference: now, please imagine that you are a road director, put on a previously unseen a priori environmental keyword, and no map is available. You need to provide route guidance to people who enter this [ a priori environmental keywords ] to find a particular item or go to a particular location, for example when people ask: [ path decision questions ], you should choose one of these possibilities: [ all possible behavioural decisions ] as your answer, neither add nor subtract any text, do you understand your work? "

Step S102: an image description is generated for the current ambient image by a multimodal vision-language pre-training model.

In some embodiments, in this step, the ambient image may be directly input into the multimodal visual-language pre-training model to generate the image description.

In other embodiments, in this step, the descriptions corresponding to all the environmental images may be output by the multimodal visual-language pre-training model, and then summarized by the large language model based on these descriptions, so as to generate a unified and detailed description of the surrounding environment.

In other embodiments, in this step, the large language model may be interacted with, the large language model is responsible for asking questions, the multi-modal vision-language pre-training model is responsible for generating image descriptions based on the environmental image answers, and after multiple iterations, all answers are summarized by the large language model.

The third of the three provided image description generation methods is described in the embodiment of the present application with reference to the drawings, and the flow is described:

as shown in fig. 1, the image description generating method includes:

step S201: providing an initial task description, fully explaining the targets and requirements of the task, informing the large language model that the multi-mode vision-language model has input an environment image, and sending a question to the multi-mode vision-language model to obtain information contained in the image. In this document, the task specification is used to explain goals and requirements of the task.

The embodiment of the application is not limited to the specific content of the initialization task instruction, and provides a content example as a reference: there is now an ambient image that requires you to ask questions based on the image description information contained in my answers to get richer image information, you first ask questions: what information is contained in the image? An initial image description is obtained and no text can be added. After getting my answer, the next question is issued based on my answer.

Step S202: multiple iterations of question-answer are performed, with each question of the large language model based on all past answers of the multimodal visual-language model.

In some embodiments, in this step, the question-answer record may be stored with a data structure of a variable length dictionary array, and the multimodal visual-language model answers the question of the previous large language model based on a certain grammar format, for example: my answer is: [ answers to multimodal visual-language model ], you can ask the next question based on past question-answer records [ question-answer records ]. To inform the large language model of the answer results and the next question can be generated. The embodiment of the application does not limit the specific content of the grammar format.

Step S203: the large language model synthesizes chat records and summarizes and generates more accurate, detailed and rich image descriptions.

In some embodiments, in this step, the question-answer iteration number limit may be fixed, for example, ten times, and the number of times reached is first followed by a certain grammar format, for example: and finishing the questioning link, wherein all questioning and answering records are as follows: [ question-answer record ]. You need to summary to generate a more accurate, detailed and rich image description from this record, taking care that no text can be output that is irrelevant to the summary result. The embodiment of the application does not limit the specific content of the grammar format.

It should be noted that the above steps S201-S203 are described as a flowchart description of fig. 2 of the accompanying drawings, irrespective of fig. 1 of the accompanying drawings.

Step S103: and integrating the image description of the current surrounding environment image with natural language instructions, past path information, optional decision types and large language model answer specifications according to a preset navigation instruction grammar format to form a current navigation instruction.

Step S103: and synthesizing image description, natural language instruction, past path information, optional decision types and large language model answer specifications of the current surrounding environment image according to a preset navigation instruction grammar to form the current navigation instruction.

In some embodiments, in this step, the preset navigation instruction grammar is a formatted text, and after the natural language instruction, the environment description, the past path information, and the optional decision types are integrated into the corresponding keyword in the grammar format in an embedded manner, the types and the numbers of the keyword types are not limited in the embodiment of the present application, including but not limited to: the navigation task is performed, the surrounding information is, the path which has been passed includes, and the optional behavior decision is available.

The embodiment of the application does not limit the specific content of the preset navigation instruction grammar, and provides a content example: now, another person comes here. The navigation task you need to do is "[ natural language instruction ]". He has now taken steps, the paths that he has traversed include historical navigation paths, he says that his surrounding environment information is: [ ambient description ], optional behavior decisions are: [ optional decision category ], please know his progress according to the past path, and select one of the optional behavior decisions according to the current surrounding environment information, and answer with serial number, unable to output any other text.

Step S104: and inputting the constructed current navigation instruction into a large language model, obtaining decision output of the large language model, converting the decision output into a state change parameter which can be received by the corresponding robot body, and controlling the behavior of the robot body.

In this step, there are different behavior types for different robots, such as "turn left", "turn right", "forward", "backward", "stop", "ignore". And encoding each behavior according to different communication protocols of the robot with the body to obtain the robot state change parameters corresponding to each behavior. And extracting behavior types contained in the navigation decision of the large language model in a character string matching processing mode, and indexing corresponding state change parameters.

The embodiment of the application is not limited to the type of the robot, and can be a virtual intelligent agent, an unmanned plane, an automatic driving vehicle and other movable robots.

The visual language navigation method based on the multi-mode and large language pre-training model provided by the embodiment of the application has the advantages that on one hand, the large language model, the multi-mode visual-language model and the robot with body are highly integrated. The generalization capability of the visual language navigation method is greatly improved by utilizing the data breadth and the reasoning capability of the large language model, and the navigation effect of the visual language navigation technology in an unknown environment is effectively improved; on the other hand, the method uses a pre-trained large language model and a multi-modal visual-language model, does not need a large amount of marking data training, has enough data to wholly fine-tune and optimize the technology, reduces the calculation cost and space-time expenditure, and creatively utilizes the prior art achievements to avoid the problems of difficult data acquisition and high data scarcity of visual language navigation tasks; in the method, when a navigation decision is made on the large language model in a text instruction mode, the past path information is explicitly integrated into the large language model, so that a solution of difficult modeling of long-term dependence of a visual language navigation task is provided; on the last hand, the method has smaller hardware requirement and good algorithm performance, and the large language model can be used as a mode for realizing the foundation stone, so that the method can be easily embedded into other integrated algorithm software systems, is erected in the existing unmanned intelligent machinery as a function, such as an unmanned automobile, an intelligent household robot and the like, and expands the capability of the robot.

In the above embodiment, a visual language navigation method based on a multi-mode and large language pre-training model is provided, and correspondingly, the application also provides an embedded hardware device. The embedded hardware device provided by the embodiment of the application can implement the visual language navigation method based on the multi-mode and large language pre-training model, and can be realized by software, hardware or a combination of software and hardware. For example, the embedded hardware device may comprise integrated or separate functional modules or units to provide support for the implementation of the above described method. Referring to fig. 3, a schematic diagram of an embedded hardware device according to some embodiments of the present application is shown. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

Referring to fig. 3, a schematic diagram of an embedded hardware device according to some embodiments of the application is shown. As shown in fig. 3, the embedded hardware device may include:

the data storage module 301: for storing various data generated each time a navigation task is performed, including, but not limited to, received natural language instructions, ambient image description data, question-answer records that interact with a large language model.

The data processing module 302: for running the visual language navigation method of claim 1.

The data communication module 303: the system is used for carrying out interactive communication with the robot body, communication data comprise natural language instructions received by the robot, surrounding environment images captured after each step of action of the robot, and navigation decisions and state change parameters given by the data processing module.

In some modified implementations of the embodiments of the present application, the data communication module supports a plurality of communication protocols of the robot with body, can detect the communication protocol of the robot after embedding, and invokes a preset behavior type and a coding mode corresponding to the communication protocol from the data storage module, so that the data processing module outputs navigation decision and state change parameters, and executes corresponding behaviors for the robot with body.

The embedded hardware device provided by the embodiment of the application has universality and can be simply configured into various robots with bodies, such as an automatic driving vehicle, an unmanned aerial vehicle and an unmanned ship.

It is noted that the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams, each arrow may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block or arrow may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In particular implementations, program code for carrying out embodiments of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, python, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages.

In the embodiments provided in the present application, it should be understood that, for the disclosed apparatus and method, it may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application, and are intended to be included within the scope of the appended claims and description.

Claims

1. A visual language navigation method based on a multi-mode and large language pre-training model. Characterized by comprising the following steps:

a. an initialization task instruction is set to outline navigation tasks that the large language model needs to complete.

b. And inputting the initialization task instruction into a pre-trained large language model to perform task initialization.

c. An image description is generated for the current ambient image by a multimodal vision-language pre-training model.

d. And integrating the image description of the current surrounding environment image with natural language instructions, past path information, optional decision types and large language model answer specifications according to a preset navigation instruction grammar format to form a current navigation instruction.

e. And inputting the current navigation instruction into the large language model after task initialization, and generating a text representation of the current navigation decision.

f. And converting the text representation of the navigation decision into corresponding state change parameters which can be received by the robot body to control the behavior of the robot body.

2. The visual language navigation method of claim 1, wherein the initialization navigation task instruction needs to include three sets of important information, namely, a task is completed as soon as the task is required, keywords of the current environment such as indoor, outdoor and the like, and question-answering templates of the navigation instruction.

3. The visual language navigation method of claim 1, wherein the means for generating an image description by the multimodal visual-language pre-training model includes, but is not limited to: 1) Directly inputting an environment image and outputting an image description; 2) Interacting with a large language model, wherein the large language model is responsible for asking questions, the multi-modal model is responsible for answers, and after multiple iterations, summarizing all answers by the large language model to generate image descriptions; 3) And after the multi-mode model outputs the image descriptions corresponding to all the environment images, summarizing by the large language model to generate unified and detailed surrounding environment descriptions.

4. The visual language navigation method of claim 1, wherein the pre-set navigation instruction grammar format is a formatted text, and the natural language instruction, the environment description, the past path information, and the optional decision types are integrated into the corresponding key description words in the grammar format in an embedded manner, and the key description words include but are not limited to: the navigation task is performed with the surrounding information being the path that has been passed including the selectable decision.

5. A visual language navigation method as claimed in claim 1, wherein the means for converting the navigation decision from a text representation to a state change parameter is:

6. The visual language navigation method of claim 1, wherein the large language model includes, but is not limited to ChatGPT, LLAMA, discourse.

7. The visual language navigation method of claim 1, wherein the multimodal visual-language model includes, but is not limited to, BILP2.

8. The visual language navigation method of claim 1, wherein the controlled autonomous robots include, but are not limited to, virtual intelligent agents, unmanned aerial vehicles, autonomous vehicles, and other mobile robots.

9. An embedded hardware device, comprising:

and a data storage module: for storing various data generated each time a navigation task is performed, including, but not limited to, received natural language instructions, ambient image description data, question-answer records that interact with a large language model.

And a data processing module: for running the visual language navigation method of claim 1.

And a data communication module: the system is used for carrying out interactive communication with the robot body, communication data comprise natural language instructions received by the robot, surrounding environment images captured after each step of action of the robot, and navigation decisions and state change parameters given by the data processing module.

10. The embedded hardware device of claim 8, wherein the data communication module supports a plurality of communication protocols of the robot, can detect the communication protocol of the robot after the embedded, and can retrieve a preset behavior type and a coding mode corresponding to the communication protocol from the data storage module, so that the data processing module can output navigation decision and state change parameters, and can execute corresponding behaviors for the robot.