CN112710310A

CN112710310A - Visual language indoor navigation method, system, terminal and application

Info

Publication number: CN112710310A
Application number: CN202011428332.1A
Authority: CN
Inventors: 张世雄; 李楠楠; 龙仕强; 朱鑫懿; 魏文应
Original assignee: Instritute Of Intelligent Video Audio Technology Longgang Shenzhen
Current assignee: Instritute Of Intelligent Video Audio Technology Longgang Shenzhen
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-04-27
Anticipated expiration: 2040-12-07
Also published as: CN112710310B

Abstract

The invention belongs to the technical field of visual language navigation, and discloses a visual language indoor navigation method, a system, a terminal and application. The invention combines the visual information of the robot and the information of the natural language to carry out the indoor navigation of the robot, and adopts the attention mechanism to enable the robot to more effectively understand the language instruction of the human and combine the visual information, so that the robot can reach the destination according to the instruction of the human to complete the task. The invention mainly designs an attention mechanism which can effectively combine natural language and visual information to realize that the robot finds an optimal path in an unknown room.

Description

Visual language indoor navigation method, system, terminal and application

Technical Field

The invention belongs to the technical field of visual language navigation, and particularly relates to a visual language indoor navigation method, a system, a terminal and application.

Background

At present: the visual language navigation technology is a recently developed intelligent navigation method, and the navigation task requires that the robot reaches a specified target technology from an initial random position by using self-acquired visual image information under a given language instruction. For example, giving the robot a command "go straight down the hallway, enter the right bedroom, stop at the bedside of the bedroom", the robot follows the command, in combination with his own observations, to adjust the direction of progress continuously until the destination is reached. The method can be widely applied to a plurality of scenes such as unmanned vehicles, intelligent robots, unmanned delivery dining cars and the like. Unlike the task based on visual navigation, the visual language based navigation requires the use of comprehensive natural language information and computer visual information, and the robot continuously interacts with the acquired environment to acquire necessary information of the environment, thereby completing the designated task given by human. After integrating the elements of natural language information and computer vision information, the agent needs to plan its own actions.

Through the above analysis, the problems and defects of the prior art are as follows: in the prior art, on one hand, the computing power requirement is improved due to complex data, on the other hand, the key information is difficult to extract due to input information of multiple dimensions, and meanwhile, the problem of complexity and high degree of a network is also needed to be faced, so that the accuracy and efficiency of extracting the information are reduced.

The difficulty in solving the above problems and defects is: the main difficulties in solving the problems are: the system is complex, the information input dimensionality is high, the method particularly relates to the field of two artificial intelligence branches of natural language processing and computer vision, the improvement difficulty is high, and certain challenges are achieved.

The significance of solving the problems and the defects is as follows: the problems that the information is complicated and the key information cannot be extracted are solved, the complexity of calculation can be effectively reduced, the navigation effect is improved, the interference of noise and useless features on the model is reduced, the efficiency of the model is improved, and the accuracy of the model is increased.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a visual language indoor navigation method, a system, a terminal and application.

The invention is realized in such a way that the visual language indoor navigation method combines natural language commands and visual information by using a sequence-to-sequence method, respectively extracts the characteristics of the natural language command information and the visual image information, and respectively screens the attention characteristics of the extracted characteristics after completing the characteristic extraction to screen out the key information related to a task.

Furthermore, the visual language indoor navigation method carries out fusion coding on natural language command information and visual image information, and enables a depth model to pay attention to certain local information; the method comprises the steps of selectively screening out local information from a large amount of information, focusing on the local information, encoding a characteristic vector, decoding the vector, and decoding to obtain a command of the robot action.

Further, the visual language indoor navigation method specifically comprises the following steps:

firstly, initializing, namely inputting a language description instruction into a robot, wherein the robot is positioned at an initial position;

secondly, extracting natural language features of the language description instruction by using the LSTM;

thirdly, extracting key information of the language description instruction by using a natural language attention mechanism, and screening out interference of irrelevant information;

fourthly, extracting the visual features of the computer by using a CNN convolutional neural network for the acquired image;

fifthly, extracting visual key information from the acquired visual features in the fourth step by using a visual attention mechanism;

sixthly, mutually fusing the extracted visual key information in the fifth step and the key information of the language description instruction in the third step;

seventhly, extracting key information of the features fused in the sixth step by using an attention mechanism again;

eighthly, decoding and evaluating the key information obtained from the seventh step to obtain the advancing direction of the robot;

the ninth step, repeat the second step-the eighth step;

and step ten, reaching the destination and stopping advancing.

Further, the visual language indoor navigation method adopts a classical convolution neural network ResNet-50 network to extract features, before the ResNet-50 network extracts the features, the data of an international known image data set ImageNet is pre-trained, the trained ResNet-50 network is used for extracting feature vectors, and the feature vectors V of the panoramic image observed by the robot at the time t_t：

Extracting attention feature vector v by using attention mechanism_t：

v_t＝attention(H_t-1，V_t)；

After evolution:

v_t＝∑_jsoftmax(H_t-1W_h(W_vV_t)^T)V_t；

H_t＝LSTM([V_tA_t-1]，H_t-1)：

wherein v is_tRepresenting the feature vector, V, extracted by the attention mechanism_tFeature vector, H, representing trained ResNet-50 network extraction_t-1Represents the historical feature vector at time t-1 and H_tThen represents the historical feature vector at time t, A_tAnd A_t-1Respectively representing the actions taken by the machine at time t and at time t-1, W_hAnd W_υA weight matrix is represented.

Further, the visual language indoor navigation method is used for inputting a string of natural language instructions W (W)₁，w₂，w₃...), the natural language instruction is composed of a string of words, features are extracted by using a long-short term memory neural network LSTM, and natural feature extraction is performed by using a C ═ LSTM (W), wherein C is the extracted features of the natural language, and the natural language features are re-extracted by using an attention mechanism, and the formal expression is that:

Y_t＝attention((H_t，C)。

further, the visual language indoor navigation method comprises the steps of respectively coding and extracting the robot visual information and the natural language information, then carrying out fusion attention extraction on the feature vectors of the robot visual information and the natural language information, fusing all extracted information and robot historical information, evaluating the next step of the robot, determining the probability P of the advancing direction, and determining the most walking direction of the robot according to the maximum probability:

D_t＝attention(Y_t，v_t)；

P＝softmax([H_t，v_t，Y_t，D_t]W_cW_b)；

wherein D_tRepresenting the fused feature vector, P representing the probability of the heading, W_cAnd W_bRespectively, represent a weight matrix.

The invention also aims to provide a robot visual language navigation information data processing terminal which is used for realizing the visual language indoor navigation method.

Another object of the present invention is to provide a visual language indoor navigation system implementing the visual language indoor navigation method, the visual language indoor navigation system comprising:

a command and information combining module for combining natural language commands with visual information using a sequence-to-sequence approach;

the characteristic extraction module is used for respectively extracting the characteristics of the natural language command information and the visual image information;

and the key information screening module is used for screening the attention characteristics of the extracted characteristics respectively after completing the characteristic extraction, and screening the key information related to the task.

By combining all the technical schemes, the invention has the advantages and positive effects that: the attention mechanism is a method for taking human attention as a reference, when the human brain processes visual information, the human brain can quickly scan a global image to acquire a key area needing attention, and the efficiency and the accuracy of visual processing are greatly improved. The attention mechanism aims to select key information with important meanings from a plurality of information, is used for reference by natural language processing for the first time, and aims to screen out phrases with important semantics. Since then, attention mechanisms have been widely used in many scenarios, such as speech recognition, image processing, and the like. The invention combines the visual information of the robot and the information of the natural language to carry out the indoor navigation of the robot, and adopts the attention mechanism to enable the robot to more effectively understand the language instruction of the human and combine the visual information, so that the robot can reach the destination according to the instruction of the human to complete the task. The invention mainly designs an attention mechanism which can effectively combine natural language and visual information to realize that the robot finds an optimal path in an unknown room.

The visual language indoor navigation task provided by the invention needs to combine natural language command information and visual image information, and has large data volume and more related key information, so that if an attention mechanism is not used, the computational power requirement caused by complicated data is improved, and the problem of high complexity of the network needs to be faced. In order to improve the accuracy and efficiency of information extraction, the invention provides a visual language indoor navigation method based on an attention mechanism.

The invention carries out fusion coding on natural language command information and visual image information, and leads a depth model to pay attention to certain local information. The method comprises the steps of selectively screening out local information from a large amount of information, focusing on the local information, encoding a characteristic vector, decoding the vector, and decoding to obtain a command of the robot action. The attention mechanism is adopted in the process of encoding the feature vector, the extraction of the attention mechanism of the features of natural language and computer vision is different, and the fused features also need attention to extract key information, so that the encoding efficiency is higher, and the extracted information is more valuable.

The invention provides an advanced indoor navigation method of a robot in visual language, which effectively combines the combination of natural language commands and visual information, and can lead the robot to reach a destination in unknown indoor space according to human commands, thus leading the navigation to be more close to the application of a real scene. The invention designs an attention mechanism, and the attention mechanism can refine the language features and the visual features and refine the obtained information because a large amount of natural language information and visual information need to be obtained in the visual language indoor navigation, so that the obtained features are more precise. The interference of noise and useless characteristics to the model is reduced, the efficiency of the model is improved, and the accuracy of the model is increased.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained from the drawings without creative efforts.

Fig. 1 is a flowchart of a visual language indoor navigation method according to an embodiment of the present invention.

FIG. 2 is a schematic structural diagram of a visual language indoor navigation system provided by an embodiment of the present invention;

in the figure: 1. a command and information combining module; 2. a feature extraction module; 3. and a key information screening module.

Fig. 3 is a flowchart of an implementation of a visual language indoor navigation method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the problems in the prior art, the invention provides a visual language indoor navigation method, a system, a terminal and application thereof, and the invention is described in detail with reference to the accompanying drawings.

As shown in fig. 1, the visual language indoor navigation method provided by the present invention comprises the following steps:

s101: combining natural language commands with visual information by a sequence-to-sequence method;

s102: respectively extracting the characteristics of the natural language command information and the visual image information;

s103: after the feature extraction is completed, the attention features are respectively screened for the extracted features, and key information related to the task is screened out.

Those skilled in the art can also implement the visual language indoor navigation method provided by the present invention by adopting other steps, and the visual language indoor navigation method provided by the present invention in fig. 1 is only one specific embodiment.

As shown in fig. 2, the visual language indoor navigation system provided by the present invention comprises:

a command and information combining module 1 for combining natural language commands and visual information by a sequence-to-sequence method;

the characteristic extraction module 2 is used for respectively extracting the characteristics of the natural language command information and the visual image information;

and the key information screening module 3 is used for screening attention features of the extracted features respectively after feature extraction is completed, and screening key information related to the task.

The technical solution of the present invention is further described below with reference to the accompanying drawings.

As shown in fig. 3, the method provided by the present invention is mainly applied to a robot language vision navigation module, and does not relate to the design of the whole robot, and the current implementation method mainly depends on a computer to simulate the module, and specifically includes the following steps:

the ninth step, repeat the second step-the eighth step;

and step ten, reaching the destination and stopping advancing.

In the invention, for computer vision images, a ResNet-50 network is adopted to extract features, before the ResNet-50 extracts the features, the data subjected to ImageNet is pre-trained, the trained ResNet-50 is used for extracting feature vectors, and for the feature vector V of the panoramic image observed by the robot at the time t_tIn a word:

extracting attention feature vector v by using attention mechanism_t：

v_t＝attention(H_t-1，V_t) (1)

After evolution:

v_t＝∑_jsoftmax(H_t-1W_h(W_vV_t)^T)V_t (2)

H_t＝LSTM([V_tA_t-1]，H_t-1) (3)

wherein v is_tRepresenting the feature vector, V, extracted by the attention mechanism_tFeature vector, H, representing trained ResNet-50 network extraction_t-1Represents the historical feature vector at time t-1 and H_tThen represents the historical feature vector at time t, A_tAnd A_t-1Respectively representing the actions taken by the machine at time t and at time t-1, W_hAnd W_vA weight matrix is represented.

In the present invention, for an input string of natural language instructions W (W)₁，w₂，w₃...), the natural language instruction is composed of a string of words, features are extracted by using LSTM, and natural feature extraction is performed by using C ═ LSTM (w), where C is the feature extracted from the natural language, and the natural language features need to be re-extracted by using attention mechanism, and the formal expression is:

Y_t＝attention((H_t，C) (4)

in the invention, after the robot visual information and the natural language information are respectively coded and extracted, because the natural language information is highly described to the visual information, the correlation between the visual information and the natural language information is good, then the feature vectors of the visual information and the natural language information are fused and extracted, then all the extracted information are fused and added with the history information of the robot, the step carried out by the next robot is evaluated, the probability P of the advancing direction is evaluated, and the most walking direction of the robot is determined according to the maximum probability:

D_t＝attention(Y_t，v_t) (5)

P＝softmax([H_t，v_t，Y_t，D_t]W_cW_b) (6)

The effect of the invention is tested on the disclosed simulation data set R2R, the data set collects 99 data of different scenes, and the test result shows that the method provided by the invention obviously improves the navigation performance, and the test effect of the invention is as shown in Table 1.

TABLE 1 test results

Method	TL↓	NE↓	OSR↑	SR↑	SPL↑
						Seq2Seq	8.40	3.67	0.43	0.25	0.35
RCM	10.65	3.53	0.75	0.46	0.43
						Our	7.86	3.54	0.78	0.53	0.58

In table 1, Our represents the method proposed by the present invention, seq2seq represents the existing basic navigation method, and RCM represents other well-known navigation methods. TL is path length evaluation, NE is navigation error evaluation, OSR is database success rate, SR is success rate, SPL is weighting success rate of reverse path length, and the five indexes are accepted indexes for internationally evaluating navigation accuracy. The arrow points downwards to indicate that the smaller the value is better under the evaluation criteria, and the opposite is true when the arrow points upwards to indicate that the larger the value is better under the evaluation criteria, and the bold font indicates that the best results are obtained. It can be seen from the table that, among the five evaluation indexes, four indexes are the best one obtained by the method provided by the invention, and one index is the second best one obtained by the method provided by the invention.

It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A visual language indoor navigation method is characterized in that a sequence-to-sequence method is utilized, natural language commands and visual information are combined, feature extraction is respectively carried out on natural language command information and visual image information, after the feature extraction is completed, attention features are respectively screened on the extracted features, and key information related to tasks is screened out.

2. The visual language indoor navigation method of claim 1, wherein the visual language indoor navigation method performs fusion coding of natural language command information and visual image information to let a depth model pay attention to a certain local information; the method comprises the steps of selectively screening out local information from a large amount of information, focusing on the local information, encoding a characteristic vector, decoding the vector, and decoding to obtain a command of the robot action.

3. The visual language indoor navigation method of claim 1, wherein the visual language indoor navigation method specifically comprises:

the ninth step, repeat the second step-the eighth step;

and step ten, reaching the destination and stopping advancing.

4. The visual language indoor navigation method of claim 3, wherein the visual language indoor navigation method adopts a ResNet-50 network for feature extraction, the data of ImageNet is pre-trained before the ResNet-50 network extracts features, the trained ResNet-50 network is used for extracting feature vectors, and the feature vectors V of the panoramic image observed by the robot at the time t are_t：

Extracting attention feature vector v by using attention mechanism_t：

v_t＝attention(H_t-1，V_t)；

After evolution:

v_t＝∑_jsoftmax(H_t-1W_h(W_vV_t)^T)V_t；

H_t＝LSTM([V_tA_t-1]，H_t-1)；

5. The visual language indoor navigation method of claim 3, wherein the visual language indoor navigation method is applied to an input string of natural language instructions W (W)₁，w₂，w₃...), the natural language instruction is composed of a string of words, features are extracted by using LSTM, and natural feature extraction is performed by using C ═ LSTM (W), wherein C is the extracted features of the natural language, and the natural language features are re-extracted by using an attention mechanism, and the formal expression is that:

Y_t＝attention((H_t，C)。

6. the visual language indoor navigation method of claim 3, wherein the visual language indoor navigation method comprises the steps of respectively encoding and extracting the visual information and the natural language information of the robot, then performing fusion attention extraction on the feature vectors of the visual information and the natural language information of the robot, fusing all the extracted information and the historical information of the robot, evaluating the next step performed by the robot, estimating the probability P of the advancing direction, and determining the direction which the robot should most move according to the maximum probability:

D_t＝attention(Y_t，v_t)；

P＝softmax([H_t，v_t，Y_t，D_t]W_cW_b)；

7. A robot visual language navigation information data processing terminal, characterized in that the robot visual language navigation information data processing terminal is used for realizing the visual language indoor navigation method of any one of claims 1 to 6.

8. A visual language indoor navigation system for implementing the visual language indoor navigation method according to any one of claims 1 to 6, wherein the visual language indoor navigation system comprises: