CN114812551A

CN114812551A - Indoor environment robot navigation natural language instruction generation method

Info

Publication number: CN114812551A
Application number: CN202210224196.7A
Authority: CN
Inventors: 陈启军; 王柳懿; 刘成菊; 何宗涛
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-07-29
Anticipated expiration: 2042-03-09
Also published as: CN114812551B

Abstract

The invention relates to a method for generating a navigation natural language instruction of a robot in an indoor environment, which comprises the following steps: s1, extracting image feature vectors of the panoramic image collected by the robot camera; s2, acquiring the current offset angle of the robot, expanding the data dimension through triangular transformation, and splicing the data dimension with the image feature vector to form a corresponding action feature vector and a panoramic image feature vector; s3, aligning the motion characteristic vectors and the panoramic image characteristic vectors by adopting multi-head attention and performing dimension reduction calculation; s4, coding the vision and action information of the robot by adopting a Transformer frame, and outputting a predicted language result; and S5, adding an additional auxiliary supervision task to the output part of the decoder to assist the robot to learn the corresponding relation between the output sentence and the input action. Compared with the prior art, the method has the advantages of improving the utilization degree of the characteristic information, improving the accuracy and generalization capability of the generated model and the like.

Description

Indoor environment robot navigation natural language instruction generation method

Technical Field

The invention relates to the field of computer vision and natural language generation, in particular to a method for generating a navigation natural language instruction of an indoor environment robot.

Background

The visual language navigation task is an important research task of artificial intelligence, is one of representative problems in the cross-modal cross research field of computer vision and natural language processing, and aims to use natural language to issue a path instruction to a robot, and the robot autonomously analyzes and judges the target direction represented by the language instruction, and adjusts behaviors and plans a path according to a visual image fed back in real time.

However, it is very time-consuming and labor-consuming to use the manual annotation natural language instruction, and therefore, many studies prove that the training precision of the existing robot on the visual language navigation task can be effectively improved by introducing the autoregressive instruction language generation model, but the existing natural language instruction generation model for guiding the robot navigation all depends on the RNN timing sequence model with a relatively simple structure, and has the disadvantages of poor long-term capturing dependence performance, slow serial operation speed and scarce output sentence semantics.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a method for generating an indoor environment robot navigation natural language instruction.

The purpose of the invention can be realized by the following technical scheme:

a method for generating an indoor environment robot navigation natural language instruction comprises the following steps:

s1, extracting image feature vectors of the panoramic image collected by the robot camera by using a deep convolutional neural network;

s2, acquiring the current offset angle of the robot, expanding the data dimension through triangular transformation, and splicing the data dimension with the image feature vector to form a corresponding action feature vector and a panoramic image feature vector;

s3, aligning the motion characteristic vector and the panoramic image characteristic vector by adopting multi-head attention and performing dimensionality reduction calculation to enable the robot to focus on more important visual contents in the environment;

s4, coding the visual and motion information of the robot by adopting a sequence-to-sequence Transformer frame, performing cross-modal attention fusion with a language embedded code with a mask at a decoder end, and outputting a predicted language result;

s5, an additional auxiliary supervision task is added to the output part of the decoder, the robot is assisted in learning the corresponding relation between the output sentences and the input actions, and the expression of the network model on the input and output relation is improved.

In step S1, the deep convolutional neural network model uses a ResNet-152 network pre-trained on ImageNet to output the last layer of images before classification obtained by forward inference after the images are input into the ResNet-152 network as image feature vectors.

The panoramic image collected by the robot camera comprises 36 sub-images, including 12 observation images at 30-degree intervals under three visual angles of overlooking, head-up and looking-up respectively, and each observation image corresponds to one image feature vector.

In step S2, the offset angle of the robot includes an action offset angle and a view angle offset angle, the action offset angle is an offset angle between the current position of the robot and the position of the robot at the previous time, the view angle offset angle is an offset angle based on the center of each sub-image included in the panoramic image observed by the robot, and the offset angle has the following expression:

wherein gamma is the offset angle, theta is the heading angle of the offset,

is an offset pitch angle.

In step S2, the motion feature vector a is specifically formed by splicing an image feature vector corresponding to a sub-image directly in front of the robot in the panoramic image and a motion offset angle vector after data dimensionality extension, and the panoramic image feature vector E is specifically formed by splicing an image feature vector corresponding to all sub-images in the panoramic image and a view offset angle vector after data dimensionality extension.

In step S3, the expression of the output X after the multi-head attention alignment and the dimensionality reduction calculation is:

Q＝AW _Q

K＝EW _K

V＝EW _V

wherein Q, K, V denotes the look-up matrix, the key matrix and the value matrix of the attention mechanism after linear transformation, respectively, W _Q 、W _K 、W _V Respectively, the learnable weights for making linear changes to the motion characteristic vector A and the panoramic image characteristic vector E,

is the dimension of K.

In step S4, when performing prediction using a transform, adding position sequence coding to emphasize different influences of input on output in time series, and performing position coding on output X calculated using multi-head attention alignment and dimensionality reduction includes:

wherein, PE _(pos,2i) For embedding coded 2 i-dimension position coded values, PE _(pos,2i+1) For the embedding of the position-coded value in the 2i +1 th dimension, pos is the actual position of the element in the input sequence, d _model Is the dimension of the embedded code.

The Transformer comprises an encoder and a decoder, wherein the encoder is composed of a plurality of multi-head self-attention modules, a forward connection network and a residual connection, and each module of the decoder comprises a cross-attention module, a self-attention module, a forward connection network and a residual connection.

In the decoder, the true text is shifted one bit to the right and diagonally masked to ensure that the text is input only dependent on the previous prediction, and then the text is converted into an embedded representation by a layer of linear transformation and cross-focused with the output from the encoder.

In step S5, after an additional auxiliary monitoring task is added to the output portion of the decoder, the final Loss function Loss is expressed as:

Loss＝λL ₁ +(1-λ)ωL ₂

wherein L is ₁ For differences between predicted and true values obtained using cross entropy loss functions, L ₂ Is the difference between the predicted value and the true value obtained by using the mean square error function, theta is the network parameter, f _θ (. cndot.) is the predicted probability of the network,

to output the p-th real value of the instruction,

for outputting the 1 st to p th true values, Z, of the instruction _l The first vocabulary pair output for the networkThe corresponding predicted progress value, L being the total number of words, P (I' _j ) Is the sub-instruction I 'where the current vocabulary is located' _j K is the total number of sub-commands, λ is a value controlling the specific gravity of the two loss functions, ω is a value unifying the magnitudes of the two loss functions,

the real process value corresponding to the ith vocabulary.

Compared with the prior art, the invention has the following advantages:

the invention effectively integrates the action behavior and the environment observation of the robot by introducing a multi-head attention method, improves the utilization degree of input characteristic information, uses an advanced sequence generation model Transformer to replace the original RNN structural model, and improves the integral coding and decoding capability of the model.

The invention provides an additional progress supervision auxiliary task, improves the expression capacity of the network model to the input and output corresponding relation by utilizing more exquisite sub instructions and corresponding sub behavior actions, and introduces a priori knowledge method to assist the model to better learn the relation between language generation and actions, thereby improving the accuracy and generalization capacity of the generated model on the premise of hardly increasing the network parameter quantity.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

FIG. 2 is a system flow diagram illustrating the method of the present invention.

FIG. 3 is a schematic diagram of a progress supervision assistance task in the method of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

Examples

The invention provides a method for generating a navigation natural language instruction of a robot in an indoor environment, wherein the overall flow block diagram of the method is shown in figure 1, and the method specifically comprises the following steps:

s1, extracting image features collected by the robot camera by using a deep convolutional neural network;

the deep convolutional neural network model adopts a ResNet-152 network pre-trained on ImageNet, the last layer of output before classification obtained by forward reasoning after an image is input into the network is used as a feature vector, and v is used _j To represent the jth image feature vector.

S2, acquiring the current offset angle gamma of the robot, expanding the data dimension through triangular transformation, and splicing with the image features to form a new feature vector;

the offset angle includes motion offset and view angle offset, the motion offset refers to the offset angle between the current position of the robot and the position of the robot at the previous moment, the view angle offset refers to the offset angle based on the center of each sub-image contained in a panoramic image observed by the robot, and in the example, a sine function and a cosine function are used for respectively calculating the offset pitch angle

And a heading angle θ, the formula being:

in order for the network model to better learn the relationship between images and motion angles, it is extended to 128 dimensions, i.e. γ _j And is combined with v _j Forming a characteristic vector after splicing: o _j ＝{v _j ；γ _j }。

S3, aligning and performing dimensionality reduction calculation on the panoramic image of the robot action and observation by using multi-head attention, so that the robot focuses on more important visual contents in the environment;

the multi-head attention uses the motion characteristic vector as a query matrix Q, uses the panoramic image as a key matrix K and a value matrix V, and can obviously reduce the dimensionality of input characteristics after the attention is paid to the motion characteristic vector and the panoramic image, filter unimportant components, enrich input semantic information, and express an attention calculation formula as follows:

Q＝AW _Q

K＝EW _K

V＝EW _V

a and E represent motion feature vector and panorama image feature vector, respectively, W _Q 、W _K 、W _V Respectively, the learnable weights for making linear changes to the motion characteristic vector A and the panoramic image characteristic vector E,

the dimensionality of K is represented, and the model can be made to pay attention to different emphasis points by using multi-head attention, so that the training and learning capabilities of the model are further improved.

S4, coding the visual and motion information of the robot by using a sequence-to-sequence Transformer frame, performing cross-modal attention fusion with a language embedded code with a mask at a decoder end, and outputting a predicted language result;

since the Transformer is a parallel structure and does not inherently have the capability of capturing the sequence of the input sequence, it is necessary to add a position sequence code to emphasize the different effects of the input on the output in time sequence, and therefore, the position code is implemented by using the following function for the input X after multi-head attention fusion:

The Transformer comprises an encoder and a decoder, wherein in the encoder, the encoder is formed by connecting a plurality of multi-head self-attention modules, a forward connection network and a residual error; in the decoder, the input of text needs to ensure that it depends only on the previous prediction, so the operation of right-shifting one bit for true text and making diagonal masking is adopted, then the text is converted into embedded expression by a layer of linear transformation and cross attention is made with the output result from the encoder, each module of the decoder comprises cross attention, self attention, forward connection network and residual connection, and the final generated word probability distribution prediction result is output by using a layer of linear change and softmax function.

The difference between the predicted and true values is calculated using a cross entropy loss function:

wherein L is ₁ For the difference between the predicted value and the true value obtained by using the cross entropy loss function, theta is the network parameter, f _θ (. cndot.) is the predicted probability of the network,

to output the p-th real value of the instruction,

the 1 st to p th real values in the output instruction.

S5, adding an additional auxiliary supervision task at the output part of the decoder to help the robot learn the corresponding relation between the output sentence and the input action and improve the expression of the network model to the input and output relation.

The auxiliary task is a means which is commonly used for improving the robot translation task, and the auxiliary task helps a model to learn the internal association between data more easily by adding extra loss supervision through utilizing prior knowledge. Assuming that an action can be divided into k sub-actions, each sub-action corresponds to a sub-instruction containing a plurality of words, the process value corresponding to each word in each sub-instruction is:

wherein P (-) denotes the sub-instruction I 'where the current vocabulary is located' _j K is the total number of segments of the sub-instruction (or sub-action),

the real process value corresponding to the ith vocabulary. The auxiliary supervision task is parallel to the original text predictor output, and the feature vectors output by the last layer of a transform decoder are used as input.

The difference between the predicted and true values is calculated using a mean square error function:

wherein L is ₂ For differences between predicted and true values obtained using the mean square error function, Z _l And outputting a predicted process value corresponding to the ith vocabulary for the network.

The final loss function is defined as follows:

Loss＝λL ₁ +(1-λ)ωL ₂

wherein λ is a value controlling the specific gravity of the two loss functions, and ω is a value unifying the magnitudes of the two loss functions

In conclusion, the invention provides a natural language instruction generation method for guiding the navigation behavior of the robot aiming at the visual language navigation task of the robot. According to the method, a Transformer is used as a text generation frame from a sequence to a sequence, an additional generation progress auxiliary supervision task is introduced, a joint training loss function is designed, and end-to-end learning and prediction are achieved. The method can effectively realize the generation of the natural language instruction of the robot navigation path, thereby improving the visual language navigation capability of the robot on the premise of not introducing additional manual marking, and has the advantages of rich generated language semantic information, strong model generalization, high training speed and the like.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A method for generating an indoor environment robot navigation natural language instruction is characterized by comprising the following steps:

s5, adding an additional auxiliary supervision task at the output part of the decoder, assisting the robot to learn the corresponding relation between the output sentence and the input action, and improving the expression of the network model to the input and output relation.

2. The method for generating an indoor environment robot navigation natural language instruction according to claim 1, wherein in step S1, the deep convolutional neural network model adopts a ResNet-152 network pre-trained on ImageNet, so that the final layer of output before classification obtained by forward reasoning after the image is input into the ResNet-152 network is used as the image feature vector.

3. The method as claimed in claim 2, wherein the panoramic image collected by the robot camera includes 36 sub-images, including 12 observation images at 30 degrees intervals at three viewing angles of overlook, head-up and head-up, each observation image corresponding to an image feature vector.

4. The method for generating an indoor environment robot navigation natural language instruction according to claim 1, wherein in step S2, the offset angle of the robot includes an action offset angle and a view angle offset angle, the action offset angle is an offset angle between a current position of the robot and a previous position of the robot, the view angle offset angle is an offset angle based on a center of each sub-image included in a panoramic image observed by the robot, and an expression of the offset angle is as follows:

wherein gamma is the offset angle, theta is the heading angle of the offset,

is an offset pitch angle.

5. The method according to claim 4, wherein in step S2, the motion eigenvector A is formed by splicing an image eigenvector corresponding to a subimage directly in front of the robot in the panoramic image with a motion offset angle vector after the dimension of the augmented data, and the panoramic image eigenvector E is formed by splicing an image eigenvector corresponding to all subimages in the panoramic image with a view offset angle vector after the dimension of the augmented data.

6. The method for generating an indoor environment robot navigation natural language instruction according to claim 1, wherein in step S3, an expression of an output X after the multi-head attention is aligned and subjected to the dimensionality reduction calculation is as follows:

Q＝AW _Q

K＝EW _K

V＝EW _V

is the dimension of K.

7. The method for generating natural language instructions for indoor environment robot navigation according to claim 1, wherein in step S4, when performing prediction using a transform, adding position sequence coding to emphasize different effects of time-series input on output, and performing position coding on output X after performing multi-head attention alignment and dimensionality reduction calculation, the method comprises:

8. The method as claimed in claim 7, wherein the Transformer comprises an encoder and a decoder, the encoder comprises a plurality of multi-headed self-attention modules, a forward connection network and a residual connection, and each module of the decoder comprises a cross-attention, a self-attention, a forward connection network and a residual connection.

9. The method as claimed in claim 8, wherein the true text is shifted one bit to the right and the diagonal masking is performed in the decoder to ensure that the text is input only based on the previous prediction, and then the text is transformed into the embedded expression by a layer of linear transformation and cross-focused with the output from the encoder.

10. The method according to claim 8, wherein in step S5, after an additional auxiliary task is added to the output part of the decoder, the final Loss function Loss is expressed as:

Loss＝λL ₁ +(1-λ)ωL ₂

to output the p-th real value of the instruction,

for outputting the 1 st to p th true values, Z, of the instruction _l A predicted progress value corresponding to the L-th vocabulary output from the network, L being the total number of vocabularies, P (I' _j ) Is the sub-instruction I 'where the current vocabulary is located' _j K is the total number of sub-commands, λ is a value controlling the specific gravity of the two loss functions, ω is a value unifying the magnitudes of the two loss functions,

the real process value corresponding to the ith vocabulary.