CN113420606B

CN113420606B - Method for realizing autonomous navigation of robot based on natural language and machine vision

Info

Publication number: CN113420606B
Application number: CN202110597437.8A
Authority: CN
Inventors: 董敏; 聂宏蓄; 毕盛
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2022-06-14
Anticipated expiration: 2041-05-31
Also published as: CN113420606A

Abstract

The invention discloses a method for realizing autonomous navigation of a robot based on natural language and machine vision, which comprises the following steps: 1) the robot starts from an initial position, and language information and visual information are acquired at each round of conversation, namely each moment; 2) performing feature extraction on the language information through an attention mechanism to obtain semantic features; performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features; performing feature extraction on visual information through the faster-RCNN and the U-net to obtain target detection features and semantic segmentation features; 3) fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current moment and the previous moment through an attention mechanism to obtain fusion features; 4) and inputting the fusion features into a softmax classifier to predict the moving direction of the current moment. The invention utilizes the visual information and the language information of the environment where the robot is located to carry out autonomous navigation of the robot under the condition of not acquiring an accurate measurement map in advance.

Description

Method for realizing autonomous navigation of robot based on natural language and machine vision

Technical Field

The invention relates to the technical field of natural language processing, image processing and autonomous navigation, in particular to a method for realizing indoor autonomous navigation of a mobile robot based on natural language and computer vision.

Background

In recent years, autonomous navigation of robots is more and more widely applied in production and life, and more application scenes need accurate and efficient autonomous navigation technology. In the conventional autonomous navigation method, the environment needs to be scanned once to obtain an accurate measurement map, and autonomous navigation is performed according to the accurate measurement map. Acquiring an accurate measurement map requires a large amount of manpower and material resources, and an autonomous navigation method based on the accurate measurement map is difficult to migrate to an unknown environment. Therefore, the research of the autonomous navigation method based on natural language and computer vision has great significance.

At present, a method based on an accurate measurement map is mainly adopted in the aspect of robot autonomous navigation research, but the following problems are also faced:

(1) the acquisition of the accurate measurement map requires a large amount of resources and time to scan the environment in advance, and the cost for acquiring the accurate measurement map is high.

(2) In some complex scenes which are difficult to observe, the difficulty and the expense for obtaining the accurate measurement map are higher, and the navigation method based on the accurate measurement map may not be implemented.

(3) The navigation effect depends on the accuracy of the metric map, and in some situations where it is difficult to obtain an accurate metric map, the navigation effect becomes poor.

(4) The autonomous navigation method based on the accurate measurement map is based on the measurement information of the environment for navigation, and does not utilize semantic information and visual information, so that the method is difficult to migrate to unknown environment.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method for realizing indoor autonomous navigation of a mobile robot based on natural language and machine vision, which can carry out autonomous navigation of the robot under the condition of not acquiring an accurate measurement map in advance by utilizing visual information of the environment where the robot is located and natural language dialogue records.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: a method for realizing autonomous navigation of a robot based on natural language and machine vision comprises the following steps:

1) the robot acquires language information and visual information from an initial position at each turn of conversation, namely each moment; the language information comprises an instruction indicating the target position of the robot and a conversation record describing the environment where the robot is located, the conversation record comprises a current position, namely a conversation generated at the current moment and a set of all previous conversations, and the visual information comprises panoramic image information of the current position of the robot;

2) performing feature extraction on the language information through an attention mechanism to obtain semantic features; performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features; respectively extracting the characteristics of the visual information through the faster-RCNN and the U-net to obtain target detection characteristics and semantic segmentation characteristics;

3) fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current moment and the previous moment through an attention mechanism to obtain fusion features;

4) inputting the fusion features into a softmax classifier to predict the moving direction of the robot at the current moment, wherein at each moment, the robot predicts the moving direction through the fusion features, and finally, when the prediction result is stop, the robot reaches the target position.

In step 1), the dialogue record of the environment where the robot is located refers to a communication record generated when two human users navigate in the environment where the robot is located, wherein one human user extracts topological information of the whole indoor environment and instructs the robot to walk through question-answer communication with the other human user; h for each dialogue recording_t＝D₁,D₂,...,D_i,...,D_t-1Is represented by H_tRepresenting the recording of the conversation during the t-th conversation, D_iRepresenting the ith round of conversation; a panoramic image corresponding to visual information of the environment in which the robot is located is denoted by C, and the panoramic image is divided into 12 sub-images, each of which represents 12 directions and is denoted by C-C₁,c₂,...,c_i,...,c₁₂Wherein c is_iRepresenting the ith sub-graph.

In step 2), performing feature extraction on the language information through an attention mechanism to obtain semantic features, and the method comprises the following steps:

2.1) each dialog record H containing t dialogs and each dialog record D containing L words are described as:

H＝{D₁,D₂,...,D_i,...,D_t}

wherein D is_iA dialog of the ith round is shown,

representing the ith word in a round of dialog;

2.2) vectorizing the session record through an embedding layer, wherein a corresponding vectorizing result E is described as follows:

E＝{G₁,G₂,...,G_i,...,G_t}

G_i＝{g₁,g₂,...,g_i,...,g_L}

wherein, G_iRepresenting the embedding vector of the ith round of dialog in the semantic map, wherein t rounds of dialog are total; g_iAn embedding vector representing the ith word in a round of conversation, wherein the number of the I words is L;

2.3) encoding the embedding vector of the session record through an LSTM network to obtain a characteristic vector, wherein the process of obtaining the characteristic vector is described as follows:

{h_i,1,h_i,2,...,h_i,L}＝LSTM({w_i,1,w_i,2,...,w_i,j,...,w_i,L})

d_i＝h_i,L

wherein, w_i,jEmbedding vector, h, representing the jth word in the ith round of dialog_i,LState vector representing the last moment of the LSTM network, denoted by d_iTo represent h_i,L，

Is a feature matrix formed by the first t-1 feature vectors of the dialogue record;

2.4) fusing the feature matrix of the conversation record and the feature vector of the current conversation respectively through an attention mechanism, wherein the fusion process is described as follows:

wherein d is_tAnd d_iRespectively represent state vectors h_t,LAnd h_i,L，A(d_t,d_i) Represents a vector d_tFor d_iAttention of (1), W_Q、W_K、W_VRepresenting parameters of the model, c representing a vector d_tAnd d_iDimension (d); softmax represents the softmax function, concat represents the merging of vectors;

is determined by the attention value and all d_iThe result of the weighting-combining is taken into account,

semantic features corresponding to the history of the t-th turn of the dialog, consisting of

And d_tMerging to obtain;

performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features, and performing feature extraction on the visual information through false-RCNN and U-net respectively to obtain target detection features and semantic segmentation features: the robot comes to a new position in each round of conversation, and then acquires a panoramic view under the position, and in t rounds of conversation, the robot can obtain a panoramic viewIs denoted as P_tA 1 is to P_tFeature extraction is performed through a neural network model Resnet152, and the obtained feature result is taken as a low-order visual feature and is expressed as V_tAnd taking the obtained image classification result as an image classification characteristic, and expressing the image classification characteristic as C_t(ii) a Will P_tInputting the target detection result into the master-RCNN network, and taking the obtained target detection result as a target detection characteristic, wherein the target detection characteristic is represented as O_t(ii) a Will P_tInputting the semantic segmentation result into a U-net network, and taking the obtained semantic segmentation result as a semantic segmentation feature expressed as S_t。

In the step 3), the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current time and the previous time are fused through an attention mechanism to obtain fusion features, and the method comprises the following steps:

3.1) fusing the low-order visual features, the image classification features, the target detection features and the semantic segmentation features with fusion features corresponding to t-1 turns of conversation, namely t-1 moments, wherein the fusion process is described as follows:

wherein v is_t,i、c_t,i、o_t,i、s_t,iRespectively representing the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector of the ith sub-image of the t-turn dialog, wherein the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector are respectively low-order visual feature vectorsSign matrix V_tImage classification feature matrix C_tObject detection feature matrix O_tSemantic segmentation feature matrix S_tThe vector of (a);

denotes the fusion feature obtained at time t-1, f_vAnd f_vlmRepresenting a non-linear mapping function, l represents

The vector dimension of (a); the fused low-order visual feature vector, the fused image classification feature vector, the fused target detection feature vector and the fused semantic segmentation feature vector are respectively

3.2) further fusing the fused low-order visual features, image classification features, target detection features, semantic segmentation features and semantic features through an attention mechanism, wherein the process is described as follows:

wherein the content of the first and second substances,

respectively representing a low-order visual characteristic matrix, an image classification characteristic matrix, a target detection characteristic matrix and a semantic segmentation characteristic matrix which are fused during t-round conversation;

representing semantic features, passes and parameters in t-turn conversations

Multiplication, mapping into

h represents semantic features when t turns of conversation

Dimension (d); softmax denotes the softmax function;

respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features after attention mechanism fusion;

3.3) further processing the fused features through an LSTM network, and finally combining the fused features into final coding features, wherein the process is as follows:

wherein, the first and the second end of the pipe are connected with each other,

respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features processed by an LSTM network; concat represents the merging of vectors;

and representing the fused feature corresponding to the pair of the t wheels, namely the final coding feature.

In step 4), inputting the fusion features into a softmax classifier for movement direction prediction, wherein the method comprises the following steps:

4.1) final coding features

Mapping is carried out by using an activation function, and the process is as follows:

wherein σ is sigmoid activation function, f_mIn order to be a non-linear mapping function,

is the result of activation;

4.2) calculating the final result of the activation result through a softmax function, wherein the process is as follows:

wherein softmax denotes the softmax function, f_aIs a non-linear mapping function.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention provides the method for performing robot autonomous navigation by using visual information and natural language, thereby saving the overhead brought by obtaining an accurate measurement map and being capable of adapting to complex environments.

2. The invention provides the robot autonomous navigation combining the natural language instruction and the machine vision, and the robot autonomous navigation can be carried out more conveniently and efficiently.

3. The invention combines natural language instruction and machine vision, and performs autonomous navigation of the robot by combining the characteristics of two different modal information, thereby improving the navigation efficiency and saving the expenditure while ensuring the navigation effect.

Drawings

FIG. 1 is a flow chart illustrating autonomous navigation according to the present invention.

FIG. 2 is a schematic diagram of a model architecture construction process for feature extraction and navigation instruction prediction based on attention mechanism.

Wherein the dialogue history represents questions of the robot and answer records of the human user; the current time dialog represents the questions of the robot and the answers of the human users in the current round of dialog; encoding represents that the dialogue information is encoded and converted into an Encoding vector; the machine vision image is extracted to low-order vision characteristics through Resnet152, Faster R-CNN and U-net models

Image classification features

Target detection features

Semantic segmentation features

Etc.; attention denotes an Attention module for extracting features of semantic information

Feature extraction of visual information, and fusion of semantic features, visual features and fusion features obtained in t-1 round of dialogue, i.e. at t-1 time

Etc.;

respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features which are fused with the semantic features; the fused features are input into the softmax module and the final result is calculated.

Fig. 3 is a schematic view of the principle of the attention mechanism. Wherein d is_t，d_iRespectively representing feature vectors used to calculate attention; w_Q，W_K，W_VIs used for mixing d_tAnd d_iParameters mapped to the same dimension; matmul stands for matrix multiplication; the calculation result is normalized by a softmax module to obtain an attention result A (d)_t,d_i) (ii) a By merging the results of all attention modules, and then d_tTo obtain the final result

Detailed Description

The present invention will be further described with reference to the following specific examples and drawings, but the embodiments of the present invention are not limited thereto.

As shown in fig. 1 to 3, the method for implementing autonomous navigation of a robot based on natural language and machine vision provided by the present embodiment includes the following steps:

1) the robot acquires language information and visual information from an initial position at each turn of conversation (each moment); the language information includes instructions indicating a target position of the robot toThe dialog record describes the environment where the robot is located, the dialog record comprises dialogs generated at the current position (namely the current moment) and a set of all previous dialogs, and the visual information comprises panoramic image information of the current position of the robot; the conversation record of the environment where the robot is located refers to communication records generated when two human users navigate in the environment where the robot is located, wherein one human user extracts topological information of the whole indoor environment and instructs the robot to walk through question-answer communication with the other human user; h for each dialogue recording_t＝D₁,D₂,...,D_i,...,D_t-1Is represented by H_tShowing the conversation log at the time of the tth conversation, D_iRepresenting the ith round of conversation; a panoramic image corresponding to visual information of an environment in which the robot is located is represented as C, the panoramic image is divided into 12 sub-images, each of the sub-images represents 12 directions, and the sub-images are represented as C ═ C₁,c₂,...,c_i,...,c₁₂Wherein, c_iThe ith sub-graph is represented as shown by the image corresponding to the alternative action direction in fig. 2.

2) The method for extracting the features of the language information through the attention mechanism to obtain the semantic features comprises the following steps:

H＝{D₁,D₂,...,D_i,...,D_t}

wherein D is_iA dialog of the ith round is shown,

representing the ith word in a round of dialog;

2.2) vectorizing the dialogue records through an embedding layer, wherein the corresponding vectorizing result is described as:

E＝{G₁,G₂,...,G_i,...,G_t}

G_i＝{g₁,g₂,...,g_i,...,g_L}

wherein, G_iRepresenting the embedding vector of the ith round of dialogue in the semantic map, wherein t rounds of dialogue are shared; g_iAn embedding vector representing the ith word in a round of conversation, wherein the number of the I words is L;

{h_i,1,h_i,2,...,h_i,L}＝LSTM({w_i,1,w_i,2,...,w_i,j,...,w_i,L})

d_i＝h_i,L

wherein, w_i,jAn embedding vector, h, representing the jth word in the ith round of dialog_i,LState vector representing the last moment of the LSTM network, denoted by d_iTo represent h_i,L，

2.4) fusing the feature matrix of the conversation record and the feature vector of the current conversation respectively through an attention mechanism, wherein the principle of the attention mechanism and the calculation process of the attention result are shown in FIG. 3, and the fusion process is described as follows:

And d_tMerging to obtain;

as shown in fig. 2, feature extraction is performed on visual information through Resnet152 to obtain low-order visual features and image classification features, feature extraction is performed on visual information through false-RCNN and U-net to obtain target detection features and semantic segmentation features, specifically, in each round of conversation, a robot arrives at a new position, then a panoramic view at the position is obtained, and a corresponding panoramic view in t rounds of conversation is represented as P_t(ii) a Will P_tFeature extraction is performed through a neural network model Resnet152, and the obtained feature result is taken as a low-order visual feature and is expressed as V_tAnd taking the obtained image classification result as an image classification characteristic, which is expressed as C_t(ii) a Will P_tInputting the target detection result into the master-RCNN network as the target detectionMeasured characteristic, denoted as O_t(ii) a Will P_tInputting the semantic segmentation result into a U-net network, and taking the obtained semantic segmentation result as a semantic segmentation feature expressed as S_t。

3) Fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current moment and the previous moment through an attention mechanism to obtain fusion features, and the method comprises the following steps of:

wherein v is_t,i、c_t,i、o_t,i、s_t,iRespectively representing the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector of the ith sub-image of the t-turn dialog, wherein the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector are respectively a low-order visual feature matrix V_tImage classification feature matrix C_tObject detection feature matrix O_tSemantic segmentation feature matrix S_tThe vector of (a);

denotes the fusion feature obtained at time t-1, f_vAnd f_vlmRepresenting non-linear mapsEquation, l represents

wherein the content of the first and second substances,

respectively representing fused low-order visual feature matrix, image classification feature matrix and target detection feature during t-turn conversationMatrix and semantic segmentation characteristic matrix;

representing semantic features, passes and parameters in t-turn conversations

Multiplication, mapping into

h represents semantic features when t turns of conversation

Dimension of (d); softmax denotes the softmax function;

3.3) further processing the fused features through an LSTM network and finally combining the processed features into final coding features, wherein the process is as follows:

wherein the content of the first and second substances,

4) Inputting the fusion features into a softmax classifier for predicting the moving direction, and comprising the following steps of:

4.1) final coding features

is the result of activation;

wherein softmax represents the softmax function, f_aIs a non-linear mapping function.

The above-described embodiments are only preferred embodiments of the present invention, and not intended to limit the scope of the present invention, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and they are included in the scope of the present invention.

Claims

1. A method for realizing autonomous navigation of a robot based on natural language and machine vision is characterized by comprising the following steps:

1) the robot starts from an initial position, and language information and visual information are acquired at each time of each round of conversation; the language information comprises an instruction indicating the target position of the robot and a conversation record describing the environment where the robot is located, the conversation record comprises a current position, namely a conversation generated at the current moment and a set of all previous conversations, and the visual information comprises panoramic image information of the current position of the robot;

3) fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features of the current moment and the previous moment through an attention mechanism to obtain fusion features, and the method comprises the following steps of:

wherein the content of the first and second substances,

respectively representing a low-order visual feature matrix, an image classification feature matrix, a target detection feature matrix and a semantic segmentation feature matrix which are fused during t-round conversations;

representing semantic features, passes and parameters in t-turn conversations

Multiplication, mapping into

h represents semantic features when t turns of conversation

Dimension (d); softmax represents the softmax function;

respectively represent the low-order vision features after the fusion of the attention mechanismThe method comprises the following steps of (1) acquiring, image classification characteristics, target detection characteristics and semantic segmentation characteristics;

wherein the content of the first and second substances,

representing the fusion characteristics corresponding to the t-wheel dialogues, namely the final coding characteristics;

2. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 1), the dialogue record of the environment where the robot is located refers to a communication record generated when two human users navigate in the environment where the robot is located, wherein one human user knows topological information of the whole indoor environment and instructs the robot to walk through question-answer communication with the other human user; h for each dialogue recording_t＝D₁,D₂,...,D_i,...,D_t-1Is represented by H_tRepresenting the recording of the conversation during the t-th conversation, D_iRepresenting the ith round of conversation; a panoramic image corresponding to visual information of the environment in which the robot is located is denoted by C, and the panoramic image is divided into 12 sub-images, each of which represents 12 directions and is denoted by C-C₁,c₂,...,c_i,...,c₁₂Wherein c is_iRepresenting the ith sub-graph.

3. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 2), performing feature extraction on the language information through an attention mechanism to obtain semantic features, and the method comprises the following steps:

H＝{D₁,D₂,...,D_i,...,D_t}

wherein D is_iA dialog of the ith round is shown,

representing the ith word in a round of dialog;

2.2) vectorizing the dialogue records through an embedding layer, wherein a corresponding vectorization result E is described as follows:

E＝{G₁,G₂,...,G_i,...,G_t}

G_i＝{g₁,g₂,...,g_i,...,g_L}

wherein G is_iRepresenting the embedding vector of the ith round of dialog in the semantic map, wherein t rounds of dialog are total; g_iAn embedding vector representing the ith word in a round of conversation, wherein the number of the I words is L;

{h_i,1,h_i,2,...,h_i,L}＝LSTM({w_i,1,w_i,2,...,w_i,j,...,w_i,L})

d_i＝h_i,L

semantic features corresponding to the history of the t-th dialog represented by

And d_tMerging to obtain;

performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features, and performing feature extraction on the visual information through false-RCNN and U-net respectively to obtain target detection features and semantic segmentation features: the method includes that in each round of conversation, the robot comes to a new position, then a panoramic view at the position is obtained, and a corresponding panoramic view in t rounds of conversation is represented as P_tA 1 is to P_tFeature extraction is performed through a neural network model Resnet152, and the obtained feature result is taken as a low-order visual feature and is expressed as V_tAnd taking the obtained image classification result as an image classification characteristic, which is expressed as C_t(ii) a Will P_tInputting the target detection result into the master-RCNN network, and taking the obtained target detection result as a target detection characteristic, wherein the target detection characteristic is represented as O_t(ii) a Will P_tInputting the semantic segmentation result into a U-net network, and taking the obtained semantic segmentation result as a semantic segmentation feature expressed as S_t。

4. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 4), inputting the fusion features into a softmax classifier for movement direction prediction, wherein the method comprises the following steps:

4.1) final coding features

is the result of activation;