CN113420606A - Method for realizing autonomous navigation of robot based on natural language and machine vision - Google Patents

Method for realizing autonomous navigation of robot based on natural language and machine vision Download PDF

Info

Publication number
CN113420606A
CN113420606A CN202110597437.8A CN202110597437A CN113420606A CN 113420606 A CN113420606 A CN 113420606A CN 202110597437 A CN202110597437 A CN 202110597437A CN 113420606 A CN113420606 A CN 113420606A
Authority
CN
China
Prior art keywords
features
feature
robot
vector
target detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110597437.8A
Other languages
Chinese (zh)
Other versions
CN113420606B (en
Inventor
董敏
聂宏蓄
毕盛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110597437.8A priority Critical patent/CN113420606B/en
Publication of CN113420606A publication Critical patent/CN113420606A/en
Application granted granted Critical
Publication of CN113420606B publication Critical patent/CN113420606B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention discloses a method for realizing autonomous navigation of a robot based on natural language and machine vision, which comprises the following steps: 1) the robot starts from an initial position, and language information and visual information are acquired at each round of conversation, namely each moment; 2) performing feature extraction on the language information through an attention mechanism to obtain semantic features; performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features; performing feature extraction on visual information through the faster-RCNN and the U-net to obtain target detection features and semantic segmentation features; 3) fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current moment and the previous moment through an attention mechanism to obtain fusion features; 4) and inputting the fusion features into a softmax classifier to predict the moving direction of the current moment. The invention utilizes the visual information and the language information of the environment where the robot is located to carry out autonomous navigation of the robot under the condition of not acquiring an accurate measurement map in advance.

Description

Method for realizing autonomous navigation of robot based on natural language and machine vision
Technical Field
The invention relates to the technical field of natural language processing, image processing and autonomous navigation, in particular to a method for realizing indoor autonomous navigation of a mobile robot based on natural language and computer vision.
Background
In recent years, autonomous navigation of robots is more and more widely applied in production and life, and more application scenes need accurate and efficient autonomous navigation technology. In the conventional autonomous navigation method, the environment needs to be scanned once to obtain an accurate measurement map, and autonomous navigation is performed according to the accurate measurement map. Acquiring an accurate measurement map requires a large amount of manpower and material resources, and an autonomous navigation method based on the accurate measurement map is difficult to migrate to an unknown environment. Therefore, the research of the autonomous navigation method based on natural language and computer vision has great significance.
At present, a method based on an accurate measurement map is mainly adopted in the aspect of robot autonomous navigation research, but the following problems are also faced:
(1) the acquisition of the accurate measurement map requires a large amount of resources and time to scan the environment in advance, and the cost for acquiring the accurate measurement map is high.
(2) In some complex scenes which are difficult to observe, the difficulty and the expense for obtaining the accurate measurement map are higher, and the navigation method based on the accurate measurement map may not be implemented.
(3) The navigation effect depends on the accuracy of the metric map, and in some situations where it is difficult to obtain an accurate metric map, the navigation effect becomes poor.
(4) The autonomous navigation method based on the accurate measurement map is based on the measurement information of the environment for navigation, and does not utilize semantic information and visual information, so that the method is difficult to migrate to unknown environment.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a method for realizing indoor autonomous navigation of a mobile robot based on natural language and machine vision, which can carry out autonomous navigation of the robot under the condition of not acquiring an accurate measurement map in advance by utilizing visual information of the environment where the robot is located and natural language dialogue records.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: a method for realizing autonomous navigation of a robot based on natural language and machine vision comprises the following steps:
1) the robot acquires language information and visual information from an initial position at each turn of conversation, namely each moment; the language information comprises an instruction indicating the target position of the robot and a conversation record describing the environment where the robot is located, the conversation record comprises a current position, namely a conversation generated at the current moment and a set of all previous conversations, and the visual information comprises panoramic image information of the current position of the robot;
2) performing feature extraction on the language information through an attention mechanism to obtain semantic features; performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features; respectively extracting the characteristics of the visual information through the faster-RCNN and the U-net to obtain target detection characteristics and semantic segmentation characteristics;
3) fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current moment and the previous moment through an attention mechanism to obtain fusion features;
4) inputting the fusion features into a softmax classifier to predict the moving direction of the robot at the current moment, wherein at each moment, the robot predicts the moving direction through the fusion features, and finally, when the prediction result is stop, the robot reaches the target position.
In step 1), the dialogue record of the environment where the robot is located refers to a communication record generated when two human users navigate in the environment where the robot is located, wherein one human user extracts topological information of the whole indoor environment and instructs the robot to walk through question-answer communication with the other human user; h for each dialogue recordingt=D1,D2,...,Di,...,Dt-1Is represented by HtRepresenting the recording of the conversation during the t-th conversation, DiRepresenting the ith round of conversation; panoramic image corresponding to visual information of environment where robot is locatedDenoted by C, the panoramic image is divided into 12 sub-images, each representing 12 directions, and denoted by C ═ C1,c2,...,ci,...,c12Wherein c isiRepresenting the ith sub-graph.
In step 2), performing feature extraction on the language information through an attention mechanism to obtain semantic features, and the method comprises the following steps:
2.1) each dialog record H containing t dialogs and each dialog record D containing L words are described as:
H={D1,D2,...,Di,...,Dt}
Figure BDA0003091634630000031
wherein D isiA dialog of the ith round is shown,
Figure BDA0003091634630000032
representing the ith word in a round of dialog;
2.2) vectorizing the dialogue records through an embedding layer, wherein a corresponding vectorization result E is described as follows:
E={G1,G2,...,Gi,...,Gt}
Gi={g1,g2,...,gi,...,gL}
wherein G isiRepresenting the embedding vector of the ith round of dialog in the semantic map, wherein t rounds of dialog are total; giAn embedding vector representing the ith word in a round of conversation, wherein the number of the I words is L;
2.3) encoding the embedding vector of the session record through an LSTM network to obtain a characteristic vector, wherein the process of obtaining the characteristic vector is described as follows:
{hi,1,hi,2,...,hi,L}=LSTM({wi,1,wi,2,...,wi,j,...,wi,L})
di=hi,L
Figure BDA0003091634630000033
wherein, wi,jAn embedding vector, h, representing the jth word in the ith round of dialogi,LState vector representing the last moment of the LSTM network, denoted by diTo represent hi,L
Figure BDA0003091634630000034
Is a feature matrix formed by the first t-1 feature vectors of the dialogue record;
2.4) fusing the feature matrix of the conversation record and the feature vector of the current conversation respectively through an attention mechanism, wherein the fusion process is described as follows:
Figure BDA0003091634630000041
Figure BDA0003091634630000042
Figure BDA0003091634630000043
Figure BDA0003091634630000044
wherein d istAnd diRespectively represent state vectors ht,LAnd hi,L,A(dt,di) Represents a vector dtFor diAttention of (1), WQ、WK、WVRepresenting parameters of the model, c representing a vector dtAnd diDimension (d); softmax represents the softmax function, concat represents the merging of vectors;
Figure BDA0003091634630000045
is determined by the attention value and all diThe result of the weighting-combining is taken into account,
Figure BDA0003091634630000046
semantic features corresponding to the history of the t-th turn of the dialog, consisting of
Figure BDA0003091634630000047
And dtMerging to obtain;
performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features, and performing feature extraction on the visual information through false-RCNN and U-net respectively to obtain target detection features and semantic segmentation features: the method includes that in each round of conversation, a robot comes to a new position, then a panoramic view at the position is acquired, and the corresponding panoramic view in t rounds of conversation is represented as PtA 1 is to PtFeature extraction is performed through a neural network model Resnet152, and the obtained feature result is taken as a low-order visual feature and is expressed as VtAnd taking the obtained image classification result as an image classification characteristic, which is expressed as Ct(ii) a Will PtInputting the target detection result into the master-RCNN network, and taking the obtained target detection result as a target detection characteristic, wherein the target detection characteristic is represented as Ot(ii) a Will PtInputting the semantic segmentation result into a U-net network, and taking the obtained semantic segmentation result as a semantic segmentation feature expressed as St
In step 3), fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current time and the previous time through an attention mechanism to obtain fusion features, wherein the fusion features comprise the following steps:
3.1) fusing the low-order visual features, the image classification features, the target detection features and the semantic segmentation features with fusion features corresponding to t-1 turns of conversation, namely t-1 moments, wherein the fusion process is described as follows:
Figure BDA0003091634630000051
Figure BDA0003091634630000052
Figure BDA0003091634630000053
Figure BDA0003091634630000054
wherein v ist,i、ct,i、ot,i、st,iRespectively representing the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector of the ith sub-image of the t-turn dialog, wherein the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector are respectively a low-order visual feature matrix VtImage classification feature matrix CtObject detection feature matrix OtSemantic segmentation feature matrix StThe vector of (a);
Figure BDA0003091634630000055
denotes the fusion feature obtained at time t-1, fvAnd fvlmRepresenting a non-linear mapping function, l represents
Figure BDA0003091634630000056
The vector dimension of (a); the fused low-order visual feature vector, the fused image classification feature vector, the fused target detection feature vector and the fused semantic segmentation feature vector are respectively
Figure BDA0003091634630000057
Figure BDA0003091634630000058
3.2) further fusing the fused low-order visual features, image classification features, target detection features, semantic segmentation features and semantic features through an attention mechanism, wherein the process is described as follows:
Figure BDA0003091634630000059
Figure BDA00030916346300000510
Figure BDA00030916346300000511
Figure BDA00030916346300000512
Figure BDA0003091634630000061
wherein the content of the first and second substances,
Figure BDA0003091634630000062
respectively representing a low-order visual feature matrix, an image classification feature matrix, a target detection feature matrix and a semantic segmentation feature matrix which are fused during t-round conversations;
Figure BDA0003091634630000063
representing semantic features, passes and parameters in t-turn conversations
Figure BDA0003091634630000064
Multiplication, mapping into
Figure BDA0003091634630000065
h represents semantic features when t turns of conversation
Figure BDA0003091634630000066
Dimension (d); softmax denotes the softmax function;
Figure BDA0003091634630000067
respectively after fusing through attention mechanismLow-order visual features, image classification features, target detection features and semantic segmentation features;
3.3) further processing the fused features through an LSTM network, and finally combining the fused features into final coding features, wherein the process is as follows:
Figure BDA0003091634630000068
Figure BDA0003091634630000069
Figure BDA00030916346300000610
Figure BDA00030916346300000611
Figure BDA00030916346300000612
wherein the content of the first and second substances,
Figure BDA00030916346300000613
respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features processed by an LSTM network; concat represents the merging of vectors;
Figure BDA00030916346300000614
and representing the fused feature corresponding to the pair of the t wheels, namely the final coding feature.
In step 4), inputting the fusion features into a softmax classifier for movement direction prediction, wherein the method comprises the following steps:
4.1) final coding features
Figure BDA00030916346300000615
By activationThe function is mapped by the following process:
Figure BDA00030916346300000616
wherein σ is sigmoid activation function, fmIn order to be a non-linear mapping function,
Figure BDA00030916346300000617
is the result of activation;
4.2) calculating the final result of the activation result through a softmax function, wherein the process is as follows:
Figure BDA0003091634630000071
wherein softmax denotes the softmax function, faIs a non-linear mapping function.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention provides the method for performing robot autonomous navigation by using visual information and natural language, thereby saving the overhead brought by obtaining an accurate measurement map and being capable of adapting to complex environments.
2. The invention provides the robot autonomous navigation combining the natural language instruction and the machine vision, and the robot autonomous navigation can be carried out more conveniently and efficiently.
3. The invention combines natural language instruction and machine vision, and performs autonomous navigation of the robot by combining the characteristics of two different modal information, thereby improving the navigation efficiency and saving the expenditure while ensuring the navigation effect.
Drawings
FIG. 1 is a flow chart illustrating autonomous navigation according to the present invention.
FIG. 2 is a schematic diagram of a model architecture construction process for feature extraction and navigation instruction prediction based on attention mechanism.
Wherein the dialogue history represents questions of the robot and answer records of the human user; dialog presentation at the current timeThe robot's questions and human user's answers in this round of dialog; encoding represents that the dialogue information is encoded and converted into an Encoding vector; the machine vision image is extracted to low-order vision characteristics through Resnet152, Faster R-CNN and U-net models
Figure BDA0003091634630000072
Image classification features
Figure BDA0003091634630000073
Target detection features
Figure BDA0003091634630000074
Semantic segmentation features
Figure BDA0003091634630000075
Etc.; attention denotes an Attention module for extracting features of semantic information
Figure BDA0003091634630000076
Feature extraction of visual information, and fusion of semantic features, visual features and fusion features obtained in t-1 round of dialogue, i.e. at t-1 time
Figure BDA0003091634630000077
Etc.;
Figure BDA0003091634630000078
respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features which are fused with the semantic features; the fused features are input into the softmax module and the final result is calculated.
Fig. 3 is a schematic view of the principle of the attention mechanism. Wherein d ist,diRespectively representing feature vectors used to calculate attention; wQ,WK,WVIs used for mixing dtAnd diParameters mapped to the same dimension; matmul stands for matrix multiplication; the calculation result is normalized by a softmax module to obtain an attention result A (d)t,di) (ii) a By combining allMerging the results of the attention module, and merging dtTo obtain the final result
Figure BDA0003091634630000081
Detailed Description
The present invention will be further described with reference to the following specific examples and drawings, but the embodiments of the present invention are not limited thereto.
As shown in fig. 1 to 3, the method for implementing autonomous navigation of a robot based on natural language and machine vision provided by the present embodiment includes the following steps:
1) the robot acquires language information and visual information from an initial position at each turn of conversation (each moment); the language information comprises instructions indicating the target position of the robot and a conversation record describing the environment where the robot is located, the conversation record comprises a conversation generated at the current position (namely the current moment) and a set of all previous conversations, and the visual information comprises panoramic image information of the current position of the robot; the conversation record of the environment where the robot is located refers to communication records generated when two human users navigate in the environment where the robot is located, wherein one human user extracts topological information of the whole indoor environment and instructs the robot to walk through question-answer communication with the other human user; h for each dialogue recordingt=D1,D2,...,Di,...,Dt-1Is represented by HtRepresenting the recording of the conversation during the t-th conversation, DiRepresenting the ith round of conversation; a panoramic image corresponding to visual information of an environment in which the robot is located is represented as C, the panoramic image is divided into 12 sub-images, each of the sub-images represents 12 directions, and the sub-images are represented as C ═ C1,c2,...,ci,...,c12Wherein c isiThe ith sub-graph is shown, as shown by the image corresponding to the alternative action direction in fig. 2.
2) The method for extracting the features of the language information through the attention mechanism to obtain the semantic features comprises the following steps:
2.1) each dialog record H containing t dialogs and each dialog record D containing L words are described as:
H={D1,D2,...,Di,...,Dt}
Figure BDA0003091634630000091
wherein D isiA dialog of the ith round is shown,
Figure BDA0003091634630000092
representing the ith word in a round of dialog;
2.2) vectorizing the dialogue records through an embedding layer, wherein the corresponding vectorizing result is described as:
E={G1,G2,...,Gi,...,Gt}
Gi={g1,g2,...,gi,...,gL}
wherein G isiRepresenting the embedding vector of the ith round of dialog in the semantic map, wherein t rounds of dialog are total; giAn embedding vector representing the ith word in a round of conversation, wherein the number of the I words is L;
2.3) encoding the embedding vector of the session record through an LSTM network to obtain a characteristic vector, wherein the process of obtaining the characteristic vector is described as follows:
{hi,1,hi,2,...,hi,L}=LSTM({wi,1,wi,2,...,wi,j,...,wi,L})
di=hi,L
Figure BDA0003091634630000093
wherein, wi,jAn embedding vector, h, representing the jth word in the ith round of dialogi,LState vector representing the last moment of the LSTM network, denoted by diTo represent hi,L
Figure BDA0003091634630000094
Is a feature matrix formed by the first t-1 feature vectors of the dialogue record;
2.4) fusing the feature matrix of the conversation record and the feature vector of the current conversation respectively through an attention mechanism, wherein the principle of the attention mechanism and the calculation process of the attention result are shown in FIG. 3, and the fusion process is described as follows:
Figure BDA0003091634630000095
Figure BDA0003091634630000101
Figure BDA0003091634630000102
Figure BDA0003091634630000103
wherein d istAnd diRespectively represent state vectors ht,LAnd hi,L,A(dt,di) Represents a vector dtFor diAttention of (1), WQ、WK、WVRepresenting parameters of the model, c representing a vector dtAnd diDimension (d); softmax represents the softmax function, concat represents the merging of vectors;
Figure BDA0003091634630000104
is determined by the attention value and all diThe result of the weighting-combining is taken into account,
Figure BDA0003091634630000105
semantic features corresponding to the history of the t-th turn of the dialog, consisting of
Figure BDA0003091634630000106
And dtMerging to obtain;
as shown in fig. 2, feature extraction is performed on visual information through Resnet152 to obtain low-order visual features and image classification features, feature extraction is performed on visual information through false-RCNN and U-net to obtain target detection features and semantic segmentation features, specifically, in each round of conversation, a robot arrives at a new position, then a panoramic view at the position is obtained, and a corresponding panoramic view in t rounds of conversation is represented as Pt(ii) a Will PtFeature extraction is performed through a neural network model Resnet152, and the obtained feature result is taken as a low-order visual feature and is expressed as VtAnd taking the obtained image classification result as an image classification characteristic, which is expressed as Ct(ii) a Will PtInputting the target detection result into the master-RCNN network, and taking the obtained target detection result as a target detection characteristic, wherein the target detection characteristic is represented as Ot(ii) a Will PtInputting the semantic segmentation result into a U-net network, and taking the obtained semantic segmentation result as a semantic segmentation feature expressed as St
3) Fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current moment and the previous moment through an attention mechanism to obtain fusion features, and the method comprises the following steps of:
3.1) fusing the low-order visual features, the image classification features, the target detection features and the semantic segmentation features with fusion features corresponding to t-1 turns of conversation, namely t-1 moments, wherein the fusion process is described as follows:
Figure BDA0003091634630000107
Figure BDA0003091634630000111
Figure BDA0003091634630000112
Figure BDA0003091634630000113
wherein v ist,i、ct,i、ot,i、st,iRespectively representing the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector of the ith sub-image of the t-turn dialog, wherein the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector are respectively a low-order visual feature matrix VtImage classification feature matrix CtObject detection feature matrix OtSemantic segmentation feature matrix StThe vector of (a);
Figure BDA0003091634630000114
denotes the fusion feature obtained at time t-1, fvAnd fvlmRepresenting a non-linear mapping function, l represents
Figure BDA0003091634630000115
The vector dimension of (a); the fused low-order visual feature vector, the fused image classification feature vector, the fused target detection feature vector and the fused semantic segmentation feature vector are respectively
Figure BDA0003091634630000116
Figure BDA0003091634630000117
3.2) further fusing the fused low-order visual features, image classification features, target detection features, semantic segmentation features and semantic features through an attention mechanism, wherein the process is described as follows:
Figure BDA0003091634630000118
Figure BDA0003091634630000119
Figure BDA00030916346300001110
Figure BDA00030916346300001111
Figure BDA00030916346300001112
wherein the content of the first and second substances,
Figure BDA00030916346300001113
respectively representing a low-order visual feature matrix, an image classification feature matrix, a target detection feature matrix and a semantic segmentation feature matrix which are fused during t-round conversations;
Figure BDA00030916346300001114
representing semantic features, passes and parameters in t-turn conversations
Figure BDA0003091634630000121
Multiplication, mapping into
Figure BDA0003091634630000122
h represents semantic features when t turns of conversation
Figure BDA0003091634630000123
Dimension (d); softmax denotes the softmax function;
Figure BDA0003091634630000124
respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features after attention mechanism fusion;
3.3) further processing the fused features through an LSTM network and finally combining the processed features into final coding features, wherein the process is as follows:
Figure BDA0003091634630000125
Figure BDA0003091634630000126
Figure BDA0003091634630000127
Figure BDA0003091634630000128
Figure BDA0003091634630000129
wherein the content of the first and second substances,
Figure BDA00030916346300001210
respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features processed by an LSTM network; concat represents the merging of vectors;
Figure BDA00030916346300001211
and representing the fused feature corresponding to the pair of the t wheels, namely the final coding feature.
4) Inputting the fusion features into a softmax classifier for predicting the moving direction, and comprising the following steps of:
4.1) final coding features
Figure BDA00030916346300001212
Mapping is carried out by using an activation function, and the process is as follows:
Figure BDA00030916346300001213
wherein σ is sigmoid activation function, fmIn order to be a non-linear mapping function,
Figure BDA00030916346300001214
is the result of activation;
4.2) calculating the final result of the activation result through a softmax function, wherein the process is as follows:
Figure BDA00030916346300001215
wherein softmax denotes the softmax function, faIs a non-linear mapping function.
The above-described embodiments are only preferred embodiments of the present invention, and not intended to limit the scope of the present invention, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and they are included in the scope of the present invention.

Claims (5)

1. A method for realizing autonomous navigation of a robot based on natural language and machine vision is characterized by comprising the following steps:
1) the robot starts from an initial position, and language information and visual information are acquired at each round of conversation, namely each moment; the language information comprises an instruction indicating the target position of the robot and a conversation record describing the environment where the robot is located, the conversation record comprises a current position, namely a conversation generated at the current moment and a set of all previous conversations, and the visual information comprises panoramic image information of the current position of the robot;
2) performing feature extraction on the language information through an attention mechanism to obtain semantic features; performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features; respectively extracting the characteristics of the visual information through the faster-RCNN and the U-net to obtain target detection characteristics and semantic segmentation characteristics;
3) fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features of the current moment and the previous moment through an attention mechanism to obtain fusion features;
4) inputting the fusion features into a softmax classifier to predict the moving direction of the robot at the current moment, wherein at each moment, the robot predicts the moving direction through the fusion features, and finally, when the prediction result is stop, the robot reaches the target position.
2. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 1), the dialogue record of the environment where the robot is located refers to a communication record generated when two human users navigate in the environment where the robot is located, wherein one human user extracts topological information of the whole indoor environment and instructs the robot to walk through question-answer communication with the other human user; h for each dialogue recordingt=D1,D2,...,Di,...,Dt-1Is represented by HtRepresenting the recording of the conversation during the t-th conversation, DiRepresenting the ith round of conversation; a panoramic image corresponding to visual information of the environment in which the robot is located is denoted by C, and the panoramic image is divided into 12 sub-images, each of which represents 12 directions and is denoted by C-C1,c2,...,ci,...,c12Wherein c isiRepresenting the ith sub-graph.
3. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 2), performing feature extraction on the language information through an attention mechanism to obtain semantic features, and the method comprises the following steps:
2.1) each dialog record H containing t dialogs and each dialog record D containing L words are described as:
H={D1,D2,...,Di,...,Dt}
Figure FDA0003091634620000021
wherein D isiA dialog of the ith round is shown,
Figure FDA0003091634620000022
representing the ith word in a round of dialog;
2.2) vectorizing the dialogue records through an embedding layer, wherein a corresponding vectorization result E is described as follows:
E={G1,G2,...,Gi,...,Gt}
Gi={g1,g2,...,gi,...,gL}
wherein G isiRepresenting the embedding vector of the ith round of dialog in the semantic map, wherein t rounds of dialog are total; giAn embedding vector representing the ith word in a round of conversation, wherein the number of the I words is L;
2.3) encoding the embedding vector of the session record through an LSTM network to obtain a characteristic vector, wherein the process of obtaining the characteristic vector is described as follows:
{hi,1,hi,2,...,hi,L}=LSTM({wi,1,wi,2,...,wi,j,...,wi,L})
di=hi,L
Figure FDA0003091634620000023
wherein, wi,jAn embedding vector, h, representing the jth word in the ith round of dialogi,LState vector representing the last moment of the LSTM network, denoted by diTo represent hi,L
Figure FDA0003091634620000024
Is a feature matrix formed by the first t-1 feature vectors of the dialogue record;
2.4) fusing the feature matrix of the conversation record and the feature vector of the current conversation respectively through an attention mechanism, wherein the fusion process is described as follows:
Figure FDA0003091634620000031
Figure FDA0003091634620000032
Figure FDA0003091634620000033
Figure FDA0003091634620000034
wherein d istAnd diRespectively represent state vectors ht,LAnd hi,L,A(dt,di) Represents a vector dtFor diAttention of (1), WQ、WK、WVRepresenting parameters of the model, c representing a vector dtAnd diDimension (d); softmax represents the softmax function, concat represents the merging of vectors;
Figure FDA0003091634620000035
is determined by the attention value and all diThe result of the weighting-combining is taken into account,
Figure FDA0003091634620000036
semantic features corresponding to the history of the t-th turn of the dialog, consisting of
Figure FDA0003091634620000037
And dtMerging to obtain;
performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features, and performing feature extraction on the visual information through false-RCNN and U-net respectively to obtain target detection features and semantic segmentation features: means that in each turn of the conversation, the robot comesA new position is obtained, then a panoramic view at the position is obtained, and the corresponding panoramic view in the t-turn conversation is represented as PtA 1 is to PtFeature extraction is performed through a neural network model Resnet152, and the obtained feature result is taken as a low-order visual feature and is expressed as VtAnd taking the obtained image classification result as an image classification characteristic, which is expressed as Ct(ii) a Will PtInputting the target detection result into the master-RCNN network, and taking the obtained target detection result as a target detection characteristic, wherein the target detection characteristic is represented as Ot(ii) a Will PtInputting the semantic segmentation result into a U-net network, and taking the obtained semantic segmentation result as a semantic segmentation feature expressed as St
4. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 3), fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current time and the previous time through an attention mechanism to obtain fusion features, wherein the fusion features comprise the following steps:
3.1) fusing the low-order visual features, the image classification features, the target detection features and the semantic segmentation features with fusion features corresponding to t-1 turns of conversation, namely t-1 moments, wherein the fusion process is described as follows:
Figure FDA0003091634620000041
Figure FDA0003091634620000042
Figure FDA0003091634620000043
Figure FDA0003091634620000044
wherein v ist,i、ct,i、ot,i、st,iRespectively representing the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector of the ith sub-image of the t-turn dialog, wherein the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector are respectively a low-order visual feature matrix VtImage classification feature matrix CtObject detection feature matrix OtSemantic segmentation feature matrix StThe vector of (a);
Figure FDA0003091634620000045
denotes the fusion feature obtained at time t-1, fvAnd fvlmRepresenting a non-linear mapping function, l represents
Figure FDA0003091634620000046
The vector dimension of (a); the fused low-order visual feature vector, the fused image classification feature vector, the fused target detection feature vector and the fused semantic segmentation feature vector are respectively
Figure FDA0003091634620000047
Figure FDA0003091634620000048
3.2) further fusing the fused low-order visual features, image classification features, target detection features, semantic segmentation features and semantic features through an attention mechanism, wherein the process is described as follows:
Figure FDA0003091634620000049
Figure FDA00030916346200000410
Figure FDA00030916346200000411
Figure FDA0003091634620000051
Figure FDA0003091634620000052
wherein, Vt mem
Figure FDA0003091634620000053
Respectively representing a low-order visual feature matrix, an image classification feature matrix, a target detection feature matrix and a semantic segmentation feature matrix which are fused during t-round conversations;
Figure FDA0003091634620000054
representing semantic features, passes and parameters in t-turn conversations
Figure FDA0003091634620000055
Multiplication, mapping into
Figure FDA0003091634620000056
h represents semantic features when t turns of conversation
Figure FDA0003091634620000057
Dimension (d); softmax denotes the softmax function; vt attn
Figure FDA0003091634620000058
Respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features after attention mechanism fusion;
3.3) further processing the fused features through an LSTM network, and finally combining the fused features into final coding features, wherein the process is as follows:
Figure FDA0003091634620000059
Figure FDA00030916346200000510
Figure FDA00030916346200000511
Figure FDA00030916346200000512
Figure FDA00030916346200000513
wherein the content of the first and second substances,
Figure FDA00030916346200000514
respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features processed by an LSTM network; concat represents the merging of vectors;
Figure FDA00030916346200000515
and representing the fused feature corresponding to the pair of the t wheels, namely the final coding feature.
5. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 4), inputting the fusion features into a softmax classifier for movement direction prediction, wherein the method comprises the following steps:
4.1) final coding features
Figure FDA0003091634620000061
Mapping is carried out by using an activation function, and the process is as follows:
Figure FDA0003091634620000062
wherein σ is sigmoid activation function, fmIn order to be a non-linear mapping function,
Figure FDA0003091634620000063
is the result of activation;
4.2) calculating the final result of the activation result through a softmax function, wherein the process is as follows:
Figure FDA0003091634620000064
wherein softmax denotes the softmax function, faIs a non-linear mapping function.
CN202110597437.8A 2021-05-31 2021-05-31 Method for realizing autonomous navigation of robot based on natural language and machine vision Active CN113420606B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110597437.8A CN113420606B (en) 2021-05-31 2021-05-31 Method for realizing autonomous navigation of robot based on natural language and machine vision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110597437.8A CN113420606B (en) 2021-05-31 2021-05-31 Method for realizing autonomous navigation of robot based on natural language and machine vision

Publications (2)

Publication Number Publication Date
CN113420606A true CN113420606A (en) 2021-09-21
CN113420606B CN113420606B (en) 2022-06-14

Family

ID=77713311

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110597437.8A Active CN113420606B (en) 2021-05-31 2021-05-31 Method for realizing autonomous navigation of robot based on natural language and machine vision

Country Status (1)

Country Link
CN (1) CN113420606B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114029963A (en) * 2022-01-12 2022-02-11 北京具身智能科技有限公司 Robot operation method based on visual and auditory fusion
CN115082915A (en) * 2022-05-27 2022-09-20 华南理工大学 Mobile robot vision-language navigation method based on multi-modal characteristics

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344288A (en) * 2018-09-19 2019-02-15 电子科技大学 A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism
CN110609891A (en) * 2019-09-18 2019-12-24 合肥工业大学 Visual dialog generation method based on context awareness graph neural network
CN110647612A (en) * 2019-09-18 2020-01-03 合肥工业大学 Visual conversation generation method based on double-visual attention network
CN110825829A (en) * 2019-10-16 2020-02-21 华南理工大学 Method for realizing autonomous navigation of robot based on natural language and semantic map
CN112504261A (en) * 2020-11-09 2021-03-16 中国人民解放军国防科技大学 Unmanned aerial vehicle landing pose filtering estimation method and system based on visual anchor point
CN112710310A (en) * 2020-12-07 2021-04-27 深圳龙岗智能视听研究院 Visual language indoor navigation method, system, terminal and application

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344288A (en) * 2018-09-19 2019-02-15 电子科技大学 A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism
CN110609891A (en) * 2019-09-18 2019-12-24 合肥工业大学 Visual dialog generation method based on context awareness graph neural network
CN110647612A (en) * 2019-09-18 2020-01-03 合肥工业大学 Visual conversation generation method based on double-visual attention network
CN110825829A (en) * 2019-10-16 2020-02-21 华南理工大学 Method for realizing autonomous navigation of robot based on natural language and semantic map
CN112504261A (en) * 2020-11-09 2021-03-16 中国人民解放军国防科技大学 Unmanned aerial vehicle landing pose filtering estimation method and system based on visual anchor point
CN112710310A (en) * 2020-12-07 2021-04-27 深圳龙岗智能视听研究院 Visual language indoor navigation method, system, terminal and application

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YI ZHU ET AL: "Vision-Dialog Navigation by Exploring Cross-modal Memory", 《2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 *
YI ZHU ET AL: "Vision-Dialog Navigation by Exploring Cross-modal Memory", 《2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》, 31 December 2020 (2020-12-31), pages 10727 - 10736 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114029963A (en) * 2022-01-12 2022-02-11 北京具身智能科技有限公司 Robot operation method based on visual and auditory fusion
CN115082915A (en) * 2022-05-27 2022-09-20 华南理工大学 Mobile robot vision-language navigation method based on multi-modal characteristics
CN115082915B (en) * 2022-05-27 2024-03-29 华南理工大学 Multi-modal feature-based mobile robot vision-language navigation method

Also Published As

Publication number Publication date
CN113420606B (en) 2022-06-14

Similar Documents

Publication Publication Date Title
CN110188598B (en) Real-time hand posture estimation method based on MobileNet-v2
CN110609891B (en) Visual dialog generation method based on context awareness graph neural network
CN108829677B (en) Multi-modal attention-based automatic image title generation method
CN111243269B (en) Traffic flow prediction method based on depth network integrating space-time characteristics
CN110990543A (en) Intelligent conversation generation method and device, computer equipment and computer storage medium
CN110795990B (en) Gesture recognition method for underwater equipment
CN112860888B (en) Attention mechanism-based bimodal emotion analysis method
CN111966800B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN110851760B (en) Human-computer interaction system for integrating visual question answering in web3D environment
CN113420606B (en) Method for realizing autonomous navigation of robot based on natural language and machine vision
CN110825829B (en) Method for realizing autonomous navigation of robot based on natural language and semantic map
CN113221663B (en) Real-time sign language intelligent identification method, device and system
CN112163498B (en) Method for establishing pedestrian re-identification model with foreground guiding and texture focusing functions and application of method
CN111028319B (en) Three-dimensional non-photorealistic expression generation method based on facial motion unit
CN112329525A (en) Gesture recognition method and device based on space-time diagram convolutional neural network
CN114360005B (en) Micro-expression classification method based on AU region and multi-level transducer fusion module
CN111026873A (en) Unmanned vehicle and navigation method and device thereof
CN113065344A (en) Cross-corpus emotion recognition method based on transfer learning and attention mechanism
CN116091551B (en) Target retrieval tracking method and system based on multi-mode fusion
CN110046271A (en) A kind of remote sensing images based on vocal guidance describe method
CN117197878B (en) Character facial expression capturing method and system based on machine learning
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN117576279B (en) Digital person driving method and system based on multi-mode data
CN113780350B (en) ViLBERT and BiLSTM-based image description method
CN117011650B (en) Method and related device for determining image encoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant