CN113420606B - Method for realizing autonomous navigation of robot based on natural language and machine vision - Google Patents

Method for realizing autonomous navigation of robot based on natural language and machine vision Download PDF

Info

Publication number
CN113420606B
CN113420606B CN202110597437.8A CN202110597437A CN113420606B CN 113420606 B CN113420606 B CN 113420606B CN 202110597437 A CN202110597437 A CN 202110597437A CN 113420606 B CN113420606 B CN 113420606B
Authority
CN
China
Prior art keywords
features
robot
feature
vector
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110597437.8A
Other languages
Chinese (zh)
Other versions
CN113420606A (en
Inventor
董敏
聂宏蓄
毕盛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110597437.8A priority Critical patent/CN113420606B/en
Publication of CN113420606A publication Critical patent/CN113420606A/en
Application granted granted Critical
Publication of CN113420606B publication Critical patent/CN113420606B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for realizing autonomous navigation of a robot based on natural language and machine vision, which comprises the following steps: 1) the robot starts from an initial position, and language information and visual information are acquired at each round of conversation, namely each moment; 2) performing feature extraction on the language information through an attention mechanism to obtain semantic features; performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features; performing feature extraction on visual information through the faster-RCNN and the U-net to obtain target detection features and semantic segmentation features; 3) fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current moment and the previous moment through an attention mechanism to obtain fusion features; 4) and inputting the fusion features into a softmax classifier to predict the moving direction of the current moment. The invention utilizes the visual information and the language information of the environment where the robot is located to carry out autonomous navigation of the robot under the condition of not acquiring an accurate measurement map in advance.

Description

Method for realizing autonomous navigation of robot based on natural language and machine vision
Technical Field
The invention relates to the technical field of natural language processing, image processing and autonomous navigation, in particular to a method for realizing indoor autonomous navigation of a mobile robot based on natural language and computer vision.
Background
In recent years, autonomous navigation of robots is more and more widely applied in production and life, and more application scenes need accurate and efficient autonomous navigation technology. In the conventional autonomous navigation method, the environment needs to be scanned once to obtain an accurate measurement map, and autonomous navigation is performed according to the accurate measurement map. Acquiring an accurate measurement map requires a large amount of manpower and material resources, and an autonomous navigation method based on the accurate measurement map is difficult to migrate to an unknown environment. Therefore, the research of the autonomous navigation method based on natural language and computer vision has great significance.
At present, a method based on an accurate measurement map is mainly adopted in the aspect of robot autonomous navigation research, but the following problems are also faced:
(1) the acquisition of the accurate measurement map requires a large amount of resources and time to scan the environment in advance, and the cost for acquiring the accurate measurement map is high.
(2) In some complex scenes which are difficult to observe, the difficulty and the expense for obtaining the accurate measurement map are higher, and the navigation method based on the accurate measurement map may not be implemented.
(3) The navigation effect depends on the accuracy of the metric map, and in some situations where it is difficult to obtain an accurate metric map, the navigation effect becomes poor.
(4) The autonomous navigation method based on the accurate measurement map is based on the measurement information of the environment for navigation, and does not utilize semantic information and visual information, so that the method is difficult to migrate to unknown environment.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a method for realizing indoor autonomous navigation of a mobile robot based on natural language and machine vision, which can carry out autonomous navigation of the robot under the condition of not acquiring an accurate measurement map in advance by utilizing visual information of the environment where the robot is located and natural language dialogue records.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: a method for realizing autonomous navigation of a robot based on natural language and machine vision comprises the following steps:
1) the robot acquires language information and visual information from an initial position at each turn of conversation, namely each moment; the language information comprises an instruction indicating the target position of the robot and a conversation record describing the environment where the robot is located, the conversation record comprises a current position, namely a conversation generated at the current moment and a set of all previous conversations, and the visual information comprises panoramic image information of the current position of the robot;
2) performing feature extraction on the language information through an attention mechanism to obtain semantic features; performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features; respectively extracting the characteristics of the visual information through the faster-RCNN and the U-net to obtain target detection characteristics and semantic segmentation characteristics;
3) fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current moment and the previous moment through an attention mechanism to obtain fusion features;
4) inputting the fusion features into a softmax classifier to predict the moving direction of the robot at the current moment, wherein at each moment, the robot predicts the moving direction through the fusion features, and finally, when the prediction result is stop, the robot reaches the target position.
In step 1), the dialogue record of the environment where the robot is located refers to a communication record generated when two human users navigate in the environment where the robot is located, wherein one human user extracts topological information of the whole indoor environment and instructs the robot to walk through question-answer communication with the other human user; h for each dialogue recordingt=D1,D2,...,Di,...,Dt-1Is represented by HtRepresenting the recording of the conversation during the t-th conversation, DiRepresenting the ith round of conversation; a panoramic image corresponding to visual information of the environment in which the robot is located is denoted by C, and the panoramic image is divided into 12 sub-images, each of which represents 12 directions and is denoted by C-C1,c2,...,ci,...,c12Wherein c isiRepresenting the ith sub-graph.
In step 2), performing feature extraction on the language information through an attention mechanism to obtain semantic features, and the method comprises the following steps:
2.1) each dialog record H containing t dialogs and each dialog record D containing L words are described as:
H={D1,D2,...,Di,...,Dt}
Figure BDA0003091634630000031
wherein D isiA dialog of the ith round is shown,
Figure BDA0003091634630000032
representing the ith word in a round of dialog;
2.2) vectorizing the session record through an embedding layer, wherein a corresponding vectorizing result E is described as follows:
E={G1,G2,...,Gi,...,Gt}
Gi={g1,g2,...,gi,...,gL}
wherein, GiRepresenting the embedding vector of the ith round of dialog in the semantic map, wherein t rounds of dialog are total; giAn embedding vector representing the ith word in a round of conversation, wherein the number of the I words is L;
2.3) encoding the embedding vector of the session record through an LSTM network to obtain a characteristic vector, wherein the process of obtaining the characteristic vector is described as follows:
{hi,1,hi,2,...,hi,L}=LSTM({wi,1,wi,2,...,wi,j,...,wi,L})
di=hi,L
Figure BDA0003091634630000033
wherein, wi,jEmbedding vector, h, representing the jth word in the ith round of dialogi,LState vector representing the last moment of the LSTM network, denoted by diTo represent hi,L
Figure BDA0003091634630000034
Is a feature matrix formed by the first t-1 feature vectors of the dialogue record;
2.4) fusing the feature matrix of the conversation record and the feature vector of the current conversation respectively through an attention mechanism, wherein the fusion process is described as follows:
Figure BDA0003091634630000041
Figure BDA0003091634630000042
Figure BDA0003091634630000043
Figure BDA0003091634630000044
wherein d istAnd diRespectively represent state vectors ht,LAnd hi,L,A(dt,di) Represents a vector dtFor diAttention of (1), WQ、WK、WVRepresenting parameters of the model, c representing a vector dtAnd diDimension (d); softmax represents the softmax function, concat represents the merging of vectors;
Figure BDA0003091634630000045
is determined by the attention value and all diThe result of the weighting-combining is taken into account,
Figure BDA0003091634630000046
semantic features corresponding to the history of the t-th turn of the dialog, consisting of
Figure BDA0003091634630000047
And dtMerging to obtain;
performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features, and performing feature extraction on the visual information through false-RCNN and U-net respectively to obtain target detection features and semantic segmentation features: the robot comes to a new position in each round of conversation, and then acquires a panoramic view under the position, and in t rounds of conversation, the robot can obtain a panoramic viewIs denoted as PtA 1 is to PtFeature extraction is performed through a neural network model Resnet152, and the obtained feature result is taken as a low-order visual feature and is expressed as VtAnd taking the obtained image classification result as an image classification characteristic, and expressing the image classification characteristic as Ct(ii) a Will PtInputting the target detection result into the master-RCNN network, and taking the obtained target detection result as a target detection characteristic, wherein the target detection characteristic is represented as Ot(ii) a Will PtInputting the semantic segmentation result into a U-net network, and taking the obtained semantic segmentation result as a semantic segmentation feature expressed as St
In the step 3), the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current time and the previous time are fused through an attention mechanism to obtain fusion features, and the method comprises the following steps:
3.1) fusing the low-order visual features, the image classification features, the target detection features and the semantic segmentation features with fusion features corresponding to t-1 turns of conversation, namely t-1 moments, wherein the fusion process is described as follows:
Figure BDA0003091634630000051
Figure BDA0003091634630000052
Figure BDA0003091634630000053
Figure BDA0003091634630000054
wherein v ist,i、ct,i、ot,i、st,iRespectively representing the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector of the ith sub-image of the t-turn dialog, wherein the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector are respectively low-order visual feature vectorsSign matrix VtImage classification feature matrix CtObject detection feature matrix OtSemantic segmentation feature matrix StThe vector of (a);
Figure BDA0003091634630000055
denotes the fusion feature obtained at time t-1, fvAnd fvlmRepresenting a non-linear mapping function, l represents
Figure BDA0003091634630000056
The vector dimension of (a); the fused low-order visual feature vector, the fused image classification feature vector, the fused target detection feature vector and the fused semantic segmentation feature vector are respectively
Figure BDA0003091634630000057
Figure BDA0003091634630000058
3.2) further fusing the fused low-order visual features, image classification features, target detection features, semantic segmentation features and semantic features through an attention mechanism, wherein the process is described as follows:
Figure BDA0003091634630000059
Figure BDA00030916346300000510
Figure BDA00030916346300000511
Figure BDA00030916346300000512
Figure BDA0003091634630000061
wherein the content of the first and second substances,
Figure BDA0003091634630000062
respectively representing a low-order visual characteristic matrix, an image classification characteristic matrix, a target detection characteristic matrix and a semantic segmentation characteristic matrix which are fused during t-round conversation;
Figure BDA0003091634630000063
representing semantic features, passes and parameters in t-turn conversations
Figure BDA0003091634630000064
Multiplication, mapping into
Figure BDA0003091634630000065
h represents semantic features when t turns of conversation
Figure BDA0003091634630000066
Dimension (d); softmax denotes the softmax function;
Figure BDA0003091634630000067
respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features after attention mechanism fusion;
3.3) further processing the fused features through an LSTM network, and finally combining the fused features into final coding features, wherein the process is as follows:
Figure BDA0003091634630000068
Figure BDA0003091634630000069
Figure BDA00030916346300000610
Figure BDA00030916346300000611
Figure BDA00030916346300000612
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00030916346300000613
respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features processed by an LSTM network; concat represents the merging of vectors;
Figure BDA00030916346300000614
and representing the fused feature corresponding to the pair of the t wheels, namely the final coding feature.
In step 4), inputting the fusion features into a softmax classifier for movement direction prediction, wherein the method comprises the following steps:
4.1) final coding features
Figure BDA00030916346300000615
Mapping is carried out by using an activation function, and the process is as follows:
Figure BDA00030916346300000616
wherein σ is sigmoid activation function, fmIn order to be a non-linear mapping function,
Figure BDA00030916346300000617
is the result of activation;
4.2) calculating the final result of the activation result through a softmax function, wherein the process is as follows:
Figure BDA0003091634630000071
wherein softmax denotes the softmax function, faIs a non-linear mapping function.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention provides the method for performing robot autonomous navigation by using visual information and natural language, thereby saving the overhead brought by obtaining an accurate measurement map and being capable of adapting to complex environments.
2. The invention provides the robot autonomous navigation combining the natural language instruction and the machine vision, and the robot autonomous navigation can be carried out more conveniently and efficiently.
3. The invention combines natural language instruction and machine vision, and performs autonomous navigation of the robot by combining the characteristics of two different modal information, thereby improving the navigation efficiency and saving the expenditure while ensuring the navigation effect.
Drawings
FIG. 1 is a flow chart illustrating autonomous navigation according to the present invention.
FIG. 2 is a schematic diagram of a model architecture construction process for feature extraction and navigation instruction prediction based on attention mechanism.
Wherein the dialogue history represents questions of the robot and answer records of the human user; the current time dialog represents the questions of the robot and the answers of the human users in the current round of dialog; encoding represents that the dialogue information is encoded and converted into an Encoding vector; the machine vision image is extracted to low-order vision characteristics through Resnet152, Faster R-CNN and U-net models
Figure BDA0003091634630000072
Image classification features
Figure BDA0003091634630000073
Target detection features
Figure BDA0003091634630000074
Semantic segmentation features
Figure BDA0003091634630000075
Etc.; attention denotes an Attention module for extracting features of semantic information
Figure BDA0003091634630000076
Feature extraction of visual information, and fusion of semantic features, visual features and fusion features obtained in t-1 round of dialogue, i.e. at t-1 time
Figure BDA0003091634630000077
Etc.;
Figure BDA0003091634630000078
respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features which are fused with the semantic features; the fused features are input into the softmax module and the final result is calculated.
Fig. 3 is a schematic view of the principle of the attention mechanism. Wherein d ist,diRespectively representing feature vectors used to calculate attention; wQ,WK,WVIs used for mixing dtAnd diParameters mapped to the same dimension; matmul stands for matrix multiplication; the calculation result is normalized by a softmax module to obtain an attention result A (d)t,di) (ii) a By merging the results of all attention modules, and then dtTo obtain the final result
Figure BDA0003091634630000081
Detailed Description
The present invention will be further described with reference to the following specific examples and drawings, but the embodiments of the present invention are not limited thereto.
As shown in fig. 1 to 3, the method for implementing autonomous navigation of a robot based on natural language and machine vision provided by the present embodiment includes the following steps:
1) the robot acquires language information and visual information from an initial position at each turn of conversation (each moment); the language information includes instructions indicating a target position of the robot toThe dialog record describes the environment where the robot is located, the dialog record comprises dialogs generated at the current position (namely the current moment) and a set of all previous dialogs, and the visual information comprises panoramic image information of the current position of the robot; the conversation record of the environment where the robot is located refers to communication records generated when two human users navigate in the environment where the robot is located, wherein one human user extracts topological information of the whole indoor environment and instructs the robot to walk through question-answer communication with the other human user; h for each dialogue recordingt=D1,D2,...,Di,...,Dt-1Is represented by HtShowing the conversation log at the time of the tth conversation, DiRepresenting the ith round of conversation; a panoramic image corresponding to visual information of an environment in which the robot is located is represented as C, the panoramic image is divided into 12 sub-images, each of the sub-images represents 12 directions, and the sub-images are represented as C ═ C1,c2,...,ci,...,c12Wherein, ciThe ith sub-graph is represented as shown by the image corresponding to the alternative action direction in fig. 2.
2) The method for extracting the features of the language information through the attention mechanism to obtain the semantic features comprises the following steps:
2.1) each dialog record H containing t dialogs and each dialog record D containing L words are described as:
H={D1,D2,...,Di,...,Dt}
Figure BDA0003091634630000091
wherein D isiA dialog of the ith round is shown,
Figure BDA0003091634630000092
representing the ith word in a round of dialog;
2.2) vectorizing the dialogue records through an embedding layer, wherein the corresponding vectorizing result is described as:
E={G1,G2,...,Gi,...,Gt}
Gi={g1,g2,...,gi,...,gL}
wherein, GiRepresenting the embedding vector of the ith round of dialogue in the semantic map, wherein t rounds of dialogue are shared; giAn embedding vector representing the ith word in a round of conversation, wherein the number of the I words is L;
2.3) encoding the embedding vector of the session record through an LSTM network to obtain a characteristic vector, wherein the process of obtaining the characteristic vector is described as follows:
{hi,1,hi,2,...,hi,L}=LSTM({wi,1,wi,2,...,wi,j,...,wi,L})
di=hi,L
Figure BDA0003091634630000093
wherein, wi,jAn embedding vector, h, representing the jth word in the ith round of dialogi,LState vector representing the last moment of the LSTM network, denoted by diTo represent hi,L
Figure BDA0003091634630000094
Is a feature matrix formed by the first t-1 feature vectors of the dialogue record;
2.4) fusing the feature matrix of the conversation record and the feature vector of the current conversation respectively through an attention mechanism, wherein the principle of the attention mechanism and the calculation process of the attention result are shown in FIG. 3, and the fusion process is described as follows:
Figure BDA0003091634630000095
Figure BDA0003091634630000101
Figure BDA0003091634630000102
Figure BDA0003091634630000103
wherein d istAnd diRespectively represent state vectors ht,LAnd hi,L,A(dt,di) Represents a vector dtFor diAttention of (1), WQ、WK、WVRepresenting parameters of the model, c representing a vector dtAnd diDimension (d); softmax represents the softmax function, concat represents the merging of vectors;
Figure BDA0003091634630000104
is determined by the attention value and all diThe result of the weighting-combining is taken into account,
Figure BDA0003091634630000105
semantic features corresponding to the history of the t-th turn of the dialog, consisting of
Figure BDA0003091634630000106
And dtMerging to obtain;
as shown in fig. 2, feature extraction is performed on visual information through Resnet152 to obtain low-order visual features and image classification features, feature extraction is performed on visual information through false-RCNN and U-net to obtain target detection features and semantic segmentation features, specifically, in each round of conversation, a robot arrives at a new position, then a panoramic view at the position is obtained, and a corresponding panoramic view in t rounds of conversation is represented as Pt(ii) a Will PtFeature extraction is performed through a neural network model Resnet152, and the obtained feature result is taken as a low-order visual feature and is expressed as VtAnd taking the obtained image classification result as an image classification characteristic, which is expressed as Ct(ii) a Will PtInputting the target detection result into the master-RCNN network as the target detectionMeasured characteristic, denoted as Ot(ii) a Will PtInputting the semantic segmentation result into a U-net network, and taking the obtained semantic segmentation result as a semantic segmentation feature expressed as St
3) Fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current moment and the previous moment through an attention mechanism to obtain fusion features, and the method comprises the following steps of:
3.1) fusing the low-order visual features, the image classification features, the target detection features and the semantic segmentation features with fusion features corresponding to t-1 turns of conversation, namely t-1 moments, wherein the fusion process is described as follows:
Figure BDA0003091634630000107
Figure BDA0003091634630000111
Figure BDA0003091634630000112
Figure BDA0003091634630000113
wherein v ist,i、ct,i、ot,i、st,iRespectively representing the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector of the ith sub-image of the t-turn dialog, wherein the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector are respectively a low-order visual feature matrix VtImage classification feature matrix CtObject detection feature matrix OtSemantic segmentation feature matrix StThe vector of (a);
Figure BDA0003091634630000114
denotes the fusion feature obtained at time t-1, fvAnd fvlmRepresenting non-linear mapsEquation, l represents
Figure BDA0003091634630000115
The vector dimension of (a); the fused low-order visual feature vector, the fused image classification feature vector, the fused target detection feature vector and the fused semantic segmentation feature vector are respectively
Figure BDA0003091634630000116
Figure BDA0003091634630000117
3.2) further fusing the fused low-order visual features, image classification features, target detection features, semantic segmentation features and semantic features through an attention mechanism, wherein the process is described as follows:
Figure BDA0003091634630000118
Figure BDA0003091634630000119
Figure BDA00030916346300001110
Figure BDA00030916346300001111
Figure BDA00030916346300001112
wherein the content of the first and second substances,
Figure BDA00030916346300001113
respectively representing fused low-order visual feature matrix, image classification feature matrix and target detection feature during t-turn conversationMatrix and semantic segmentation characteristic matrix;
Figure BDA00030916346300001114
representing semantic features, passes and parameters in t-turn conversations
Figure BDA0003091634630000121
Multiplication, mapping into
Figure BDA0003091634630000122
h represents semantic features when t turns of conversation
Figure BDA0003091634630000123
Dimension of (d); softmax denotes the softmax function;
Figure BDA0003091634630000124
respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features after attention mechanism fusion;
3.3) further processing the fused features through an LSTM network and finally combining the processed features into final coding features, wherein the process is as follows:
Figure BDA0003091634630000125
Figure BDA0003091634630000126
Figure BDA0003091634630000127
Figure BDA0003091634630000128
Figure BDA0003091634630000129
wherein the content of the first and second substances,
Figure BDA00030916346300001210
respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features processed by an LSTM network; concat represents the merging of vectors;
Figure BDA00030916346300001211
and representing the fused feature corresponding to the pair of the t wheels, namely the final coding feature.
4) Inputting the fusion features into a softmax classifier for predicting the moving direction, and comprising the following steps of:
4.1) final coding features
Figure BDA00030916346300001212
Mapping is carried out by using an activation function, and the process is as follows:
Figure BDA00030916346300001213
wherein σ is sigmoid activation function, fmIn order to be a non-linear mapping function,
Figure BDA00030916346300001214
is the result of activation;
4.2) calculating the final result of the activation result through a softmax function, wherein the process is as follows:
Figure BDA00030916346300001215
wherein softmax represents the softmax function, faIs a non-linear mapping function.
The above-described embodiments are only preferred embodiments of the present invention, and not intended to limit the scope of the present invention, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and they are included in the scope of the present invention.

Claims (4)

1. A method for realizing autonomous navigation of a robot based on natural language and machine vision is characterized by comprising the following steps:
1) the robot starts from an initial position, and language information and visual information are acquired at each time of each round of conversation; the language information comprises an instruction indicating the target position of the robot and a conversation record describing the environment where the robot is located, the conversation record comprises a current position, namely a conversation generated at the current moment and a set of all previous conversations, and the visual information comprises panoramic image information of the current position of the robot;
2) performing feature extraction on the language information through an attention mechanism to obtain semantic features; performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features; respectively extracting the characteristics of the visual information through the faster-RCNN and the U-net to obtain target detection characteristics and semantic segmentation characteristics;
3) fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features of the current moment and the previous moment through an attention mechanism to obtain fusion features, and the method comprises the following steps of:
3.1) fusing the low-order visual features, the image classification features, the target detection features and the semantic segmentation features with fusion features corresponding to t-1 turns of conversation, namely t-1 moments, wherein the fusion process is described as follows:
Figure FDA0003603136260000011
Figure FDA0003603136260000012
Figure FDA0003603136260000013
Figure FDA0003603136260000014
wherein v ist,i、ct,i、ot,i、st,iRespectively representing the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector of the ith sub-image of the t-turn dialog, wherein the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector are respectively a low-order visual feature matrix VtImage classification feature matrix CtObject detection feature matrix OtSemantic segmentation feature matrix StThe vector of (a);
Figure FDA0003603136260000021
denotes the fusion feature obtained at time t-1, fvAnd fvlmRepresenting a non-linear mapping function, l represents
Figure FDA0003603136260000022
The vector dimension of (a); the fused low-order visual feature vector, the fused image classification feature vector, the fused target detection feature vector and the fused semantic segmentation feature vector are respectively
Figure FDA0003603136260000023
Figure FDA0003603136260000024
3.2) further fusing the fused low-order visual features, image classification features, target detection features, semantic segmentation features and semantic features through an attention mechanism, wherein the process is described as follows:
Figure FDA0003603136260000025
Figure FDA0003603136260000026
Figure FDA0003603136260000027
Figure FDA0003603136260000028
Figure FDA0003603136260000029
wherein the content of the first and second substances,
Figure FDA00036031362600000210
respectively representing a low-order visual feature matrix, an image classification feature matrix, a target detection feature matrix and a semantic segmentation feature matrix which are fused during t-round conversations;
Figure FDA00036031362600000211
representing semantic features, passes and parameters in t-turn conversations
Figure FDA00036031362600000212
Multiplication, mapping into
Figure FDA00036031362600000213
h represents semantic features when t turns of conversation
Figure FDA00036031362600000214
Dimension (d); softmax represents the softmax function;
Figure FDA00036031362600000215
respectively represent the low-order vision features after the fusion of the attention mechanismThe method comprises the following steps of (1) acquiring, image classification characteristics, target detection characteristics and semantic segmentation characteristics;
3.3) further processing the fused features through an LSTM network, and finally combining the fused features into final coding features, wherein the process is as follows:
Figure FDA0003603136260000031
Figure FDA0003603136260000032
Figure FDA0003603136260000033
Figure FDA0003603136260000034
Figure FDA0003603136260000035
wherein the content of the first and second substances,
Figure FDA0003603136260000036
respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features processed by an LSTM network; concat represents the merging of vectors;
Figure FDA0003603136260000037
representing the fusion characteristics corresponding to the t-wheel dialogues, namely the final coding characteristics;
4) inputting the fusion features into a softmax classifier to predict the moving direction of the robot at the current moment, wherein at each moment, the robot predicts the moving direction through the fusion features, and finally, when the prediction result is stop, the robot reaches the target position.
2. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 1), the dialogue record of the environment where the robot is located refers to a communication record generated when two human users navigate in the environment where the robot is located, wherein one human user knows topological information of the whole indoor environment and instructs the robot to walk through question-answer communication with the other human user; h for each dialogue recordingt=D1,D2,...,Di,...,Dt-1Is represented by HtRepresenting the recording of the conversation during the t-th conversation, DiRepresenting the ith round of conversation; a panoramic image corresponding to visual information of the environment in which the robot is located is denoted by C, and the panoramic image is divided into 12 sub-images, each of which represents 12 directions and is denoted by C-C1,c2,...,ci,...,c12Wherein c isiRepresenting the ith sub-graph.
3. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 2), performing feature extraction on the language information through an attention mechanism to obtain semantic features, and the method comprises the following steps:
2.1) each dialog record H containing t dialogs and each dialog record D containing L words are described as:
H={D1,D2,...,Di,...,Dt}
Figure FDA0003603136260000041
wherein D isiA dialog of the ith round is shown,
Figure FDA0003603136260000042
representing the ith word in a round of dialog;
2.2) vectorizing the dialogue records through an embedding layer, wherein a corresponding vectorization result E is described as follows:
E={G1,G2,...,Gi,...,Gt}
Gi={g1,g2,...,gi,...,gL}
wherein G isiRepresenting the embedding vector of the ith round of dialog in the semantic map, wherein t rounds of dialog are total; giAn embedding vector representing the ith word in a round of conversation, wherein the number of the I words is L;
2.3) encoding the embedding vector of the session record through an LSTM network to obtain a characteristic vector, wherein the process of obtaining the characteristic vector is described as follows:
{hi,1,hi,2,...,hi,L}=LSTM({wi,1,wi,2,...,wi,j,...,wi,L})
di=hi,L
Figure FDA0003603136260000043
wherein, wi,jAn embedding vector, h, representing the jth word in the ith round of dialogi,LState vector representing the last moment of the LSTM network, denoted by diTo represent hi,L
Figure FDA0003603136260000044
Is a feature matrix formed by the first t-1 feature vectors of the dialogue record;
2.4) fusing the feature matrix of the conversation record and the feature vector of the current conversation respectively through an attention mechanism, wherein the fusion process is described as follows:
Figure FDA0003603136260000051
Figure FDA0003603136260000052
Figure FDA0003603136260000053
Figure FDA0003603136260000054
wherein d istAnd diRespectively represent state vectors ht,LAnd hi,L,A(dt,di) Represents a vector dtFor diAttention of (1), WQ、WK、WVRepresenting parameters of the model, c representing a vector dtAnd diDimension (d); softmax represents the softmax function, concat represents the merging of vectors;
Figure FDA0003603136260000055
is determined by the attention value and all diThe result of the weighting-combining is taken into account,
Figure FDA0003603136260000056
semantic features corresponding to the history of the t-th dialog represented by
Figure FDA0003603136260000057
And dtMerging to obtain;
performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features, and performing feature extraction on the visual information through false-RCNN and U-net respectively to obtain target detection features and semantic segmentation features: the method includes that in each round of conversation, the robot comes to a new position, then a panoramic view at the position is obtained, and a corresponding panoramic view in t rounds of conversation is represented as PtA 1 is to PtFeature extraction is performed through a neural network model Resnet152, and the obtained feature result is taken as a low-order visual feature and is expressed as VtAnd taking the obtained image classification result as an image classification characteristic, which is expressed as Ct(ii) a Will PtInputting the target detection result into the master-RCNN network, and taking the obtained target detection result as a target detection characteristic, wherein the target detection characteristic is represented as Ot(ii) a Will PtInputting the semantic segmentation result into a U-net network, and taking the obtained semantic segmentation result as a semantic segmentation feature expressed as St
4. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 4), inputting the fusion features into a softmax classifier for movement direction prediction, wherein the method comprises the following steps:
4.1) final coding features
Figure FDA0003603136260000058
Mapping is carried out by using an activation function, and the process is as follows:
Figure FDA0003603136260000059
wherein σ is sigmoid activation function, fmIn order to be a non-linear mapping function,
Figure FDA00036031362600000510
is the result of activation;
4.2) calculating the final result of the activation result through a softmax function, wherein the process is as follows:
Figure FDA0003603136260000061
wherein softmax represents the softmax function, faIs a non-linear mapping function.
CN202110597437.8A 2021-05-31 2021-05-31 Method for realizing autonomous navigation of robot based on natural language and machine vision Active CN113420606B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110597437.8A CN113420606B (en) 2021-05-31 2021-05-31 Method for realizing autonomous navigation of robot based on natural language and machine vision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110597437.8A CN113420606B (en) 2021-05-31 2021-05-31 Method for realizing autonomous navigation of robot based on natural language and machine vision

Publications (2)

Publication Number Publication Date
CN113420606A CN113420606A (en) 2021-09-21
CN113420606B true CN113420606B (en) 2022-06-14

Family

ID=77713311

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110597437.8A Active CN113420606B (en) 2021-05-31 2021-05-31 Method for realizing autonomous navigation of robot based on natural language and machine vision

Country Status (1)

Country Link
CN (1) CN113420606B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114029963B (en) * 2022-01-12 2022-03-29 北京具身智能科技有限公司 Robot operation method based on visual and auditory fusion
CN115082915B (en) * 2022-05-27 2024-03-29 华南理工大学 Multi-modal feature-based mobile robot vision-language navigation method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344288A (en) * 2018-09-19 2019-02-15 电子科技大学 A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism
CN110609891A (en) * 2019-09-18 2019-12-24 合肥工业大学 Visual dialog generation method based on context awareness graph neural network
CN110647612A (en) * 2019-09-18 2020-01-03 合肥工业大学 Visual conversation generation method based on double-visual attention network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825829B (en) * 2019-10-16 2023-05-26 华南理工大学 Method for realizing autonomous navigation of robot based on natural language and semantic map
CN112504261B (en) * 2020-11-09 2024-02-09 中国人民解放军国防科技大学 Unmanned aerial vehicle falling pose filtering estimation method and system based on visual anchor points
CN112710310B (en) * 2020-12-07 2024-04-19 深圳龙岗智能视听研究院 Visual language indoor navigation method, system, terminal and application

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344288A (en) * 2018-09-19 2019-02-15 电子科技大学 A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism
CN110609891A (en) * 2019-09-18 2019-12-24 合肥工业大学 Visual dialog generation method based on context awareness graph neural network
CN110647612A (en) * 2019-09-18 2020-01-03 合肥工业大学 Visual conversation generation method based on double-visual attention network

Also Published As

Publication number Publication date
CN113420606A (en) 2021-09-21

Similar Documents

Publication Publication Date Title
CN110188598B (en) Real-time hand posture estimation method based on MobileNet-v2
CN110609891B (en) Visual dialog generation method based on context awareness graph neural network
CN108829677B (en) Multi-modal attention-based automatic image title generation method
CN110990543A (en) Intelligent conversation generation method and device, computer equipment and computer storage medium
CN112860888B (en) Attention mechanism-based bimodal emotion analysis method
CN113420606B (en) Method for realizing autonomous navigation of robot based on natural language and machine vision
CN111966800B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN110795990B (en) Gesture recognition method for underwater equipment
CN110825829B (en) Method for realizing autonomous navigation of robot based on natural language and semantic map
CN112163498B (en) Method for establishing pedestrian re-identification model with foreground guiding and texture focusing functions and application of method
CN112329525A (en) Gesture recognition method and device based on space-time diagram convolutional neural network
CN113064968B (en) Social media emotion analysis method and system based on tensor fusion network
CN113221663A (en) Real-time sign language intelligent identification method, device and system
CN113554032A (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
CN117197878B (en) Character facial expression capturing method and system based on machine learning
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN113989933A (en) Online behavior recognition model training and detecting method and system
CN116189306A (en) Human behavior recognition method based on joint attention mechanism
CN115018819A (en) Weld point position extraction method based on Transformer neural network
CN112926569B (en) Method for detecting natural scene image text in social network
CN112800958B (en) Lightweight human body key point detection method based on heat point diagram
CN115311598A (en) Video description generation system based on relation perception
CN117576279B (en) Digital person driving method and system based on multi-mode data
CN113780350B (en) ViLBERT and BiLSTM-based image description method
CN117011650B (en) Method and related device for determining image encoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant