CN113420606B - Method for realizing autonomous navigation of robot based on natural language and machine vision - Google Patents
Method for realizing autonomous navigation of robot based on natural language and machine vision Download PDFInfo
- Publication number
- CN113420606B CN113420606B CN202110597437.8A CN202110597437A CN113420606B CN 113420606 B CN113420606 B CN 113420606B CN 202110597437 A CN202110597437 A CN 202110597437A CN 113420606 B CN113420606 B CN 113420606B
- Authority
- CN
- China
- Prior art keywords
- features
- robot
- feature
- vector
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method for realizing autonomous navigation of a robot based on natural language and machine vision, which comprises the following steps: 1) the robot starts from an initial position, and language information and visual information are acquired at each round of conversation, namely each moment; 2) performing feature extraction on the language information through an attention mechanism to obtain semantic features; performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features; performing feature extraction on visual information through the faster-RCNN and the U-net to obtain target detection features and semantic segmentation features; 3) fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current moment and the previous moment through an attention mechanism to obtain fusion features; 4) and inputting the fusion features into a softmax classifier to predict the moving direction of the current moment. The invention utilizes the visual information and the language information of the environment where the robot is located to carry out autonomous navigation of the robot under the condition of not acquiring an accurate measurement map in advance.
Description
Technical Field
The invention relates to the technical field of natural language processing, image processing and autonomous navigation, in particular to a method for realizing indoor autonomous navigation of a mobile robot based on natural language and computer vision.
Background
In recent years, autonomous navigation of robots is more and more widely applied in production and life, and more application scenes need accurate and efficient autonomous navigation technology. In the conventional autonomous navigation method, the environment needs to be scanned once to obtain an accurate measurement map, and autonomous navigation is performed according to the accurate measurement map. Acquiring an accurate measurement map requires a large amount of manpower and material resources, and an autonomous navigation method based on the accurate measurement map is difficult to migrate to an unknown environment. Therefore, the research of the autonomous navigation method based on natural language and computer vision has great significance.
At present, a method based on an accurate measurement map is mainly adopted in the aspect of robot autonomous navigation research, but the following problems are also faced:
(1) the acquisition of the accurate measurement map requires a large amount of resources and time to scan the environment in advance, and the cost for acquiring the accurate measurement map is high.
(2) In some complex scenes which are difficult to observe, the difficulty and the expense for obtaining the accurate measurement map are higher, and the navigation method based on the accurate measurement map may not be implemented.
(3) The navigation effect depends on the accuracy of the metric map, and in some situations where it is difficult to obtain an accurate metric map, the navigation effect becomes poor.
(4) The autonomous navigation method based on the accurate measurement map is based on the measurement information of the environment for navigation, and does not utilize semantic information and visual information, so that the method is difficult to migrate to unknown environment.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a method for realizing indoor autonomous navigation of a mobile robot based on natural language and machine vision, which can carry out autonomous navigation of the robot under the condition of not acquiring an accurate measurement map in advance by utilizing visual information of the environment where the robot is located and natural language dialogue records.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: a method for realizing autonomous navigation of a robot based on natural language and machine vision comprises the following steps:
1) the robot acquires language information and visual information from an initial position at each turn of conversation, namely each moment; the language information comprises an instruction indicating the target position of the robot and a conversation record describing the environment where the robot is located, the conversation record comprises a current position, namely a conversation generated at the current moment and a set of all previous conversations, and the visual information comprises panoramic image information of the current position of the robot;
2) performing feature extraction on the language information through an attention mechanism to obtain semantic features; performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features; respectively extracting the characteristics of the visual information through the faster-RCNN and the U-net to obtain target detection characteristics and semantic segmentation characteristics;
3) fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current moment and the previous moment through an attention mechanism to obtain fusion features;
4) inputting the fusion features into a softmax classifier to predict the moving direction of the robot at the current moment, wherein at each moment, the robot predicts the moving direction through the fusion features, and finally, when the prediction result is stop, the robot reaches the target position.
In step 1), the dialogue record of the environment where the robot is located refers to a communication record generated when two human users navigate in the environment where the robot is located, wherein one human user extracts topological information of the whole indoor environment and instructs the robot to walk through question-answer communication with the other human user; h for each dialogue recordingt=D1,D2,...,Di,...,Dt-1Is represented by HtRepresenting the recording of the conversation during the t-th conversation, DiRepresenting the ith round of conversation; a panoramic image corresponding to visual information of the environment in which the robot is located is denoted by C, and the panoramic image is divided into 12 sub-images, each of which represents 12 directions and is denoted by C-C1,c2,...,ci,...,c12Wherein c isiRepresenting the ith sub-graph.
In step 2), performing feature extraction on the language information through an attention mechanism to obtain semantic features, and the method comprises the following steps:
2.1) each dialog record H containing t dialogs and each dialog record D containing L words are described as:
H={D1,D2,...,Di,...,Dt}
2.2) vectorizing the session record through an embedding layer, wherein a corresponding vectorizing result E is described as follows:
E={G1,G2,...,Gi,...,Gt}
Gi={g1,g2,...,gi,...,gL}
wherein, GiRepresenting the embedding vector of the ith round of dialog in the semantic map, wherein t rounds of dialog are total; giAn embedding vector representing the ith word in a round of conversation, wherein the number of the I words is L;
2.3) encoding the embedding vector of the session record through an LSTM network to obtain a characteristic vector, wherein the process of obtaining the characteristic vector is described as follows:
{hi,1,hi,2,...,hi,L}=LSTM({wi,1,wi,2,...,wi,j,...,wi,L})
di=hi,L
wherein, wi,jEmbedding vector, h, representing the jth word in the ith round of dialogi,LState vector representing the last moment of the LSTM network, denoted by diTo represent hi,L,Is a feature matrix formed by the first t-1 feature vectors of the dialogue record;
2.4) fusing the feature matrix of the conversation record and the feature vector of the current conversation respectively through an attention mechanism, wherein the fusion process is described as follows:
wherein d istAnd diRespectively represent state vectors ht,LAnd hi,L,A(dt,di) Represents a vector dtFor diAttention of (1), WQ、WK、WVRepresenting parameters of the model, c representing a vector dtAnd diDimension (d); softmax represents the softmax function, concat represents the merging of vectors;is determined by the attention value and all diThe result of the weighting-combining is taken into account,semantic features corresponding to the history of the t-th turn of the dialog, consisting ofAnd dtMerging to obtain;
performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features, and performing feature extraction on the visual information through false-RCNN and U-net respectively to obtain target detection features and semantic segmentation features: the robot comes to a new position in each round of conversation, and then acquires a panoramic view under the position, and in t rounds of conversation, the robot can obtain a panoramic viewIs denoted as PtA 1 is to PtFeature extraction is performed through a neural network model Resnet152, and the obtained feature result is taken as a low-order visual feature and is expressed as VtAnd taking the obtained image classification result as an image classification characteristic, and expressing the image classification characteristic as Ct(ii) a Will PtInputting the target detection result into the master-RCNN network, and taking the obtained target detection result as a target detection characteristic, wherein the target detection characteristic is represented as Ot(ii) a Will PtInputting the semantic segmentation result into a U-net network, and taking the obtained semantic segmentation result as a semantic segmentation feature expressed as St。
In the step 3), the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current time and the previous time are fused through an attention mechanism to obtain fusion features, and the method comprises the following steps:
3.1) fusing the low-order visual features, the image classification features, the target detection features and the semantic segmentation features with fusion features corresponding to t-1 turns of conversation, namely t-1 moments, wherein the fusion process is described as follows:
wherein v ist,i、ct,i、ot,i、st,iRespectively representing the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector of the ith sub-image of the t-turn dialog, wherein the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector are respectively low-order visual feature vectorsSign matrix VtImage classification feature matrix CtObject detection feature matrix OtSemantic segmentation feature matrix StThe vector of (a);denotes the fusion feature obtained at time t-1, fvAnd fvlmRepresenting a non-linear mapping function, l representsThe vector dimension of (a); the fused low-order visual feature vector, the fused image classification feature vector, the fused target detection feature vector and the fused semantic segmentation feature vector are respectively
3.2) further fusing the fused low-order visual features, image classification features, target detection features, semantic segmentation features and semantic features through an attention mechanism, wherein the process is described as follows:
wherein the content of the first and second substances,respectively representing a low-order visual characteristic matrix, an image classification characteristic matrix, a target detection characteristic matrix and a semantic segmentation characteristic matrix which are fused during t-round conversation;representing semantic features, passes and parameters in t-turn conversationsMultiplication, mapping intoh represents semantic features when t turns of conversationDimension (d); softmax denotes the softmax function;respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features after attention mechanism fusion;
3.3) further processing the fused features through an LSTM network, and finally combining the fused features into final coding features, wherein the process is as follows:
wherein, the first and the second end of the pipe are connected with each other,respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features processed by an LSTM network; concat represents the merging of vectors;and representing the fused feature corresponding to the pair of the t wheels, namely the final coding feature.
In step 4), inputting the fusion features into a softmax classifier for movement direction prediction, wherein the method comprises the following steps:
4.1) final coding featuresMapping is carried out by using an activation function, and the process is as follows:
wherein σ is sigmoid activation function, fmIn order to be a non-linear mapping function,is the result of activation;
4.2) calculating the final result of the activation result through a softmax function, wherein the process is as follows:
wherein softmax denotes the softmax function, faIs a non-linear mapping function.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention provides the method for performing robot autonomous navigation by using visual information and natural language, thereby saving the overhead brought by obtaining an accurate measurement map and being capable of adapting to complex environments.
2. The invention provides the robot autonomous navigation combining the natural language instruction and the machine vision, and the robot autonomous navigation can be carried out more conveniently and efficiently.
3. The invention combines natural language instruction and machine vision, and performs autonomous navigation of the robot by combining the characteristics of two different modal information, thereby improving the navigation efficiency and saving the expenditure while ensuring the navigation effect.
Drawings
FIG. 1 is a flow chart illustrating autonomous navigation according to the present invention.
FIG. 2 is a schematic diagram of a model architecture construction process for feature extraction and navigation instruction prediction based on attention mechanism.
Wherein the dialogue history represents questions of the robot and answer records of the human user; the current time dialog represents the questions of the robot and the answers of the human users in the current round of dialog; encoding represents that the dialogue information is encoded and converted into an Encoding vector; the machine vision image is extracted to low-order vision characteristics through Resnet152, Faster R-CNN and U-net modelsImage classification featuresTarget detection featuresSemantic segmentation featuresEtc.; attention denotes an Attention module for extracting features of semantic informationFeature extraction of visual information, and fusion of semantic features, visual features and fusion features obtained in t-1 round of dialogue, i.e. at t-1 timeEtc.;respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features which are fused with the semantic features; the fused features are input into the softmax module and the final result is calculated.
Fig. 3 is a schematic view of the principle of the attention mechanism. Wherein d ist,diRespectively representing feature vectors used to calculate attention; wQ,WK,WVIs used for mixing dtAnd diParameters mapped to the same dimension; matmul stands for matrix multiplication; the calculation result is normalized by a softmax module to obtain an attention result A (d)t,di) (ii) a By merging the results of all attention modules, and then dtTo obtain the final result
Detailed Description
The present invention will be further described with reference to the following specific examples and drawings, but the embodiments of the present invention are not limited thereto.
As shown in fig. 1 to 3, the method for implementing autonomous navigation of a robot based on natural language and machine vision provided by the present embodiment includes the following steps:
1) the robot acquires language information and visual information from an initial position at each turn of conversation (each moment); the language information includes instructions indicating a target position of the robot toThe dialog record describes the environment where the robot is located, the dialog record comprises dialogs generated at the current position (namely the current moment) and a set of all previous dialogs, and the visual information comprises panoramic image information of the current position of the robot; the conversation record of the environment where the robot is located refers to communication records generated when two human users navigate in the environment where the robot is located, wherein one human user extracts topological information of the whole indoor environment and instructs the robot to walk through question-answer communication with the other human user; h for each dialogue recordingt=D1,D2,...,Di,...,Dt-1Is represented by HtShowing the conversation log at the time of the tth conversation, DiRepresenting the ith round of conversation; a panoramic image corresponding to visual information of an environment in which the robot is located is represented as C, the panoramic image is divided into 12 sub-images, each of the sub-images represents 12 directions, and the sub-images are represented as C ═ C1,c2,...,ci,...,c12Wherein, ciThe ith sub-graph is represented as shown by the image corresponding to the alternative action direction in fig. 2.
2) The method for extracting the features of the language information through the attention mechanism to obtain the semantic features comprises the following steps:
2.1) each dialog record H containing t dialogs and each dialog record D containing L words are described as:
H={D1,D2,...,Di,...,Dt}
2.2) vectorizing the dialogue records through an embedding layer, wherein the corresponding vectorizing result is described as:
E={G1,G2,...,Gi,...,Gt}
Gi={g1,g2,...,gi,...,gL}
wherein, GiRepresenting the embedding vector of the ith round of dialogue in the semantic map, wherein t rounds of dialogue are shared; giAn embedding vector representing the ith word in a round of conversation, wherein the number of the I words is L;
2.3) encoding the embedding vector of the session record through an LSTM network to obtain a characteristic vector, wherein the process of obtaining the characteristic vector is described as follows:
{hi,1,hi,2,...,hi,L}=LSTM({wi,1,wi,2,...,wi,j,...,wi,L})
di=hi,L
wherein, wi,jAn embedding vector, h, representing the jth word in the ith round of dialogi,LState vector representing the last moment of the LSTM network, denoted by diTo represent hi,L,Is a feature matrix formed by the first t-1 feature vectors of the dialogue record;
2.4) fusing the feature matrix of the conversation record and the feature vector of the current conversation respectively through an attention mechanism, wherein the principle of the attention mechanism and the calculation process of the attention result are shown in FIG. 3, and the fusion process is described as follows:
wherein d istAnd diRespectively represent state vectors ht,LAnd hi,L,A(dt,di) Represents a vector dtFor diAttention of (1), WQ、WK、WVRepresenting parameters of the model, c representing a vector dtAnd diDimension (d); softmax represents the softmax function, concat represents the merging of vectors;is determined by the attention value and all diThe result of the weighting-combining is taken into account,semantic features corresponding to the history of the t-th turn of the dialog, consisting ofAnd dtMerging to obtain;
as shown in fig. 2, feature extraction is performed on visual information through Resnet152 to obtain low-order visual features and image classification features, feature extraction is performed on visual information through false-RCNN and U-net to obtain target detection features and semantic segmentation features, specifically, in each round of conversation, a robot arrives at a new position, then a panoramic view at the position is obtained, and a corresponding panoramic view in t rounds of conversation is represented as Pt(ii) a Will PtFeature extraction is performed through a neural network model Resnet152, and the obtained feature result is taken as a low-order visual feature and is expressed as VtAnd taking the obtained image classification result as an image classification characteristic, which is expressed as Ct(ii) a Will PtInputting the target detection result into the master-RCNN network as the target detectionMeasured characteristic, denoted as Ot(ii) a Will PtInputting the semantic segmentation result into a U-net network, and taking the obtained semantic segmentation result as a semantic segmentation feature expressed as St。
3) Fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current moment and the previous moment through an attention mechanism to obtain fusion features, and the method comprises the following steps of:
3.1) fusing the low-order visual features, the image classification features, the target detection features and the semantic segmentation features with fusion features corresponding to t-1 turns of conversation, namely t-1 moments, wherein the fusion process is described as follows:
wherein v ist,i、ct,i、ot,i、st,iRespectively representing the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector of the ith sub-image of the t-turn dialog, wherein the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector are respectively a low-order visual feature matrix VtImage classification feature matrix CtObject detection feature matrix OtSemantic segmentation feature matrix StThe vector of (a);denotes the fusion feature obtained at time t-1, fvAnd fvlmRepresenting non-linear mapsEquation, l representsThe vector dimension of (a); the fused low-order visual feature vector, the fused image classification feature vector, the fused target detection feature vector and the fused semantic segmentation feature vector are respectively
3.2) further fusing the fused low-order visual features, image classification features, target detection features, semantic segmentation features and semantic features through an attention mechanism, wherein the process is described as follows:
wherein the content of the first and second substances,respectively representing fused low-order visual feature matrix, image classification feature matrix and target detection feature during t-turn conversationMatrix and semantic segmentation characteristic matrix;representing semantic features, passes and parameters in t-turn conversationsMultiplication, mapping intoh represents semantic features when t turns of conversationDimension of (d); softmax denotes the softmax function;respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features after attention mechanism fusion;
3.3) further processing the fused features through an LSTM network and finally combining the processed features into final coding features, wherein the process is as follows:
wherein the content of the first and second substances,respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features processed by an LSTM network; concat represents the merging of vectors;and representing the fused feature corresponding to the pair of the t wheels, namely the final coding feature.
4) Inputting the fusion features into a softmax classifier for predicting the moving direction, and comprising the following steps of:
4.1) final coding featuresMapping is carried out by using an activation function, and the process is as follows:
wherein σ is sigmoid activation function, fmIn order to be a non-linear mapping function,is the result of activation;
4.2) calculating the final result of the activation result through a softmax function, wherein the process is as follows:
wherein softmax represents the softmax function, faIs a non-linear mapping function.
The above-described embodiments are only preferred embodiments of the present invention, and not intended to limit the scope of the present invention, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and they are included in the scope of the present invention.
Claims (4)
1. A method for realizing autonomous navigation of a robot based on natural language and machine vision is characterized by comprising the following steps:
1) the robot starts from an initial position, and language information and visual information are acquired at each time of each round of conversation; the language information comprises an instruction indicating the target position of the robot and a conversation record describing the environment where the robot is located, the conversation record comprises a current position, namely a conversation generated at the current moment and a set of all previous conversations, and the visual information comprises panoramic image information of the current position of the robot;
2) performing feature extraction on the language information through an attention mechanism to obtain semantic features; performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features; respectively extracting the characteristics of the visual information through the faster-RCNN and the U-net to obtain target detection characteristics and semantic segmentation characteristics;
3) fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features of the current moment and the previous moment through an attention mechanism to obtain fusion features, and the method comprises the following steps of:
3.1) fusing the low-order visual features, the image classification features, the target detection features and the semantic segmentation features with fusion features corresponding to t-1 turns of conversation, namely t-1 moments, wherein the fusion process is described as follows:
wherein v ist,i、ct,i、ot,i、st,iRespectively representing the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector of the ith sub-image of the t-turn dialog, wherein the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector are respectively a low-order visual feature matrix VtImage classification feature matrix CtObject detection feature matrix OtSemantic segmentation feature matrix StThe vector of (a);denotes the fusion feature obtained at time t-1, fvAnd fvlmRepresenting a non-linear mapping function, l representsThe vector dimension of (a); the fused low-order visual feature vector, the fused image classification feature vector, the fused target detection feature vector and the fused semantic segmentation feature vector are respectively
3.2) further fusing the fused low-order visual features, image classification features, target detection features, semantic segmentation features and semantic features through an attention mechanism, wherein the process is described as follows:
wherein the content of the first and second substances,respectively representing a low-order visual feature matrix, an image classification feature matrix, a target detection feature matrix and a semantic segmentation feature matrix which are fused during t-round conversations;representing semantic features, passes and parameters in t-turn conversationsMultiplication, mapping intoh represents semantic features when t turns of conversationDimension (d); softmax represents the softmax function;respectively represent the low-order vision features after the fusion of the attention mechanismThe method comprises the following steps of (1) acquiring, image classification characteristics, target detection characteristics and semantic segmentation characteristics;
3.3) further processing the fused features through an LSTM network, and finally combining the fused features into final coding features, wherein the process is as follows:
wherein the content of the first and second substances,respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features processed by an LSTM network; concat represents the merging of vectors;representing the fusion characteristics corresponding to the t-wheel dialogues, namely the final coding characteristics;
4) inputting the fusion features into a softmax classifier to predict the moving direction of the robot at the current moment, wherein at each moment, the robot predicts the moving direction through the fusion features, and finally, when the prediction result is stop, the robot reaches the target position.
2. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 1), the dialogue record of the environment where the robot is located refers to a communication record generated when two human users navigate in the environment where the robot is located, wherein one human user knows topological information of the whole indoor environment and instructs the robot to walk through question-answer communication with the other human user; h for each dialogue recordingt=D1,D2,...,Di,...,Dt-1Is represented by HtRepresenting the recording of the conversation during the t-th conversation, DiRepresenting the ith round of conversation; a panoramic image corresponding to visual information of the environment in which the robot is located is denoted by C, and the panoramic image is divided into 12 sub-images, each of which represents 12 directions and is denoted by C-C1,c2,...,ci,...,c12Wherein c isiRepresenting the ith sub-graph.
3. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 2), performing feature extraction on the language information through an attention mechanism to obtain semantic features, and the method comprises the following steps:
2.1) each dialog record H containing t dialogs and each dialog record D containing L words are described as:
H={D1,D2,...,Di,...,Dt}
2.2) vectorizing the dialogue records through an embedding layer, wherein a corresponding vectorization result E is described as follows:
E={G1,G2,...,Gi,...,Gt}
Gi={g1,g2,...,gi,...,gL}
wherein G isiRepresenting the embedding vector of the ith round of dialog in the semantic map, wherein t rounds of dialog are total; giAn embedding vector representing the ith word in a round of conversation, wherein the number of the I words is L;
2.3) encoding the embedding vector of the session record through an LSTM network to obtain a characteristic vector, wherein the process of obtaining the characteristic vector is described as follows:
{hi,1,hi,2,...,hi,L}=LSTM({wi,1,wi,2,...,wi,j,...,wi,L})
di=hi,L
wherein, wi,jAn embedding vector, h, representing the jth word in the ith round of dialogi,LState vector representing the last moment of the LSTM network, denoted by diTo represent hi,L,Is a feature matrix formed by the first t-1 feature vectors of the dialogue record;
2.4) fusing the feature matrix of the conversation record and the feature vector of the current conversation respectively through an attention mechanism, wherein the fusion process is described as follows:
wherein d istAnd diRespectively represent state vectors ht,LAnd hi,L,A(dt,di) Represents a vector dtFor diAttention of (1), WQ、WK、WVRepresenting parameters of the model, c representing a vector dtAnd diDimension (d); softmax represents the softmax function, concat represents the merging of vectors;is determined by the attention value and all diThe result of the weighting-combining is taken into account,semantic features corresponding to the history of the t-th dialog represented byAnd dtMerging to obtain;
performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features, and performing feature extraction on the visual information through false-RCNN and U-net respectively to obtain target detection features and semantic segmentation features: the method includes that in each round of conversation, the robot comes to a new position, then a panoramic view at the position is obtained, and a corresponding panoramic view in t rounds of conversation is represented as PtA 1 is to PtFeature extraction is performed through a neural network model Resnet152, and the obtained feature result is taken as a low-order visual feature and is expressed as VtAnd taking the obtained image classification result as an image classification characteristic, which is expressed as Ct(ii) a Will PtInputting the target detection result into the master-RCNN network, and taking the obtained target detection result as a target detection characteristic, wherein the target detection characteristic is represented as Ot(ii) a Will PtInputting the semantic segmentation result into a U-net network, and taking the obtained semantic segmentation result as a semantic segmentation feature expressed as St。
4. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 4), inputting the fusion features into a softmax classifier for movement direction prediction, wherein the method comprises the following steps:
4.1) final coding featuresMapping is carried out by using an activation function, and the process is as follows:
wherein σ is sigmoid activation function, fmIn order to be a non-linear mapping function,is the result of activation;
4.2) calculating the final result of the activation result through a softmax function, wherein the process is as follows:
wherein softmax represents the softmax function, faIs a non-linear mapping function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110597437.8A CN113420606B (en) | 2021-05-31 | 2021-05-31 | Method for realizing autonomous navigation of robot based on natural language and machine vision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110597437.8A CN113420606B (en) | 2021-05-31 | 2021-05-31 | Method for realizing autonomous navigation of robot based on natural language and machine vision |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113420606A CN113420606A (en) | 2021-09-21 |
CN113420606B true CN113420606B (en) | 2022-06-14 |
Family
ID=77713311
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110597437.8A Active CN113420606B (en) | 2021-05-31 | 2021-05-31 | Method for realizing autonomous navigation of robot based on natural language and machine vision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113420606B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114029963B (en) * | 2022-01-12 | 2022-03-29 | 北京具身智能科技有限公司 | Robot operation method based on visual and auditory fusion |
CN115082915B (en) * | 2022-05-27 | 2024-03-29 | 华南理工大学 | Multi-modal feature-based mobile robot vision-language navigation method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344288A (en) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism |
CN110609891A (en) * | 2019-09-18 | 2019-12-24 | 合肥工业大学 | Visual dialog generation method based on context awareness graph neural network |
CN110647612A (en) * | 2019-09-18 | 2020-01-03 | 合肥工业大学 | Visual conversation generation method based on double-visual attention network |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110825829B (en) * | 2019-10-16 | 2023-05-26 | 华南理工大学 | Method for realizing autonomous navigation of robot based on natural language and semantic map |
CN112504261B (en) * | 2020-11-09 | 2024-02-09 | 中国人民解放军国防科技大学 | Unmanned aerial vehicle falling pose filtering estimation method and system based on visual anchor points |
CN112710310B (en) * | 2020-12-07 | 2024-04-19 | 深圳龙岗智能视听研究院 | Visual language indoor navigation method, system, terminal and application |
-
2021
- 2021-05-31 CN CN202110597437.8A patent/CN113420606B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344288A (en) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism |
CN110609891A (en) * | 2019-09-18 | 2019-12-24 | 合肥工业大学 | Visual dialog generation method based on context awareness graph neural network |
CN110647612A (en) * | 2019-09-18 | 2020-01-03 | 合肥工业大学 | Visual conversation generation method based on double-visual attention network |
Also Published As
Publication number | Publication date |
---|---|
CN113420606A (en) | 2021-09-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110188598B (en) | Real-time hand posture estimation method based on MobileNet-v2 | |
CN110609891B (en) | Visual dialog generation method based on context awareness graph neural network | |
CN108829677B (en) | Multi-modal attention-based automatic image title generation method | |
CN110990543A (en) | Intelligent conversation generation method and device, computer equipment and computer storage medium | |
CN112860888B (en) | Attention mechanism-based bimodal emotion analysis method | |
CN113420606B (en) | Method for realizing autonomous navigation of robot based on natural language and machine vision | |
CN111966800B (en) | Emotion dialogue generation method and device and emotion dialogue model training method and device | |
CN110795990B (en) | Gesture recognition method for underwater equipment | |
CN110825829B (en) | Method for realizing autonomous navigation of robot based on natural language and semantic map | |
CN112163498B (en) | Method for establishing pedestrian re-identification model with foreground guiding and texture focusing functions and application of method | |
CN112329525A (en) | Gesture recognition method and device based on space-time diagram convolutional neural network | |
CN113064968B (en) | Social media emotion analysis method and system based on tensor fusion network | |
CN113221663A (en) | Real-time sign language intelligent identification method, device and system | |
CN113554032A (en) | Remote sensing image segmentation method based on multi-path parallel network of high perception | |
CN117197878B (en) | Character facial expression capturing method and system based on machine learning | |
CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
CN113989933A (en) | Online behavior recognition model training and detecting method and system | |
CN116189306A (en) | Human behavior recognition method based on joint attention mechanism | |
CN115018819A (en) | Weld point position extraction method based on Transformer neural network | |
CN112926569B (en) | Method for detecting natural scene image text in social network | |
CN112800958B (en) | Lightweight human body key point detection method based on heat point diagram | |
CN115311598A (en) | Video description generation system based on relation perception | |
CN117576279B (en) | Digital person driving method and system based on multi-mode data | |
CN113780350B (en) | ViLBERT and BiLSTM-based image description method | |
CN117011650B (en) | Method and related device for determining image encoder |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |