CN113420606A - Method for realizing autonomous navigation of robot based on natural language and machine vision - Google Patents
Method for realizing autonomous navigation of robot based on natural language and machine vision Download PDFInfo
- Publication number
- CN113420606A CN113420606A CN202110597437.8A CN202110597437A CN113420606A CN 113420606 A CN113420606 A CN 113420606A CN 202110597437 A CN202110597437 A CN 202110597437A CN 113420606 A CN113420606 A CN 113420606A
- Authority
- CN
- China
- Prior art keywords
- features
- feature
- robot
- vector
- target detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Abstract
The invention discloses a method for realizing autonomous navigation of a robot based on natural language and machine vision, which comprises the following steps: 1) the robot starts from an initial position, and language information and visual information are acquired at each round of conversation, namely each moment; 2) performing feature extraction on the language information through an attention mechanism to obtain semantic features; performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features; performing feature extraction on visual information through the faster-RCNN and the U-net to obtain target detection features and semantic segmentation features; 3) fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current moment and the previous moment through an attention mechanism to obtain fusion features; 4) and inputting the fusion features into a softmax classifier to predict the moving direction of the current moment. The invention utilizes the visual information and the language information of the environment where the robot is located to carry out autonomous navigation of the robot under the condition of not acquiring an accurate measurement map in advance.
Description
Technical Field
The invention relates to the technical field of natural language processing, image processing and autonomous navigation, in particular to a method for realizing indoor autonomous navigation of a mobile robot based on natural language and computer vision.
Background
In recent years, autonomous navigation of robots is more and more widely applied in production and life, and more application scenes need accurate and efficient autonomous navigation technology. In the conventional autonomous navigation method, the environment needs to be scanned once to obtain an accurate measurement map, and autonomous navigation is performed according to the accurate measurement map. Acquiring an accurate measurement map requires a large amount of manpower and material resources, and an autonomous navigation method based on the accurate measurement map is difficult to migrate to an unknown environment. Therefore, the research of the autonomous navigation method based on natural language and computer vision has great significance.
At present, a method based on an accurate measurement map is mainly adopted in the aspect of robot autonomous navigation research, but the following problems are also faced:
(1) the acquisition of the accurate measurement map requires a large amount of resources and time to scan the environment in advance, and the cost for acquiring the accurate measurement map is high.
(2) In some complex scenes which are difficult to observe, the difficulty and the expense for obtaining the accurate measurement map are higher, and the navigation method based on the accurate measurement map may not be implemented.
(3) The navigation effect depends on the accuracy of the metric map, and in some situations where it is difficult to obtain an accurate metric map, the navigation effect becomes poor.
(4) The autonomous navigation method based on the accurate measurement map is based on the measurement information of the environment for navigation, and does not utilize semantic information and visual information, so that the method is difficult to migrate to unknown environment.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a method for realizing indoor autonomous navigation of a mobile robot based on natural language and machine vision, which can carry out autonomous navigation of the robot under the condition of not acquiring an accurate measurement map in advance by utilizing visual information of the environment where the robot is located and natural language dialogue records.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: a method for realizing autonomous navigation of a robot based on natural language and machine vision comprises the following steps:
1) the robot acquires language information and visual information from an initial position at each turn of conversation, namely each moment; the language information comprises an instruction indicating the target position of the robot and a conversation record describing the environment where the robot is located, the conversation record comprises a current position, namely a conversation generated at the current moment and a set of all previous conversations, and the visual information comprises panoramic image information of the current position of the robot;
2) performing feature extraction on the language information through an attention mechanism to obtain semantic features; performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features; respectively extracting the characteristics of the visual information through the faster-RCNN and the U-net to obtain target detection characteristics and semantic segmentation characteristics;
3) fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current moment and the previous moment through an attention mechanism to obtain fusion features;
4) inputting the fusion features into a softmax classifier to predict the moving direction of the robot at the current moment, wherein at each moment, the robot predicts the moving direction through the fusion features, and finally, when the prediction result is stop, the robot reaches the target position.
In step 1), the dialogue record of the environment where the robot is located refers to a communication record generated when two human users navigate in the environment where the robot is located, wherein one human user extracts topological information of the whole indoor environment and instructs the robot to walk through question-answer communication with the other human user; h for each dialogue recordingt=D1,D2,...,Di,...,Dt-1Is represented by HtRepresenting the recording of the conversation during the t-th conversation, DiRepresenting the ith round of conversation; panoramic image corresponding to visual information of environment where robot is locatedDenoted by C, the panoramic image is divided into 12 sub-images, each representing 12 directions, and denoted by C ═ C1,c2,...,ci,...,c12Wherein c isiRepresenting the ith sub-graph.
In step 2), performing feature extraction on the language information through an attention mechanism to obtain semantic features, and the method comprises the following steps:
2.1) each dialog record H containing t dialogs and each dialog record D containing L words are described as:
H={D1,D2,...,Di,...,Dt}
2.2) vectorizing the dialogue records through an embedding layer, wherein a corresponding vectorization result E is described as follows:
E={G1,G2,...,Gi,...,Gt}
Gi={g1,g2,...,gi,...,gL}
wherein G isiRepresenting the embedding vector of the ith round of dialog in the semantic map, wherein t rounds of dialog are total; giAn embedding vector representing the ith word in a round of conversation, wherein the number of the I words is L;
2.3) encoding the embedding vector of the session record through an LSTM network to obtain a characteristic vector, wherein the process of obtaining the characteristic vector is described as follows:
{hi,1,hi,2,...,hi,L}=LSTM({wi,1,wi,2,...,wi,j,...,wi,L})
di=hi,L
wherein, wi,jAn embedding vector, h, representing the jth word in the ith round of dialogi,LState vector representing the last moment of the LSTM network, denoted by diTo represent hi,L,Is a feature matrix formed by the first t-1 feature vectors of the dialogue record;
2.4) fusing the feature matrix of the conversation record and the feature vector of the current conversation respectively through an attention mechanism, wherein the fusion process is described as follows:
wherein d istAnd diRespectively represent state vectors ht,LAnd hi,L,A(dt,di) Represents a vector dtFor diAttention of (1), WQ、WK、WVRepresenting parameters of the model, c representing a vector dtAnd diDimension (d); softmax represents the softmax function, concat represents the merging of vectors;is determined by the attention value and all diThe result of the weighting-combining is taken into account,semantic features corresponding to the history of the t-th turn of the dialog, consisting ofAnd dtMerging to obtain;
performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features, and performing feature extraction on the visual information through false-RCNN and U-net respectively to obtain target detection features and semantic segmentation features: the method includes that in each round of conversation, a robot comes to a new position, then a panoramic view at the position is acquired, and the corresponding panoramic view in t rounds of conversation is represented as PtA 1 is to PtFeature extraction is performed through a neural network model Resnet152, and the obtained feature result is taken as a low-order visual feature and is expressed as VtAnd taking the obtained image classification result as an image classification characteristic, which is expressed as Ct(ii) a Will PtInputting the target detection result into the master-RCNN network, and taking the obtained target detection result as a target detection characteristic, wherein the target detection characteristic is represented as Ot(ii) a Will PtInputting the semantic segmentation result into a U-net network, and taking the obtained semantic segmentation result as a semantic segmentation feature expressed as St。
In step 3), fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current time and the previous time through an attention mechanism to obtain fusion features, wherein the fusion features comprise the following steps:
3.1) fusing the low-order visual features, the image classification features, the target detection features and the semantic segmentation features with fusion features corresponding to t-1 turns of conversation, namely t-1 moments, wherein the fusion process is described as follows:
wherein v ist,i、ct,i、ot,i、st,iRespectively representing the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector of the ith sub-image of the t-turn dialog, wherein the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector are respectively a low-order visual feature matrix VtImage classification feature matrix CtObject detection feature matrix OtSemantic segmentation feature matrix StThe vector of (a);denotes the fusion feature obtained at time t-1, fvAnd fvlmRepresenting a non-linear mapping function, l representsThe vector dimension of (a); the fused low-order visual feature vector, the fused image classification feature vector, the fused target detection feature vector and the fused semantic segmentation feature vector are respectively
3.2) further fusing the fused low-order visual features, image classification features, target detection features, semantic segmentation features and semantic features through an attention mechanism, wherein the process is described as follows:
wherein the content of the first and second substances,respectively representing a low-order visual feature matrix, an image classification feature matrix, a target detection feature matrix and a semantic segmentation feature matrix which are fused during t-round conversations;representing semantic features, passes and parameters in t-turn conversationsMultiplication, mapping intoh represents semantic features when t turns of conversationDimension (d); softmax denotes the softmax function;respectively after fusing through attention mechanismLow-order visual features, image classification features, target detection features and semantic segmentation features;
3.3) further processing the fused features through an LSTM network, and finally combining the fused features into final coding features, wherein the process is as follows:
wherein the content of the first and second substances,respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features processed by an LSTM network; concat represents the merging of vectors;and representing the fused feature corresponding to the pair of the t wheels, namely the final coding feature.
In step 4), inputting the fusion features into a softmax classifier for movement direction prediction, wherein the method comprises the following steps:
wherein σ is sigmoid activation function, fmIn order to be a non-linear mapping function,is the result of activation;
4.2) calculating the final result of the activation result through a softmax function, wherein the process is as follows:
wherein softmax denotes the softmax function, faIs a non-linear mapping function.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention provides the method for performing robot autonomous navigation by using visual information and natural language, thereby saving the overhead brought by obtaining an accurate measurement map and being capable of adapting to complex environments.
2. The invention provides the robot autonomous navigation combining the natural language instruction and the machine vision, and the robot autonomous navigation can be carried out more conveniently and efficiently.
3. The invention combines natural language instruction and machine vision, and performs autonomous navigation of the robot by combining the characteristics of two different modal information, thereby improving the navigation efficiency and saving the expenditure while ensuring the navigation effect.
Drawings
FIG. 1 is a flow chart illustrating autonomous navigation according to the present invention.
FIG. 2 is a schematic diagram of a model architecture construction process for feature extraction and navigation instruction prediction based on attention mechanism.
Wherein the dialogue history represents questions of the robot and answer records of the human user; dialog presentation at the current timeThe robot's questions and human user's answers in this round of dialog; encoding represents that the dialogue information is encoded and converted into an Encoding vector; the machine vision image is extracted to low-order vision characteristics through Resnet152, Faster R-CNN and U-net modelsImage classification featuresTarget detection featuresSemantic segmentation featuresEtc.; attention denotes an Attention module for extracting features of semantic informationFeature extraction of visual information, and fusion of semantic features, visual features and fusion features obtained in t-1 round of dialogue, i.e. at t-1 timeEtc.;respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features which are fused with the semantic features; the fused features are input into the softmax module and the final result is calculated.
Fig. 3 is a schematic view of the principle of the attention mechanism. Wherein d ist,diRespectively representing feature vectors used to calculate attention; wQ,WK,WVIs used for mixing dtAnd diParameters mapped to the same dimension; matmul stands for matrix multiplication; the calculation result is normalized by a softmax module to obtain an attention result A (d)t,di) (ii) a By combining allMerging the results of the attention module, and merging dtTo obtain the final result
Detailed Description
The present invention will be further described with reference to the following specific examples and drawings, but the embodiments of the present invention are not limited thereto.
As shown in fig. 1 to 3, the method for implementing autonomous navigation of a robot based on natural language and machine vision provided by the present embodiment includes the following steps:
1) the robot acquires language information and visual information from an initial position at each turn of conversation (each moment); the language information comprises instructions indicating the target position of the robot and a conversation record describing the environment where the robot is located, the conversation record comprises a conversation generated at the current position (namely the current moment) and a set of all previous conversations, and the visual information comprises panoramic image information of the current position of the robot; the conversation record of the environment where the robot is located refers to communication records generated when two human users navigate in the environment where the robot is located, wherein one human user extracts topological information of the whole indoor environment and instructs the robot to walk through question-answer communication with the other human user; h for each dialogue recordingt=D1,D2,...,Di,...,Dt-1Is represented by HtRepresenting the recording of the conversation during the t-th conversation, DiRepresenting the ith round of conversation; a panoramic image corresponding to visual information of an environment in which the robot is located is represented as C, the panoramic image is divided into 12 sub-images, each of the sub-images represents 12 directions, and the sub-images are represented as C ═ C1,c2,...,ci,...,c12Wherein c isiThe ith sub-graph is shown, as shown by the image corresponding to the alternative action direction in fig. 2.
2) The method for extracting the features of the language information through the attention mechanism to obtain the semantic features comprises the following steps:
2.1) each dialog record H containing t dialogs and each dialog record D containing L words are described as:
H={D1,D2,...,Di,...,Dt}
2.2) vectorizing the dialogue records through an embedding layer, wherein the corresponding vectorizing result is described as:
E={G1,G2,...,Gi,...,Gt}
Gi={g1,g2,...,gi,...,gL}
wherein G isiRepresenting the embedding vector of the ith round of dialog in the semantic map, wherein t rounds of dialog are total; giAn embedding vector representing the ith word in a round of conversation, wherein the number of the I words is L;
2.3) encoding the embedding vector of the session record through an LSTM network to obtain a characteristic vector, wherein the process of obtaining the characteristic vector is described as follows:
{hi,1,hi,2,...,hi,L}=LSTM({wi,1,wi,2,...,wi,j,...,wi,L})
di=hi,L
wherein, wi,jAn embedding vector, h, representing the jth word in the ith round of dialogi,LState vector representing the last moment of the LSTM network, denoted by diTo represent hi,L,Is a feature matrix formed by the first t-1 feature vectors of the dialogue record;
2.4) fusing the feature matrix of the conversation record and the feature vector of the current conversation respectively through an attention mechanism, wherein the principle of the attention mechanism and the calculation process of the attention result are shown in FIG. 3, and the fusion process is described as follows:
wherein d istAnd diRespectively represent state vectors ht,LAnd hi,L,A(dt,di) Represents a vector dtFor diAttention of (1), WQ、WK、WVRepresenting parameters of the model, c representing a vector dtAnd diDimension (d); softmax represents the softmax function, concat represents the merging of vectors;is determined by the attention value and all diThe result of the weighting-combining is taken into account,semantic features corresponding to the history of the t-th turn of the dialog, consisting ofAnd dtMerging to obtain;
as shown in fig. 2, feature extraction is performed on visual information through Resnet152 to obtain low-order visual features and image classification features, feature extraction is performed on visual information through false-RCNN and U-net to obtain target detection features and semantic segmentation features, specifically, in each round of conversation, a robot arrives at a new position, then a panoramic view at the position is obtained, and a corresponding panoramic view in t rounds of conversation is represented as Pt(ii) a Will PtFeature extraction is performed through a neural network model Resnet152, and the obtained feature result is taken as a low-order visual feature and is expressed as VtAnd taking the obtained image classification result as an image classification characteristic, which is expressed as Ct(ii) a Will PtInputting the target detection result into the master-RCNN network, and taking the obtained target detection result as a target detection characteristic, wherein the target detection characteristic is represented as Ot(ii) a Will PtInputting the semantic segmentation result into a U-net network, and taking the obtained semantic segmentation result as a semantic segmentation feature expressed as St。
3) Fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current moment and the previous moment through an attention mechanism to obtain fusion features, and the method comprises the following steps of:
3.1) fusing the low-order visual features, the image classification features, the target detection features and the semantic segmentation features with fusion features corresponding to t-1 turns of conversation, namely t-1 moments, wherein the fusion process is described as follows:
wherein v ist,i、ct,i、ot,i、st,iRespectively representing the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector of the ith sub-image of the t-turn dialog, wherein the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector are respectively a low-order visual feature matrix VtImage classification feature matrix CtObject detection feature matrix OtSemantic segmentation feature matrix StThe vector of (a);denotes the fusion feature obtained at time t-1, fvAnd fvlmRepresenting a non-linear mapping function, l representsThe vector dimension of (a); the fused low-order visual feature vector, the fused image classification feature vector, the fused target detection feature vector and the fused semantic segmentation feature vector are respectively
3.2) further fusing the fused low-order visual features, image classification features, target detection features, semantic segmentation features and semantic features through an attention mechanism, wherein the process is described as follows:
wherein the content of the first and second substances,respectively representing a low-order visual feature matrix, an image classification feature matrix, a target detection feature matrix and a semantic segmentation feature matrix which are fused during t-round conversations;representing semantic features, passes and parameters in t-turn conversationsMultiplication, mapping intoh represents semantic features when t turns of conversationDimension (d); softmax denotes the softmax function;respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features after attention mechanism fusion;
3.3) further processing the fused features through an LSTM network and finally combining the processed features into final coding features, wherein the process is as follows:
wherein the content of the first and second substances,respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features processed by an LSTM network; concat represents the merging of vectors;and representing the fused feature corresponding to the pair of the t wheels, namely the final coding feature.
4) Inputting the fusion features into a softmax classifier for predicting the moving direction, and comprising the following steps of:
4.1) final coding featuresMapping is carried out by using an activation function, and the process is as follows:
wherein σ is sigmoid activation function, fmIn order to be a non-linear mapping function,is the result of activation;
4.2) calculating the final result of the activation result through a softmax function, wherein the process is as follows:
wherein softmax denotes the softmax function, faIs a non-linear mapping function.
The above-described embodiments are only preferred embodiments of the present invention, and not intended to limit the scope of the present invention, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and they are included in the scope of the present invention.
Claims (5)
1. A method for realizing autonomous navigation of a robot based on natural language and machine vision is characterized by comprising the following steps:
1) the robot starts from an initial position, and language information and visual information are acquired at each round of conversation, namely each moment; the language information comprises an instruction indicating the target position of the robot and a conversation record describing the environment where the robot is located, the conversation record comprises a current position, namely a conversation generated at the current moment and a set of all previous conversations, and the visual information comprises panoramic image information of the current position of the robot;
2) performing feature extraction on the language information through an attention mechanism to obtain semantic features; performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features; respectively extracting the characteristics of the visual information through the faster-RCNN and the U-net to obtain target detection characteristics and semantic segmentation characteristics;
3) fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features of the current moment and the previous moment through an attention mechanism to obtain fusion features;
4) inputting the fusion features into a softmax classifier to predict the moving direction of the robot at the current moment, wherein at each moment, the robot predicts the moving direction through the fusion features, and finally, when the prediction result is stop, the robot reaches the target position.
2. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 1), the dialogue record of the environment where the robot is located refers to a communication record generated when two human users navigate in the environment where the robot is located, wherein one human user extracts topological information of the whole indoor environment and instructs the robot to walk through question-answer communication with the other human user; h for each dialogue recordingt=D1,D2,...,Di,...,Dt-1Is represented by HtRepresenting the recording of the conversation during the t-th conversation, DiRepresenting the ith round of conversation; a panoramic image corresponding to visual information of the environment in which the robot is located is denoted by C, and the panoramic image is divided into 12 sub-images, each of which represents 12 directions and is denoted by C-C1,c2,...,ci,...,c12Wherein c isiRepresenting the ith sub-graph.
3. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 2), performing feature extraction on the language information through an attention mechanism to obtain semantic features, and the method comprises the following steps:
2.1) each dialog record H containing t dialogs and each dialog record D containing L words are described as:
H={D1,D2,...,Di,...,Dt}
2.2) vectorizing the dialogue records through an embedding layer, wherein a corresponding vectorization result E is described as follows:
E={G1,G2,...,Gi,...,Gt}
Gi={g1,g2,...,gi,...,gL}
wherein G isiRepresenting the embedding vector of the ith round of dialog in the semantic map, wherein t rounds of dialog are total; giAn embedding vector representing the ith word in a round of conversation, wherein the number of the I words is L;
2.3) encoding the embedding vector of the session record through an LSTM network to obtain a characteristic vector, wherein the process of obtaining the characteristic vector is described as follows:
{hi,1,hi,2,...,hi,L}=LSTM({wi,1,wi,2,...,wi,j,...,wi,L})
di=hi,L
wherein, wi,jAn embedding vector, h, representing the jth word in the ith round of dialogi,LState vector representing the last moment of the LSTM network, denoted by diTo represent hi,L,Is a feature matrix formed by the first t-1 feature vectors of the dialogue record;
2.4) fusing the feature matrix of the conversation record and the feature vector of the current conversation respectively through an attention mechanism, wherein the fusion process is described as follows:
wherein d istAnd diRespectively represent state vectors ht,LAnd hi,L,A(dt,di) Represents a vector dtFor diAttention of (1), WQ、WK、WVRepresenting parameters of the model, c representing a vector dtAnd diDimension (d); softmax represents the softmax function, concat represents the merging of vectors;is determined by the attention value and all diThe result of the weighting-combining is taken into account,semantic features corresponding to the history of the t-th turn of the dialog, consisting ofAnd dtMerging to obtain;
performing feature extraction on the visual information through Resnet152 to obtain low-order visual features and image classification features, and performing feature extraction on the visual information through false-RCNN and U-net respectively to obtain target detection features and semantic segmentation features: means that in each turn of the conversation, the robot comesA new position is obtained, then a panoramic view at the position is obtained, and the corresponding panoramic view in the t-turn conversation is represented as PtA 1 is to PtFeature extraction is performed through a neural network model Resnet152, and the obtained feature result is taken as a low-order visual feature and is expressed as VtAnd taking the obtained image classification result as an image classification characteristic, which is expressed as Ct(ii) a Will PtInputting the target detection result into the master-RCNN network, and taking the obtained target detection result as a target detection characteristic, wherein the target detection characteristic is represented as Ot(ii) a Will PtInputting the semantic segmentation result into a U-net network, and taking the obtained semantic segmentation result as a semantic segmentation feature expressed as St。
4. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 3), fusing the low-order visual features, the image classification features, the target detection features, the semantic segmentation features and the semantic features at the current time and the previous time through an attention mechanism to obtain fusion features, wherein the fusion features comprise the following steps:
3.1) fusing the low-order visual features, the image classification features, the target detection features and the semantic segmentation features with fusion features corresponding to t-1 turns of conversation, namely t-1 moments, wherein the fusion process is described as follows:
wherein v ist,i、ct,i、ot,i、st,iRespectively representing the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector of the ith sub-image of the t-turn dialog, wherein the low-order visual feature vector, the image classification feature vector, the target detection feature vector and the semantic segmentation feature vector are respectively a low-order visual feature matrix VtImage classification feature matrix CtObject detection feature matrix OtSemantic segmentation feature matrix StThe vector of (a);denotes the fusion feature obtained at time t-1, fvAnd fvlmRepresenting a non-linear mapping function, l representsThe vector dimension of (a); the fused low-order visual feature vector, the fused image classification feature vector, the fused target detection feature vector and the fused semantic segmentation feature vector are respectively
3.2) further fusing the fused low-order visual features, image classification features, target detection features, semantic segmentation features and semantic features through an attention mechanism, wherein the process is described as follows:
wherein, Vt mem、Respectively representing a low-order visual feature matrix, an image classification feature matrix, a target detection feature matrix and a semantic segmentation feature matrix which are fused during t-round conversations;representing semantic features, passes and parameters in t-turn conversationsMultiplication, mapping intoh represents semantic features when t turns of conversationDimension (d); softmax denotes the softmax function; vt attn、Respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features after attention mechanism fusion;
3.3) further processing the fused features through an LSTM network, and finally combining the fused features into final coding features, wherein the process is as follows:
wherein the content of the first and second substances,respectively representing low-order visual features, image classification features, target detection features and semantic segmentation features processed by an LSTM network; concat represents the merging of vectors;and representing the fused feature corresponding to the pair of the t wheels, namely the final coding feature.
5. The method for realizing autonomous navigation of the robot based on the natural language and the machine vision according to claim 1, characterized in that: in step 4), inputting the fusion features into a softmax classifier for movement direction prediction, wherein the method comprises the following steps:
4.1) final coding featuresMapping is carried out by using an activation function, and the process is as follows:
wherein σ is sigmoid activation function, fmIn order to be a non-linear mapping function,is the result of activation;
4.2) calculating the final result of the activation result through a softmax function, wherein the process is as follows:
wherein softmax denotes the softmax function, faIs a non-linear mapping function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110597437.8A CN113420606B (en) | 2021-05-31 | 2021-05-31 | Method for realizing autonomous navigation of robot based on natural language and machine vision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110597437.8A CN113420606B (en) | 2021-05-31 | 2021-05-31 | Method for realizing autonomous navigation of robot based on natural language and machine vision |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113420606A true CN113420606A (en) | 2021-09-21 |
CN113420606B CN113420606B (en) | 2022-06-14 |
Family
ID=77713311
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110597437.8A Active CN113420606B (en) | 2021-05-31 | 2021-05-31 | Method for realizing autonomous navigation of robot based on natural language and machine vision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113420606B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114029963A (en) * | 2022-01-12 | 2022-02-11 | 北京具身智能科技有限公司 | Robot operation method based on visual and auditory fusion |
CN115082915A (en) * | 2022-05-27 | 2022-09-20 | 华南理工大学 | Mobile robot vision-language navigation method based on multi-modal characteristics |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344288A (en) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism |
CN110609891A (en) * | 2019-09-18 | 2019-12-24 | 合肥工业大学 | Visual dialog generation method based on context awareness graph neural network |
CN110647612A (en) * | 2019-09-18 | 2020-01-03 | 合肥工业大学 | Visual conversation generation method based on double-visual attention network |
CN110825829A (en) * | 2019-10-16 | 2020-02-21 | 华南理工大学 | Method for realizing autonomous navigation of robot based on natural language and semantic map |
CN112504261A (en) * | 2020-11-09 | 2021-03-16 | 中国人民解放军国防科技大学 | Unmanned aerial vehicle landing pose filtering estimation method and system based on visual anchor point |
CN112710310A (en) * | 2020-12-07 | 2021-04-27 | 深圳龙岗智能视听研究院 | Visual language indoor navigation method, system, terminal and application |
-
2021
- 2021-05-31 CN CN202110597437.8A patent/CN113420606B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344288A (en) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism |
CN110609891A (en) * | 2019-09-18 | 2019-12-24 | 合肥工业大学 | Visual dialog generation method based on context awareness graph neural network |
CN110647612A (en) * | 2019-09-18 | 2020-01-03 | 合肥工业大学 | Visual conversation generation method based on double-visual attention network |
CN110825829A (en) * | 2019-10-16 | 2020-02-21 | 华南理工大学 | Method for realizing autonomous navigation of robot based on natural language and semantic map |
CN112504261A (en) * | 2020-11-09 | 2021-03-16 | 中国人民解放军国防科技大学 | Unmanned aerial vehicle landing pose filtering estimation method and system based on visual anchor point |
CN112710310A (en) * | 2020-12-07 | 2021-04-27 | 深圳龙岗智能视听研究院 | Visual language indoor navigation method, system, terminal and application |
Non-Patent Citations (2)
Title |
---|
YI ZHU ET AL: "Vision-Dialog Navigation by Exploring Cross-modal Memory", 《2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 * |
YI ZHU ET AL: "Vision-Dialog Navigation by Exploring Cross-modal Memory", 《2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》, 31 December 2020 (2020-12-31), pages 10727 - 10736 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114029963A (en) * | 2022-01-12 | 2022-02-11 | 北京具身智能科技有限公司 | Robot operation method based on visual and auditory fusion |
CN115082915A (en) * | 2022-05-27 | 2022-09-20 | 华南理工大学 | Mobile robot vision-language navigation method based on multi-modal characteristics |
CN115082915B (en) * | 2022-05-27 | 2024-03-29 | 华南理工大学 | Multi-modal feature-based mobile robot vision-language navigation method |
Also Published As
Publication number | Publication date |
---|---|
CN113420606B (en) | 2022-06-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110188598B (en) | Real-time hand posture estimation method based on MobileNet-v2 | |
CN110609891B (en) | Visual dialog generation method based on context awareness graph neural network | |
CN108829677B (en) | Multi-modal attention-based automatic image title generation method | |
CN111243269B (en) | Traffic flow prediction method based on depth network integrating space-time characteristics | |
CN110990543A (en) | Intelligent conversation generation method and device, computer equipment and computer storage medium | |
CN110795990B (en) | Gesture recognition method for underwater equipment | |
CN112860888B (en) | Attention mechanism-based bimodal emotion analysis method | |
CN111966800B (en) | Emotion dialogue generation method and device and emotion dialogue model training method and device | |
CN110851760B (en) | Human-computer interaction system for integrating visual question answering in web3D environment | |
CN113420606B (en) | Method for realizing autonomous navigation of robot based on natural language and machine vision | |
CN110825829B (en) | Method for realizing autonomous navigation of robot based on natural language and semantic map | |
CN113221663B (en) | Real-time sign language intelligent identification method, device and system | |
CN112163498B (en) | Method for establishing pedestrian re-identification model with foreground guiding and texture focusing functions and application of method | |
CN111028319B (en) | Three-dimensional non-photorealistic expression generation method based on facial motion unit | |
CN112329525A (en) | Gesture recognition method and device based on space-time diagram convolutional neural network | |
CN114360005B (en) | Micro-expression classification method based on AU region and multi-level transducer fusion module | |
CN111026873A (en) | Unmanned vehicle and navigation method and device thereof | |
CN113065344A (en) | Cross-corpus emotion recognition method based on transfer learning and attention mechanism | |
CN116091551B (en) | Target retrieval tracking method and system based on multi-mode fusion | |
CN110046271A (en) | A kind of remote sensing images based on vocal guidance describe method | |
CN117197878B (en) | Character facial expression capturing method and system based on machine learning | |
CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
CN117576279B (en) | Digital person driving method and system based on multi-mode data | |
CN113780350B (en) | ViLBERT and BiLSTM-based image description method | |
CN117011650B (en) | Method and related device for determining image encoder |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |