CN111611373A - Robot-oriented specific active scene description method - Google Patents
Robot-oriented specific active scene description method Download PDFInfo
- Publication number
- CN111611373A CN111611373A CN202010287188.8A CN202010287188A CN111611373A CN 111611373 A CN111611373 A CN 111611373A CN 202010287188 A CN202010287188 A CN 202010287188A CN 111611373 A CN111611373 A CN 111611373A
- Authority
- CN
- China
- Prior art keywords
- state
- image
- robot
- model
- action
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention relates to a robot-oriented specific active scene description method, and belongs to the technical field of image processing. According to the invention, by combining the static image description generation model and the navigation model of the robot, when the robot is in an initial scene with a poor visual angle, the trained navigation model can provide an effective action sequence for the robot to adjust the visual angle of the robot, and the trained static image description generation model is called to obtain a final scene description result after a final proper visual angle is found. The invention can overcome the defect that the traditional image description generation model cannot be suitable for the specific scene, can generate more accurate and comprehensive image description in the three-dimensional specific scene, and can be used for the fields of service robots, security monitoring, barrier-free man-machine interaction and the like.
Description
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a robot-oriented specific active scene description method.
Background
The image description generation refers to a technology for generating corresponding text description according to a given image, the image description generation technology allows a computer to provide text description which can be understood by human beings for natural images, and the image description generation technology has wide and important application in security monitoring, image retrieval, network image processing and barrier-free man-machine interaction.
In recent years, an automatic generation method has been described for an image of a single picture, and great progress has been made. However, the current image description generation model is only applicable to static pictures and is useless for interactive scenes. For example, a robot may face a wall in a room, and no matter what image description generation model is used, it makes no sense for the entire indoor scene. However, if the robot can rotate 180 degrees, it may see a completely different scene.
The current image description generation method is mainly based on a convolutional neural network and an LSTM (long-term memory) language model. The main method is to extract image features by using a convolutional neural network, and then to generate final texts recursively word by using an LSTM model. In generating text, the LSTM model may use an attention mechanism to assign weights to the regional features at different positions of the image, and then the regional features are weighted and averaged, thereby improving the generation effect of the model. There are also some models that use pre-extracted keywords and then use a semantic attention mechanism to fuse the information of these keywords to improve the generation. However, all these models can only be used for generating text in a static picture because they can receive a single picture as input. None of these generative models can be used when the view of the picture is not appropriate (e.g., facing a wall or window in an indoor scene).
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a robot-oriented specific active scene description method. The method can combine the natural language model and the navigation model of the robot, when the robot is in an initial scene with a poor visual angle, the navigation model can provide an effective action sequence for the robot to adjust the visual angle of the robot, and the text generation model aiming at the static picture is called after a final proper visual angle is found, so that a final meaningful scene description result is obtained. The invention can overcome the defect that the traditional image description generation model cannot be suitable for the specific scene, can generate more accurate and comprehensive image description in the three-dimensional specific scene, and can be used for the fields of service robots, security monitoring, barrier-free man-machine interaction and the like.
The invention provides a robot-oriented specific active scene description method, which is divided into a training stage and a using stage and is characterized by comprising the following steps of:
(1) a training stage; the method comprises the following specific steps:
(1-1) using the image title generation data set as an image description generation data set, and performing text description labeling on each image in the image description generation data set to obtain an image description generation data set with labeled images;
(1-2) selecting a static image description generation frame, training the frame by using the image description generation data set labeled in the step (1-1) to obtain a trained static image description generation model Cap, inputting any one image of the model, and outputting an image description word sequence corresponding to the image, wherein the expression is as follows:
(w(I)1,w(I)2,...w(I)mI)=Cap(I)
wherein Cap describes the generation model for static image, I is the input image, and (w)(I)1,w(I)2,...w(I)mI) For the image description word sequence corresponding to the input image I, w(I)mIDescribing the mI word of the word sequence for the image corresponding to the input image I;
(1-3) selecting a simulation environment;
(1-4) constructing a state set S to obtain a scene connection graph G ═ S, E >; the method comprises the following specific steps:
(1-4-1) from the simulation environment, selecting all actions supported by the physical robot in the use stage to form an action space
(1-4-2) initialization State setEdge setSearch queue SQ={s0In which s is0Simulating any reachable state of the robot in a simulation scene;
(1-4-3) to SQAnd (4) judging:
if SQIf not, executing the step (1-4-4), otherwise, executing the step (1-4-6);
(1-4-4) order the Current State SC=SQDequeue (), where dequeue represents a dequeue operation of a queue data structure;
(1-4-5) for eachWherein a represents any motion in the motion space, if state ScIf the lower simulation robot allows the action a to be executed, S is addedcGoing through the performed action a to the state Exec (S)CAnd, a) Exec (S) in the edge-to-edge set ECAnd a) represents the slave state S of the simulation robotCExecuting the new state obtained by the action a;
then for all Exec (S)CAnd, a) making a determination:
if it isThen execute SQ.enqueue(Exec(SCA)), mixing Exec (S)CA) adding the corresponding state to the search queue SQAnd Exec (S)CA) adding to the state set S, and then returning to the step (1-4-3); otherwise, no operation is executed, and then the step (1-4-3) is returned again; wherein enqueue represents an enqueue operation of a queue data structure;
(1-4-6) after the state set S is constructed, finally obtaining a scene connection graph G (S, E), wherein S is a state set, each state in S corresponds to one node in the scene connection graph, E is a set of edges between any two states in the connection state set, and each edge corresponds to the robot action of connecting the two states;
(1-5) calculating all shortest paths between any two nodes in the scene connection graph G; each shortest path correspondingly connects one state sequence of the two nodes, and adjacent states in the state sequence are connected through actions corresponding to edges connecting the two adjacent states in the edge set E;
(1-6) obtaining a candidate state set;
for each state S e S in the state set S obtained in the step (1-4), obtaining a corresponding observation image i (S) of the state S in the simulation scene, and obtaining a target detection set o (S) corresponding to i (S), and obtaining a word set w (S) ═ Cap (i (S)) } corresponding to the state by using the static image description generation model Cap trained in the step (1-3), wherein { q } represents a set formed by non-repetitive elements in the sequence q; and calculating the score corresponding to the state s, wherein the expression is as follows:
score(s)=α|O(s)|+|W(S)∩O(s)|
wherein α is a trade-off factor;
after calculating the scores corresponding to all the states in the state set S, forming the states with the state scores higher than the set score threshold value into a candidate state set Scand(ii) a Wherein, the score threshold value is the set proportion of the highest score in the corresponding scores of all the states, namely:
Scand={s|s∈S,score(s)>βmax(score(s))}
wherein β is a scaling factor;
(1-7) constructing a navigation model;
the navigation model is composed of a convolutional neural network CNN, a long-time memory model LSTM and a full connection layer, wherein the convolutional neural network is connected to an input layer of each time step of the long-time memory model, and the full connection layer is connected to an output layer of each time step of the long-time memory model;
let CNN initial parameter be theta1The initial parameter of LSTM is theta2The initial parameter of the full connection layer is W3The initial state of the LSTM hidden layer is h0,c0Then the overlay of the navigation modelThe generation process is as follows:
ht+1,ct+1=LSTM(ht,ct,[CNN(It+1);at])
p(at+1)=Softmax(W3ht+1)
wherein, at,It+1For the input of the t-th iteration of the navigation model, atIterating the actions performed by the robot for the t-th step, It+1To perform action atThen obtaining an observation image; for t equal to 0, set at=astart,astartRepresenting a start action; p (a)t+1) The conditional probability of each action of the t +1 th step of iterative execution is represented and is the output of the navigation model; theta is ═ theta1;θ2;W3;h0;c0]Constructing a current trainable set of parameters for the navigation model;
and recording the result of the iteration process after the step t as Nav (h)0,c0,I1,I2,...,It,a0,a1,...,at)=p(at+1);
(1-8) sampling the shortest path and training a navigation model to obtain a trained navigation model; the method comprises the following specific steps:
(1-8-1) setting B as the size of a training batch;
(1-8-2) randomly sampling B states from S as a current initial state set S11,s21,...,sB1;
(1-8-3) for each current initial state si1I is more than or equal to 1 and less than or equal to B, and the candidate state set S obtained from the step (1-6)candRandomly sampling a candidate state as a termination candidate position s of the initial statei(n(i))1. ltoreq. i.ltoreq.B, where n (i) is a link si1And si(n(i))Length of state sequence traversed in shortest path of, from connection si1And sin(i)Randomly selecting one shortest path from all the shortest paths to obtain a sequence consisting of nodes and edges corresponding to the path:
whereinIs in slave state sijTransition to the next state sij+1The movement of passing between, I(s)ij) Is a state sijCorresponding observation image, astartAnd astopRespectively starting action and stopping action;
(1-8-4) repeating the step (1-8-3) to obtain sequences corresponding to all current initial states; calculating a loss function corresponding to the current initial state set:
(1-8-5) calculating a gradient of the loss function for the current trainable set of parameters θOptimizing navigation model parameters by using a random gradient descent optimizer according to the calculated gradient, and updating a current trainable parameter set;
(1-8-6) repeating the steps (1-8-2) to (1-8-5) until the navigation model converges, and obtaining the trained navigation model and the final trainable parameter set theta ═ theta'1;θ′2;W′3;h′0;c′0];
(2) A use stage; the method comprises the following specific steps:
(2-1) placing the physical robot at any position in the real physical scene as an initial position, and initiating the initial actionObtaining an observation image corresponding to the physical robot
(2-2) defining the navigation model trained in the step (1)Type LSTM hidden layer initial state Making the iteration step number i equal to 1;
(2-3) a handleInputting the navigation model trained in the step (1) to obtain model output ofWherein the content of the first and second substances,is an observation image of the ith step of the physical robot,the actions performed for the physical robot at step i,the state parameters of the ith step of the LSTM hidden layer of the trained navigation model are obtained;
(2-4) pairsAnd (4) judging: if it is actingTo terminate action astopEntering the step (2-5); otherwise, the physical robot performs the actionObtaining the updated observation image of the (i + 1) th stepThen, making i equal to i +1, and returning to the step (2-3) again;
(2-5) navigation is finished, andinputting the model static image description generation model Cap obtained in the step (1-3), and obtaining a word sequence of the image description corresponding to the observation image as follows:
the invention has the characteristics and beneficial effects that:
the invention constructs a robot-oriented specific active scene description method by utilizing a navigation model and an image description generation model based on deep learning. The method overcomes the defect that a static description generation model cannot deal with the condition that the robot is in a poor visual angle, can guide the robot to navigate and adjust the visual angle of the robot, and then generates the image description after the robot reaches the proper visual angle in the scene.
The invention can be used in the field of robots, allowing a robot to autonomously navigate to a suitable location within an area, explore the environment and generate the required natural language description. The method can be used in the fields of service robots, security monitoring, barrier-free man-machine interaction and the like, and can generate more accurate and comprehensive image description in a three-dimensional specific scene.
Detailed Description
The invention provides a robot-oriented active scene description method, which is further described in detail below with reference to specific embodiments.
The invention provides a robot-oriented specific active scene description method. The method can combine the static image description generation model with the navigation model of the robot, when the method is applied to the initial scene of the robot with poor visual angle, the navigation model can provide an effective action sequence for the robot to adjust the visual angle of the robot, and the static image description generation model is called to obtain the final scene description result after the final proper visual angle is found.
The invention provides a robot-oriented active scene description method, which comprises a training stage and a using stage, and comprises the following steps:
(1) a training stage; the method comprises the following specific steps:
(1-1) the image title generation dataset is used as the image description generation dataset (if the public dataset is used, the MSCOCO dataset may be used. The image description generation dataset needs to include a plurality of images (generally 1 ten thousand to 100 ten thousand, 10 ten thousand in this embodiment), and each image in the image description generation dataset is manually labeled to obtain an image description generation dataset after labeling.
The content of each image label is one or more text descriptions corresponding to the image (the more text descriptions of each image are, the better, each text description of the same image should have different content, the number of texts of each image is not necessarily the same, generally 1-20, and in this embodiment, each image is 5). The style of the text description should be unified as much as possible, and the rarely used words should be used as little as possible. Due to the limitations of the model, the text descriptions should not be too long (each text description in this embodiment typically does not exceed 25 words).
(1-2) selecting a static image description generation framework (such as imagecapturing. pittorech, in this embodiment, imagecapturing. pittorech), and training the framework by using the image description generation data set labeled in the step (1-1) to obtain a trained static image description generation model Cap, wherein the model is input into any one image and output as an image description word sequence corresponding to the input image, and the expression is as follows:
(w(I)1,w(I)2,...w(I)mI)=Cap(I)
wherein Cap describes the generation model for static image, I is the input image, and (w)(I)1,w(I)2,...w(I)mI) For the image description word sequence corresponding to the input image I, w(I)mIThe mI-th word of the word sequence is described for the image corresponding to the input image I. Each input diagramThe number of words like the corresponding output image description is not necessarily equal.
(1-3) selecting a simulation environment (the simulation environment should be as close as possible to the real physical environment of robot deployment, and if an open-source simulation environment is used, AI2Thor, in the embodiment, AI2Thor, can be used). Virtual scenes and simulation robots can be supported in the simulation environment, the simulation robots can be operated by programming, and various actions of the simulation robots are supported (actions of the simulation robots supported by the AI2Thor supported in the embodiment are left-turn, right-turn, forward, coordinate acquisition and remote transmission).
(1-4) constructing a state set S, and constructing the state set S to obtain a scene connection diagram G ═ S, E >; the method comprises the following specific steps:
(1-4-1) selecting all actions which need to be supported by the physical robot (namely the robot actually executing the actions in the step (2)) from the simulation environment to form an action space(in this embodiment, the 5 actions of forward, backward, left turn, right turn and stop in AI2Thor are selected to form an action space);
(1-4-2) initialization State setEdge setSearch queue SQ={s0In which s is0Any reachable state of the robot is simulated in the simulation scene (the state comprises the position and the posture of the robot).
(1-4-3) to SQAnd (4) judging:
if SQIf not, executing the step (1-4-4), otherwise, executing the step (1-4-6);
(1-4-4) order the Current State SC=SQDequeue (), where dequeue represents a dequeue operation of a queue data structure;
(1-4-5) for eachWherein a represents any motion in the motion space, if state ScIf the lower simulation robot allows the action a to be executed, S is addedcGoing through the performed action a to the state Exec (S)CA) Exec (S) in the edge-to-edge set E formedCA) represents a new state obtained by the simulation robot executing the action a from the state SC;
then for all Exec (S)CAnd, a) making a determination:
if it isThen execute SQ.enqueue(Exec(SCA)), mixing Exec (S)CA) adding the corresponding new state to the search queue SQAnd Exec (S)CA) adding to the state set S, and then returning to the step (1-4-3); otherwise, no operation is executed, and then the step (1-4-3) is returned again; wherein enqueue represents an enqueue operation of a queue data structure;
(1-4-6) after the state set S is constructed, finally obtaining a discretized scene connection graph G (S, E), wherein S is a state set (each state corresponds to one node in the scene connection graph), and E is a set of edges connecting any two states in the state set (each edge corresponds to the robot action connecting the two states);
and (1-5) calculating all shortest paths between any two nodes for the scene connection graph G by using a Floyd-Warshall algorithm. And a plurality of states connecting the two nodes in the state set S corresponding to each shortest path form a state sequence, and adjacent states in the state sequence are connected through actions corresponding to edges connecting the two adjacent states in the edge set E.
(1-6) obtaining a candidate state set;
for each state S ∈ S in the state set S obtained in step (1-4) (each state generally includes the position and posture of the robot), obtaining, by using simulation software, an observation image i (S) corresponding to the state S in the simulation scene, further obtaining, by using the simulation software, a target detection set o (S) corresponding to i (S), and obtaining, by using the static image description generation model Cap trained in step (1-3), a word set w (S) ═ Cap (i (S)) } (where { q } represents a set formed by non-repetitive elements in the sequence q). And calculating the score corresponding to the state s, wherein the expression is as follows:
score(s)=α|O(s)|+|W(S)∩O(s)|
where α is a trade-off factor (typically 0.5-5.0), which in this example is 1.0.
After calculating the scores corresponding to all the states in the state set S, forming the states with the state scores higher than the set score threshold value into a candidate state set Scand(ii) a Wherein, the score threshold value is the set proportion of the highest score in the corresponding scores of all the states, namely:
Scand={s|s∈S,score(s)>βmax(score(s))}
wherein β is a scale factor (value range 0-1), which is 0.85 in this embodiment.
(1-7) constructing a navigation model;
the navigation model consists of a convolutional neural network CNN (CNN used in the embodiment is ResNet18), an long-and-short time memory model LSTM and a full connection layer, wherein the convolutional neural network is connected to an input layer of each time step of the long-and-short time memory model, and the full connection layer is connected to an output layer of each time step of the long-and-short time memory model;
let CNN initial parameter be theta1The initial parameter of LSTM is theta2The initial parameter of the full connection layer is W3The initial state of the LSTM hidden layer is h0,c0Then the iterative process of the navigation model is as follows:
ht+1,ct+1=LSTM(ht,ct,[CNN(It+1);at])
p(at+1)=Softmax(W3ht+1)
wherein a ist,It+1For the input of the t-th iteration of the navigation model, atIterating the actions performed by the robot for the t-th step, It+1To perform action atAnd (5) obtaining an observation image. Before step 1, for t equal to 0, atSet to a special value at=astart,astartRepresenting a start action; p (a)t+1) Representing the conditional probability of each action executed by the t +1 th iteration, and outputting the t-th iteration of the navigation model; theta is ═ theta1;θ2;W3;h0;c0]Constructing a current trainable set of parameters for the navigation model; in each iteration, only the LSTM hidden layer parameters are changed in the iteration process, and other parameters in the current trainable parameter set are not changed.
The result of the iteration process in step t is recorded as Nav (h)0,c0,I1,I2,...,It,a0,a1,...,at)=p(at+1);
(1-8) sampling the shortest path and training a navigation model to obtain a trained navigation model; the method comprises the following specific steps:
(1-8-1) setting B as the size of the training batch (the value range is 16-256, and the implementation is 32);
(1-8-2) randomly sampling B states from S as a current initial state set S11,s21,...,sB1;
(1-8-3) for each current initial state si1(1. ltoreq. i. ltoreq.B), candidate state set S obtained from step (1-6)candRandomly sampling a candidate state as a termination candidate position s of the initial statei(n(i))(1. ltoreq. i.ltoreq.B) (the termination candidate position may coincide with the initial state but does not affect the normal operation of the algorithm), where n (i) is the connection s calculated according to step (1-5)i1And si(n(i))The length of the state sequence passing through the shortest path in (the length of the state sequence passing through all the shortest paths between the two states is equal) from si1And sin(i)Randomly selecting one shortest path from all the corresponding shortest paths to obtain a sequence consisting of nodes and edges corresponding to the path:
whereinIs in slave state sijTransition to the next state sij+1The movement of passing between, I(s)ij) Is a state sijCorresponding observation image, astartAnd astopSpecial marks, s, for start and stop actions, respectivelyin(i)∈Scand;
(1-8-4) repeating the step (1-8-3) to obtain sequences corresponding to all current initial states; calculating a loss function corresponding to the current initial state set:
(1-8-5) calculating a gradient of the loss function for the current trainable set of parameters θOptimizing navigation model parameters by using a random gradient descent optimizer according to the calculated gradient, and updating a current trainable parameter set (after one training for each batch, all parameters in the trainable parameter set theta are updated);
(1-8-6) repeating the steps (1-8-2) to (1-8-5) until the navigation model converges, and obtaining the trained navigation model and the final trainable parameter set theta ═ theta'1;θ′2;W′3;h′0;c′0];
(2) A use stage; the method comprises the following specific steps:
(2-1) placing the physical robot at any position in a real physical scene as an initial position, (the model of the physical robot is consistent with that of the simulation robot as much as possible), and initializing the operationObtaining an initial observation image of the robot at the initial position
(2-2) defining the initial state of the LSTM hidden layer of the navigation model trained in the step (1)Making the iteration step number i equal to 1;
(2-3) a handleInputting the navigation model trained in the step (1) to obtain model output ofWherein the content of the first and second substances,for the observation image of the ith step of the physical robot (by performing the action)Obtained) of the first step,the actions performed for the physical robot at step i,the state parameters (as intermediate state vectors) of the ith step of the LSTM hidden layer of the trained navigation model are obtained.
(2-4) pairsAnd (4) judging: if it is actingTo terminate action astopEntering the step (2-5); otherwise, the physical robot performs the actionObtaining the updated observation image of the (i + 1) th stepAnd then, the step (2-3) is returned again when i is equal to i + 1.
(2-5) navigation is finished, andinputting the model static image description generation model Cap obtained in the step (1-3), and obtaining a word sequence of the image description corresponding to the observation image as follows:
Claims (1)
1. a robot-oriented active scene description method comprises a training phase and a using phase, and is characterized by comprising the following steps:
(1) a training stage; the method comprises the following specific steps:
(1-1) using the image title generation data set as an image description generation data set, and performing text description labeling on each image in the image description generation data set to obtain an image description generation data set with labeled images;
(1-2) selecting a static image description generation frame, training the frame by using the image description generation data set labeled in the step (1-1) to obtain a trained static image description generation model Cap, inputting any one image of the model, and outputting an image description word sequence corresponding to the image, wherein the expression is as follows:
(w(I)1,w(I)2,...w(I)mI)=Cap(I)
wherein Cap describes the generation model for static image, I is the input image, and (w)(I)1,w(I)2,...w(I)mI) For the image description word sequence corresponding to the input image I, w(I)mIDescribing the mI word of the word sequence for the image corresponding to the input image I;
(1-3) selecting a simulation environment;
(1-4) constructing a state set S to obtain a scene connection graph G ═ S, E >; the method comprises the following specific steps:
(1-4-1) from the simulation environment, selecting all actions supported by the physical robot in the use stage to form an action space
(1-4-2) initialization State setEdge setSearch queue SQ={s0In which s is0Simulating any reachable state of the robot in a simulation scene;
(1-4-3) to SQAnd (4) judging:
if SQIf not, executing the step (1-4-4), otherwise, executing the step (1-4-6);
(1-4-4) order the Current State SC=SQDequeue (), where dequeue represents a dequeue operation of a queue data structure;
(1-4-5) for eachWherein a represents any motion in the motion space, if state ScIf the lower simulation robot allows the action a to be executed, S is addedcGoing through the performed action a to the state Exec (S)CAnd, a) Exec (S) in the edge-to-edge set ECAnd a) represents the slave state S of the simulation robotCExecuting the new state obtained by the action a;
then for all Exec (S)CAnd, a) making a determination:
if it isThen execute SQ.enqueue(Exec(SCA)), mixing Exec (S)CA) correspondingState join search queue SQAnd Exec (S)CA) adding to the state set S, and then returning to the step (1-4-3); otherwise, no operation is executed, and then the step (1-4-3) is returned again; wherein enqueue represents an enqueue operation of a queue data structure;
(1-4-6) after the state set S is constructed, finally obtaining a scene connection graph G (S, E), wherein S is a state set, each state in S corresponds to one node in the scene connection graph, E is a set of edges between any two states in the connection state set, and each edge corresponds to the robot action of connecting the two states;
(1-5) calculating all shortest paths between any two nodes in the scene connection graph G; each shortest path correspondingly connects one state sequence of the two nodes, and adjacent states in the state sequence are connected through actions corresponding to edges connecting the two adjacent states in the edge set E;
(1-6) obtaining a candidate state set;
for each state S e S in the state set S obtained in the step (1-4), obtaining a corresponding observation image i (S) of the state S in the simulation scene, and obtaining a target detection set o (S) corresponding to i (S), and obtaining a word set w (S) ═ Cap (i (S)) } corresponding to the state by using the static image description generation model Cap trained in the step (1-3), wherein { q } represents a set formed by non-repetitive elements in the sequence q; and calculating the score corresponding to the state s, wherein the expression is as follows:
score(s)=α|O(s)|+|W(S)∩O(s)|
wherein α is a trade-off factor;
after calculating the scores corresponding to all the states in the state set S, forming the states with the state scores higher than the set score threshold value into a candidate state set Scand(ii) a Wherein, the score threshold value is the set proportion of the highest score in the corresponding scores of all the states, namely:
Scand={s|s∈S,score(s)>βmax(score(s))}
wherein β is a scaling factor;
(1-7) constructing a navigation model;
the navigation model is composed of a convolutional neural network CNN, a long-time memory model LSTM and a full connection layer, wherein the convolutional neural network is connected to an input layer of each time step of the long-time memory model, and the full connection layer is connected to an output layer of each time step of the long-time memory model;
let CNN initial parameter be theta1The initial parameter of LSTM is theta2The initial parameter of the full connection layer is W3The initial state of the LSTM hidden layer is h0,c0Then the iterative process of the navigation model is as follows:
ht+1,ct+1=LSTM(ht,ct,[CNN(It+1);at])
p(at+1)=Softmax(W3ht+1)
wherein, at,It+1For the input of the t-th iteration of the navigation model, atIterating the actions performed by the robot for the t-th step, It+1To perform action atThen obtaining an observation image; for t equal to 0, set at=astart,astartRepresenting a start action; p (a)t+1) The conditional probability of each action of the t +1 th step of iterative execution is represented and is the output of the navigation model; theta is ═ theta1;θ2;W3;h0;c0]Constructing a current trainable set of parameters for the navigation model;
and recording the result of the iteration process after the step t as Nav (h)0,c0,I1,I2,...,It,a0,a1,...,at)=p(at+1);
(1-8) sampling the shortest path and training a navigation model to obtain a trained navigation model; the method comprises the following specific steps:
(1-8-1) setting B as the size of a training batch;
(1-8-2) randomly sampling B states from S as a current initial state set S11,s21,...,sB1;
(1-8-3) for each current initial state si1I is more than or equal to 1 and less than or equal to B, and the candidate state set S obtained from the step (1-6)candMedium random samplingOne candidate state is used as the termination candidate position s of the initial statei(n(i))1. ltoreq. i.ltoreq.B, where n (i) is a link si1And si(n(i))Length of state sequence traversed in shortest path of, from connection si1And sin(i)Randomly selecting one shortest path from all the shortest paths to obtain a sequence consisting of nodes and edges corresponding to the path:
whereinIs in slave state sijTransition to the next state sij+1The movement of passing between, I(s)ij) Is a state sijCorresponding observation image, astartAnd astopRespectively starting action and stopping action;
(1-8-4) repeating the step (1-8-3) to obtain sequences corresponding to all current initial states; calculating a loss function corresponding to the current initial state set:
(1-8-5) calculating a gradient of the loss function for the current trainable set of parameters θOptimizing navigation model parameters by using a random gradient descent optimizer according to the calculated gradient, and updating a current trainable parameter set;
(1-8-6) repeating the steps (1-8-2) to (1-8-5) until the navigation model converges, and obtaining the trained navigation model and the final trainable parameter set theta ═ theta'1;θ′2;W′3;h′0;c′0];
(2) A use stage; the method comprises the following specific steps:
(2-1) mixingThe physical robot is placed at any position in the real physical scene and used as an initial position to initialize actionsObtaining an observation image corresponding to the physical robot
(2-2) defining the initial state of the LSTM hidden layer of the navigation model trained in the step (1) Making the iteration step number i equal to 1;
(2-3) a handleInputting the navigation model trained in the step (1) to obtain model output ofWherein the content of the first and second substances,is an observation image of the ith step of the physical robot,the actions performed for the physical robot at step i,the state parameters of the ith step of the LSTM hidden layer of the trained navigation model are obtained;
(2-4) pairsAnd (4) judging:if it is actingTo terminate action astopEntering the step (2-5); otherwise, the physical robot performs the actionObtaining the updated observation image of the (i + 1) th stepThen, making i equal to i +1, and returning to the step (2-3) again;
(2-5) navigation is finished, andinputting the model static image description generation model Cap obtained in the step (1-3), and obtaining a word sequence of the image description corresponding to the observation image as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010287188.8A CN111611373B (en) | 2020-04-13 | 2020-04-13 | Robot-oriented specific active scene description method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010287188.8A CN111611373B (en) | 2020-04-13 | 2020-04-13 | Robot-oriented specific active scene description method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111611373A true CN111611373A (en) | 2020-09-01 |
CN111611373B CN111611373B (en) | 2021-09-10 |
Family
ID=72197820
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010287188.8A Active CN111611373B (en) | 2020-04-13 | 2020-04-13 | Robot-oriented specific active scene description method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111611373B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113111192A (en) * | 2021-04-28 | 2021-07-13 | 清华大学 | Method, equipment and exploration method for intelligent agent to actively construct environment scene map |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107065881A (en) * | 2017-05-17 | 2017-08-18 | 清华大学 | A kind of robot global path planning method learnt based on deeply |
WO2018170671A1 (en) * | 2017-03-20 | 2018-09-27 | Intel Corporation | Topic-guided model for image captioning system |
CN108875807A (en) * | 2018-05-31 | 2018-11-23 | 陕西师范大学 | A kind of Image Description Methods multiple dimensioned based on more attentions |
CN109029444A (en) * | 2018-06-12 | 2018-12-18 | 深圳职业技术学院 | One kind is based on images match and sterically defined indoor navigation system and air navigation aid |
-
2020
- 2020-04-13 CN CN202010287188.8A patent/CN111611373B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018170671A1 (en) * | 2017-03-20 | 2018-09-27 | Intel Corporation | Topic-guided model for image captioning system |
CN107065881A (en) * | 2017-05-17 | 2017-08-18 | 清华大学 | A kind of robot global path planning method learnt based on deeply |
CN108875807A (en) * | 2018-05-31 | 2018-11-23 | 陕西师范大学 | A kind of Image Description Methods multiple dimensioned based on more attentions |
CN109029444A (en) * | 2018-06-12 | 2018-12-18 | 深圳职业技术学院 | One kind is based on images match and sterically defined indoor navigation system and air navigation aid |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113111192A (en) * | 2021-04-28 | 2021-07-13 | 清华大学 | Method, equipment and exploration method for intelligent agent to actively construct environment scene map |
CN113111192B (en) * | 2021-04-28 | 2022-03-29 | 清华大学 | Method, equipment and exploration method for intelligent agent to actively construct environment scene map |
Also Published As
Publication number | Publication date |
---|---|
CN111611373B (en) | 2021-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kar et al. | Meta-sim: Learning to generate synthetic datasets | |
CN107515674B (en) | It is a kind of that implementation method is interacted based on virtual reality more with the mining processes of augmented reality | |
Miao et al. | Parallel learning: Overview and perspective for computational learning across Syn2Real and Sim2Real | |
CN108875807B (en) | Image description method based on multiple attention and multiple scales | |
CN108229444B (en) | Pedestrian re-identification method based on integral and local depth feature fusion | |
Vinyals et al. | Show and tell: A neural image caption generator | |
Mi et al. | Hdmapgen: A hierarchical graph generative model of high definition maps | |
CN110851760B (en) | Human-computer interaction system for integrating visual question answering in web3D environment | |
Stein et al. | Genesis-rt: Generating synthetic images for training secondary real-world tasks | |
CN115769234A (en) | Template-based generation of 3D object mesh from 2D images | |
Hamdi et al. | SADA: semantic adversarial diagnostic attacks for autonomous applications | |
KR102375286B1 (en) | Learning method and learning device for generating training data from virtual data on virtual world by using generative adversarial network, to thereby reduce annotation cost required in training processes of neural network for autonomous driving | |
Hu et al. | Safe navigation with human instructions in complex scenes | |
CN113506377A (en) | Teaching training method based on virtual roaming technology | |
CN111611373B (en) | Robot-oriented specific active scene description method | |
Cruz et al. | Closing the simulation-to-reality gap using generative neural networks: Training object detectors for soccer robotics in simulation as a case study | |
US20220207831A1 (en) | Simulated control for 3- dimensional human poses in virtual reality environments | |
Xu et al. | Text-guided human image manipulation via image-text shared space | |
Chen et al. | Neural task planning with and–or graph representations | |
CN117115911A (en) | Hypergraph learning action recognition system based on attention mechanism | |
Di et al. | Multi-agent reinforcement learning of 3d furniture layout simulation in indoor graphics scenes | |
Ren et al. | InsActor: Instruction-driven Physics-based Characters | |
CN114168769B (en) | Visual question-answering method based on GAT relation reasoning | |
Sahni et al. | Addressing sample complexity in visual tasks using her and hallucinatory gans | |
Putra et al. | Designing translation tool: Between sign language to spoken text on kinect time series data using dynamic time warping |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |