CN111611373A - Robot-oriented specific active scene description method - Google Patents

Robot-oriented specific active scene description method Download PDF

Info

Publication number
CN111611373A
CN111611373A CN202010287188.8A CN202010287188A CN111611373A CN 111611373 A CN111611373 A CN 111611373A CN 202010287188 A CN202010287188 A CN 202010287188A CN 111611373 A CN111611373 A CN 111611373A
Authority
CN
China
Prior art keywords
state
image
robot
model
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010287188.8A
Other languages
Chinese (zh)
Other versions
CN111611373B (en
Inventor
刘华平
谭思楠
郭迪
孙富春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202010287188.8A priority Critical patent/CN111611373B/en
Publication of CN111611373A publication Critical patent/CN111611373A/en
Application granted granted Critical
Publication of CN111611373B publication Critical patent/CN111611373B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to a robot-oriented specific active scene description method, and belongs to the technical field of image processing. According to the invention, by combining the static image description generation model and the navigation model of the robot, when the robot is in an initial scene with a poor visual angle, the trained navigation model can provide an effective action sequence for the robot to adjust the visual angle of the robot, and the trained static image description generation model is called to obtain a final scene description result after a final proper visual angle is found. The invention can overcome the defect that the traditional image description generation model cannot be suitable for the specific scene, can generate more accurate and comprehensive image description in the three-dimensional specific scene, and can be used for the fields of service robots, security monitoring, barrier-free man-machine interaction and the like.

Description

Robot-oriented specific active scene description method
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a robot-oriented specific active scene description method.
Background
The image description generation refers to a technology for generating corresponding text description according to a given image, the image description generation technology allows a computer to provide text description which can be understood by human beings for natural images, and the image description generation technology has wide and important application in security monitoring, image retrieval, network image processing and barrier-free man-machine interaction.
In recent years, an automatic generation method has been described for an image of a single picture, and great progress has been made. However, the current image description generation model is only applicable to static pictures and is useless for interactive scenes. For example, a robot may face a wall in a room, and no matter what image description generation model is used, it makes no sense for the entire indoor scene. However, if the robot can rotate 180 degrees, it may see a completely different scene.
The current image description generation method is mainly based on a convolutional neural network and an LSTM (long-term memory) language model. The main method is to extract image features by using a convolutional neural network, and then to generate final texts recursively word by using an LSTM model. In generating text, the LSTM model may use an attention mechanism to assign weights to the regional features at different positions of the image, and then the regional features are weighted and averaged, thereby improving the generation effect of the model. There are also some models that use pre-extracted keywords and then use a semantic attention mechanism to fuse the information of these keywords to improve the generation. However, all these models can only be used for generating text in a static picture because they can receive a single picture as input. None of these generative models can be used when the view of the picture is not appropriate (e.g., facing a wall or window in an indoor scene).
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a robot-oriented specific active scene description method. The method can combine the natural language model and the navigation model of the robot, when the robot is in an initial scene with a poor visual angle, the navigation model can provide an effective action sequence for the robot to adjust the visual angle of the robot, and the text generation model aiming at the static picture is called after a final proper visual angle is found, so that a final meaningful scene description result is obtained. The invention can overcome the defect that the traditional image description generation model cannot be suitable for the specific scene, can generate more accurate and comprehensive image description in the three-dimensional specific scene, and can be used for the fields of service robots, security monitoring, barrier-free man-machine interaction and the like.
The invention provides a robot-oriented specific active scene description method, which is divided into a training stage and a using stage and is characterized by comprising the following steps of:
(1) a training stage; the method comprises the following specific steps:
(1-1) using the image title generation data set as an image description generation data set, and performing text description labeling on each image in the image description generation data set to obtain an image description generation data set with labeled images;
(1-2) selecting a static image description generation frame, training the frame by using the image description generation data set labeled in the step (1-1) to obtain a trained static image description generation model Cap, inputting any one image of the model, and outputting an image description word sequence corresponding to the image, wherein the expression is as follows:
(w(I)1,w(I)2,...w(I)mI)=Cap(I)
wherein Cap describes the generation model for static image, I is the input image, and (w)(I)1,w(I)2,...w(I)mI) For the image description word sequence corresponding to the input image I, w(I)mIDescribing the mI word of the word sequence for the image corresponding to the input image I;
(1-3) selecting a simulation environment;
(1-4) constructing a state set S to obtain a scene connection graph G ═ S, E >; the method comprises the following specific steps:
(1-4-1) from the simulation environment, selecting all actions supported by the physical robot in the use stage to form an action space
Figure BDA0002448968620000021
(1-4-2) initialization State set
Figure BDA0002448968620000022
Edge set
Figure BDA0002448968620000023
Search queue SQ={s0In which s is0Simulating any reachable state of the robot in a simulation scene;
(1-4-3) to SQAnd (4) judging:
if SQIf not, executing the step (1-4-4), otherwise, executing the step (1-4-6);
(1-4-4) order the Current State SC=SQDequeue (), where dequeue represents a dequeue operation of a queue data structure;
(1-4-5) for each
Figure BDA0002448968620000024
Wherein a represents any motion in the motion space, if state ScIf the lower simulation robot allows the action a to be executed, S is addedcGoing through the performed action a to the state Exec (S)CAnd, a) Exec (S) in the edge-to-edge set ECAnd a) represents the slave state S of the simulation robotCExecuting the new state obtained by the action a;
then for all Exec (S)CAnd, a) making a determination:
if it is
Figure BDA0002448968620000025
Then execute SQ.enqueue(Exec(SCA)), mixing Exec (S)CA) adding the corresponding state to the search queue SQAnd Exec (S)CA) adding to the state set S, and then returning to the step (1-4-3); otherwise, no operation is executed, and then the step (1-4-3) is returned again; wherein enqueue represents an enqueue operation of a queue data structure;
(1-4-6) after the state set S is constructed, finally obtaining a scene connection graph G (S, E), wherein S is a state set, each state in S corresponds to one node in the scene connection graph, E is a set of edges between any two states in the connection state set, and each edge corresponds to the robot action of connecting the two states;
(1-5) calculating all shortest paths between any two nodes in the scene connection graph G; each shortest path correspondingly connects one state sequence of the two nodes, and adjacent states in the state sequence are connected through actions corresponding to edges connecting the two adjacent states in the edge set E;
(1-6) obtaining a candidate state set;
for each state S e S in the state set S obtained in the step (1-4), obtaining a corresponding observation image i (S) of the state S in the simulation scene, and obtaining a target detection set o (S) corresponding to i (S), and obtaining a word set w (S) ═ Cap (i (S)) } corresponding to the state by using the static image description generation model Cap trained in the step (1-3), wherein { q } represents a set formed by non-repetitive elements in the sequence q; and calculating the score corresponding to the state s, wherein the expression is as follows:
score(s)=α|O(s)|+|W(S)∩O(s)|
wherein α is a trade-off factor;
after calculating the scores corresponding to all the states in the state set S, forming the states with the state scores higher than the set score threshold value into a candidate state set Scand(ii) a Wherein, the score threshold value is the set proportion of the highest score in the corresponding scores of all the states, namely:
Scand={s|s∈S,score(s)>βmax(score(s))}
wherein β is a scaling factor;
(1-7) constructing a navigation model;
the navigation model is composed of a convolutional neural network CNN, a long-time memory model LSTM and a full connection layer, wherein the convolutional neural network is connected to an input layer of each time step of the long-time memory model, and the full connection layer is connected to an output layer of each time step of the long-time memory model;
let CNN initial parameter be theta1The initial parameter of LSTM is theta2The initial parameter of the full connection layer is W3The initial state of the LSTM hidden layer is h0,c0Then the overlay of the navigation modelThe generation process is as follows:
ht+1,ct+1=LSTM(ht,ct,[CNN(It+1);at])
p(at+1)=Softmax(W3ht+1)
wherein, at,It+1For the input of the t-th iteration of the navigation model, atIterating the actions performed by the robot for the t-th step, It+1To perform action atThen obtaining an observation image; for t equal to 0, set at=astart,astartRepresenting a start action; p (a)t+1) The conditional probability of each action of the t +1 th step of iterative execution is represented and is the output of the navigation model; theta is ═ theta1;θ2;W3;h0;c0]Constructing a current trainable set of parameters for the navigation model;
and recording the result of the iteration process after the step t as Nav (h)0,c0,I1,I2,...,It,a0,a1,...,at)=p(at+1);
(1-8) sampling the shortest path and training a navigation model to obtain a trained navigation model; the method comprises the following specific steps:
(1-8-1) setting B as the size of a training batch;
(1-8-2) randomly sampling B states from S as a current initial state set S11,s21,...,sB1
(1-8-3) for each current initial state si1I is more than or equal to 1 and less than or equal to B, and the candidate state set S obtained from the step (1-6)candRandomly sampling a candidate state as a termination candidate position s of the initial statei(n(i))1. ltoreq. i.ltoreq.B, where n (i) is a link si1And si(n(i))Length of state sequence traversed in shortest path of, from connection si1And sin(i)Randomly selecting one shortest path from all the shortest paths to obtain a sequence consisting of nodes and edges corresponding to the path:
Figure BDA0002448968620000041
wherein
Figure BDA0002448968620000042
Is in slave state sijTransition to the next state sij+1The movement of passing between, I(s)ij) Is a state sijCorresponding observation image, astartAnd astopRespectively starting action and stopping action;
(1-8-4) repeating the step (1-8-3) to obtain sequences corresponding to all current initial states; calculating a loss function corresponding to the current initial state set:
Figure BDA0002448968620000043
(1-8-5) calculating a gradient of the loss function for the current trainable set of parameters θ
Figure BDA0002448968620000044
Optimizing navigation model parameters by using a random gradient descent optimizer according to the calculated gradient, and updating a current trainable parameter set;
(1-8-6) repeating the steps (1-8-2) to (1-8-5) until the navigation model converges, and obtaining the trained navigation model and the final trainable parameter set theta ═ theta'1;θ′2;W′3;h′0;c′0];
(2) A use stage; the method comprises the following specific steps:
(2-1) placing the physical robot at any position in the real physical scene as an initial position, and initiating the initial action
Figure BDA0002448968620000045
Obtaining an observation image corresponding to the physical robot
Figure BDA0002448968620000046
(2-2) defining the navigation model trained in the step (1)Type LSTM hidden layer initial state
Figure BDA0002448968620000047
Figure BDA0002448968620000048
Making the iteration step number i equal to 1;
(2-3) a handle
Figure BDA0002448968620000049
Inputting the navigation model trained in the step (1) to obtain model output of
Figure BDA00024489686200000410
Wherein the content of the first and second substances,
Figure BDA00024489686200000411
is an observation image of the ith step of the physical robot,
Figure BDA00024489686200000412
the actions performed for the physical robot at step i,
Figure BDA00024489686200000413
the state parameters of the ith step of the LSTM hidden layer of the trained navigation model are obtained;
(2-4) pairs
Figure BDA00024489686200000414
And (4) judging: if it is acting
Figure BDA00024489686200000415
To terminate action astopEntering the step (2-5); otherwise, the physical robot performs the action
Figure BDA0002448968620000051
Obtaining the updated observation image of the (i + 1) th step
Figure BDA0002448968620000052
Then, making i equal to i +1, and returning to the step (2-3) again;
(2-5) navigation is finished, and
Figure BDA0002448968620000053
inputting the model static image description generation model Cap obtained in the step (1-3), and obtaining a word sequence of the image description corresponding to the observation image as follows:
Figure BDA0002448968620000054
the invention has the characteristics and beneficial effects that:
the invention constructs a robot-oriented specific active scene description method by utilizing a navigation model and an image description generation model based on deep learning. The method overcomes the defect that a static description generation model cannot deal with the condition that the robot is in a poor visual angle, can guide the robot to navigate and adjust the visual angle of the robot, and then generates the image description after the robot reaches the proper visual angle in the scene.
The invention can be used in the field of robots, allowing a robot to autonomously navigate to a suitable location within an area, explore the environment and generate the required natural language description. The method can be used in the fields of service robots, security monitoring, barrier-free man-machine interaction and the like, and can generate more accurate and comprehensive image description in a three-dimensional specific scene.
Detailed Description
The invention provides a robot-oriented active scene description method, which is further described in detail below with reference to specific embodiments.
The invention provides a robot-oriented specific active scene description method. The method can combine the static image description generation model with the navigation model of the robot, when the method is applied to the initial scene of the robot with poor visual angle, the navigation model can provide an effective action sequence for the robot to adjust the visual angle of the robot, and the static image description generation model is called to obtain the final scene description result after the final proper visual angle is found.
The invention provides a robot-oriented active scene description method, which comprises a training stage and a using stage, and comprises the following steps:
(1) a training stage; the method comprises the following specific steps:
(1-1) the image title generation dataset is used as the image description generation dataset (if the public dataset is used, the MSCOCO dataset may be used. The image description generation dataset needs to include a plurality of images (generally 1 ten thousand to 100 ten thousand, 10 ten thousand in this embodiment), and each image in the image description generation dataset is manually labeled to obtain an image description generation dataset after labeling.
The content of each image label is one or more text descriptions corresponding to the image (the more text descriptions of each image are, the better, each text description of the same image should have different content, the number of texts of each image is not necessarily the same, generally 1-20, and in this embodiment, each image is 5). The style of the text description should be unified as much as possible, and the rarely used words should be used as little as possible. Due to the limitations of the model, the text descriptions should not be too long (each text description in this embodiment typically does not exceed 25 words).
(1-2) selecting a static image description generation framework (such as imagecapturing. pittorech, in this embodiment, imagecapturing. pittorech), and training the framework by using the image description generation data set labeled in the step (1-1) to obtain a trained static image description generation model Cap, wherein the model is input into any one image and output as an image description word sequence corresponding to the input image, and the expression is as follows:
(w(I)1,w(I)2,...w(I)mI)=Cap(I)
wherein Cap describes the generation model for static image, I is the input image, and (w)(I)1,w(I)2,...w(I)mI) For the image description word sequence corresponding to the input image I, w(I)mIThe mI-th word of the word sequence is described for the image corresponding to the input image I. Each input diagramThe number of words like the corresponding output image description is not necessarily equal.
(1-3) selecting a simulation environment (the simulation environment should be as close as possible to the real physical environment of robot deployment, and if an open-source simulation environment is used, AI2Thor, in the embodiment, AI2Thor, can be used). Virtual scenes and simulation robots can be supported in the simulation environment, the simulation robots can be operated by programming, and various actions of the simulation robots are supported (actions of the simulation robots supported by the AI2Thor supported in the embodiment are left-turn, right-turn, forward, coordinate acquisition and remote transmission).
(1-4) constructing a state set S, and constructing the state set S to obtain a scene connection diagram G ═ S, E >; the method comprises the following specific steps:
(1-4-1) selecting all actions which need to be supported by the physical robot (namely the robot actually executing the actions in the step (2)) from the simulation environment to form an action space
Figure BDA0002448968620000061
(in this embodiment, the 5 actions of forward, backward, left turn, right turn and stop in AI2Thor are selected to form an action space);
(1-4-2) initialization State set
Figure BDA0002448968620000062
Edge set
Figure BDA0002448968620000063
Search queue SQ={s0In which s is0Any reachable state of the robot is simulated in the simulation scene (the state comprises the position and the posture of the robot).
(1-4-3) to SQAnd (4) judging:
if SQIf not, executing the step (1-4-4), otherwise, executing the step (1-4-6);
(1-4-4) order the Current State SC=SQDequeue (), where dequeue represents a dequeue operation of a queue data structure;
(1-4-5) for each
Figure BDA0002448968620000064
Wherein a represents any motion in the motion space, if state ScIf the lower simulation robot allows the action a to be executed, S is addedcGoing through the performed action a to the state Exec (S)CA) Exec (S) in the edge-to-edge set E formedCA) represents a new state obtained by the simulation robot executing the action a from the state SC;
then for all Exec (S)CAnd, a) making a determination:
if it is
Figure BDA0002448968620000071
Then execute SQ.enqueue(Exec(SCA)), mixing Exec (S)CA) adding the corresponding new state to the search queue SQAnd Exec (S)CA) adding to the state set S, and then returning to the step (1-4-3); otherwise, no operation is executed, and then the step (1-4-3) is returned again; wherein enqueue represents an enqueue operation of a queue data structure;
(1-4-6) after the state set S is constructed, finally obtaining a discretized scene connection graph G (S, E), wherein S is a state set (each state corresponds to one node in the scene connection graph), and E is a set of edges connecting any two states in the state set (each edge corresponds to the robot action connecting the two states);
and (1-5) calculating all shortest paths between any two nodes for the scene connection graph G by using a Floyd-Warshall algorithm. And a plurality of states connecting the two nodes in the state set S corresponding to each shortest path form a state sequence, and adjacent states in the state sequence are connected through actions corresponding to edges connecting the two adjacent states in the edge set E.
(1-6) obtaining a candidate state set;
for each state S ∈ S in the state set S obtained in step (1-4) (each state generally includes the position and posture of the robot), obtaining, by using simulation software, an observation image i (S) corresponding to the state S in the simulation scene, further obtaining, by using the simulation software, a target detection set o (S) corresponding to i (S), and obtaining, by using the static image description generation model Cap trained in step (1-3), a word set w (S) ═ Cap (i (S)) } (where { q } represents a set formed by non-repetitive elements in the sequence q). And calculating the score corresponding to the state s, wherein the expression is as follows:
score(s)=α|O(s)|+|W(S)∩O(s)|
where α is a trade-off factor (typically 0.5-5.0), which in this example is 1.0.
After calculating the scores corresponding to all the states in the state set S, forming the states with the state scores higher than the set score threshold value into a candidate state set Scand(ii) a Wherein, the score threshold value is the set proportion of the highest score in the corresponding scores of all the states, namely:
Scand={s|s∈S,score(s)>βmax(score(s))}
wherein β is a scale factor (value range 0-1), which is 0.85 in this embodiment.
(1-7) constructing a navigation model;
the navigation model consists of a convolutional neural network CNN (CNN used in the embodiment is ResNet18), an long-and-short time memory model LSTM and a full connection layer, wherein the convolutional neural network is connected to an input layer of each time step of the long-and-short time memory model, and the full connection layer is connected to an output layer of each time step of the long-and-short time memory model;
let CNN initial parameter be theta1The initial parameter of LSTM is theta2The initial parameter of the full connection layer is W3The initial state of the LSTM hidden layer is h0,c0Then the iterative process of the navigation model is as follows:
ht+1,ct+1=LSTM(ht,ct,[CNN(It+1);at])
p(at+1)=Softmax(W3ht+1)
wherein a ist,It+1For the input of the t-th iteration of the navigation model, atIterating the actions performed by the robot for the t-th step, It+1To perform action atAnd (5) obtaining an observation image. Before step 1, for t equal to 0, atSet to a special value at=astart,astartRepresenting a start action; p (a)t+1) Representing the conditional probability of each action executed by the t +1 th iteration, and outputting the t-th iteration of the navigation model; theta is ═ theta1;θ2;W3;h0;c0]Constructing a current trainable set of parameters for the navigation model; in each iteration, only the LSTM hidden layer parameters are changed in the iteration process, and other parameters in the current trainable parameter set are not changed.
The result of the iteration process in step t is recorded as Nav (h)0,c0,I1,I2,...,It,a0,a1,...,at)=p(at+1);
(1-8) sampling the shortest path and training a navigation model to obtain a trained navigation model; the method comprises the following specific steps:
(1-8-1) setting B as the size of the training batch (the value range is 16-256, and the implementation is 32);
(1-8-2) randomly sampling B states from S as a current initial state set S11,s21,...,sB1
(1-8-3) for each current initial state si1(1. ltoreq. i. ltoreq.B), candidate state set S obtained from step (1-6)candRandomly sampling a candidate state as a termination candidate position s of the initial statei(n(i))(1. ltoreq. i.ltoreq.B) (the termination candidate position may coincide with the initial state but does not affect the normal operation of the algorithm), where n (i) is the connection s calculated according to step (1-5)i1And si(n(i))The length of the state sequence passing through the shortest path in (the length of the state sequence passing through all the shortest paths between the two states is equal) from si1And sin(i)Randomly selecting one shortest path from all the corresponding shortest paths to obtain a sequence consisting of nodes and edges corresponding to the path:
Figure BDA0002448968620000081
wherein
Figure BDA0002448968620000082
Is in slave state sijTransition to the next state sij+1The movement of passing between, I(s)ij) Is a state sijCorresponding observation image, astartAnd astopSpecial marks, s, for start and stop actions, respectivelyin(i)∈Scand
(1-8-4) repeating the step (1-8-3) to obtain sequences corresponding to all current initial states; calculating a loss function corresponding to the current initial state set:
Figure BDA0002448968620000083
(1-8-5) calculating a gradient of the loss function for the current trainable set of parameters θ
Figure BDA0002448968620000084
Optimizing navigation model parameters by using a random gradient descent optimizer according to the calculated gradient, and updating a current trainable parameter set (after one training for each batch, all parameters in the trainable parameter set theta are updated);
(1-8-6) repeating the steps (1-8-2) to (1-8-5) until the navigation model converges, and obtaining the trained navigation model and the final trainable parameter set theta ═ theta'1;θ′2;W′3;h′0;c′0];
(2) A use stage; the method comprises the following specific steps:
(2-1) placing the physical robot at any position in a real physical scene as an initial position, (the model of the physical robot is consistent with that of the simulation robot as much as possible), and initializing the operation
Figure BDA0002448968620000091
Obtaining an initial observation image of the robot at the initial position
Figure BDA0002448968620000092
(2-2) defining the initial state of the LSTM hidden layer of the navigation model trained in the step (1)
Figure BDA0002448968620000093
Making the iteration step number i equal to 1;
(2-3) a handle
Figure BDA0002448968620000094
Inputting the navigation model trained in the step (1) to obtain model output of
Figure BDA0002448968620000095
Wherein the content of the first and second substances,
Figure BDA0002448968620000096
for the observation image of the ith step of the physical robot (by performing the action)
Figure BDA0002448968620000097
Obtained) of the first step,
Figure BDA0002448968620000098
the actions performed for the physical robot at step i,
Figure BDA0002448968620000099
the state parameters (as intermediate state vectors) of the ith step of the LSTM hidden layer of the trained navigation model are obtained.
(2-4) pairs
Figure BDA00024489686200000910
And (4) judging: if it is acting
Figure BDA00024489686200000911
To terminate action astopEntering the step (2-5); otherwise, the physical robot performs the action
Figure BDA00024489686200000912
Obtaining the updated observation image of the (i + 1) th step
Figure BDA00024489686200000913
And then, the step (2-3) is returned again when i is equal to i + 1.
(2-5) navigation is finished, and
Figure BDA00024489686200000914
inputting the model static image description generation model Cap obtained in the step (1-3), and obtaining a word sequence of the image description corresponding to the observation image as follows:
Figure BDA00024489686200000915

Claims (1)

1. a robot-oriented active scene description method comprises a training phase and a using phase, and is characterized by comprising the following steps:
(1) a training stage; the method comprises the following specific steps:
(1-1) using the image title generation data set as an image description generation data set, and performing text description labeling on each image in the image description generation data set to obtain an image description generation data set with labeled images;
(1-2) selecting a static image description generation frame, training the frame by using the image description generation data set labeled in the step (1-1) to obtain a trained static image description generation model Cap, inputting any one image of the model, and outputting an image description word sequence corresponding to the image, wherein the expression is as follows:
(w(I)1,w(I)2,...w(I)mI)=Cap(I)
wherein Cap describes the generation model for static image, I is the input image, and (w)(I)1,w(I)2,...w(I)mI) For the image description word sequence corresponding to the input image I, w(I)mIDescribing the mI word of the word sequence for the image corresponding to the input image I;
(1-3) selecting a simulation environment;
(1-4) constructing a state set S to obtain a scene connection graph G ═ S, E >; the method comprises the following specific steps:
(1-4-1) from the simulation environment, selecting all actions supported by the physical robot in the use stage to form an action space
Figure FDA0002448968610000011
(1-4-2) initialization State set
Figure FDA0002448968610000012
Edge set
Figure FDA0002448968610000013
Search queue SQ={s0In which s is0Simulating any reachable state of the robot in a simulation scene;
(1-4-3) to SQAnd (4) judging:
if SQIf not, executing the step (1-4-4), otherwise, executing the step (1-4-6);
(1-4-4) order the Current State SC=SQDequeue (), where dequeue represents a dequeue operation of a queue data structure;
(1-4-5) for each
Figure FDA0002448968610000014
Wherein a represents any motion in the motion space, if state ScIf the lower simulation robot allows the action a to be executed, S is addedcGoing through the performed action a to the state Exec (S)CAnd, a) Exec (S) in the edge-to-edge set ECAnd a) represents the slave state S of the simulation robotCExecuting the new state obtained by the action a;
then for all Exec (S)CAnd, a) making a determination:
if it is
Figure FDA0002448968610000015
Then execute SQ.enqueue(Exec(SCA)), mixing Exec (S)CA) correspondingState join search queue SQAnd Exec (S)CA) adding to the state set S, and then returning to the step (1-4-3); otherwise, no operation is executed, and then the step (1-4-3) is returned again; wherein enqueue represents an enqueue operation of a queue data structure;
(1-4-6) after the state set S is constructed, finally obtaining a scene connection graph G (S, E), wherein S is a state set, each state in S corresponds to one node in the scene connection graph, E is a set of edges between any two states in the connection state set, and each edge corresponds to the robot action of connecting the two states;
(1-5) calculating all shortest paths between any two nodes in the scene connection graph G; each shortest path correspondingly connects one state sequence of the two nodes, and adjacent states in the state sequence are connected through actions corresponding to edges connecting the two adjacent states in the edge set E;
(1-6) obtaining a candidate state set;
for each state S e S in the state set S obtained in the step (1-4), obtaining a corresponding observation image i (S) of the state S in the simulation scene, and obtaining a target detection set o (S) corresponding to i (S), and obtaining a word set w (S) ═ Cap (i (S)) } corresponding to the state by using the static image description generation model Cap trained in the step (1-3), wherein { q } represents a set formed by non-repetitive elements in the sequence q; and calculating the score corresponding to the state s, wherein the expression is as follows:
score(s)=α|O(s)|+|W(S)∩O(s)|
wherein α is a trade-off factor;
after calculating the scores corresponding to all the states in the state set S, forming the states with the state scores higher than the set score threshold value into a candidate state set Scand(ii) a Wherein, the score threshold value is the set proportion of the highest score in the corresponding scores of all the states, namely:
Scand={s|s∈S,score(s)>βmax(score(s))}
wherein β is a scaling factor;
(1-7) constructing a navigation model;
the navigation model is composed of a convolutional neural network CNN, a long-time memory model LSTM and a full connection layer, wherein the convolutional neural network is connected to an input layer of each time step of the long-time memory model, and the full connection layer is connected to an output layer of each time step of the long-time memory model;
let CNN initial parameter be theta1The initial parameter of LSTM is theta2The initial parameter of the full connection layer is W3The initial state of the LSTM hidden layer is h0,c0Then the iterative process of the navigation model is as follows:
ht+1,ct+1=LSTM(ht,ct,[CNN(It+1);at])
p(at+1)=Softmax(W3ht+1)
wherein, at,It+1For the input of the t-th iteration of the navigation model, atIterating the actions performed by the robot for the t-th step, It+1To perform action atThen obtaining an observation image; for t equal to 0, set at=astart,astartRepresenting a start action; p (a)t+1) The conditional probability of each action of the t +1 th step of iterative execution is represented and is the output of the navigation model; theta is ═ theta1;θ2;W3;h0;c0]Constructing a current trainable set of parameters for the navigation model;
and recording the result of the iteration process after the step t as Nav (h)0,c0,I1,I2,...,It,a0,a1,...,at)=p(at+1);
(1-8) sampling the shortest path and training a navigation model to obtain a trained navigation model; the method comprises the following specific steps:
(1-8-1) setting B as the size of a training batch;
(1-8-2) randomly sampling B states from S as a current initial state set S11,s21,...,sB1
(1-8-3) for each current initial state si1I is more than or equal to 1 and less than or equal to B, and the candidate state set S obtained from the step (1-6)candMedium random samplingOne candidate state is used as the termination candidate position s of the initial statei(n(i))1. ltoreq. i.ltoreq.B, where n (i) is a link si1And si(n(i))Length of state sequence traversed in shortest path of, from connection si1And sin(i)Randomly selecting one shortest path from all the shortest paths to obtain a sequence consisting of nodes and edges corresponding to the path:
Figure FDA0002448968610000031
wherein
Figure FDA0002448968610000032
Is in slave state sijTransition to the next state sij+1The movement of passing between, I(s)ij) Is a state sijCorresponding observation image, astartAnd astopRespectively starting action and stopping action;
(1-8-4) repeating the step (1-8-3) to obtain sequences corresponding to all current initial states; calculating a loss function corresponding to the current initial state set:
Figure FDA0002448968610000033
(1-8-5) calculating a gradient of the loss function for the current trainable set of parameters θ
Figure FDA0002448968610000034
Optimizing navigation model parameters by using a random gradient descent optimizer according to the calculated gradient, and updating a current trainable parameter set;
(1-8-6) repeating the steps (1-8-2) to (1-8-5) until the navigation model converges, and obtaining the trained navigation model and the final trainable parameter set theta ═ theta'1;θ′2;W′3;h′0;c′0];
(2) A use stage; the method comprises the following specific steps:
(2-1) mixingThe physical robot is placed at any position in the real physical scene and used as an initial position to initialize actions
Figure FDA0002448968610000035
Obtaining an observation image corresponding to the physical robot
Figure FDA0002448968610000036
(2-2) defining the initial state of the LSTM hidden layer of the navigation model trained in the step (1)
Figure FDA0002448968610000037
Figure FDA0002448968610000038
Making the iteration step number i equal to 1;
(2-3) a handle
Figure FDA0002448968610000039
Inputting the navigation model trained in the step (1) to obtain model output of
Figure FDA00024489686100000310
Wherein the content of the first and second substances,
Figure FDA00024489686100000311
is an observation image of the ith step of the physical robot,
Figure FDA00024489686100000312
the actions performed for the physical robot at step i,
Figure FDA00024489686100000313
the state parameters of the ith step of the LSTM hidden layer of the trained navigation model are obtained;
(2-4) pairs
Figure FDA0002448968610000041
And (4) judging:if it is acting
Figure FDA0002448968610000042
To terminate action astopEntering the step (2-5); otherwise, the physical robot performs the action
Figure FDA0002448968610000043
Obtaining the updated observation image of the (i + 1) th step
Figure FDA0002448968610000044
Then, making i equal to i +1, and returning to the step (2-3) again;
(2-5) navigation is finished, and
Figure FDA0002448968610000046
inputting the model static image description generation model Cap obtained in the step (1-3), and obtaining a word sequence of the image description corresponding to the observation image as follows:
Figure FDA0002448968610000045
CN202010287188.8A 2020-04-13 2020-04-13 Robot-oriented specific active scene description method Active CN111611373B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010287188.8A CN111611373B (en) 2020-04-13 2020-04-13 Robot-oriented specific active scene description method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010287188.8A CN111611373B (en) 2020-04-13 2020-04-13 Robot-oriented specific active scene description method

Publications (2)

Publication Number Publication Date
CN111611373A true CN111611373A (en) 2020-09-01
CN111611373B CN111611373B (en) 2021-09-10

Family

ID=72197820

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010287188.8A Active CN111611373B (en) 2020-04-13 2020-04-13 Robot-oriented specific active scene description method

Country Status (1)

Country Link
CN (1) CN111611373B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111192A (en) * 2021-04-28 2021-07-13 清华大学 Method, equipment and exploration method for intelligent agent to actively construct environment scene map

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107065881A (en) * 2017-05-17 2017-08-18 清华大学 A kind of robot global path planning method learnt based on deeply
WO2018170671A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Topic-guided model for image captioning system
CN108875807A (en) * 2018-05-31 2018-11-23 陕西师范大学 A kind of Image Description Methods multiple dimensioned based on more attentions
CN109029444A (en) * 2018-06-12 2018-12-18 深圳职业技术学院 One kind is based on images match and sterically defined indoor navigation system and air navigation aid

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018170671A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Topic-guided model for image captioning system
CN107065881A (en) * 2017-05-17 2017-08-18 清华大学 A kind of robot global path planning method learnt based on deeply
CN108875807A (en) * 2018-05-31 2018-11-23 陕西师范大学 A kind of Image Description Methods multiple dimensioned based on more attentions
CN109029444A (en) * 2018-06-12 2018-12-18 深圳职业技术学院 One kind is based on images match and sterically defined indoor navigation system and air navigation aid

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111192A (en) * 2021-04-28 2021-07-13 清华大学 Method, equipment and exploration method for intelligent agent to actively construct environment scene map
CN113111192B (en) * 2021-04-28 2022-03-29 清华大学 Method, equipment and exploration method for intelligent agent to actively construct environment scene map

Also Published As

Publication number Publication date
CN111611373B (en) 2021-09-10

Similar Documents

Publication Publication Date Title
Kar et al. Meta-sim: Learning to generate synthetic datasets
CN107515674B (en) It is a kind of that implementation method is interacted based on virtual reality more with the mining processes of augmented reality
Miao et al. Parallel learning: Overview and perspective for computational learning across Syn2Real and Sim2Real
CN108875807B (en) Image description method based on multiple attention and multiple scales
CN108229444B (en) Pedestrian re-identification method based on integral and local depth feature fusion
Vinyals et al. Show and tell: A neural image caption generator
Mi et al. Hdmapgen: A hierarchical graph generative model of high definition maps
CN110851760B (en) Human-computer interaction system for integrating visual question answering in web3D environment
Stein et al. Genesis-rt: Generating synthetic images for training secondary real-world tasks
CN115769234A (en) Template-based generation of 3D object mesh from 2D images
Hamdi et al. SADA: semantic adversarial diagnostic attacks for autonomous applications
KR102375286B1 (en) Learning method and learning device for generating training data from virtual data on virtual world by using generative adversarial network, to thereby reduce annotation cost required in training processes of neural network for autonomous driving
Hu et al. Safe navigation with human instructions in complex scenes
CN113506377A (en) Teaching training method based on virtual roaming technology
CN111611373B (en) Robot-oriented specific active scene description method
Cruz et al. Closing the simulation-to-reality gap using generative neural networks: Training object detectors for soccer robotics in simulation as a case study
US20220207831A1 (en) Simulated control for 3- dimensional human poses in virtual reality environments
Xu et al. Text-guided human image manipulation via image-text shared space
Chen et al. Neural task planning with and–or graph representations
CN117115911A (en) Hypergraph learning action recognition system based on attention mechanism
Di et al. Multi-agent reinforcement learning of 3d furniture layout simulation in indoor graphics scenes
Ren et al. InsActor: Instruction-driven Physics-based Characters
CN114168769B (en) Visual question-answering method based on GAT relation reasoning
Sahni et al. Addressing sample complexity in visual tasks using her and hallucinatory gans
Putra et al. Designing translation tool: Between sign language to spoken text on kinect time series data using dynamic time warping

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant