CN111611373A

CN111611373A - Robot-oriented specific active scene description method

Info

Publication number: CN111611373A
Application number: CN202010287188.8A
Authority: CN
Inventors: 刘华平; 谭思楠; 郭迪; 孙富春
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-04-13
Filing date: 2020-04-13
Publication date: 2020-09-01
Anticipated expiration: 2040-04-13
Also published as: CN111611373B

Abstract

The invention relates to a robot-oriented specific active scene description method, and belongs to the technical field of image processing. According to the invention, by combining the static image description generation model and the navigation model of the robot, when the robot is in an initial scene with a poor visual angle, the trained navigation model can provide an effective action sequence for the robot to adjust the visual angle of the robot, and the trained static image description generation model is called to obtain a final scene description result after a final proper visual angle is found. The invention can overcome the defect that the traditional image description generation model cannot be suitable for the specific scene, can generate more accurate and comprehensive image description in the three-dimensional specific scene, and can be used for the fields of service robots, security monitoring, barrier-free man-machine interaction and the like.

Description

Robot-oriented specific active scene description method

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a robot-oriented specific active scene description method.

Background

The image description generation refers to a technology for generating corresponding text description according to a given image, the image description generation technology allows a computer to provide text description which can be understood by human beings for natural images, and the image description generation technology has wide and important application in security monitoring, image retrieval, network image processing and barrier-free man-machine interaction.

In recent years, an automatic generation method has been described for an image of a single picture, and great progress has been made. However, the current image description generation model is only applicable to static pictures and is useless for interactive scenes. For example, a robot may face a wall in a room, and no matter what image description generation model is used, it makes no sense for the entire indoor scene. However, if the robot can rotate 180 degrees, it may see a completely different scene.

The current image description generation method is mainly based on a convolutional neural network and an LSTM (long-term memory) language model. The main method is to extract image features by using a convolutional neural network, and then to generate final texts recursively word by using an LSTM model. In generating text, the LSTM model may use an attention mechanism to assign weights to the regional features at different positions of the image, and then the regional features are weighted and averaged, thereby improving the generation effect of the model. There are also some models that use pre-extracted keywords and then use a semantic attention mechanism to fuse the information of these keywords to improve the generation. However, all these models can only be used for generating text in a static picture because they can receive a single picture as input. None of these generative models can be used when the view of the picture is not appropriate (e.g., facing a wall or window in an indoor scene).

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a robot-oriented specific active scene description method. The method can combine the natural language model and the navigation model of the robot, when the robot is in an initial scene with a poor visual angle, the navigation model can provide an effective action sequence for the robot to adjust the visual angle of the robot, and the text generation model aiming at the static picture is called after a final proper visual angle is found, so that a final meaningful scene description result is obtained. The invention can overcome the defect that the traditional image description generation model cannot be suitable for the specific scene, can generate more accurate and comprehensive image description in the three-dimensional specific scene, and can be used for the fields of service robots, security monitoring, barrier-free man-machine interaction and the like.

The invention provides a robot-oriented specific active scene description method, which is divided into a training stage and a using stage and is characterized by comprising the following steps of:

(1) a training stage; the method comprises the following specific steps:

(1-1) using the image title generation data set as an image description generation data set, and performing text description labeling on each image in the image description generation data set to obtain an image description generation data set with labeled images;

(1-2) selecting a static image description generation frame, training the frame by using the image description generation data set labeled in the step (1-1) to obtain a trained static image description generation model Cap, inputting any one image of the model, and outputting an image description word sequence corresponding to the image, wherein the expression is as follows:

(w_(I)1,w_(I)2,...w_(I)mI)＝Cap(I)

wherein Cap describes the generation model for static image, I is the input image, and (w)_(I)1,w_(I)2,...w_(I)mI) For the image description word sequence corresponding to the input image I, w_(I)mIDescribing the mI word of the word sequence for the image corresponding to the input image I;

(1-3) selecting a simulation environment;

(1-4) constructing a state set S to obtain a scene connection graph G ═ S, E >; the method comprises the following specific steps:

(1-4-1) from the simulation environment, selecting all actions supported by the physical robot in the use stage to form an action space

(1-4-2) initialization State set

Edge set

Search queue S_Q＝{s₀In which s is₀Simulating any reachable state of the robot in a simulation scene;

(1-4-3) to S_QAnd (4) judging:

if S_QIf not, executing the step (1-4-4), otherwise, executing the step (1-4-6);

(1-4-4) order the Current State S_C＝S_QDequeue (), where dequeue represents a dequeue operation of a queue data structure;

(1-4-5) for each

Wherein a represents any motion in the motion space, if state S_cIf the lower simulation robot allows the action a to be executed, S is added_cGoing through the performed action a to the state Exec (S)_CAnd, a) Exec (S) in the edge-to-edge set E_CAnd a) represents the slave state S of the simulation robot_CExecuting the new state obtained by the action a;

then for all Exec (S)_CAnd, a) making a determination:

if it is

Then execute S_Q.enqueue(Exec(S_CA)), mixing Exec (S)_CA) adding the corresponding state to the search queue S_QAnd Exec (S)_CA) adding to the state set S, and then returning to the step (1-4-3); otherwise, no operation is executed, and then the step (1-4-3) is returned again; wherein enqueue represents an enqueue operation of a queue data structure;

(1-4-6) after the state set S is constructed, finally obtaining a scene connection graph G (S, E), wherein S is a state set, each state in S corresponds to one node in the scene connection graph, E is a set of edges between any two states in the connection state set, and each edge corresponds to the robot action of connecting the two states;

(1-5) calculating all shortest paths between any two nodes in the scene connection graph G; each shortest path correspondingly connects one state sequence of the two nodes, and adjacent states in the state sequence are connected through actions corresponding to edges connecting the two adjacent states in the edge set E;

(1-6) obtaining a candidate state set;

for each state S e S in the state set S obtained in the step (1-4), obtaining a corresponding observation image i (S) of the state S in the simulation scene, and obtaining a target detection set o (S) corresponding to i (S), and obtaining a word set w (S) ═ Cap (i (S)) } corresponding to the state by using the static image description generation model Cap trained in the step (1-3), wherein { q } represents a set formed by non-repetitive elements in the sequence q; and calculating the score corresponding to the state s, wherein the expression is as follows:

score(s)＝α|O(s)|+|W(S)∩O(s)|

wherein α is a trade-off factor;

after calculating the scores corresponding to all the states in the state set S, forming the states with the state scores higher than the set score threshold value into a candidate state set S_cand(ii) a Wherein, the score threshold value is the set proportion of the highest score in the corresponding scores of all the states, namely:

S_cand＝{s|s∈S,score(s)＞βmax(score(s))}

wherein β is a scaling factor;

(1-7) constructing a navigation model;

the navigation model is composed of a convolutional neural network CNN, a long-time memory model LSTM and a full connection layer, wherein the convolutional neural network is connected to an input layer of each time step of the long-time memory model, and the full connection layer is connected to an output layer of each time step of the long-time memory model;

let CNN initial parameter be theta₁The initial parameter of LSTM is theta₂The initial parameter of the full connection layer is W₃The initial state of the LSTM hidden layer is h₀,c₀Then the overlay of the navigation modelThe generation process is as follows:

h_t+1,c_t+1＝LSTM(h_t,c_t,[CNN(I_t+1)；a_t])

p(a_t+1)＝Softmax(W₃h_t+1)

wherein, a_t,I_t+1For the input of the t-th iteration of the navigation model, a_tIterating the actions performed by the robot for the t-th step, I_t+1To perform action a_tThen obtaining an observation image; for t equal to 0, set a_t＝a_start，a_startRepresenting a start action; p (a)_t+1) The conditional probability of each action of the t +1 th step of iterative execution is represented and is the output of the navigation model; theta is ═ theta₁；θ₂；W₃；h₀；c₀]Constructing a current trainable set of parameters for the navigation model;

and recording the result of the iteration process after the step t as Nav (h)₀,c₀,I₁,I₂,...,I_t,a₀,a₁,...,a_t)＝p(a_t+1)；

(1-8) sampling the shortest path and training a navigation model to obtain a trained navigation model; the method comprises the following specific steps:

(1-8-1) setting B as the size of a training batch;

(1-8-2) randomly sampling B states from S as a current initial state set S₁₁,s₂₁,...,s_B1；

(1-8-3) for each current initial state s_i1I is more than or equal to 1 and less than or equal to B, and the candidate state set S obtained from the step (1-6)_candRandomly sampling a candidate state as a termination candidate position s of the initial state_i(n(i))1. ltoreq. i.ltoreq.B, where n (i) is a link s_i1And s_i(n(i))Length of state sequence traversed in shortest path of, from connection s_i1And s_in(i)Randomly selecting one shortest path from all the shortest paths to obtain a sequence consisting of nodes and edges corresponding to the path:

wherein

Is in slave state s_ijTransition to the next state s_ij+1The movement of passing between, I(s)_ij) Is a state s_ijCorresponding observation image, a_startAnd a_stopRespectively starting action and stopping action;

(1-8-4) repeating the step (1-8-3) to obtain sequences corresponding to all current initial states; calculating a loss function corresponding to the current initial state set:

(1-8-5) calculating a gradient of the loss function for the current trainable set of parameters θ

Optimizing navigation model parameters by using a random gradient descent optimizer according to the calculated gradient, and updating a current trainable parameter set;

(1-8-6) repeating the steps (1-8-2) to (1-8-5) until the navigation model converges, and obtaining the trained navigation model and the final trainable parameter set theta ═ theta'₁；θ′₂；W′₃；h′₀；c′₀]；

(2) A use stage; the method comprises the following specific steps:

(2-1) placing the physical robot at any position in the real physical scene as an initial position, and initiating the initial action

Obtaining an observation image corresponding to the physical robot

(2-2) defining the navigation model trained in the step (1)Type LSTM hidden layer initial state

Making the iteration step number i equal to 1;

(2-3) a handle

Inputting the navigation model trained in the step (1) to obtain model output of

Wherein the content of the first and second substances,

is an observation image of the ith step of the physical robot,

the actions performed for the physical robot at step i,

the state parameters of the ith step of the LSTM hidden layer of the trained navigation model are obtained;

(2-4) pairs

And (4) judging: if it is acting

To terminate action a_stopEntering the step (2-5); otherwise, the physical robot performs the action

Obtaining the updated observation image of the (i + 1) th step

Then, making i equal to i +1, and returning to the step (2-3) again;

(2-5) navigation is finished, and

inputting the model static image description generation model Cap obtained in the step (1-3), and obtaining a word sequence of the image description corresponding to the observation image as follows:

the invention has the characteristics and beneficial effects that:

the invention constructs a robot-oriented specific active scene description method by utilizing a navigation model and an image description generation model based on deep learning. The method overcomes the defect that a static description generation model cannot deal with the condition that the robot is in a poor visual angle, can guide the robot to navigate and adjust the visual angle of the robot, and then generates the image description after the robot reaches the proper visual angle in the scene.

The invention can be used in the field of robots, allowing a robot to autonomously navigate to a suitable location within an area, explore the environment and generate the required natural language description. The method can be used in the fields of service robots, security monitoring, barrier-free man-machine interaction and the like, and can generate more accurate and comprehensive image description in a three-dimensional specific scene.

Detailed Description

The invention provides a robot-oriented active scene description method, which is further described in detail below with reference to specific embodiments.

The invention provides a robot-oriented specific active scene description method. The method can combine the static image description generation model with the navigation model of the robot, when the method is applied to the initial scene of the robot with poor visual angle, the navigation model can provide an effective action sequence for the robot to adjust the visual angle of the robot, and the static image description generation model is called to obtain the final scene description result after the final proper visual angle is found.

The invention provides a robot-oriented active scene description method, which comprises a training stage and a using stage, and comprises the following steps:

(1) a training stage; the method comprises the following specific steps:

(1-1) the image title generation dataset is used as the image description generation dataset (if the public dataset is used, the MSCOCO dataset may be used. The image description generation dataset needs to include a plurality of images (generally 1 ten thousand to 100 ten thousand, 10 ten thousand in this embodiment), and each image in the image description generation dataset is manually labeled to obtain an image description generation dataset after labeling.

The content of each image label is one or more text descriptions corresponding to the image (the more text descriptions of each image are, the better, each text description of the same image should have different content, the number of texts of each image is not necessarily the same, generally 1-20, and in this embodiment, each image is 5). The style of the text description should be unified as much as possible, and the rarely used words should be used as little as possible. Due to the limitations of the model, the text descriptions should not be too long (each text description in this embodiment typically does not exceed 25 words).

(1-2) selecting a static image description generation framework (such as imagecapturing. pittorech, in this embodiment, imagecapturing. pittorech), and training the framework by using the image description generation data set labeled in the step (1-1) to obtain a trained static image description generation model Cap, wherein the model is input into any one image and output as an image description word sequence corresponding to the input image, and the expression is as follows:

(w_(I)1,w_(I)2,...w_(I)mI)＝Cap(I)

wherein Cap describes the generation model for static image, I is the input image, and (w)_(I)1,w_(I)2,...w_(I)mI) For the image description word sequence corresponding to the input image I, w_(I)mIThe mI-th word of the word sequence is described for the image corresponding to the input image I. Each input diagramThe number of words like the corresponding output image description is not necessarily equal.

(1-3) selecting a simulation environment (the simulation environment should be as close as possible to the real physical environment of robot deployment, and if an open-source simulation environment is used, AI2Thor, in the embodiment, AI2Thor, can be used). Virtual scenes and simulation robots can be supported in the simulation environment, the simulation robots can be operated by programming, and various actions of the simulation robots are supported (actions of the simulation robots supported by the AI2Thor supported in the embodiment are left-turn, right-turn, forward, coordinate acquisition and remote transmission).

(1-4) constructing a state set S, and constructing the state set S to obtain a scene connection diagram G ═ S, E >; the method comprises the following specific steps:

(1-4-1) selecting all actions which need to be supported by the physical robot (namely the robot actually executing the actions in the step (2)) from the simulation environment to form an action space

(in this embodiment, the 5 actions of forward, backward, left turn, right turn and stop in AI2Thor are selected to form an action space);

(1-4-2) initialization State set

Edge set

Search queue S_Q＝{s₀In which s is₀Any reachable state of the robot is simulated in the simulation scene (the state comprises the position and the posture of the robot).

(1-4-3) to S_QAnd (4) judging:

(1-4-5) for each

Wherein a represents any motion in the motion space, if state S_cIf the lower simulation robot allows the action a to be executed, S is added_cGoing through the performed action a to the state Exec (S)_CA) Exec (S) in the edge-to-edge set E formed_CA) represents a new state obtained by the simulation robot executing the action a from the state SC;

then for all Exec (S)_CAnd, a) making a determination:

if it is

Then execute S_Q.enqueue(Exec(S_CA)), mixing Exec (S)_CA) adding the corresponding new state to the search queue S_QAnd Exec (S)_CA) adding to the state set S, and then returning to the step (1-4-3); otherwise, no operation is executed, and then the step (1-4-3) is returned again; wherein enqueue represents an enqueue operation of a queue data structure;

(1-4-6) after the state set S is constructed, finally obtaining a discretized scene connection graph G (S, E), wherein S is a state set (each state corresponds to one node in the scene connection graph), and E is a set of edges connecting any two states in the state set (each edge corresponds to the robot action connecting the two states);

and (1-5) calculating all shortest paths between any two nodes for the scene connection graph G by using a Floyd-Warshall algorithm. And a plurality of states connecting the two nodes in the state set S corresponding to each shortest path form a state sequence, and adjacent states in the state sequence are connected through actions corresponding to edges connecting the two adjacent states in the edge set E.

(1-6) obtaining a candidate state set;

for each state S ∈ S in the state set S obtained in step (1-4) (each state generally includes the position and posture of the robot), obtaining, by using simulation software, an observation image i (S) corresponding to the state S in the simulation scene, further obtaining, by using the simulation software, a target detection set o (S) corresponding to i (S), and obtaining, by using the static image description generation model Cap trained in step (1-3), a word set w (S) ═ Cap (i (S)) } (where { q } represents a set formed by non-repetitive elements in the sequence q). And calculating the score corresponding to the state s, wherein the expression is as follows:

score(s)＝α|O(s)|+|W(S)∩O(s)|

where α is a trade-off factor (typically 0.5-5.0), which in this example is 1.0.

S_cand＝{s|s∈S,score(s)＞βmax(score(s))}

wherein β is a scale factor (value range 0-1), which is 0.85 in this embodiment.

(1-7) constructing a navigation model;

the navigation model consists of a convolutional neural network CNN (CNN used in the embodiment is ResNet18), an long-and-short time memory model LSTM and a full connection layer, wherein the convolutional neural network is connected to an input layer of each time step of the long-and-short time memory model, and the full connection layer is connected to an output layer of each time step of the long-and-short time memory model;

let CNN initial parameter be theta₁The initial parameter of LSTM is theta₂The initial parameter of the full connection layer is W₃The initial state of the LSTM hidden layer is h₀,c₀Then the iterative process of the navigation model is as follows:

h_t+1,c_t+1＝LSTM(h_t,c_t,[CNN(I_t+1)；a_t])

p(a_t+1)＝Softmax(W₃h_t+1)

wherein a is_t,I_t+1For the input of the t-th iteration of the navigation model, a_tIterating the actions performed by the robot for the t-th step, I_t+1To perform action a_tAnd (5) obtaining an observation image. Before step 1, for t equal to 0, a_tSet to a special value a_t＝a_start，a_startRepresenting a start action; p (a)_t+1) Representing the conditional probability of each action executed by the t +1 th iteration, and outputting the t-th iteration of the navigation model; theta is ═ theta₁；θ₂；W₃；h₀；c₀]Constructing a current trainable set of parameters for the navigation model; in each iteration, only the LSTM hidden layer parameters are changed in the iteration process, and other parameters in the current trainable parameter set are not changed.

The result of the iteration process in step t is recorded as Nav (h)₀,c₀,I₁,I₂,...,I_t,a₀,a₁,...,a_t)＝p(a_t+1)；

(1-8-1) setting B as the size of the training batch (the value range is 16-256, and the implementation is 32);

(1-8-3) for each current initial state s_i1(1. ltoreq. i. ltoreq.B), candidate state set S obtained from step (1-6)_candRandomly sampling a candidate state as a termination candidate position s of the initial state_i(n(i))(1. ltoreq. i.ltoreq.B) (the termination candidate position may coincide with the initial state but does not affect the normal operation of the algorithm), where n (i) is the connection s calculated according to step (1-5)_i1And s_i(n(i))The length of the state sequence passing through the shortest path in (the length of the state sequence passing through all the shortest paths between the two states is equal) from s_i1And s_in(i)Randomly selecting one shortest path from all the corresponding shortest paths to obtain a sequence consisting of nodes and edges corresponding to the path:

wherein

Is in slave state s_ijTransition to the next state s_ij+1The movement of passing between, I(s)_ij) Is a state s_ijCorresponding observation image, a_startAnd a_stopSpecial marks, s, for start and stop actions, respectively_in(i)∈S_cand；

Optimizing navigation model parameters by using a random gradient descent optimizer according to the calculated gradient, and updating a current trainable parameter set (after one training for each batch, all parameters in the trainable parameter set theta are updated);

(2) A use stage; the method comprises the following specific steps:

(2-1) placing the physical robot at any position in a real physical scene as an initial position, (the model of the physical robot is consistent with that of the simulation robot as much as possible), and initializing the operation

Obtaining an initial observation image of the robot at the initial position

(2-2) defining the initial state of the LSTM hidden layer of the navigation model trained in the step (1)

Making the iteration step number i equal to 1;

(2-3) a handle

Wherein the content of the first and second substances,

for the observation image of the ith step of the physical robot (by performing the action)

Obtained) of the first step,

the actions performed for the physical robot at step i,

the state parameters (as intermediate state vectors) of the ith step of the LSTM hidden layer of the trained navigation model are obtained.

(2-4) pairs

And (4) judging: if it is acting

Obtaining the updated observation image of the (i + 1) th step

And then, the step (2-3) is returned again when i is equal to i + 1.

(2-5) navigation is finished, and

Claims

1. a robot-oriented active scene description method comprises a training phase and a using phase, and is characterized by comprising the following steps:

(1) a training stage; the method comprises the following specific steps:

(w_(I)1，w_(I)2，...w_(I)mI)＝Cap(I)

wherein Cap describes the generation model for static image, I is the input image, and (w)_(I)1，w_(I)2，...w_(I)mI) For the image description word sequence corresponding to the input image I, w_(I)mIDescribing the mI word of the word sequence for the image corresponding to the input image I;

(1-3) selecting a simulation environment;

(1-4-2) initialization State set

Edge set

(1-4-3) to S_QAnd (4) judging:

(1-4-5) for each

then for all Exec (S)_CAnd, a) making a determination:

if it is

Then execute S_Q.enqueue(Exec(S_CA)), mixing Exec (S)_CA) correspondingState join search queue S_QAnd Exec (S)_CA) adding to the state set S, and then returning to the step (1-4-3); otherwise, no operation is executed, and then the step (1-4-3) is returned again; wherein enqueue represents an enqueue operation of a queue data structure;

(1-6) obtaining a candidate state set;

score(s)＝α|O(s)|+|W(S)∩O(s)|

wherein α is a trade-off factor;

S_cand＝{s|s∈S，score(s)＞βmax(score(s))}

wherein β is a scaling factor;

(1-7) constructing a navigation model;

let CNN initial parameter be theta₁The initial parameter of LSTM is theta₂The initial parameter of the full connection layer is W₃The initial state of the LSTM hidden layer is h₀，c₀Then the iterative process of the navigation model is as follows:

h_t+1，c_t+1＝LSTM(h_t，c_t，[CNN(I_t+1)；a_t])

p(a_t+1)＝Softmax(W₃h_t+1)

wherein, a_t，I_t+1For the input of the t-th iteration of the navigation model, a_tIterating the actions performed by the robot for the t-th step, I_t+1To perform action a_tThen obtaining an observation image; for t equal to 0, set a_t＝a_start，a_startRepresenting a start action; p (a)_t+1) The conditional probability of each action of the t +1 th step of iterative execution is represented and is the output of the navigation model; theta is ═ theta₁；θ₂；W₃；h₀；c₀]Constructing a current trainable set of parameters for the navigation model;

and recording the result of the iteration process after the step t as Nav (h)₀，c₀，I₁，I₂，...，I_t，a₀，a₁，...，a_t)＝p(a_t+1)；

(1-8-1) setting B as the size of a training batch;

(1-8-2) randomly sampling B states from S as a current initial state set S₁₁，s₂₁，...，s_B1；

(1-8-3) for each current initial state s_i1I is more than or equal to 1 and less than or equal to B, and the candidate state set S obtained from the step (1-6)_candMedium random samplingOne candidate state is used as the termination candidate position s of the initial state_i(n(i))1. ltoreq. i.ltoreq.B, where n (i) is a link s_i1And s_i(n(i))Length of state sequence traversed in shortest path of, from connection s_i1And s_in(i)Randomly selecting one shortest path from all the shortest paths to obtain a sequence consisting of nodes and edges corresponding to the path: