CN115311538A

CN115311538A - Intelligent agent target searching method based on scene prior

Info

Publication number: CN115311538A
Application number: CN202210156851.XA
Authority: CN
Inventors: 赵怀林; 陆升阳; 梁兰军; 侯煊
Original assignee: Shanghai Institute of Technology
Current assignee: Shanghai Institute of Technology
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2022-11-08

Abstract

The invention relates to an intelligent object searching method based on scene prior, which is used for searching an object of a robot and comprises the following steps: confirming target coding information and a target to be searched; acquiring an environment image of a scene to be searched through a robot, and constructing a depth image matrix and a semantic image matrix according to the environment image; extracting object relation characteristic vectors; constructing a spatial semantic fusion matrix; obtaining a semantic map feature vector according to the spatial semantic map fusion matrix; generating a fusion feature vector according to the object relation feature vector, the semantic map feature vector and the target coding information; and training the value network and the target network according to the fusion feature vector, and searching the target based on the trained value network after the training is finished. Compared with the prior art, the method has the advantages of high navigation precision, high search accuracy and efficiency and the like.

Description

Intelligent agent target searching method based on scene prior

Technical Field

The invention relates to the field of active visual perception, in particular to an intelligent agent target searching method based on scene prior.

Background

In recent years, the field of robot research has been devoted to expanding the ability of robots to explore the environment, understand the environment, interact with the environment, and communicate with people. The traditional navigation method generally uses an environment map for navigation, and a navigation task is divided into three steps: mapping, positioning and path planning. This approach typically requires the construction of 3D maps in advance, as well as reliable map localization and path tracking. However, in some cases, the artificial landmark is unknown or the robot is in an environment where GPS is absent, and thus self-motion estimation or acquisition of scene information is greatly difficult. For a long time, the problem of robot navigation has been substantially solved by a series of distance sensors, such as light detection and ranging, infrared radiation, or sonar navigation and ranging, which are suitable for use in a small-range static environment (the various distance sensors are limited by their own physical properties). However, in dynamic, complex, and wide-range environments, mapping and navigation of robots may face many challenges.

Recently, the success of data-driven machine learning strategies for various control and perception problems has opened a new way to overcome the limitations of previous approaches. These methods are widely studied without the need for constructing maps with low dependency on the environment and with the possibility of human-computer interaction. Their key point is to directly learn the mapping between the original observations and the end-to-end approach to the operational task. These methods take advantage of the ability to previously navigate experiences in new similar environments, whether with or without maps. Reinforcement Learning (Reinforcement Learning) is commonly used for visual navigation. However, the reinforcement learning still has the problems of poor generalization capability, low navigation efficiency, low precision and the like.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an intelligent agent target searching method based on scene prior.

The purpose of the invention can be realized by the following technical scheme:

a method for searching an intelligent agent target based on scene prior is used for searching a target of a robot, and comprises the following steps:

s1: confirming target coding information and a target to be searched;

s2: acquiring an environment image of a scene to be searched through a robot, and constructing a depth image matrix and a semantic image matrix according to the environment image;

s3: carrying out object relation characteristic analysis on the environment image, identifying objects in the environment, confirming the object with the maximum possibility of relation with a target to be searched, and extracting an object relation characteristic vector;

s4: acquiring a spatial semantic point cloud according to the depth image matrix and the semantic image matrix, and constructing a spatial semantic fusion matrix according to the spatial semantic point cloud and object information in the environment;

s5: obtaining a semantic map feature vector according to the spatial semantic map fusion matrix;

s6: generating a fusion feature vector according to the object relation feature vector, the semantic map feature vector and the target coding information;

s7: and training the value network and the target network according to the fusion feature vector, and searching the target based on the trained value network after the training is finished.

Preferably, the step S2 specifically includes:

s21: acquiring an environment image of a scene to be searched through a robot, wherein the environment image comprises an RGB (red, green and blue) image and a depth image of the environment;

s22: recording the depth image as a depth image matrix;

s23: and calculating the environment image by utilizing the pre-trained semantic segmentation network to generate a semantic image matrix.

Preferably, the step S3 specifically includes:

s31: acquiring a scene graph G = { V, E }, wherein V is a graph node and represents different object types in a scene, E is a graph edge and represents the position relation between two types of objects, a visual genome data set is used as a source, a knowledge graph is constructed according to the types of all objects appearing in the scene to be searched, each type is represented as a node in the graph, edges are used for linking two nodes with object relation appearance frequency larger than 3 in the visual genome data set to generate a graph structure, and the graph structure is represented by a binary adjacency matrix A;

s32: and constructing a graph convolution neural network, inputting an RGB image of the environment image, outputting a spatial relationship characteristic, and mapping the spatial relationship characteristic to 512 dimensions to obtain an object relationship characteristic vector.

Preferably, the specific step of step S4 includes:

s41: generating an all 0 matrix of (C + 2) ((224) × 224), which represents a spatial semantic fusion matrix M, said spatial semantic fusion matrix comprising C +2 layers, wherein 224 × 224 represents the size of each layer;

s42: considering the position and attitude P (x) of the robot _t ,y _t ,z _t ,θ _t ) Generating a spatial point cloud;

s43: the size of the spatial point cloud is C W H, wherein C is a channel of the spatial semantic point cloud, each channel represents a semantic category, W H is the width, the length and the height of the spatial semantic point cloud respectively, and the three-dimensional point cloud is summed in the high dimension and mapped to two dimensions to obtain a two-dimensional mapping characteristic diagram with the size of C W L which is used as a front C layer of the spatial semantic fusion matrix;

s44: and recording the walking path of the robot on a C +1 layer of the spatial semantic fusion matrix, and marking an object with the highest possibility of relation with the target to be searched in the C +2 layer.

Preferably, the spatial point cloud is obtained in the following manner:

wherein x, y and z are point cloud coordinates, f _x ,f _y Is an internal reference of the camera, c _x ，c _y The position of a pixel in a semantic image matrix S, D is a depth image matrix, u and v are pixel point coordinates in the semantic image matrix respectively, R and T are a transfer matrix and a rotation matrix of the robot respectively, and the position and the attitude of the robot are P (x) _t ,y _t ,z _t ,θ _t ) The obtained transfer matrix and rotation matrix of the robot are respectively as follows:

preferably, the step S5 specifically includes:

s51: carrying out normalization processing on the spatial semantic fusion matrix;

s52: and constructing a convolutional neural network, processing the space semantic fusion matrix as input by using the convolutional neural network, and outputting a semantic map feature vector.

Preferably, the convolutional neural network comprises a convolutional layer, a nonlinear activation layer, a data normalization layer, a maximum pooling network, a convolutional layer, a data normalization layer, a maximum pooling network, a convolutional layer, and a data normalization layer which are connected in sequence, and finally, the output of the last data normalization layer is changed into a one-dimensional vector through matrix transformation, and then, the result of the matrix transformation is converted into a semantic map feature vector by utilizing linear transformation.

Preferably, the step S6 specifically includes:

and splicing the object relation feature vector, the semantic map feature vector and the target coding information vector to generate a fusion feature vector.

Preferably, the step S7 specifically includes:

s71: constructing a reward and punishment function:

r (t, a) is reward and punishment, t represents that the robot takes action at a certain moment, a represents the action taken by the robot at the moment, a target category appears in the semantic image matrix S of the robot, and the fact that the robot finds a target when the distance between the robot and the target category is calculated to be less than 0.5m is represented;

s72: and inputting the fusion characteristic vector Q into a deep convolutional neural network with initial weight, simulating a navigation strategy of a human expert by a machine to obtain demonstration experience, storing the demonstration experience into an initialized experience pool, initializing a value network J by using random weight values, initializing a target value network J' into a current value network, and cycling each event to obtain an optimal value network J.

Preferably, the training process of the value network in S72 specifically includes:

training a value network by using a reinforcement learning time difference method, recording the value network J as a current value network, initializing the training times to be 0, designing experience playback capacity and sampling quantity, setting a target network J', initializing the random pose of the robot, setting the training times, and selecting actions under the current state according to a greedy strategy:

wherein a is _t For the action taken at the next moment, a reward R (t, a) is obtained from the action taken, and the next state S _t Storing the updated reported value and state into an experience pool, updating the experience pool once every preset step number, and updating the current value network through a gradient descent algorithm until the robot reaches the final state and exceeds the set maximum time t _max 200 actions; otherwise, updating the current network to be a target network, and obtaining a value network J after the training times are reached.

Compared with the prior art, the invention has the following advantages:

1. the robot target searching method is based on the real environment, and an indoor map-free navigation target searching system based on the visual sensor is designed, so that the robot does not need to establish a map when searching for an object and completing task navigation, and map-free target searching and indoor navigation tasks can be realized.

2. In the current research of indoor map-free visual navigation, visual information is basically used as an input matrix, and reinforcement learning or imitation learning is directly utilized to train navigation. The method has low navigation success rate and long time, and partial networks are difficult to converge, thereby causing the result of training failure. The map-free navigation target searching system based on scene prior disclosed by the invention has the advantages that visual information forms a local semantic map, so that the training speed and the navigation precision are greatly improved.

3. The invention adopts scene prior, searches for objects by utilizing the scene prior exploration environment, and is beneficial to improving the accuracy and efficiency of target search.

Drawings

Fig. 1 is a block diagram of a hardware system according to the method of the present invention.

Fig. 2 is a general flow diagram of the present invention.

Fig. 3 is a schematic view of a scene graph of the present invention.

FIG. 4 is a schematic diagram of a convolutional neural network of the present invention

FIG. 5 is a flow chart of reinforcement learning according to the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. Note that the following description of the embodiments is merely a substantial example, and the present invention is not intended to be limited to the application or the use thereof, and is not limited to the following embodiments.

Examples

The invention relates to a scene prior-based intelligent object searching method, which comprises the steps of shooting an RGB image and a depth map of an environment, calculating the RGB image by utilizing a trained semantic segmentation network to generate a semantic image, and generating a local semantic map according to the depth map and odometer information; the method comprises the steps of (1) recording object relations of multiple occurrences of visual genome data into a priori knowledge matrix, adding the object relations on the basis of a semantic map, and generating a spatial relation semantic map; acquiring a space semantic map matrix by adopting a convolution network as a data fusion matrix of a local environment; and training a reinforcement learning model as a navigator, taking the data fusion matrix as input, outputting one of the actions of front, left, right and stop, and controlling the motion direction of the robot. The adopted equipment for the target search of the robot, as shown in fig. 1, is mainly composed of a robot equipped with a camera sensor and a laser radar, and a server, the robot transmits the 'seen' information to the server through WiFi by the camera sensor, and the visual information is converted on the server, specifically, as shown in fig. 2, the method comprises the following steps:

s1: and confirming target coding information and a target to be searched. And establishing a target coding network and coding to confirm the coding information of each object in the scene. Specifically, a human-computer interaction interface is set up in the embodiment, the interaction interface is in a text box form, a name of a target to be searched by a user is input into the text box, and the target is coded after the input. The encoder consists essentially of one LSTM, each layer containing 128 hidden units.

S2: the method comprises the steps of obtaining an environment image of a scene to be searched through a robot, and constructing a depth image matrix and a semantic image matrix according to the environment image.

The step S2 specifically includes: the robot shoots an RGB image and a depth map of an environment by using a camera sensor, the image is called an environment image, the environment image is a 3 x (w x h) image, w and h are image width and height, the environment image comprises 3 layers, the size of each layer is (w x h), the depth image is a 1 x (w h) image, the depth image comprises 1 layer, the size of each layer is (w x h), and the depth image is recorded as a depth image matrix D. Firstly, processing and calculating an environment image by using a pre-trained semantic segmentation network Mask-RCNN to generate a semantic segmentation result matrix which is recorded as a semantic image matrix S.

S3: and carrying out object relation characteristic analysis on the environment image, identifying objects in the environment, confirming the object with the highest possibility of relation with the target to be searched, and extracting an object relation characteristic vector.

As shown in fig. 4, step S3 is to extract the relationship information between the targets by using the convolutional neural networks (GCNs) based mainly on the visual information. The scene prior knowledge is expressed in the form of an undirected graph, a scene graph G = { V, E }, nodes in the V represent different object types, and an edge E represents a special position relation between two classes of objects. Using the visual genome data set as a source, a knowledge graph was constructed from the categories of all objects that appeared in the actual scene in this experiment. Each category is represented as a node in the graph. The use of edges to link the graph structure in two nodes when the frequency of occurrence of object relationships in the visual genome dataset is greater than 3 is represented by the binary adjacency matrix a.

Constructing a graph convolution neural network, inputting the graph convolution neural network into an RGB image, and forming a spatial relation characteristic matrix Z = [ Z ] by each layer of output of the graph convolution neural network ₁ ,z ₂ ,...,z _|V| ]In this embodiment, since there are 83 articles in the actual environment, 83 nodes are designed, and all the nodes are summarized as the feature matrix F = [ F = ₁ ,F ₂ ...F _|V| ]. Standardizing the matrix A to obtain a matrix

It is thus possible to obtain:

wherein H ⁽⁰⁾ ＝X,H ^(β) = Z, wherein W ^(α) Is a parameter of the alpha layer and beta is the total number of layers of the GCNs.

As shown in fig. 3, the first part of the atlas neural network is the pre-trained rescet 34, the input to this part is the RGB image, and the score of the 1000 classes of objects output as ImageNet is the image feature vector. For different nodes, mapping the current image feature vector into 512-dimensional feature vectors, embedding all types of names by using words and mapping into 512-dimensional feature vectors respectively, and splicing the two feature vectors to form 1024-dimensional joint representation for each graph node. The input of the graph convolution neural network is a temporary matrix A and a node feature vector, the output of the first two layers is potential features of 1024 dimensions, and the output of the last layer is the value output by each node, so that the feature vector | V | is obtained. The feature vector is semantic coding information of the current scene and the environmental context. Finally, the feature vector is mapped to an object relation feature vector e with 512 dimensions.

S4: and acquiring a spatial semantic point cloud according to the depth image matrix and the semantic image matrix, and constructing a spatial semantic fusion matrix according to the spatial semantic point cloud and object information in the environment.

The method specifically comprises the following steps: first, an all 0 matrix of (C + 2) × (224 × 224) is generated, which represents the spatial semantic fusion matrix M. The spatial semantic fusion matrix comprises C +2 layers, and v,224 and 224 represent the size of each layer. The position of the robot is then placed in the middle of the map, at (112 ). Considering the position and attitude P (x) of the robot _t ,y _t ,z _t ,θ _t ) And calculating and generating the spatial point cloud by using the following formula:

wherein x, y and z are point cloud coordinates, f _x ,f _y Is an internal reference of the camera, c _x ，c _y The position of a pixel in a semantic image matrix S, D is a depth image matrix, u and v are pixel point coordinates in the semantic image matrix respectively, R and T are a transfer matrix and a rotation matrix of the robot respectively, and the pose of the robot is P (x) _t ,y _t ,z _t ,θ _t ) The obtained transfer matrix and rotation matrix of the robot are respectively as follows:

according to the formula, the spatial semantic point cloud can be calculated through the image feature matrix S and the depth image matrix D. The calculated size of the spatial semantic point cloud is C W L H, wherein C is a channel of the spatial semantic point cloud, each channel represents a semantic category, and W L H is the width, the length and the height of the spatial semantic point cloud respectively. Because the calculation amount of the spatial semantic point cloud is too large, the spatial semantic point cloud is summed in the high dimension, and the three-dimensional point cloud is mapped to two dimensions to obtain a two-dimensional mapping characteristic diagram with the dimension of C W L as the front C layer of the spatial semantic fusion matrix; recording the walking path of the robot on a C +1 layer of the spatial semantic fusion matrix, and marking an object with the highest possibility of relation with a target to be searched in the C +2 layer; and in the walking process of the robot, adding the two-dimensional mapping characteristic diagram and the space semantic fusion matrix M according to the corresponding position, and updating the object with the maximum possibility of path and relation to complete the updating of the space semantic fusion matrix.

Further, the semantic two-dimensional mapping feature map can clearly know that the objects exist in the mapping map, the object with the highest possibility of being in relation with the target object is obtained through statistics and calculation by utilizing the relation between the objects in the visual genome data set, and the position corresponding to the object is highlighted in the C +2 layer to play a role of marking.

the method specifically comprises the following steps: s51: carrying out normalization processing on the spatial semantic fusion matrix;

The convolutional neural network is widely applied as a mode for extracting image features due to the advantages that the convolutional neural network does not need to preprocess images and can extract additional features, and the like, high-dimensional data can be processed by a special processing mode of sharing a convolutional kernel, and the convolutional neural network can extract deep information in the images as the number of layers of the network is deepened. Therefore, in the step, the convolutional neural network is adopted to perform feature processing on the image acquired by the camera sensor.

(1) In the step, firstly, the space semantic map fusion matrix M is normalized, and a specific normalization method can be represented by the following formula:

in the formula x _i ^* Representing the value, x, of the normalized matrix M _i Representing the value of the original matrix M, x _min Represents the minimum value of the matrix M, x _max Representing the maximum value of the matrix M. The spatial semantic map fusion matrix M can be normalized by the formula.

(2) The method for constructing the convolutional neural network specifically comprises the following steps:

setting a first layer of the convolutional neural network as a convolutional layer, wherein a convolutional kernel of the convolutional layer is a matrix of 3*3, and the number of channels is 64; the input of the convolution layer is the spatial semantic fusion matrix M after the normalization processing in the previous step; the second layer of the convolutional neural network is a nonlinear activation layer, the nonlinear activation function is a relu function, the output of the convolutional layer is used as the input of the layer, and the nonlinearity of the network is increased. The third layer of the convolutional neural network is a data normalization layer, the input of the layer is the output of the nonlinear activation layer, and the input is normalized by the following formula:

wherein, the first and the second end of the pipe are connected with each other,

for normalization layer output, x _v1 ^(k) Is the output of the nonlinear active layer, k is the channel number, i.e. the output of the k channel is x _v1 ^(k) ，E(x _v1 ^(k) ) Is x _v1 ^(k) Average of (a), var [ x ] _v1 ^(k) ]Is x _v1 ^(k) The variance of (c).

The fourth layer of the convolutional neural network is a maximum pooling network, the convolution kernel of the maximum pooling neural network is a 2*2 matrix, the fifth layer of the convolutional neural network is a convolutional layer, the size of the convolution kernel of the convolutional layer is 3*3 matrix, the number of channels is 64, and the input of the convolutional layer is the output result of the maximum pooling network of the fourth layer of the feature extraction network. The sixth layer of the convolutional neural network is a nonlinear activation layer, the nonlinear activation function is a relu function, the output of the convolutional layer is used as the input of the layer, and the nonlinearity of the network is increased. The seventh layer of the convolutional neural network is a data normalization layer, the input of which is the output of the nonlinear activation layer. The eighth layer of the convolutional neural network is a maximum pooling network, the convolution kernel of the maximum pooling neural network is a 2*2 matrix, the ninth layer of the convolutional neural network is a convolutional layer, the size of the convolution kernel of the convolutional layer is 3*3 matrix, the number of channels is 128, and the input of the convolutional layer is the result output by the maximum pooling network. The tenth layer of the convolutional neural network is a nonlinear activation layer, the nonlinear activation function adopts a relu function, the output of the convolutional layer is used as the input of the layer, and the nonlinearity of the network is increased. The eleventh layer of the convolutional neural network is a data normalization layer, the twelfth layer of the convolutional neural network with the input of the nonlinear activation layer as the output is a maximal pooling network, the convolutional kernel of the maximal pooling neural network is a matrix of 2*2, the thirteenth layer of the convolutional neural network is a convolutional layer, the convolutional kernel of the convolutional layer is a 3*3 matrix, the number of channels is 512, and the input of the convolutional layer is the result output by the maximal pooling network. The fourteenth layer of the convolutional neural network is a data normalization layer, the input is the output result of the thirteenth layer, then the output of the data normalization layer is changed into a one-dimensional vector through matrix transformation, and the matrix transformation is changed into a 1 × 128 specific semantic map feature vector f through linear transformation.

S6: and generating a fusion feature vector according to the object relation feature vector, the semantic map feature vector and the target coding information. Specifically, in this embodiment, the feature vectors e and f and the target encoding information are spliced to generate the fusion feature vector Q.

Step S7, a value network model is trained by adopting a deep convolutional neural network in deep learning and a time difference method in reinforcement learning, so that the target search and navigation of the robot are realized, and the method specifically comprises the following steps:

s71: constructing a reward and punishment function:

In this embodiment, as shown in fig. 5, the training process of the value network in S72 specifically includes:

and training the value network by using a time difference method of reinforcement learning. Recording the value network J as a current value network, initializing the training times to be 0, designing the experience playback capacity to be 50000, sampling the number to be 200, setting a target network J', initializing the random pose of the robot, wherein the training times are 1000000, and selecting the action under the current state according to a greedy strategy:

whereina _t For the action to be taken at the next moment, the robot does not know how to take the action at the beginning and therefore can only take the action randomly, but after a certain experience the robot will look for the action with the greatest reward to perform. Get a reward R (t, a) based on the action taken, and the next state S _t Storing the updated reported value and state into an experience pool, updating the experience pool every 3000 steps, updating the current value network by a gradient descent algorithm, and knowing that the final state of the robot exceeds the set maximum time t _max There are 200 actions. Otherwise, updating the current network to be a target network, and obtaining a value network J after the training times are reached.

The above embodiments are merely examples and do not limit the scope of the present invention. These embodiments may be implemented in other various manners, and various omissions, substitutions, and changes may be made without departing from the technical spirit of the present invention.

Claims

1. A method for searching an intelligent agent target based on scene prior is characterized in that the method is used for searching the target of a robot and comprises the following steps:

s1: confirming target coding information and a target to be searched;

s5: obtaining a semantic map feature vector according to the spatial semantic fusion matrix;

s7: and constructing a value network and a target network of the target search, training the value network and the target network according to the fusion feature vector, and searching the target based on the trained value network after the training is finished.

2. The method for searching for an intelligent object based on scene priors as claimed in claim 1, wherein said step S2 specifically comprises:

s22: recording the depth image as a depth image matrix;

3. The method for searching for an intelligent object based on scene priors as claimed in claim 1, wherein said step S3 specifically comprises:

4. The method for intelligent agent target searching based on scene priors as claimed in claim 1, wherein the specific steps of step S4 include:

s44: recording the walking path of the robot on a C +1 layer of the spatial semantic fusion matrix, and marking an object with the highest possibility of relation with a target to be searched in the C +2 layer;

s45: and acquiring the latest space point cloud, the path and the object with the maximum possibility of relation with the target to be searched of the robot in real time, and updating the space semantic fusion matrix.

5. The method of claim 4, wherein the spatial point cloud is obtained by:

6. the method for searching for an intelligent object based on scene priors as claimed in claim 1, wherein said step S5 specifically comprises:

7. The method of claim 6, wherein the convolutional neural network comprises a convolutional layer, a nonlinear activation layer, a data normalization layer, a max pooling network, a convolutional layer, a data normalization layer, and finally the output of the last data normalization layer is changed into a one-dimensional vector through matrix transformation, and then the result of the matrix transformation is converted into a semantic map feature vector by using linear transformation.

8. The method for searching for the intelligent agent target based on the scene prior as claimed in claim 1, wherein the step S6 specifically includes:

9. The method for searching for the intelligent agent target based on the scene prior as claimed in claim 1, wherein the step S7 specifically includes:

s71: constructing a reward and punishment function:

s72: and inputting the fusion characteristic vector Q into a deep convolutional neural network with initial weight, simulating a navigation strategy of a human expert by a machine to obtain demonstration experience, storing the demonstration experience into an initialized experience pool, initializing a value network J by using random weight values, initializing a target network J' into a current value network, and cycling each event to obtain an optimal value network J.

10. The method of claim 9, wherein the training process of the value network in S72 specifically comprises:

wherein a is _t For the action taken at the next moment, a reward R (t, a) is obtained from the action taken, and the next state S _t Storing the updated reported values and status into an experience pool, each time after a presetUpdating the experience pool once by the number of steps, and updating the current value network through a gradient descent algorithm until the robot reaches the final state and exceeds the set maximum time t _max 200 actions; otherwise, updating the current network to the target network, and obtaining a value network J after the training times are reached.