CN115311538A - Intelligent agent target searching method based on scene prior - Google Patents

Intelligent agent target searching method based on scene prior Download PDF

Info

Publication number
CN115311538A
CN115311538A CN202210156851.XA CN202210156851A CN115311538A CN 115311538 A CN115311538 A CN 115311538A CN 202210156851 A CN202210156851 A CN 202210156851A CN 115311538 A CN115311538 A CN 115311538A
Authority
CN
China
Prior art keywords
matrix
semantic
target
robot
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210156851.XA
Other languages
Chinese (zh)
Inventor
赵怀林
陆升阳
梁兰军
侯煊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Institute of Technology
Original Assignee
Shanghai Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Institute of Technology filed Critical Shanghai Institute of Technology
Priority to CN202210156851.XA priority Critical patent/CN115311538A/en
Publication of CN115311538A publication Critical patent/CN115311538A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to an intelligent object searching method based on scene prior, which is used for searching an object of a robot and comprises the following steps: confirming target coding information and a target to be searched; acquiring an environment image of a scene to be searched through a robot, and constructing a depth image matrix and a semantic image matrix according to the environment image; extracting object relation characteristic vectors; constructing a spatial semantic fusion matrix; obtaining a semantic map feature vector according to the spatial semantic map fusion matrix; generating a fusion feature vector according to the object relation feature vector, the semantic map feature vector and the target coding information; and training the value network and the target network according to the fusion feature vector, and searching the target based on the trained value network after the training is finished. Compared with the prior art, the method has the advantages of high navigation precision, high search accuracy and efficiency and the like.

Description

Intelligent agent target searching method based on scene prior
Technical Field
The invention relates to the field of active visual perception, in particular to an intelligent agent target searching method based on scene prior.
Background
In recent years, the field of robot research has been devoted to expanding the ability of robots to explore the environment, understand the environment, interact with the environment, and communicate with people. The traditional navigation method generally uses an environment map for navigation, and a navigation task is divided into three steps: mapping, positioning and path planning. This approach typically requires the construction of 3D maps in advance, as well as reliable map localization and path tracking. However, in some cases, the artificial landmark is unknown or the robot is in an environment where GPS is absent, and thus self-motion estimation or acquisition of scene information is greatly difficult. For a long time, the problem of robot navigation has been substantially solved by a series of distance sensors, such as light detection and ranging, infrared radiation, or sonar navigation and ranging, which are suitable for use in a small-range static environment (the various distance sensors are limited by their own physical properties). However, in dynamic, complex, and wide-range environments, mapping and navigation of robots may face many challenges.
Recently, the success of data-driven machine learning strategies for various control and perception problems has opened a new way to overcome the limitations of previous approaches. These methods are widely studied without the need for constructing maps with low dependency on the environment and with the possibility of human-computer interaction. Their key point is to directly learn the mapping between the original observations and the end-to-end approach to the operational task. These methods take advantage of the ability to previously navigate experiences in new similar environments, whether with or without maps. Reinforcement Learning (Reinforcement Learning) is commonly used for visual navigation. However, the reinforcement learning still has the problems of poor generalization capability, low navigation efficiency, low precision and the like.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide an intelligent agent target searching method based on scene prior.
The purpose of the invention can be realized by the following technical scheme:
a method for searching an intelligent agent target based on scene prior is used for searching a target of a robot, and comprises the following steps:
s1: confirming target coding information and a target to be searched;
s2: acquiring an environment image of a scene to be searched through a robot, and constructing a depth image matrix and a semantic image matrix according to the environment image;
s3: carrying out object relation characteristic analysis on the environment image, identifying objects in the environment, confirming the object with the maximum possibility of relation with a target to be searched, and extracting an object relation characteristic vector;
s4: acquiring a spatial semantic point cloud according to the depth image matrix and the semantic image matrix, and constructing a spatial semantic fusion matrix according to the spatial semantic point cloud and object information in the environment;
s5: obtaining a semantic map feature vector according to the spatial semantic map fusion matrix;
s6: generating a fusion feature vector according to the object relation feature vector, the semantic map feature vector and the target coding information;
s7: and training the value network and the target network according to the fusion feature vector, and searching the target based on the trained value network after the training is finished.
Preferably, the step S2 specifically includes:
s21: acquiring an environment image of a scene to be searched through a robot, wherein the environment image comprises an RGB (red, green and blue) image and a depth image of the environment;
s22: recording the depth image as a depth image matrix;
s23: and calculating the environment image by utilizing the pre-trained semantic segmentation network to generate a semantic image matrix.
Preferably, the step S3 specifically includes:
s31: acquiring a scene graph G = { V, E }, wherein V is a graph node and represents different object types in a scene, E is a graph edge and represents the position relation between two types of objects, a visual genome data set is used as a source, a knowledge graph is constructed according to the types of all objects appearing in the scene to be searched, each type is represented as a node in the graph, edges are used for linking two nodes with object relation appearance frequency larger than 3 in the visual genome data set to generate a graph structure, and the graph structure is represented by a binary adjacency matrix A;
s32: and constructing a graph convolution neural network, inputting an RGB image of the environment image, outputting a spatial relationship characteristic, and mapping the spatial relationship characteristic to 512 dimensions to obtain an object relationship characteristic vector.
Preferably, the specific step of step S4 includes:
s41: generating an all 0 matrix of (C + 2) ((224) × 224), which represents a spatial semantic fusion matrix M, said spatial semantic fusion matrix comprising C +2 layers, wherein 224 × 224 represents the size of each layer;
s42: considering the position and attitude P (x) of the robot t ,y t ,z tt ) Generating a spatial point cloud;
s43: the size of the spatial point cloud is C W H, wherein C is a channel of the spatial semantic point cloud, each channel represents a semantic category, W H is the width, the length and the height of the spatial semantic point cloud respectively, and the three-dimensional point cloud is summed in the high dimension and mapped to two dimensions to obtain a two-dimensional mapping characteristic diagram with the size of C W L which is used as a front C layer of the spatial semantic fusion matrix;
s44: and recording the walking path of the robot on a C +1 layer of the spatial semantic fusion matrix, and marking an object with the highest possibility of relation with the target to be searched in the C +2 layer.
Preferably, the spatial point cloud is obtained in the following manner:
Figure BDA0003513089090000031
wherein x, y and z are point cloud coordinates, f x ,f y Is an internal reference of the camera, c x ,c y The position of a pixel in a semantic image matrix S, D is a depth image matrix, u and v are pixel point coordinates in the semantic image matrix respectively, R and T are a transfer matrix and a rotation matrix of the robot respectively, and the position and the attitude of the robot are P (x) t ,y t ,z tt ) The obtained transfer matrix and rotation matrix of the robot are respectively as follows:
Figure BDA0003513089090000032
Figure BDA0003513089090000033
preferably, the step S5 specifically includes:
s51: carrying out normalization processing on the spatial semantic fusion matrix;
s52: and constructing a convolutional neural network, processing the space semantic fusion matrix as input by using the convolutional neural network, and outputting a semantic map feature vector.
Preferably, the convolutional neural network comprises a convolutional layer, a nonlinear activation layer, a data normalization layer, a maximum pooling network, a convolutional layer, a data normalization layer, a maximum pooling network, a convolutional layer, and a data normalization layer which are connected in sequence, and finally, the output of the last data normalization layer is changed into a one-dimensional vector through matrix transformation, and then, the result of the matrix transformation is converted into a semantic map feature vector by utilizing linear transformation.
Preferably, the step S6 specifically includes:
and splicing the object relation feature vector, the semantic map feature vector and the target coding information vector to generate a fusion feature vector.
Preferably, the step S7 specifically includes:
s71: constructing a reward and punishment function:
Figure BDA0003513089090000041
r (t, a) is reward and punishment, t represents that the robot takes action at a certain moment, a represents the action taken by the robot at the moment, a target category appears in the semantic image matrix S of the robot, and the fact that the robot finds a target when the distance between the robot and the target category is calculated to be less than 0.5m is represented;
s72: and inputting the fusion characteristic vector Q into a deep convolutional neural network with initial weight, simulating a navigation strategy of a human expert by a machine to obtain demonstration experience, storing the demonstration experience into an initialized experience pool, initializing a value network J by using random weight values, initializing a target value network J' into a current value network, and cycling each event to obtain an optimal value network J.
Preferably, the training process of the value network in S72 specifically includes:
training a value network by using a reinforcement learning time difference method, recording the value network J as a current value network, initializing the training times to be 0, designing experience playback capacity and sampling quantity, setting a target network J', initializing the random pose of the robot, setting the training times, and selecting actions under the current state according to a greedy strategy:
Figure BDA0003513089090000042
wherein a is t For the action taken at the next moment, a reward R (t, a) is obtained from the action taken, and the next state S t Storing the updated reported value and state into an experience pool, updating the experience pool once every preset step number, and updating the current value network through a gradient descent algorithm until the robot reaches the final state and exceeds the set maximum time t max 200 actions; otherwise, updating the current network to be a target network, and obtaining a value network J after the training times are reached.
Compared with the prior art, the invention has the following advantages:
1. the robot target searching method is based on the real environment, and an indoor map-free navigation target searching system based on the visual sensor is designed, so that the robot does not need to establish a map when searching for an object and completing task navigation, and map-free target searching and indoor navigation tasks can be realized.
2. In the current research of indoor map-free visual navigation, visual information is basically used as an input matrix, and reinforcement learning or imitation learning is directly utilized to train navigation. The method has low navigation success rate and long time, and partial networks are difficult to converge, thereby causing the result of training failure. The map-free navigation target searching system based on scene prior disclosed by the invention has the advantages that visual information forms a local semantic map, so that the training speed and the navigation precision are greatly improved.
3. The invention adopts scene prior, searches for objects by utilizing the scene prior exploration environment, and is beneficial to improving the accuracy and efficiency of target search.
Drawings
Fig. 1 is a block diagram of a hardware system according to the method of the present invention.
Fig. 2 is a general flow diagram of the present invention.
Fig. 3 is a schematic view of a scene graph of the present invention.
FIG. 4 is a schematic diagram of a convolutional neural network of the present invention
FIG. 5 is a flow chart of reinforcement learning according to the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. Note that the following description of the embodiments is merely a substantial example, and the present invention is not intended to be limited to the application or the use thereof, and is not limited to the following embodiments.
Examples
The invention relates to a scene prior-based intelligent object searching method, which comprises the steps of shooting an RGB image and a depth map of an environment, calculating the RGB image by utilizing a trained semantic segmentation network to generate a semantic image, and generating a local semantic map according to the depth map and odometer information; the method comprises the steps of (1) recording object relations of multiple occurrences of visual genome data into a priori knowledge matrix, adding the object relations on the basis of a semantic map, and generating a spatial relation semantic map; acquiring a space semantic map matrix by adopting a convolution network as a data fusion matrix of a local environment; and training a reinforcement learning model as a navigator, taking the data fusion matrix as input, outputting one of the actions of front, left, right and stop, and controlling the motion direction of the robot. The adopted equipment for the target search of the robot, as shown in fig. 1, is mainly composed of a robot equipped with a camera sensor and a laser radar, and a server, the robot transmits the 'seen' information to the server through WiFi by the camera sensor, and the visual information is converted on the server, specifically, as shown in fig. 2, the method comprises the following steps:
s1: and confirming target coding information and a target to be searched. And establishing a target coding network and coding to confirm the coding information of each object in the scene. Specifically, a human-computer interaction interface is set up in the embodiment, the interaction interface is in a text box form, a name of a target to be searched by a user is input into the text box, and the target is coded after the input. The encoder consists essentially of one LSTM, each layer containing 128 hidden units.
S2: the method comprises the steps of obtaining an environment image of a scene to be searched through a robot, and constructing a depth image matrix and a semantic image matrix according to the environment image.
The step S2 specifically includes: the robot shoots an RGB image and a depth map of an environment by using a camera sensor, the image is called an environment image, the environment image is a 3 x (w x h) image, w and h are image width and height, the environment image comprises 3 layers, the size of each layer is (w x h), the depth image is a 1 x (w h) image, the depth image comprises 1 layer, the size of each layer is (w x h), and the depth image is recorded as a depth image matrix D. Firstly, processing and calculating an environment image by using a pre-trained semantic segmentation network Mask-RCNN to generate a semantic segmentation result matrix which is recorded as a semantic image matrix S.
S3: and carrying out object relation characteristic analysis on the environment image, identifying objects in the environment, confirming the object with the highest possibility of relation with the target to be searched, and extracting an object relation characteristic vector.
As shown in fig. 4, step S3 is to extract the relationship information between the targets by using the convolutional neural networks (GCNs) based mainly on the visual information. The scene prior knowledge is expressed in the form of an undirected graph, a scene graph G = { V, E }, nodes in the V represent different object types, and an edge E represents a special position relation between two classes of objects. Using the visual genome data set as a source, a knowledge graph was constructed from the categories of all objects that appeared in the actual scene in this experiment. Each category is represented as a node in the graph. The use of edges to link the graph structure in two nodes when the frequency of occurrence of object relationships in the visual genome dataset is greater than 3 is represented by the binary adjacency matrix a.
Constructing a graph convolution neural network, inputting the graph convolution neural network into an RGB image, and forming a spatial relation characteristic matrix Z = [ Z ] by each layer of output of the graph convolution neural network 1 ,z 2 ,...,z |V| ]In this embodiment, since there are 83 articles in the actual environment, 83 nodes are designed, and all the nodes are summarized as the feature matrix F = [ F = 1 ,F 2 ...F |V| ]. Standardizing the matrix A to obtain a matrix
Figure BDA0003513089090000061
It is thus possible to obtain:
Figure BDA0003513089090000062
wherein H (0) =X,H (β) = Z, wherein W (α) Is a parameter of the alpha layer and beta is the total number of layers of the GCNs.
As shown in fig. 3, the first part of the atlas neural network is the pre-trained rescet 34, the input to this part is the RGB image, and the score of the 1000 classes of objects output as ImageNet is the image feature vector. For different nodes, mapping the current image feature vector into 512-dimensional feature vectors, embedding all types of names by using words and mapping into 512-dimensional feature vectors respectively, and splicing the two feature vectors to form 1024-dimensional joint representation for each graph node. The input of the graph convolution neural network is a temporary matrix A and a node feature vector, the output of the first two layers is potential features of 1024 dimensions, and the output of the last layer is the value output by each node, so that the feature vector | V | is obtained. The feature vector is semantic coding information of the current scene and the environmental context. Finally, the feature vector is mapped to an object relation feature vector e with 512 dimensions.
S4: and acquiring a spatial semantic point cloud according to the depth image matrix and the semantic image matrix, and constructing a spatial semantic fusion matrix according to the spatial semantic point cloud and object information in the environment.
The method specifically comprises the following steps: first, an all 0 matrix of (C + 2) × (224 × 224) is generated, which represents the spatial semantic fusion matrix M. The spatial semantic fusion matrix comprises C +2 layers, and v,224 and 224 represent the size of each layer. The position of the robot is then placed in the middle of the map, at (112 ). Considering the position and attitude P (x) of the robot t ,y t ,z tt ) And calculating and generating the spatial point cloud by using the following formula:
Figure BDA0003513089090000071
wherein x, y and z are point cloud coordinates, f x ,f y Is an internal reference of the camera, c x ,c y The position of a pixel in a semantic image matrix S, D is a depth image matrix, u and v are pixel point coordinates in the semantic image matrix respectively, R and T are a transfer matrix and a rotation matrix of the robot respectively, and the pose of the robot is P (x) t ,y t ,z tt ) The obtained transfer matrix and rotation matrix of the robot are respectively as follows:
Figure BDA0003513089090000072
Figure BDA0003513089090000073
according to the formula, the spatial semantic point cloud can be calculated through the image feature matrix S and the depth image matrix D. The calculated size of the spatial semantic point cloud is C W L H, wherein C is a channel of the spatial semantic point cloud, each channel represents a semantic category, and W L H is the width, the length and the height of the spatial semantic point cloud respectively. Because the calculation amount of the spatial semantic point cloud is too large, the spatial semantic point cloud is summed in the high dimension, and the three-dimensional point cloud is mapped to two dimensions to obtain a two-dimensional mapping characteristic diagram with the dimension of C W L as the front C layer of the spatial semantic fusion matrix; recording the walking path of the robot on a C +1 layer of the spatial semantic fusion matrix, and marking an object with the highest possibility of relation with a target to be searched in the C +2 layer; and in the walking process of the robot, adding the two-dimensional mapping characteristic diagram and the space semantic fusion matrix M according to the corresponding position, and updating the object with the maximum possibility of path and relation to complete the updating of the space semantic fusion matrix.
Further, the semantic two-dimensional mapping feature map can clearly know that the objects exist in the mapping map, the object with the highest possibility of being in relation with the target object is obtained through statistics and calculation by utilizing the relation between the objects in the visual genome data set, and the position corresponding to the object is highlighted in the C +2 layer to play a role of marking.
S5: obtaining a semantic map feature vector according to the spatial semantic map fusion matrix;
the method specifically comprises the following steps: s51: carrying out normalization processing on the spatial semantic fusion matrix;
s52: and constructing a convolutional neural network, processing the space semantic fusion matrix as input by using the convolutional neural network, and outputting a semantic map feature vector.
The convolutional neural network is widely applied as a mode for extracting image features due to the advantages that the convolutional neural network does not need to preprocess images and can extract additional features, and the like, high-dimensional data can be processed by a special processing mode of sharing a convolutional kernel, and the convolutional neural network can extract deep information in the images as the number of layers of the network is deepened. Therefore, in the step, the convolutional neural network is adopted to perform feature processing on the image acquired by the camera sensor.
(1) In the step, firstly, the space semantic map fusion matrix M is normalized, and a specific normalization method can be represented by the following formula:
Figure BDA0003513089090000081
in the formula x i * Representing the value, x, of the normalized matrix M i Representing the value of the original matrix M, x min Represents the minimum value of the matrix M, x max Representing the maximum value of the matrix M. The spatial semantic map fusion matrix M can be normalized by the formula.
(2) The method for constructing the convolutional neural network specifically comprises the following steps:
setting a first layer of the convolutional neural network as a convolutional layer, wherein a convolutional kernel of the convolutional layer is a matrix of 3*3, and the number of channels is 64; the input of the convolution layer is the spatial semantic fusion matrix M after the normalization processing in the previous step; the second layer of the convolutional neural network is a nonlinear activation layer, the nonlinear activation function is a relu function, the output of the convolutional layer is used as the input of the layer, and the nonlinearity of the network is increased. The third layer of the convolutional neural network is a data normalization layer, the input of the layer is the output of the nonlinear activation layer, and the input is normalized by the following formula:
Figure BDA0003513089090000082
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003513089090000083
for normalization layer output, x v1 (k) Is the output of the nonlinear active layer, k is the channel number, i.e. the output of the k channel is x v1 (k) ,E(x v1 (k) ) Is x v1 (k) Average of (a), var [ x ] v1 (k) ]Is x v1 (k) The variance of (c).
The fourth layer of the convolutional neural network is a maximum pooling network, the convolution kernel of the maximum pooling neural network is a 2*2 matrix, the fifth layer of the convolutional neural network is a convolutional layer, the size of the convolution kernel of the convolutional layer is 3*3 matrix, the number of channels is 64, and the input of the convolutional layer is the output result of the maximum pooling network of the fourth layer of the feature extraction network. The sixth layer of the convolutional neural network is a nonlinear activation layer, the nonlinear activation function is a relu function, the output of the convolutional layer is used as the input of the layer, and the nonlinearity of the network is increased. The seventh layer of the convolutional neural network is a data normalization layer, the input of which is the output of the nonlinear activation layer. The eighth layer of the convolutional neural network is a maximum pooling network, the convolution kernel of the maximum pooling neural network is a 2*2 matrix, the ninth layer of the convolutional neural network is a convolutional layer, the size of the convolution kernel of the convolutional layer is 3*3 matrix, the number of channels is 128, and the input of the convolutional layer is the result output by the maximum pooling network. The tenth layer of the convolutional neural network is a nonlinear activation layer, the nonlinear activation function adopts a relu function, the output of the convolutional layer is used as the input of the layer, and the nonlinearity of the network is increased. The eleventh layer of the convolutional neural network is a data normalization layer, the twelfth layer of the convolutional neural network with the input of the nonlinear activation layer as the output is a maximal pooling network, the convolutional kernel of the maximal pooling neural network is a matrix of 2*2, the thirteenth layer of the convolutional neural network is a convolutional layer, the convolutional kernel of the convolutional layer is a 3*3 matrix, the number of channels is 512, and the input of the convolutional layer is the result output by the maximal pooling network. The fourteenth layer of the convolutional neural network is a data normalization layer, the input is the output result of the thirteenth layer, then the output of the data normalization layer is changed into a one-dimensional vector through matrix transformation, and the matrix transformation is changed into a 1 × 128 specific semantic map feature vector f through linear transformation.
S6: and generating a fusion feature vector according to the object relation feature vector, the semantic map feature vector and the target coding information. Specifically, in this embodiment, the feature vectors e and f and the target encoding information are spliced to generate the fusion feature vector Q.
Step S7, a value network model is trained by adopting a deep convolutional neural network in deep learning and a time difference method in reinforcement learning, so that the target search and navigation of the robot are realized, and the method specifically comprises the following steps:
s71: constructing a reward and punishment function:
Figure BDA0003513089090000091
r (t, a) is reward and punishment, t represents that the robot takes action at a certain moment, a represents the action taken by the robot at the moment, a target category appears in the semantic image matrix S of the robot, and the fact that the robot finds a target when the distance between the robot and the target category is calculated to be less than 0.5m is represented;
s72: and inputting the fusion characteristic vector Q into a deep convolutional neural network with initial weight, simulating a navigation strategy of a human expert by a machine to obtain demonstration experience, storing the demonstration experience into an initialized experience pool, initializing a value network J by using random weight values, initializing a target value network J' into a current value network, and cycling each event to obtain an optimal value network J.
In this embodiment, as shown in fig. 5, the training process of the value network in S72 specifically includes:
and training the value network by using a time difference method of reinforcement learning. Recording the value network J as a current value network, initializing the training times to be 0, designing the experience playback capacity to be 50000, sampling the number to be 200, setting a target network J', initializing the random pose of the robot, wherein the training times are 1000000, and selecting the action under the current state according to a greedy strategy:
Figure BDA0003513089090000101
whereina t For the action to be taken at the next moment, the robot does not know how to take the action at the beginning and therefore can only take the action randomly, but after a certain experience the robot will look for the action with the greatest reward to perform. Get a reward R (t, a) based on the action taken, and the next state S t Storing the updated reported value and state into an experience pool, updating the experience pool every 3000 steps, updating the current value network by a gradient descent algorithm, and knowing that the final state of the robot exceeds the set maximum time t max There are 200 actions. Otherwise, updating the current network to be a target network, and obtaining a value network J after the training times are reached.
The above embodiments are merely examples and do not limit the scope of the present invention. These embodiments may be implemented in other various manners, and various omissions, substitutions, and changes may be made without departing from the technical spirit of the present invention.

Claims (10)

1. A method for searching an intelligent agent target based on scene prior is characterized in that the method is used for searching the target of a robot and comprises the following steps:
s1: confirming target coding information and a target to be searched;
s2: acquiring an environment image of a scene to be searched through a robot, and constructing a depth image matrix and a semantic image matrix according to the environment image;
s3: carrying out object relation characteristic analysis on the environment image, identifying objects in the environment, confirming the object with the maximum possibility of relation with a target to be searched, and extracting an object relation characteristic vector;
s4: acquiring a spatial semantic point cloud according to the depth image matrix and the semantic image matrix, and constructing a spatial semantic fusion matrix according to the spatial semantic point cloud and object information in the environment;
s5: obtaining a semantic map feature vector according to the spatial semantic fusion matrix;
s6: generating a fusion feature vector according to the object relation feature vector, the semantic map feature vector and the target coding information;
s7: and constructing a value network and a target network of the target search, training the value network and the target network according to the fusion feature vector, and searching the target based on the trained value network after the training is finished.
2. The method for searching for an intelligent object based on scene priors as claimed in claim 1, wherein said step S2 specifically comprises:
s21: acquiring an environment image of a scene to be searched through a robot, wherein the environment image comprises an RGB (red, green and blue) image and a depth image of the environment;
s22: recording the depth image as a depth image matrix;
s23: and calculating the environment image by utilizing the pre-trained semantic segmentation network to generate a semantic image matrix.
3. The method for searching for an intelligent object based on scene priors as claimed in claim 1, wherein said step S3 specifically comprises:
s31: acquiring a scene graph G = { V, E }, wherein V is a graph node and represents different object types in a scene, E is a graph edge and represents the position relation between two types of objects, a visual genome data set is used as a source, a knowledge graph is constructed according to the types of all objects appearing in the scene to be searched, each type is represented as a node in the graph, edges are used for linking two nodes with object relation appearance frequency larger than 3 in the visual genome data set to generate a graph structure, and the graph structure is represented by a binary adjacency matrix A;
s32: and constructing a graph convolution neural network, inputting an RGB image of the environment image, outputting a spatial relationship characteristic, and mapping the spatial relationship characteristic to 512 dimensions to obtain an object relationship characteristic vector.
4. The method for intelligent agent target searching based on scene priors as claimed in claim 1, wherein the specific steps of step S4 include:
s41: generating an all 0 matrix of (C + 2) ((224) × 224), which represents a spatial semantic fusion matrix M, said spatial semantic fusion matrix comprising C +2 layers, wherein 224 × 224 represents the size of each layer;
s42: considering the position and attitude P (x) of the robot t ,y t ,z tt ) Generating a spatial point cloud;
s43: the size of the spatial point cloud is C W H, wherein C is a channel of the spatial semantic point cloud, each channel represents a semantic category, W H is the width, the length and the height of the spatial semantic point cloud respectively, and the three-dimensional point cloud is summed in the high dimension and mapped to two dimensions to obtain a two-dimensional mapping characteristic diagram with the size of C W L which is used as a front C layer of the spatial semantic fusion matrix;
s44: recording the walking path of the robot on a C +1 layer of the spatial semantic fusion matrix, and marking an object with the highest possibility of relation with a target to be searched in the C +2 layer;
s45: and acquiring the latest space point cloud, the path and the object with the maximum possibility of relation with the target to be searched of the robot in real time, and updating the space semantic fusion matrix.
5. The method of claim 4, wherein the spatial point cloud is obtained by:
Figure FDA0003513089080000021
wherein x, y and z are point cloud coordinates, f x ,f y Is an internal reference of the camera, c x ,c y The position of a pixel in a semantic image matrix S, D is a depth image matrix, u and v are pixel point coordinates in the semantic image matrix respectively, R and T are a transfer matrix and a rotation matrix of the robot respectively, and the position and the attitude of the robot are P (x) t ,y t ,z tt ) The obtained transfer matrix and rotation matrix of the robot are respectively as follows:
Figure FDA0003513089080000022
Figure FDA0003513089080000023
6. the method for searching for an intelligent object based on scene priors as claimed in claim 1, wherein said step S5 specifically comprises:
s51: carrying out normalization processing on the spatial semantic fusion matrix;
s52: and constructing a convolutional neural network, processing the space semantic fusion matrix as input by using the convolutional neural network, and outputting a semantic map feature vector.
7. The method of claim 6, wherein the convolutional neural network comprises a convolutional layer, a nonlinear activation layer, a data normalization layer, a max pooling network, a convolutional layer, a data normalization layer, and finally the output of the last data normalization layer is changed into a one-dimensional vector through matrix transformation, and then the result of the matrix transformation is converted into a semantic map feature vector by using linear transformation.
8. The method for searching for the intelligent agent target based on the scene prior as claimed in claim 1, wherein the step S6 specifically includes:
and splicing the object relation feature vector, the semantic map feature vector and the target coding information vector to generate a fusion feature vector.
9. The method for searching for the intelligent agent target based on the scene prior as claimed in claim 1, wherein the step S7 specifically includes:
s71: constructing a reward and punishment function:
Figure FDA0003513089080000031
r (t, a) is reward and punishment, t represents that the robot takes action at a certain moment, a represents the action taken by the robot at the moment, a target category appears in the semantic image matrix S of the robot, and the fact that the robot finds a target when the distance between the robot and the target category is calculated to be less than 0.5m is represented;
s72: and inputting the fusion characteristic vector Q into a deep convolutional neural network with initial weight, simulating a navigation strategy of a human expert by a machine to obtain demonstration experience, storing the demonstration experience into an initialized experience pool, initializing a value network J by using random weight values, initializing a target network J' into a current value network, and cycling each event to obtain an optimal value network J.
10. The method of claim 9, wherein the training process of the value network in S72 specifically comprises:
training a value network by using a reinforcement learning time difference method, recording the value network J as a current value network, initializing the training times to be 0, designing experience playback capacity and sampling quantity, setting a target network J', initializing the random pose of the robot, setting the training times, and selecting actions under the current state according to a greedy strategy:
Figure FDA0003513089080000041
wherein a is t For the action taken at the next moment, a reward R (t, a) is obtained from the action taken, and the next state S t Storing the updated reported values and status into an experience pool, each time after a presetUpdating the experience pool once by the number of steps, and updating the current value network through a gradient descent algorithm until the robot reaches the final state and exceeds the set maximum time t max 200 actions; otherwise, updating the current network to the target network, and obtaining a value network J after the training times are reached.
CN202210156851.XA 2022-02-21 2022-02-21 Intelligent agent target searching method based on scene prior Pending CN115311538A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210156851.XA CN115311538A (en) 2022-02-21 2022-02-21 Intelligent agent target searching method based on scene prior

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210156851.XA CN115311538A (en) 2022-02-21 2022-02-21 Intelligent agent target searching method based on scene prior

Publications (1)

Publication Number Publication Date
CN115311538A true CN115311538A (en) 2022-11-08

Family

ID=83854865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210156851.XA Pending CN115311538A (en) 2022-02-21 2022-02-21 Intelligent agent target searching method based on scene prior

Country Status (1)

Country Link
CN (1) CN115311538A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115858698A (en) * 2023-02-22 2023-03-28 北京融信数联科技有限公司 Intelligent agent atlas analysis method, system and readable storage medium
CN116499471A (en) * 2023-06-30 2023-07-28 华南理工大学 Visual language navigation method, device and medium based on open scene map

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115858698A (en) * 2023-02-22 2023-03-28 北京融信数联科技有限公司 Intelligent agent atlas analysis method, system and readable storage medium
CN115858698B (en) * 2023-02-22 2023-06-06 北京融信数联科技有限公司 Agent profile analysis method, system and readable storage medium
CN116499471A (en) * 2023-06-30 2023-07-28 华南理工大学 Visual language navigation method, device and medium based on open scene map
CN116499471B (en) * 2023-06-30 2023-09-12 华南理工大学 Visual language navigation method, device and medium based on open scene map

Similar Documents

Publication Publication Date Title
CN105843223B (en) A kind of mobile robot three-dimensional based on space bag of words builds figure and barrier-avoiding method
CN108416840B (en) Three-dimensional scene dense reconstruction method based on monocular camera
CN110827415B (en) All-weather unknown environment unmanned autonomous working platform
US11720110B2 (en) Dynamic obstacle avoidance method based on real-time local grid map construction
CN112859859B (en) Dynamic grid map updating method based on three-dimensional obstacle object pixel object mapping
Sünderhauf Switchable Constraints for Robust Simultaneous Localization and Mapping and Satellite-Based Localization
Zivkovic et al. Hierarchical map building using visual landmarks and geometric constraints
CN110717927A (en) Indoor robot motion estimation method based on deep learning and visual inertial fusion
CN111814683A (en) Robust visual SLAM method based on semantic prior and deep learning features
CN115311538A (en) Intelligent agent target searching method based on scene prior
Kulhánek et al. Visual navigation in real-world indoor environments using end-to-end deep reinforcement learning
CN104374395A (en) Graph-based vision SLAM (simultaneous localization and mapping) method
CN112873211B (en) Robot man-machine interaction method
Cummins et al. Fab-map: Appearance-based place recognition and mapping using a learned visual vocabulary model
CN114626539A (en) Distributed SLAM system and learning method thereof
GB2612029A (en) Lifted semantic graph embedding for omnidirectional place recognition
Ye et al. From seeing to moving: A survey on learning for visual indoor navigation (vin)
CN112489119A (en) Monocular vision positioning method for enhancing reliability
Joo et al. A realtime autonomous robot navigation framework for human like high-level interaction and task planning in global dynamic environment
CN114494436A (en) Indoor scene positioning method and device
CN112330750A (en) Three-dimensional matching method for self-reconstruction butt joint of reconfigurable spherical robot
Nasr et al. Landmark recognition for autonomous mobile robots
Guo et al. Object goal visual navigation using Semantic Spatial Relationships
Lawton et al. Knowledge based vision for terrestrial robots
Felicioni et al. Goln: Graph object-based localization network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination