CN114663880A

CN114663880A - Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism

Info

Publication number: CN114663880A
Application number: CN202210253116.0A
Authority: CN
Inventors: 曹原周汉; 李浥东; 张慧; 郎丛妍; 陈乃月
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-06-24

Abstract

The invention provides a three-dimensional target detection method based on a multi-level cross-modal self-attention mechanism. The method comprises the steps of constructing a training set and a testing set by using RGB image data; constructing a three-dimensional target detection model, wherein the three-dimensional target detection model comprises an RGB (red, green and blue) backbone network, a depth backbone network, a classifier and a regressor; training the three-dimensional target detection model by using a training set and test set data, and verifying the training effect by using the test set to obtain a trained three-dimensional target detection model; and detecting the three-dimensional target in the RGB image by using the model obtained by training. The method acquires the depth structure information in the global scene range from the depth characteristic map, and the depth structure information is organically combined with the appearance information to improve the accuracy of the three-dimensional target detection algorithm, so that the method effectively detects the information such as the category, the position, the size, the posture and the like of the three-dimensional object in the two-dimensional RGB image.

Description

Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism

Technical Field

The invention relates to the technical field of target detection, in particular to a three-dimensional target detection method based on a multi-level cross-modal self-attention mechanism.

Background

The three-dimensional target detection is an important branch in the field of computer vision, and has strong application value in a plurality of scenes such as intelligent transportation, robot vision, three-dimensional reconstruction, virtual reality, augmented reality and the like. The purpose of three-dimensional object detection is to recover information such as the class, position, depth, size, and pose of an object in three-dimensional space. According to different types of processing data, three-dimensional target detection technologies are mainly divided into two types, namely two-dimensional image detection and point cloud data detection.

The imaging process of the three-dimensional object is a process of mapping points in a three-dimensional space onto a two-dimensional plane after losing depth information. The detection of an object in a three-dimensional space inevitably uses lost depth information, which is one of the main differences between three-dimensional target detection and two-dimensional target detection and is the difficult point of three-dimensional target detection. The three-dimensional target detection method based on the two-dimensional image can directly acquire depth information from the two-dimensional image so as to detect the three-dimensional target. The acquisition of the depth information mainly depends on a plurality of constraint conditions such as geometric constraint in a three-dimensional scene, shape constraint and semantic constraint of a three-dimensional object and the like. Because the depth information contained in the two-dimensional image is limited, and the constraint condition is greatly limited by scenes and objects, the accuracy achieved by the three-dimensional target detection method is low.

The point cloud is a set of points in a three-dimensional space corresponding to pixel points in the two-dimensional image, and the depth information can be acquired by processing the point cloud data based on the three-dimensional target detection of the point cloud data, which can be further classified into two types. Firstly, point cloud data in a three-dimensional space is directly processed, and the operation on pixel points in a two-dimensional target detection method is improved to be three-dimensional, so that the point cloud is directly processed. Due to the increase of the operation dimensionality, the method is high in calculation complexity, and meanwhile, the noise data in the point cloud can directly influence the detection precision of the algorithm. In another method, a depth prediction model is obtained through point cloud data training, a two-dimensional depth image is obtained through the model, and then depth information is obtained through the two-dimensional depth image for three-dimensional target detection. The algorithm does not need to directly operate the point cloud data, but reduces the dimension of the point cloud data to a two-dimensional depth map, so that the operation complexity is reduced, and meanwhile, a depth prediction model can remove part of point cloud noise data, so that the algorithm is widely used in practical application.

A three-dimensional target detection method in the prior art includes: after the two-dimensional depth map is obtained, because the depth prediction model obtained through point cloud data training has the capability of acquiring depth information, the method further trains a three-dimensional target detection model through a two-dimensional RGB image on the basis of the depth prediction model. The disadvantages of this method are: for a three-dimensional target detection task, the category and position information of a target are acquired from a two-dimensional image or a video frame, and it is not necessary to directly process three-dimensional point cloud data, and the point cloud data usually contains a large amount of noise.

Another three-dimensional target detection method in the prior art includes: according to the method, a two-dimensional depth image is used as an independent model to be input, depth information is obtained from the depth image through an additional model, and the depth information is combined with two-dimensional RGB (red, green and blue) image input to carry out three-dimensional target detection. The disadvantages of this method are: the depth information which can be obtained from the two-dimensional image is very limited, and geometric constraint is inevitably used in the process of obtaining the depth information, so the detection precision of the algorithm is poor.

Disclosure of Invention

The embodiment of the invention provides a three-dimensional target detection method based on a multi-level cross-modal self-attention mechanism, so as to effectively detect the type, position and posture of a three-dimensional object in a two-dimensional RGB image.

In order to achieve the purpose, the invention adopts the following technical scheme.

A three-dimensional target detection method based on a multi-level cross-modal self-attention mechanism comprises the following steps:

constructing training set and test set data by using RGB image data;

constructing a three-dimensional target detection model, wherein the three-dimensional target detection model comprises an RGB (red, green and blue) trunk network, a depth trunk network, a classifier and a regressor;

training the three-dimensional target detection model by using the training set and the test set data, verifying the training effect of the three-dimensional target detection model by using the test set, respectively acquiring RGB features and depth features by using the RGB backbone network and the depth backbone network, inputting the RGB features and the depth features into a cross-modal self-attention mechanics learning module, updating the RGB features, and learning a classifier and a regressor by using the updated RGB features to obtain the trained three-dimensional target detection model;

and detecting the category, the position and the posture of the three-dimensional object in the two-dimensional RGB image to be recognized by utilizing the classifier and the regressor in the trained three-dimensional target detection model.

Preferably, the constructing training set and test set data by using RGB image data includes:

collecting an RGB image, and enabling the RGB image to be in a ratio of about 1: 1, performing normalization processing on image data in the training set and the test set, acquiring a two-dimensional depth image of an image of the training set through a depth estimation algorithm, labeling the category of an object in the image of the training set, and labeling the coordinates of a two-dimensional detection frame of the image, the central position, the size and the corner of a three-dimensional detection frame.

Preferably, the RGB backbone network, the depth backbone network, the classifier and the regressor in the three-dimensional target detection model each include a convolution layer, a full-link layer and a normalization layer, and the RGB backbone network and the depth backbone network have the same structure and each include 4 convolution modules.

Preferably, the training of the three-dimensional target detection model using the training set and the test set data, the RGB backbone network and the depth backbone network respectively obtaining RGB features and depth features, inputting the RGB features and the depth features into a cross-modal self-attention mechanics learning module, updating the RGB features, and obtaining the trained three-dimensional target detection model using the updated RGB feature learning classifier and the regressor, includes:

step S3-1: initializing parameters in a convolution layer, a full connection layer and a normalization layer contained in an RGB (red, green and blue) trunk network, a depth trunk network, a classifier and a regressor in the three-dimensional target detection model;

step S3-2: setting related training parameters of a random gradient descent algorithm, wherein the related training parameters comprise learning rate, impulse, batch size and iteration times;

step S3-3: for any iteration batch, respectively inputting all RGB images and depth images into an RGB backbone network and a depth backbone network to obtain multi-level RGB features and depth features, constructing a cross-modal self-attention learning module, inputting the RGB features and the depth features into the cross-modal self-attention learning module, learning to obtain a self-attention matrix based on depth information, updating the RGB features through the self-attention matrix, learning a classifier and a regressor by utilizing the updated RGB features, and using the classifier and the regressor for target detection of a three-dimensional object in a two-dimensional RGB image,

and (3) calculating the error between the network estimation value and the actual marked value to obtain an objective function value, and respectively calculating three objective function values by using formulas (1), (2) and (3):

wherein s in the formula (1)_iAnd p_iRespectively labeling and estimating the probability of the ith target class in formula (2)

And in formula (3)

A two-dimensional estimation frame and a three-dimensional estimation frame respectively representing the ith target, wherein gt represents an actual marking value, and N represents the total number of the targets;

step S3-4: adding the three objective function values to obtain a total objective function value, respectively solving partial derivatives of all parameters in the three-dimensional target detection model, and updating the parameters by a random gradient descent method;

step S3-5: and repeating the step S3-3 and the step 3-4, continuously updating the parameters of the three-dimensional target detection model until convergence, and outputting the parameters of the trained three-dimensional target detection model.

Preferably, the inputting the RGB features and the depth features into a cross-modal self-attention mechanics learning module, updating the RGB features, and learning a classifier and a regressor by using the updated RGB features to obtain a trained three-dimensional target detection model, includes:

for any two-dimensional RGB feature map R and two-dimensional depth feature map D, assuming that the dimensions thereof are C × H × W, wherein C, H and W are dimensions, height and width, respectively, the two-dimensional RGB feature map R and the two-dimensional depth feature map D are represented as a set of N C-dimensional features: r ═ R₁，r₂，...，r_N]^TAnd D ═ D₁，d₂，...，d_N]^TWherein N ═ hxw;

constructing a full-connected graph for the input feature graph R, wherein each feature R_iAs a node, an edge (r)_i，r_j) Representation node r_iAnd r_jThe relation between the two points is obtained by learning the two-dimensional depth feature map D, and the current two points are subjected to edge matchingUpdating the dimension RGB characteristic diagram R, which is specifically represented as:

wherein

To normalize the parameters, δ is the softmax function, j is the position associated with all i,

for the updated RGB features, the above formula is written in the form of matrix multiplication:

wherein

As a self-attention matrix, D_θ,D_φAnd R_gAll the dimensions of (A) are NxC';

the feature matrix r of each spatial position is divided into_iAs a node, and find the sum r in all the space regions_iAnd the associated nodes, for any node i in the depth feature map, sampling the representative features with the number S in all nodes related to i:

wherein s (n) is a feature vector obtained by sampling, the dimension of which is C',

for the sampling function, the cross-modal self-attention learning module is represented as:

where n is the sampled i-related node, δ is the softmax function, d_θ(i)＝W_θd(i)，s_φ(i)＝W_φs(i)，s_g(i)＝W_gs(i)。

And

three transformation matrices for linear transformation, respectively.

According to the technical scheme provided by the embodiment of the invention, the embodiment of the invention provides a multi-level cross-modal self-attention learning mechanism for three-dimensional target detection, the depth structure information in the global scene range is obtained from the depth feature map, and the depth structure information is organically combined with the appearance information to improve the accuracy of the three-dimensional target detection algorithm. Meanwhile, various strategies are adopted to reduce the operation complexity so as to meet the requirements of scenes such as unmanned driving and the like on the processing speed.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a three-dimensional target detection method based on a cross-modal self-attention mechanism according to an embodiment of the present invention.

Fig. 2 is a structural diagram of a three-dimensional object detection model according to an embodiment of the present invention.

Fig. 3 is a flowchart of training a three-dimensional target detection model according to an embodiment of the present invention.

Fig. 4 is a structural diagram of a cross-mode self-attention module according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

Based on the main defects existing in the current three-dimensional target detection algorithm, the invention obtains depth information through a two-dimensional depth map, and formalizes the utilization of the depth information into a cross-modal self-attention module learning problem. Depth information and appearance information are combined through a cross-modal self-attention mechanism, and meanwhile depth information is extracted in a non-iterative mode in a global range through a self-attention learning mechanism, so that detection accuracy is improved. When the depth information is acquired, the invention further uses a plurality of measures to further reduce the operation complexity and ensure that the method can be used in scenes with real-time processing requirements, such as automatic driving and the like.

The invention provides a three-dimensional target detection method based on a multi-level cross-modal self-attention mechanism, which takes a two-dimensional RGB image and a depth image as input, and combines appearance information acquired by the two-dimensional RGB image with structural information acquired by the depth image through the self-attention mechanism to achieve an accurate detection result and avoid high calculation amount brought by point cloud processing. In addition, because the self-attention mechanism acquires a large amount of redundant information while acquiring the global structure information, the method adopts an improved self-attention mechanism, namely for a certain region, the structure information is calculated only for the region with the highest correlation in the global, and the calculation amount is further reduced on the premise of ensuring the detection precision.

The three-dimensional target detection method based on the multi-level cross-modal self-attention mechanism comprises the following processing procedures:

and (3) data set construction: and constructing a training set and a test set of the three-dimensional target detection model, specifically comprising acquiring RGB images used for training and testing, and extracting depth information corresponding to the images of the training set through a depth model. And marking the class, the two-dimensional coordinate, the three-dimensional coordinate, the depth and the size of the object in the training image. And pre-processes the image data.

Constructing a three-dimensional target detection model: the method comprises the steps of constructing a three-dimensional target detection model based on a convolutional neural network, and specifically comprising the construction of an RGB image feature extraction network, a depth image feature extraction network and a cross-modal self-attention learning network.

Training a three-dimensional target detection model: parameters in the three-dimensional target detection model are updated until convergence by calculating loss functions such as two-dimensional target detection, classification of three-dimensional target detection, regression and the like and a random gradient descent algorithm.

Detecting the three-dimensional target: the three-dimensional object is detected through the provided color image or video frame.

The processing flow chart of the three-dimensional target detection method based on the multi-level cross-modal self-attention mechanism provided by the embodiment of the invention is shown in fig. 1, and comprises the following steps:

step S1: and constructing a training set and a testing set. The RGB image was acquired and was taken as approximately 1: the scale of 1 is divided into a training set and a test set. Because the three-dimensional target detection method provided by the embodiment of the invention obtains the depth information through the two-dimensional depth image instead of the point cloud data adopted by the traditional method, the two-dimensional depth image is obtained through the depth estimation algorithm for the color image in the training set. In addition, for the objects in the training set images, firstly, the categories of the objects are labeled, and simultaneously, the coordinates of the two-dimensional detection frame, and the central position, the size and the corner of the three-dimensional detection frame are labeled. And finally, carrying out normalization processing on the image data in the training set and the test set.

Step S2: after a training set and a test set are obtained, a three-dimensional target detection model is constructed, and the three-dimensional target detection model comprises an RGB (red, green and blue) trunk network, a depth trunk network, a classifier and a regressor. The structure of the three-dimensional target detection model provided by the embodiment of the invention is shown in fig. 2. Because features need to be extracted from the RGB image and the depth image respectively during training, two feature extraction backbone networks need to be constructed. In the embodiment of the invention, the RGB backbone network and the deep backbone network have the same structure and respectively comprise 4 convolution modules for extracting multi-level features.

Step S3: and training a three-dimensional target detection model. After the three-dimensional object detection model is constructed, the model may be trained through the training set obtained in step S1, and the training effect of the three-dimensional object detection model is verified by using the test set. A training flowchart of a three-dimensional target detection model provided in an embodiment of the present invention is shown in fig. 3, and specifically includes the following steps:

step S3-1: the initialization model parameters include parameters in the convolution layer, the full link layer and the normalization layer included in the RGB backbone network, the deep backbone network, the classifier and the regressor.

Step S3-2: and setting training parameters. The three-dimensional target detection model provided by the embodiment of the invention is trained by adopting an SGD (Stochastic Gradient descent) algorithm, and related training parameters including learning rate, impulse, batch size and iteration number need to be set before training.

Step S3-3: and calculating an objective function value. For any iteration batch, all RGB images and depth images are input into an RGB main network and a depth main network respectively to obtain multi-level features, updated RGB features are obtained through a cross-modal self-attention learning module, and then estimated types, position postures and depth values of the target object are obtained through a classifier and a regressor. And finally, calculating the error between the network estimation value and the actual marked value to obtain a target function value. Three objective function values are calculated in total during model training:

wherein s in the formula (1)_iAnd p_iLabeling and estimating probabilities for the ith target class respectively, as in formula (2)

And in equation (3)

And the two-dimensional estimation box and the three-dimensional estimation box respectively represent the ith target, gt represents an actual labeled value, and N represents the total number of targets.

Step S3-4: and adding the multiple objective function values to obtain a total objective function value, respectively solving partial derivatives of all parameters in the model, and then updating the parameters by a random gradient descent method.

Step S3-5: and (5) repeating the step (S3-3) and the step (3-4), continuously updating the model parameters until convergence, and finally outputting the model parameters.

All parameters of the three-dimensional target detection model in the embodiment of the invention are obtained, and finally, the object in the two-dimensional image provided by the user is detected.

Step S4: after the multilevel RGB features and the depth features are obtained, a cross-modal self-attention learning module is constructed, the module takes the RGB features and the depth features as input simultaneously, a self-attention matrix based on depth information is obtained through learning, the RGB features are updated through the self-attention matrix, and structural information in the RGB features is increased. And finally, learning a classifier and a regressor by utilizing the updated RGB features, wherein the classifier and the regressor are used for target detection of the three-dimensional object in the two-dimensional RGB image, the classifier can identify the category of the three-dimensional target, and the regressor can identify the position and the posture of the three-dimensional target.

The three-dimensional target detection model in the embodiment of the invention comprises an RGB (red, green and blue) trunk network, a depth trunk network, a classifier and a regressor. After training is finished, the RGB backbone network reserves depth structure information through a cross-modal self-attention mechanics learning module. When testing, only two-dimensional RGB images need to be provided, and depth features do not need to be extracted by a depth backbone network.

According to the cross-modal self-attention learning module disclosed by the embodiment of the invention, the depth structure information can be obtained through the learning of the depth map and is embedded into the RGB image characteristics, so that the accuracy of three-dimensional target detection is improved. As described in detail below.

The structural flow chart of the cross-modal self-attention learning module provided by the embodiment of the invention is shown in fig. 4, and mainly comprises four sub-modules: the system comprises a sampling point generating module, a multi-level attention learning module, an information updating module and an information fusion module. The method has the core idea that a self-attention matrix based on depth information is obtained through multi-level depth feature map learning, the self-attention matrix can reflect structural similarity between different positions in a global image range, the RGB feature map is updated through the self-attention matrix to obtain structural features in the global image range, and finally the accuracy of three-dimensional target detection is improved. In actual operation, up to a multi-level depth feature may be extended, as shown by way of example in fig. 4 for a two-level depth feature map.

For any two-dimensional RGB feature map R and two-dimensional depth feature map D, it is assumed that the dimensions are C × H × W, where C, H and W are dimension, height and width, respectively. The two-dimensional RGB feature map R and the two-dimensional depth feature map D can both be represented as a set of N C-dimensional features: r ═ R₁，r₂，...，r_N]^TAnd D ═ D₁，d₂，...，d_N]^TWherein N ═ hxw. Constructing a full-connected graph for the input feature graph R, wherein each feature R_iAs a node, an edge (r)_i，r_j) Representation node r_iAnd r_jThe relationship between them. For the two-dimensional RGB feature map R, the appearance features such as color, texture and the like are obvious, and the structure information such as depth and the like is insufficient. The cross-modal self-attention learning module in the embodiment of the present invention learns the two-dimensional depth feature map D to obtain the edge, and then updates the current two-dimensional RGB feature map R to increase the structural features thereof, which can be specifically expressed as:

wherein

is the updated RGB feature. We can further write the above equation in the form of a matrix multiplication:

wherein

As a self-attention matrix, D_θ,D_φAnd R_gAll dimensions of (A) are NxC'.

Thus, a single-level cross-modal self-attention learning module is constructed, which can learn from the single-level depth feature map to obtain a self-attention matrix containing structural information, and update the RGB feature map of the corresponding level. However, as can be seen from the above matrix multiplication formula, the complexity of the operation of updating the RGB feature map is O (C' × N)²) For the three-dimensional target detection, especially for scenes such as unmanned driving, the resolution of the input image or video frame is usually large, so that the time consumption is too high when the self-attention matrix a (x) is calculated, which is not favorable for application scenes with real-time processing requirements. In the process of constructing the fully-connected graph, we use the feature matrix r of each spatial position_iAs a node, and find the sum r in all the space regions_iAnd associating the nodes, and calculating a self-attention matrix. Due to the fact that the R and the R are in the whole space region_iThe associated nodes are highly overlapped, so that the cross-modal self-attention learning module in the embodiment of the invention only selects the node r_iThe node with the highest relevance degree in the relevant nodes is removed of a large number of redundant nodes and then the self-attention matrix is calculated, so that the operation efficiency can be greatly improved, and meanwhile, the relevance in all the space regions is guaranteed. Following is a cross-mode involving a sampling mechanismThe state self-attention learning module is described in detail.

For any node i in the depth feature map, sampling a representative feature with the number S in all nodes related to i:

is a sampling function. Thus, the cross-modal self-attention learning module in the embodiment of the present invention can be expressed as:

And

three transformation matrices for linear transformation, respectively. By adding the sampling module, when the self-attention matrix is calculated, the number of nodes can be reduced from N to S:

and S < N, so that the operation complexity can be greatly reduced. For example, for a feature with a spatial dimension of 80 x 80, N is 6400, and in the embodiment of the present invention, the number of selected samples is 9.

The invention being based on deformable convolutionThe idea is to select the sample points dynamically by estimating the offset. In particular, for a certain position p in the feature map, the function is sampled

Can be expressed as:

wherein Δ p_nIs the offset resulting from the regression. Since the result obtained by the convolution operation usually contains a fractional part, and the coordinate values of the sampling points are only integers, the integer coordinate values are obtained by bilinear interpolation:

wherein p is_sP + Δ pn, t is four adjacent points with integer coordinate values of the sampling points obtained by calculation, and K is a bilinear interpolation kernel function.

In practical application, for each node in the RGB feature map, we can obtain its offset by linear transformation, and its transformation matrix is

The output offset dimension is 2S, which is the offset of the coordinate in two directions of the horizontal axis and the vertical axis respectively. And obtaining S most representative nodes aiming at each node through bilinear difference values.

After the most representative sampling node is obtained through the depth feature map and the self-attention matrix is obtained through calculation, the RGB feature map can be updated. In the cross-modal self-attention learning module according to the embodiment of the present invention, the RGB feature map is updated by using a structure of a residual error network, which may be specifically expressed as:

wherein the content of the first and second substances,

RGB feature in the above formula (7), W_yIn the form of a linear transformation matrix, the transformation matrix,

to learn the resulting residual, r_iFor the original input RGB features, y_iIs the final updated RGB feature. The cross-modal self-attention learning module constructed based on the residual error network structure can be embedded into any neural network model.

It can be seen from the above description that, when a single-level cross-modal self-attention mechanics learning module is constructed, 5 linear transformation matrices are required, which are respectively W in formula (7)_φ，W_φAnd W_gW in formula (11)_yAnd a linear transformation matrix W for generating sampling points_s. To further reduce the number of parameters, we construct the cross-modal self-attention mechanics learning module as a bottleneck (bottleeck) structure, i.e., W in equation (7)_θ，W_φAnd W_gFused into a linear transformation matrix W for obtaining d_θ，s_φAnd s_g. Thus, only 3 linear transformation matrixes are needed to construct the single-level cross-modal self-attention learning module. All linear transformations are achieved by 1 × 1 convolution, with the addition of a batch normalization operation.

As shown in fig. 4, the cross-modal self-attention learning module in the embodiment of the present invention may learn a self-attention matrix including structural information through a multi-level depth feature map, and update the RGB feature map, so that the multi-level information needs to be fused finally. The fusion operation may be specifically expressed as:

where j enumerates the entire depth hierarchy,

is a linear transformation matrix for the corresponding level,

the updated RGB feature for the corresponding hierarchy can be calculated by formula (7).

It should be noted that, in order to further reduce the operation complexity of the embodiment of the present invention, the feature maps may be further grouped at the spatial and dimensional levels when the self-attention matrix is calculated. In a spatial layer, for a feature map with one dimension of C × H × W, the feature map can be divided into a plurality of regions, each region comprises a plurality of feature vectors with the dimension of C × 1, and pooling operation is performed on each region to use one region as a node, so that matrix operation can be performed on all features in one region, and further, the operation complexity is greatly reduced. Similarly, at the dimensional level, all feature channels may be equally divided into groups, each group having a feature map dimension of C '× H × W, where C' ═ C/G. Firstly, each group of characteristics is calculated, and then all the grouped characteristics are cascaded together, so that the final characteristics can be obtained.

In summary, the present invention organically combines the depth structure information obtained from the depth map and the appearance information obtained from the RGB map to achieve an accurate detection result through a cross-modal attention mechanism, instead of simply fusing the two kinds of information. When the depth structure information is acquired, the correlation among different positions can be considered in the global scene range instead of being limited in the neighborhood range. This benefits mainly from the characteristics of the self-attention learning mechanism, and the way of learning through multi-level features. In addition, when the correlation among different positions is obtained in the global scene range, only single operation is carried out without iteration, so that the method can effectively detect the class, the position and the posture target of the three-dimensional object in the two-dimensional RGB image.

When the cross-modal self-attention mechanism is used for acquiring the correlation among different positions, the self-attention matrix is calculated only for the positions with high correlation, so that the calculation of the self-attention matrix among a large number of redundant positions can be avoided, and the operation complexity is reduced while the effect is ensured. In addition, the depth features can be grouped at the dimension and space level when the self-attention matrix is calculated so as to further reduce the operation complexity.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A three-dimensional target detection method based on a multi-level cross-modal self-attention mechanism is characterized by comprising the following steps:

constructing training set and test set data by using RGB image data;

2. The method of claim 1, wherein said constructing training set and test set data using RGB image data comprises:

3. The method of claim 2, wherein the RGB backbone network, the deep backbone network, the classifier and the regressor in the three-dimensional object detection model each comprise a convolution layer, a full link layer and a normalization layer, and the RGB backbone network and the deep backbone network have the same structure and each comprise 4 convolution modules.

4. The method according to claims 2 and 3, wherein the training of the three-dimensional target detection model using the training set and the test set data, the RGB backbone network and the depth backbone network respectively obtain RGB features and depth features, the RGB features and the depth features are input into a cross-modal attention-directed mechanics learning module to update the RGB features, and the trained three-dimensional target detection model is obtained using an updated RGB feature learning classifier and a regressor, comprising:

And in equation (3)

step S3-5: and (5) repeating the step (S3-3) and the step (3-4), continuously updating the parameters of the three-dimensional target detection model until convergence, and outputting the parameters of the trained three-dimensional target detection model.

5. The method of claim 4, wherein the inputting the RGB features and the depth features into a cross-modal self-attention mechanics learning module, updating the RGB features, and obtaining a trained three-dimensional object detection model by using the updated RGB feature learning classifier and the regressor, comprises:

constructing a full-connected graph for the input feature graph R, wherein each feature R_iAs a node, an edge (r)_i，r_j) Representation node r_iAnd r_jThe relationship between the two-dimensional RGB feature maps is obtained by learning the two-dimensional depth feature map D, and the current two-dimensional RGB feature map R is updated, which is specifically expressed as:

wherein

for the updated RGB feature, the above formula is written in the form of matrix multiplication:

wherein

As a self-attention matrix, D_θ,D_φAnd R_gAll the dimensions of (A) are NxC';

the feature matrix r of each spatial position_iAs a node, and find the sum r in all the space regions_iAssociated nodeFor any node i in the depth feature map, sampling the representative features with the number S in all nodes related to i:

And with

Three transformation matrices for linear transformation, respectively.