CN114359660B

CN114359660B - Multi-modal target detection method and system suitable for modal intensity change

Info

Publication number: CN114359660B
Application number: CN202111566871.6A
Authority: CN
Inventors: 程腾; 侯登超; 张峻宁; 石琴; 陈炯; 姜俊昭
Original assignee: Hefei University of Technology
Current assignee: Anhui Guandun Technology Co ltd; Hefei University Of Technology Asset Management Co ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-08-26
Anticipated expiration: 2041-12-20
Also published as: CN114359660A

Abstract

The invention discloses a multi-modal target detection method and system suitable for modal intensity change, wherein the target detection method comprises the following steps: selecting a large number of training data sets, and obtaining corresponding modal characteristic tensors by using corresponding characteristic extraction modules; constructing and forming a fusion detection network model without considering the interrelation among multi-modal data in the first model training, and obtaining a dictionary set in the channel direction after the model training is finished; in the second model training, a fusion detection network model considering the interrelation of multi-modal data is constructed and formed, the corresponding modal sparsity is calculated by using the obtained dictionary set, the weight of multi-modal characteristics in the fusion stage is corrected, and after the model training is finished, the model can be used for real-time detection and recognition. According to the invention, the strength relation between modal data is quantized by a reasonable means, so that the 3D target detection and identification effect and precision can be improved, and the operation parameters and operation time can be saved.

Description

Multi-modal target detection method and system suitable for modal intensity change

Technical Field

The invention relates to the technical field of automatic driving target detection, in particular to a multi-mode target detection method and system suitable for modal intensity change.

Background

In recent years, object recognition detection has made remarkable progress in face recognition, image recognition, video recognition, automatic driving, and the like. The target identification and detection are particularly important in automatic driving, and the traffic participant on the road is detected and identified, so that the method has important significance on the operations of safe driving, emergency avoidance and the like of the vehicle.

At present, there are three main multimodal environmental perception methods in target recognition and detection: firstly, acquiring each modal data by using a plurality of sensors, and performing superposition fusion, also called as forward fusion, on each modal data before perception, wherein the multimodal environment perception mode has the defects that: the relation between modal data physical quantities obtained by the sensor is required to be large, such as speed, rotation angle and other data; secondly, respectively designing a neural network aiming at each modal data, extracting features by using the neural network to obtain required local features and global features, and performing superposition fusion, also called intermediate fusion, on the modal features corresponding to each modal data on a feature level, wherein the multimode environmental perception mode has the defects that: the mutual relation among different modal data is not considered sufficiently, for example, a camera and a millimeter wave radar, under a dim condition, a neural network needs to compare the strength of two modal data, and more emphasis is placed on embodying the millimeter wave radar in outputting a perception result; thirdly, logical selection and selection are carried out by using the perception results of the modal data, and a final result is obtained comprehensively, namely, post fusion, and the multimode environment perception mode has the defects that: the fusion result is similar to the intersection, and unreliable factors are eliminated as much as possible.

In an automatic driving environment, strong dependency relationship may not exist among modal data acquired by a plurality of sensors, but the data acquired by the sensors are indispensable for completing an automatic driving task, and the final detection and identification effect can be reduced by taking intersection aiming at a rough detection and identification result; and each modal data is extracted by utilizing the neural network, and in the process of stacking and fusing the characteristic layers, the mutual relation existing among the modal data in an extreme scene is ignored, so that missing detection or wrong detection is often caused.

Disclosure of Invention

The present invention provides a method and a system for multi-modal object detection suitable for modal intensity variation, so as to solve the problems mentioned in the background art.

In order to achieve the purpose, the invention provides the following technical scheme:

a multi-modal target detection method suitable for modal intensity variation comprises the following steps:

s1, feature extraction of multi-modal information: selecting a large number of training data sets, wherein the training data sets comprise modal data of different dimensions in the same time and space under various extreme scenes, and acquiring corresponding modal feature tensors by using corresponding feature extraction modules aiming at the modal data of different dimensions;

s2, training a model for the first time: on the basis of sparse coding, aiming at the modal feature tensors with different dimensions obtained in the step S1, feature graph scales of the modal feature tensors except the channel are taken for dictionary set training, dictionary set loss terms corresponding to the dictionary set are incorporated into an overall loss function during model training, a first fusion detection network model without considering the interrelation among multimodal data is constructed, then the first fusion detection network model is trained, and after the first fusion detection network model is trained, perfect dictionary sets corresponding to the modal data with different dimensions can be obtained at the same time;

s3, training a second model: calculating the modal sparsity corresponding to different dimensionality modal data by using the obtained perfect dictionary set, calculating a superposition coefficient distribution proportion according to the modal sparsity, constructing and forming a second fusion detection network model considering the interrelation among the multi-modal data, then training the second fusion detection network model, and correcting the weight of the relevant features to be fused according to the superposition coefficient distribution proportion in the feature fusion stage of the training process until the second fusion detection network model is trained;

s4, detection application: and (4) performing feature extraction operation of the step S1 on modal data of different dimensions in the same time and space in an actual driving scene to obtain a corresponding real-time modal feature tensor, inputting the real-time modal feature tensor into a trained second fusion detection network model, predicting the traffic participants, and outputting the second fusion detection network model to obtain a final target detection effect.

As a further aspect of the present invention, the process of extracting features from the three-dimensional spatial point cloud data in step S1 includes: firstly, obtaining a point cloud sparse tensor by utilizing voxel partitioning, point cloud grouping and VEF feature coding according to three-dimensional space point cloud data; then, performing a plurality of three-dimensional convolution operations on the point cloud sparse tensor through a three-dimensional convolution core, wherein each three-dimensional convolution operation obtains a four-dimensional tensor representing a corresponding scale;

the process of extracting features from the two-dimensional color image data in step S1 is as follows: and performing convolution operation on the image tensor corresponding to the two-dimensional color image data for multiple times through the convolution core of 3 x 3, constructing a similar tree structure by using the result of each convolution operation, taking the output of the current convolution operation result as the input of a certain convolution operation, constructing jump link, and finally obtaining the three-dimensional tensor of the image.

As a further scheme of the invention, the extreme scenes of the three-dimensional space point cloud data comprise a strong light scene, a smoke scene, a rain scene or/and a snow scene;

the extreme scenes of the two-dimensional color image data include a rainy night scene, a snowy night scene, a daytime rainy scene or/and a daytime snowy scene.

As a further scheme of the present invention, the first fusion detection network model in step S2 includes a 3D target detection model, a SMOKE task network, a VSA extraction module, a feature fusion module, a refinement module, and an overall loss function, and the first model training specifically includes the steps of:

s21, defining dictionary set and constructing dictionary set loss items: after obtaining modal feature tensors of different dimensions, respectively defining corresponding dictionary sets for the modal feature tensors of different dimensions according to sparse coding definition, and carrying out averaging according to the number of channels of the corresponding modal feature tensors to construct corresponding dictionary set loss terms;

s22, constructing a foreground point loss term and a 3D bounding box loss term: calculating to obtain foreground point loss items and 3D surrounding frame loss items according to a rough 3D surrounding frame obtained by outputting of the 3D target detection model and a truth value label in the training data set;

s23, constructing a key point loss item: obtaining a key point loss item according to the target 2D key point obtained by the SMOKE task network and the category of the target;

s24, constructing a confidence coefficient loss term and a correction loss term: obtaining a confidence coefficient loss term and a correction loss term according to the confidence coefficient prediction result and the frame correction parameter obtained by the fine correction module;

s25, constructing and forming a first fusion detection network model without considering the interrelation among the multi-modal data: the overall loss function of the first fusion detection network model comprises a foreground point loss term, a 3D bounding box loss term, a key point loss term, a confidence coefficient loss term, a correction loss term and a dictionary set loss term;

s26, training a first fusion detection network model: training the first fusion detection network model by using the modal feature tensor extracted in the step S1, and finishing the training of the first fusion detection network model when the overall loss function reaches the minimum; meanwhile, the dictionary set is trained, and a perfect dictionary set corresponding to different dimension modal data is obtained.

As a further scheme of the present invention, the second fusion detection network model in step S3 includes a perfect dictionary set, a 3D target detection model, a SMOKE task network, a VSA extraction module, a feature fusion module, a Softmax function, a refinement module, and an overall loss function, and the second model training specifically includes:

s31, calculating the modal sparsity corresponding to different dimension modal data by using the obtained perfect dictionary set, forming a vector by using the modal sparsity corresponding to the different dimension modal data, mapping the vector to a [0,1] space by using a softmax function, and obtaining the superposition coefficient distribution proportion corresponding to the different dimension modal data by setting the sum to 1;

s32, constructing a foreground point loss item and a 3D bounding box loss item: same as step S22;

s33, constructing a key point loss item: same as step S23;

s34, constructing a confidence coefficient loss term and a correction loss term: same as step S24;

s35, constructing and forming a second fusion detection network model considering the interrelation among the multi-modal data: the overall loss function of the second fusion detection network model comprises a foreground point loss term, a 3D bounding box loss term, a key point loss term, a confidence coefficient loss term and a correction loss term;

s36, training a second fusion detection network model: and (5) training the second fusion detection network model by using the modal feature tensor extracted in the step (S1), and finishing the training of the second fusion detection network model when the overall loss function reaches the minimum.

As a further aspect of the present invention, the dictionary set includes a three-dimensional dictionary set corresponding to three-dimensional space point cloud data and a two-dimensional dictionary set corresponding to two-dimensional color image data; dictionary set loss terms also include three-dimensional dictionary set loss terms and two-dimensional dictionary set loss terms.

As a further scheme of the present invention, the specific training process of the first fusion detection network model in step S26 is as follows:

s261, 3D target preliminary identification stage: the 3D target detection model takes a four-dimensional tensor corresponding to the three-dimensional space point cloud data finally output in the step S1, a projection aerial view of the four-dimensional tensor is used as a point cloud aerial view tensor, a classification branch of the 3D target detection model is used for determining a foreground point of a target in the point cloud aerial view tensor, a regression branch of the 3D target detection model is used for regressing a 2D aerial view frame of the target, a Z-axis point cloud space projection is ignored, a rough 3D surrounding frame is obtained, and the three-dimensional space point cloud data are divided into a foreground and a background of the target;

s262, determining a 3D central point: outputting to obtain 2D key points of the target in the two-dimensional color image data by using the key point branch of the SMOKE task network according to the three-dimensional tensor of the image corresponding to the two-dimensional color image data in the step S1, and reversely deducing the 3D key points of the target by combining camera internal parameters; searching a point closest to the 3D key point in the three-dimensional space point cloud data according to an Euclidean distance method, and taking the point as a 3D central point of a target;

s263, VSA extraction stage: respectively converging the point cloud sparse tensor obtained in the step S1 and the four-dimensional tensor obtained by each three-dimensional convolution operation to a 3D central point by using a VSA module to obtain a 3D central point field set of each target, obtaining voxel characteristics corresponding to the 3D central point by max pooling operation, and obtaining multi-level voxel characteristics of the 3D central point by combining all the voxel characteristics;

performing VSA operation on the initial three-dimensional space point cloud data by using the 3D central point to obtain global characteristics, and combining the global characteristics with multi-level voxel characteristics to obtain two-dimensional modal data related characteristics;

projecting the 3D central point onto a point cloud aerial view, and performing binomial interpolation to obtain an aerial view interpolation characteristic;

s264, a first feature fusion stage: the feature fusion module performs add operation on the relevant features of the two-dimensional modal data and the aerial view interpolation features, simple splicing is achieved, and first fusion features are obtained;

s265, a fine trimming stage: inputting the first fusion characteristics and the rough 3D bounding box into a fine modification module, and calculating by the fine modification module to obtain a confidence prediction result and a frame modification parameter;

s266, completing the training of the dictionary set: and repeating the steps S261 to S265 until the overall loss function reaches the minimum, finishing training the first fusion detection network model, and finishing training the dictionary set to obtain the perfect dictionary sets corresponding to the modal data with different dimensions.

As a further scheme of the present invention, the specific training process of the second fusion detection network model in step S36 is as follows:

s361, a 3D target preliminary identification stage: same as step S261;

s362, determining a 3D center point stage: same as step S262;

s363, VSA extraction stage: same as step 263;

s364, a second feature fusion stage: the feature fusion module performs fusion splicing on the two-dimensional modal data relevant features and the aerial view interpolation features according to the superposition coefficient distribution proportion to obtain second fusion features;

s365, a fine modification stage: inputting the second fusion characteristics and the rough 3D bounding box into a refinement module, and calculating by the refinement module to obtain a confidence coefficient prediction result and a border correction parameter;

s366, a second fusion detection network model training completion stage: and repeating the steps S361 to S365 until the overall loss function reaches the minimum and the training of the second fusion detection network model is finished.

As a further scheme of the present invention, the specific process of detecting the application in step S4 is as follows:

s41, performing the feature extraction operation of the step S1 on modal data of different dimensions in the same space-time in an actual driving scene to obtain a corresponding real-time modal feature tensor, and inputting the real-time modal feature tensor into a second fusion detection network model;

s42, predicting the traffic participants by the second fusion detection network model, outputting a rough 3D bounding box by the 3D target detection model, and outputting a confidence prediction result and a frame correction parameter by the fine modification module;

and S43, synthesizing three output information of the rough 3D surrounding frame, the confidence coefficient prediction result and the frame correction parameter, identifying the detected target by the traffic participant through a new 3D surrounding frame, and obtaining the probability of the type of the target, namely the result of the 3D target detection.

A multi-modal target detection system adapted for modal robustness variation, comprising: a large number of sufficient training data sets, wherein the training data sets comprise modal data of different dimensions under the same time and space under various extreme scenes;

the characteristic extraction module is used for extracting characteristics of modal data with different dimensions in the training data set to obtain corresponding modal characteristic tensors;

the dictionary set is used for taking the scale of a feature map except a channel for carrying out dictionary set training on the acquired modal feature tensors with different dimensions on the basis of sparse coding, obtaining a complete dictionary set corresponding to modal data with different dimensions after the training of the first fusion detection network model is finished, and calculating the modal sparsity corresponding to the modal data with different dimensions through the complete dictionary set;

the SMOKE task network outputs and obtains 2D key points of a target in the two-dimensional color image data by using the key point branches of the SMOKE task network according to the three-dimensional tensor of the image corresponding to the two-dimensional color image data, and reversely deduces the 3D key points of the target by combining camera internal parameters; in the three-dimensional space point cloud data, searching a point closest to the 3D key point according to an Euclidean distance method, and taking the point as a 3D central point of a target;

the VSA extraction module is used for carrying out VSA operation on the modal characteristic tensor obtained by each three-dimensional convolution operation in the characteristic extraction process to obtain multi-level voxel characteristics combined with the 3D central point;

carrying out VSA operation on the initial three-dimensional space point cloud data by using the 3D central point to obtain global characteristics;

projecting the 3D central point onto a point cloud aerial view, and performing binomial interpolation to obtain an aerial view interpolation feature;

obtaining two-dimensional modal data relevant characteristics by combining the global characteristics and the multi-level voxel characteristics;

the feature fusion module is used for performing add operation on the two-dimensional modal data related features and the aerial view interpolation features in the process of training the first fusion detection network model, so that simple splicing is realized, and first fusion features are obtained;

in the process of training the second fusion detection network model, the feature fusion module performs fusion splicing on the two-dimensional modal data relevant features and the aerial view interpolation features according to the superposition coefficient distribution proportion to obtain second fusion features;

the method comprises the steps that a 3D target detection model is used for obtaining a modal feature tensor corresponding to three-dimensional space point cloud data finally output by a feature extraction module, a projection aerial view of the modal feature tensor is used as a point cloud aerial view tensor, and the 3D target detection model is input to obtain a rough 3D surrounding frame;

the fine modification module is used for combining the fusion features and the rough 3D bounding box and outputting a modified feature vector, and the modified feature vector is used for predicting confidence and modifying the rough 3D bounding box;

the integral loss function is used for measuring the inconsistency degree of the predicted value and the true value of the fusion detection network model, and when the integral loss function reaches the minimum value, the training of the fusion detection network model is finished;

in the process of training the first fusion detection network model, the overall loss function comprises a foreground point loss term, a 3D bounding box loss term, a key point loss term, a confidence coefficient loss term, a correction loss term, a two-dimensional dictionary set loss term and a three-dimensional dictionary set loss term;

in the process of training the second fusion detection network model, the overall loss function comprises a foreground point loss term, a 3D bounding box loss term, a key point loss term, a confidence coefficient loss term and a correction loss term; and a Softmax function, wherein in the process of training the second fusion detection network model, the modal sparsity corresponding to the different dimensional modal data forms a vector, the vector is mapped to a [0,1] space by using the Softmax function, and the sum is 1, so that the distribution proportion of the superposition coefficients corresponding to the different dimensional modal data is obtained.

Compared with the prior art, the invention has the beneficial effects that: in the first model training, constructing and forming a fusion detection network model without considering the interrelation among multi-mode data, on the basis of sparse coding, taking characteristic graph scales except channels for modal characteristic tensors of different dimensions to perform dictionary set training, incorporating dictionary set loss terms corresponding to the dictionary set into an overall loss function of the model training, and obtaining the dictionary set in the channel direction after the model training is completed; in the second model training, constructing and forming a fusion detection network model considering the interrelation among the multi-modal data, calculating the corresponding modal sparsity by using the obtained dictionary set, correcting the weight of the multi-modal characteristics in the fusion stage, and performing real-time detection and recognition by using the model after the model training is finished;

the method considers the strong and weak relation among modal data, under extreme conditions, even if the modal data received by partial sensors are not good and enough useful information cannot be obtained, simple superposition cannot be carried out in the characteristic layer fusion stage, and the method carries out characteristic fusion according to reasonably distributed weight coefficients in the characteristic layer fusion stage, thereby effectively ensuring the effectiveness and robustness of the final perception task;

according to the method, the strength relation among different modal data is quantized through a reasonable and effective technical means, the modal sparsity of the modal data with different dimensions is defined through the mathematical expression of a sparse coding set, and further, the characteristics are fully considered when the characteristic level fusion is carried out, so that the final 3D target detection and identification effect and accuracy are improved;

in addition, compared with the traditional point cloud identification system, the target detection system disclosed by the invention has the advantages that 2D key points of the target in the two-dimensional color image data are identified and detected, 3D key points of the target are reversely deduced according to the 2D key points of the target, the three-dimensional space point cloud data are accurately screened according to the characteristics of the 3D key points, the operation parameters can be greatly saved, and the 3D target identification accuracy rate is improved.

Drawings

FIG. 1 is a connection block diagram of a multi-modal target detection system suitable for modal intensity variation during a first model training;

FIG. 2 is a block diagram of an overall loss function of a multi-modal target detection system adapted for modal intensity variation during a first model training;

FIG. 3 is a flowchart of a first model training process in a multi-modal target detection method suitable for modal intensity variation;

FIG. 4 is a connection block diagram of a multi-modal target detection system adapted for modal intensity variation during a second model training;

FIG. 5 is a block diagram of an overall loss function of a multi-modal object detection system adapted for modal intensity variation during a second model training;

fig. 6 is a flow chart of a second model training in the multi-modal target detection method suitable for modal intensity variation.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example (b): a multi-modal target detection method suitable for modal intensity variation comprises the following steps:

s1, feature extraction of multi-modal information: selecting a large number of training data sets, wherein the training data sets comprise modal data of different dimensions in the same time and space under various extreme scenes, and acquiring corresponding modal feature tensors by using corresponding feature extraction modules aiming at the modal data of different dimensions; obviously, the training data set should also contain modal data and truth labels in a general scene, the modal data may be three-dimensional space point cloud data (such as bin file) acquired by laser radar, two-dimensional color image data (such as RGB image file) acquired by camera shooting, etc.,

specifically, the process of extracting features from the three-dimensional space point cloud data in step S1 is as follows: firstly, obtaining a point cloud sparse tensor by utilizing voxel partitioning, point cloud grouping and VEF feature coding according to three-dimensional space point cloud data; then, performing a plurality of three-dimensional convolution operations on the point cloud sparse tensor through a three-dimensional convolution core, wherein each three-dimensional convolution operation obtains a four-dimensional tensor representing a corresponding scale;

in the present embodiment, the three-dimensional space point cloud data is reflected in the mathematical expression, i.e. the tensors with the size (D, H, W), D, H, W respectively represent the length, width and height of the point in the space coordinate system. And obtaining point cloud sparse tensors with the sizes of (C ', D', H ', W') by utilizing voxel partitioning, point cloud grouping and VEF feature coding, wherein C 'is the dimension of the three-dimensional space point cloud data, and D', H 'and W' are the length, width and height of the set voxel blocks respectively. Because the calculation amount of analyzing and calculating each point is too large, the voxel method is adopted to divide the three-dimensional space point cloud into voxel blocks with equal size, thereby reducing the calculation amount. Next, the point cloud sparse tensors (C ', D', H ', W') are convolved by using a three-dimensional convolution kernel, which is mainly different from the 2D convolution in that the spatial dimension of the three-dimensional convolution kernel sliding can be used to extract the object features in the 3D space well, and in this embodiment, three four-dimensional tensors representing different scales can be obtained through three times of convolution.

In the present embodiment, the two-dimensional color image data is reflected in mathematical expressions, i.e., tensors of size (H, W, 3), H, W, 3 representing the height, width, and RGB three channels of the image, respectively. And performing convolution operation on the two-dimensional color image data for multiple times by using the convolution kernel of 3 x 3, and constructing a similar tree structure by using each result. And taking the output of the current convolution result as the input of a certain convolution later to construct jump link, so that the depth and the width of the network can be better gathered, and finally the three-dimensional tensor of the image with the size of (H/4, W/4, 256) is obtained.

Specifically, the extreme scenes of the three-dimensional space point cloud data include a strong light scene, a smoke scene, a rain scene or/and a snow scene, and the like;

the extreme scenes of the two-dimensional color image data comprise a rainy night scene, a snowy night scene, a daytime rainy scene or/and a daytime snowy scene and the like.

S2, training a model for the first time: on the basis of sparse coding, aiming at the modal feature tensors with different dimensions obtained in the step S1, feature graph scales of the modal feature tensors except for a channel are taken for dictionary set training, dictionary set loss terms corresponding to the dictionary set are incorporated into an overall loss function during model training, a first fusion detection network model without considering the interrelation among multimodal data is constructed, then the first fusion detection network model is trained, and after the first fusion detection network model is trained, perfect dictionary sets corresponding to the modal data with different dimensions can be obtained at the same time;

specifically, the first fusion detection network model in step S2 includes a 3D target detection model, an SMOKE task network, a VSA extraction module, a feature fusion module, a refinement module, and an overall loss function, and the first model training specifically includes the following steps:

specifically, the dictionary set comprises a three-dimensional dictionary set corresponding to three-dimensional space point cloud data and a two-dimensional dictionary set corresponding to two-dimensional color image data; dictionary set loss terms also include three-dimensional dictionary set loss terms and two-dimensional dictionary set loss terms.

In this embodiment, taking an image three-dimensional tensor as an example, after obtaining an image three-dimensional tensor of (H/4, W/4, 256), according to a sparse coding definition, a set of 256 two-dimensional dictionary sets (H/4, W/4, 256) may be defined, where a coefficient matrix X of one of the 256 channels input Y ═ H/4, W/4) under a corresponding dictionary set D ═ H/4, K is represented as (K, W/4), and an error matrix is R ═ Y-DX, and then a final iterative training objective is to make modal input errors restored by the dictionary set and the corresponding coefficient set as small as possible, and the coefficient matrix is as sparse as possible, that is, as sufficiently represented as possible. According to the number of channels of the three-dimensional tensor of the image, taking a mean value, and constructing a corresponding two-dimensional dictionary set loss term; by adopting the same principle, a three-dimensional dictionary set can be defined, and corresponding three-dimensional dictionary set loss items are constructed.

S22, constructing a foreground point loss term and a 3D bounding box loss term: calculating to obtain a foreground point loss item and a 3D bounding box loss item according to a rough 3D bounding box obtained by the output of the 3D target detection model and a truth label in the training data set;

In this embodiment, the specific training process of the first fusion detection network model in step S26 is as follows:

s261, 3D target preliminary identification stage: the 3D target detection model takes a four-dimensional tensor corresponding to the three-dimensional space point cloud data finally output in the step S1, a projection aerial view of the four-dimensional tensor is used as a point cloud aerial view tensor, the dimensions of the point cloud aerial view tensor are (H ', W', 128), a classification branch of the 3D target detection model is used for determining a foreground point of a target in the point cloud aerial view tensor, a regression branch of the 3D target detection model is used for regressing a 2D aerial view frame of the target, then, Z-axis point cloud space projection is ignored, a rough 3D enclosure frame is obtained, and the three-dimensional space point cloud data are divided into a foreground and a background of the target;

s262, determining a 3D central point: according to the three-dimensional tensor of the image corresponding to the two-dimensional color image data in the step S1, outputting the 2D key points of the target obtained in the two-dimensional color image data by using the key point branch of the SMOKE task network, and reversely deducing the 3D key points of the target by combining the camera internal parameters, assuming that the camera internal parameters are K, the mapping formula of the 2D key points and the 3D key points is as follows:

searching a point closest to the 3D key point in the three-dimensional space point cloud data according to an Euclidean distance method, and taking the point as a 3D central point of a target;

s263, VSA extraction stage: respectively converging the point cloud sparse tensor obtained in the step S1 and the four-dimensional tensor obtained by each three-dimensional convolution operation to a 3D central point by using a VSA module to obtain a 3D central point field set of each target, and obtaining the voxel characteristic f corresponding to the 3D central point by one-time max pooling operation _i ^(pvk) ，

Wherein the content of the first and second substances,

c-dimensional feature vector, p, representing a block of voxels _i For the predicted 3D center point,

the coordinate of a certain voxel block in the voxel characteristic is obtained;

then, all voxel characteristics f are combined _i ^(pvk) Obtaining the multi-level voxel characteristic f of the 3D central point _i ^(pv) ，f _i ^(pv) ＝[f _i ^(pv1) ，f _i ^(pv2) ，f _i ^(pv3) ，f _i ^(pv4) ]，for i＝1，...，n；

VSA operation is carried out on the initial three-dimensional space point cloud data by using the 3D central point to obtain the global feature f _i ^(raw) Incorporating global features f _i ^(raw) And multi-level voxel characteristics f _i ^(pv) Obtaining the related characteristic f of the two-dimensional modal data _i ^(rgb) ＝[f _i ^(raw) ，f _i ^(pv) ]；

Projecting the 3D central point to a point cloud aerial view, and performing binomial interpolation to obtain an aerial view interpolation characteristic f _i ^(bev) The interpolation characteristics of the aerial view are closely related to the key points of the point cloud voxel method;

s264, a first feature fusion stage: the feature fusion module performs add operation on the two-dimensional modal data relevant features and the aerial view interpolation features to realize simple splicing to obtain first fusion features;

s265, a fine trimming stage: inputting the first fusion features and the rough 3D bounding box into a refinement module, uniformly sampling 6 x 6 grid points in the rough 3D bounding box, setting a variable radius, aggregating final features of nearby central points, regularizing the final features as the features of the grid points, generating features f corresponding to each rough 3D bounding box through a multilayer perceptron and max pooling _i ^(g) Finally, obtaining 256-dimensional characteristic vectors through two layers of MLP networks, and obtaining confidence prediction results and frame correction parameters;

s266, completing the training of the dictionary set: and repeating the steps S261-S265 until the overall loss function reaches the minimum, finishing the training of the first fusion detection network model, and simultaneously finishing the training of the dictionary set to obtain the perfect dictionary sets corresponding to the different-dimension modal data.

specifically, the second fusion detection network model in step S3 includes a perfect dictionary set, a 3D target detection model, a SMOKE task network, a VSA extraction module, a feature fusion module, a Softmax function, a refinement module, and an overall loss function, and the second model training specifically includes the steps of:

after the training of step S2, each modal data obtains its own complete dictionary set, and the extracted data may be restored by using the corresponding dictionary set and the corresponding coefficient matrix, which although not 100% restored, may roughly represent the sufficiency of the modal data with respect to the complete information of the first stage.

In this embodiment, taking the two-dimensional dictionary set (H/4, W/4, 256) as an example, the mathematical expression of sparsity is the ratio of the sum of squares of F norm square-matrix element absolute values of the coefficient matrix (K, W/4) and the matrix maximum element square K × W'.

The feature tensor channels of the modal data with different dimensions are different, so that the number of layers of the dictionary set is also different, and the average value of the number of layers is taken as the sparsity of the modal feature tensor input information.

S32, constructing a foreground point loss term and a 3D bounding box loss term: same as step S22;

s33, constructing a key point loss item: same as step S23;

In this implementation, the specific training process of the second fusion detection network model in step S36 is as follows:

s361, a 3D target preliminary identification stage: same as step S261;

s362, determining a 3D center point stage: same as step S262;

s363, VSA extraction stage: same as step 263;

s365, a fine modification stage: inputting the second fusion characteristics and the rough 3D bounding box into a fine modification module, and calculating by the fine modification module to obtain a confidence prediction result and a frame modification parameter;

S4, detection application: and (3) performing the feature extraction operation of the step (S1) on modal data of different dimensions in the same space-time in the actual driving scene to obtain a corresponding real-time modal feature tensor, inputting the real-time modal feature tensor into the trained second fusion detection network model, predicting the traffic participant, and outputting the second fusion detection network model to obtain the final target detection effect.

Specifically, the specific process of detecting the application in step S4 is as follows:

s43, synthesizing three output information of the rough 3D surrounding frame, the confidence degree prediction result and the frame correction parameter, identifying the detected target by the traffic participant through a new 3D surrounding frame, and obtaining the probability of the type of the target, namely the result of the 3D target detection.

the SMOKE task network outputs and obtains 2D key points of a target in the two-dimensional color image data by using the key point branches of the SMOKE task network according to the three-dimensional tensor of the image corresponding to the two-dimensional color image data, and reversely deduces the 3D key points of the target by combining camera internal parameters; searching a point closest to the 3D key point in the three-dimensional space point cloud data according to an Euclidean distance method, and taking the point as a 3D central point of a target;

the fine modification module is used for combining the fusion features (the first fusion features or the second fusion features) and the rough 3D bounding box and outputting a modified feature vector, and the modified feature vector is used for predicting confidence and modifying the rough 3D bounding box;

in the process of training the second fusion detection network model, the overall loss function comprises a foreground point loss term, a 3D bounding box loss term, a key point loss term, a confidence coefficient loss term and a correction loss term;

and the Softmax function is used for forming a vector by the modal sparsity corresponding to the modal data of different dimensions in the training process of the second fusion detection network model, mapping the vector to a [0,1] space by using the Softmax function, and obtaining the distribution proportion of the superposition coefficients corresponding to the modal data of different dimensions by setting the sum to 1.

In the present invention, terms such as "upper", "lower", "left", "right", "front", "rear", "vertical", "horizontal", "side", "bottom", etc. indicate orientation and/or positional relationship based on the orientation and/or positional relationship shown in the drawings, and are only relational terms determined for convenience of describing structural relationship of the respective parts or/and elements of the present invention, and do not particularly refer to any parts or/and elements of the present invention, and are not to be construed as limiting the present invention.

Claims

1. A multi-modal target detection method suitable for modal intensity variation is characterized in that: the method comprises the following steps:

s4, detection application: carrying out the feature extraction operation of the step S1 on modal data of different dimensions in the same time and space in an actual driving scene to obtain a corresponding real-time modal feature tensor, inputting the real-time modal feature tensor into a trained second fusion detection network model, predicting traffic participants, and outputting the second fusion detection network model to obtain a final target detection effect;

in step S2, the first fusion detection network model includes a 3D target detection model, a SMOKE task network, a VSA extraction module, a feature fusion module, a refinement module, and an overall loss function, and the first model training specifically includes:

s21, defining a dictionary set and constructing a dictionary set loss item: after obtaining modal feature tensors of different dimensions, respectively defining corresponding dictionary sets for the modal feature tensors of different dimensions according to sparse coding definition, and carrying out averaging according to the number of channels of the corresponding modal feature tensors to construct corresponding dictionary set loss terms;

s26, training a first fusion detection network model: training the first fusion detection network model by using the modal feature tensor extracted in the step S1, and finishing the training of the first fusion detection network model when the overall loss function reaches the minimum; meanwhile, the dictionary set is trained, and a perfect dictionary set corresponding to different dimension modal data is obtained;

in step S3, the second fusion detection network model includes a perfect dictionary set, a 3D target detection model, a SMOKE task network, a VSA extraction module, a feature fusion module, a Softmax function, a refinement module, and an overall loss function, and the second model training specifically includes the steps of:

s33, constructing a key point loss item: same as step S23;

2. The method for multi-modal object detection as claimed in claim 1, wherein the method comprises the following steps: the process of extracting features of the three-dimensional spatial point cloud data in step S1 is as follows: firstly, obtaining a point cloud sparse tensor by utilizing voxel partitioning, point cloud grouping and VEF feature coding according to three-dimensional space point cloud data; then, performing a plurality of three-dimensional convolution operations on the point cloud sparse tensor through a three-dimensional convolution core, wherein each three-dimensional convolution operation obtains a four-dimensional tensor representing a corresponding scale;

3. The method of claim 2, wherein the target detection system comprises: the extreme scenes of the three-dimensional space point cloud data comprise strong light scenes, smoke scenes, rain scenes or/and snow scenes;

4. The method of claim 2, wherein the target detection system comprises: the dictionary set comprises a three-dimensional dictionary set corresponding to three-dimensional space point cloud data and a two-dimensional dictionary set corresponding to two-dimensional color image data; dictionary set loss terms also include three-dimensional dictionary set loss terms and two-dimensional dictionary set loss terms.

5. The method for multi-modal object detection as claimed in claim 4, wherein the method comprises the following steps: the specific training process of the first fusion detection network model in step S26 is as follows:

s261, 3D target preliminary identification stage: the 3D target detection model takes a four-dimensional tensor corresponding to the three-dimensional spatial point cloud data finally output in the step S1, a projection aerial view of the four-dimensional tensor is used as a point cloud aerial view tensor, a classification branch of the 3D target detection model is used for determining a foreground point of a target in the point cloud aerial view tensor, a regression branch of the 3D target detection model is used for regressing a 2D aerial view frame of the target, a Z-axis point cloud space projection is omitted, a rough 3D surrounding frame is obtained, and the three-dimensional spatial point cloud data are divided into a foreground and a background of the target;

6. The method of claim 5, wherein the target detection system comprises: the specific training process of the second fusion detection network model in step S36 is as follows:

s361, a 3D target preliminary identification stage: same as step S261;

s362, determining a 3D center point stage: same as step S262;

s363, VSA extraction stage: same as step 263;

7. The method of claim 4, wherein the target detection system comprises: the specific process of detecting the application in step S4 is as follows:

8. A multi-modal target detection system suitable for modal intensity variation is characterized in that: the method comprises the following steps: a large number of sufficient training data sets, wherein the training data sets comprise modal data of different dimensions under the same time and space under various extreme scenes;

the method comprises the steps that a 3D target detection model is used for obtaining a modal feature tensor corresponding to three-dimensional space point cloud data finally output by a feature extraction module, a projection aerial view of the modal feature tensor is used as a point cloud aerial view tensor, and the point cloud aerial view tensor is input into the 3D target detection model to obtain a rough 3D surrounding frame;

and a Softmax function, wherein in the process of training the second fusion detection network model, the modal sparsity corresponding to the different dimensional modal data forms a vector, the vector is mapped to a [0,1] space by using the Softmax function, and the sum is 1, so that the distribution proportion of the superposition coefficients corresponding to the different dimensional modal data is obtained.