Disclosure of Invention
The present invention is directed to a method and a system for multi-modal object detection with missing modality, so as to solve the problems mentioned in the background art.
In order to achieve the purpose, the invention provides the following technical scheme:
a multi-modal target detection method suitable for modal loss comprises the following steps:
s1, a real mode generation stage:
s1-1, acquiring modal data of different dimensions under the same time and space based on an open source data set;
s1-2, processing the modal data through a feature extraction unit, and extracting a modal feature tensor corresponding to each modal data;
s1-3, splicing the obtained modal feature tensors with different dimensions along channels to obtain a spliced modal feature tensor, inputting the spliced modal feature tensor into an attention network unit, randomly distributing weights to endow different channels with weight coefficients, outputting to obtain a multi-modal feature matrix stacked along the channel direction, inputting the multi-modal feature matrix into an information fusion unit, inputting the multi-modal feature matrix into the information fusion unit, and processing the stacked multi-modal feature matrix by using 1x1 convolution to obtain a final multi-modal feature matrix;
s1-4, inputting the multi-modal feature matrix obtained in the last step into a neural network unit for appointing a perception target task, training the target detection perception task, and finally finishing the training to obtain the neural network unit with the target detection perception capability;
s1-5, inputting all modal data of the data set into the trained neural network unit for detection after the processing of the steps S1-1 to S1-3, extracting and storing the real multi-modal characteristic matrix in the detection process of each instance as a true value of the discrimination network unit in the training process of the generation network unit;
s2, multi-modal fusion stage in the missing modal scene:
s2-1, deleting all modal data of a certain dimensionality in the data set;
s2-2, processing the modal data of the other dimensions through a feature extraction unit, and extracting modal feature tensors corresponding to the modal data of the other dimensions;
s2-3, a random vector is newly established and used for simulating missing modal information, and the random vector is input into a generation network unit to generate a pseudo modal characteristic tensor;
s2-4, splicing the pseudo modal feature tensor and the modal feature tensor in the step S2-2 along a channel, inputting the spliced pseudo modal feature tensor into an attention network unit, giving weight coefficients to different channels through randomly distributing weights to represent the interest degree of the network, and outputting to obtain a multi-modal feature matrix stacked along the channel direction;
s2-5, inputting the multi-modal feature matrix endowed with the weight coefficient into an information fusion unit, performing 1x1 convolution processing to reduce channels, fusing feature information of multiple modes, and outputting the feature information as a pseudo multi-modal feature matrix;
s2-6, inputting the pseudo multi-modal feature matrix and the real multi-modal feature matrix in the step S1-5 into a discrimination network unit, using the pseudo multi-modal feature as a predicted value by the discrimination network unit, using the real multi-modal feature in the step S1-5 as a true value, and inputting the true multi-modal feature into a loss function module to obtain the matrix similarity;
s2-7, reversely propagating the matrix similarity to the generated network unit, building a random vector, repeating the step S2-3 to the step S2-6 until the network unit is judged to be that the pseudo multi-modal feature is the real multi-modal feature, and finishing the training of the generated network unit;
s3, target detection application stage:
s3-1, acquiring modal data in real time through data acquisition equipment, wherein modal data of a certain dimensionality is missing, generating a corresponding missing modal feature tensor by a trained network generating unit through a random vector, extracting other intact modal feature tensors by a feature extracting unit, and performing the processing work of the step S1-3 on the missing modal feature tensor and the other intact modal feature tensors to obtain a multi-modal feature matrix;
s3-2, inputting the multi-modal feature matrix into the neural network unit trained in the step S1-4, generating the type of the target by the neural network unit, and selecting and identifying each target frame by using a 2D marking frame.
As a further scheme of the present invention, the data acquisition device includes a laser radar, a camera, and a roadside edge device, and the data set is modal data of different dimensions acquired by the data acquisition device in the same time and space.
As a further aspect of the present invention, the feature extraction unit includes an RNN network for extracting one-dimensional modal data, a Resnet network for extracting two-dimensional modal data, and a pointet network for extracting three-dimensional modal data.
As a further scheme of the present invention, the generating network unit adopts a neural network model based on a GAN network, which includes a model input, a generating module and a model output, wherein the model input is a random vector, which is used to replace missing modal information and input it to the generating module; the generating module is an improved GAN network, the network form of the coding layer and the decoding layer is connected through the jumping layer to capture the characteristic information more accurately, and the generated network unit can generate the characteristic matrix of the missing mode after judging the backward propagation of the network unit and carrying out a large amount of training.
As a further aspect of the present invention, the discriminant network unit includes a model input, a CNN module, a loss function module, and a model output.
As a further scheme of the invention, the loss function module adopts a composite global loss function based on L1 and structural similarity.
As a further scheme of the present invention, the neural network unit specifically adopts a YOLO v3 algorithm network.
A multi-modal target detection system suitable for modal loss comprises data acquisition equipment, a data set, a feature extraction unit, a generation network unit, an attention network unit, an information fusion unit, a judgment network unit and a neural network unit for target detection;
the data set comprises data information of multiple dimensions of a laser radar, a camera and roadside edge equipment under the same time and space, a neural network unit of a designated perception task is trained by using the complete multi-modal data information, and a multi-modal feature matrix of each example is extracted to be used as a true value for distinguishing the network unit;
the characteristic extraction unit is used for extracting modal characteristics which do not have faults or are missing;
the method comprises the steps that a network unit is generated to be connected with a network form of a coding layer and a decoding layer through an internal jump layer, characteristic information of a random vector is captured, and a characteristic matrix of a missing mode can be generated by the network unit after the network unit is judged to be reversely propagated and trained in a large quantity;
the attention network unit is an attention mechanism network based on channels, weight coefficients of the channels are randomly distributed, and the weight coefficients are continuously optimized through back propagation, so that the network captures information of interest;
the information fusion unit splices the feature matrixes of multiple modes along the channels, utilizes the weight coefficient given by the attention unit to each channel, and uses 1x1 convolution kernel convolution to the spliced feature matrixes to play a role in fusing multi-mode information of different dimensions into one feature matrix;
the distinguishing network unit comprises a model input, a CNN module, a loss function module and a model output, wherein the model input comprises real multi-modal feature data extracted through a trained neural network and false multi-modal feature data generated by a generating network unit, an attention network unit and an information fusion unit through random vectors, and weight coefficients are distributed to all channels through the attention network unit to represent the interest degree of different channel features;
the CNN module is used for extracting the characteristics of the two input matrixes and inputting the authenticity characteristics into the loss function module to calculate the similarity degree of the two matrixes;
the model output is the similarity of the two matrixes, the similarity is reversely transmitted to the generation network unit, the generation network unit continues to randomly generate a missing mode matrix, the false multi-mode characteristics are input to the judgment network unit through the attention network unit and the information fusion layer, and the operation is repeatedly circulated until the similarity of the real and false matrixes is highly consistent so that the judgment network unit cannot distinguish the real and false matrixes, and the multi-mode characteristics are output;
the loss function module is a composite global loss function based on L1 and structural similarity.
Compared with the prior art, the invention has the beneficial effects that: compared with the technical scheme of deleting the missing modal sample and completing the missing data, the method adopts the mode characteristic representation of generating the missing by the network unit, and avoids additional noise introduced in the step of completing the data; compared with the generation of the whole modal information, the method provided by the invention can greatly reduce the calculation amount and complexity of the model; compared with the technical scheme of obtaining the consistent representation of the missing mode by utilizing a matrix decomposition technology or utilizing Laplace regularization to complement the missing mode similarity matrix, the method improves the consistency of the multi-mode information representation.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A multi-modal target detection method suitable for modal loss comprises the following steps:
s1, please refer to fig. 1, the real mode generating stage:
s1-1, acquiring modal data of different dimensions under the same time and space based on an open source data set;
s1-2, processing the modal data through a feature extraction unit, and extracting a modal feature tensor corresponding to each modal data;
s1-3, splicing the obtained modal feature tensors with different dimensions along channels to obtain a spliced modal feature tensor, inputting the spliced modal feature tensor into an attention network unit, randomly distributing weights to endow different channels with weight coefficients, outputting to obtain a multi-modal feature matrix stacked along the channel direction, inputting into an information fusion unit, and processing the stacked modal feature matrix by using 1x1 convolution to obtain a final multi-modal feature matrix;
s1-4, inputting the multi-modal feature matrix obtained in the last step into a neural network unit for appointing a perception target task, training the target detection perception task, and finally finishing the training to obtain the neural network unit with the target detection perception capability;
s1-5, inputting all modal data of the data set into the trained neural network unit for detection after the processing of the steps S1-1 to S1-3, extracting and storing the real multi-modal characteristic matrix in the detection process of each instance as a true value of the discrimination network unit in the training process of the generation network unit;
s2, please refer to fig. 2, the multimodal fusion stage in the missing modality scenario:
s2-1, deleting all modal data of a certain dimensionality in the data set;
s2-2, processing the modal data of the other dimensions through a feature extraction unit, and extracting modal feature tensors corresponding to the modal data of the other dimensions;
s2-3, a random vector is newly established and used for simulating missing modal information, and the random vector is input into a generation network unit to generate a pseudo modal characteristic tensor;
s2-4, splicing the pseudo modal feature tensor and the modal feature tensor in the step S2-2 along a channel, inputting the spliced pseudo modal feature tensor into an attention network unit, giving weight coefficients to different channels through randomly distributing weights to represent the interest degree of the network, and outputting to obtain a multi-modal feature matrix stacked along the channel direction;
s2-5, inputting the multi-modal feature matrix endowed with the weight coefficient into an information fusion unit, performing 1x1 convolution processing to reduce channels, fusing feature information of multiple modes, and outputting the feature information as a pseudo multi-modal feature matrix;
s2-6, inputting the pseudo multi-modal feature matrix and the real multi-modal feature matrix in the step S1-5 into a discrimination network unit, using the pseudo multi-modal feature as a predicted value by the discrimination network unit, using the real multi-modal feature in the step S1-5 as a true value, and inputting the true multi-modal feature into a loss function module to obtain the matrix similarity;
s2-7, reversely propagating the matrix similarity to the generated network unit, building a random vector, repeating the step S2-3 to the step S2-6 until the network unit is judged to be that the pseudo multi-modal feature is the real multi-modal feature, and finishing the training of the generated network unit;
s3, please refer to fig. 3, the application stage of object detection:
s3-1, acquiring modal data in real time through data acquisition equipment, wherein modal data of a certain dimensionality is missing, generating a corresponding missing modal feature tensor by a trained network generating unit through a random vector, extracting other intact modal feature tensors by a feature extracting unit, and performing the processing work of the step S1-3 on the missing modal feature tensor and the other intact modal feature tensors to obtain a multi-modal feature matrix;
s3-2, inputting the multi-modal feature matrix into the neural network unit trained in the step S1-4, generating the type of the target by the neural network unit, and selecting and identifying each target frame by using a 2D marking frame.
The data acquisition equipment comprises a laser radar, a camera and roadside edge equipment, and the data set is modal data of different dimensions acquired by the data acquisition equipment in the same time and space.
The feature extraction unit comprises an RNN (radio network) for extracting one-dimensional modal data, a Resnet network for extracting two-dimensional modal data and a Pointnet network for extracting three-dimensional modal data.
The generation network unit adopts a neural network model based on a GAN network, and comprises a model input module, a generation module and a model output module, wherein the model input is a random vector and is used for replacing missing modal information and inputting the modal information into the generation module; the generating module is an improved GAN network, the network form of the coding layer and the decoding layer is connected through the jumping layer to capture the characteristic information more accurately, and the generated network unit can generate the characteristic matrix of the missing mode after judging the backward propagation of the network unit and carrying out a large amount of training.
The discrimination network unit comprises a model input, a CNN module, a loss function module and a model output.
The loss function module adopts a composite global loss function based on L1 and structural similarity.
The neural network unit specifically adopts a YOLO v3 algorithm network.
A multi-modal target detection system suitable for modal loss comprises data acquisition equipment, a data set, a feature extraction unit, a generation network unit, an attention network unit, an information fusion unit, a judgment network unit and a neural network unit for target detection;
the data set comprises data information of multiple dimensions of a laser radar, a camera and roadside edge equipment under the same time and space, a neural network unit of a designated perception task is trained by using the complete multi-modal data information, and a multi-modal feature matrix of each example is extracted to be used as a true value for distinguishing the network unit;
the characteristic extraction unit is used for extracting modal characteristics which do not have faults or are missing;
the method comprises the steps that a network unit is generated to be connected with a network form of a coding layer and a decoding layer through an internal jump layer, characteristic information of a random vector is captured, and a characteristic matrix of a missing mode can be generated by the network unit after the network unit is judged to be reversely propagated and trained in a large quantity;
the attention network unit is an attention mechanism network based on channels, weight coefficients of the channels are randomly distributed, and the weight coefficients are continuously optimized through back propagation, so that the network captures information of interest;
the information fusion unit splices the feature matrixes of multiple modes along the channels, utilizes the weight coefficient given by the attention unit to each channel, and uses 1x1 convolution kernel convolution to the spliced feature matrixes to play a role in fusing multi-mode information of different dimensions into one feature matrix;
the distinguishing network unit comprises a model input, a CNN module, a loss function module and a model output, wherein the model input comprises real multi-modal feature data extracted through a trained neural network and false multi-modal feature data generated by a generating network unit, an attention network unit and an information fusion unit through random vectors, and weight coefficients are distributed to all channels through the attention network unit to represent the interest degree of different channel features;
the CNN module is used for extracting the characteristics of the two input matrixes and inputting the authenticity characteristics into the loss function module to calculate the similarity degree of the two matrixes;
the model output is the similarity of the two matrixes, the similarity is reversely transmitted to the generation network unit, the generation network unit continues to randomly generate a missing mode matrix, the false multi-mode characteristics are input to the judgment network unit through the attention network unit and the information fusion layer, and the operation is repeatedly circulated until the similarity of the real and false matrixes is highly consistent so that the judgment network unit cannot distinguish the real and false matrixes, and the multi-mode characteristics are output;
the loss function module is a composite global loss function based on L1 and structural similarity.
Example 1: a multi-modal target detection method suitable for modal loss is disclosed, wherein H is height, W is width, D is depth in the following embodiments, two-dimensional RGB image data collected by a camera and three-dimensional space point cloud data collected by a laser radar are taken as examples, and detailed description is given, and the method specifically comprises the following steps:
s1, a real mode generation stage:
1) acquiring three-dimensional space point cloud data and two-dimensional RGB image data of the laser radar under the same time and space based on an open source multi-mode data set KITTI;
2) the two-dimensional RGB image data is reflected to mathematical expression, namely a modal characteristic tensor with the size of (H, W, 3), and respectively represents the height, width and RGB channels of the image; the Resnet network is used as a feature extraction unit of the two-dimensional RGB image data, so that the extraction precision is further improved, and finally a modal feature tensor with the size of (H/4, W/4, 256) is obtained;
3) the three-dimensional space point cloud data is reflected to mathematical expression, namely modal characteristic tensors with the size of (D, H, W) respectively represent the length, width and height of the point in a space coordinate system; because the calculation amount of analyzing and calculating each point is overlarge, the voxel method adopts a form of dividing a three-dimensional space into voxel blocks with equal size, thereby reducing the calculation amount; by means of voxel partitioning, point cloud grouping, VEF feature coding and 3D sparse convolution, local and global point cloud features are fully considered, a Pointernet network is adopted as a feature extraction unit of three-dimensional space point cloud data, and finally a modal feature tensor with the size of (H/4, W/4, 256) is obtained;
4) splicing the two obtained modal feature tensors along a channel to obtain a tensor with the size of (H/4, W/4, 512), inputting the tensor into an attention network unit, randomly distributing weights, endowing weight coefficients for different channels, outputting to obtain a multi-modal feature matrix stacked along the channel direction, inputting into an information fusion unit, and performing convolution processing on the stacked tensor by 1x1 to obtain a final multi-modal feature matrix with the size of (H/4, W/4, 216);
5) inputting the multi-mode feature matrix obtained in the last step into a YOLO v3 algorithm network, training based on a two-dimensional target detection perception task, and finally finishing the training to obtain a neural network unit with two-dimensional target detection perception capability;
6) processing all two-dimensional RGB image data and three-dimensional space point cloud data in the data set in the steps S1-1 to S1-3, inputting the data into a trained neural network unit for target detection, extracting and storing a real multi-modal characteristic matrix in the detection process of each instance as a true value of a discrimination network unit in the training process of the generation network unit;
s2, please refer to fig. 4, the multimodal fusion stage in the missing modality scenario:
1) deleting all the two-dimensional RGB image data in the data set so as to simulate a scene of two-dimensional data loss under the camera fault;
2) extracting the three-dimensional space point cloud data by using a Pointernet network as a feature extraction unit of the three-dimensional space point cloud data to finally obtain a modal feature tensor with the size of (H/4, W/4, 256);
3) establishing a new random vector for simulating missing modal information, forming the random vector reshape into a (H/4, W/4, 256) feature tensor, inputting the feature tensor into a generating network unit, and generating a pseudo-modal feature tensor with the size of (H/4, W/4, 256); the structure enables the decoding layer to be combined with high-level characteristics through skip connection, ensures the diversity of the characteristics, and enables the generating network unit to obtain more robust performance under the condition that the parameter number of the coding network is not increased and the calculation complexity of the decoding network is not increased; referring to FIG. 5, E1-E7 correspond to code layers, D1-D7 correspond to decode layers, and S1-S7 correspond to skip layers, respectively.
4) Splicing a pseudo modal feature tensor obtained by a random vector through a network generating unit and a modal feature tensor corresponding to the real three-dimensional space point cloud data (step S2-2) along a channel, inputting the spliced pseudo modal feature tensor into an attention network unit, assigning weight coefficients to different channels through randomly distributing weights to represent the interest degree of the network, and outputting two layers of feature matrixes which are still stacked along the channel direction and have the sizes of (H/4, W/4 and 512);
5) inputting the two layers of feature matrixes endowed with the weight coefficients into an information fusion unit, performing 1x1 convolution processing to reduce channels, fusing feature information of two modes, and outputting a pseudo multi-mode feature matrix with the size of (H/4, W/4, 256);
6) inputting the pseudo multi-modal feature matrix and the real multi-modal feature matrix (step S1-5) into a discrimination network unit, wherein the structure of the discrimination network unit is formed by a simple multilayer neural network, a discrimination task is a supervised binary classification problem, and the result of an information fusion unit is used as input to carry out similarity analysis with a given label (real multi-modal feature matrix) to obtain a positive mode and a negative mode; in order to further improve the accuracy of the discrimination, in this embodiment, an attention network unit is further arranged in front of the information fusion unit to give weight coefficients to different channels to represent the network interest degree, the discrimination network unit performs feature extraction by using a CNN module, uses a pseudo multi-modal feature as a predicted value and a real multi-modal feature as a true value, and inputs the true value into a loss function module to obtain a matrix similarity;
7) and reversely propagating the matrix similarity to the generated network unit, newly building a random vector, and repeating the step S2-3 to the step S2-6 until the network unit is judged to be the pseudo multi-modal feature (namely when the matrix similarity is output to be 1), and finishing the training of the generated network unit.
S3, target detection application stage:
1) when the application scene is in an extremely severe weather environment or a camera is in a fault state and the like, modal data are collected in real time through data collection equipment (a laser radar and the camera), two-dimensional RGB image data are lost, a trained generation network unit generates a corresponding lost modal feature tensor through a random vector, a feature extraction unit extracts a complete three-dimensional modal feature tensor, and the lost modal feature tensor and the complete three-dimensional modal feature tensor are subjected to splicing and convolution processing in the step S1-3 to obtain a multi-modal feature matrix;
2) inputting the multi-modal feature matrix into the neural network unit trained in the step S1-4, generating the target types by the neural network unit, and selecting and identifying each target frame by using a 2D marking frame.
Specifically, after entering the neural network unit, the multi-modal feature matrix is continuously convolved to generate three feature maps with different scales, the three feature maps are input into an internal FPN module and an RPN module, and the RPN module completes the classification and regression of the target according to the feature layers of 256 channels output by the last convolutional layer. The type division finishes the type judgment of the target, the regression uses a 2D marking frame to put out the detected target frame, and the results of the type division and the regression are the output results of the method.
In this embodiment, only two-dimensional modal data (in the case of two-dimensional modal data missing) are illustrated, and it is obvious that the method of the present invention can also perform target detection according to three-dimensional modal data (in the case of one of the two-dimensional modal data missing), and details are not described here again.