CN114359586A - Multi-modal target detection method and system suitable for modal loss - Google Patents

Multi-modal target detection method and system suitable for modal loss Download PDF

Info

Publication number
CN114359586A
CN114359586A CN202111456527.1A CN202111456527A CN114359586A CN 114359586 A CN114359586 A CN 114359586A CN 202111456527 A CN202111456527 A CN 202111456527A CN 114359586 A CN114359586 A CN 114359586A
Authority
CN
China
Prior art keywords
modal
network unit
feature
data
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111456527.1A
Other languages
Chinese (zh)
Other versions
CN114359586B (en
Inventor
程腾
孙磊
张峻宁
陈炯
石琴
丁莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Guandun Technology Co ltd
Hefei University Of Technology Asset Management Co ltd
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202111456527.1A priority Critical patent/CN114359586B/en
Publication of CN114359586A publication Critical patent/CN114359586A/en
Application granted granted Critical
Publication of CN114359586B publication Critical patent/CN114359586B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a multi-modal target detection method and system suitable for modal loss, wherein the method comprises the following steps: s1, training the neural network unit by the multi-modal data in the data set; inputting all modal data of the data set into a trained neural network unit for detection, and storing a detection result; s2, extracting modal data of other dimensions; generating a pseudo-modal feature tensor by using the generating network unit, splicing, inputting the pseudo-modal feature tensor into the attention network unit, the information fusion unit and the judgment network unit until the training of the generating network unit is completed; and S3, acquiring modal data in real time through data acquisition equipment, and generating the type and the identification of the target by using the trained neural network unit. According to the method, on the premise of avoiding noise introduction and characteristic information loss, missing modal data are generated virtually, the calculated amount and complexity of the model are reduced to a great extent, and the representation consistency of the generated modes can be improved.

Description

Multi-modal target detection method and system suitable for modal loss
Technical Field
The invention relates to the technical field of automatic driving, in particular to a multi-modal target detection method and system suitable for modal loss.
Background
With the gradual rise and progress of various technologies such as Computer Vision (CV), machine learning and the like in the field of artificial intelligence, cameras and image processing equipment can replace human eyes to detect, recognize, track, count and the like targets more efficiently. The target detection is a basic task of target identification, tracking and counting. Particularly in the field of automatic driving, a vehicle-mounted terminal generally needs to combine data acquisition devices such as a laser radar, a camera and roadside edge equipment to acquire modal data of a plurality of different dimensions, and performs target detection and identification according to the acquired modal data of different dimensions, so as to improve the accuracy of target detection and identification.
However, in the practical application process, modal data loss inevitably occurs, such as: when a camera fails, encounters an extremely severe weather environment, and the like, modal data loss or partial information loss may be caused.
For the problems of modal data loss and partial information loss in a complex environment, the solution of the prior art is generally: 1. deleting a modal missing sample, filling missing modal characteristics by using a data completion technology, and processing by using the existing multi-modal learning method, but the operation can cause the problems of introducing extra noise into a detection result and the like; 2. the method obtains the consistent representation of the missing mode by utilizing a matrix decomposition technology or completes the missing mode similarity matrix by utilizing Laplace regularization, and although the introduction of extra noise is effectively avoided, the loss of the mode effective characteristic information is caused. How to study and design a multi-modal target detection method and system suitable for modal loss, on the premise of avoiding noise introduction and characteristic information loss, missing modal data is generated virtually, so that the representation consistency of the generated modes is improved, and the method and system become technical problems to be solved by the application.
Disclosure of Invention
The present invention is directed to a method and a system for multi-modal object detection with missing modality, so as to solve the problems mentioned in the background art.
In order to achieve the purpose, the invention provides the following technical scheme:
a multi-modal target detection method suitable for modal loss comprises the following steps:
s1, a real mode generation stage:
s1-1, acquiring modal data of different dimensions under the same time and space based on an open source data set;
s1-2, processing the modal data through a feature extraction unit, and extracting a modal feature tensor corresponding to each modal data;
s1-3, splicing the obtained modal feature tensors with different dimensions along channels to obtain a spliced modal feature tensor, inputting the spliced modal feature tensor into an attention network unit, randomly distributing weights to endow different channels with weight coefficients, outputting to obtain a multi-modal feature matrix stacked along the channel direction, inputting the multi-modal feature matrix into an information fusion unit, inputting the multi-modal feature matrix into the information fusion unit, and processing the stacked multi-modal feature matrix by using 1x1 convolution to obtain a final multi-modal feature matrix;
s1-4, inputting the multi-modal feature matrix obtained in the last step into a neural network unit for appointing a perception target task, training the target detection perception task, and finally finishing the training to obtain the neural network unit with the target detection perception capability;
s1-5, inputting all modal data of the data set into the trained neural network unit for detection after the processing of the steps S1-1 to S1-3, extracting and storing the real multi-modal characteristic matrix in the detection process of each instance as a true value of the discrimination network unit in the training process of the generation network unit;
s2, multi-modal fusion stage in the missing modal scene:
s2-1, deleting all modal data of a certain dimensionality in the data set;
s2-2, processing the modal data of the other dimensions through a feature extraction unit, and extracting modal feature tensors corresponding to the modal data of the other dimensions;
s2-3, a random vector is newly established and used for simulating missing modal information, and the random vector is input into a generation network unit to generate a pseudo modal characteristic tensor;
s2-4, splicing the pseudo modal feature tensor and the modal feature tensor in the step S2-2 along a channel, inputting the spliced pseudo modal feature tensor into an attention network unit, giving weight coefficients to different channels through randomly distributing weights to represent the interest degree of the network, and outputting to obtain a multi-modal feature matrix stacked along the channel direction;
s2-5, inputting the multi-modal feature matrix endowed with the weight coefficient into an information fusion unit, performing 1x1 convolution processing to reduce channels, fusing feature information of multiple modes, and outputting the feature information as a pseudo multi-modal feature matrix;
s2-6, inputting the pseudo multi-modal feature matrix and the real multi-modal feature matrix in the step S1-5 into a discrimination network unit, using the pseudo multi-modal feature as a predicted value by the discrimination network unit, using the real multi-modal feature in the step S1-5 as a true value, and inputting the true multi-modal feature into a loss function module to obtain the matrix similarity;
s2-7, reversely propagating the matrix similarity to the generated network unit, building a random vector, repeating the step S2-3 to the step S2-6 until the network unit is judged to be that the pseudo multi-modal feature is the real multi-modal feature, and finishing the training of the generated network unit;
s3, target detection application stage:
s3-1, acquiring modal data in real time through data acquisition equipment, wherein modal data of a certain dimensionality is missing, generating a corresponding missing modal feature tensor by a trained network generating unit through a random vector, extracting other intact modal feature tensors by a feature extracting unit, and performing the processing work of the step S1-3 on the missing modal feature tensor and the other intact modal feature tensors to obtain a multi-modal feature matrix;
s3-2, inputting the multi-modal feature matrix into the neural network unit trained in the step S1-4, generating the type of the target by the neural network unit, and selecting and identifying each target frame by using a 2D marking frame.
As a further scheme of the present invention, the data acquisition device includes a laser radar, a camera, and a roadside edge device, and the data set is modal data of different dimensions acquired by the data acquisition device in the same time and space.
As a further aspect of the present invention, the feature extraction unit includes an RNN network for extracting one-dimensional modal data, a Resnet network for extracting two-dimensional modal data, and a pointet network for extracting three-dimensional modal data.
As a further scheme of the present invention, the generating network unit adopts a neural network model based on a GAN network, which includes a model input, a generating module and a model output, wherein the model input is a random vector, which is used to replace missing modal information and input it to the generating module; the generating module is an improved GAN network, the network form of the coding layer and the decoding layer is connected through the jumping layer to capture the characteristic information more accurately, and the generated network unit can generate the characteristic matrix of the missing mode after judging the backward propagation of the network unit and carrying out a large amount of training.
As a further aspect of the present invention, the discriminant network unit includes a model input, a CNN module, a loss function module, and a model output.
As a further scheme of the invention, the loss function module adopts a composite global loss function based on L1 and structural similarity.
As a further scheme of the present invention, the neural network unit specifically adopts a YOLO v3 algorithm network.
A multi-modal target detection system suitable for modal loss comprises data acquisition equipment, a data set, a feature extraction unit, a generation network unit, an attention network unit, an information fusion unit, a judgment network unit and a neural network unit for target detection;
the data set comprises data information of multiple dimensions of a laser radar, a camera and roadside edge equipment under the same time and space, a neural network unit of a designated perception task is trained by using the complete multi-modal data information, and a multi-modal feature matrix of each example is extracted to be used as a true value for distinguishing the network unit;
the characteristic extraction unit is used for extracting modal characteristics which do not have faults or are missing;
the method comprises the steps that a network unit is generated to be connected with a network form of a coding layer and a decoding layer through an internal jump layer, characteristic information of a random vector is captured, and a characteristic matrix of a missing mode can be generated by the network unit after the network unit is judged to be reversely propagated and trained in a large quantity;
the attention network unit is an attention mechanism network based on channels, weight coefficients of the channels are randomly distributed, and the weight coefficients are continuously optimized through back propagation, so that the network captures information of interest;
the information fusion unit splices the feature matrixes of multiple modes along the channels, utilizes the weight coefficient given by the attention unit to each channel, and uses 1x1 convolution kernel convolution to the spliced feature matrixes to play a role in fusing multi-mode information of different dimensions into one feature matrix;
the distinguishing network unit comprises a model input, a CNN module, a loss function module and a model output, wherein the model input comprises real multi-modal feature data extracted through a trained neural network and false multi-modal feature data generated by a generating network unit, an attention network unit and an information fusion unit through random vectors, and weight coefficients are distributed to all channels through the attention network unit to represent the interest degree of different channel features;
the CNN module is used for extracting the characteristics of the two input matrixes and inputting the authenticity characteristics into the loss function module to calculate the similarity degree of the two matrixes;
the model output is the similarity of the two matrixes, the similarity is reversely transmitted to the generation network unit, the generation network unit continues to randomly generate a missing mode matrix, the false multi-mode characteristics are input to the judgment network unit through the attention network unit and the information fusion layer, and the operation is repeatedly circulated until the similarity of the real and false matrixes is highly consistent so that the judgment network unit cannot distinguish the real and false matrixes, and the multi-mode characteristics are output;
the loss function module is a composite global loss function based on L1 and structural similarity.
Compared with the prior art, the invention has the beneficial effects that: compared with the technical scheme of deleting the missing modal sample and completing the missing data, the method adopts the mode characteristic representation of generating the missing by the network unit, and avoids additional noise introduced in the step of completing the data; compared with the generation of the whole modal information, the method provided by the invention can greatly reduce the calculation amount and complexity of the model; compared with the technical scheme of obtaining the consistent representation of the missing mode by utilizing a matrix decomposition technology or utilizing Laplace regularization to complement the missing mode similarity matrix, the method improves the consistency of the multi-mode information representation.
Drawings
FIG. 1 is a schematic flow chart of a real mode generation phase of the method of the present invention;
FIG. 2 is a schematic flow chart of a multimodal fusion stage in a modality-missing scenario of the method of the present invention;
FIG. 3 is a schematic flow chart of the target detection application stage of the method of the present invention;
fig. 4 is a schematic flow chart of a multimodal fusion stage in a missing modality scene according to the method in embodiment 1 of the present invention;
fig. 5 is a schematic diagram of generating a network element in embodiment 1 of the method of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A multi-modal target detection method suitable for modal loss comprises the following steps:
s1, please refer to fig. 1, the real mode generating stage:
s1-1, acquiring modal data of different dimensions under the same time and space based on an open source data set;
s1-2, processing the modal data through a feature extraction unit, and extracting a modal feature tensor corresponding to each modal data;
s1-3, splicing the obtained modal feature tensors with different dimensions along channels to obtain a spliced modal feature tensor, inputting the spliced modal feature tensor into an attention network unit, randomly distributing weights to endow different channels with weight coefficients, outputting to obtain a multi-modal feature matrix stacked along the channel direction, inputting into an information fusion unit, and processing the stacked modal feature matrix by using 1x1 convolution to obtain a final multi-modal feature matrix;
s1-4, inputting the multi-modal feature matrix obtained in the last step into a neural network unit for appointing a perception target task, training the target detection perception task, and finally finishing the training to obtain the neural network unit with the target detection perception capability;
s1-5, inputting all modal data of the data set into the trained neural network unit for detection after the processing of the steps S1-1 to S1-3, extracting and storing the real multi-modal characteristic matrix in the detection process of each instance as a true value of the discrimination network unit in the training process of the generation network unit;
s2, please refer to fig. 2, the multimodal fusion stage in the missing modality scenario:
s2-1, deleting all modal data of a certain dimensionality in the data set;
s2-2, processing the modal data of the other dimensions through a feature extraction unit, and extracting modal feature tensors corresponding to the modal data of the other dimensions;
s2-3, a random vector is newly established and used for simulating missing modal information, and the random vector is input into a generation network unit to generate a pseudo modal characteristic tensor;
s2-4, splicing the pseudo modal feature tensor and the modal feature tensor in the step S2-2 along a channel, inputting the spliced pseudo modal feature tensor into an attention network unit, giving weight coefficients to different channels through randomly distributing weights to represent the interest degree of the network, and outputting to obtain a multi-modal feature matrix stacked along the channel direction;
s2-5, inputting the multi-modal feature matrix endowed with the weight coefficient into an information fusion unit, performing 1x1 convolution processing to reduce channels, fusing feature information of multiple modes, and outputting the feature information as a pseudo multi-modal feature matrix;
s2-6, inputting the pseudo multi-modal feature matrix and the real multi-modal feature matrix in the step S1-5 into a discrimination network unit, using the pseudo multi-modal feature as a predicted value by the discrimination network unit, using the real multi-modal feature in the step S1-5 as a true value, and inputting the true multi-modal feature into a loss function module to obtain the matrix similarity;
s2-7, reversely propagating the matrix similarity to the generated network unit, building a random vector, repeating the step S2-3 to the step S2-6 until the network unit is judged to be that the pseudo multi-modal feature is the real multi-modal feature, and finishing the training of the generated network unit;
s3, please refer to fig. 3, the application stage of object detection:
s3-1, acquiring modal data in real time through data acquisition equipment, wherein modal data of a certain dimensionality is missing, generating a corresponding missing modal feature tensor by a trained network generating unit through a random vector, extracting other intact modal feature tensors by a feature extracting unit, and performing the processing work of the step S1-3 on the missing modal feature tensor and the other intact modal feature tensors to obtain a multi-modal feature matrix;
s3-2, inputting the multi-modal feature matrix into the neural network unit trained in the step S1-4, generating the type of the target by the neural network unit, and selecting and identifying each target frame by using a 2D marking frame.
The data acquisition equipment comprises a laser radar, a camera and roadside edge equipment, and the data set is modal data of different dimensions acquired by the data acquisition equipment in the same time and space.
The feature extraction unit comprises an RNN (radio network) for extracting one-dimensional modal data, a Resnet network for extracting two-dimensional modal data and a Pointnet network for extracting three-dimensional modal data.
The generation network unit adopts a neural network model based on a GAN network, and comprises a model input module, a generation module and a model output module, wherein the model input is a random vector and is used for replacing missing modal information and inputting the modal information into the generation module; the generating module is an improved GAN network, the network form of the coding layer and the decoding layer is connected through the jumping layer to capture the characteristic information more accurately, and the generated network unit can generate the characteristic matrix of the missing mode after judging the backward propagation of the network unit and carrying out a large amount of training.
The discrimination network unit comprises a model input, a CNN module, a loss function module and a model output.
The loss function module adopts a composite global loss function based on L1 and structural similarity.
The neural network unit specifically adopts a YOLO v3 algorithm network.
A multi-modal target detection system suitable for modal loss comprises data acquisition equipment, a data set, a feature extraction unit, a generation network unit, an attention network unit, an information fusion unit, a judgment network unit and a neural network unit for target detection;
the data set comprises data information of multiple dimensions of a laser radar, a camera and roadside edge equipment under the same time and space, a neural network unit of a designated perception task is trained by using the complete multi-modal data information, and a multi-modal feature matrix of each example is extracted to be used as a true value for distinguishing the network unit;
the characteristic extraction unit is used for extracting modal characteristics which do not have faults or are missing;
the method comprises the steps that a network unit is generated to be connected with a network form of a coding layer and a decoding layer through an internal jump layer, characteristic information of a random vector is captured, and a characteristic matrix of a missing mode can be generated by the network unit after the network unit is judged to be reversely propagated and trained in a large quantity;
the attention network unit is an attention mechanism network based on channels, weight coefficients of the channels are randomly distributed, and the weight coefficients are continuously optimized through back propagation, so that the network captures information of interest;
the information fusion unit splices the feature matrixes of multiple modes along the channels, utilizes the weight coefficient given by the attention unit to each channel, and uses 1x1 convolution kernel convolution to the spliced feature matrixes to play a role in fusing multi-mode information of different dimensions into one feature matrix;
the distinguishing network unit comprises a model input, a CNN module, a loss function module and a model output, wherein the model input comprises real multi-modal feature data extracted through a trained neural network and false multi-modal feature data generated by a generating network unit, an attention network unit and an information fusion unit through random vectors, and weight coefficients are distributed to all channels through the attention network unit to represent the interest degree of different channel features;
the CNN module is used for extracting the characteristics of the two input matrixes and inputting the authenticity characteristics into the loss function module to calculate the similarity degree of the two matrixes;
the model output is the similarity of the two matrixes, the similarity is reversely transmitted to the generation network unit, the generation network unit continues to randomly generate a missing mode matrix, the false multi-mode characteristics are input to the judgment network unit through the attention network unit and the information fusion layer, and the operation is repeatedly circulated until the similarity of the real and false matrixes is highly consistent so that the judgment network unit cannot distinguish the real and false matrixes, and the multi-mode characteristics are output;
the loss function module is a composite global loss function based on L1 and structural similarity.
Example 1: a multi-modal target detection method suitable for modal loss is disclosed, wherein H is height, W is width, D is depth in the following embodiments, two-dimensional RGB image data collected by a camera and three-dimensional space point cloud data collected by a laser radar are taken as examples, and detailed description is given, and the method specifically comprises the following steps:
s1, a real mode generation stage:
1) acquiring three-dimensional space point cloud data and two-dimensional RGB image data of the laser radar under the same time and space based on an open source multi-mode data set KITTI;
2) the two-dimensional RGB image data is reflected to mathematical expression, namely a modal characteristic tensor with the size of (H, W, 3), and respectively represents the height, width and RGB channels of the image; the Resnet network is used as a feature extraction unit of the two-dimensional RGB image data, so that the extraction precision is further improved, and finally a modal feature tensor with the size of (H/4, W/4, 256) is obtained;
3) the three-dimensional space point cloud data is reflected to mathematical expression, namely modal characteristic tensors with the size of (D, H, W) respectively represent the length, width and height of the point in a space coordinate system; because the calculation amount of analyzing and calculating each point is overlarge, the voxel method adopts a form of dividing a three-dimensional space into voxel blocks with equal size, thereby reducing the calculation amount; by means of voxel partitioning, point cloud grouping, VEF feature coding and 3D sparse convolution, local and global point cloud features are fully considered, a Pointernet network is adopted as a feature extraction unit of three-dimensional space point cloud data, and finally a modal feature tensor with the size of (H/4, W/4, 256) is obtained;
4) splicing the two obtained modal feature tensors along a channel to obtain a tensor with the size of (H/4, W/4, 512), inputting the tensor into an attention network unit, randomly distributing weights, endowing weight coefficients for different channels, outputting to obtain a multi-modal feature matrix stacked along the channel direction, inputting into an information fusion unit, and performing convolution processing on the stacked tensor by 1x1 to obtain a final multi-modal feature matrix with the size of (H/4, W/4, 216);
5) inputting the multi-mode feature matrix obtained in the last step into a YOLO v3 algorithm network, training based on a two-dimensional target detection perception task, and finally finishing the training to obtain a neural network unit with two-dimensional target detection perception capability;
6) processing all two-dimensional RGB image data and three-dimensional space point cloud data in the data set in the steps S1-1 to S1-3, inputting the data into a trained neural network unit for target detection, extracting and storing a real multi-modal characteristic matrix in the detection process of each instance as a true value of a discrimination network unit in the training process of the generation network unit;
s2, please refer to fig. 4, the multimodal fusion stage in the missing modality scenario:
1) deleting all the two-dimensional RGB image data in the data set so as to simulate a scene of two-dimensional data loss under the camera fault;
2) extracting the three-dimensional space point cloud data by using a Pointernet network as a feature extraction unit of the three-dimensional space point cloud data to finally obtain a modal feature tensor with the size of (H/4, W/4, 256);
3) establishing a new random vector for simulating missing modal information, forming the random vector reshape into a (H/4, W/4, 256) feature tensor, inputting the feature tensor into a generating network unit, and generating a pseudo-modal feature tensor with the size of (H/4, W/4, 256); the structure enables the decoding layer to be combined with high-level characteristics through skip connection, ensures the diversity of the characteristics, and enables the generating network unit to obtain more robust performance under the condition that the parameter number of the coding network is not increased and the calculation complexity of the decoding network is not increased; referring to FIG. 5, E1-E7 correspond to code layers, D1-D7 correspond to decode layers, and S1-S7 correspond to skip layers, respectively.
4) Splicing a pseudo modal feature tensor obtained by a random vector through a network generating unit and a modal feature tensor corresponding to the real three-dimensional space point cloud data (step S2-2) along a channel, inputting the spliced pseudo modal feature tensor into an attention network unit, assigning weight coefficients to different channels through randomly distributing weights to represent the interest degree of the network, and outputting two layers of feature matrixes which are still stacked along the channel direction and have the sizes of (H/4, W/4 and 512);
5) inputting the two layers of feature matrixes endowed with the weight coefficients into an information fusion unit, performing 1x1 convolution processing to reduce channels, fusing feature information of two modes, and outputting a pseudo multi-mode feature matrix with the size of (H/4, W/4, 256);
6) inputting the pseudo multi-modal feature matrix and the real multi-modal feature matrix (step S1-5) into a discrimination network unit, wherein the structure of the discrimination network unit is formed by a simple multilayer neural network, a discrimination task is a supervised binary classification problem, and the result of an information fusion unit is used as input to carry out similarity analysis with a given label (real multi-modal feature matrix) to obtain a positive mode and a negative mode; in order to further improve the accuracy of the discrimination, in this embodiment, an attention network unit is further arranged in front of the information fusion unit to give weight coefficients to different channels to represent the network interest degree, the discrimination network unit performs feature extraction by using a CNN module, uses a pseudo multi-modal feature as a predicted value and a real multi-modal feature as a true value, and inputs the true value into a loss function module to obtain a matrix similarity;
7) and reversely propagating the matrix similarity to the generated network unit, newly building a random vector, and repeating the step S2-3 to the step S2-6 until the network unit is judged to be the pseudo multi-modal feature (namely when the matrix similarity is output to be 1), and finishing the training of the generated network unit.
S3, target detection application stage:
1) when the application scene is in an extremely severe weather environment or a camera is in a fault state and the like, modal data are collected in real time through data collection equipment (a laser radar and the camera), two-dimensional RGB image data are lost, a trained generation network unit generates a corresponding lost modal feature tensor through a random vector, a feature extraction unit extracts a complete three-dimensional modal feature tensor, and the lost modal feature tensor and the complete three-dimensional modal feature tensor are subjected to splicing and convolution processing in the step S1-3 to obtain a multi-modal feature matrix;
2) inputting the multi-modal feature matrix into the neural network unit trained in the step S1-4, generating the target types by the neural network unit, and selecting and identifying each target frame by using a 2D marking frame.
Specifically, after entering the neural network unit, the multi-modal feature matrix is continuously convolved to generate three feature maps with different scales, the three feature maps are input into an internal FPN module and an RPN module, and the RPN module completes the classification and regression of the target according to the feature layers of 256 channels output by the last convolutional layer. The type division finishes the type judgment of the target, the regression uses a 2D marking frame to put out the detected target frame, and the results of the type division and the regression are the output results of the method.
In this embodiment, only two-dimensional modal data (in the case of two-dimensional modal data missing) are illustrated, and it is obvious that the method of the present invention can also perform target detection according to three-dimensional modal data (in the case of one of the two-dimensional modal data missing), and details are not described here again.

Claims (8)

1. A multi-modal target detection method suitable for modal deficiency is characterized in that: the method comprises the following steps:
s1, a real mode generation stage:
s1-1, acquiring modal data of different dimensions under the same time and space based on an open source data set;
s1-2, processing the modal data through a feature extraction unit, and extracting a modal feature tensor corresponding to each modal data;
s1-3, splicing the obtained modal feature tensors with different dimensions along channels to obtain a spliced modal feature tensor, inputting the spliced modal feature tensor into an attention network unit, randomly distributing weights to endow different channels with weight coefficients, outputting to obtain a multi-modal feature matrix stacked along the channel direction, inputting into an information fusion unit, and processing the stacked multi-modal feature matrix by using 1x1 convolution to obtain a final multi-modal feature matrix;
s1-4, inputting the multi-modal feature matrix obtained in the last step into a neural network unit for appointing a perception target task, training the target detection perception task, and finally finishing the training to obtain the neural network unit with the target detection perception capability;
s1-5, inputting all modal data of the data set into the trained neural network unit for detection after the processing of the steps S1-1 to S1-3, extracting and storing the real multi-modal characteristic matrix in the detection process of each instance as a true value of the discrimination network unit in the training process of the generation network unit;
s2, multi-modal fusion stage in the missing modal scene:
s2-1, deleting all modal data of a certain dimensionality in the data set;
s2-2, processing the modal data of the other dimensions through a feature extraction unit, and extracting modal feature tensors corresponding to the modal data of the other dimensions;
s2-3, a random vector is newly established and used for simulating missing modal information, and the random vector is input into a generation network unit to generate a pseudo modal characteristic tensor;
s2-4, splicing the pseudo modal feature tensor and the modal feature tensor in the step S2-2 along a channel, inputting the spliced pseudo modal feature tensor into an attention network unit, giving weight coefficients to different channels through randomly distributing weights to represent the interest degree of the network, and outputting to obtain a multi-modal feature matrix stacked along the channel direction;
s2-5, inputting the multi-modal feature matrix endowed with the weight coefficient into an information fusion unit, performing 1x1 convolution processing to reduce channels, fusing feature information of multiple modes, and outputting the feature information as a pseudo multi-modal feature matrix;
s2-6, inputting the pseudo multi-modal feature matrix and the real multi-modal feature matrix in the step S1-5 into a discrimination network unit, using the pseudo multi-modal feature as a predicted value by the discrimination network unit, using the real multi-modal feature in the step S1-5 as a true value, and inputting the true multi-modal feature into a loss function module to obtain the matrix similarity;
s2-7, reversely propagating the matrix similarity to the generated network unit, building a random vector, repeating the step S2-3 to the step S2-6 until the network unit is judged to be that the pseudo multi-modal feature is the real multi-modal feature, and finishing the training of the generated network unit;
s3, target detection application stage:
s3-1, acquiring modal data in real time through data acquisition equipment, wherein modal data of a certain dimensionality is missing, generating a corresponding missing modal feature tensor by a trained network generating unit through a random vector, extracting other intact modal feature tensors by a feature extracting unit, and performing the processing work of the step S1-3 on the missing modal feature tensor and the other intact modal feature tensors to obtain a multi-modal feature matrix;
s3-2, inputting the multi-modal feature matrix into the neural network unit trained in the step S1-4, generating the type of the target by the neural network unit, and selecting and identifying each target frame by using a 2D marking frame.
2. The method of claim 1, wherein the method comprises: the data acquisition equipment comprises a laser radar, a camera and roadside edge equipment, and the data set is modal data of different dimensions acquired by the data acquisition equipment in the same time and space.
3. The method of claim 1, wherein the method comprises: the feature extraction unit comprises an RNN (radio network) for extracting one-dimensional modal data, a Resnet network for extracting two-dimensional modal data and a Pointnet network for extracting three-dimensional modal data.
4. The method of claim 1, wherein the method comprises: the generation network unit adopts a neural network model based on a GAN network, and comprises a model input module, a generation module and a model output module, wherein the model input is a random vector and is used for replacing missing modal information and inputting the modal information into the generation module; the generating module is an improved GAN network, the network form of the coding layer and the decoding layer is connected through the jumping layer to capture the characteristic information more accurately, and the generated network unit can generate the characteristic matrix of the missing mode after judging the backward propagation of the network unit and carrying out a large amount of training.
5. The method of claim 1, wherein the method comprises: the discrimination network unit comprises a model input, a CNN module, a loss function module and a model output.
6. The method of claim 5, wherein the method comprises: the loss function module adopts a composite global loss function based on L1 and structural similarity.
7. The method of claim 1, wherein the method comprises: the neural network unit specifically adopts a YOLO v3 algorithm network.
8. A multimodal target detection system adapted for modality deficiency, characterized by: the system comprises data acquisition equipment, a data set, a feature extraction unit, a generation network unit, an attention network unit, an information fusion unit, a judgment network unit and a neural network unit for target detection;
the data set comprises data information of multiple dimensions of a laser radar, a camera and roadside edge equipment under the same time and space, a neural network unit of a designated perception task is trained by using the complete multi-modal data information, and a multi-modal feature matrix of each example is extracted to be used as a true value for distinguishing the network unit;
the characteristic extraction unit is used for extracting modal characteristics which do not have faults or are missing;
the method comprises the steps that a network unit is generated to be connected with a network form of a coding layer and a decoding layer through an internal jump layer, characteristic information of a random vector is captured, and a characteristic matrix of a missing mode can be generated by the network unit after the network unit is judged to be reversely propagated and trained in a large quantity;
the attention network unit is an attention mechanism network based on channels, weight coefficients of the channels are randomly distributed, and the weight coefficients are continuously optimized through back propagation, so that the network captures information of interest;
the information fusion unit splices the feature matrixes of multiple modes along the channels, utilizes the weight coefficient given by the attention unit to each channel, and uses 1x1 convolution kernel convolution to the spliced feature matrixes to play a role in fusing multi-mode information of different dimensions into one feature matrix;
the distinguishing network unit comprises a model input, a CNN module, a loss function module and a model output, wherein the model input comprises real multi-modal feature data extracted through a trained neural network and false multi-modal feature data generated by a generating network unit, an attention network unit and an information fusion unit through random vectors, and weight coefficients are distributed to all channels through the attention network unit to represent the interest degree of different channel features;
the CNN module is used for extracting the characteristics of the two input matrixes and inputting the authenticity characteristics into the loss function module to calculate the similarity degree of the two matrixes;
the model output is the similarity of the two matrixes, the similarity is reversely transmitted to the generation network unit, the generation network unit continues to randomly generate a missing mode matrix, the false multi-mode characteristics are input to the judgment network unit through the attention network unit and the information fusion layer, and the operation is repeatedly circulated until the similarity of the real and false matrixes is highly consistent so that the judgment network unit cannot distinguish the real and false matrixes, and the multi-mode characteristics are output;
the loss function module is a composite global loss function based on L1 and structural similarity.
CN202111456527.1A 2021-12-01 2021-12-01 Multi-modal target detection method and system suitable for modal loss Active CN114359586B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111456527.1A CN114359586B (en) 2021-12-01 2021-12-01 Multi-modal target detection method and system suitable for modal loss

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111456527.1A CN114359586B (en) 2021-12-01 2021-12-01 Multi-modal target detection method and system suitable for modal loss

Publications (2)

Publication Number Publication Date
CN114359586A true CN114359586A (en) 2022-04-15
CN114359586B CN114359586B (en) 2022-08-05

Family

ID=81096914

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111456527.1A Active CN114359586B (en) 2021-12-01 2021-12-01 Multi-modal target detection method and system suitable for modal loss

Country Status (1)

Country Link
CN (1) CN114359586B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114998567A (en) * 2022-07-18 2022-09-02 中国科学院长春光学精密机械与物理研究所 Infrared point group target identification method based on multi-mode feature discrimination

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288537A (en) * 2019-05-20 2019-09-27 湖南大学 Facial image complementing method based on the depth production confrontation network from attention
CN111460494A (en) * 2020-03-24 2020-07-28 广州大学 Multi-mode deep learning-oriented privacy protection method and system
US20200372369A1 (en) * 2019-05-22 2020-11-26 Royal Bank Of Canada System and method for machine learning architecture for partially-observed multimodal data
CN113361559A (en) * 2021-03-12 2021-09-07 华南理工大学 Multi-mode data knowledge information extraction method based on deep width joint neural network
CN113628294A (en) * 2021-07-09 2021-11-09 南京邮电大学 Image reconstruction method and device for cross-modal communication system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288537A (en) * 2019-05-20 2019-09-27 湖南大学 Facial image complementing method based on the depth production confrontation network from attention
US20200372369A1 (en) * 2019-05-22 2020-11-26 Royal Bank Of Canada System and method for machine learning architecture for partially-observed multimodal data
CN111460494A (en) * 2020-03-24 2020-07-28 广州大学 Multi-mode deep learning-oriented privacy protection method and system
CN113361559A (en) * 2021-03-12 2021-09-07 华南理工大学 Multi-mode data knowledge information extraction method based on deep width joint neural network
CN113628294A (en) * 2021-07-09 2021-11-09 南京邮电大学 Image reconstruction method and device for cross-modal communication system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
WENHUI WANG: "《VLMo:Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts》", 《HTTPS://ARXIV:2111.02358V1》 *
ZIJIAN GUO: "《A data imputation method for multivariate time series based on generative adversarial network》", 《NEUROCOMPUTING》 *
杜若画: "《基于机器学习的缺失模态影像分类研究》", 《中国优秀博硕士学位论文全文数据库(硕士)医药卫生科技辑》 *
檀华东: "《面向视听觉数据的跨模态生成及同步判别研究》", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
黄子敬: "《基于时空注意力机制的高速公路多收费站多时段出口流量预测方法研究》", 《中国优秀博硕士学位论文全文数据库(硕士)工程科技Ⅱ辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114998567A (en) * 2022-07-18 2022-09-02 中国科学院长春光学精密机械与物理研究所 Infrared point group target identification method based on multi-mode feature discrimination

Also Published As

Publication number Publication date
CN114359586B (en) 2022-08-05

Similar Documents

Publication Publication Date Title
Lu et al. Monocular semantic occupancy grid mapping with convolutional variational encoder–decoder networks
Alonso et al. 3d-mininet: Learning a 2d representation from point clouds for fast and efficient 3d lidar semantic segmentation
Du et al. A general pipeline for 3d detection of vehicles
CN110020620B (en) Face recognition method, device and equipment under large posture
CN111161349B (en) Object posture estimation method, device and equipment
Gomez-Donoso et al. Lonchanet: A sliced-based cnn architecture for real-time 3d object recognition
Wu et al. 3d shapenets for 2.5 d object recognition and next-best-view prediction
WO2019213459A1 (en) System and method for generating image landmarks
CN110222604B (en) Target identification method and device based on shared convolutional neural network
CN110619638A (en) Multi-mode fusion significance detection method based on convolution block attention module
CN108491848A (en) Image significance detection method based on depth information and device
CN112785526B (en) Three-dimensional point cloud restoration method for graphic processing
CN110991444A (en) Complex scene-oriented license plate recognition method and device
Khurana et al. A survey on object recognition and segmentation techniques
Gupta et al. Mergenet: A deep net architecture for small obstacle discovery
CN114219855A (en) Point cloud normal vector estimation method and device, computer equipment and storage medium
CN111915618B (en) Peak response enhancement-based instance segmentation algorithm and computing device
CN114509785A (en) Three-dimensional object detection method, device, storage medium, processor and system
CN114359586B (en) Multi-modal target detection method and system suitable for modal loss
CN114519853A (en) Three-dimensional target detection method and system based on multi-mode fusion
CN115439694A (en) High-precision point cloud completion method and device based on deep learning
Nayan et al. Real time detection of small objects
Yang et al. SDF-SLAM: A deep learning based highly accurate SLAM using monocular camera aiming at indoor map reconstruction with semantic and depth fusion
Nguyen et al. Robust stereo data cost with a learning strategy
Wietrzykowski et al. Stereo plane R-CNN: Accurate scene geometry reconstruction using planar segments and camera-agnostic representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230925

Address after: Building B4, 5th Floor v3, Zhongguancun Collaborative Innovation Industrial Park, Intersection of Lanzhou Road and Chongqing Road, Baohe Economic Development Zone, Hefei City, Anhui Province, 230000 yuan

Patentee after: Anhui Guandun Technology Co.,Ltd.

Address before: 230009 No. 193, Tunxi Road, Hefei, Anhui

Patentee before: HeFei University of Technology Asset Management Co.,Ltd.

Effective date of registration: 20230925

Address after: 230009 No. 193, Tunxi Road, Hefei, Anhui

Patentee after: HeFei University of Technology Asset Management Co.,Ltd.

Address before: 230009 No. 193, Tunxi Road, Hefei, Anhui

Patentee before: Hefei University of Technology

TR01 Transfer of patent right