CN114445363A

CN114445363A - System, method, device, processor and storage medium for realizing dangerous goods identification based on multi-mode data attention model

Info

Publication number: CN114445363A
Application number: CN202210087402.4A
Authority: CN
Inventors: 常青青; 李维姣; 沈天明; 刘伟豪; 欧阳光; 姚鸿达
Original assignee: Third Research Institute of the Ministry of Public Security
Current assignee: Third Research Institute of the Ministry of Public Security
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-05-06

Abstract

The invention relates to a system for realizing dangerous goods identification based on a multi-modal data attention model, wherein the system comprises a feature extraction processing module, a feature vector extraction module and a feature vector extraction module, wherein the feature extraction processing module is used for inputting a perspective image and a back scattering image to extract an image feature vector; the attention fusion processing module is used for carrying out attention fusion processing on the extracted image features; and the recognition output module is used for carrying out training processing of image classification, type detection, pixel segmentation and effective atomic number mapping on the obtained image according to a specific recognition task. The invention also relates to a corresponding method, device, processor and storage medium thereof. By adopting the system, the method, the device, the processor and the storage medium thereof for realizing dangerous goods identification based on the multi-mode data attention model, not only dangerous goods with shape characteristics such as guns and cutters, lighters and the like in luggage packages can be identified, but also material components of liquid in a container can be extracted, and non-unpacking identification of dangerous goods such as flammable liquid and the like is realized.

Description

System, method, device, processor and storage medium for realizing dangerous goods identification based on multi-mode data attention model

Technical Field

The invention relates to the technical field of computer vision, in particular to the technical field of image recognition, and specifically relates to a system, a method, a device, a processor and a computer readable storage medium for realizing dangerous goods recognition based on a multi-mode data attention model.

Background

The X-ray inspection equipment is one of the most widely applied equipment in the field of safety inspection, the traditional X-ray equipment mostly adopts the perspective imaging technology to carry out non-contact imaging on luggage packages carried by passengers, whether dangerous goods exist or not is rapidly checked in a picture reading mode, the identification types are limited, the intelligent level is low, the picture reading level of workers is seriously depended, and the safety inspection efficiency and the identification accuracy of the dangerous goods are influenced. With the development of new technologies such as big data and deep learning, part of X-ray security inspection equipment learns the shapes of objects in X-ray images and the like by embedding an image recognition algorithm module based on deep learning, and identifies dangerous goods such as guns and cutters with obvious shape characteristics to a certain extent, but the technology does not have material characteristic resolution capability and cannot identify liquid dangerous goods such as gasoline and alcohol in luggage packages. The X-ray back scattering technology is another detection imaging technology, is sensitive to organic matters with low effective atomic number and high density, can highlight dangerous goods such as inflammable and explosive articles and the like, and further improves the inspection accuracy.

In the dangerous goods identification technology based on deep learning, Convolutional Neural Networks (CNNs) are widely used as feature extractors of X-ray images. In multi-modal data fusion, although a multi-stream depth architecture can be established by inputting perspective and backscatter images of the same baggage package, the technology has two disadvantages, firstly, feature extraction of different modes is carried out independently, and some important shared advanced functions in data of the two modes are omitted; secondly, the features extracted by the method are simply connected in series or combined to have redundant information, so that the system can be easily overfitted.

Disclosure of Invention

The present invention is directed to overcoming the above-mentioned disadvantages of the prior art, and providing a system, method, apparatus, processor and computer readable storage medium thereof for implementing dangerous goods identification based on multi-modal data attention model, which can implement cross-attention fusion feature extraction.

In order to achieve the above objects, the system, method, apparatus, processor and computer readable storage medium thereof for realizing identification of dangerous goods based on multi-modal data attention model of the present invention are as follows:

the system for realizing dangerous goods identification based on the multi-mode data attention model is mainly characterized by comprising the following components:

the characteristic extraction processing module is used for extracting image characteristic vectors from the input perspective image and the backscattering image through a basic neural network;

the attention fusion processing module is connected with the feature extraction processing module and is used for carrying out fusion processing on the self-attention and the cross-attention of each image feature acquired in a plurality of extraction stages to obtain a fusion image feature; and

and the recognition output module is connected with the characteristic vector splicing processing module and used for carrying out training processing of image classification, type detection, pixel segmentation and effective atomic number mapping on the obtained image according to a specific recognition task.

Preferably, the feature extraction processing module specifically includes:

training sample T of high-energy image of perspective image subjected to X-ray perspective processing_HAnd low energy image training sample T_LAnd carrying out extraction processing on the characteristics of the perspective image on the back scattering image.

Preferably, the attention fusion processing module includes:

the self-attention processing unit is used for generating a space attention mask for the perspective image processed by the convolutional neural network model and generating a substance attention mask for the backscatter image;

the cross attention processing unit is connected with the self attention processing unit and is used for performing enhancement processing on the material characteristic by using the image characteristic;

and the modality fusion extraction unit is connected with the self-attention processing unit and the cross-attention processing unit and is used for performing output combination processing on the perspective image and the backscatter image so as to highlight important parts of two modalities.

Preferably, the identification output module includes:

the detection submodule is connected with the attention fusion processing module and used for detecting and identifying the interested object in the image and outputting a bounding box of the interested object;

the segmentation sub-module is connected with the attention fusion processing module and is used for identifying and processing the pixels in the image according to the corresponding object types; and

and the atomic number regression submodule is connected with the attention fusion processing module and is used for estimating the equivalent atomic number of the scanned object pixel by pixel.

The method for realizing the identification of the dangerous goods based on the multi-mode data attention model by utilizing the system is mainly characterized by comprising the following steps of:

(1) the basic neural network carries out feature extraction processing on the input original perspective image and the back scattering image;

(2) performing multi-mode information fusion processing based on an attention mechanism;

(3) and performing network training and reasoning processing of multi-task joint learning on the acquired images.

Preferably, the step (1) comprises the following steps:

(1.1) carrying out feature extraction on the input original perspective image and the input back scattering image through a basic neural network;

(1.2) the height of the input collected by the security inspection machine is W₁Width of H₁And a feature dimension of D₁And a backscatter image, and outputting a first feature tensor W₁×H₁×D₁。

Preferably, the step (1.1) in the specific implementation adopts a deep convolutional neural network, which is composed of a plurality of layers of convolutions and poling layers, the output of each convolution operation is standardized in batch, and residual connection exists among a plurality of different stages to ensure that gradient transmission is smoother during training, thereby facilitating training.

Preferably, the step (2) comprises:

(2.1) fusing the characteristics of the perspective and backscatter images through a plurality of cascade-crossed self-attention layers and mutual-attention layers to obtain fused multi-modal characteristics;

(2.2) separately inputting each modal characteristic height as W₂Width of H₂And a feature dimension of D₂After being processed and fused by a plurality of attention layers, the second feature tensor W is output₂×H₂×D₂。

Preferably, the step (3) specifically includes the following steps:

(3.1) the detection submodule refers to the Region Proposal Network and the ROIAlign layer to detect the dangerous goods and output the bounding box information of the interested category;

(3.2) the segmentation submodule realizes pixel-level segmentation and extracts the edge of the dangerous goods through a deconvolution layer and an up-sampling layer;

and (3.3) generating an effective atomic number mapping map corresponding to the luggage package by the effective atomic number regression module, wherein the effective atomic number mapping map is used for assisting in identifying dangerous goods contained in the luggage package.

Preferably, on the manual labeling data set, the feature extraction processing module, the attention fusion module and the recognition output module are subjected to multi-task training according to a joint loss function of a plurality of final tasks through a labeled training set, so as to obtain better network parameters.

The device for realizing dangerous goods identification based on the multi-mode data attention model is mainly characterized by comprising the following components:

a processor configured to execute computer-executable instructions;

a memory storing one or more computer-executable instructions that, when executed by the processor, perform the steps of the above-described method for multi-modal data attention model-based threat identification.

The processor for realizing dangerous goods identification based on the multi-modal data attention model is mainly characterized in that the processor is configured to execute computer executable instructions, and when the computer executable instructions are executed by the processor, the steps of the dangerous goods identification method based on the multi-modal data attention model are realized.

The computer-readable storage medium is mainly characterized by having a computer program stored thereon, wherein the computer program is executable by a processor to implement the steps of the above-mentioned method for identifying a hazardous material based on a multi-modal data attention model.

The system, the method, the device, the processor and the computer readable storage medium for realizing the identification of the dangerous goods based on the multi-mode data attention model are adopted, the attention model based on the multi-mode data is provided based on X-ray dual-energy perspective and back scattering image data, the convolutional neural network model CNN is used for extracting the space characteristics of an X-ray image, and the substance characteristics are extracted by using a cross attention mechanism in a fusion manner, so that the dangerous goods with shape characteristics, such as guns, cutters and lighters in luggage packages, can be identified, the substance components of liquid in containers can be extracted, the non-unpacking identification of the dangerous goods, such as flammable liquid and the like, is realized, and the display performance in a visual reasoning task is obviously improved.

Drawings

Fig. 1 is a block diagram of a system for identifying dangerous goods based on a multi-modal data attention model according to the present invention.

FIG. 2 is a flow chart of a method of implementing multi-modal data attention model-based threat identification in accordance with the present invention.

Detailed Description

In order to more clearly describe the technical contents of the present invention, the following further description is given in conjunction with specific embodiments.

Before describing in detail embodiments that are in accordance with the present invention, it should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, the system for identifying dangerous goods based on multi-modal data attention model includes:

As a preferred embodiment of the present invention, the feature extraction processing module specifically includes:

As a preferred embodiment of the present invention, the attention fusion processing module includes:

As a preferred embodiment of the present invention, the identification output module includes:

The method for realizing the identification of the dangerous goods based on the multi-modal data attention model by using the system comprises the following steps:

As a preferred embodiment of the present invention, the step (1) comprises the steps of:

As a preferred embodiment of the invention, the step (1.1) adopts a deep convolutional neural network in the specific implementation, the deep convolutional neural network is composed of a plurality of layers of convolutions and pooling layers, the output of each convolution operation is standardized in batch, and residual connection exists among a plurality of different stages to ensure that gradient transmission is smoother during training, so that the training is easier.

As a preferred embodiment of the present invention, the step (2) comprises:

As a preferred embodiment of the present invention, the step (3) specifically comprises the following steps:

As a preferred embodiment of the invention, on a manually labeled data set, the feature extraction processing module, the attention fusion module and the recognition output module are subjected to multi-task training according to a joint loss function of a plurality of final tasks through a labeled training set so as to obtain better network parameters.

The device for realizing dangerous goods identification based on the multi-modal data attention model comprises:

a processor configured to execute computer-executable instructions;

The processor for identifying dangerous goods based on the multi-modal data attention model is configured to execute computer executable instructions, and when the computer executable instructions are executed by the processor, the steps of the method for identifying dangerous goods based on the multi-modal data attention model are realized.

The computer-readable storage medium has stored thereon a computer program which is executable by a processor to implement the steps of the above-described method for identifying a hazardous material based on a multimodal data attention model.

In practical application, the feature extraction processing module of the invention has the following main functions:

the module has the main functions of extracting characteristic opening vectors from perspective images and back scattering images respectively, inputting the characteristic opening vectors into original perspective images and back scattering images respectively, ensuring consistent resolution ratio and ensuring one-to-one correspondence relationship. To accommodate different input image resolutions, a full convolution neural network (FCN) is employed, consisting of several layers of convolution and pooling layers. The output of each convolution operation is standardized in batch, and residual connection among a plurality of different stages ensures that gradient transfer is smoother during training, so that the training is easier.

This part can usually initialize the weights directly from the existing network, fine-tuning on the training data of the X-ray image. This part can also initialize the weights randomly and train from scratch if there is enough X-ray related labeling data.

In practical application, the multi-modal information fusion based on attention mechanism of the invention has the following main functions:

the part is the core of the technical scheme, and the information of the perspective and back scattering images is fused through the attention layers of a plurality of stages to obtain the fused characteristics. Similar to the neural message posting, a fused feature is obtained by stacking the self-attention (self-attention) and cross-attention (cross-attention) of multiple stages, specifically, the self-attention layer for the feature T

SA(T)＝SA(QT,KT,VT)＝softmax(QT×KT).dot(VT)

For a cross anchorage layer,

CA(T1，T2)＝CT(QT2,KT1,VT1)＝softmax(QT2×KT1).dot(VT1)

the complete computation also includes a residual connection, then the complete input and output relationships from the attention layer are:

SA(T)＝MLP(BatchNorm(SA(T)+T

the input and output relationship of the mutual attention layer is as follows:

CT(T1,T2)＝MLP(BatchNorm(CA(T1,T2)+T2

the input of the module is W multiplied by H multiplied by D tenor, the output is W multiplied by H multiplied by D tenor, and the dimension is guaranteed to be unchanged. However, in the process, the features of different positions of the image and the features of different modes are gradually integrated by mutual attention, so that the feature fusion is completed.

In practical application, the main functions of the multi-task joint learning of the invention are as follows:

the multi-task learning is proved to be capable of improving the effect of a single task by using the commonality between different tasks. And the computation complexity is reduced by sharing the computation by a plurality of tasks. In the invention, the multi-mode characteristics can be fused in the multi-task learning, the characteristics of different modes for different tasks are fully utilized, and the effect is improved. For this purpose, at this stage, different tasks are performed by means of multi-task learning via a plurality of different branches and are trained simultaneously.

The first branch can refer to the region technical network and the ROIAlign layer, detect objects such as guns and cutters, flammable liquid and the like in the image, and output information such as a bounding box of the interested category.

The second branch realizes the pixel-level segmentation simultaneously, extracts dangerous articles and backgrounds and facilitates subjective display.

And the third branch directly regresses the effective atomic number to generate an effective atomic number mapping map of the luggage package, and the map is used for identifying dangerous goods such as gasoline, explosive and the like in the package.

In the whole training process, the loss function comprises the sum of the losses of all branches, namely the object class, the mask and the like.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by suitable instruction execution devices.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, and the program may be stored in a computer readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of terms "an embodiment," "some embodiments," "an example," "a specific example," or "an embodiment," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

In this specification, the invention has been described with reference to specific embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims

1. A system for identifying hazardous materials based on a multi-modal data attention model, the system comprising:

2. The system for realizing dangerous goods identification based on multi-modal data attention model according to claim 1, wherein the feature extraction processing module is specifically:

3. The system for performing threat identification based on multimodal data attention models of claim 2, wherein the attention fusion processing module comprises:

the cross attention processing unit is connected with the self attention processing unit and is used for performing enhancement processing on the material characteristics by using the image characteristics;

4. The system for performing threat identification according to claim 3, wherein the identification output module comprises:

the detection sub-module is connected with the attention fusion processing module and is used for detecting and identifying the interested object in the image and outputting a bounding box of the interested object;

5. A method for performing a multi-modal data attention model based threat identification using the system of claims 1-4, the method comprising the steps of:

6. The method for realizing the identification of the dangerous goods based on the multi-modal data attention model according to claim 5, wherein the step (1) comprises the following steps:

7. The method for realizing the identification of dangerous goods based on the multi-modal data attention model as claimed in claim 6, wherein the step (1.1) is implemented by using a deep convolutional neural network, which is composed of a plurality of layers of convolution and a pooling layer, the output of each convolution operation is standardized in batch, and the residual connection between a plurality of different stages ensures that the gradient transmission is smoother during training, thereby being easier to train.

8. The method for performing multi-modal data attention model-based threat identification as claimed in claim 6, wherein said step (2) comprises:

9. The method for realizing the identification of the dangerous goods based on the multi-modal data attention model according to claim 8, wherein the step (3) comprises the following steps:

10. The method for realizing dangerous goods identification based on multi-modal data attention model according to claim 9, characterized in that on the manually labeled data set, the feature extraction processing module, the attention fusion module and the identification output module are multi-tasked trained according to the joint loss function of the final multiple tasks through the labeled training set to obtain better network parameters.

11. An apparatus for realizing dangerous goods identification based on a multi-modal data attention model, which is characterized in that the apparatus comprises:

a processor configured to execute computer-executable instructions;

a memory storing one or more computer-executable instructions that, when executed by the processor, perform the steps of the method for multi-modal data attention model-based threat identification method of claim 10.

12. A processor for performing threat identification based on a multimodal data attention model, wherein the processor is configured to execute computer-executable instructions that, when executed by the processor, perform the steps of the method for threat identification based on a multimodal data attention model of claim 10.

13. A computer-readable storage medium, having stored thereon a computer program which is executable by a processor for performing the steps of the method for threat identification based on multimodal data attention model as claimed in claim 10.