CN115731400A

CN115731400A - X-ray image foreign matter detection method based on self-supervision learning

Info

Publication number: CN115731400A
Application number: CN202211509961.6A
Authority: CN
Inventors: 庞子龙; 甘志华; 宋亚林; 冯世杰; 邓欣冉
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-03-03

Abstract

The invention provides an X-ray image foreign matter detection method based on self-supervision learning. The method comprises the following steps: step 1: constructing a ConvMAE model based on structural parameterization, recording the ConvMAE model as a RepConvMAE self-supervision model, and comprising an Encoder for extracting a multi-scale feature map of an input image and a Decode for reconstructing the image based on the extracted multi-scale feature map; step 2: pre-training the RepConMAE self-monitoring model by adopting an X-ray image data set; and step 3: constructing a detection model EFFR-CNN, which sequentially comprises the following components from a shallow layer to a deep layer: backbone network, FPN layer, shared characteristic layer, RPN layer, ROI Pooling layer and full connection layer; the backbone network adopts an Encoder in the RepConMAE self-supervision model after pre-training; and 4, step 4: training the detection model EFFR-CNN by adopting an X-ray foreign matter image data set; and 5: and inputting the X-ray image to be detected into the trained detection model EFFR-CNN to obtain a detection result.

Description

X-ray image foreign matter detection method based on self-supervision learning

Technical Field

The invention relates to the technical field of deep learning, in particular to an X-ray image foreign matter detection method based on self-supervision learning.

Background

The needle detector is mainly used for detecting broken needles and metal fragments in textiles and preventing consumers from being hurt by the sharp metal foreign bodies. The traditional needle detector has the working principle of magnetic induction, and a permanent magnet is arranged in a detection head of the needle detector, so that the needle detector can only detect ferromagnetic metals (iron, cobalt, nickel and alloy containing the metal impurities) and cannot detect other high-purity fine metals. The novel needle detector solves the problems by using a foreign body detection algorithm of deep learning.

However, the existing foreign matter detection algorithm has certain defects. For example, patent document CN112525931A discloses a metal foreign object identification network Super-FODNet algorithm, which needs to collect up to several tens of thousands of X-ray images of shoes and clothes with and without metal foreign objects, and needs to use a lot of manpower to label the images, the labeling cost is high, and a lot of manpower and material resources are consumed; moreover, manual marking is actually incomplete and accurate, and the condition of missing marking and wrong marking can occur when data is marked, which influences the training speed and accuracy; and patent document CN109886935A discloses a method for detecting foreign matter on a road surface based on deep learning, which uses deep learning to determine and identify a plurality of foreign matter blocks and output the foreign matter category.

Disclosure of Invention

Aiming at the problems of few data samples, unbalanced samples and low detection accuracy of a detection model caused by incomplete accuracy of manual labeling data, the invention provides an X-ray image foreign matter detection method based on self-supervision learning, which can solve the problems of few data samples, unbalanced samples and low detection accuracy caused by incomplete accuracy of manual labeling data; meanwhile, the detection performance of the detection model on small objects (such as metal broken pins) can be improved.

The invention provides an X-ray image foreign matter detection method based on self-supervision learning, which comprises the following steps:

step 1: constructing a ConvMAE model based on structural parameterization, recording the model as a RepConvMAE self-supervision model, and comprising an Encoder for extracting a multi-scale feature map of an input image and a Decode for reconstructing an image based on the extracted multi-scale feature map;

step 2: pre-training the RepConMAE self-monitoring model by adopting an X-ray image data set;

and 3, step 3: constructing a detection model EFFR-CNN, which sequentially comprises the following steps from a shallow layer to a deep layer: backbone network, FPN layer, shared characteristic layer, RPN layer, ROI Pooling layer and full connection layer; the backbone network adopts an Encoder in the pre-trained RepConMAE self-supervision model;

and 4, step 4: training the detection model EFFR-CNN by adopting an X-ray foreign matter image data set;

and 5: and inputting the X-ray image to be detected into the trained detection model EFFR-CNN to obtain a detection result.

Further, the encor in the RepConvMAE self-supervision model comprises 4 stages from a shallow layer to a deep layer; the first three stages have the same structure and comprise a Patch Embedding layer and a Masked constraint Block based on structure reparameterization; the last stage comprises a batch Embedding layer and a transform Block;

the Masked convention Block based on the structural parameterization sequentially comprises the following components from a shallow layer to a deep layer: batch Norm layer, 1 × 1 convolution layer, K × K convolution layer based on structural parameterization, batch Norm layer, 1 × 1 convolution layer, and 1 × 1 convolution layer; wherein, based on structure heavy parameterization K x K convolution layer means: during training, replacing the K × K convolution layer with three branches, namely 1 × K convolution layer and then a Batch Norm layer, K × 1 convolution layer and then a Batch Norm layer, and K × K convolution layer and then a Batch Norm layer; when reasoning, three branches are replaced with one K × K convolutional layer.

Further, the masking policy of the RepConvMAE self-supervision model specifically includes: directly carrying out random mask on the input of the last stage, and then respectively carrying out up-sampling on the obtained mask of the last stage by different multiples to obtain masks of the first three stages; wherein the mask of each patch is expressed by 0 or 1, 0 represents a mask, and 1 represents an unmasked.

Further, the Decoder in the RepConvMAE self-supervision model comprises a Linear layer and a transform Block; and the output of the Linear layer is supplemented with Masked patches and then sent into the transform Block.

Further, the processing procedure of the Linear layer is expressed by formula (2):

E _d ＝Linear(StrideConv(E ₁ ,8)+StrideConv(E ₂ ,4)+StrideConv(E ₃ ,2)+E ₄ ) (2)

wherein E is _d For the output of the Linear layer, strideConv (, X) represents the convolution with step size X, E ₁ 、E ₂ 、E ₃ 、E ₄ Feature maps of four different scale sizes are shown, arranged from large to small.

Further, the FPN adopts an improved FPN; the modified FPN is obtained by adding a fusion factor into an original FPN, and specifically comprises the following steps: the improved FPN adopts a formula (3) to aggregate adjacent feature layers;

wherein, the first and the second end of the pipe are connected with each other,

convolution layers representing 1 x 1 achieve channel matching, F _UP Representing a two-fold upsampling operation to achieve feature map size matching, conv () representing a convolution operation for feature processing, α being a fusion factor, p _i And representing the ith layer of characteristic image layer.

Further, the value of the fusion factor α is determined using equation (4):

wherein the content of the first and second substances,

and

respectively represent P _i+1 Layer and P _i Number of objects on a layer.

The invention has the beneficial effects that:

(1) The X-ray image foreign matter detection method based on the self-supervision learning comprises a pre-training model and a detection model, wherein the pre-training model is trained in a self-supervision mode, a large number of data sets with labels are not needed, and the labor cost is saved; the number of each type sample in the data is not required to be balanced, so that the problem that the effect of the data imbalance on the model is influenced is solved; in the detection model, in addition to the characteristic fusion realized by adopting the traditional FPN structure, in order to further improve the detection precision of the model on the tiny objects, an improved FPN layer is also provided;

(2) The pre-training model adopts an Encoder with a multi-level structure, so that multi-scale features can be extracted, and the expression capability of a feature map is improved; structural parameterization operation is added into the pre-training model, so that the pre-training model has a large number of parameters during training, the speed of the model during reasoning is guaranteed, and the detection efficiency of the model in practical application is improved;

(3) In the detection model, an Encoder in a pre-training model is used as a backbone network, multi-scale features are fused, and an improved FPN structure is combined, so that the detection precision of the model on tiny objects is further improved, and the requirements on precision and efficiency in the actual industry can be met;

(4) The invention has wide application scene and can be applied to the foreign matter detection in the industrial production fields of food, textile, medicine, electronics and the like.

Drawings

FIG. 1 is a flowchart of an X-ray image foreign object detection method based on self-supervised learning according to an embodiment of the present invention;

fig. 2 is a network structure diagram of a RepConvMAE self-supervision model according to an embodiment of the present invention;

fig. 3 is a structural diagram of a mask restriction Block based on a structural reparameterization according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an operation of structure reparameterization in the mastercontribution Block based on structure reparameterization according to an embodiment of the present invention;

FIG. 5 is a diagram of a detection algorithm EFFR-CNN network structure provided by the embodiment of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be described clearly below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Example 1

As shown in fig. 1, an embodiment of the present invention provides an X-ray image foreign object detection method based on self-supervised learning, including the following steps:

s101: constructing a ConvMAE model based on structural parameterization, recording the model as a RepConvMAE self-supervision model, and comprising an Encoder for extracting a multi-scale feature map of an input image and a Decode for reconstructing an image based on the extracted multi-scale feature map; repConvMAE: replay volume Masked AutoEncoders.

Specifically, the operation of structural parameterization is added into the model, so that the model has a large number of parameters during training, the speed of the model during reasoning is ensured, and the detection efficiency of the model in practical application is improved. In addition, compared with the case that only a single scale feature can be extracted from the MAE model, the RepConvMAE self-supervision model in the embodiment can extract a multi-scale feature.

S102: pre-training the RepConMAE self-monitoring model by adopting an X-ray image data set;

specifically, an open source X-ray image dataset is used for pre-training, and the dataset image is divided into a plurality of patches (e.g., 16 × 16) and fed into a reconvmae self-monitoring model.

S103: constructing a detection model EFFR-CNN, as shown in FIG. 5, sequentially including from a shallow layer to a deep layer: a backbone network, an FPN layer, a shared characteristic layer, an RPN layer, an ROI Pooling layer and a full connection layer; the backbone network adopts an Encoder in the RepConMAE self-supervision model after pre-training; EFFR-CNN: effective Fusion Factor Region-CNN;

specifically, firstly, extracting features from a backbone network, then performing feature fusion on the extracted features by using an FPN layer to obtain a shared feature layer, enabling the shared feature layer to enter an RPN (candidate area network) layer to obtain candidate frames, screening all the candidate frames by using an ROI Powing layer in combination with the shared feature layer, and outputting the screened candidate frames through a full connection layer and softmax to generate position information, confidence coefficient and foreign matter category of a detection frame. Wherein, the processing procedure of the full connection layer is as follows: comparing the IOU (intersection comparison) of the candidate frame and the GT frame with a preset threshold value to obtain four-dimensional information of a foreign object detection frame meeting the condition, continuously repeating the comparison process, taking the output value of the previous comparison process as the input value of the current comparison process, continuously improving the IOU threshold value of the current comparison process, finally obtaining the detection frame meeting the condition, and outputting the position information (such as the four-dimensional information) and the confidence coefficient (such as the IoU value) of the detection frame and the category of the foreign object.

S104: training the detection model EFFR-CNN by adopting an X-ray foreign matter image data set;

specifically, an X-ray image of an object to be detected is collected as a data set of a detection model, objects collected in multiple batches are labeled, labeling information of the objects comprises edge frames Bounding Box (X, y, w, h) and class labels of each edge frame, and then the data set is divided into 4: the scale of 1 is divided into a training set and a test set.

S105: and inputting the X-ray image to be detected into the trained detection model EFFR-CNN to obtain a detection result.

In addition, in practical application, the data set needs to be preprocessed before being input into the detection model for training or testing, and the purpose is to amplify the data set, so that the detection accuracy is improved, and overfitting is prevented. The pretreatment comprises the following steps: at least one of image blending (Mix-up), label Smoothing (Label Smoothing), random geometric transformation, random angular rotation, and random color transformation. Wherein, the random geometric transformation (all setting ranges and probabilities) comprises: random cropping, random expansion, random horizontal flipping, random stretching (random difference); the random color transformation includes: random transformation of contrast, brightness, saturation, chromaticity.

In the detection model EFFR-CNN of this embodiment, the backbone network (i.e., the Encoder in the reconvmae self-supervised model) for extracting features is pre-trained by using a self-supervised method, and compared with the pre-training by using a supervised method, the training method of the present invention can realize that the pre-training process can be completed only by a small amount of X-ray image data with labels in the downstream task, thereby saving a large amount of manpower to label the data.

Example 2

The inventor believes that the convolutional neural network can reduce local redundancy and reduce the amount of calculation through a small range of convolution, but has a limit in capturing global dependence. Whereas ViT (Vision Transformer) can effectively capture long-distance dependency through Self-Attention, it is difficult to effectively decode local features on a shallow network. Therefore, in the embodiment, viT is combined with a convolutional neural network to improve the detection performance of the model for metal foreign matters such as broken needles.

As shown in fig. 2, as an implementation manner, the Encoder in the reconvmae self-supervision model adopts a multi-level structure, and includes 4 stages (stage 1, stage2, stage3, stage 4) from a shallow layer to a deep layer; the first three stages have the same structure and respectively comprise a Patch Embedding layer and a mask restriction Block (also called mask Re-Param restriction Block, referred to as mask ReConv Block for short) based on structure-heavy parameterization; the last stage comprises a batch Embedding layer and a transform Block; and outputting a feature map of one scale after each stage is finished.

As shown in fig. 3, the Masked constraint Block based on the structural parameterization sequentially comprises, from a shallow layer to a deep layer: batch Norm layer, 1 × 1 convolution layer, K × K convolution layer based on structural weight parameterization, batch Norm layer, 1 × 1 convolution layer, and 1 × 1 convolution layer; as shown in fig. 4, the K × K convolution layer based on the structural parameterization means: during training, replacing the K × K convolution layer with three branches, namely 1 × K convolution layer and then a Batch Norm layer, K × 1 convolution layer and then a Batch Norm layer, and K × K convolution layer and then a Batch Norm layer; when reasoning, three branches will be replaced with one K x K convolutional layer. The structural reparameterization can be expressed by equation (1):

Reparam(K*K)＝BN(Conv(K,K))+BN(Conv(1,K))+BN(Conv(K,1)) (1)

where Reparam (K × K) represents the heavily parameterized convolution, conv (K, K) represents the K × K convolution, and BN () represents the Batch Norm layer. Preferably, in this embodiment, K =4.

In addition, it should be noted that the global Self-attribute has a larger receptive field, but the calculation cost is too high, so in practical application, the following setting can be adopted for the 6 layers of stage4 in the reconvmae as required: {1,3,5} uses Swin-Attention with 7 × 7 window size, and {2,4,6} uses global Self-Attention to reduce the amount of computation and footprint on the GPU.

In this embodiment, the Encoder adopts a multi-level structure, which is more complex than a single-level structure, and may need to consume longer time to detect in the actual foreign object detection process; to overcome this problem and reduce the time for detecting the X-ray image, the present embodiment employs an operation of structural reparameterization. The structural parameterization is to construct a series of parameters during training and equivalently convert the parameters into another set of parameters during reasoning. The more the parameters are, the better the performance of the model is, but the less the parameters are, the faster the reasoning speed of the model is, and the operation of the structure heavy parameterization combines the two advantages well. In the embodiment, by adopting the operation of the structural parameterization, no extra calculation amount is brought during training, but the capability of the model for extracting the X-ray image features is enhanced, and meanwhile, the time for detecting foreign matters by the model is saved during reasoning.

In order to facilitate the adoption of the structure-weighted parameterization operation, the model can focus on both smaller detailed texture characteristics and shape characteristics, so that the characteristics extracted by the model have expression capability.

It should be noted that, compared to the MAE model, since the Encoder in this embodiment adopts a multi-level structure, the traditional mask strategy cannot be directly used, because: if stage1 extracted features were directly randomly masked, this would result in a portion of visible information for each token of stage 3. Therefore, the RepConvMAE performs random masking on the input of the last stage (i.e. stage 4), and then performs upsampling on the obtained mask of the last stage (i.e. stage 4) by different times respectively to obtain masks of the first three stages, for example, performs upsampling on the obtained mask of stage4 by 2 times, 4 times and 8 times respectively to obtain masks of stage1, stage2 and stage 3; wherein the mask of each patch is expressed by 0 or 1, 0 represents a mask, and 1 represents an unmasked.

Example 3

Based on the foregoing embodiments, as an implementation manner, as shown in fig. 2, the Decoder in the RepConvMAE auto-supervision model includes a Linear layer and a transform Block; and the output of the Linear layer is supplemented with Masked patches and then sent into the transform Block.

For example, the feature map obtained by each stage layer of the Encoder has the scale of {1/4,1/8,1/16,1/32}, and the extracted feature maps of the first three stage layers are added with the feature map of the last stage layer through step convolution and then are sent into a transform Block through a Linear layer.

As an implementation manner, the processing procedure of the Linear layer is expressed by formula (2):

wherein, E _d For the output of the Linear layer, strideConv (, X) represents the convolution with step size X, E ₁ 、E ₂ 、E ₃ 、E ₄ Feature maps representing four different sizes arranged from large to small; for example, on the basis of the above example, E ₁ 、E ₂ 、E ₃ 、E ₄ The outputs of stage1, stage2, stage3 and stage4, respectively.

Example 4

The FPN layer in the detection model EFFR-CNN can use conventional FPN for feature fusion. In the process of detecting foreign matters in production workshops such as textile shoes, clothes, bags and the like, metal foreign matters such as broken needles and the like are usually small except some large foreign matters needing to be detected, and are generally different from 0.4mm to 1.0cm, but the traditional FPN has the situation of poor performance in the detection of some small objects, so that the detection performance is further improved, and the embodiment also provides an improved FPN layer.

For detecting small objects, each layer of the multi-scale feature map needs to not only focus on the target of the scale of the layer, but also help other layers to acquire more training samples. In order to control the priority of the two requirements, a fusion factor is added into the improved FPN layer, which specifically includes: the improved FPN adopts a formula (3) to aggregate adjacent feature layers;

wherein the content of the first and second substances,

represents 1 x 1 convolutional layer implementing channel matching, F _up Representing a two-fold upsampling operation to achieve feature map size matching, conv () representing a convolution operation for feature processing, α being a fusion factor, P _i Representing the ith layer of feature map layer;

further, it can be understood that the fusion factor is used to describe the coupling degree of adjacent layers of the FPN, and there is an optimal value range, and in order to obtain an ideal α, so that the detection model has the best performance on small target detection, in the embodiment of the present invention, α is calculated by using statistical information of a data set, specifically: determining the value of the fusion factor α using equation (4):

and

respectively represent P _i+1 Layer and P _i Number of objects on a layer.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. The X-ray image foreign matter detection method based on the self-supervision learning is characterized by comprising the following steps:

step 1: constructing a ConvMAE model based on structural parameterization, recording the ConvMAE model as a RepConvMAE self-supervision model, and comprising an Encoder for extracting a multi-scale feature map of an input image and a Decode for reconstructing the image based on the extracted multi-scale feature map;

and step 3: constructing a detection model EFFR-CNN, which sequentially comprises the following components from a shallow layer to a deep layer: a backbone network, an FPN layer, a shared characteristic layer, an RPN layer, an ROI Pooling layer and a full connection layer; the backbone network adopts an Encoder in the RepConMAE self-supervision model after pre-training;

2. The X-ray image foreign matter detection method based on the self-supervised learning of claim 1, wherein the Encoder in the RepConvMAE self-supervised model comprises 4 stages from a shallow layer to a deep layer; the first three stages have the same structure and comprise a Patch Embedding layer and a Masked constraint Block based on structure reparameterization; the last stage comprises a batch Embedding layer and a transform Block;

the Masked restriction Block based on the structure weight parameterization sequentially comprises the following components from a shallow layer to a deep layer: batch Norm layer, 1 × 1 convolution layer, K × K convolution layer based on structural weight parameterization, batch Norm layer, 1 × 1 convolution layer, and 1 × 1 convolution layer; wherein, based on structure heavy parameterization K x K convolution layer means: during training, replacing the K × K convolution layer with three branches, namely, 1 × K convolution layer followed by Batch Norm layer, K × 1 convolution layer followed by Batch Norm layer, and K × K convolution layer followed by Batch Norm layer; when reasoning, three branches are replaced with one K × K convolutional layer.

3. The method for detecting the foreign matter in the X-ray image based on the self-supervised learning according to claim 2, wherein the masking strategy of the RepConvMAE self-supervised model is specifically as follows: directly carrying out random mask on the input of the last stage, and then respectively carrying out up-sampling on the obtained mask of the last stage by different multiples to obtain masks of the first three stages; wherein the mask of each patch is expressed by 0 or 1, 0 represents a mask, and 1 represents an unmasked.

4. The X-ray image foreign matter detection method based on the self-supervised learning as recited in claim 1, wherein a Decoder in the RepConvMAE self-supervised model comprises a Linear layer and a transform Block; and the output of the Linear layer is supplemented with Masked patches and then sent into the transform Block.

5. The X-ray image foreign matter detection method based on the self-supervised learning as recited in claim 4, wherein the processing procedure of the Linear layer is expressed by formula (2):

6. The X-ray image foreign matter detection method based on the self-supervised learning as recited in claim 1, wherein the FPN is modified FPN; the improved FPN is obtained by adding a fusion factor into an original FPN, and specifically comprises the following steps: the improved FPN adopts a formula (3) to aggregate adjacent feature layers;

wherein the content of the first and second substances,

represents 1 x 1 convolutional layer implementing channel matching, F _up Representing a double upsampling operationImplementing feature map size matching, conv () representing a convolution operation for feature processing, α being a fusion factor, P _i And representing the ith layer of characteristic image layer.

7. The X-ray image foreign matter detection method based on the self-supervised learning as recited in claim 6, wherein a value of the fusion factor α is determined using formula (4):

wherein the content of the first and second substances,

and

respectively represent P _i+1 Layer and P _i Number of objects on a layer.