CN114581698A

CN114581698A - Target classification method based on space cross attention mechanism feature fusion

Info

Publication number: CN114581698A
Application number: CN202210084352.4A
Authority: CN
Inventors: 李岳阳; 顾中轩; 罗海驰
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-06-03

Abstract

The invention discloses a target classification method based on space cross attention mechanism feature fusion, and belongs to the technical field of computer-aided detection. The method captures important features of similar feature maps through a non-attention mechanism, and then re-calibrates another feature map through the feature of one feature map by a channel attention mechanism excitation method to achieve a fusion effect.

Description

Target classification method based on space cross attention mechanism feature fusion

Technical Field

The invention relates to a target classification method based on space cross attention mechanism feature fusion, and belongs to the technical field of computer-aided detection.

Background

With the development of image processing technology, image-based object recognition and classification techniques have been widely applied to various fields of life, including the field of computer-aided detection. For example, in the medical field, in the detection of lung nodules, a clinician comprehensively judges whether a lung nodule is early lung cancer or not according to the existence of the lung nodule, the size of the nodule, the nodule density and other factors, such as the existence of the lung nodule, the long-term smoking history, the existence of the family history of lung cancer and the like.

Low-dose helical computed tomography (LDCT) is currently the most widely used lung nodule screening modality, which can detect lung nodules of more than 3 mm. However, several hundred axial images can be generated by one LDCT, and in the stage of pulmonary nodule screening, a professional doctor needs to check each image, which takes a lot of time and is labor-intensive. To reduce the work intensity of physicians, currently, a pulmonary tumor nodule Computer Aided Detection (CAD) system is commonly used to assist radiologists in pulmonary tumor nodule detection.

When the system is used, a doctor inputs a CT image to be detected into the system, the system can quickly locate a suspected nodule area and give the probability that the candidate nodule is positive, namely, whether the suspected nodule in the CT image is a true lung nodule or not is distinguished from the shadows of other organs, blood vessels and the like. The system consists of two parts: nodule candidate detection, lung nodule identification and classification. Where the degree of perception of false positive nodule candidates by lung nodule identification and classification determines overall system performance. Therefore, improving the accuracy of the system for lung nodule identification and classification is the main direction for the development of the auxiliary lung nodule detection technology.

In recent years, with the popularization of deep learning application methods, a computer medical aided diagnosis system based on a Convolutional Neural Network (CNN) has become a research hotspot, and research on lung nodule identification has achieved some results. The CNN model is based on two-dimensional CT images to identify false positive nodules. Because the positions and the expression forms of the lung nodules are different, the lung nodule forms can be subdivided into the insulativity, the vascular adhesion, the pleural adhesion, the frosted glass and the cavitity, and the difficulty of identifying the false positives of the model is increased. And the three-dimensional image has richer semantic information than the two-dimensional image, so that false positive nodules are identified based on the three-dimensional image, and the robustness of the model can be improved.

However, in the existing process of identifying and classifying false positive nodules based on a three-dimensional image, multi-scale feature fusion is usually adopted to improve the detection accuracy, that is, multi-scale features extracted at each stage are fused, for example, after the multi-scale features are extracted by down-sampling, a pyramid sampling mode is usually adopted to upwards fuse, and finally, target classification is performed according to the fused features, but such a mode causes feature redundancy, thereby causing the classification accuracy to be reduced.

Disclosure of Invention

In order to further improve the precision of target classification in the computer aided detection technology, the invention provides a target classification method based on spatial cross attention mechanism feature fusion, which comprises the following steps:

step 1: acquiring a three-dimensional image to be classified, and setting the length, width and height of a target area in the three-dimensional image to be classified as L, W and H respectively;

step 2: performing feature extraction on a target area in an image to be classified by adopting a 3DSeNet backbone network to obtain four primary feature vectors; the 3DSeNet backbone network is composed of a plurality of SeBlock blocks, and the SeBlock blocks are obtained by adding an SE three-dimensional channel attention module in a ResBlock block;

step 3: respectively carrying out feature refinement on four primary feature vectors output by a 3DSeNet backbone network to obtain refined feature vectors;

step 4: using the feature of one feature map to calibrate another feature map by using the feature of the feature map to re-calibrate the four thinned feature vectors by a channel attention mechanism excitation method, and finally obtaining a feature vector for classification;

step 5: and classifying the three-dimensional image to be classified according to the finally obtained characteristic vector for classification.

Optionally, the SE three-dimensional channel attention module is used for inputting the feature x^C×L×W×HFirst, the Squeeze operation is carried out to obtain a global feature map z based on a channel^C×1×1×1And performing an Excitation operation according to the global feature map to obtain a feature map

Using Scale operation to map the feature

With the features x of the original input^C×L×W×HMultiplying, and finishing the function of feature recalibration to correct the features.

Optionally, the SE three-dimensional channel attention module in Step2 is based on the input feature x^C×L×W×HFour preliminary feature vectors are obtained, including:

for input feature x^C×L×W×HPerforming Squeeze operation to obtain the global characteristics of the characteristic diagram based on the channel, wherein the implementation method comprises the following steps:

namely, a three-dimensional global self-adaptive average pooling method is adopted to encode the whole spatial feature on a channel into a global feature, and a feature map z is output^C×1×1×1(ii) a Wherein C represents the number of channels;

the feature map is obtained by realizing the specification operation through the following formula

s_c＝σ(ω₂δ(ω₁z)) (3)

Where z is the output of the Squeeze operation, σ and δ are sigmoid activation functions, w₁＝C/R，w₂＝C²R, R represents a reduction factor;

s obtained by performing specification operation by Scale operation_cAnd x^C×L×W×HThe formula for the multiplication, Scale operation is as follows:

wherein

The output of the attention module of the SE three-dimensional channel is the same as the input feature map in size;

the four preliminary feature vectors are respectively marked as x₁、x₂、x₃And x₄。

Optionally, Step3 includes:

from input features x₂Obtaining an output x 'after passing through a multi-scale feature refining module'₂Then inputting the feature x₁And x'₂The up-sampling features are fused and then input into a multi-scale feature refinement module f_sTo obtain an output of x'₁Finally x is₁And x'₁Fuse to obtain an output s₁：

Wherein f is_sFor the feature refinement module, λ₁，λ₂Is a set of linear parameters, λ'₁，λ′₂Is another set of linear parameters, Up (x'₂) Represents a refined feature x'₂Carrying out up-sampling operation;

from input features x₃Obtaining an output x 'after passing through a multi-scale feature refining module'₃Then inputting the feature x₂And x'₃The up-sampling features are fused and then input into a multi-scale feature refinement module f_sTo obtain an output of x'₂Finally x is₂、x′₂And output characteristics s₁By fusing to obtain an output s₂：

Among them, DOwn(s)₁) Represents the feature s fused to stage1₁Carry out a down-sampling operation of λ'₁λ′₂λ′₃Is a set of linear parameters;

from input features x₄Obtaining an output x 'after passing through a multi-scale feature refining module'₄Then inputting the feature x₃And x'₄The up-sampling features are fused and then input into a multi-scale feature refinement module f_sTo obtain an output of x'₃Finally x is₃、x′₃And output characteristics s₂By fusing to obtain an output s₃：

From input features x₄Obtaining an output x 'after passing through a multi-scale feature refining module'₄(ii) a X is to be₄、x′₄And output characteristics s of stage3₃And fusing the down-sampled features to obtain an output feature s₄：

Optionally, Step4 includes:

will s₄And s₃S is fused by a cross-attention mechanism₄Is calibrated to s₃Up-output fusion feature F₃；

F is to be₃And s₂F by a cross-attention fusion mechanism₃Is calibrated to s₂Up-out fusion feature F₂；

F is to be₂And s₁Fusion of F by cross-attention mechanism₂Is calibrated to s₁Up-output fusion feature F₁。

Optionally, said coupling s₄And s₃S is fused by a cross-attention mechanism₄Is calibrated to s₃Up-output fusion feature F₃，

The method comprises the following steps:

for two features s of the same channel₄And s₃And performing similarity recalibration of the features on the three-dimensional space:

s′₃＝SimAM(s₃)，s′₄＝SimAM(s₄) (14)

operating by the channel attention compressed excitation method Squeeze, s₄' compression into a channel-based one-dimensional vector f₄Re-sum s₃' multiplication to give the output F₃To recalibrate s₃The characteristics of (A):

f₄＝σ(w₂δ(w₁F_sq(s′₄)))，F₃＝s′₃f₁ (15)

wherein

Indicating the intra-pixel linear separability, sigmod denotes the activation function.

Step5 comprises the following steps:

characteristic F₁And outputting a one-dimensional classification vector through three-dimensional space pooling, wherein the length of the one-dimensional classification vector is 4379, and outputting a confidence coefficient sequence of the two classifications by the classifier.

The application also provides a lung nodule positive probability prediction method, and the method is used for obtaining a lung nodule positive probability prediction value based on suspected lung nodule data output by a CAD system.

The invention has the beneficial effects that:

the method not only effectively fuses the characteristics of the two characteristic graphs, but also is different from the method for fusing similar characteristic graphs by means of upsampling of a characteristic pyramid.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a three-dimensional representation of a segmentation of the pulmonary parenchyma provided in one embodiment of the present invention.

Fig. 2 is a schematic model structure diagram of an object classification method based on spatial cross-attention mechanism feature fusion provided in an embodiment of the present invention.

FIG. 3 is a schematic diagram of an SE three-dimensional channel attention module provided in one embodiment of the present invention.

Fig. 4 is a schematic diagram of a SeBlock provided in an embodiment of the present invention.

FIG. 5 is a schematic diagram of a feature refinement module provided in one embodiment of the present invention.

FIG. 6 is a schematic diagram of a feature fusion method provided in an embodiment of the present invention.

FIG. 7 is a schematic diagram of a cross attention feature fusion Module (CSFA) provided in an embodiment of the present invention.

FIG. 8 is a graph illustrating an example effect of a positive lung nodule prediction provided in one embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The first embodiment is as follows:

the embodiment provides a target classification method based on spatial cross attention mechanism feature fusion, which comprises the following steps:

step 4: re-calibrating the other characteristic diagram by using the characteristics of one characteristic diagram by using the four thinned characteristic vectors in a channel attention mechanism excitation method, and finally obtaining characteristic vectors for classification;

Example two:

this embodiment provides a target classification method based on spatial cross-attention mechanism feature fusion, where a CAD (Computer-aided Detection) system determines a suspected nodule region according to a CT image, and then further performs processing to give a probability that the suspected nodule region is a positive lung nodule (i.e., the suspected nodule region is classified into a true lung nodule and a false lung nodule), the method includes:

step one, obtaining suspected pulmonary nodule data:

in the actual detection process, the CAD system may directly give the data of the suspected lung nodule, and in this embodiment, the LUNA16 data set is taken as an example for description, so that the data in the data set is preprocessed to obtain the data of the suspected lung nodule, specifically, the method includes:

step 1-1: pulmonary parenchymal segmentation

In the original CT image, X-ray attenuation values are acquired in Hounsfield Units (HU). The CT value of a substance reflects its density, with higher CT values indicating higher substance densities. The HU value of the lung is around-500, so when the lung is substantially segmented, the threshold interval can be set to [ -1000,400], i.e. the value with HU greater than 400 is set to 400, and the value with HU less than-1000 is set to-1000; the HU values can then be converted to a [0,255] range using a normalization method. Fig. 1 is a three-dimensional image of a processed segmented parenchymal lung.

Step 1-2: extracting data of suspected pulmonary nodules

For each candidate nodule in the LUNA16 data set, namely a suspected lung nodule area, reading a coordinate file of each candidate nodule, and acquiring three-dimensional world coordinates of the candidate nodule

The voxel coordinate v is calculated by the following formula_voxel：

Wherein v is_orginAs the origin coordinate of the lung, d_spacingIs the pixel spacing.

And (4) cutting out a cube with the same length, width and height by taking the voxel coordinate of each candidate nodule as a central point.

According to the size distribution of suspected lung nodule data blocks in the LUNA16 data set, cube blocks with side lengths of 24, 28, 32, 36 and 40 (unit: mm) in five different sizes are respectively selected and stored as files in a npy format, so that the file is convenient to use in subsequent training and testing. If the clipping range exceeds the image area, 0 is complemented.

Step 1-3: data enhancement

Analyzing the selected LUNA16 data set, and counting the number of all positive and negative samples, wherein the number of positive lung nodules (positive samples) is 1557 and accounts for 0.21% of the total number, the number of false positive lung nodules (negative samples) is 753418 and accounts for 99.79% of the total number, wherein 1 represents a positive lung nodule, and 0 represents a false positive lung nodule. Obviously, the number of the negative samples is far greater than that of the positive samples, so that the training result of the two-classification model is prone to being recognized as the negative samples, and the performance of the two-classification model is not beneficial to being effectively evaluated. To alleviate the data sample number imbalance, a data enhancement method may be used to expand the number of positive samples. In the present invention, the selected data enhancement method is as follows:

(1) rotating: the cube image is rotated 90 °, 180 °, 270 ° along the cross-section.

(2) Mirroring: the cube image is symmetrically inverted along the coronal and sagittal planes, respectively.

By the method, the data enhancement is carried out on the positive samples, and the final total number of the positive samples is 20 times of the original number of the positive samples.

It is noted that the LUNA16 dataset is a public data office that includes 888 low dose pulmonary CT image (mhd format) data, each image containing a series of multiple axial slices of the thorax. The number of slices contained in each image will vary from one scanning machine to another, from one scan layer thickness to another, and from one patient to another. The original image is a three-dimensional image. Each image contains a series of axial slices of the thorax. This three-dimensional image is composed of a different number of two-dimensional images.

Step two: modeling

By analyzing the preprocessed data set, to construct a suitable model, the following two problems need to be solved:

(1) lung nodules can be subdivided into solitary, vascular adhesion, pleural adhesion, frosty, and cavitary, depending on their location and manifestation. The lung nodules have the characteristics of small size, irregular shape and the like, and the difficulty of the model in identifying the negative samples is increased.

(2) The number of positive and negative samples is still unbalanced after data enhancement, so that the performance of the classifier needs to be adjusted in the model construction and training processes, and the perception degree of the model on the two types of samples is approximately equal.

For the above two problems, as shown in fig. 2, the model design is divided into three stages to obtain better classification, and the three stages are respectively: a backbone network feature extraction stage, a multi-scale feature fusion stage and a feature classification stage.

In the stage of extracting the features of the backbone network, a 3DSeNet model is adopted, and in the model, the features can be better extracted because a three-dimensional channel attention mechanism is adopted. The 3DSeNet model is the basic Network model, and the detailed description can refer to the description in Yang J, Jiang X, Ma X.3DSeNet:3D Spatial Attention Region Ensemble Network for Real-time 3D Hand position Estimation [ C ]// 202010 th International Conference on Information Science and Technology (ICIST).2020.

In the multi-scale feature fusion stage, feature refinement is carried out by acquiring feature maps of different stages of a backbone network, and similar feature maps are fused through a spatial cross attention fusion mechanism, so that the effect of multi-scale feature fusion is achieved.

In the feature classification stage, variable parameters are introduced into the classifier, self-adaptive change is carried out along with the proportion of positive and negative samples in the training process, and the perception effect of the samples with less number is improved through linear change, so that the method has better classification performance.

Step 2-1: backbone network feature extraction stage

A backbone network is built for feature extraction, and the backbone network adopts 3DSeNet and consists of a plurality of SeBlock blocks. In fig. 4, a schematic diagram of a SeBlock block is shown, that is, a SE three-dimensional channel attention module is added to a ResBlock (the ResBlock is a main module in a backbone network 3 dreson model), so that the feature extraction capability is enhanced.

The principle of the SE three-dimensional channel attention module is shown in FIG. 3. The method comprises the following specific steps:

(1) squeeze operation

For input feature x^C×L×W×H(where C represents the number of channels, and L, W and H represent the length, width and height of the cube, respectively), first, Squeeze operation is performed to obtain the global feature of the feature map based on the channels, and the implementation method is as follows:

namely, a three-dimensional global self-adaptive average pooling method is adopted to encode the whole spatial feature on a channel into a global feature, and a feature map z is output^C×1×1×1。

(2) Excitation operation

Is realized by the following formulaThe specification operation obtains a feature map

s_c＝σ(w₂δ(ω₁z)) (3)

Where z is the output of the Squeeze operation, σ and δ are sigmoid activation functions, w₁＝C/R,w₂＝C²R, R represents a reduction factor, which here may take the value 16.

(3) Scale operation

Through learning channel weight, S is processed by Scale operation_cMultiplying the feature by the original feature map to finish the function of feature recalibration, namely correcting the features, wherein the corrected features can retain valuable features and reject the features without valuable values. The formula for the Scale operation is as follows:

wherein

The output of the SE three-dimensional channel attention module is the same size as the input feature map.

The SE three-dimensional attention channel mechanism module can be used as a plug-and-play module to be inserted into a feature extraction module ResBlock of the 3DResnet to be constructed into a SeBlock module, so that the effect of three-dimensional feature recalibration is achieved.

Step 2-2: multi-scale feature fusion phase

Outputting four outputs of the backbone network in the feature extraction stage: stage1, stage2, stage3, and stage4 output feature vectors for classification by a multi-scale feature fusion module as shown in FIG. 2. The pulmonary nodules have the characteristics of variable size, irregular shape, random position distribution and the like, and are not beneficial to effectively extracting features from the backbone network. In order to improve the characteristic extraction effect of the model, after the characteristics of the main network are extracted, the output of each stage is respectively subjected to multi-scale characteristic refinement treatment and then is fused with the characteristics of other stages, so that a better characteristic extraction effect is achieved. The detailed steps are as follows:

(1) feature refinement module

The present invention adopts a multi-scale feature refinement module, as shown in fig. 5, which can fuse the next level of upsampling information as input (see the dashed box) and perform feature refinement processing by using 4 branches, that is, respectively adopt 1 × 1 × 1, 3 × 3 × 3, 5 × 5 × 5, 7 × 7 × 7 three-dimensional convolution and then adopt 3 × 3 × 3 three-dimensional void convolution (the corresponding expansion coefficients are 1, 3, 5, 7, respectively).

(2) Feature fusion framework

The feature fusion method from stage1 to stage4 can be described as follows:

stage 1: first, input feature x from stage2₂Obtaining an output x 'after passing through a multi-scale feature refining module'₂Then input feature x of stage1₁And x'₂The up-sampling features are fused and then input into a multi-scale feature refinement module f_sTo obtain an output of x'₁Finally x is₁And x'₁Fuse to obtain an output s₁. The feature fusion formula of stage1 can be expressed as:

stage 2: first, input feature x from stage3₃Obtaining an output x 'after passing through a multi-scale feature refining module'₃Then input feature x of stage2₂And x'₃The up-sampling features are fused and then input into a multi-scale feature refinement module f_sTo obtain an output x'₂Finally x is₂、x′₂And staOutput characteristic s of ge1₁By fusing to obtain an output s₂. The feature fusion formula of stage2 can be expressed as:

wherein f is_sFor the feature refinement module, Down(s)₁) Represents the feature s fused to stage1₁Carry out a down-sampling operation of λ'₃Is a linear parameter;

stage 3: first, input feature x from stage4₄Obtaining output x 'after passing through a multi-scale feature refinement module'₄Then input feature x of stage3₃And x'₄The up-sampling characteristics are fused and then input into a multi-scale characteristic thinning module f_sTo obtain an output of x'₃Finally x is₃、x′₃And output characteristics s of stage2₂And (3) down-sampling features of the input signal, fusing to obtain an output s₃. The feature fusion formula of stage3 can be expressed as:

stage 4: input feature x₄Obtaining output x 'after passing through a multi-scale feature refinement module'₄. X is to be₄、x′₄And output characteristics s of stage3₃And fusing the down-sampled features to obtain an output feature s₄. The feature fusion formula of stage4 can be expressed as:

(3) spatial cross-attention fusion mechanism

By analyzing the output vectors of the feature fusion framework and the fusion process, the output vectors of each layer have the same channel number and have been fused with similar features. The upsampling and connecting fusion method using the feature pyramid will undoubtedly cause feature redundancy, thereby affecting the precision. Therefore, the invention provides a novel feature fusion method, namely fusing similar feature maps based on a space cross attention mechanism.

(a) Attention mechanism SimAM

In the invention, the similarity of different vectors is calculated by adopting an attention mechanism SimAM: the attention mechanism evaluates the importance of each pixel point in the three-dimensional characteristic diagram by realizing an energy function, does not need additional parameters to derive a weight value for the characteristic diagram, and is a parameter-free three-dimensional attention module.

Suppose the input feature map is x ∈ R^C×L×W×HDefining an energy function e as follows_t：

Wherein

M is L × H × W, t is the target pixel, y₀And y_tIs a binary label (y)₀Take 1, y_tTaking-1), x_iAnd (3) representing pixel points in the characteristic diagram, wherein the minimization formula (9) is equivalent to training the linear separability between the pixel point t and other pixel points in the same channel. In order to improve the generalization capability of the model, a regularization coefficient λ may be added to the function, resulting in the following formula:

minimizing the energy function e in equation (10)_tCan obtain w_tAnd b_t：

Wherein

The mean and variance, w, of all pixels except the channel are calculated_t，b_tAre calculated in a single channel.

Assuming that all pixels in a single channel follow the same distribution, the mean and variance of all pixels can be calculated and reused to assess the importance of all pixels on that channel. The formula for calculating the minimum energy of the pixel point can be expressed as follows:

wherein

Therefore, the importance of each pixel point can be expressed as

That is, the lower the energy is, the smaller the linear separability of other pixels is, and the higher the importance is.

Inputting the feature map, the output of the attention mechanism SimAM can be expressed as:

(b) spatial cross-attention fusion

Multi-scale features [ s ] output by a feature fusion framework₁，s₂，s₃，s₄]And outputting the feature vectors for classification by a pairwise feature fusion method. The invention replaces the up-sampling method of feature pyramid fusion by a space cross attention mechanism, and mainly comprises the following implementation steps:

in FIG. 7, the equation s₃，s₄Fusion as an example to describe the cross-attention fusion mechanism (CSFA), a method to fuse two features:

s₃，s₄the number of channels is equal, the characteristics with similar other scales are obtained in the characteristic fusion process, and the similarity recalibration of the characteristics in a three-dimensional space is carried out on the two characteristics with the same channels through a SimAM attention mechanism:

s′₃＝SimAM(s₃)，s′₄＝SimAM(s₄) (14)

s 'by channel attention compressive excitation method (Squeeze operation)'₄Compression into a channel-based one-dimensional vector f₄S 'are re-mixed'₃Multiplying to obtain an output F₃To recalibrate s₃The characteristics of (A):

f₄＝σ(w₂δ(w₁F_sq(s′₄)))，F₃＝s′₃f₄ (15)

because of s₃Originally has s₄And with s₄Recalibrate s₃This method is therefore equivalent in effect to fusion. And compared with the method that similar features are fused in an up-sampling mode, the parameter quantity and feature redundancy are reduced.

Multi-scale features [ s ] output by a feature fusion framework₁，s₂，s₃，s₄]The fusion process of (2) is shown in fig. 6:

1、s₄and s₃By CSFA, s is₄Is calibrated to s₃Up-output fusion feature F₃。

2、F₃And s₂By CSFA, F₃Is calibrated to s₂Up-output fusion feature F₂。

3、F₂And s₁By CSFA, F₂Is calibrated to s₁Up-out fusion feature F₁。

Step 2-3: feature classification phase

In the feature classification stage, the long-tail learning classifier is adopted in the invention. When the features are input, the confidence coefficient can be adjusted in a self-adaptive mode according to the distribution situation of the features. The principle can be described as follows.

For tag [0, 1]Let the output of the model, i.e. the predicted confidence sequence, be

The category confidence can be adjusted by means of linear transformation, and the formula is as follows:

wherein alpha is_j，β_jThe correction parameters to be learned are adjusted for each class probability distribution. Then, a confidence equation is defined to combine the alignment probability and the original probability to obtain a corrected probability

Where σ (x) is the activation function.

And inputting the vector after the Spatial Pyramid Pooling (SPP) into a long-tail learning classifier with the input dimension of 4379 and the output dimension of 2 to obtain a confidence sequence of the two classifications.

And 3, step 3: model training

The model was trained using 4 GeForce RTX 3090 GPUs with the batch size set to 128, and 100 epochs were iterated. The optimizer selects SGD (initial learning rate is set to be 0.001, momentum factor is set to be 0.9), and cross entropy is adopted by a loss function to perform ten-fold cross validation.

And 4, step 4: multi-dimensional prediction confidence sequence weighted fusion

Because the lung nodule shape, size and texture change range is large, samples with different sizes are respectively intercepted aiming at the same suspicious lesion region, the samples are respectively trained, and finally, a confidence coefficient sequence z is output as [ z ═ z₀，z₁]。

Fusion size of 24 by weighted fusion³、28³、32³、36³、40³mm³Confidence sequence of 5 sizes:

where ω is 0.2. And obtaining the confidence coefficient output by the final model, and obtaining the dimension with higher confidence coefficient through argmax, namely the result predicted by the final model.

And 5: model evaluation

The model was evaluated using the FROC standard. The independent variable of FROC is the number of false positive samples per CT (FPPS) on average and the dependent variable is sensitivity (sensitivity).

The sensitivity is calculated as:

wherein TP indicates true positive (true positive) and FN indicates false negative (false negative).

FROC can reflect the classification performance of models at different thresholds. Seven representative points were taken from FROC: 0.125, 0.25,0.5,1,2,4,8. CPM is the average of these seven points, and can be integrated to represent the classification performance of the model:

wherein Recall_fpr＝iA value representing the recall rate corresponding to a false positive rate (false positive rate).

The CPM results of the present invention are shown in Table 1:

table 1: CPM results

Step 6: results display

Shown in fig. 8 are partial positive lung nodule predictions. A two-dimensional cross section and corresponding confidence are shown in fig. 8. As can be seen from fig. 8, (a) (b) (c) are simple samples, the model has very good recognition effect; (d) (e) (f) is a small or large nodule, and the model has good identification effect; (g) (h) and (i) are difficult samples, have the characteristics of interference information, irregular shape and the like, and the model still has a considerable recognition effect.

Table 2: uniformly taking suspected lung nodule blocks with the length, width and height of 32mm as input contrast effects:

as shown in table 2, M2 reduced the number of layers of the backbone network compared to the method M1 of the present application, and although the highest score was achieved at 8FPPs, all other points were lower than the method M1 of the present application. M3 replaced the loss function with cross entropy, which was higher at 0.125FPPs than the method M1 of the present application, but the effect was not ideal at 4, 8FPPs, and it can be seen that GRWLoss had some adjustment effect on DisAlignLinear. M4 replaces CSFA with an FPN up-sampling connection method, the effect is not ideal compared with the method M1, and the CSFA has a certain improvement on the fusion effect of similar features. M5 and M6 adopt a traditional full connection layer, M5 adopts Focalloss as a solution for the problem of data imbalance, and M6 adopts a data equalization method and intercepts negative samples with the same number as that of positive samples as a training set.

Table 3: comparison of effects for different input sizes:

CPM score for the model at 24³To 40³Shows an ascending trend but is 40³To 56³The interval (a) is in a downward trend. Accordingly, the present application selects 24³To 40³5 sizes for weighted fusion:

table 4: selecting 24³To 40³5 sizes of

Some steps in the embodiments of the present invention may be implemented by software, and the corresponding software program may be stored in a readable storage medium, such as an optical disc or a hard disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for classifying an object based on spatial cross attention mechanism feature fusion, the method comprising:

2. The method of claim 1, wherein the method is performed in a batch processFeature x of the SE three-dimensional channel attention module pair input^C×L×W×HFirst, the Squeeze operation is carried out to obtain a global feature map z based on a channel^C×1×1×1And performing an Excitation operation according to the global feature map to obtain a feature map

Using Scale operation to map the feature

3. The method of claim 2, wherein the SE three-dimensional channel attention module in Step2 is based on the input feature x^C×L×W×HFour preliminary feature vectors are obtained, including:

s_c＝σ(w₂δ(w₁z)) (3)

s obtained by performing specification operation by Scale operation_cAnd input feature x^C×L×W×HThe formula for the multiplication, Scale operation is as follows:

wherein

4. The method of claim 3, wherein Step3 comprises:

from input features x₂Obtaining an output x 'after passing through a multi-scale feature refining module'₂Then inputting the feature x₁And x'₂The up-sampling features are fused and then input into a multi-scale feature refinement module f_sTo obtain an output of x'₁Finally x is₁And x'₁Fusing to obtain an output s₁：

Wherein f is_sFor a multi-scale feature refinement module, λ₁λ₂Is a set of linear parameters, λ'₁λ′₂Is another set of linear parameters, Up (x'₂) Represents a refined feature x'₂Carrying out up-sampling operation;

from input features x₃Obtaining an output x 'after passing through a multi-scale feature refining module'₃Then inputting the feature x₂And x'₃Is fused and thenInput to a multi-scale feature refinement module f_sTo obtain an output of x'₂Finally x is₂、x′₂And output characteristics s₁By fusing to obtain an output s₂：

Wherein, Down(s)₁) Represents the feature s fused to stage1₁Down-sampling operation is performed, λ'₁λ′₂λ′₃Is a set of linear parameters;

from input features x₄Obtaining output x 'after passing through a multi-scale feature refinement module'₄Then inputting the feature x₃And x'₄The up-sampling features are fused and then input into a multi-scale feature refinement module f_sTo obtain an output x'₃Finally x is₃、x′₃And output characteristics s₂By fusing to obtain an output s₃：

5. The method of claim 4, wherein Step4 comprises:

F is to be₃And s₂Fusion of F by cross-attention mechanism₃Is calibrated to s₂Up-output fusion feature F₂；

6. The method of claim 5, wherein s is₄And s₃S is fused by a cross-attention mechanism₄Is calibrated to s₃Up-output fusion feature F₃The method comprises the following steps:

for two features s identical in channel₄And s₃And performing similarity recalibration of the features on the three-dimensional space:

s′₃＝SimAM(s₃)，s′₄＝SimAM(s₄) (14)

s 'by channel attention compressed excitation method Squeeze operation'₄Compression into a channel-based one-dimensional vector f₄S 'are re-mixed'₃Multiplying to obtain an output F₃To recalibrate s₃The characteristics of (A):

f₄＝σ(w₂δ(w₁F_sq(s′₄)))，F₃＝s′₃f₄ (15)

wherein

7. The method of claim 6, wherein Step5 comprises:

characteristic F₁By three-dimensional spaceThe one-dimensional classification vector is output by the interval pooling, the length is 4379, and the confidence coefficient sequence of the two classifications is output by the classifier.

8. A method for predicting lung nodule positive probability, which is characterized in that the method adopts the method of any one of claims 1-7 to obtain a lung nodule positive probability prediction value based on suspected lung nodule data output by a CAD system.