CN115187819B

CN115187819B - Training method and device for image classification model, electronic equipment and storage medium

Info

Publication number: CN115187819B
Application number: CN202211014238.0A
Authority: CN
Inventors: 贾潇; 王子腾; 丁佳; 吕晨翀
Original assignee: Beijing Yizhun Medical AI Co Ltd
Current assignee: Zhejiang Yizhun Intelligent Technology Co ltd
Priority date: 2022-08-23
Filing date: 2022-08-23
Publication date: 2023-05-16
Anticipated expiration: 2042-08-23
Also published as: CN115187819A

Abstract

The disclosure provides a training method, a training device, electronic equipment and a storage medium for an image classification model, wherein the method comprises the following steps: confirming a first sample image and a first mask image corresponding to the first sample image; training a first encoder included in an image classification model based on the first mask image, and confirming that the trained first encoder is a second encoder; inputting a second sample image into a classification branch, inputting a second mask image corresponding to the second sample image into a comparison branch, and performing feature fusion on a first feature image output by the classification branch and a second feature image output by the comparison branch to obtain a first fusion feature; inputting the first fusion characteristic into a full-connection layer classifier included in the classification branch, and confirming that the output of the full-connection layer classifier is a lung prediction classification result corresponding to the second sample image; and adjusting parameters of the classification branch based on the lung labeling classification result of the second sample image and the lung prediction classification result of the second sample image.

Description

Training method and device for image classification model, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, and in particular relates to a training method and device for an image classification model, electronic equipment and a storage medium.

Background

A mask automatic-Encoder (MAE) observes partial images to reconstruct original images as proxy tasks by utilizing the characteristic of redundancy of image information; the encoder of the MAE has the ability to infer the content of the masked image area by aggregating the context information; the encoder that trains the MAE based on the chest X-ray image dataset is not well utilized with a priori knowledge learned in the chest X-ray image dataset as an initial weight in the downstream task.

Disclosure of Invention

The disclosure provides a training method, device, electronic equipment and storage medium for an image classification model, so as to at least solve the technical problems in the prior art.

According to a first aspect of the present disclosure, there is provided a training method of an image classification model, including:

confirming a first sample image and a first mask image corresponding to the first sample image; the first mask image is obtained based on a first sample image, and a lung field area in the first mask image comprises a mask;

Adjusting parameters of a first encoder included in the image classification model based on the first mask image, and confirming that the first encoder after adjusting the parameters is a second encoder; the second encoder is applied to a classification branch and a comparison branch included in the image classification model;

inputting a second sample image into the classification branch, inputting a second mask image corresponding to the second sample image into the comparison branch, and performing feature fusion on a first feature image output by the classification branch and a second feature image output by the comparison branch to obtain a first fusion feature;

inputting the first fusion characteristic into a full-connection layer classifier included in the classification branch, and confirming that the output of the full-connection layer classifier is a lung prediction classification result corresponding to the second sample image;

and adjusting parameters of the classification branch based on the lung labeling classification result of the second sample image and the lung prediction classification result of the second sample image.

In the above solution, the identifying the first sample image and the first mask image corresponding to the first sample image includes:

dividing the first sample image, and determining a lung field area and a non-lung field area in the first sample image based on a division result;

And replacing the lung field area of the first sample image based on the mask and/or the sub-image in the first patch library to obtain a first mask image corresponding to the first sample image.

In the above solution, the replacing the lung field area of the first sample image based on the mask and/or the sub-image in the first patch library to obtain the first mask image corresponding to the first sample image includes:

replacing a lung field region of the first sample image based on the mask to obtain a first mask image corresponding to the first sample image; the lung field areas in the first mask image are all masks;

or, based on the mask and the sub-images in the first patch library, replacing the lung field area of the first sample image to obtain a first mask image corresponding to the first sample image; the lung field area in the first mask image is partly masked and partly subimages in the first patch library.

In the above aspect, the adjusting parameters of the first encoder included in the image classification model based on the first mask image, and confirming that the first encoder after adjusting the parameters is the second encoder, includes:

inputting the first mask image into the first encoder, and confirming that the output of the first encoder is at least one characteristic image corresponding to the first mask image;

Inputting the at least one characteristic image into a decoder included in the image classification model, and confirming that the output of the decoder is a first reconstructed image corresponding to the first mask image;

and adjusting parameters of the first encoder based on the first reconstructed image and the first sample image, and confirming that the first encoder after adjusting the parameters is a second encoder.

In the above solution, the inputting the second sample image into the classification branch, inputting the second mask image corresponding to the second sample image into the comparison branch, and performing feature fusion on the first feature image output by the classification branch and the second feature image output by the comparison branch to obtain a first fusion feature includes:

inputting the second sample image into a third encoder included in the classification branch, and confirming that the output of the third encoder is a first characteristic image corresponding to the second sample image; the parameters of the third encoder are the same as those of the second encoder;

inputting a second mask image corresponding to the second sample image into a fourth encoder included in the comparison branch, and confirming that the output of the fourth encoder is a second characteristic image corresponding to the second sample image; the parameters of the fourth encoder are the same as those of the second encoder;

And carrying out feature fusion on the first feature image and the second feature image to obtain a first fusion feature.

In the above scheme, the feature fusion of the first feature image output by the classification branch and the second feature image output by the comparison branch to obtain a first fusion feature includes:

performing pixel-by-pixel difference on the first characteristic image and the second characteristic image to obtain a first difference characteristic;

the first fusion feature is obtained based on the first feature image and the first difference feature.

In the above solution, the adjusting the parameters of the classification branch based on the lung labeling classification result of the second sample image and the lung prediction classification result of the second sample image includes:

determining cross entropy loss based on the lung labeling classification result of the second sample image and the lung prediction classification result of the second sample image;

and adjusting parameters of a third encoder included in the classification branch and parameters of the full-connection layer classifier based on the cross entropy loss.

According to a second aspect of the present disclosure, there is provided an image classification method implemented based on an image classification model trained by the method provided in the first aspect, the method comprising:

Inputting a first image to be classified into a classification branch included in the image classification model, and confirming that the output of the classification branch is a classification result of the first image to be classified;

and/or, confirming a third mask image corresponding to the first image to be classified; and inputting the third mask image into a comparison branch included in the image classification model, and confirming that the output of the comparison branch is a comparison image corresponding to the first image to be classified.

According to a third aspect of the present disclosure, there is provided a training apparatus of an image classification model, comprising:

a dividing unit configured to confirm a first sample image and a first mask image corresponding to the first sample image; the first mask image is obtained based on a first sample image, and a lung field area in the first mask image comprises a mask;

the first training unit is used for adjusting parameters of a first encoder included in the image classification model based on the first mask image, and confirming that the first encoder after the parameters are adjusted is a second encoder; the second encoder is applied to a classification branch and a comparison branch included in the image classification model;

the feature fusion unit is used for inputting a second sample image into the classification branch, inputting a second mask image corresponding to the second sample image into the comparison branch, and carrying out feature fusion on a first feature image output by the classification branch and a second feature image output by the comparison branch to obtain a first fusion feature;

The second training unit is used for inputting the first fusion characteristic into a full-connection layer classifier included in the classification branch, and confirming that the output of the full-connection layer classifier is a lung prediction classification result corresponding to the second sample image;

and the adjusting unit is used for adjusting parameters of the classification branches based on the lung labeling classification result of the second sample image and the lung prediction classification result of the second sample image.

According to a fourth aspect of the present disclosure, there is provided an image classification apparatus comprising:

a first input unit, configured to input a first image to be classified into a classification branch included in the image classification model, and confirm that an output of the classification branch is a classification result of the first image to be classified;

a second input unit, configured to confirm a third mask image corresponding to the first image to be classified; and inputting the third mask image into a comparison branch included in the image classification model, and confirming that the output of the comparison branch is a comparison image corresponding to the first image to be classified.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods described in the present disclosure.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of the present disclosure.

According to the training method of the image classification model, a first sample image and a first mask image corresponding to the first sample image are confirmed; the first mask image is obtained based on a first sample image, and a lung field area in the first mask image comprises a mask; adjusting parameters of a first encoder included in the image classification model based on the first mask image, and confirming that the first encoder after adjusting the parameters is a second encoder; the second encoder is applied to a classification branch and a comparison branch included in the image classification model; inputting a second sample image into the classification branch, inputting a second mask image corresponding to the second sample image into the comparison branch, and performing feature fusion on a first feature image output by the classification branch and a second feature image output by the comparison branch to obtain a first fusion feature; inputting the first fusion characteristic into a full-connection layer classifier included in the classification branch, and confirming that the output of the full-connection layer classifier is a lung prediction classification result corresponding to the second sample image; adjusting parameters of the classification branch based on a lung labeling classification result of the second sample image and a lung prediction classification result of the second sample image; therefore, priori knowledge learned by the image classification model in the healthy chest radiography data set can be fully utilized and applied to training of downstream classification tasks, and classification effect of the image classification model is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

FIG. 1 is a schematic diagram showing the structure of MAE in the related art;

FIG. 2 illustrates an alternative flow diagram of a training method for an image classification model provided by embodiments of the present disclosure;

FIG. 3 illustrates another alternative flow diagram of a training method for an image classification model provided by embodiments of the present disclosure;

FIG. 4 shows a schematic diagram of lung field segmentation provided by an embodiment of the present disclosure;

FIG. 5 shows a partitioning illustration of an image provided by an embodiment of the present disclosure;

FIG. 6 illustrates an alternative schematic diagram of validating a first mask image provided by an embodiment of the present disclosure;

FIG. 7 illustrates an alternative schematic diagram of an image classification model provided by an embodiment of the present disclosure;

FIG. 8 illustrates another alternative schematic diagram of an image classification model provided by an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of an alternative flow chart of an image classification method according to an embodiment of the disclosure;

FIG. 10 shows an alternative structural schematic of a training apparatus for image classification models provided by embodiments of the present disclosure;

FIG. 11 is a schematic view showing an alternative configuration of an image classification apparatus according to an embodiment of the present disclosure;

fig. 12 is a schematic diagram showing a composition structure of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, features and advantages of the present disclosure more comprehensible, the technical solutions in the embodiments of the present disclosure will be clearly described in conjunction with the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person skilled in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.

Transducers (transducers) are widely used in the field of natural language processing because of their ability to establish relationships between long distance objects by self-attention mechanisms. In the computer vision field, a vision Transformer (VisionTransformer, viT) projects each Patch into a fixed length vector into a transducer by dividing the input image into a plurality of blocks (patches), e.g., 16 x 16. When enough data are possessed for pre-training, the performance of ViT exceeds that of a convolutional neural network, the limitation of lack of inductive bias is broken through, and a good migration effect can be obtained in a downstream task.

Fig. 1 shows a schematic diagram of the structure of an MAE in the related art.

The masking auto-encoder proved to be effective in pre-training ViT of natural image analysis. As shown in fig. 1, the MAE uses the feature of redundancy of image information, and observes that a part of the image reconstruct the original image as a proxy task, and the encoder of the MAE has the ability to infer the content of the masked image area by aggregating the context information. This contextual polymerizability is also crucial in the medical image field, for example in chest X-ray images, where anatomical structures (ribs, lung fields) are functionally and mechanically indistinct in relation to other structures and regions.

The MAE is applied to a chest X-ray image analysis task, the input image is reconstructed through a masking strategy of randomly masking 75% of image blocks, and a trained encoder of the MAE achieves higher performance in a downstream chest X-ray multi-label disease diagnosis task.

However, when the encoder of the MAE is pre-trained based on the healthy chest X-ray image dataset, the encoder (ViT model) is pre-trained by taking the lung field area with filling (In-pointing) missing as a proxy task (pre text task), the obtained ViT weight (i.e. the parameters of the encoder) is only used as the initial weight of the ViT encoder In the downstream task, and Fine tuning (Fine-tune) is performed on the basis of the weight, so that the domain knowledge (Domainknowledges) learned by the image classification model In the healthy chest X-ray image dataset is not fully utilized, specifically, different healthy chest X-ray images often contain similar structural tissues such as ribs, collarbons and lung gates at the same position, and additional information obtained In the pre-training task is not fused into the subsequent disease auxiliary diagnosis (CAD) model, so that the image classification model cannot achieve higher classification performance finally.

Based on this, the embodiment of the disclosure provides a training method for an image classification model, which can make full use of priori knowledge learned by the image classification model in a healthy chest radiography dataset, and apply the priori knowledge to training of a downstream classification task, thereby improving the classification effect of the image classification model.

Fig. 2 is a schematic flow chart of an alternative method for training an image classification model according to an embodiment of the disclosure, and will be described according to the steps.

Step S101, confirming a first sample image and a first mask image corresponding to the first sample image.

In some embodiments, the training device (hereinafter referred to as a first device) of the image classification model confirms the first sample image and the first mask image corresponding to the first sample image; the first sample image may be a healthy (non-diseased) chest X-ray image; the first mask image is obtained based on a first sample image, and a lung field area in the first mask image comprises a mask.

In specific implementation, the first device may segment the first sample image, and determine a lung field area and a non-lung field area in the first sample image based on a segmentation result; alternatively, the first device may input the first sample image into a trained lung field segmentation model (UNet) to obtain a first mask image. Then, replacing a lung field area of the first sample image based on a mask to obtain a first mask image corresponding to the first sample image; or replacing the lung field area of the first sample image based on the mask and the sub-images in the first patch library to obtain a first mask image corresponding to the first sample image. The first patch library comprises at least one healthy chest X-ray image and a plurality of sub-images segmented based on the at least one healthy chest X-ray image.

Specifically, if the first device replaces the lung field area of the first sample image based on the mask, the lung field area of the first sample image is replaced with the mask, and the non-lung field area is not processed, so that the image with the mask of the lung field area is confirmed to be the first mask image.

Or if the first device replaces the lung field region of the first sample image based on the mask and the sub-image in the first patch library, after confirming the non-lung field region and the lung field region, filling the lung field region based on the mask; the lung field area does not contain any information at this time; the first device can randomly replace the mask of the lung field area in the first sample image through the sub-image in the first patch library, and the structure or the position of the sub-image during replacement needs to correspond to the structure and the position of the mask of the lung field area replaced; specifically, the structure and position of the mask may be the structure and position of the original lung field area corresponding to the mask.

In some optional embodiments, the first device divides the image after the mask fills the lung field area, confirms at least one sub-image (patch), confirms sub-images that do not include any information (i.e. pixel sum is 0) in the at least one sub-image, and forms a first sub-image set; numbering all sub-images in a first sub-image set, scrambling the sub-images, taking out a first threshold sub-image with the scrambling sequence, acquiring sub-images with the same positions or numbers as the first threshold sub-image from a first patch library, replacing the first threshold sub-image with the sub-image in the first patch library, and taking the replaced image as the first mask image. The first threshold may be determined according to an actual requirement or an experimental result.

Compared with the method that the mask is directly used for replacing the lung field region to generate the first mask image, the method has the advantages that extra auxiliary information is provided for a pre-training task on the premise that extra labels are not introduced, the convergence speed of a first encoder can be increased, and the cxrMAE model (the first encoder) is helped to fill the lung field region in the original chest better. The sub-images in the first patch library are utilized to replace part of mask images, so that the association between different healthy chest films is established, the structure organization information shared by the healthy chest films is better learned, and the sub-images are used as a priori knowledge model of the lung field organization structure of the healthy chest films with better generalization performance and applied to downstream tasks. For the same healthy chest film, a plurality of possible latent space feature vectors and filling results can be obtained by replacing different mask images.

Further, the first device may acquire mask images corresponding to all sample images in the training set, and because the lung field area of each sample image is inconsistent in size, the mask image is selected as an input of the image classification model; correspondingly, the sample images in the training set are all healthy chest X-ray images.

Step S102, training a first encoder included in the image classification model based on the first mask image, and confirming that the trained first encoder is a second encoder.

In some embodiments, the first device inputs the first mask image into the first encoder, and confirms that the output of the first encoder is at least one feature image corresponding to the first mask image; inputting the at least one characteristic image into the decoder, and confirming that the output of the decoder is a first reconstructed image corresponding to the first mask image; and adjusting parameters of the first encoder based on the first reconstructed image and the first sample image, and confirming that the first encoder after adjusting the parameters is a second encoder.

In some embodiments, the image classification model may further include a first full-connection layer located before the first encoder for performing a dimension conversion on the segmented image after the first mask image segmentation, and a second full-connection layer located after the decoder for performing a dimension conversion on the reconstructed sub-image output by the decoder.

In the implementation, the first device segments the first mask image into at least one segment image, wherein the dimension of the segment image is m×n, and then the at least one segment image is input to a first full-connection layer included in an image classification model for dimension conversion; wherein, every divided image is not overlapped, the size of every divided image is the same, and the sum of the areas of all divided images is equal to the area of the first mask image. Further, the first device inputs at least one segmented image corresponding to the first mask image after dimension conversion into the first encoder, and determines the output of the first encoder as the at least one feature image corresponding to the obtained first mask image; optionally, the number of the feature images may be the same as the number of the split images, or may be different from the number of the split images; the segmented image corresponds to at least one feature image having the same dimensions as the dimension of the at least one segmented image after the dimension conversion. Then, the first device inputs at least one characteristic image corresponding to the first mask image into the decoder, and determines the output of the decoder as at least one reconstructed sub-image; the number of the reconstructed sub-images is the same as that of the segmented images, and the size of the reconstructed sub-images is the same as that of the segmented images; the device inputs the at least one reconstructed sub-image into a second fully connected layer, and confirms that the output of the second fully connected layer is the first reconstructed image.

Each reconstructed sub-image has a unique one corresponding to its corresponding segmented image, which corresponds in position to the first reconstructed image or the first mask image, e.g., the reconstructed sub-image of the a-th row and b-th column in the first reconstructed image corresponds to the segmented image of the a-th row and b-th column in the first mask image, which is identical in size and dimension, and similar or identical in feature.

In some embodiments, the first means confirms a sum of squares of euclidean distances between the reconstructed sub-image and the segmented image, which are identical in position, as the first sub-loss value; the number of at least one reconstructed sub-image corresponding to the first reconstructed image is the same as the number of at least one segmented image corresponding to the first sample image.

Specifically, the same position may include that the reconstructed sub-image of the a-th row and the b-th column in the first reconstructed image is the same as the position of the segmented image of the a-th row and the b-th column in the first mask image, the reconstructed sub-image and the segmented image of the first reconstructed image and the first mask image which are the same in position may be set as image pairs, the square of the euclidean distance (L2 distance) between each pair of images is calculated, and then the squares of the euclidean distances between all the pairs of images are summed, and the result of the summation is confirmed to be the first sub-loss value.

In some alternative embodiments, after adjusting the parameters of the first encoder and/or decoder based on the first sub-loss value, the first apparatus may further repeatedly perform steps S101 to S102, that is, repeatedly train the first encoder and/or decoder until the sub-loss value satisfies a first condition, confirm that the training of the first encoder is completed, and confirm that the first encoder after the training is completed is the second encoder. The first condition may be that the sub-loss value is smaller than a preset threshold, or the sub-loss value converges, or other conditions set based on actual requirements or experimental results, which are not specifically limited in the disclosure.

In some embodiments, the second encoder is applied to a classification branch and a control branch included in the image classification model; specifically, the second encoder is a third encoder in the classification branch; the second encoder is a fourth encoder in the control branch.

Step S103, inputting a second sample image into the classification branch, inputting a second mask image corresponding to the second sample image into the comparison branch, and performing feature fusion on the first feature image output by the classification branch and the second feature image output by the comparison branch to obtain a first fusion feature.

In some embodiments, after the first device trains to obtain the second encoder, the second encoder is applied to a downstream classification task, specifically, the second sample image is input into a third encoder included in the classification branch, and the output of the third encoder is confirmed to be a first characteristic image corresponding to the second sample image; the parameters of the third encoder are the same as those of the second encoder; inputting a second mask image corresponding to the second sample image into a fourth encoder included in the comparison branch, and confirming that the output of the fourth encoder is a second characteristic image corresponding to the second sample image; the parameters of the fourth encoder are the same as those of the second encoder; and carrying out feature fusion on the first feature image and the second feature image to obtain a first fusion feature. Wherein, a second mask image corresponding to the second sample image may be obtained based on step S101.

In specific implementation, the device can perform pixel-by-pixel difference on the first characteristic image and the second characteristic image to obtain a first difference characteristic; obtaining the first fused feature based on the first feature image and the first difference feature (e.g., summing the first feature image and the first difference feature pixel by pixel); the device may further perform feature fusion on the first feature image and the second feature image based on other feature fusion methods in the related art, and the disclosure is not particularly limited.

Step S104, inputting the first fusion characteristic into a full-connection layer classifier included in the classification branch, and confirming that the output of the full-connection layer classifier is a lung prediction classification result corresponding to the second sample image.

In some embodiments, the fully connected layer classifier includes an average pooling layer and a fully connected layer; and the first device inputs the first fusion characteristic into the full-connection layer classifier to obtain a prediction classification result corresponding to the second sample image.

And step S105, adjusting parameters of the classification branches based on the lung labeling classification result of the second sample image and the lung prediction classification result of the second sample image.

In some embodiments, the first means determines a cross entropy loss based on a lung labeling classification result of the second sample image and a lung prediction classification result of the second sample image; and adjusting parameters of a third encoder included in the classification branch and parameters of the full-connection layer classifier based on the cross entropy loss. The lung labeling classification result and the lung prediction classification result may be lung diseases corresponding to the second sample image.

In some alternative embodiments, the lung labeling classification result includes an identification value of each sub-classification result, where 0 and 1 may indicate that there is no such disease (sub-classification result), and 1 indicates that there is such disease (sub-classification result). For example, the lung labeling classification result may be a 1-dimensional vector, where the number of elements is the total number of sub-classification results (disease types), and 1 or 0 indicates whether there is a corresponding disease type.

In this way, by using the training method of the image classification model provided by the embodiment of the disclosure, an attention mechanism of feature difference is provided by using the Encoder (second Encoder) of the trained cxrMAE model, so as to enhance the potential disease features in the input chest radiography feature map (namely, the feature fusion related content) to assist the model in diagnosis. Meanwhile, the image classification model also has stronger diagnosis interpretability, and is specifically characterized in that the health feature vector coded by the second coder (the fourth coder in the comparison branch) and the health mode of the input chest X-ray image recovered by the cxrMAEdecoder in the pre-training process can be used as the basis of the classification of the image classification model.

Fig. 3 is a schematic flow chart of another alternative method for training an image classification model according to an embodiment of the disclosure, and will be described according to the steps.

Step S201, a training set is acquired.

In some embodiments, the first device may process the sample image based on the trained lung field segmentation model to obtain a lung field region of the sample image, and replace the lung field region based on a mask to obtain a mask region; correspondingly, a non-mask area in the sample image can be obtained, and the mask area and the non-mask area of the sample image are combined to obtain a mask image; or the lung field region can be obtained after the sample image is processed based on the lung field segmentation model, the lung field region is processed based on the mask, and the processed image is confirmed to be the mask image.

Fig. 4 shows a schematic diagram of lung field segmentation provided by an embodiment of the present disclosure.

As shown in fig. 4, the sample image is input to the lung field segmentation model, and the lung field region is represented by a mask and the non-lung field region is normally represented in the output of the obtained lung field segmentation model. The mask region has the same shape as the lung field region.

In other embodiments, the first device may further perform a segmentation operation on the mask image (the lung field area is all the mask) and divide the mask image into a plurality of sub-images with consistent shapes and sizes.

Fig. 5 shows a schematic of division of an image provided by an embodiment of the present disclosure. As shown in fig. 5, the image is divided into 16 sub-images (patch) of uniform shape and size; it should be understood that fig. 5 is only illustrative, and in implementation, the mask image may be divided into more than 16 sub-images, so that the lung field area (mask) may be divided into a plurality of sub-images, so as to facilitate the replacement based on the sub-images in the first patch library in the later stage.

Fig. 6 shows an alternative schematic diagram of validating a first mask image provided by an embodiment of the present disclosure.

In some embodiments, the device segments out lung field areas and replaces all the lung field areas with masks, after obtaining a mask image, the mask image is divided into at least one sub-image, and sub-images belonging to the lung field areas in the mask image are replaced based on the sub-images in the first patch library, and after replacement, a first mask image is obtained.

Specifically, a random replacement mode may be adopted during replacement, or a sub-image (patch) which does not contain any information (i.e. pixels and 0) in the mask image may be confirmed in advance, index information (index) of the sub-image with pixels and 0 is recorded in a list, the sequence of the index information of the sub-image in the list is disordered, and the sub-image is taken out

Index number information (i.e., a first threshold value) (where λ is an integer greater than 1, may be set to 2, 3, etc., and 2 is selected in the present disclosure), the preceding ∈in the mask image is replaced based on the sub-image in the first patch library>

Sub-images corresponding to the index number information of the image; optionally, the method of referencing the mask image with the image in the first patch library is divided and marked with index number information, and substitution is performed based on the index number information (for example, the sub-image with index number 1 in the mask image is replaced based on the sub-image with index number 1 in the first patch library), so that after the first mask image is generated, the first encoder is trained based on the first mask image. Wherein List _index List of index information (index) characterizing mask pixels and sub-images of 0, len (List _index ) The length of the list of index information (index) characterizing the mask pixels and the sub-image of 0.

Specifically, the image in the first patch library is at least one healthy chest X-ray image; the sub-image in the first patch library may be obtained by dividing any healthy chest X-ray image in the first patch library according to the dividing manner of the mask image (for example, the mask image is divided according to 20×30, and then the healthy chest X-ray image is also divided according to 20×30).

At the time of replacement, the sub-image in the first patch library replaces only the pixels in the mask image and the sub-image of 0, and as shown in fig. 6, does not replace the pixels and the sub-image of not 0 (all the non-lung field area or part of the lung field area is the non-lung field area).

As shown in fig. 6, the left image is the first sample image (healthy chest X-ray image) input, the right image is the complementary result after replacing the lung field area based on the mask and replacing 1/2 mask (sub-image of mask area) with the healthcare patch (sub-image of first patch library), and comparing the two images, it can be found that the sub-images with the same index information have similar chest structure tissue, for example, the 1-patch in the figure contains collarbone and rib structure, the 2-patch contains rib structure, and the 3-patch contains lung gate and rib structure tissue.

Therefore, compared with the method that the mask is directly used for replacing the lung field region to generate the first mask image, the method has the advantages that extra auxiliary information is provided for a pre-training task on the premise that extra labels are not introduced, the convergence speed of a model is accelerated, and the cxrMAE model (first encoder) is helped to fill the lung field region in the original chest better. The sub-images in the first patch library are utilized to replace part of mask images, so that the association between different healthy chest films is established, the structure organization information shared by the healthy chest films is better learned, and the sub-images are used as a priori knowledge model of the lung field organization structure of the healthy chest films with better generalization performance and applied to downstream tasks. For the same healthy chest film, a plurality of possible latent space feature vectors and filling results can be obtained by replacing different mask images.

In some embodiments, the lung field region of each sample image is not uniform in size and thus the masked region is input together with the non-masked region into the cxrMAE (image classification model) for feature extraction.

After the training set is acquired, training the first encoder based on the image in the training set to obtain the pre-training weight of the first encoder (i.e., parameters of the second encoder and/or the second encoder); then, the present disclosure proposes a new training method of an image classification model based on a model cross attention mechanism, which uses a feature vector of a possible healthy chest structure output by a healthy chest encoder (second encoder) model in cxrMAE to an input chest as a reference of the image classification model, uses a difference part between a healthy feature and an original feature to mine out a feature of a potential focus area in the input image, and enhances the classification performance of the image classification model by enhancing the feature of the difference part.

Step S202, training the first encoder.

FIG. 7 illustrates an alternative schematic diagram of an image classification model provided by embodiments of the present disclosure. As shown in fig. 7, the image classification model includes a first encoder and a decoder. It should be noted that, in fig. 7, the input image is a mask image in which the lung field area is replaced by a mask, and those skilled in the art should understand that, after the lung field area obtained in step S201 is replaced by a mask, the input image may also be a first mask image obtained by replacing the lung field area pixels and the sub-image with 0 based on the sub-image in the first patch library, and the input image in fig. 7 is merely used as an example to illustrate the training process of the first encoder, and is not used to limit the disclosure.

In some embodiments, as shown in fig. 7, the first apparatus divides the first mask image into image blocks (divided images) by a preset image block size (patch size) without overlapping, the number of image blocks (divided images) being the size of the input image divided by the size of the image blocks. The divided image blocks are subjected to dimension conversion through a first full connection layer (patchebeddingclayer), and the dimensions of the image blocks are converted from m to 1 (m). And then inputting each image block into a first encoder, wherein the first encoder can select ViT-Base or ViT-Large, at least one characteristic image output by the first encoder is input into a decoder after being subjected to layer normalization, and finally, the pixel values in each image block of the first sample image are subjected to regression through a second full-connection layer, so that a first reconstructed image is obtained.

The first sub-loss value corresponding to the first encoder is the square of the L2 distance before and after the reconstruction of the image block with the mask, and the specific calculation formula is as follows:

where N is the total number of tiles from which the first mask image is partitioned, P _i,pred For the ith block (e.g. the ith row and the ith column) of the first reconstructed image, P _i,target An ith block (e.g., an a-th row, b-th column) of the first sample image.

And adjusting parameters of the first encoder and/or decoder based on the first sub-loss value, wherein the first device can also repeatedly execute step S202, namely repeatedly train the first encoder and/or decoder until the sub-loss value meets a first condition, confirm that the training of the first encoder is completed, and confirm that the first encoder with the completed training is a second encoder. The first condition may be that the sub-loss value is smaller than a preset threshold, or the sub-loss value converges, or other conditions set based on actual requirements or experimental results, which are not specifically limited in the disclosure.

In step S203, the image classification model is trained to include a comparison branch and a classification branch.

Fig. 8 shows another alternative schematic diagram of an image classification model provided by an embodiment of the present disclosure.

As shown in fig. 8, the image classification model includes a comparison branch and a classification branch, where the comparison branch includes a fourth encoder (parameters are the same as those of the second encoder), and optionally, a decoder, for inputting features output by the fourth encoder to obtain a reconstructed image; the classification branch includes a third encoder (the same parameters as the second encoder) and a full-connection layer classifier. The comparison branch is used for providing a second characteristic image which can be referred for the classification branch, and further obtains an abnormal value region caused by a focus in the first characteristic image of the classification branch by comparing the second characteristic image with the first characteristic image output by the classification branch. The differential attention mechanism mainly uses the differential part of the characteristic images output by the two branches to enhance the potential disease characteristics of the classified branches so as to improve the classification performance (or diagnosis performance) of the image classification model on the disease.

In some embodiments, after the first device trains the second encoder, the second encoder is applied to a downstream classification task, specifically, the second sample image (original chest X-ray image) is input into a third encoder included in the classification branch, and output of the third encoder is confirmed to be a first characteristic image corresponding to the second sample image; the parameters of the third encoder are the same as those of the second encoder; inputting a second mask image corresponding to the second sample image into a fourth encoder included in the comparison branch, and confirming that the output of the fourth encoder is a second characteristic image corresponding to the second sample image; the parameters of the fourth encoder are the same as those of the second encoder; and carrying out feature fusion on the first feature image and the second feature image to obtain a first fusion feature. Wherein, a second mask image corresponding to the second sample image may be obtained based on step S201.

In specific implementation, the device can perform pixel-by-pixel difference on the first characteristic image and the second characteristic image to obtain a first difference characteristic; obtaining the first fusion feature based on the first feature image and the first difference feature; the device may further perform feature fusion on the first feature image and the second feature image based on other feature fusion methods in the related art, and the disclosure is not particularly limited.

In some embodiments, the first means determines a weighted binary cross entropy loss or multi-label loss function based on the lung labeling classification result of the second sample image and the lung prediction classification result of the second sample image; based on a loss function, adjusting parameters of a third encoder and parameters of the full-connection layer classifier included in the classification branch in a gradient flip manner; the parameters of the fourth encoder in the control branch remain unchanged.

Optionally, the ViT model is ViT-base, the size of the input image is 224, and the size of the segmented sub-image is 16 when training the first encoder, the classification branch and the comparison branch.

In this way, through the training method of the image classification model provided by the embodiment of the present disclosure, a mask replacement representation manner is provided to accelerate convergence of the pre-training model (the first encoder), and the representation of the same structural organization among healthy patients in the feature space is mined, and the downstream image classification model (the chest disease diagnosis model) is assisted by taking the representation as prior knowledge. A feature difference attention mechanism is proposed by the Encoder (first Encoder) of the trained cxrMAE model to enhance the potential disease features in the feature images of the two branch outputs to assist the model in diagnosis. Meanwhile, the image classification model has stronger diagnosis interpretability, and is characterized in that the health feature vector coded by the cxrMAE encoder and the cxrMAE encoder can be utilized to restore the health mode of the input chest film during pre-training, and the health mode is used as the basis of the classification of the image classification model; specifically, a decoder may be connected to the fourth encoder of the control branch, and the output of the decoder may be confirmed as a healthy chest X-ray image corresponding to the input image. The parameters of the decoder may be determined according to step S102 or step S201.

Fig. 9 is a schematic flowchart of an alternative image classification method according to an embodiment of the disclosure, and the description will be made according to the steps.

Step S301, inputting a first image to be classified into a classification branch included in the image classification model, and confirming that the output of the classification branch is a classification result of the first image to be classified.

In some embodiments, the image classification device (abbreviated as a second device) inputs the first image to be classified into a third encoder included in the classification branch, and confirms that the output of the third encoder is a feature image corresponding to the first image to be classified; and then inputting the characteristic image corresponding to the first classified image into a full-connection layer classifier included in the classification branch, and confirming that the output of the full-connection layer classifier is the classification result of the first-generation classified image.

Further, the second apparatus may further obtain a comparison image corresponding to the first but classified image based on a comparison branch included in the image classification model, and may specifically include:

step S302, confirming a third mask image corresponding to the first image to be classified; and inputting the third mask image into a comparison branch included in the image classification model, and confirming that the output of the comparison branch is a comparison image corresponding to the first image to be classified.

In some embodiments, the second apparatus may acquire a third mask image corresponding to the first band classification image with reference to step S101 or step S201; inputting the third mask image into a fourth encoder included in a comparison branch, and confirming that the output of the fourth encoder is a characteristic image corresponding to the third mask image; and inputting the characteristic image into a decoder, and acquiring a contrast image corresponding to the first image to be classified.

Therefore, according to the image classification method provided by the embodiment of the disclosure, on one hand, the classification result obtained based on the image classification model has higher accuracy, and on the other hand, the classification result has stronger diagnosis interpretability, and the decoder is connected after the fourth encoder of the control branch, so that the doctor or the patient can be assisted to understand the illness state by confirming that the output of the decoder is the healthy chest X-ray image corresponding to the input image.

Fig. 10 is a schematic diagram showing an alternative structure of a training apparatus for an image classification model according to an embodiment of the present disclosure, and will be described according to various steps.

In some embodiments, the training apparatus 400 of the image classification model includes a segmentation unit 401, a first training unit 402, a feature fusion unit 403, a second training unit 404, and an adjustment unit 405.

The segmentation unit 401 is configured to confirm a first sample image and a first mask image corresponding to the first sample image; the first mask image is obtained based on a first sample image, and a lung field area in the first mask image comprises a mask;

the first training unit 402 is configured to train a first encoder included in an image classification model based on the first mask image, and confirm that the trained first encoder is a second encoder; the second encoder is applied to a classification branch and a comparison branch included in the image classification model;

the feature fusion unit 403 is configured to input a second sample image into the classification branch, input a second mask image corresponding to the second sample image into the comparison branch, and perform feature fusion on a first feature image output by the classification branch and a second feature image output by the comparison branch to obtain a first fusion feature;

the second training unit 404 is configured to input the first fusion feature into a full-connection layer classifier included in the classification branch, and confirm that an output of the full-connection layer classifier is a lung prediction classification result corresponding to the second sample image;

The adjusting unit 405 is configured to adjust parameters of the classification branch based on a lung labeling classification result of the second sample image and a lung prediction classification result of the second sample image.

The segmentation unit 401 is specifically configured to segment the first sample image, and determine a lung field area and a non-lung field area in the first sample image based on a segmentation result;

The segmentation unit 401 is specifically configured to replace a lung field area of the first sample image based on the mask, so as to obtain a first mask image corresponding to the first sample image; the lung field areas in the first mask image are all masks;

The first training unit 402 is specifically configured to input the first mask image to the first encoder, and confirm that the output of the first encoder is at least one feature image corresponding to the first mask image;

The feature fusion unit 403 is specifically configured to input the second sample image to a third encoder included in the classification branch, and confirm that an output of the third encoder is a first feature image corresponding to the second sample image; the third encoder is the second encoder;

inputting a second mask image corresponding to the second sample image into a fourth encoder included in the comparison branch, and confirming that the output of the fourth encoder is a second characteristic image corresponding to the second sample image; the fourth encoder is the second encoder;

The feature fusion unit 403 is specifically configured to perform pixel-by-pixel difference on the first feature image and the second feature image to obtain a first difference feature;

The adjusting unit 405 is specifically configured to determine a cross entropy loss based on a lung labeling classification result of the second sample image and a lung prediction classification result of the second sample image;

Fig. 11 is a schematic diagram showing an alternative configuration of an image classification apparatus according to an embodiment of the present disclosure, and will be described in terms of the respective sections.

In some embodiments, the image classification apparatus 500 includes a first input unit 501 and a second input unit 502.

The first input unit 501 is configured to input a first image to be classified into a classification branch included in the image classification model, and confirm that an output of the classification branch is a classification result of the first image to be classified;

the second input unit 502 is configured to confirm a third mask image corresponding to the first image to be classified; and inputting the third mask image into a comparison branch included in the image classification model, and confirming that the output of the comparison branch is a comparison image corresponding to the first image to be classified.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device and a readable storage medium.

Fig. 12 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in electronic device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as the XXX method. For example, in some embodiments, the XXX methods may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When a computer program is loaded into RAM803 and executed by computing unit 801, one or more steps of the XXX method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the XXX method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present disclosure, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the disclosure, and it is intended to cover the scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of training an image classification model, the method comprising:

training a first encoder included in an image classification model based on the first mask image, and confirming that the trained first encoder is a second encoder; the second encoder is applied to a classification branch and a comparison branch included in the image classification model;

inputting a second sample image into the classification branch, inputting a second mask image corresponding to the second sample image into the comparison branch, and performing feature fusion on a first feature image output by the classification branch and a second feature image output by the comparison branch to obtain a first fusion feature; wherein the collation branch includes a fourth encoder for outputting the second mask image as a second feature image; the parameters of the fourth encoder are the same as those of the second encoder;

2. The method of claim 1, wherein the validating the first sample image and the first mask image corresponding to the first sample image comprises:

3. The method according to claim 2, wherein the replacing the lung field area of the first sample image based on the mask and/or the sub-images in the first patch library to obtain the first mask image corresponding to the first sample image comprises:

4. The method of claim 1, wherein training the first encoder included in the image classification model based on the first mask image, and wherein validating the trained first encoder as the second encoder comprises:

5. The method according to claim 1, wherein inputting the second sample image into the classification branch, inputting the second mask image corresponding to the second sample image into the comparison branch, and performing feature fusion on the first feature image output by the classification branch and the second feature image output by the comparison branch to obtain a first fusion feature, includes:

6. The method according to claim 1 or 5, wherein feature fusing the first feature image of the classification branch output and the second feature image of the comparison branch output to obtain a first fused feature, comprises:

7. The method of claim 1, wherein the adjusting the parameters of the classification branch based on the pulmonary labeling classification result of the second sample image and the pulmonary prediction classification result of the second sample image comprises:

8. An image classification method, characterized in that it is implemented based on an image classification model obtained by training according to claims 1-7, said method comprising:

9. An apparatus for training an image classification model, the apparatus comprising:

The feature fusion unit is used for inputting a second sample image into the classification branch, inputting a second mask image corresponding to the second sample image into the comparison branch, and carrying out feature fusion on a first feature image output by the classification branch and a second feature image output by the comparison branch to obtain a first fusion feature; wherein the collation branch includes a fourth encoder for outputting the second mask image as a second feature image; the parameters of the fourth encoder are the same as those of the second encoder;

10. An image classification apparatus, characterized in that it is implemented based on an image classification model trained according to claims 1-7, said apparatus comprising:

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7;

alternatively, the method of claim 8 is performed.

12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7;

Alternatively, the method of claim 8 is performed.