CN115984578A

CN115984578A - Tandem fusion DenseNet and Transformer skin image feature extraction method

Info

Publication number: CN115984578A
Application number: CN202211570369.7A
Authority: CN
Inventors: 白雪梅; 王帅; 张晨洁; 史新瑞; 赵荟圆; 侯聪聪; 王澳; 师宏锦
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-04-18

Abstract

The invention provides a skin image feature extraction method with tandem fusion of DenseNet and a Transformer, and belongs to the field of deep learning image classification. Preprocessing an input picture, converting the input picture into a tensor, sending the tensor to a DenseNet part for feature extraction, and obtaining local features of a face; sending the feature map obtained after DenseNet into a Transformer to obtain the global features of the face; carrying out information fusion on the global features and the local features to obtain fusion features, and carrying out skin image identification through the fusion features; the feature map output by the Transformer passes through a Layer Noramyl Layer, an average pooling Layer and a full link Layer, and finally, the prediction category and the disease probability are output. The invention can fully utilize the skin information contained in the global characteristics and the local characteristics, thereby improving the accuracy of skin diagnosis and well judging the type and the disease probability of skin diseases.

Description

Tandem fusion DenseNet and Transformer skin image feature extraction method

Technical Field

The invention relates to the field of deep learning image classification, in particular to a method for more fully extracting skin disease image characteristics by serially fusing DenseNet and Transformer.

Background

Skin diseases are more common and multiple diseases in medicine, and skin detection technology is receiving more and more attention. The traditional manual diagnosis has certain subjectivity and cannot meet the detection requirements of complex and various skin diseases. In recent years, deep learning techniques have been increasingly used in more well-known areas, and features obtained by deep learning have proved to be more representative than features constructed by conventional methods in many tasks.

The research of deep learning has become an application trend, wherein a Convolutional Neural Network (CNN) model is always the mainstream model in the CV field and has the best application prospect, and gradually becomes the most widely applied object in the machine learning and computer vision fields, and achieves good results. The DenseNet convolution operation is good at extracting local features, but does not have the capability of extracting global characteristics, and in order to sense the global information of the image, the stacking convolution layer is required, and the pooling operation is adopted to enlarge the receptive field. And the Transformer has the global and dynamic receptive field capability, so that the monopoly of the CNN on the aspect of visual representation is broken, and a better result is obtained on an image recognition task. The method for extracting features by using the deep network is widely applied to various aspects such as images, voice, videos and the like.

One difficulty facing the field of medical image analysis, including skin diagnosis, is the insufficient amount of high quality medical image data. In the face of an image with insufficient data quantity, information on the image needs to be extracted more fully. In the aspect of auxiliary diagnosis of skin disease images, the CNN algorithm and the Transformer are fully fused, the image processing performance is improved, and the diagnosis accuracy is improved.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a method for extracting depth features of 8 skin disease images of melanoma, melanocyte nevi, basal cell carcinoma, actinic keratosis, benign keratosis, dermatofibroma, hemangioma and squamous cell carcinoma in an ISIC2019 data set by using tandem fusion of DenseNet and Transformer. By utilizing the advantages that DenseNet is good at extracting local features of an image, and a Transformer structure is good at extracting global features of the image and the characteristics of the local features such as edges and textures of a lesion area needing attention of a skin image, key information is extracted by using DenseNet, and then the Transformer is used for carrying out global analysis on the information, so that lesion features can be better extracted from the skin image, and the accuracy of auxiliary diagnosis is improved.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

step one, downloading a source data set ISIC2019, and compressing all pictures to 448 × 448;

and step two, reserving a first convolution Layer, a pooling Layer, a first Transition Layer and the first two Transition Block layers in the DenseNet as a local feature extraction module. Converting the input picture into tensor, and sending the tensor to a feature extraction module for local feature extraction;

step three, reducing the dimension of the feature vector obtained after the picture tensor passes through the local feature extraction module on the channel number of the feature graph output by the DenseNet through a convolution layer with the convolution kernel size of 1 x 1 and the convolution kernel number of 96 so as to match the requirement of the first Stage in the Transformer on the channel number of the input vector;

and step four, sending the feature graph after dimensionality reduction into a Transformer algorithm for further feature extraction, wherein the Transformer algorithm selects Swin Transformer-Tiny. The algorithm is divided into 4 stages, with the number of Swin Transformer blocks in each Stage being 2,2,6,2.

And step five, outputting the prediction result of image classification by the feature vector extracted by the Transformer through a Layer Noramyl Layer, a pooling Layer and a full-link Layer.

Drawings

FIG. 1 is a flow chart of the use of the algorithm of the present invention.

Fig. 2 is a block diagram of the algorithm proposed by the present invention.

FIG. 3 is an internal structure diagram of the DenseNet Block of the present invention.

FIG. 4 is an internal structure diagram of Swin transducer Blocks in the present invention.

Detailed Description

For a further understanding of the invention, reference is made to the following detailed description taken in conjunction with the accompanying drawings. The specific using process of the invention is realized by the following steps:

step one, downloading an open-source data set ISIC2019 of an international skin imaging cooperative organization, wherein the data set comprises 25331 pictures which comprise eight skin disease types: melanoma, melanocytic nevi, basal cell carcinoma, actinic keratosis, benign keratosis, dermatofibroma, hemangioma, squamous cell carcinoma. The method includes the steps that pictures of each category in a data set are divided into a training set and a testing set according to the proportion of 8. In order to alleviate the problem of data imbalance and avoid overfitting during training, data enhancement processing is carried out on the skin disease picture, and specifically, geometric transformation methods such as rotation and translation are used. Since the improved algorithm requires the input picture size to be 448 x 3, the data set is subjected to a downscaling process to reduce the resolution of the data set picture to 48 x 448 x 3.

Step two, for the detection of skin diseases, local characteristics of skin damage, such as the edge shape of a damaged part and texture in a damaged area, need to be paid more attention. The characteristics of the skin surface outside the lesion area are not considered. The DenseNet algorithm is mainly responsible for the extraction of local features. The complete DenseNet is not required to be used as a preposed part of the Swin transducer, and only the first convolution Layer, the pooling Layer, the first two DenseBlock and the first Transition Layer in the DenseNet are required to form a local feature extraction module. A large number of image features are extracted by the module to obtain a feature map of the image. The core of convolution, whose function is to perform feature extraction on data, is generally composed of a plurality of convolution kernels. Each convolution kernel is connected with a local area of the previous layer of feature map, the local area is a receptive field of the previous layer of the convolution kernel, and the convolution kernel can obtain a new feature map through convolution operation. The calculation of the feature map is generally divided into 2 steps: firstly, carrying out convolution operation on the previous layer of data through a convolution kernel, and then applying a nonlinear function to each operation result. The typical form of a convolutional layer is:

in the formula (I), the compound is shown in the specification,

j-th feature vector for the l-th layer output, f is the excitation function, k is the number of feature quantities, and->

The ith characteristic quantity output for the l-1 th layer is convolution operation and is selected based on the value of the convolution operation>

A weight matrix for the jth convolution kernel output for the ith layer, <' >>

The bias moment of the jth convolution kernel output for the ith layer. />

First, the 448 × 3 image is passed through the first convolution layer of the DenseNet, resulting in 224 × 3 feature vectors, the first convolution layer having a kernel size of 7 × 7, 64 convolution kernels, and a step size of 2. The output feature vector is passed through a maximum pooling layer of size 3 x 3 and step size 2, resulting in a feature vector of 112 x 64. Then enters a first Dense Block module which contains optional layers Bottleneck, and the input of each layer is the concatenation of the outputs of all the previous layers on the channel. The first Dense Block contained six sets of 1 × 1 convolutional layers and 3 × 3 convolutional layers. The feature vector of 112 × 64 input to the first detect Block passes through the Batch Normalization layer and the ReLU layer in the Block, and the dimension of the output feature vector is not changed and is still 112 × 64. And then enter an optional layer bottleeck, which uses a convolution kernel of 1 × 1 to reduce the phenomenon of excessive depth caused by the splicing of vectors on the channel, and the output is 112 × 128. After passing through a convolution layer with convolution kernel size of 3 × 3 and 32 convolution kernels, the final output is 112 × 32 after the first Dens Blcok.

Then passes through the Transition Block module. The module is arranged between two Dense blocks, plays a role in connection and consists of a convolution layer and a pooling layer. Its input is the feature vector of 112 x 32 from the previous Dense Block output. The lattice in Block is an optional layer, and multiple convolution kernels with the size of 1 × 1 are adopted, and here, the parameters can be reduced by compressing according to a preset compression coefficient θ (0-1), and the output is 112 × 112 (32 × θ). Then passed through a second Dense Block, which contained 12 sets of 1X 1 and 3X 3 convolutions. A feature vector of 56 x 32 is obtained. The calculation formula of the feature tensor obtained after DenseNet is as follows:

M＝D ₍₂₎ (T(D ₍₁₎ (P ₃ (Conv ₇ (Z))))) (2)

wherein Z is the input picture tensor, conv ₇ Is the first 7 by 7 convolutional layer, P ₃ Pooling layer of 3 x 3, D ₍₁₎ Is the first Dense Block Layer, T is the Transition Layer, D ₍₂₎ The second Dense Block layer.

And step three, the characteristic vector obtained by the local characteristic extraction module is high in quantity and width of 56, but the depth does not meet the requirement of a Transformer. Therefore, the number of channels of the feature map is also 96 by one convolution kernel with the size of 1 × 1 and the number of convolution kernels of 96. In order to solve the problem that the data distribution of the middle layer is changed in the training process so as to prevent the gradient from disappearing or exploding and accelerate the training speed, and then the characteristic vector passes through a Batch Normalization layer. And finally, sending the feature vectors into a Transformer for further feature extraction.

The feature vectors fed into the transform are:

C＝BN(Conv ₁ (M)) (3)

wherein M is a feature tensor obtained after DenseNet, conv ₁ The convolution layer contains 96 convolution kernels of 1 × 1 size, and BN is the Batch Normalization layer.

And step four, further feature extraction in the Transformer. The transform selects the Swin transform algorithm. The algorithm is divided into 4 stages, and the number of transform blocks in each Stage is 2,2,6,2. The depth of each stage input vector is 96, 192, 384, 768 respectively. The first stage inputs a 56 × 96 sized feature vector through a transform Block, which has paired W-MSA and SW-MSA modules, performing window self-attention to focus the weight more on the skin lesion. After Block, the dimension of the feature vector is not changed, and after Patch Merging, the height and width of the feature vector are reduced to half, and the depth is doubled, so that a feature map with the size of 28 × 192 is obtained. After passing through Stage3 and Stage4 in the Transformer, the size of the final characteristic diagram is 7 × 768.

And step five, for the classification task, the feature graph output by the Transformer also passes through a Layer Noramyl Layer, an average pooling Layer and a full connection Layer, and finally the prediction category is output. The fully connected layer is similar to a conventional neural network in that each neuron is connected to all neurons in the previous layer. Thus, the fully-connected layer contains global information of the data, connecting each neuron of the fully-connected layer to a Softmax function, which is usually used in the output layer of the classification problem, whose function is to represent the prediction result in the form of probability, whose formula is:

wherein S is the output value Z of the mth neuron _m The value of the probability converted by the Softmax function, C being the number of neurons，Z _c Is the output value of the c-th neuron.

Claims

1. A tandem fusion DenseNet and Transformer skin image feature extraction method is characterized by comprising the following steps:

and step four, sending the feature graph after dimensionality reduction into a Transformer algorithm for further feature extraction, wherein the Transformer algorithm selects Swin Transformer-Tiny. The algorithm is divided into 4 stages, and the number of blocks in each stage is 2,2,6,2.

And step five, outputting the prediction result of image classification by the feature vector extracted by the transform through an LN layer, a pooling layer and a full-connection layer.

2. The method for extracting skin image features through tandem fusion of DenseNet and Transformer as claimed in claim 1, wherein: firstly, converting an input picture into a tensor, and sending the tensor to a DenseNet part for feature extraction, wherein the DenseNet part is mainly responsible for extracting local features.

3. The method for extracting skin image features by fusing DenseNet and Transformer in series according to claim 1, wherein: the characteristic diagram obtained after Densenet has the height and width of the vector of 56, but the depth does not meet the requirement of a Transformer. Therefore, the number of channels of the feature map is also 96 through a convolution layer with the convolution kernel size of 1 × 1 and the number of convolution layers of 96, and then the feature vector is sent to the Transformer for further feature extraction.

4. The method for extracting skin image features by fusing DenseNet and Transformer in series according to claim 1, wherein: the characteristic tensor sent into the Transformer after Densenet is as follows: m = δ (F) ₁ (O)) wherein F ₁ The convolution transformation is 1 × 1, O is a feature map of CNN-terminal output, and δ is an activation function.

5. The method for extracting skin image features by fusing DenseNet and Transformer in series according to claim 1, wherein: for the classification task, the feature map output by the Transformer passes through a Layer Noramyl Layer, an average pooling Layer and a full link Layer, and finally a prediction category is output.

6. The method for extracting skin image features by fusing DenseNet and Transformer in series according to claim 1, wherein: during prediction, an input picture is processed to remove a non-human face part. And uniformly cutting the processed picture according to the size of 448 × 448, converting the cut picture into tensors, and sequentially sending the tensors into a network model for prediction. Finally, the type and the disease probability of the skin diseases contained in the picture are obtained.