CN115458174A

CN115458174A - Method for constructing intelligent diagnosis model of diabetic retinopathy

Info

Publication number: CN115458174A
Application number: CN202211142980.XA
Authority: CN
Inventors: 欧阳继红; 逯晨阳; 刘思光; 郭泽琪
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2022-12-09

Abstract

The invention discloses a method for constructing an intelligent diagnosis model for diabetic retinopathy, which aims at the disease stage grade characteristic of DR and combines self-supervision contrast learning and supervision deep learning to improve the training process of DNN (deep dynamic network) so as to enable the model to excavate prior knowledge under the condition of no labeled data to assist the DR intelligent diagnosis model and be used for early DR detection. The construction method comprises data preprocessing, a front task and a downstream task. The preprocessing is to reduce noise and normalize the image in order to improve the quality of the data. The pre-task is an unsupervised model pre-training process, and is used for mining prior knowledge from unmarked data sets so as to assist model training. The downstream task is a supervised classification model fine-tuning process, and the prior knowledge is utilized to improve the training quality of the model, so that the DR classification performance of the model under the training of a small amount of labeled data is improved.

Description

Method for constructing intelligent diagnosis model of diabetic retinopathy

Technical Field

The invention relates to the technical field of deep learning, in particular to a method for constructing an intelligent diagnosis model of diabetic retinopathy.

Background

Diabetic Retinopathy (DR) is one of the most common Diabetic ophthalmic disease complications and a preventable disease. In clinic, the higher-experience ophthalmologist is mainly relied on to perform observation and evaluation on the color retina image to determine the disease stage grade, wherein the pathological stage grade is mostly established by adopting the DR international quintuple standard to make an optimized diagnosis and treatment scheme. Deep learning can effectively extract pathological features in an image according to labeled data and obtain a better diagnosis result, but the actual situation can face many challenges.

In the deep neural network training process, the existing method mostly trains a neural network in a large data driving mode, and because the existing fundus image open source data sets have the problems of equipment difference, regional difference of diseases and the like, the model training method for large-scale open source data can cause inaccurate model classification results, and is not suitable for expanding the model and applying the model to certain medical institutions with insufficient label data or different national regions.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a method for constructing an intelligent diagnosis model of diabetic retinopathy.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for constructing an intelligent diagnosis model of diabetic retinopathy comprises the following specific processes:

s1, image preprocessing: carrying out noise reduction and normalization processing on the fundus image data;

s2, model training: the method mainly comprises two stages: a pre-task based on contrast self-supervision learning and a downstream task based on a convolutional neural network;

the specific process of the preposed task is as follows:

2.1.1, data enhancement: in the training process, a batch of input fundus images is givenImage set is X = { X = ×) _k } _{k＝1,2,…,N} N is the number of images of the input fundus image set; for each input fundus image data x _k And (3) carrying out random transformation to generate a positive sample fundus image data pair: (x) _i ,x _j ) (ii) a Thus, set X generates set X after data enhancement _i ＝{x _i } _{i＝1,3,…,2N-1} And X _j ＝{x _j } _{j＝2,4,…,2N} ；

2.1.2, encoder: encoder f (-) uses a convolutional neural network shared by two weights, while assembling the input X _i And X _j Encoding into a corresponding 2048-dimensional feature representation sequence H _i And H _j As shown in equation (3):

H _i ＝f(X _i )

H _j ＝f(X _j ) (3)

2.1.3, projection network project Net: the projection network p (-) comprises two weight-sharing multilayer perceptron network modules, namely MLP modules, wherein each MLP module consists of two fully-connected network layers F1 and F2 connected through a ReLU activation function layer; after multiple nonlinear changes in the projection network, the feature is represented by H _i And H _j Further conversion to Z _i And Z _j (ii) a The projection process is shown in formula (4):

Z _i ＝p(H _i )＝W ₂ σ(W ₁ H _i )

Z _j ＝p(H _j )＝W ₂ σ(W ₁ H _j ) (4)

in the formula (4), W ₁ And W ₂ Respectively representing two fully-connected layer F1 and F2 network parameters in MLP, wherein sigma represents ReLU nonlinear transformation;

2.1.4, contrast loss function: selecting a comparative loss function NT-Xent as a loss function; inputting N pieces of fundus image data in each batch, wherein the batch size can be expanded to 2N after the data enhancement process; the NT-Xent loss calculation procedure for each batch of data is as follows:

first, the cosine similarity between each two fundus images in all 2N fundus images is calculated as shown in formula (5):

in formula (5), m ∈ {1,2, \ 8230;, 2N }, N ∈ {1,2, \ 8230;, 2N }; then, the cosine similarity sim (Z) is normalized using the Softmax function _m ,Z _n ) Obtaining the probability of similarity between the two fundus images; then, taking the negative value of the logarithm of the pair of fundus images as the loss of the pair of fundus images, and calling the loss as noise contrast estimation loss; positive sample pair (Z) _i ,Z _j ) The process of calculating the loss therebetween is shown in equation (6):

in the formula (6), k ≠ i represents dividing Z _i All images except the image are processed, and tau represents an adjustable temperature parameter which is used for zooming input and expanding the range of cosine similarity to [ -1,1](ii) a Finally, comprises (X) _i ,X _j ) And (X) _j ,X _i ) The final loss of all the pairs of positive samples, i.e. the final loss of the whole network, is calculated as shown in equation (7):

by the pre-task training, the learned prior knowledge is migrated through a network by using a self-supervision contrast learning training method without labeled data, and the parameters learned by an encoder in the pre-task are migrated to a feature extraction network of a downstream task to be used as initialization network parameters;

the downstream task comprises three components (1) an encoder; (2) a classifier; (3) objective function:

2.2.1, encoder: the structure of a CNN encoder in a downstream task is the same as that of a preposed task encoder; loading parameters learned by an encoder in a preposed task into an encoder of a downstream task; inputting fundus image data x into an encoder to obtain a high-dimensional characteristic map representation sequence, then performing dimensionality reduction on the fundus image data x through an output layer of the encoder to obtain a characteristic vector h, and then inputting the characteristic vector h into a classifier for classification; the calculation process is as follows:

h＝GAP(f(x)) (8)

in formula (8), f (·) is a convolution pooling process, GAP represents a global averaging pooling layer, and h represents a feature vector finally output by the encoder;

2.2.2, classifier: the output layer of the encoder is followed by a Softmax classifier, consisting of an FC layer and a Softmax function, for calculating the probability that a sample is positive

The calculation process is as follows:

{z ¹ ,z ² }＝FC(h) (9)

in equation (9), FC denotes a fully connected network with two nodes at the output, z ¹ And z ² Respectively representing two output values; in equation (10), the output is normalized using the Softmax function to obtain the probability that the sample is positive

2.2.3, objective function: based on a Softmax classifier, adopting a two-class cross entropy loss function as a target function in the task; input sequence for a given batch { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ),…,(x _N ,y _N ) The loss function computation procedure is as follows:

in the formula (11), y _i For true labeling of samples, the positive sample label is set to 1 and the negative sample label is set to 0 in this task, so y _i ∈{0,1}，

Indicating the probability that the sample is predicted to be positive.

Further, the specific process of step S1:

firstly, cutting redundant black boundary pixels in a fundus image to remove information redundant noise;

then, normalizing the size of the fundus images, and uniformly adjusting the resolution of all the fundus images to 256 multiplied by 256;

finally, the brightness and contrast of the fundus image are improved as shown in the formula (1) and the formula (2):

I _enhance ＝α·I(x,y)+β·I _gaussian +γ (2)

in formula (1), an image I (x, y) with normalized size is input, and is subjected to gaussian filter convolution processing with standard deviation σ to obtain I _gaussian Then obtaining an enhanced fundus image I through weighted summation of formula (2) _enhabce Wherein the values of α, β, σ, γ are set to 4, -4, 10, 128, respectively.

Further, in step 2.1.1, data enhancement includes: randomly clipping the fundus image, randomly turning to a certain angle, randomly converting to a grey-scale image, and randomly modifying the brightness, contrast or saturation of the fundus image.

Further, in step 2.1.2, a ResNet50 network containing the ImageNet pre-training parameters is selected as the basic network structure of the encoder.

The invention has the beneficial effects that: the invention combines self-supervision contrast learning and supervision deep learning, improves the training process of DNN, and provides a method for constructing an intelligent diagnosis model of diabetic retinopathy. The construction method comprises data preprocessing, a front task and a downstream task. The preprocessing is to perform noise reduction and normalization processing on the image in order to improve the quality of the data. The pre-task is an unsupervised model pre-training process, and is used for mining prior knowledge from unmarked data sets so as to assist model training. The downstream task is a supervised classification model fine-tuning process, and the prior knowledge is utilized to improve the training quality of the model, so that the DR classification performance of the model under the training of a small amount of labeled data is improved.

Drawings

FIG. 1 is a schematic diagram illustrating data enhancement effect in a method according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating a method according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and it should be noted that the present embodiment is based on the technical solution, and the detailed implementation and the specific operation process are provided, but the protection scope of the present invention is not limited to the present embodiment.

The embodiment provides a method for constructing an intelligent diagnosis model of diabetic retinopathy, wherein how to train the model by using unmarked data and a small amount of marked image data in a database is important.

Convolutional neural network (DNN), one of the best techniques for extracting image features, has become the most effective deep learning model in the field, and the model usually uses labeled data for supervised training, but feature mining capability in unlabeled data and a small amount of labeled data is not sufficient.

Aiming at the disease stage grade characteristic of DR, in order to enable a model to dig out a priori knowledge auxiliary intelligent diagnosis model under the condition of no labeled data and be used for early diabetic retinopathy detection, the embodiment combines self-supervision contrast learning and supervision deep learning, improves the training process of DNN, and provides a method for constructing the diabetic retinopathy intelligent diagnosis model (DR model). The construction method comprises the steps of data preprocessing, a pretreat Task (Pretext Task) and a Downstream Task (Downstream Task). The preprocessing is to reduce noise and normalize the image in order to improve the quality of the data. The pre-task is an unsupervised model pre-training process and is used for mining prior knowledge from an unmarked data set so as to assist model training. The downstream task is a supervised classification model fine-tuning process, and the prior knowledge is utilized to improve the training quality of the model, so that the DR classification performance of the model under the training of a small amount of labeled data is improved.

The technical route of the method of the embodiment is as follows: preprocessing an original picture in a database, inputting processed data into a model, and obtaining the diabetic retinopathy intelligent diagnosis model after training of a preposed task and a downstream task and training optimization. The specific process is as follows:

s1, image preprocessing:

the difference between individuals is large because the fundus image data sources and the imaging devices are different. Several phenomena can occur in the data set: large size color differences, overexposure, information redundancy, etc. In order to improve the quality of data, the fundus images in the database need to be subjected to noise reduction and normalization processing. Firstly, cutting redundant black boundary pixels in a fundus image to remove information redundant noise; then, normalizing the size of the fundus images, and uniformly adjusting the resolution of all the fundus images to 256 multiplied by 256; finally, the brightness and contrast of the fundus image are improved as shown in the formula (1) and the formula (2):

I _enhance ＝α·I(x,y)+β·I _gaussian +γ (2)

in formula (1), an image I (x, y) with normalized size is input, and is subjected to gaussian filter convolution processing with standard deviation σ to obtain I _gaussian Then obtaining an enhanced fundus image I through weighted summation of formula (2) _enhance Where the values of α, β, σ, and γ are 4, -4, 10, and 128, respectively, the original and the preprocessing result are shown in fig. 1, and (a) and (c) in fig. 1 are the original, and (b) and (d) are the corresponding preprocessing results, respectively.

S2, model training:

the clinical medical image database stores a large number of fundus images, which are not fully utilized because of no labeling. The comparison learning plays an important role in mining unmarked data samples to improve the generalization capability of the model.

As shown in fig. 2, the model training of the present embodiment mainly includes two stages: a pre-task based on contrast self-supervised learning, and a downstream task based on a convolutional neural network. The pre-task is an unsupervised model pre-training process, and is used for mining prior knowledge from unmarked data sets so as to assist model training. The downstream task is a supervised classification model fine-tuning process, and the prior knowledge is utilized to improve the training quality of the model, so that the DR classification performance of the model under the training of a small amount of labeled data is improved.

In the method, the key point of model training is how to mine the priori knowledge from the unmarked data by using the front-end task, and the model training is transferred to the downstream task by a knowledge migration method, so that the classification performance of the model is improved.

2.1, the pre-task comprises four stages of data enhancement, an encoder, a projection network and contrast loss. The encoder f (-) utilizes a convolutional neural network shared by two weights, and the convolutional neural network is migrated to a feature extraction network of a downstream task after training.

The specific process of the preposed task is as follows:

2.1.1, data enhancement: in the training process, a batch of input fundus image sets is given as X = { X = { (X) } _k } _{k＝1,2,…,N} N is the number of images of the input fundus image set; fundus image data x for each input _k Randomly transforming to generate a positive sample fundus image data pair (x) _i ,x _j ) (ii) a Therefore, setThe aggregate X generates a set X after data enhancement _i ＝{x _i } _{i＝1,3,…,2N-1} And X _j ＝{x _j } _{j＝2,4,…,2N} . Finally, the input to the encoder for one batch is the set { (x) _i ,x _j )} _{i＝1,3,…,2N-1；j＝i+1} Wherein (x) ₁ ,x ₂ ) Is composed of x ₁ e.X. Several data enhancement methods are selected in the embodiment, including: randomly cropping the fundus image, randomly flipping to a certain angle, randomly converting to a gray scale map, and randomly modifying the brightness, contrast or saturation of the fundus image.

2.1.2, encoder Base Encoder: encoder f (-) utilizes a convolutional neural network shared by two weights, while assembling the input X _i And X _j Encoding into a corresponding 2048-dimensional feature representation sequence H _i And H _j As shown in equation (3).

H _i ＝f(X _i )

H _j ＝f(X _j ) (3)

The ResNet50 network has proven to be very efficient in extracting high-level semantic information, which is critical for medical image analysis. Therefore, in the method of the embodiment, the ResNet50 network containing the ImageNet pre-training parameter is selected as the basic network structure of the encoder, so that the encoder network has better initialization parameters, and the convergence rate of the encoder network in the training process is increased. In addition, the encoder position design in the network architecture is generic, i.e., the ResNet50 can be replaced with other types of encoders.

2.1.3, projection network project Net: the projection network p (-) comprises two weight-shared multilayer perceptron (MLP) network modules, each MLP module consisting of two fully-connected network layers F1, F2 connected by a ReLU activation function layer, as shown in fig. 2. After a plurality of nonlinear changes in the projection network, the characteristic representation sequence H obtained in step 2.1.2 is _i And H _j Further conversion to Z _i And Z _j . The projection process is shown in equation (4).

Z _i ＝p(H _i )＝W ₂ σ(W ₁ H _i )

Z _j ＝p(H _j )＝W ₂ σ(W ₁ H _j ) (4)

In the formula (4), W ₁ And W ₂ Respectively represents two fully-connected layer F1 and F2 network parameters in MLP, and sigma represents ReLU nonlinear transformation.

2.1.4, contrast loss function: unlike the supervised learning task, there is no manual tagging involved in the process of self-supervised learning. Therefore, in the pre-task, the key of the model training is to construct a pseudo label by using the unlabeled training data and design a corresponding loss function to perform pseudo supervised training on the network. The loss function selected by the method of the embodiment is a comparative loss function NT-Xent, and the principle of the function is to calculate the loss by using the distance between a positive example and a negative example. The losses are then optimized to make the positive examples closer together and the negative examples farther apart.

In this task, when the network model is trained, N pieces of fundus image data are input for each batch, and after the data enhancement process, the batch size is expanded to 2N. The NT-Xent loss calculation flow for each batch of data is as follows. First, the cosine similarity between each two fundus images in all 2N fundus images is calculated as shown in formula (5):

in formula (5), m ∈ {1,2, \ 8230;, 2N }, N ∈ {1,2, \ 8230;, 2N }. The cosine similarity sim (Z) is then normalized using the Softmax function _m ,Z _n ) And obtaining the probability of similarity between the two fundus images. Then, the negative value of the logarithm thereof is taken as the loss of the pair of fundus images, which is referred to as noise contrast estimation loss (NCE loss for short). Positive sample pair (Z) _i ,Z _j ) The process of calculating the loss therebetween is shown in equation (6):

in the formula (6), k ≠ i represents dividing Z _i All images except the image, and tau represents an adjustable temperature parameter which is used for zooming input and expanding the range of cosine similarity to be-1, 1]. Finally, comprising (X) _i ,X _j ) And (X) _j ,X _i ) The final loss of all the positive sample pairs, i.e. the final loss calculation process of the whole network, is shown in equation (7):

through the training of the preposed task, the learned priori knowledge is migrated through a network by using a self-supervision contrast learning training method without marked data, and parameters learned by an encoder in the preposed task are migrated into a feature extraction network of a downstream task to be used as initialization network parameters, as shown in fig. 2.

2.2, because the model already learns some priori knowledge through a front task, good classification performance can be obtained only by using a small amount of labeled data to carry out fine tuning training on the encoder and the classifier, and a downstream task comprises three components (1) of the encoder; (2) a classifier; and (3) an objective function.

2.2.1, encoder: the structure of a CNN encoder in a downstream task is the same as that of a front task encoder, and is also a ResNet50 network structure. And loading the parameters learned by the encoder in the front task into the encoder of the downstream task. The fundus image data x is input into an encoder to obtain a high-dimensional feature map representation sequence, then the fundus image data x is subjected to dimension reduction processing through an output layer (GAP layer) of the encoder to obtain a feature vector h, and then the feature vector h is input into a classifier to be classified. The calculation procedure is as follows.

h＝GAP(f(x)) (8)

In equation (8), f (-) convolution pooling, GAP denotes the global average pooling layer, and h denotes the feature vector of the final output of the encoder.

2.2.2, classifier: the output layer of the encoder is followed by a Softmax classifier consisting of an FC layer and a Softmax function for computing samplesProbability of originally belonging to RDR (recoverable RDR), i.e. probability of sample belonging to positive

The calculation procedure is as follows.

{z ¹ ,z ² }＝FC(h) (9)

In equation (9), FC denotes a fully connected network with two nodes at the output, z ¹ And z ² Representing two output values, respectively. In equation (10), the output is normalized using the Softmax function to obtain the probability that the sample is positive

2.2.3, objective function: based on the Softmax classifier, a two-class cross entropy loss function is adopted as an objective function in the task. Input sequence for a given batch { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ),…,(x _N ,y _N ) The loss function calculation procedure is as follows:

in the formula (11), y _i For true labeling of samples, the positive sample label is set to 1 and the negative sample label is set to 0 in this task, so y _i ∈{0,1}。

Indicating the probability that the sample is predicted to be positive.

Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims

1. A method for constructing an intelligent diagnosis model of diabetic retinopathy is characterized by comprising the following specific processes:

the specific process of the preposed task is as follows:

2.1.1, data enhancement: in the training process, a batch of input fundus image sets is given as X = { X = { (X) } _k } _{k＝1，2，...，N} N is the number of images of the input fundus image set; fundus image data x for each input _k And (3) carrying out random transformation to generate a positive sample fundus image data pair: (x) _i ，x _j ) (ii) a Thus, set X generates set X after data enhancement _i ＝{x _i } _{i＝1，3，...，2N-1} And X _j ＝{x _j } _{j＝2，4，...，2N} ；

H _i ＝f(X _i )

H _j ＝f(X _j ) (3)

2.1.3, projection network project Net: the projection network p (-) comprises two weight-sharing multilayer perceptron network modules, namely MLP modules, wherein each MLP module consists of two fully-connected network layers F1 and F2 which are connected through a ReLU activation function layer; after multiple nonlinear changes in the projection network, the feature is represented by H _i And H _j Further conversion to Z _i And Z _j (ii) a The projection process is shown in formula (4):

Z _i ＝p(H _i )＝W ₂ σ(W ₁ H _i )

Z _j ＝p(H _j )＝W ₂ σ(W ₁ H _j ) (4)

2.1.4, contrast loss function: selecting a comparative loss function NT-Xent as a loss function; inputting N pieces of fundus image data in each batch, and expanding the batch size to 2N after the data enhancement process; the NT-Xent loss calculation flow for each batch of data is as follows:

in formula (5), m ∈ {1,2, \ 8230;, 2N }, N ∈ {1,2, \ 8230;, 2N }; then, the cosine similarity sim (Z) is normalized using the Softmax function _m ，Z _n ) Obtaining the probability of similarity between the two fundus images; then, taking the negative value of the logarithm of the pair of fundus images as the loss of the pair of fundus images, and calling the loss as noise contrast estimation loss; positive sample pair (Z) _i ，Z _j ) The process of calculating the loss therebetween is shown in equation (6):

in the formula (6), k ≠ i represents dividing Z _i All images except the image are processed, and tau represents an adjustable temperature parameter which is used for zooming input and expanding the range of cosine similarity to [ -1,1](ii) a Finally, comprising (X) _i ，X _j ) And (X) _j ，X _i ) The final loss of all the positive sample pairs, i.e. the final loss calculation process of the whole network, is shown in equation (7):

h＝GAP(f(x)) (8)

The calculation process is as follows:

{z ¹ ，z ² }＝FC(h) (9)

in equation (9), FC denotes a fully connected network with two nodes at the output, z ¹ And z ² Respectively representing two output values; in equation (10), the output is output using the Softmax functionNormalizing to obtain the probability that the sample is positive

2.2.3, objective function: based on a Softmax classifier, adopting a two-class cross entropy loss function as a target function in the task; input sequence for a given batch { (x) ₁ ，y ₁ )，(x ₂ ，y ₂ )，…，(x _N ，y _N ) The loss function computation procedure is as follows:

in the formula (11), y _i For true labeling of samples, the positive sample label is set to 1 and the negative sample label is set to 0 in this task, so y _i ∈{0，1}，

Indicating the probability that the sample is predicted to be positive.

2. The construction method according to claim 1, wherein the specific process of step S1:

finally, the luminance and contrast of the fundus image are improved as shown in formula (1) and formula (2):

I _enhance ＝α·I(x，y)+β·I _gaussian +γ (2)

in formula (1), an image I (x, y) with normalized size is input, and is subjected to gaussian filter convolution processing with standard deviation σ to obtain I _gaussian Then obtaining an enhanced fundus image I through weighted summation of formula (2) _enhance Wherein the values of α, β, σ, γ are set to 4, -4, 10, 128, respectively.

3. The building method according to claim 1, characterized in that in step 2.1.1, the data enhancement comprises: randomly clipping the fundus image, randomly turning to a certain angle, randomly converting to a grey-scale image, and randomly modifying the brightness, contrast or saturation of the fundus image.

4. The method of claim 1, wherein in step 2.1.2, a ResNet50 network containing ImageNet pre-training parameters is selected as the base network structure of the encoder.