CN115294075A

CN115294075A - OCTA image retinal vessel segmentation method based on attention mechanism

Info

Publication number: CN115294075A
Application number: CN202210960639.9A
Authority: CN
Inventors: 崔少国; 文浩; 张宇楠; 柳耘豪; 杨泽华
Original assignee: Chongqing Normal University
Current assignee: Chongqing Normal University
Priority date: 2022-08-11
Filing date: 2022-08-11
Publication date: 2022-11-04

Abstract

The invention provides an OCTA image retinal vessel segmentation method based on an attention mechanism, which comprises the following steps of: the method comprises the steps of constructing a convolutional neural network segmentation model with an attention mechanism, training the model, optimizing parameters, and quickly positioning and accurately segmenting the retinal vascular structure based on OCTA, wherein the convolutional neural network segmentation model is composed of a main feature extractor, a structural feature extractor, a reinforced feature extractor and a classifier with different scales. According to the method, a deep learning mixed model is built, when feature extraction is carried out by using a deep separable convolution layer of a large convolution kernel in the segmentation process, loss of important information of blood vessels at the frame of an image is reduced by adopting a brand-new down-sampling mode, and a space channel attention module, namely an STAM (static state modeling) module is introduced and is used for better learning the space channel information of feature maps with different scales, a complete blood vessel structure is captured, and rapid and accurate segmentation of an OCTA retinal blood vessel image is realized.

Description

OCTA image retinal vessel segmentation method based on attention mechanism

Technical Field

The invention relates to the technical field of medical image full-automatic semantic segmentation, in particular to an OCTA image retinal vessel segmentation method based on an attention mechanism.

Background

The eye is an important organ of the human perception world, and various retinal diseases are important problems threatening human health. Medical image segmentation is an important processing step in an intelligent auxiliary diagnosis process, and can help doctors to perform image-guided medical intervention or more effective radiology diagnosis and the like. Many fundus lesions occur around blood vessels, and retinal blood vessel features are abundant in retinal fundus images. The clinical case characteristics of the fundus disease can be obtained by analyzing the special structural effects of the retinal vessel such as length, width, curvature, bifurcation mode and the like, and the method has important significance for preventing and treating some related diseases. In clinical medicine, retinal blood vessel extraction technology can assist doctors in diagnosing whether patients have eye-related diseases or not by performing a series of analyses on retinal blood vessels. The conventional retinal vessel image is a common color fundus image, and Optical Coherence Tomography (OCTA) is a new imaging modality built on Optical Coherence Tomography (OCT), and is an emerging non-invasive imaging technology capable of observing vessel information of different retinal layers. Therefore, the OCTA is becoming one of important tools for observing fundus-related diseases.

In recent years, by combining low-level features to form abstract deep-level features, the convolutional neural network can better complete various visual tasks. Deep learning solves various problems in a learning mode from data, and an early retinal vessel segmentation algorithm takes a full convolution neural network as a core and aims to better solve the problem of how to better recover lost information from convolution downsampling. Later on, the following three categories were developed: firstly, designing a U-shaped 'coding and decoding symmetrical' structure represented by U-net; secondly, designing the convolution of the introduced cavity represented by deep Lab; finally, the design is based on self-attention structure of self-attention in the Transformer encoder. Retinal vessel data sets generally have the characteristic of small sample number, and a segmentation network and a deep lab series based on self-attention calculation both need a large amount of data to be trained to achieve a good segmentation result. Therefore, only the network improved based on the U-net can achieve good segmentation effect under the condition of less data set sample number.

Although, the method based on the U-net improvement is applied to the blood vessel segmentation of the ordinary color fundus image, and good results are obtained. However, the inventors of the present application have found through research that the current research for extracting blood vessels from an OCTA fundus image is relatively few, and the current deep learning-based OCTA retinal blood vessel segmentation has the following disadvantages: (1) The size of retinal blood vessels changes greatly, wherein the retinal blood vessels have tiny capillary vessels, the minimum diameter is only 1-2 pixels wide, and the tail ends of the blood vessels are easily confused with the background to cause the wrong classification of the tail ends of the blood vessels as the background; (2) Retinal blood vessels have complex structures similar to trees, such as bifurcations, intersections and the like, are irregular in shape and uneven in distribution, and cause the conditions of blood vessels discontinuity and breakage in the segmentation process; (3) Lesions such as microaneurysms and exudates also exist at the edge of part of retinal blood vessels to influence the segmentation result; (4) In the process of extracting semantic information with different sizes by a network, the change of the size of the feature map can cause the condition that blood vessels at the image frame are misclassified and missed for detection; (5) The depth and width of the network are relatively deep, resulting in a relatively slow segmentation speed.

Disclosure of Invention

Aiming at the technical problems of loss of important information of blood vessels, low segmentation speed, inaccurate segmentation of blood vessel terminal regions, incomplete captured blood vessel structures and discontinuous and broken blood vessels in segmentation results in the existing OCTA retinal blood vessel segmentation based on deep learning, the invention provides an OCTA image retinal blood vessel segmentation method based on attention mechanism.

In order to solve the technical problems, the invention adopts the following technical scheme:

an OCTA image retinal vessel segmentation method based on an attention mechanism comprises the following steps:

s1, building a convolutional neural network segmentation model with an attention mechanism:

s11, a convolutional neural network segmentation model with an attention mechanism consists of stem feature extractors with different scales, structural feature extractors, reinforced feature extractors and classifiers, wherein the stem feature extractors with different scales are used for carrying out four times of downsampling on retinal vessel feature maps based on OCTA fundus images to respectively obtain four vessel feature maps with different scales of which the input feature map size ratios are 1/4, 1/16, 1/64 and 1/256, carrying out vessel detail feature extraction on the four vessel feature maps with different scales of sizes, and serially connecting the vessel feature maps in the order of the size of the feature maps from large to small; the structural feature extractor is used for accurately extracting the structural features of the blood vessels before the 1/4 and 1/16 blood vessel feature maps obtained by the main feature extractor are subjected to channel splicing with the corresponding proportional feature maps in the enhanced feature extractor; the enhanced feature extractor is used for up-sampling the 1/256 high-order feature map obtained by the main feature extractor, gradually reducing the size proportion of the feature map to 1/64, 1/16, 1/4 and 1/1, and extracting the detailed features of the blood vessel aiming at the feature maps with different sizes; the classifier is used for classifying labels of the pixels according to the blood vessel characteristics extracted from the characteristic images with different scale sizes; the input of the segmentation network is three channels, the output is two channels, the size of the input image and the size of the output image are both 512 multiplied by 512, and end-to-end semantic segmentation can be realized;

s12, the different-scale trunk feature extractors comprise ten convolution layer groups, two convolution layers and four adjacent block merging layers, wherein one adjacent block merging layer is arranged behind every two convolution layer groups, the two convolution layers are positioned between the last two convolution layer groups, and each convolution layer group consists of a channel-by-channel convolution layer with convolution kernel size of 7 multiplied by 7 and step length of 1 and a point-by-point convolution layer with convolution kernel size of 1 multiplied by 1 and step length of 1; the structural feature extractor comprises two attention layers, each attention layer comprising a channel attention submodule and a spatial attention submodule; the enhanced feature extractor comprises eight convolution layer groups and four up-sampling layers, wherein two convolution layer groups are arranged behind each up-sampling layer, and each convolution layer group consists of a channel-by-channel convolution layer with convolution kernel size of 7 multiplied by 7 and step length of 1 and a point-by-point convolution layer with convolution kernel size of 1 multiplied by 1 and step length of 1; the classifier consists of a category prediction layer and a softmax regression layer, and the softmax regression layer converts category prediction scores into probability distribution;

s2, model training and parameter optimization:

s21, initializing the convolution neural network segmentation model parameters with the attention mechanism built in the step S1 by adopting an Xavier method;

s22, preprocessing the data and performing online enhancement on the retinal vessel data set with the retinal vessel segmentation label according to the ratio of 7:2:1, dividing the network segmentation model into a training set, a verification set and a test set in proportion, and pre-training the network segmentation model by adopting 10-fold cross verification;

s23, inputting the OCTA images of the same retinal vessel section into a network, and generating a retinal vessel segmentation result through network forward calculation, wherein the network forward calculation comprises convolution operation, nonlinear excitation, probability value conversion and multi-head attention calculation;

s24, adopting a classified cross entropy loss function as a segmentation network optimization target, wherein the target function is defined as follows:

where θ 'is the classification network parameter, Y' is the segmentation label, Y is the probability of prediction, S is the number of image pixels, C is the number of pixel classes,

is a regularization term, lambda is a regularization factor, and Q is the number of model parameters;

s25, optimizing a function:

optimizing an objective function by adopting a stochastic gradient descent algorithm, and updating parameters of the retinal vessel segmentation network model by using error back propagation, wherein the optimization process comprises the following steps:

m _t ＝μ×m _t-1 -η _t g _t

θ _t ＝θ _t-1 +m _t

wherein, t represents the number of iterations,

representing the gradient calculation, theta corresponds to theta', L (theta) in the objective function of step S23 _t-1 ) When using theta _t-1 As a loss function in the network parameters, g _t 、m _t And μ are the gradient, momentum and momentum coefficient, η, respectively _t Is the learning rate;

s3, fast positioning and accurate segmentation of the retinal vascular structure based on OCTA:

s31, sequentially carrying out data preprocessing and online enhancement processing on the fundus images based on the OCTA to obtain processed images;

s32, inputting the image subjected to online enhancement processing as a three-channel input into a feature extractor consisting of a trunk feature extractor, a structural feature extractor and an enhancement feature extractor for feature extraction, automatically positioning and outputting a reconstructed retinal vessel image feature map;

s33, inputting the reconstructed retinal blood vessel image feature map into a classifier, and predicting feature image pixels one by one in a sliding window mode to generate two pixel label prediction value maps with the same size as the original image;

s34, converting the prediction score into probability distribution by using a softmax function;

and S35, taking the subscript component where the maximum probability of each pixel is located as a pixel class label, and obtaining a retina blood vessel segmentation result binary image while realizing rapid positioning of a blood vessel structure.

Further, in the step S12 of extracting the trunk features of different scales, the convolution kernel size of each convolution layer in the two convolution layers is 1 × 1, the step size is 1, the number of convolution kernels is 1024, 512, respectively, and each adjacent block merging layer is composed of a convolution layer with convolution kernel size of 2 × 2 and step size of 2 and a convolution layer with convolution kernel size of 1 × 1 and step size of 1.

Further, in the structural feature extractor of step S12, the channel attention submodule includes a global maximum pooling layer, a global average pooling layer, a depth separable convolutional layer with a convolutional kernel size of 1 × 2, and two convolutional layers with a convolutional kernel size of 1 × 1, which are sequentially connected; the space attention submodule comprises a convolution layer with convolution kernel size of 1 x 1, a multi-head self-attention calculation layer, a full connection layer and a convolution layer with convolution kernel size of 1 x 1 which are sequentially connected.

Further, each upsampling layer in the step S12 enhanced feature extractor is an deconvolution layer with a convolution kernel size of 3 × 3 and a step size of 2.

Further, the size of the convolution kernel of the class prediction layer in the classifier of step S12 is 1 × 1, the number of convolution kernels is 2, and the step size is 1.

Further, the data preprocessing in step S22 is to adjust the size of the image pixels to 512 × 512 uniformly before sending to the network, and the online enhancement is to use horizontal flipping, vertical flipping, clipping, rotating by 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, and 315 ° online data enhancement techniques, so as to increase the training data samples by 10 times of the original.

Further, the rolling operation in step S23 is: output characteristic diagram Z corresponding to any convolution kernel in network _i The calculation was performed using the following formula:

wherein f represents a non-linear excitation function, r represents an input channel index number, k represents the number of input channels, W _ir An r channel weight matrix representing the ith convolution kernel,

is the convolution impairment, X _r Representing the r-th input channel image.

Further, the nonlinear excitation in step S23 is: using a rectifying linear unit ReLU as a non-linear excitation function for transforming the output profile Z generated by the convolution kernel _i Is non-linearly transformed, the rectifying linear unit ReLU being defined as follows:

f(x)＝max(0,x)

where f (x) represents the rectifying linear unit function, max represents the maximum value, and x is an input value.

Further, the probability value in step S23 is converted into: the predicted scores output by the network are converted into probability distributions using a softmax function, which is defined as follows:

wherein Y is _j Is the probability that the pixel belongs to class j, O _j Is the prediction score of a certain pixel on the jth class, and K represents the number of classes.

Further, in the multi-head attention calculation in step S23, the multi-head self-attention mechanism only performs one calculation, and finally combines the results, where the multi-head self-attention mechanism is implemented based on a zoom dot product attention operation, and the zoom dot product attention operation is defined as follows:

wherein Attention is the Attention function, Q, K, V represent the input matrix of the scaled dot product Attention,

and expressing the dimensionality of the K matrix, multiplying the Q and K matrixes by the V matrix after the Q and K matrixes are multiplied and changed and pass through a softmax function, and obtaining a final output result of self attention.

Compared with the prior art, the OCTA image retinal vessel segmentation method based on the attention mechanism has the following beneficial effects:

1. each convolution layer group in the trunk feature extractor and the enhanced feature extractor consists of a channel-by-channel convolution layer and a point-by-point convolution layer, so that in the retinal blood vessel feature extraction stage, deep separable convolution operation based on a large convolution kernel is used, thereby being capable of better extracting blood vessel detail features, reducing the number of pixel points for misclassifying the blood vessel tail end as a background, reducing the calculated amount of the convolution operation and improving the segmentation speed;

2. in the down-sampling stage, the main feature extractor can capture detail information around the feature map frame while reducing the feature map by adopting a near block merging operation mode, reduces the loss of important information such as blood vessels and the like at the frame positions in the feature maps with different scales, and improves the segmentation accuracy;

3. a convolution self-attention module (STAM) is introduced into the structural feature extractor, and is used for better learning the spatial channel information of feature maps with different scales, capturing the complete blood vessel structure, avoiding the conditions of blood vessel discontinuity and fracture in the retinal blood vessel segmentation result and improving the segmentation precision.

Drawings

Fig. 1 is a schematic structural diagram of a retinal vessel segmentation network based on an attention mechanism for an oca image provided by the invention.

FIG. 2 is a schematic diagram of a deep separable convolution operation based on a large convolution kernel according to the present invention.

Fig. 3 is a schematic diagram of a neighboring block merging operation provided by the present invention.

Fig. 4a is a schematic diagram of the overall structure of the STAM module provided by the present invention.

FIG. 4b is a schematic diagram of a channel attention submodule structure in the STAM module provided by the present invention.

Fig. 4c is a schematic structural diagram of a spatial attention submodule in the STAM module provided in the present invention.

Fig. 5 is a schematic diagram of an embedded jump connection structure of a STAM module according to the present invention.

Fig. 6 is a schematic diagram of an original image and a manually segmented genuine label provided by the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further explained below by combining the specific drawings.

Referring to fig. 1, the present invention provides an attention-based method for segmenting retinal blood vessels in an OCTA image, including the following steps:

s11, a convolutional neural network segmentation model with an attention mechanism is used for generating an input OCTA fundus image, and consists of stem feature extractors (numbered 1-16 in Table 1), structural feature extractors (numbered 29-30 in Table 1), enhanced feature extractors (numbered 17-27 in Table 1) and classifiers (numbered 31 in Table 1) in different scales, wherein the stem feature extractors in different scales are used for extracting different context features of retinal vessels from feature maps in different scales, and are used for performing down-sampling on a retinal vessel feature map based on the OCTA fundus image for four times to respectively obtain four blood vessel feature maps with different scales of input feature map size ratios of 1/4, 1/16, 1/64 and 1/256, performing blood vessel detail feature extraction on the four blood vessel feature maps in different scales, and serially connecting the four blood vessel feature maps in the order of the feature map size from large to small; the structure feature extractor has the function of accurately extracting the structure features of the retinal blood vessels, and is used for accurately extracting the structure features of the blood vessels before channel splicing is carried out on 1/4 and 1/16 blood vessel feature maps obtained by the main feature extractor and corresponding proportion feature maps in the enhanced feature extractor; the enhanced feature extractor is used for carrying out further blood vessel feature extraction with different scales while reducing the size of a retina blood vessel feature map from a 1/256 ratio to a 1/1 ratio, wherein the enhanced feature extractor is used for carrying out up-sampling on a 1/256 high-order feature map obtained by a main feature extractor, gradually reducing the size ratio of the feature map to 1/64, 1/16, 1/4 and 1/1, carrying out blood vessel detail feature extraction on the feature maps with different scales while reducing, and carrying out jump connection on channel dimensions with the feature maps obtained by the main feature extractor and the structural feature extractor after each up-sampling is completed, so that the number of input channels in layers corresponding to serial numbers 18, 21, 24 and 27 in table 1 is 1024, 512, 256 and 128 respectively, and carrying out detail feature extraction in the extractor and simultaneously carrying out accurate prediction on pixel classes by reducing to the 1/1 feature map; the classifier is used for classifying the labels of the pixels according to the blood vessel characteristics extracted from the characteristic maps with different scales and sizes; the input of the segmentation network is three channels which respectively represent an RGB color three channel of an OCTA fundus image, the output of the segmentation network is two channels which respectively represent the probability that a pixel belongs to a blood vessel region and a non-blood vessel region (background), the size of an input image is the same as that of an output image, and the input image and the output image are 512 multiplied by 512, so that end-to-end semantic segmentation can be realized;

s12, the different-scale trunk feature extractors comprise ten convolution layer groups, two convolution layers and four adjacent block merging layers, wherein one adjacent block merging layer is arranged behind every two convolution layer groups, and the two convolution layers are positioned between the last two convolution layer groups; each convolution layer group consists of a channel-by-channel convolution layer with convolution kernel size of 7 multiplied by 7 and step length of 1 and a point-by-point convolution layer with convolution kernel size of 1 multiplied by 1 and step length of 1; the size of a convolution kernel of each convolution layer in the two convolution layers is 1 × 1, the step length is 1, the number of the convolution kernels is 1024 and 512, depth separable convolution operation based on the large convolution kernels is shown in fig. 2, specifically, the number of channels before the operation is performed is n as shown in fig. 2, firstly, after performing channel-by-channel convolution operation on each channel by n convolution kernels with the size of 7 × 7, splicing according to channel dimensions, performing point-by-point convolution operation by using m convolution kernels with the size of 1 × 1 to obtain a final output channel m, and the specific values of m and n are the number of convolution kernels in each convolution layer group in table 1 respectively; each adjacent block merging layer is composed of a convolution layer with convolution kernel size of 2 × 2 and step size of 2 and a convolution layer with convolution kernel size of 1 × 1 and step size of 1, merging operations of adjacent block merging layers are shown in fig. 3, specifically, red, yellow, blue and green in fig. 3 respectively represent four adjacent blocks, and the numbers 1234 in the blocks are index values of corresponding positions of the adjacent blocks respectively; the structural feature extractor comprises two attention layers (STAM modules), each attention layer (STAM module) comprises a channel attention submodule and a spatial attention submodule, as shown in FIG. 4 a; the channel attention submodule comprises a global maximum pooling layer, a global average pooling layer, a depth separable convolutional layer with a convolutional kernel size of 1 × 2 and two convolutional layers with a convolutional kernel size of 1 × 1 which are sequentially connected, as shown in fig. 4 b; the spatial attention submodule comprises a convolution layer with convolution kernel size of 1 multiplied by 1, a multi-head self-attention calculation layer, a full connection layer and a convolution layer with convolution kernel size of 1 multiplied by 1 which are sequentially connected and arranged, as shown in figure 4c, and a specific embedding schematic diagram of the structural feature extractor is shown in figure 5; the enhanced feature extractor comprises eight convolution layer groups and four upper sampling layers, wherein two convolution layer groups are arranged behind each upper sampling layer, each convolution layer group consists of a channel-by-channel convolution layer with convolution kernel size of 7 x 7 and step length of 1 and a point-by-point convolution layer with convolution kernel size of 1 x 1 and step length of 1, and each upper sampling layer is an inverse convolution layer with convolution kernel size of 3 x 3 and step length of 2; the classifier consists of a category prediction layer and a softmax regression layer, wherein the softmax regression layer converts category prediction scores into probability distribution, the size of a convolution kernel of the category prediction layer is 1 multiplied by 1, the number of the convolution kernels is 2, and the step length is 1, and specifically, the convolution neural network segmentation model parameters with the attention mechanism are shown in the following table 1.

TABLE 1 convolution neural network segmentation model parameter table with attention mechanism

In the above table, padding =3 with a convolution kernel size of 7 × 7, padding =1 with a convolution kernel size of 3 × 3, and Padding =0 with a convolution kernel size of 1 × 1.

S2, model training and parameter optimization:

s22, preprocessing the data and performing online enhancement on the retinal vessel data set with the retinal vessel segmentation label according to the ratio of 7:2:1, dividing the network segmentation model into a training set, a verification set and a test set in proportion, and pre-training the network segmentation model by adopting 10-fold cross verification; as a specific implementation manner, the inventor of the present application obtains 300 cases of patient data with pixel-level segmentation labels, specifically, the adopted data set is a data set for segmentation in an oca _6M subset partitioned by a 20 edition retinal blood vessel data set oca-500 disclosed in IEEE-DataPort by professor and its team, wherein the selected original image is an oca image of the whole eye, the selected labels are retinal blood vessel segmentation labels (artificial segmentation real labels), 300 pieces of each of the original image and the artificial segmentation real labels are shown in fig. 6, and the data set is to be according to 7:2: the scale of 1 is divided into a training set, a verification set and a test set by a random sampling mode, and the training set, the verification set and the test set respectively comprise 210 training images, 60 verification images and 30 test images. To unify the input sizes of different networks in subsequent comparative experiments, the pixel size was therefore uniformly adjusted to 512 × 512 before being sent to the network, as shown in table 2 below:

table 2 data set distribution table

In the training process, the data amount of the training set is less, which is one of the important factors causing model under-fitting, and because the OCTA image is scarce and the data amount of the OCTA-500 retinal vessel segmentation data set is less, in order to reduce the influence of under-fitting, the on-line data enhancement technology of horizontal turning, vertical turning, cutting, rotation by 45 degrees, 90 degrees, 135 degrees, 180 degrees, 225 degrees, 270 degrees and 315 degrees is used, so that the training data sample is increased by 10 times of the initial training data sample.

S23, inputting the OCTA image of the same retinal vessel section into a network, and generating a retinal vessel segmentation result through network forward calculation, wherein the network forward calculation comprises the following steps:

and (3) convolution operation: output characteristic diagram Z corresponding to any convolution kernel in network _i The calculation was performed using the following formula:

wherein f represents a nonlinear excitation function, r represents an input channel index number, k represents the number of input channels, W _ir An r channel weight matrix representing the ith convolution kernel,

is the convolution impairment, X _r Representing the r-th input channel image.

Nonlinear excitation: using as the non-linear excitation function f a rectifying linear unit ReLU, which is the activation function of the network, for transforming the output profile Z generated by the convolution kernel _i Is non-linearly transformed, said rectifying linear unit ReLU being defined as follows:

f(x)＝max(0,x)

Probability value conversion: the predicted scores output by the network are converted into probability distributions using a softmax function, which is defined as follows:

Multi-head attention (Multi-head Self-attention) calculation: the Attention function is a process of carrying out weighting change processing on an output characteristic diagram, and compared with a multi-head self-Attention mechanism and a common self-Attention mechanism, the multi-dimensional characteristic can be obtained; in the multi-head Attention calculation, the multi-head self-Attention mechanism only executes one calculation, and finally combines the results, the multi-head self-Attention mechanism is realized based on the zoom point product Attention operation (Attention), and the zoom point product Attention operation (Attention) is defined as follows:

The multi-head self-attention mechanism is that different linear mappings are carried out for h times, scaling dot product attention operation is carried out on different mapping results in parallel, then the results are spliced and input into a linear mapping layer, and finally the output result of the multi-head self-attention mechanism is obtained. After h times of linear mapping, the model can learn related information in different representation subspaces.

S24, adopting a classified cross entropy loss function as a segmentation network optimization target, wherein the loss function is designed as follows:

where θ 'is a classification network parameter, Y' is a segmentation label, Y is a predicted probability, S is the number of image pixels, C is the number of pixel classes, and C =2, S =512 × 512=262144 in the experiment;

in order to prevent overfitting, an L2 regularization term is added to the loss function to obtain a final objective function, which is defined as follows:

wherein the content of the first and second substances,

and lambda is a regularization factor, and Q is the number of model parameters.

S25, optimizing a function:

optimizing an objective function by adopting a random gradient descent algorithm, and updating retinal vessel segmentation network model parameters by using error back propagation, wherein the specific optimization process is as follows:

m _t ＝μ×m _t-1 -η _t g _t

θ _t ＝θ _t-1 +m _t

wherein, t represents the number of iterations,

representing the gradient calculation, θ corresponds to θ', L (θ) in the objective function of step S23 _t-1 ) When using theta _t-1 As a loss function in the network parameters, g _t 、m _t And μ is the gradient, momentum and momentum coefficient, respectively, such as given by μ =0.99; eta _t Is a learning rate, and is initially set to 1e ^-3 Decreasing by 1/10 every 50 iterations until 1e ^-5 Until now.

s31, sequentially carrying out data preprocessing and online enhancement processing on the fundus images based on the OCTA to obtain processed images, wherein the data preprocessing and online enhancement technology in the step S22 can be specifically referred;

s32, inputting the image subjected to online enhancement processing as three channels into a feature extractor consisting of a trunk feature extractor, a structural feature extractor and a reinforcing feature extractor with different scales for feature extraction, automatically positioning and outputting a reconstructed retinal blood vessel image feature map;

s35, taking the subscript component (0 or 1) where the maximum probability of each pixel is located as a pixel class label, and obtaining a retina blood vessel segmentation result binary image (a blood vessel region and a non-blood vessel region) while realizing rapid positioning of a blood vessel structure.

3. a convolution self-attention module (STAM) is introduced into the structural feature extractor, and the method is used for better learning the spatial channel information of feature maps with different scales, capturing the complete blood vessel structure, avoiding the conditions of blood vessel discontinuity and fracture in the retinal blood vessel segmentation result and improving the segmentation precision.

Finally, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An OCTA image retinal vessel segmentation method based on an attention mechanism is characterized by comprising the following steps:

s11, a convolutional neural network segmentation model with an attention mechanism consists of stem feature extractors with different scales, structural feature extractors, reinforced feature extractors and classifiers, wherein the stem feature extractors with different scales are used for carrying out four times of downsampling on retinal vessel feature maps based on OCTA fundus images to respectively obtain four vessel feature maps with different scales of which the input feature map size ratios are 1/4, 1/16, 1/64 and 1/256, carrying out vessel detail feature extraction on the four vessel feature maps with different scales of sizes, and serially connecting the vessel feature maps in the order of the size of the feature maps from large to small; the structural feature extractor is used for accurately extracting the structural features of the blood vessels before channel splicing is carried out on the 1/4 and 1/16 blood vessel feature graphs obtained by the trunk feature extractor and the corresponding proportional feature graphs in the enhanced feature extractor; the enhanced feature extractor is used for up-sampling the 1/256 high-order feature map obtained by the main feature extractor, gradually reducing the size proportion of the feature map to 1/64, 1/16, 1/4 and 1/1, and extracting the detailed features of the blood vessel aiming at the feature maps with different sizes; the classifier is used for classifying the labels of the pixels according to the blood vessel characteristics extracted from the characteristic maps with different scales and sizes; the input of the segmentation network is three channels, the output of the segmentation network is two channels, the size of the input image and the size of the output image are both 512 multiplied by 512, and end-to-end semantic segmentation can be realized;

s2, model training and parameter optimization:

where θ 'is a classification network parameter, Y' is a segmentation label, Y is a probability of prediction, S is a number of image pixels, C is a number of pixel classes,

s25, optimizing a function:

m _t ＝μ×m _t-1 -η _t g _t

θ _t ＝θ _t-1 +m _t

wherein, t represents the number of iterations,

representing the gradient calculation, θ corresponds to θ', L (θ) in the objective function of step S23 _t-1 ) When using theta _t-1 As a loss function in the network parameters, g _t 、m _t And μ are eachGradient, momentum and momentum coefficient, eta _t Is the learning rate;

s35, taking the subscript component where the maximum probability of each pixel is located as a pixel class label, and obtaining a retina blood vessel segmentation result binary image while realizing rapid positioning of a blood vessel structure.

2. The method for retinal vessel segmentation based on an OCTA image of an attention device as claimed in claim 1, wherein in the step S12 of the stem feature extractor with different scales, the convolution kernel size of each convolution layer in two convolution layers is 1 x 1 and the step size is 1, the number of convolution kernels is 1024 and 512 respectively, and each adjacent block merging layer is composed of one convolution layer with convolution kernel size of 2 x 2 and the step size of 2 and one convolution layer with convolution kernel size of 1 x 1 and the step size of 1.

3. The method for retinal vessel segmentation based on an OCTA image of an attention mechanism as claimed in claim 1, wherein in the step S12 of the structural feature extractor, the channel attention submodule comprises a global maximum pooling layer, a global average pooling layer, a depth separable convolutional layer with a convolutional kernel size of 1 x 2 and two convolutional layers with a convolutional kernel size of 1 x 1 which are sequentially connected; the space attention submodule comprises a convolution layer with convolution kernel size of 1 x 1, a multi-head self-attention calculation layer, a full connection layer and a convolution layer with convolution kernel size of 1 x 1 which are sequentially connected.

4. The method of claim 1, wherein each upsampling layer in the step S12 robust feature extractor is an deconvolution layer with a convolution kernel size of 3 x 3 and a step size of 2.

5. The method for retinal vessel segmentation based on an OCTA image of an attention mechanism as claimed in claim 1, wherein the size of the convolution kernel of the class prediction layer in the classifier of the step S12 is 1 x 1, the number of the convolution kernels is 2, and the step size is 1.

6. The method for retinal vessel segmentation based on OCTA images in an attention mechanism according to claim 1, wherein the data preprocessing in step S22 is to adjust the image pixel size to 512 × 512 uniformly before being sent to the network, and the online enhancement is to use horizontal flip, vertical flip, clipping, 45 ° rotation, 90 °, 135 °, 180 °, 225 °, 270 °, 315 ° online data enhancement techniques to increase the training data samples by 10 times of the initial one.

7. The method for segmenting retinal blood vessels based on an OCTA image of claim 1, wherein the convolution operation in the step S23 is: output characteristic diagram Z corresponding to any convolution kernel in network _i The calculation was performed using the following formula:

wherein f represents a nonlinear excitation function, r represents an input channel index number, k represents the number of input channels, W _ir Represents the ith volumeThe r-th channel weight matrix of the product kernel,

is the convolution impairment, X _r Representing the r-th input channel image.

8. The method for segmenting retinal blood vessels based on an OCTA image of an attention mechanism as claimed in claim 1, wherein the nonlinear excitation in the step S23 is as follows: using the rectified linear unit ReLU as a non-linear excitation function for transforming the output profile Z generated by the convolution kernel _i Is non-linearly transformed, said rectifying linear unit ReLU being defined as follows:

f(x)＝max(0,x)

where f (x) represents the rectified linear unit function, max represents the maximum value, and x is an input value.

9. The method for retinal vessel segmentation based on an OCTA image of an attention mechanism as claimed in claim 1, wherein the probability value in the step S23 is converted into: the predicted scores output by the network are converted into probability distributions using a softmax function, which is defined as follows:

wherein, Y _j Is the probability that the pixel belongs to class j, O _j Is the prediction score of a certain pixel on the jth class, and K represents the number of classes.

10. The method for segmenting retinal vessels of an OCTA image based on an attention mechanism as claimed in claim 1, wherein in the step S23, during the multi-head attention calculation, the multi-head self-attention mechanism only performs one calculation, and finally combines the results, and the multi-head self-attention mechanism is implemented based on a zoom dot product attention operation defined as follows: