CN117912092B

CN117912092B - Fundus image identification method and device based on binocular feature fusion and storage medium

Info

Publication number: CN117912092B
Application number: CN202410101607.2A
Authority: CN
Inventors: 甄宝琛; 于延锁; 刘强; 刘超勇; 唐梓祯
Original assignee: Beijing Institute of Petrochemical Technology
Current assignee: Beijing Institute of Petrochemical Technology
Priority date: 2024-01-24
Filing date: 2024-01-24
Publication date: 2024-07-05
Anticipated expiration: 2044-01-24
Also published as: CN117912092A

Abstract

The invention relates to a fundus image recognition method and device based on binocular feature fusion and a storage medium, which are applied to the technical field of fundus image recognition and comprise the following steps: aiming at the situation that paired binocular fundus image data available for model training is very limited, the loose pairing method is combined with WGAN to generate countermeasure network, so that binocular fundus image data for training is effectively expanded, the neural network training is performed on the basis of ResNet residual network by combining a channel attention and space attention mechanism, so that the network focuses more on focus characteristic information, the characteristic extraction capacity of the model is effectively improved, redundancy calculation is reduced, binocular fundus image characteristics are fused through a sliding average method, binocular information is fully utilized, characteristic loss and redundancy are avoided, stability and consistency of fusion results are improved, time and calculation resources are saved, and overall processing efficiency is improved.

Description

Fundus image identification method and device based on binocular feature fusion and storage medium

Technical Field

The invention relates to the technical field of fundus image recognition, in particular to a fundus image recognition method and device based on binocular feature fusion and a storage medium.

Background

Fundus is a critical component of the interior of the eye and the visual organ, which is of vital physiological and medical significance, however fundus disease is often affected by a variety of factors including genetics, age, environment, lifestyle, general health, etc.; these factors interweave with each other, increasing the complexity of the disease. Due to the specificity and complexity of fundus diseases, the probability that fundus images contain two or more diseases is extremely high, and these fundus diseases, if not immediately diagnosed and treated, may result in irreversible vision loss in both eyes inadvertently.

Most fundus disease studies in existence are based solely on monocular fundus images.

However, due to the complexity and mutual independence of the fundus diseases, the fundus diseases of both eyes of a patient are often different, the result of disease detection on the patient based on focus information of only a single eye fundus image lacks global nature, furthermore, the onset latency of some fundus diseases (such as glaucoma) is long, and when early focus information is not obvious, the disease detection based on only a single eye fundus image may miss an optimal treatment period because the disease is not accurately diagnosed in time; meanwhile, in the existing research, we find that paired binocular fundus image data available for model training is very limited, and it is not sufficient to expand these image data by only relying on the conventional data enhancement method. Moreover, the traditional feature extraction method only focuses on global or local features, ignores the integrity and the detail of the features, and has a plurality of redundant calculation and huge parameter quantity, so that the model is low in efficiency and difficult to train.

Disclosure of Invention

In view of the above, the present invention aims to provide a fundus image recognition method, device and storage medium based on binocular feature fusion, so as to solve the problems that in the prior art, disease detection based on only a monocular fundus image may miss an optimal treatment period because the disease is not accurately diagnosed in time, paired binocular fundus image data for model training is very limited, it is not enough to expand the image data only by means of a traditional data enhancement method, and the traditional feature extraction method only focuses on global or local features, ignores the integrity and detail of features, and in addition, the methods have many redundant calculation and huge parameter amounts, resulting in low model efficiency and difficulty in training.

According to a first aspect of an embodiment of the present invention, there is provided a fundus image identification method based on binocular feature fusion, the method comprising:

Acquiring a binocular fundus image dataset comprising binocular fundus images of preset disease classifications;

Performing abnormal image cleaning on the binocular fundus image dataset, and performing image preprocessing operation on the binocular fundus image dataset after the abnormal image cleaning;

Constructing WGAN a network frame, and training the WGAN network frame through a binocular fundus image dataset after image preprocessing to obtain a trained WGAN network;

Inputting the binocular fundus image subjected to image preprocessing into the trained WGAN network, and generating an expanded training data set by inputting random noise;

selecting an input sample from the expanded training data set, wherein the input sample is a left eye fundus image and a right eye fundus image containing the same disease classification, so as to obtain an original pairing list;

loosely pairing is carried out on the original pairing list, and a loosely paired list is obtained; combining the original pairing list and the loose pairing list to obtain a new binocular fundus image dataset;

Setting left eye fundus images and right eye fundus images in the new binocular fundus image dataset to be of preset sizes, inputting the left eye fundus images and the right eye fundus images into a pre-built residual attention module, and extracting channel attention characteristics of the new binocular fundus image dataset by using channel attention weights of channel attention modules in the residual attention module to obtain a weighted characteristic diagram;

inputting the weighted feature map into a spatial attention module of the residual attention module to extract the spatial attention feature of the weighted feature map, so as to obtain a final feature map;

Performing feature fusion on the final feature map of the left eye and the final feature map of the right eye output by the residual attention module by setting a moving average variable to obtain a final binocular fusion feature;

and inputting the final binocular fusion characteristics into a pre-trained classifier, and outputting a disease type label.

Preferably, the method comprises the steps of,

The step of performing feature fusion on the final feature map of the left eye and the final feature map of the right eye output by the residual attention module by setting a moving average variable, and the step of obtaining final binocular fusion features comprises the following steps:

Inputting the final feature map of the left eye and the final feature map of the right eye output by the residual attention module into a pre-built binocular feature fusion module;

Initializing a moving average variable into a zero matrix in the binocular feature fusion module, wherein the size of the moving average variable is consistent with the input features;

setting a sliding average attenuation coefficient, and multiplying a left eye final feature map and a right eye final feature map of each frame by elements to obtain an input feature at the current moment;

updating a moving average variable, obtaining an output fusion characteristic at the current moment through the updated moving average variable, taking the output fusion characteristic at the current moment as the input of a next frame, selecting a specific time step as the termination, and outputting a final binocular fusion characteristic.

Preferably, the method comprises the steps of,

The loose pairing is carried out on the original pairing list, and the loose pairing list is obtained comprises the following steps:

Setting a disease category label of each input sample in the original pairing list, if the disease category labels of any two input samples are the same, selecting a left eye fundus image of one input sample and a right eye fundus image of the other input sample, adding the right eye fundus image of one input sample and the left eye fundus image of the other input sample into a preset blank loose pairing list, and traversing the whole original pairing list to obtain a loose pairing list.

Preferably, the method comprises the steps of,

Extracting the channel attention characteristic of the new binocular fundus image dataset by using the channel attention weight of the channel attention module in the residual attention module, and obtaining a weighted characteristic diagram comprises the following steps:

performing convolution operation on fundus images in the new binocular fundus image dataset by using a convolution check of 3×3, wherein the stride is 1, and the boundary filling is 1, so as to obtain a first convolution characteristic diagram;

carrying out batch normalization on the first convolution feature images, calculating the mean value and the variance on each channel, and then carrying out scaling and shifting operations on the first convolution feature images;

non-linear activation is performed on the first convolution feature map subjected to scaling and translation operations by using a ReLU activation function;

performing a second convolution operation on the activated first convolution feature map to generate a second convolution feature map, and performing batch normalization, scaling, translation and nonlinear activation operations on the second convolution feature map respectively;

Then 128 convolution cores of 3 multiplied by 3 are used for carrying out a third convolution operation on the second convolution characteristic map to obtain a characteristic map;

Performing global average pooling operation on the feature map to obtain a first feature vector;

mapping the first feature vector into a channel attention weight W through a full connection layer to generate a second feature vector;

and multiplying the second feature vector by the corresponding channel of the feature map element by element to obtain a weighted feature map.

Preferably, the method comprises the steps of,

The step of inputting the weighted feature map to the spatial attention module of the residual attention module to extract the spatial attention feature of the weighted feature map, and the step of obtaining a final feature map includes:

Carrying out global average pooling on each channel in the weighted feature map to obtain a third feature vector with a first preset size;

carrying out global maximization pooling on each channel in the weighted feature map to obtain a fourth feature vector with a first preset size;

Splicing the third feature vector and the fourth feature vector according to the channel direction to obtain a feature map with a second preset size;

performing convolution operation on the feature map with the second preset size by using a convolution check of 1 multiplied by 1 to obtain a target feature map;

and performing nonlinear activation on the target feature map by using a ReLU activation function to obtain a final feature map.

Preferably, the method comprises the steps of,

The image preprocessing operation for the binocular fundus image data set after the abnormal image cleaning comprises the following steps:

Respectively carrying out image normalization operation, image weighting enhancement operation and image enhancement operation on the binocular fundus image data set after the abnormal image cleaning;

The image weight enhancement operation includes:

Performing convolution operation on the original fundus image in the binocular fundus image dataset by using Gaussian kernel to generate a blurred image;

Setting a weighting coefficient, and obtaining a weighted and enhanced fundus image through the weighting coefficient, the original fundus image and the fuzzy image;

the image enhancement operation includes:

rotating the fundus image by 45 ° or 90 °;

randomly translating the fundus image along the horizontal or vertical direction, wherein the distance of translation is between 0 and 10 percent of the width or the height of the fundus image;

And randomly overturning the translated fundus image to obtain an enhanced fundus image.

Preferably, the method comprises the steps of,

The constructing WGAN of the network framework includes:

Respectively setting WGAN loss functions of a generator and a discriminator in the network framework;

Performing weight clipping on the weight of the discriminator, and limiting the weight within a preset range;

The loss function of WASSERSTEIN GAN was minimized using an Adam optimizer.

According to a second aspect of an embodiment of the present invention, there is provided a fundus image recognition apparatus based on binocular feature fusion, the apparatus comprising:

The data set construction module: the method comprises the steps of acquiring a binocular fundus image dataset, wherein the binocular fundus image dataset comprises binocular fundus images of preset disease classification;

and a pretreatment module: the image preprocessing method comprises the steps of performing abnormal image cleaning on the binocular fundus image dataset, and performing image preprocessing operation on the binocular fundus image dataset after the abnormal image cleaning;

WGAN network module: the method comprises the steps of constructing WGAN a network frame, and training the WGAN network frame through a binocular fundus image dataset after image preprocessing to obtain a trained WGAN network;

and (3) an expansion module: the binocular fundus image processing method comprises the steps of inputting a binocular fundus image subjected to image preprocessing into the trained WGAN network, and generating an expanded training data set through random noise input;

original pairing module: the training data set is used for acquiring training data set, wherein the training data set is used for training the left eye fundus image and the right eye fundus image of the same disease classification, and the training data set is used for acquiring training data set;

Loose pairing module: the method comprises the steps of performing loose pairing on an original pairing list to obtain a loose pairing list; combining the original pairing list and the loose pairing list to obtain a new binocular fundus image dataset;

Channel attention extraction module: the method comprises the steps of setting left eye fundus images and right eye fundus images in a new binocular fundus image dataset to be of preset sizes, inputting the left eye fundus images and the right eye fundus images into a pre-built residual attention module, extracting channel attention characteristics of the new binocular fundus image dataset by using channel attention weights of a channel attention module in the residual attention module, and obtaining a weighted characteristic diagram;

Spatial attention extraction module: the spatial attention module is used for inputting the weighted feature map into the spatial attention module of the residual attention module to extract the spatial attention feature of the weighted feature map, so as to obtain a final feature map;

And a feature fusion module: the method comprises the steps of performing feature fusion on a final feature map of a left eye and a final feature map of a right eye output by the residual attention module through setting a moving average variable to obtain a final binocular fusion feature;

And an output module: and the final binocular fusion characteristic is input into a pre-trained classifier, and a disease type label is output.

According to a third aspect of embodiments of the present invention, there is provided a storage medium storing a computer program which, when executed by a master, implements the steps of the above method.

The technical scheme provided by the embodiment of the invention can comprise the following beneficial effects:

Aiming at the situation that paired binocular fundus image data available for model training is very limited, the binocular fundus image data for training is effectively expanded by using a loose pairing method and combining WGAN to generate an countermeasure network, and on the basis of a ResNet residual network, the neural network training is performed by combining a channel attention and space attention mechanism, so that the network focuses more on focus characteristic information, the characteristic extraction capability of the model is effectively improved, redundancy calculation is reduced, binocular fundus image characteristics are fused by a sliding average method, binocular information is fully utilized, characteristic loss and redundancy are avoided, stability and consistency of fusion results are improved, time and calculation resources are saved, and overall processing efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flow diagram illustrating a binocular feature fusion-based fundus image recognition method according to an exemplary embodiment;

FIG. 2 is a schematic diagram of a residual attention module shown according to an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating channel attention memorization according to an example embodiment;

fig. 4 is a system diagram of a fundus image recognition apparatus based on binocular feature fusion, which is illustrated according to another exemplary embodiment;

In the accompanying drawings: the system comprises a 1-data set construction module, a 2-preprocessing module, a 3-WGAN network module, a 4-expansion module, a 5-original pairing module, a 6-loose pairing module, a 7-channel attention extraction module, an 8-space attention extraction module, a 9-feature fusion module and a 10-output module.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention.

Example 1

Fig. 1 is a flow chart illustrating a binocular feature fusion-based fundus image recognition method according to an exemplary embodiment, as shown in fig. 1, the method including:

s1, acquiring a binocular fundus image dataset, wherein the binocular fundus image dataset comprises binocular fundus images of preset disease classification;

S2, performing abnormal image cleaning on the binocular fundus image dataset, and performing image preprocessing operation on the binocular fundus image dataset after performing abnormal image cleaning;

S3, constructing WGAN a network frame, and training the WGAN network frame through the binocular fundus image dataset after image preprocessing to obtain a trained WGAN network;

S4, inputting the binocular fundus image subjected to image preprocessing into the trained WGAN network, and generating an expanded training data set by inputting random noise;

S5, selecting an input sample from the expanded training data set, wherein the input sample is a left eye fundus image and a right eye fundus image containing the same disease classification, and an original pairing list is obtained;

S6, carrying out loose pairing on the original pairing list to obtain a loose pairing list; combining the original pairing list and the loose pairing list to obtain a new binocular fundus image dataset;

S7, setting left eye fundus images and right eye fundus images in the new binocular fundus image dataset to be of preset sizes, inputting the left eye fundus images and the right eye fundus images into a pre-built residual attention module, and extracting channel attention characteristics of the new binocular fundus image dataset by using channel attention weights of a channel attention module in the residual attention module to obtain a weighted characteristic diagram;

S8, inputting the weighted feature map into a spatial attention module of the residual attention module to extract the spatial attention feature of the weighted feature map, so as to obtain a final feature map;

S9, carrying out feature fusion on the final feature map of the left eye and the final feature map of the right eye output by the residual attention module by setting a moving average variable to obtain a final binocular fusion feature;

s10, inputting the final binocular fusion characteristics into a pre-trained classifier, and outputting a disease type label;

it can be understood that the specific implementation process of the above scheme is given in this embodiment, as follows:

1.1 image dataset

A binocular fundus image dataset is prepared, comprising eight types of images, namely normal (N), diabetes (D), glaucoma (G), cataract (C), age-related macular degeneration (a), hypertension (H), pathological myopia (M) and other diseases/abnormalities (O), wherein the types cover common fundus diseases and images which are generally diagnosed as abnormal or not belonging to the specific diseases, and the abnormal images are subjected to data cleaning and then are subjected to label making.

1.2 Data preprocessing

After the image data is subjected to operations such as abnormal data cleaning, other preprocessing operations are performed on the image data, and the process mainly comprises three components: image normalization, image weighting enhancement, and image enhancement;

The original image in the process is input as I _org, then the image I _weight is obtained after image normalization and image weighting enhancement, and finally the image is obtained after image enhancement operation, and the image output is I _aug;

Wherein, the image normalization operation includes:

Adjusting the image pixel value range: performing range scaling processing on each channel (R, G, B) to ensure that the numerical ranges of the channels are consistent;

And (3) carrying out averaging treatment: calculating an average value of pixels of the image, and subtracting the average value from each pixel in the image to enable the image to have more consistency;

and (3) standardization treatment: calculating standard deviation of image pixels, dividing each pixel in the image by the standard deviation, and reducing the variation amplitude of image data to ensure that the data has higher stability;

And (3) size adjustment: the fundus images are sized 256 x 256 to ensure consistency of the data and to conform to the input requirements of the model.

Wherein the image weighting enhancement comprises:

Performing Gaussian blur (Gaussian Blurring) on the original image to reduce high-frequency detail information of the image;

the original image is convolved using a gaussian kernel to produce a blurred image. The gaussian blur formula is as follows:

I_blur＝I_org*kernal_h×w；

Wherein, I _org is an original image, I _blur is an image subjected to Gaussian blur, and kernal _h×w is a Gaussian kernel with the size of h×w;

The mathematical expression of weighted enhancement: i _weight＝I_org*α+I_blur x β+γ;

This formula describes the generation of an enhanced image I _weight by weighting the original image I _org, the gaussian blurred image I _blur, and the constant γ (where α and β are weighting coefficients).

Wherein the image enhancement comprises:

Rotating the image by 45 ° or 90 °; randomly translating the image along a horizontal (x-axis) or vertical (y-axis) direction, wherein the distance of translation is between 0 and 10 percent of the width or height of the image; and randomly overturning, and outputting the preprocessed image I _aug after image enhancement.

1.3WGAN network and Loose pairing

1.3.1WGAN network:

The GAN network consists of two parts, generator G and D (Discriminator), respectively. The two are mutually competing, and the capability of generating a real sample is learned in a game mode; WGAN introduced the Wasserstein distance as the training target for the arbiter, rather than the JS divergence used in the original GAN (Jensen-Shannon divergence). The Wasserstein distance measures the "distance" between the two distributions, and for any point of the two distributions, the Wasserstein distance is the desired difference between the real sample and the generated sample, the use of such a distance makes WGAN easier to train, more capable of producing a high quality sample;

the loss function of the generator is:

In the formula The WASSERSTEIN GAN loss function of the generator is represented,Representing a desired operation of random noise, where z is random noise sampled from the noise distribution P _noise, G (z) is an output of the generator, represented as a synthesized fundus image generated from the random noise, and D (G (z)) represents an output of the discriminator to generate a fundus image, represented as a discrimination probability to the generated data;

the loss function of the arbiter is:

In the form of Representing WASSERSTEIN GAN loss functions of the arbiter,Representing a desired operation of the real sample, wherein I _aug is a fundus image from the real data distribution P _real subjected to data preprocessing, D (G (z)) represents an output of the discriminator to generate a fundus image, expressed as a discrimination probability to the generated data;

WGAN is also innovative in that the introduction of weight clipping (WEIGHT CLIPPING) to the generator for forcing the arbiter to maintain Lipschitz continuity to ensure that the calculation of wasperstein distance is reasonable, this method of weight clipping helps to stabilize the training process;

the WGAN network concretely comprises the following steps:

① Preparing a data-preprocessed binocular fundus image dataset, each sample containing images of both eyes;

② Constructing a network:

the input of the generator is a random noise vector z, and the output is a synthesized fundus image G (z);

the input of the discriminator is a real fundus image I _aug or a synthesized fundus image G (z) after data preprocessing;

The network architecture adopts a convolution layer, batch normalization and activation functions (leak ReLU) to splice in series;

③ Determining a loss function:

the loss function of the generator is:

the loss function of the arbiter is:

④ Weight clipping is implemented: performing weight clipping on the weights of the discriminators, limiting the weights to a small range to ensure Lipschitz continuity;

⑤ Selecting Adam optimizers for minimizing WASSERSTEIN GAN's loss function;

⑥ Training process:

a, sampling a batch of samples from a real fundus image I _aug dataset after data preprocessing;

b, sampling a batch of random noise z from noise distribution;

c, generating a synthesized fundus image by using a generator;

d, calculating the loss of the discriminator and the loss of the generator;

e, updating parameters of the discriminator and the generator;

f, repeating the steps;

⑦ Adjusting super-parameter optimization generating processes such as learning rate, epoch, batch sizes and the like according to the result of the training process;

⑧ And (3) using a trained generator, and generating a synthesized binocular fundus image X _w＝G(z),X_w by inputting random noise z, wherein the synthesized binocular fundus image X _w＝G(z),X_w is a training data set obtained by expanding a data set I _aug subjected to data preprocessing through a WGAN network.

1.3.2 Loose pairing:

To obtain more training examples, the natural strategy is to strictly select left and right fundus images from the same person according to disease tags, assuming that the present application has two glaucoma patients a and b in the training set, four glaucoma patients' eyes: left eye left_a of a patient, right eye right_a of a patient, left eye left_b of b patient, right eye right_b of b patient, i.e. set a= { left_a, right_a } and set b= { left_b, right_b }, only two sets of strict pairings are seen from above, in order to increase the number of trained binocular fundus image instances, the present application constructs input pairs based on disease category labels instead of left and right eyes of the same patient, left eye fundus image can be paired with right eye fundus image as long as disease category labels are the same, so based on this loose pairing method created by the present application, the original two sets of available training instances from two patients a, b are expanded into four sets of available training instances: { left_a, right_a } { left_a, right_b }, { left_b, right_a }, { left_b, right_b }, i.e. as long as the disease category labels of the left eye and the right eye are the same, they can be paired together to form a loose image pair, thus, it can be seen that by this method, the application builds more training examples based on the category labels of the patient, and the original training data will be expanded in square level;

The loose pairing concretely comprises the following steps:

① Inputting X _w as a dataset, each sample in X _w containing fundus images of the left eye (left_x _w) and the right eye (right_x _w);

② Each sample i is assigned a true disease category label. The disease category label of the sample i is marked as 'P _i', wherein P _i is one of eight disease category labels of { 'N', 'D', 'G', 'C', 'A', 'H', 'M', 'O' };

③ Establishing a disease category dictionary, and mapping disease category labels of each sample to corresponding fundus image data;

④ For each sample i, left eye (left_x _w _i) and right eye (right_x _w _i) are taken as the original pairs;

⑤ Loose pairing strategy:

a. initializing an empty loose pairing list (LoosePairs = [ ]);

b. Traversing the dataset X _w, for each sample i;

c. Traversing the dataset X _w, for each sample j (j+.i);

d. If the disease category labels for sample i and sample j are the same (P _i＝P_j), then (left_X _w_i,right_X_w _j) and (left_X _w_j,right_X_w _i) are added to the loose pairing list.

E. Constructing a new fundus image pair (an extended example);

f. Combining the original pairing list and the loose pairing list to obtain a new binocular fundus image data set X as output;

1.4 Residual Attention Module (RAM)

As shown in fig. 2, the specific implementation process of the residual attention module is as follows:

① Inputting fundus image data X (256×256×3) which has been loosely paired with each other via WGAN networks;

② Channel attention mechanism module

A. Performing convolution operation by using a convolution kernel X of 3×3, wherein the step is 1, and the boundary filling is 1, so as to obtain a convolution characteristic diagram a ₁ (256×256×64);

b. carrying out batch normalization on the convolution feature map a ₁, calculating the mean value and variance on each channel, and carrying out normalization through scaling and translation;

c. Non-linear activation of feature map a ₁ using a ReLU activation function;

d. generating a feature map a ₂ with dimensions 256×256×64 by a second convolution operation, which also includes convolution, batch normalization, and activation functions;

e. Performing a third convolution operation by using 128 convolution checks a ₂ of 3×3 to obtain a feature map U (256×256×128);

f. global average pooling is carried out on the feature map U to obtain feature vectors s _c (1 multiplied by 128);

g, mapping the feature vector s _c into a channel attention weight W through a full connection layer to generate a feature vector s (1 multiplied by 128);

h. The eigenvector s (1×1×128) is multiplied element by the corresponding channel of the eigenvector U (256×256×128) to obtain a weighted eigenvector

③ Spatial attention mechanism module

A. inputting weighted feature diagram

B. for a pair ofGlobal average pooling is carried out on each channel to obtain a feature vector l ₁ with the size of (1 multiplied by 128);

c. for a pair of Global maximum pooling is carried out on each channel to obtain a feature vector l ₂ with the size of (1 multiplied by 128);

d. Splicing l ₁ and l ₂ according to channel direction to obtain a characteristic diagram with size of (1×1×256)

E. Convolution check using 1 x 1Performing convolution operation to obtain a final characteristic diagram y (256×256×3);

f. nonlinear activation of feature map y (256×256×3) using a ReLU activation function;

1.4.1SE channel attention mechanism

As shown in fig. 3, the SE (measure-and-specification) channel attention mechanism is a method for enhancing the attention degree of the model to important features, and can adaptively learn the importance of each channel and re-weight the channel features, so that Featuremap containing more useful information has larger weight and Featuremap containing less useless information has smaller weight, so as to improve the perception capability of the model to the key features, thereby improving the accuracy of classification of the binocular multi-label fundus image, and the method is specifically as follows:

input feature map conversion: for a given signature X, with a characteristic channel C', a signature U with C characteristic channels is obtained by operation F _tr, which can be expressed as: u=f _tr(X),X∈R^{H′×W′×C′},U∈R^H×W×C, where F _tr represents a feature transformation operation for adjusting the number of channels;

global average pooling operation: global averaging pooling of the feature map U generates a vector s _c of size 1 x C, where each channel is represented by a number, which can be expressed as: here, h×w is the spatial dimension of the feature map, and U _i,j,c is the value of the c-th channel in the feature map U at position (i, j);

Introducing weight W obtained by learning: the weight W is obtained through network learning, a weight matrix with the size of 1 multiplied by C is generated, and the weight information can be expressed as: w=f _Net(s_c), where f _Net is a network structure for learning weight information W from the global feature vector s _c;

generating a characteristic feature vector s by the weight W obtained through learning:

s＝F_es(z,W)＝σ(g(z,W))＝σ(W₂δ(W₁z))。

Channel weighting: finally, channel weighting operation is carried out by using the generated feature vector s (1×1×C) and the feature map U (H×W×C), and the corresponding channels are multiplied element by element to obtain a weighted feature map This operation can be expressed as:

1.4.2 spatial attention mechanism

The key idea of the spatial attention mechanism is to selectively allocate weights of different spatial positions when processing images, so that a model can focus on important areas more specifically, and the following is a main process of the spatial attention mechanism:

Feature extraction: generating a feature map by the input image through a feature extraction network And even contains rich information of the image at different spatial positions. The feature map is typically a three-dimensional tensor with dimensions (H W C) representing height, width and number of channels, respectively;

attention calculation: the feature map is obtained through a weight matrix obtained through learning Conversion to Query Q (Query), key K (Key), and Value V (Value) representations generally involves three weight matrices W _q,W_k,W_v, which are used to calculate representations of the Query, key, and Value, respectively: q=x×w _q,K＝X×W_k,V＝X×W_v, the matrix Q containing a query representation corresponding to each location in the feature map for subsequent calculation of the attention-related information; the matrix K comprises a key representation corresponding to each position in the feature map and is used for calculating the similarity of the attention; the matrix V contains a representation of the values corresponding to each position in the feature map to be used in the final weighted summary, these representations being used to calculate the attention weight;

Attention weight calculation: the similarity matrix S' between the query Q and the key K is calculated, and then the similarity matrix is converted into probability distribution by using a softmax function to obtain an attention weight matrix A, wherein the common calculation mode is as follows: s=q·k ^T, where·represents a dot product, K ^T represents a transpose of the key matrix, each element S [ i, j ] of the similarity matrix S represents the similarity between the ith query in query Q and the jth key in key K; to make the attention distribution smoother, the similarity is typically scaled, one common way of scaling is to use a scaling factor Where d _k is the dimension of the key, this scaling factor helps to avoid too much or too little similarity, making the subsequent softmax operation more stable, and thus the final similarity matrix is: A=softmax (S '), wherein the softmax function converts each element in S' into a probability distribution, Where i represents the row of the matrix, j represents the column of the matrix, and each element in matrix a becomes a probability Value on that row by softmax operation, which ensures that the sum of the weights of each key on that row is equal to 1, each row corresponding to an attention weight distribution for weighting the summarized Value matrix, which allows the model to focus more on keys with high similarity to the query when focusing on the Value (Value) matrix;

feature weighted summarization: the attention weight matrix A is used for carrying out weighted summarization on the value V to obtain an output characteristic diagram y after a spatial attention mechanism, wherein y=A.V, and the characteristic diagram is focused on an important area in an image, so that a model can capture key characteristics more pertinently;

Through the process, the spatial attention mechanism enables the model to selectively pay attention to information of different positions when processing the image, so that the perceptibility of the model to important features in the image is improved, and complex image data is better adapted.

1.5 Binocular feature fusion Module (BFM)

When the automatic diagnosis of the fundus diseases is carried out, the diseases with unobvious early lesion information are difficult to accurately diagnose by only detecting the monocular fundus images, in addition, the binocular fundus images usually contain one or more diseases, the disease diagnosis of a patient based on the monocular fundus images also lacks global property, the deep learning training process pursues a global optimal solution, however, the setting of super parameters (such as learning rate) often leads the model to fall into the local optimal solution of certain specific points, and further leads the model to stop optimizing, and the comprehensive analysis capability of the model on the multi-lesion information can be obviously improved by constructing a binocular feature fusion module BFM based on binocular features to combine the features output by the model, and the optimization dilemma caused by the local optimal solution is facilitated;

The characteristic output of the binocular fundus images is fused by adopting a sliding average method, and the method is specifically as follows:

1. Input characteristics: assuming that the characteristic extracted by the left eye is y _left, the characteristic extracted by the right eye is y _right, and the sizes are H multiplied by W multiplied by C;

2. Initializing a moving average variable: a variable of the running average is set for recording the fused features. Initializing to zero matrix, and keeping the size consistent with the input characteristics:

y_avg＝0

3. updating a sliding average:

For each time step t, the following steps are performed:

a. Updating fusion characteristics:

y_avg＝β·y_avg+(1-β)·y_t

wherein, beta is the attenuation coefficient of the moving average, and takes a value close to 1, the coefficient determines the influence degree of history information on the current fusion result, and y _t represents the result of multiplying left and right eye characteristics at the current moment by elements;

b. Outputting fusion characteristics:

y_output＝y_avg

4. Outputting a result:

The final fusion feature y _output is the feature extracted by fusing the left and right eyes by means of a moving average;

The specific implementation steps are as follows:

① Taking binocular fundus characteristics y _left and y _right extracted by the RAM module as input;

② Initializing a moving average variable y _avg as a zero matrix, wherein the size is consistent with the input characteristics;

③ Setting the attenuation coefficient beta of the moving average to be 0.9;

④ Left and right eye features y _left and y _right for each frame:

a. Multiplying y _left and y _right by elements to obtain an input characteristic y _t at the current moment;

b. updating a moving average variable y _avg;

c. Outputting a fusion feature y _output＝y_avg;

d. Y _output is taken as the input of the next frame;

e. Selecting a specific time step t as a termination, and outputting a final binocular fusion characteristic y _output;

through the sliding average process, the fusion characteristics at each moment can consider the information of the historical moment, so that the smooth fusion of the ocular fundus image characteristics is realized;

1.6 classifier

The binocular fusion feature y _output is taken as input, finally taken as input to enter a classifier through a full connection layer, and the construction of the classifier comprises the following steps: and constructing and optimizing the classifier based on a preset classification algorithm, wherein the preset classification algorithm comprises the following steps: a support vector machine algorithm or a random forest algorithm;

training the classifier by using the fused features, and selecting a classification algorithm such as a Support Vector Machine (SVM), a Random Forest (Random Forest), a K nearest neighbor algorithm (K-Nearest Neighbors, KNN), a multi-layer perceptron (Multilayer Perceptron, MLP) and the like to construct and optimize the classification model. The training objective of the classifier is to minimize the loss function, and the training formula for the classifier selected herein is as follows:

Wherein Y is a real label, Is the predictive probability of the classifier.

In addition, the trained classifier can be utilized to input a new left fundus image and a new right fundus image as a whole to construct a network and carry out classification prediction, and the final classification prediction result is one of eight disease type labels of normal (N), diabetes (D), glaucoma (G), cataract (C), age-related macular degeneration (A), hypertension (H), pathological myopia (M) and other diseases/abnormalities (O).

Aiming at the situation that paired binocular fundus image data available for model training is very limited, the binocular fundus image data for training is effectively expanded by using a loose pairing method and combining WGAN to generate an countermeasure network, and the neural network training is performed by combining a channel attention and a space attention mechanism on the basis of a ResNet residual network, so that the network focuses more on focus characteristic information, the characteristic extraction capacity of the model is effectively improved, redundancy calculation is reduced, binocular fundus image characteristics are fused by a sliding average method, binocular information is fully utilized, characteristic loss and redundancy are avoided, and the stability and consistency of fusion results are improved. Meanwhile, time and calculation resources are saved, and the overall processing efficiency is improved.

Example two

Fig. 4 is a system diagram of a fundus image recognition apparatus based on binocular feature fusion, according to another exemplary embodiment, including:

Data set construction module 1: the method comprises the steps of acquiring a binocular fundus image dataset, wherein the binocular fundus image dataset comprises binocular fundus images of preset disease classification;

pretreatment module 2: the image preprocessing method comprises the steps of performing abnormal image cleaning on the binocular fundus image dataset, and performing image preprocessing operation on the binocular fundus image dataset after the abnormal image cleaning;

WGAN network module 3: the method comprises the steps of constructing WGAN a network frame, and training the WGAN network frame through a binocular fundus image dataset after image preprocessing to obtain a trained WGAN network;

Expansion module 4: the binocular fundus image processing method comprises the steps of inputting a binocular fundus image subjected to image preprocessing into the trained WGAN network, and generating an expanded training data set through random noise input;

Original pairing module 5: the training data set is used for acquiring training data set, wherein the training data set is used for training the left eye fundus image and the right eye fundus image of the same disease classification, and the training data set is used for acquiring training data set;

Loose pairing module 6: the method comprises the steps of performing loose pairing on an original pairing list to obtain a loose pairing list; combining the original pairing list and the loose pairing list to obtain a new binocular fundus image dataset;

Channel attention extraction module 7: the method comprises the steps of setting left eye fundus images and right eye fundus images in a new binocular fundus image dataset to be of preset sizes, inputting the left eye fundus images and the right eye fundus images into a pre-built residual attention module, extracting channel attention characteristics of the new binocular fundus image dataset by using channel attention weights of a channel attention module in the residual attention module, and obtaining a weighted characteristic diagram;

Spatial attention extraction module 8: the spatial attention module is used for inputting the weighted feature map into the spatial attention module of the residual attention module to extract the spatial attention feature of the weighted feature map, so as to obtain a final feature map;

feature fusion module 9: the method comprises the steps of performing feature fusion on a final feature map of a left eye and a final feature map of a right eye output by the residual attention module through setting a moving average variable to obtain a final binocular fusion feature;

the output module 10: and the final binocular fusion characteristic is input into a pre-trained classifier, and a disease type label is output.

Embodiment III:

The present embodiment provides a storage medium storing a computer program which, when executed by a master controller, implements each step in the above method;

It is to be understood that the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.

It should be noted that in the description of the present invention, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present invention, unless otherwise indicated, the meaning of "plurality" means at least two.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. The fundus image identification method based on binocular feature fusion is characterized by comprising the following steps of:

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

3. The method of claim 2, wherein the step of determining the position of the substrate comprises,

4. The method of claim 1, wherein the step of determining the position of the substrate comprises,

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

6. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The image weight enhancement operation includes:

the image enhancement operation includes:

rotating the fundus image by 45 ° or 90 °;

7. The method of claim 6, wherein the step of providing the first layer comprises,

The constructing WGAN of the network framework includes:

The loss function of WASSERSTEIN GAN was minimized using an Adam optimizer.

8. Fundus image recognition device based on binocular feature fusion, characterized in that the device comprises:

9. A storage medium storing a computer program which, when executed by a master, implements the steps of the binocular feature fusion-based fundus image recognition method of any of claims 1-7.