CN116310335A

CN116310335A - Method for segmenting pterygium focus area based on Vision Transformer

Info

Publication number: CN116310335A
Application number: CN202310254245.6A
Authority: CN
Inventors: 朱绍军; 方新闻; 郑博; 吴茂念; 杨卫华
Original assignee: Huzhou University
Current assignee: Huzhou University
Priority date: 2023-03-11
Filing date: 2023-03-11
Publication date: 2023-06-23

Abstract

The invention discloses a method for dividing a pterygium focus area based on Vision Transformer, belongs to the technical field of image processing and application, and aims to solve the problems of inaccurate positioning and division of the pterygium focus area in the prior art. The method comprises the following steps: selecting an anterior ocular segment image provided by a cooperation hospital as an original sample, dividing the sample into a training set, a verification set and a test set, and performing a series of preprocessing operations; the semantic segmentation model fused with Vision Transformer, the convolution network and the pyramid pooling module is provided for carrying out semantic segmentation on the pterygium focus area in the anterior segment image, and more target information can be extracted by utilizing the Vision Transformer-based pterygium focus area segmentation method provided by the invention, so that the pterygium in the anterior segment image can be segmented efficiently and accurately.

Description

Method for segmenting pterygium focus area based on Vision Transformer

Technical Field

The invention belongs to the technical field of medical image processing and application, and particularly relates to a segmentation method of pterygium focus areas based on deep learning.

Background

The prevalence of pterygium is about 12% worldwide. Prior to 2015, students mostly achieved segmentation of target objects through traditional machine learning. The traditional segmentation method comprises a threshold value method, a region growing method, an edge detection method and the like, but the segmentation precision and the segmentation efficiency of the traditional machine learning on medical images are difficult to meet the practical application requirements.

In recent years, many researches have realized classification diagnosis of diseases through convolution technology in deep learning and achieved an accuracy of about 95%. But the classification result alone does not provide accurate localization of the lesion area for surgical treatment of pterygium. At present, the application research of the convolution technology in the field of medical segmentation is more, and the segmentation precision is superior to that of the traditional machine learning. Although the target information can be extracted through the deep convolution, much edge detail information is lost in the convolution process, so that the edge segmentation effect is not ideal.

Disclosure of Invention

The invention aims to: the invention provides a method for dividing a pterygium focus area based on Vision Transformer, aiming at the problems of less pterygium data, low division precision, difficult boundary division and the like. According to the method, vision Transformer is used as a main body, a convolutional neural network is used as an auxiliary body, an attention mechanism is fused, a model is trained by using an expert-labeled pterygium focus region data set, complete information of the pterygium region is extracted as a target, and a new segmentation method is provided according to the structural characteristics of the model network and the requirements of medical image segmentation tasks, so that accurate segmentation of the pterygium is realized.

The technical scheme is as follows:

1. the segmentation method based on Vision Transformer pterygium focus area comprises an acquisition module, a semantic segmentation network module and a training module, and the segmentation processing is carried out on the diseased anterior segment image by utilizing the data acquisition module, the semantic segmentation network module and the training module, and is characterized by comprising the following steps:

(1) The method comprises the steps that the anterior segment image of the diseased eye forms a group of pterygium segmentation data sets, the pterygium segmentation data sets are used as original data samples, and the data acquisition module carries out preprocessing operation on images in the original data samples so as to ensure that the lengths and heights of the images are the same, and a group of training set images are formed;

(2) Segmenting the training set image by utilizing the semantic segmentation network module, wherein the semantic segmentation network module comprises a Vision Transformer network and a convolution network; the Vision Transformer network processes the training set images by using an image blocking method, and obtains an image block association relationship by superposing a multi-layer multi-head attention mechanism so as to obtain an image attention diagram; the convolution network obtains an image feature map through a multi-layer convolution operation; the image attention map and the image feature map are obtained through matrix addition operation, and a pterygium segmentation map is obtained through a pyramid pooling method;

(3) Training the segmentation model by using a training module, inputting the pterygium segmentation data set into the semantic segmentation network module for training, and finally forming a pterygium focus region segmentation model by setting a learning rate, a loss function method and a learning iteration period adjustment model parameter during training;

the pretreatment operation is as follows:

the method of the invention requires that the input image size is MxNx3, M and N are positive integers, the original image size is H xW, H and W are positive integers, firstly, the image is scaled into Mx ((N/H) xW), and then, the two sides of the shorter sides are evenly supplemented with gray sides to convert the size into MxN;

the image blocking method comprises the following steps:

up-sampling an image with the size of M multiplied by N multiplied by 3 into M 'multiplied by N' multiplied by 3 through up-sampling operation, inputting the image with the size of M 'multiplied by N' multiplied by 3 into Vision Transformer, dividing an input picture into image blocks with the size of (M '/Patch) multiplied by (N'/Patch), and adding trainable position information parameters with the size of 1× ((M '/Patch) multiplied by (N'/Patch))× (3 multiplied by Patch) on each image block;

the multi-head attention mechanism is as follows:

inputting a multi-head attention mechanism by taking the image blocks as units, calculating the relation among the image blocks through matrix operation, generating new image features with the size of ((M '/Patch) X (N'/Patch))X (3 XPatch) X Patch), and cycling the multi-head attention mechanism for 12-16 times;

the image characteristics generated in the multi-head attention mechanism are subjected to a transformation operation to obtain the image attention diagram, and a convolution module with a convolution kernel size of 3 multiplied by 3 is connected to obtain the image attention diagram with a size of 2048 multiplied by 30;

the convolution network is as follows:

taking parameters obtained after the ResNet50 model is pre-trained on a public data set ImageNet as initialization parameters of the convolution network, and extracting the image feature map through convolution modules with different 4 layers of sizes and structures;

the pyramid pooling method comprises the following steps:

the image attention map and the image feature map are subjected to matrix addition operation to obtain a new image feature map, the new image feature map is input into a pyramid pooling module, and the dimension of the image feature map is converted into 1/4 of the dimension of the image feature map through convolution operation; respectively carrying out pooling operation by using 4 pooling blocks with different sizes to obtain an image feature map a, an image feature map b, an image feature map c and an image feature map d; finally, the image feature map a, the image feature map b, the image feature map c and the image feature map d are up-sampled and then stacked with the new image feature map to obtain an image feature map (e);

inputting the image feature map a, the image feature map b, the image feature map c and the image feature map d into an up-sampling module, obtaining an image feature map through up-sampling and feature fusion operation, stacking the image feature map a, the image feature map b, the image feature map c and the image feature map d to obtain a brand new image feature map f, and finally carrying out convolution operation on the image feature map f to obtain a semantic segmentation image.

2. The method of dividing according to claim 1, wherein,

the Loss function comprises a cross entropy Loss function and a Dice Loss function, the cross entropy Loss function and the Dice Loss function are fused to be used as the Loss function of the semantic segmentation network model, and the following objective function is minimized:

Loss＝Cross Entropy Loss+Dice Loss

where Cross Entropy Loss denotes the cross entropy Loss function and Dice Loss denotes the Dice Loss function.

3. The method of dividing according to claim 1, wherein,

the learning iteration period is 80 periods, freezing training is adopted in 0-40 periods, 40-80 normal training is adopted in 40-40 periods, and the learning rate is 1e-5.

The beneficial effects are that:

the pterygium segmentation dataset marked by the expert is used for training, so that the authority of training is ensured.

The ResNet50 is used as a feature extraction network, and a transfer learning method is used for pre-training on the public data set ImageNet, so that a deep convolution network ensures that a model can extract enough complete focus region features.

The multi-headed attention mechanism in Vision Transformer is capable of preserving detailed information of the outline of the lesion area by linking the relationship inside the image to a great extent.

And adding a stage up-sampling module into the pyramid pooling module, and reserving target detail information by the stage up-sampling module through special drawing fusion while extracting context information by utilizing pooling blocks with different sizes.

The segmentation effect of pterygium is improved by fusing cross entropy Loss and Dice Loss as a Loss function in a network model.

Drawings

FIG. 1 is a schematic diagram of a semantic segmentation network structure

FIG. 2 is a schematic diagram of a pyramid pooling module structure

FIG. 3 is a schematic diagram of a Vision Transformer structure

FIG. 4 is a schematic diagram of a phase up-sampling structure

FIG. 5 is a graph showing the comparison of the pre-processing and post-processing of data

FIG. 6 is a schematic diagram of a segmentation result

Detailed Description

Examples: the method for dividing the pterygium focus area based on Vision Transformer provided by the invention is used for dividing the pterygium focus area, and comprises the following steps of:

1. pterygium dataset

The pterygium segmentation dataset contains 517 anterior ocular segment images of the pterygium (containing pterygium symptoms of varying degrees of illness), with training verification data 367 and test data 150, with focal areas of each pterygium being personally labeled by an ophthalmologist.

2. Data preprocessing

The input image size required by the method is 473×473×3, the original image size is h×w (H > W), the image is scaled to 473× ((473/H) ×w) ×3, and then the shorter sides are evenly gray-lined with the gray-lined sides to convert the size to 473×473×3.

3. Training semantic segmentation networks

In the network training process, data are input to a network in batches, firstly, an image with the size of 473×473×3 is up-sampled to 480×480×3 through an up-sampling operation, the image with the size of 480×480×3 is input to Vision Transformer, after an input picture is divided into 30×30×3 image blocks by convolution with the size of 16×16 and the step size of 16, features are tiled to obtain serialized picture features with the size of 900×768×1, and a trainable position information parameter with the size of 900×768×1 is added to each image block to serve as sequence features of the image blocks. Inputting a multi-head attention mechanism by taking the sequence characteristics as a unit, transforming the sequence into q, k and v matrixes, calculating the relation among image blocks through matrix operation, generating new image characteristics with the size of 900 multiplied by 768, cycling 12 times of multi-head attention mechanism operation, and carrying out normalization and linear transformation operation on the image characteristics generated in the multi-head attention mechanism to obtain an image attention map with the size of 30 multiplied by 768. Performing convolution operation on the image attention graph through a convolution module with the convolution kernel size of 3×3 to obtain an image attention graph (a) with the size of 30×30×2048;

secondly, the convolutional network adopts a ResNet50 and adopts a transfer learning method, and parameters obtained by training a ResNet50 model on an ImageNet are used as initialization parameters of the convolutional network; inputting an image with the size of 473 multiplied by 3 into a ResNet50 network, and performing four layers of different convolution modules, wherein each layer of convolution module comprises three different convolution operations, namely 3 times, 4 times, 6 times and 3 times, respectively, so as to obtain an image characteristic diagram (B) with the size of 30 multiplied by 2048;

obtaining a new image characteristic diagram (C) by adding elements from the image attention diagram (A) and the image characteristic diagram (B), inputting the new image characteristic diagram (C) into a pyramid pooling module, and converting the dimension of the image characteristic diagram (C) into 1/4 of the input dimension by using convolution operation with kernel_size of 1×1; then, respectively carrying out pooling operation by using pooling blocks of 1×1, 2×2, 3×3 and 6×6 to obtain an image feature map a, an image feature map b, an image feature map c and an image feature map d; finally, up-sampling the image feature map a, the image feature map b, the image feature map C and the image feature map D to 30 multiplied by 512, and stacking the image feature map a, the image feature map b, the image feature map C and the image feature map D in a channel stacking mode to obtain an image feature map (D);

inputting an image feature map a, an image feature map b, an image feature map c and an image feature map D into an up-sampling module at the stage, carrying out feature fusion on the image feature map a, the image feature map b, the image feature map c and the image feature map D through up-sampling operation to obtain an image feature map with the size of 30 multiplied by 512, and carrying out channel stacking with the image feature map (D) to obtain a brand new feature map (e 1);

finally, repeating the operations in the pyramid pooling module and the stage up-sampling module on the image feature map of the third layer of the ResNet50 to obtain an image feature map (e 2), and carrying out feature fusion operation on the image feature map and the brand-new feature map (e 1) to obtain a pterygium semantic segmentation map;

and carrying out loss function calculation on the pterygium semantic segmentation map and the real pterygium semantic segmentation map, and updating network parameters. The invention fuses the cross entropy Loss function and the Dice Loss function as the Loss function of the semantic segmentation network model, and minimizes the following objective function:

Loss＝Cross Entropy Loss+Dice Loss

Note y=y _truth ，y′＝y _pred The following Cross Entropy Loss objective function is defined:

Cross Entropy Loss＝-y·log(y′)-(1-y)·log(1-y)

the more accurate the pixel classification, the smaller the Loss.

The marks A and B respectively represent a predicted contour area point set and a real contour area point set, and define the following Dice Loss objective function:

the greater the overlapping ratio of the predicted lesion area to the real lesion area, the smaller the Loss.

4. Analysis of processing results

The method uses the following two performance metrics to quantify the processing results, which are respectively: single-class cross-over ratio (IoU), average cross-over ratio (MIoU), single-class Pixel Accuracy (PA), average pixel accuracy (MPA), and the calculation formula is as follows:

p _i represents the segmented region g _i Representing the real world region. The intersection ratio (IOU) is obtained by the union of the true value and the predicted value on the intersection ratio; the average cross-over ratio (MIOU) is calculated by calculating the cross-over ratio for each class (including the background class) and taking the average of all classes.

p _ii Indicating the correct number of pixels to predict, p _ij Representing the number of pixels predicted to be j for class i. Pixel Accuracy (PA) represents the proportion of correctly marked pixels to total pixels; average pixel accuracy is the ratio of the number of correctly classified pixels per class, after which the average of all classes is calculated.

The results on the pterygium test set of the present invention are MIOU:87.43%, MPA:92.57%, IOU:79.44%, PA:87.16%. A large number of applications show that the pterygium lesion area segmentation method based on Vision Transformer provided by the invention has higher segmentation performance. This is of great importance in the medical field.

As described above, while the present findings have been shown and described with reference to certain preferred embodiments, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

the pretreatment operation is as follows:

the image blocking method comprises the following steps:

the multi-head attention mechanism is as follows:

the convolution network is as follows:

the pyramid pooling method comprises the following steps:

2. The method of dividing according to claim 1, wherein,

Loss＝Cross Entropy Loss+Dice Loss

3. The method of dividing according to claim 1, wherein,