CN116823852A

CN116823852A - Strip-shaped skin scar image segmentation method and system based on convolutional neural network

Info

Publication number: CN116823852A
Application number: CN202310682091.0A
Authority: CN
Inventors: 石霏; 周健; 夏文涛
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-09-29
Anticipated expiration: 2043-06-09
Also published as: CN116823852B

Abstract

The application relates to a strip-shaped skin scar image segmentation method and system based on a convolutional neural network, wherein the method comprises the following steps: acquiring an image with strip-shaped skin scars; inputting the image into a U-shaped encoding and decoding network, and realizing image segmentation of the strip-shaped skin scar through the U-shaped encoding and decoding network; the U-shaped encoding and decoding network comprises an SADC module and a GGMC module, wherein the SADC module is used for enabling the U-shaped encoding and decoding network to focus on strip-shaped characteristics of skin scars, and the GGMC module is used for increasing the perception capability of the U-shaped encoding and decoding network on the skin scars with different scales and different lengths. The SADC module designed by the application makes the network pay more attention to the strip-shaped characteristic of scars, and the GGMC module designed makes the network have stronger perceptibility for targets with different scales and different lengths.

Description

Strip-shaped skin scar image segmentation method and system based on convolutional neural network

Technical Field

The application relates to the technical field of image segmentation, in particular to a strip-shaped skin scar image segmentation method and system based on a convolutional neural network.

Background

Dermal scar means that the dermal tissue is completely damaged, forming fibrous connective tissue during the healing process. Scar analysis may be used for injury identification, crime evidence collection, and skin repair assessment. Along with the rapid development of computer technology, scar analysis methods based on digital images are gradually developed, and are mainly used for measuring and quantitatively analyzing clinical scars. The segmentation of scars in skin scar images is currently a difficult problem, and most of the current segmentation is done by using manual segmentation or semi-automatic segmentation tools. Therefore, the automatic segmentation of the scars in the skin scar image can be realized, the scar analysis efficiency can be greatly improved, and the subjectivity of manual measurement is avoided.

Image segmentation is to assign a class label to each pixel in the image. The traditional image segmentation method is to segment the image by basic features such as color, texture and the like. In recent years, image segmentation techniques based on deep learning have been rapidly developed, in which a convolutional neural network-based full convolutional network (Fully Convolutional Network, FCN) is first applied to an image segmentation task. After that, various convolutional neural network-based methods have grown endlessly.

In recent years, image segmentation algorithms based on deep learning are endless. The full convolution network FCN changes the final full connection layer of the traditional CNN for classification into a convolution layer, and simultaneously adopts an up-sampling mode to recover the characteristic size, so that the network can adapt to input images with any size, and the end-to-end segmentation task is realized. UNet is a variant of FCN, and the purpose of UNet is to solve the problem of medical image segmentation, where the encoder part uses cascaded convolution blocks, and the decoder uses bilinear interpolation and jump connection to implement feature recovery, and the jump connection fuses the position information of the bottom layer with the deep semantic information through stitching. The PSPNet provides a pyramid pooling module (Pyramid Pooling Module, PPM) which realizes the aggregation of multi-scale features and fuses information between different scales and different subregions. Deep lab series networks propose to use dilation convolution to increase the receptive field of the network, which builds a spatial pyramid structure (Atrous Spatial Pyramid Pooling, ASPP) for extracting multi-scale targets. CENet proposes a context extractor to generate more advanced semantic feature graphs that contains a dense hole convolution module (Dense Atrous Convolution, DAC) block and a Residual Multi-core pooling module (RMP). CPFNet adds two pyramid modules to fuse global and multi-Scale context information, a multi-Scale pyramid fusion (SAPF-Aware Pyramid Fusion) module and a global pyramid guidance (Global Pyramid Guidance, GPG) module, respectively.

The disadvantages of the prior art are as follows:

in the image segmentation task, the coding and decoding convolutional neural network represented by UNet finishes feature extraction through convolutional calculation, improves the receptive field of the network through pooling downsampling, and complements detail loss during feature recovery through jump connection operation. Although the existing codec structure network represented by UNet achieves a good segmentation effect in many tasks, the following disadvantages still exist:

(1) UNet adopts common convolution to extract features, and for the features with the same resolution, only local perceptibility is needed, and for global context extraction, the capability is insufficient. Second, UNet is also very limited in its ability to fuse features of each hierarchy, implemented only by the channel overlay concat method of the corresponding hierarchy features.

(2) In the decoder part, the prior art is typically just a simple convolution and upsampling concatenation, with upsampling typically employing bilinear interpolation or deconvolution operations. This decoding process is inadequate for spatial information recovery of features because the semantic information of each level of features is not fully utilized.

(3) In terms of data, deep learning models are driven by large amounts of supervised data, resulting in performance of the model being dependent on the amount and quality of the supervised data. When the data volume is insufficient, it is difficult to train a model with strong generalization ability. Data augmentation techniques are typically used to extend the data volume, such as some random augmentation, including rotation, flipping, contrast enhancement, etc., but have limited improvement in model performance.

Disclosure of Invention

Therefore, the application aims to solve the technical problems of insufficient sensing capability of the neural network to the strip-shaped targets, insufficient extraction capability to the multi-scale features and the like in the prior art.

In order to solve the technical problems, the application provides a strip-shaped skin scar image segmentation method based on a convolutional neural network, which comprises the following steps:

acquiring an image with strip-shaped skin scars;

inputting the image into a U-shaped encoding and decoding network, and realizing image segmentation of the strip-shaped skin scar through the U-shaped encoding and decoding network;

the U-shaped encoding and decoding network comprises an SADC module and a GGMC module, wherein the SADC module is used for enabling the U-shaped encoding and decoding network to focus on strip-shaped characteristics of skin scars, and the GGMC module is used for increasing the perception capability of the U-shaped encoding and decoding network on the skin scars with different scales and different lengths.

In one embodiment of the present application, the U-shaped codec network includes an encoder for extracting features, a decoder for feature recovery of the features extracted by the encoder, and a header structure for image segmentation of the features recovered by the decoder;

the header structure comprises up-sampling, 3 x 3 convolution and 1 x 1 convolution connected in sequence;

the encoder adopts a ResNet34, the ResNet34 comprises a convolution layer, a first residual error coding module, a second residual error coding module, a third residual error coding module and a fourth residual error coding module which are sequentially connected, the first residual error coding module, the second residual error coding module, the third residual error coding module and the fourth residual error coding module respectively comprise 3X 3 convolution, reLu activation and 3X 3 convolution which are sequentially connected, and input characteristics and self second 3X 3 convolution output are added to obtain an output result;

the decoder comprises a third decoding module, a second decoding module, a first decoding module and a zeroth decoding module which are sequentially connected;

the characteristics output by the first residual error coding module are input to a first decoding module after passing through a CBR module;

the characteristics output by the second residual error coding module are input to a second decoding module after passing through a CBR module;

the characteristics output by the third residual error coding module are input to a third decoding module after passing through a CBR module;

the characteristics output by the fourth residual error coding module are input to a third decoding module after passing through a CBR module;

in the first decoding module, the second decoding module and the third decoding module, the features of the upper layer pass through the SADC module, and are added and fused with the features of the layer corresponding to the residual coding module after residual operation and up-sampling, so that feature reconstruction is realized;

in the zeroth decoding module, the last-level features pass through the SADC module, and output results are obtained after residual error operation and up-sampling;

the first decoding module and the second decoding module are added with the output of the zeroth decoding module after passing through the GGMC module, and then the added result is input into the head structure.

In one embodiment of the application, the CBR module comprises a 3 x 3 convolution, a batch normalization layer, and a ReLu activation function layer connected in sequence.

In one embodiment of the present application, the SADC module includes:

for the input feature diagram X ε R ^C×H×W Wherein C represents the number of channels, and H and W represent the height and width of the feature map, respectively; firstly, carrying out convolution, batch normalization and activation layer processing on a feature map X, and carrying out horizontal pooling and vertical pooling on the processed feature map X to respectively obtain two feature vectors X _v ∈R ^C×H×1 and X_h ∈R ^C×1×W And directing the features toQuantity X _v and X_h Adding after 3X 3 convolution, batch normalization and up-sampling operation, and rolling and batch normalization of the added result to obtain X _m ∈R ^C×H×W The method comprises the steps of carrying out a first treatment on the surface of the X is to be _m Through a Sigmoid function and multiplication with a feature map X, a global descriptor Z is obtained;

inputting the global descriptor Z into a multi-layer perceptron formed by two fully connected layers, wherein the multi-layer perceptron outputs a kernel allocation matrix K= [ K ] ₁ ,k ₂ ,...,k _N ]Wherein N represents the number of independent 3×3 convolution kernels, k _i Weights representing the ith convolution kernel, parameter constraints are performed by Softmax such that Σk _i =1; let N independent convolution kernel parameters be θ= [ θ ] ₁ ,θ ₂ ,...,θ _N ]And (3) fusing the kernel distribution matrix K and the convolution kernel parameter theta, then acting on the feature map X, and finally outputting O.

In one embodiment of the present application, the fusion of the kernel distribution matrix K and the convolution kernel parameter θ is applied to the feature map X, and the final output O is given by the formula:

O＝DConv(X)＝Conv(X,∑k _i θ _i )

where DConv denotes a dynamic convolution operation and Conv denotes a convolution operation.

In one embodiment of the present application, the GGMC module includes:

for the input feature diagram X ε R ^C×H×W Wherein C represents the number of channels, H and W represent the height and width of the feature map respectively, feature scaling is performed in the spatial dimension by adopting average pooling, and the feature map X is converted into a feature X with the height and width H and W respectively _s ∈R ^C×h×w Then feature X _s Leveling according to the space dimension to obtainWill->Obtaining global descriptors G E R through a multi-layer perceptron ^C×1×1 The method comprises the steps of carrying out a first treatment on the surface of the The global descriptors G and X _s Channel-by-channel fusion to obtain a feature map X with global perceptibility _G ∈R ^C×h×w ；

The characteristic diagram X _G The multi-scale local perception features are captured through depth convolution kernels with the sizes of 3 multiplied by 3, 5 multiplied by 5 and 7 multiplied by 7 respectively; for the characteristic diagram X _G Performing 1×1 point-by-point convolution to capture channel perceptual features;

for the characteristic diagram X _G After the 3×3, 5×5 and 7×7 deep convolution kernels and the 1×1 point-by-point convolution operation, the feature alignment is realized by adopting batch normalization; and finally, adding the multi-scale local perception features and the channel perception features to obtain final output.

In one embodiment of the present application, the method further comprises training the U-shaped codec network by constructing an image dataset, the method of constructing an image dataset comprising:

acquiring multi-view images of the same strip skin scar, performing three-dimensional reconstruction on the multi-view images by using an SFM algorithm to obtain a three-dimensional model of the strip skin scar and camera parameters of each view, and obtaining a rotation transformation matrix of each view image relative to the three-dimensional model according to the camera parameters;

interpolation is carried out between adjacent rotation transformation matrixes, and the three-dimensional model is subjected to three-dimensional projection by the difference result, so that a plurality of two-dimensional pseudo views are generated.

In one embodiment of the present application, the rotation transformation matrix of each view image relative to the three-dimensional model is obtained according to the camera parameters, where the formula is:

wherein ,representing a rotation transformation of the ith view, [ u ] _ix u _iy u _iz ]Representing a three-dimensional vector of unit length, u _i Representing the ith view toRotation axis, θ, defined by unit vector _i The angle of rotation of the ith view is represented, and N represents the total view number.

In one embodiment of the present application, the interpolation is performed between adjacent rotation transformation matrices, where the formula is:

wherein t epsilon (0, 1) represents an interpolation ratio parameter,a rotation transformation matrix representing two adjacent views,representing the interpolation result.

In order to solve the technical problems, the application provides a strip-shaped skin scar image segmentation system based on a convolutional neural network, which comprises the following steps:

the acquisition module is used for: for obtaining an image with striped skin scars;

an image segmentation module: the method comprises the steps of inputting the images into a U-shaped encoding and decoding network, and realizing image segmentation of strip-shaped skin scars through the U-shaped encoding and decoding network;

Compared with the prior art, the technical scheme of the application has the following advantages:

the application creatively designs a dynamic convolution module (SADC module) based on strip attention, which takes strip pooling as an access point to enhance the perceptibility of a network to a strip target, and simultaneously endows attention to convolution, so that the characteristic acquisition is more flexible;

the application creatively designs a globally guided multi-core convolution module (GGMC module), which solves the problem of scale alignment existing in feature fusion from three angles of global, local and channel;

according to the application, a teacher model based on multi-view geometric projection data augmentation is adopted to enhance a training strategy, so that the identification capability of the network to scars at different positions and angles is improved under the condition of few data samples, and the generalization capability of the network is enhanced.

Drawings

In order that the application may be more readily understood, a more particular description of the application will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings.

FIG. 1 is an overall flow chart of an embodiment of the present application;

FIG. 2 is a schematic diagram of a U-type codec network according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a SADC module according to an embodiment of the present application;

FIG. 4 is a schematic view of a GGMC module according to an embodiment of the application;

FIG. 5 is a flow chart of a teacher-student model enhanced training strategy based on multi-view geometric projection data augmentation in an embodiment of the present application;

fig. 6 is a graph comparing the segmentation results of striped skin scar images of the network of the present application with other segmentation networks.

Detailed Description

The present application will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the application and practice it.

Example 1

The application relates to a strip-shaped skin scar image segmentation method based on a convolutional neural network, which comprises the following steps:

acquiring an image with strip-shaped skin scars;

the U-shaped codec network in this embodiment includes a dynamic convolution based on stripe attention (Spatial Attention Based Dynamic Convolution, SADC) module for focusing the U-shaped codec network on stripe features of the scar and a globally directed multi-core convolution module (Global Guided Multi-kernel Convolution, GGMC) module for increasing the perceptibility of the U-shaped codec network to scars of different sizes and different lengths.

The SADC module in this embodiment overcomes the drawbacks of the existing method that the characteristic acquisition capability is insufficient due to the insensitivity to the stripe target and the limitation of the attention mechanism, generates the stripe attention map by performing stripe pooling fusion in the horizontal and vertical directions on the characteristic map, and then dynamically distributes weights of a plurality of convolution kernels based on the stripe attention map, so that the network is more focused on the stripe characteristic of the scar. The GGMC module in the embodiment overcomes the defect that global and multi-scale semantic information cannot be captured during feature fusion in the existing method, so that the network has stronger perceptibility for targets with different scales and different lengths.

The present embodiment is described in detail below:

the embodiment is mainly used for accurately dividing the skin scar in the image. As shown in fig. 1, first, supervised training is performed using labeled data to obtain a pre-trained model. And then carrying out three-dimensional reconstruction on each scar sample in the labeled image to obtain a three-dimensional model of each scar sample. And interpolating and projecting two-dimensional pseudo views with more angles by using the three-dimensional model as unlabeled data. And for the unlabeled data, loading a pre-training model by adopting a teacher-student model strategy to carry out enhancement training, and finally using the trained student model for model test. The same network structure is adopted for the teacher model and the student model in the pre-training model, the teacher model and the student model.

(1) Network structure

The network structure in this embodiment is shown in fig. 2, and the overall network is a U-shaped codec network, and includes an encoder, a decoder, and a header structure (i.e., a splitting header) for splitting tasks. Specifically, the encoder part adopts ResNet34 as a feature extractor, and comprises a convolution layer and four residual error coding modules (a first residual error coding module, a second residual error coding module, a third residual error coding module and a fourth residual error coding module) which are sequentially connected; the first residual error coding module, the second residual error coding module, the third residual error coding module and the fourth residual error coding module all comprise 3×3 convolution, reLu activation and 3×3 convolution which are sequentially connected, and the input features are added with the own second 3×3 convolution output (residual error operation) to obtain an output result. The decoder comprises four decoding modules (a third decoding module, a second decoding module, a first decoding module and a zeroth decoding module) which are sequentially connected. The header structure includes up-sampling, 3 x 3 convolution, and 1 x 1 convolution, connected in sequence. The layer characteristics of the encoder output are input to the decoder after passing through a CBR module consisting of a 3 x 3 convolution, batch normalization and ReLu activation function. In the third decoding module, the second decoding module and the first decoding module, the last level features are subjected to residual error operation and up-sampling and then added and fused with the corresponding level features of the encoder through the SADC module, so that feature reconstruction is realized. In the zeroth decoding module, the last level of features pass through the SADC module, and output results are obtained after residual operation and up-sampling. The first decoding module and the second decoding module are added with the output of the zeroth decoding module after passing through the GGMC module, and then the added result is input into the head structure. In short, the GGMC module is used for multi-level feature fusion of the decoder, namely, the outputs of the middle two layers of the decoder are added with the output of the last layer of the decoder after passing through the GMMC module.

In the U-shaped codec network of this embodiment, the first residual coding module corresponds to the first decoding module (128×128), the second residual coding module corresponds to the second decoding module (64×64), the third residual coding module corresponds to the third decoding module (32×32), and the correspondence here means that the length and the width of the output feature map of the corresponding module are identical.

(1.1) dynamic convolution module based on striped attention (SADC module): as shown in fig. 3, for each level of feature map, the SADC module mainly includes two steps: (a) Determining the space position of a strip-shaped target based on a strip-shaped attention mechanism, and generating a global descriptor; (b) The method comprises the steps of using a global descriptor, regressing a kernel allocation matrix through a multi-layer perceptron, and using the kernel allocation matrix to weight and fuse a plurality of independent convolution kernels to dynamically allocate weights to the independent convolution kernels according to different input characteristics, wherein the method comprises the following steps of:

given a feature map X E R ^C×H×W As an input, where C represents the number of channels, and H and W represent the height and width of the feature map, respectively. Firstly, a feature map X is subjected to convolution, batch normalization and activation layer processing (namely a CBR module in fig. 3), and is subjected to horizontal pooling and vertical pooling to respectively obtain two feature vectors X _v ∈R ^C×H×1 and X_h ∈R ^C ^×1×W Then X is taken up _v and X_h By adding after the convolution, batch normalization and up-sampling operations of 3X 3 (i.e., CBU in fig. 3), and then rolling and batch normalization (i.e., CB in fig. 3) of the added result to obtain X _m ∈R ^C×H×W . X is to be _m And through a Sigmoid function and multiplying the Sigmoid function with the feature map X, obtaining a global descriptor Z.

The global descriptor Z is input into a multi-layer perceptron (namely MLP in figure 3) formed by two fully connected layers, and the multi-layer perceptron outputs a kernel allocation matrix K= [ K ] ₁ ,k ₂ ,...,k _N ]. Wherein N represents the number of independent 3×3 convolution kernels, k _i The weight representing the ith convolution kernel, the present embodiment uses Softmax for parameter constraints such thatDynamic convolution applies a focus mechanism on the convolution kernels by assigning weights to the different convolution kernels. Let N independent convolution kernel parameters be θ= [ θ ] ₁ ,θ ₂ ,...,θ _N ]The parameters of the kernel distribution matrix K and the convolution kernel parameter theta are fused and then act on the feature map X, and finally O is output:

(1.2) globally booted multicore convolution module (GGMC module): as shown in fig. 4, for input features, the GGMC module mainly includes two steps: (a) The input features are subjected to feature scaling and flattening to obtain global descriptors, and the global descriptors are used for optimizing the input features to obtain global features; (b) The global features are subjected to multi-branch convolution structure, the channel characterization and multi-scale local characterization are captured, and finally all the features are fused to obtain output, wherein the method comprises the following steps:

for the input feature diagram X ε R ^C×H×W Feature scaling is performed in the spatial dimension by adopting average pooling, and an input feature map X is converted into a feature X with height and width of h and w respectively _s ∈R ^C×h×w Then feature X _s Leveling according to the space dimension to obtainWill->Obtaining a global descriptor G E R through a multi-layer perceptron ^C×1×1 . Global descriptors G and X _s Channel-by-channel fusion, i.e. X _s Each pixel is added with a value to obtain a feature map X with global perceptibility _G ∈R ^C×h×w 。

For local perception features, feature map X is fused using multi-branch convolution _G The multi-scale local perceptual features are captured by 3×3, 5×5, 7×7 size depth convolution kernels, respectively. For channel aware features, for feature map X _G A 1 x 1 point-by-point convolution is employed to capture the relationship of each pixel in the channel dimension. For each branch result, batch Normalization (BN) is used to perform feature alignment. And finally, adding the local perception feature and the channel perception feature to obtain a final output.

(1.3) loss function

The present embodiment uses a joint loss function L of binary Focal loss and Dice loss _seg The formula is:

L _seg ＝L _focal +L _dice (2)

the formula of the Focal loss function is:

wherein, gamma is an adjusting factor, and gamma is more than 0, and in the embodiment, the value of gamma is 0.5.

Dice loss function L _dice The similarity between the network prediction result and the manual annotation is calculated, and the calculation method is as follows:

wherein ,predicted value for ith pixel value, y _i For the real label of the ith pixel, epsilon is a smoothing factor, the numerator or denominator is prevented from being zero, and the value is 10 ^-6 。

(2) Data augmentation based on multi-view geometric projection

The data amplification scheme of this example is shown in FIG. 5. Firstly, three-dimensional reconstruction is carried out on multi-view images of the same sample (strip skin scar) through a multi-view solid geometry algorithm (SFM algorithm), so that a three-dimensional model and camera parameters of each view can be obtained. And then interpolating a rotation transformation matrix formed by camera parameters, and performing three-dimensional projection on the three-dimensional model by utilizing an interpolation result to generate a two-dimensional camera view image (pseudo view). And finally, constructing a label-free data set from all the generated views, and sending the label-free data set into a teacher model for training. The data amplification protocol is specifically as follows:

for the same strip-shaped skin scar sample, images are respectively shot from multiple visual angles, and multiple views are obtained. And adopting a motion restoration Structure (SFM) algorithm in multi-view solid geometry to carry out three-dimensional reconstruction on the input multi-view. The three-dimensional model of the strip-shaped skin scar sample and the camera parameters corresponding to each view can be obtained through three-dimensional reconstruction, wherein the camera parameters comprise internal parameters and external parameters of the camera. The rotation transformation matrix of each view relative to the three-dimensional model is obtained through camera parameters, and the rotation transformation matrix of the view is represented by using a quaternary value in the embodiment, wherein the formula is as follows:

wherein ,representing a rotation transformation of the ith view, [ u ] _ix u _iy u _iz ]Representing a three-dimensional vector of unit length, u _i Representing the rotation axis, θ, of the ith view in unit vector definition _i The angle of rotation of the ith view is represented, and N represents the total view number.

And interpolating between adjacent rotation transformation matrixes, and carrying out three-dimensional projection on the three-dimensional model by interpolation results, so as to generate a two-dimensional view with more angles. Assuming a rotational transformation matrix with two adjacent viewsObtaining ∈10 by linear interpolation method>The formula is:

wherein t epsilon (0, 1) is an interpolation proportion parameter,representing the interpolation result, will->The three-dimensional model is acted upon to obtain a composite pseudo-view.

(3) Enhanced training based on teacher-student model

Taking 0.5 t in the formula (6), generating twice the view number of each scar sample, namely doubling the data quantity, and amplifying strong data such as data mixing, rotation, overturning, translation and the like to obtain training data of a teacher model and a student model in the second stage. By using the generated data, the embodiment adopts a teacher-student model training strategy to carry out enhanced training on the pre-training model. As shown in fig. 5, the teacher model and the student model adopt the same network structure, and model parameters obtained by performing supervision training on the original data set are preloaded. The input pseudo view obtains a pseudo tag through the teacher model, meanwhile, the pseudo view is input to the student model to obtain a segmentation result, then the segmentation result output by the pseudo tag and the student model is utilized to calculate loss, the teacher model is used for guiding the student model to learn, the generalization of the student model is better and better, and finally the student model is used as the U-shaped coding and decoding network of the embodiment.

The application adopts an index moving average (Exponential Moving Average, EMA) strategy to gradually transfer the parameters of the student model into the teacher model, and the formula is as follows:

θ′ _t ＝α _t θ′ _t-1 +(1-α _t )θ _t (7)

wherein ,θ′_t Weights, θ ', representing the current epoch teacher model' _t-1 Teacher weight, θ 'representing last epoch' _t Representing the weight of the current student model, total_epoch represents the total number of training cycles. Alpha _t The weight ratio of the teacher model and the student model is represented. With the increase of training process, the performance of the student model is also continuously improved, the weight ratio of the student model can be properly improved, and the embodiment adopts a simple linear change strategy expressed by a formula (8), namely alpha ₀ The initial teacher weight is represented, and the value is 0.99.

The experimental analysis is as follows:

in order to verify the effectiveness of the method, the application is verified by collecting the true strip-shaped skin scar image.

1) Data set

The data used in this experiment are all clinical data from Shanghai's department of municipal identification sciences. 744 strip-shaped skin scar images are acquired by adopting a mobile phone photographing mode, 130 scar samples are taken, and each sample has 4-10 different visual angles. Experiments were performed using a five-fold cross-validation approach. The resolution is uniformly adjusted to 512×512 for the input image. The sensitivity (Sen) was used as a segmentation evaluation index using the Dice coefficient, the cross-over ratio (IoU). The definition is as follows:

TP, FP, FN respectively represent true positive, false positive and false negative, scar area is used as positive sample, and background area is used as negative sample. TP represents that the true value is positive and the predicted value is positive. FP represents a true value as negative, while the predicted value is positive. FN represents true values as positive, but predicted values as negative.

To comprehensively evaluate the model performance, the parameters (Param) and the calculated amounts (FLPs) of different partitioned networks are also compared in experiments. The parameter quantity represents the space occupied by the parameter storage in the network and is used for measuring the space complexity of the model. The calculated amount represents the number of floating point operations needed by the network and is used for measuring the time complexity of the model.

2) Results

The present application employs a base network (Baseline) using a res net34 pre-training model as the encoder, each decoder consisting of up-sampling and convolution blocks, the encoder features being reduced in dimension and fused into the decoder by a skip connection. Corresponding ablation experiments are performed for the SADC module and the GGMC module. In a comparative experiment, the effectiveness of the proposed method was verified by comparing the method with other excellent convolutional neural network CNN-based segmentation networks, including UNet, CSNet, segNet, FCN, CENet, deepLabv3_ plus, CPFNet, PSPNet. All experimental results are supplemented with the training experimental results by using a teacher-student model, and the effectiveness of the data amplification scheme and the training strategy provided by the application is verified.

Table 1 lists the results of ablation experiments, and compared with a reference network, the performance of the method provided by the application is improved, and the cross-over ratio, the Dice coefficient and the sensitivity are respectively improved from 79.14%, 87.94% and 88.81% to 80.85%, 89.16% and 92.11%. The GGMC module is added, so that the network is improved in loU and Dice indexes. The SADC module is added, so that the network is improved in the indexes of loU, dice and Sen. From the aspects of calculation amount and parameter amount analysis, the GGMC module and the SADC module provided by the application do not introduce excessive parameter amount and calculation amount.

TABLE 1 comparison of ablation experimental results for strip skin scar image segmentation

Method

Teacher and student

mIoU

mDice

mSen

param(M)

flops(G)

baseline

×

0.7914±0.0242

0.8794±0.0170

0.8881±0.0114

21.739

31.87

baseline+GGMC

×

0.7971±0.0270

0.8833±0.0182

0.8830±0.0213

21.807

31.88

baseline+SADC

×

0.7977±0.0238

0.8834±0.0165

0.8912±0.0128

21.896

32.27

The application is that

×

0.8007±0.0275

0.8861±0.0181

0.9079±0.0149

21.964

32.27

baseline

√

0.7995±0.0232

0.8848±0.0159

0.8967±0.0141

21.739

31.87

baseline+GGMC

√

0.8013±0.0277

0.8859±0.0187

0.8951±0.0133

21.807

31.88

baseline+SADC

√

0.8013±0.0249

0.8863±0.0169

0.8977±0.0139

21.896

32.27

The application is that

√

0.8085±0.0253

0.8916±0.0164

0.9111±0.0142

21.964

32.27

Table 2 lists the results of the comparison experiment, since the present application uses the ResNet34 pre-training model as the encoder, during comparison, other network encoder parts except UNet and CSNet are replaced by the ResNet34 pre-training model, thereby ensuring fairness. For UNet, CSNet, its encoder cannot be replaced with the res net34 pre-training model because the network itself is relatively special. As can be seen from table 2, the index is relatively low for UNet and CSNet without pre-trained models. Because the skin scar image is photographed by the smart phone, the skin scar image belongs to natural images. The ResNet34 pre-training model performs pre-training on the natural image data set lmageNet, so that for a natural image, the pre-training model can provide a better initialization parameter for a network encoder part, and network training is easier to converge to an optimal value. Regarding the training strategy by using the teacher and student model and the data amplification scheme based on the multi-view geometric projection, the table shows that after the enhancement training is performed by using the teacher and student model, all methods have a certain improvement on indexes. In addition, the application has great advantages in calculation amount and parameter amount compared with other segmentation networks.

TABLE 2 comparison of results of comparative experiments for strip-shaped skin scar image segmentation

Method

Backbone network

Teacher and student

mIoU

mDice

mSen

param(M)

flops(G)

UNet

/

×

0.7365±0.0284

0.8368±0.0228

0.8354±0.0322

8.637

65.7

CSNet

/

×

0.7498±0.0224

0.8481±0.0178

0.8549±0.0333

8.401

55.98

SegNet

Res34

×

0.7648±0.0187

0.8625±0.0142

0.8798±0.0174

38.438

37.68

FCN

Res34

×

0.7800±0.0235

0.8710±0.0182

0.8761±0.0206

25.21

40.97

CENet

Res34

×

0.7913±0.0269

0.8791±0.0181

0.8915±0.0195

29.003

35.57

DeepLabv3_plus

Res34

×

0.7926±0.0274

0.8806±0.0214

0.8905±0.0227

26.711

54.59

CPFNet

Res34

×

0.7880±0.0266

0.8772±0.0184

0.8840±0.0162

30.651

32.13

PSPNet

Res34

×

0.7736±0.0265

0.8680±0.0184

0.8822±0.0158

27.5

23.49

The application is that

Res34

×

0.8007±0.0275

0.8861±0.0181

0.9079±0.0149

21.96

32.27

UNet

/

√

0.7453±0.0251

0.8448±0.0191

0.8468±0.0305

8.637

65.7

CSNet

/

√

0.7577±0.0196

0.8540±0.0154

0.8628±0.0259

8.401

55.98

SegNet

Res34

√

0.7666±0.0176

0.8641±0.0132

0.8827±0.0186

38.438

37.68

FCN

Res34

√

0.7853±0.0235

0.8754±0.0173

0.8791±0.0170

25.21

40.97

CENet

Res34

√

0.7960±0.0286

0.8821±0.0192

0.8966±0.0210

29.003

35.57

DeepLabv3_plus

Res34

√

0.7950±0.0273

0.8823±0.0187

0.8944±0.0238

26.711

54.59

CPFNet

Res34

√

0.7917±0.0248

0.8799±0.0171

0.8883±0.0150

30.651

32.13

PSPNet

Res34

√

0.7772±0.0236

0.8712±0.0163

0.8912±0.0162

27.5

23.49

The application is that

Res34

√

0.8085±0.0253

0.8916±0.0164

0.9111±0.0142

21.96

32.27

Fig. 6 shows the segmentation results of the stripe scar images of different methods, namely the original image, the segmentation result of SegNet, the segmentation result of PSPNet, the segmentation result of FCN, the segmentation result of cene, the segmentation result of deepflev3+, the segmentation result of CPFNet and the segmentation result of the network of the application, from left to right, and as can be seen from fig. 6, the segmentation accuracy of the network of the application on the stripe scar images is obviously improved.

To this end, a novel deep learning network suitable for striped skin scar images has been implemented and validated. The teacher-student model enhancement training strategy based on multi-view geometric projection data amplification solves the problem that network under-fitting is caused by few data volume data samples. The dynamic convolution module based on the strip attention solves the problems of insensitivity to strip targets and insufficient feature acquisition capability caused by the limitation of an attention mechanism in the existing method. The multi-core convolution module for global guidance solves the problem that the coding and decoding network structure cannot capture global and multi-scale semantic information during feature fusion. The application is verified by comprehensive experiments on the collected 744 Zhang Tiaozhuang skin scar image data, and the experimental results show that the method has better segmentation performance on the strip scar image data.

Example two

The application relates to a strip-shaped skin scar image segmentation system based on a convolutional neural network, which comprises the following steps:

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present application will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present application.

Claims

1. A strip skin scar image segmentation method based on a convolutional neural network is characterized by comprising the following steps of: comprising the following steps:

acquiring an image with strip-shaped skin scars;

2. The strip-shaped skin scar image segmentation method based on the convolutional neural network as set forth in claim 1, wherein: the U-shaped coding and decoding network comprises an encoder, a decoder and a header structure, wherein the encoder is used for extracting features, the decoder is used for carrying out feature recovery on the features extracted by the encoder, and the header structure is used for carrying out image segmentation on the features recovered by the decoder;

3. The strip-shaped skin scar image segmentation method based on the convolutional neural network as set forth in claim 2, wherein: the CBR module comprises a 3X 3 convolution layer, a batch normalization layer and a ReLu activation function layer which are connected in sequence.

4. The strip-shaped skin scar image segmentation method based on the convolutional neural network as set forth in claim 1, wherein: the SADC module includes:

for the input feature diagram X ε R ^C×H×W Wherein C represents the number of channels, and H and W represent the height and width of the feature map, respectively; firstly, carrying out convolution, batch normalization and activation layer processing on a feature map X, and carrying out horizontal pooling and vertical pooling on the processed feature map X to respectively obtain two feature vectors X _v ∈R ^C×H×1 and X_h ∈R ^C×1×W And then the characteristic vector X _v and X_h By 3 x 3 rollsAdding after integrating, batch normalization and up-sampling operation, and rolling and batch normalization of the added result to obtain X _m ∈R ^C×H×W The method comprises the steps of carrying out a first treatment on the surface of the X is to be _m Through a Sigmoid function and multiplication with a feature map X, a global descriptor Z is obtained;

inputting the global descriptor Z into a multi-layer perceptron formed by two fully connected layers, wherein the multi-layer perceptron outputs a kernel allocation matrix K= [ K ] ₁ ,k ₂ ,...,k _N ]Wherein N represents the number of independent 3×3 convolution kernels, k _i Weights representing the ith convolution kernel, parametric constraints are made by Softmax such thatLet N independent convolution kernel parameters be θ= [ θ ] ₁ ,θ ₂ ,...,θ _N ]And (3) fusing the kernel distribution matrix K and the convolution kernel parameter theta, then acting on the feature map X, and finally outputting O.

5. The strip-shaped skin scar image segmentation method based on a convolutional neural network as set forth in claim 4, wherein: the kernel distribution matrix K and the convolution kernel parameter theta are fused and then act on the feature map X, and finally output O, wherein the formula is as follows:

O＝DConv(X)＝Conv(X,∑k _i θ _i )

6. The strip-shaped skin scar image segmentation method based on the convolutional neural network as set forth in claim 1, wherein: the GGMC module comprises:

7. The strip-shaped skin scar image segmentation method based on the convolutional neural network as set forth in claim 1, wherein: the method further comprises training the U-shaped coding and decoding network by constructing an image data set, and the method for constructing the image data set comprises the following steps:

and interpolating between adjacent rotation transformation matrixes, and carrying out three-dimensional projection on the three-dimensional model by interpolation results to generate a plurality of two-dimensional pseudo views.

8. The strip-shaped skin scar image segmentation method based on a convolutional neural network as set forth in claim 7, wherein: the rotation transformation matrix of each view angle image relative to the three-dimensional model is obtained according to camera parameters, and the formula is as follows:

9. The strip-shaped skin scar image segmentation method based on a convolutional neural network as set forth in claim 7, wherein: the interpolation is carried out between adjacent rotation transformation matrixes, and the formula is as follows:

wherein t epsilon (0, 1) represents an interpolation ratio parameter,a rotation transformation matrix representing two adjacent views, < >>Representing the interpolation result.

10. A strip skin scar image segmentation system based on a convolutional neural network is characterized in that: comprising the following steps: