CN108268870B

CN108268870B - Multi-scale feature fusion ultrasonic image semantic segmentation method based on counterstudy

Info

Publication number: CN108268870B
Application number: CN201810085384.XA
Authority: CN
Inventors: 崔少国; 张建勋; 刘畅
Original assignee: Chongqing Normal University
Current assignee: Chongqing Normal University
Priority date: 2018-01-29
Filing date: 2018-01-29
Publication date: 2020-10-09
Anticipated expiration: 2038-01-29
Also published as: CN108268870A

Abstract

The invention provides a multiscale feature fusion ultrasonic image semantic segmentation method based on counterwork learning, which comprises the following steps of: building a multi-scale feature fusion semantic segmentation network model, building an antagonistic identification network model, performing antagonistic training and model parameter learning, and automatically segmenting the breast lesion. The segmentation method provided by the application predicts the pixel class by combining the multi-scale features of the input images with different resolutions, improves the accuracy of pixel class label prediction, adopts the expansion convolution to replace partial pooling to improve the resolution of the segmentation graph, adopts the countermeasure identification network to guide the segmentation graph generated by the segmentation net to be indistinguishable from the segmentation label, and ensures that the segmentation graph has good appearance and spatial continuity, thereby obtaining a more accurate high-resolution ultrasonic mammary gland lesion segmentation graph.

Description

Multi-scale feature fusion ultrasonic image semantic segmentation method based on counterstudy

Technical Field

The invention relates to the technical field of medical image understanding, in particular to a multi-scale feature fusion ultrasonic image semantic segmentation method based on counterstudy.

Background

Breast cancer is a malignant tumor that occurs in mammary epithelial tissues, seriously threatening the health and quality of life of women. The ultrasonic examination is simple, convenient, economical and radiationless, and becomes an important tool for clinical breast cancer diagnosis. Ultrasonic B-mode images combined with ultrasonic elastography are important means for clinically diagnosing breast diseases. The B mode images the breast tissue structure, the elasticity imaging detects the elasticity of the breast tissue, and the two modal images are mutually contrasted to more accurately detect and position the breast focus.

The breast focal region is accurately identified and segmented from the ultrasonic image, and important support can be provided for accurate diagnosis and treatment of breast diseases. The position, size, shape and number of the focus area are all important bases for clinical diagnosis and treatment. Due to the existence of B-mode image speckle noise and elastic image artifact noise, the traditional image processing algorithm has poor effect when applied to the field, and usually has the problems of under-segmentation or over-segmentation, so that the boundary is not accurate enough. Therefore, it is currently the mainstream method to automatically detect and segment the lesion region from the multi-modal ultrasound image by using an artificial intelligence algorithm.

The deep learning technology makes a significant breakthrough in the series of tasks of computer vision. The ultrasonic image semantic segmentation is to label each pixel semantic category by using image features, and the semantic segmentation precision can be obviously improved by the deep learning technology based on the convolutional neural network. The convolutional neural network can automatically learn hierarchical characteristics facing specific tasks from training data in a supervised learning mode through end-to-end training, and can obtain a better segmentation effect compared with the traditional machine learning. However, the current breast ultrasound image semantic segmentation technology based on deep learning has the following disadvantages: (1) the classification of the pixels is predicted through the single scale features, the local features and the global context features are not fully utilized, and misclassification points are easy to generate; (2) the network remarkably reduces the resolution of the feature map through multi-level pooling (a typical VGG network has five times of pooling, and the feature map is reduced to 1/32 of an original image), so that the final segmentation map is smaller; (3) the prediction of each pixel label does not consider the similarity between the characteristics of each pixel, so that the segmentation graph output by the network lacks high-order spatial continuity.

Disclosure of Invention

Aiming at the problems of the existing method, the invention provides a multiscale feature fusion ultrasonic image semantic segmentation method based on countermeasure learning, which predicts the pixel class by using multiscale feature combination of input images with different resolutions and improves the accuracy of pixel class label prediction; the resolution of the segmentation graph is improved by adopting expansion convolution instead of partial pooling; and the segmentation graph generated by the segmentation net is guided by adopting the countermeasure identification network and cannot be distinguished from the segmentation label, so that the segmentation graph has good appearance and space continuity, and a more accurate high-resolution ultrasonic breast lesion segmentation graph is obtained. .

In order to solve the technical problems, the invention adopts the following technical scheme:

a multiscale feature fusion ultrasonic image semantic segmentation method based on counterwork learning comprises the following steps:

s1, building a multi-scale feature fusion semantic segmentation network model:

s11, the multi-scale feature fusion semantic segmentation network is used for generating a lesion area segmentation map of an input ultrasonic image, the multi-scale feature fusion semantic segmentation network comprises a multi-resolution feature extractor, a cascade feature fusion device and a classifier, the multi-resolution feature extractor is formed by at least two parallel deep convolution network branches, each branch is identical in structure and shared in weight, the multi-resolution feature extractor is used for extracting context features with different scales from input images with different resolutions, the cascade feature fusion device is used for carrying out step-by-step fusion on the multi-scale features obtained by the different branches to generate a comprehensive feature map, and the classifier is used for predicting pixel category labels according to the comprehensive feature map;

s12, each deep convolutional network branch structure including first to fifth convolutional layer groups, first to third largest pooling layers, a first swelling convolutional layer, and a second swelling convolutional layer, the first largest pooling layer being located after the first convolutional layer group, the second largest pooling layer being located after the second convolutional layer group, the third largest pooling layer being located after the third convolutional layer group, the first swelling convolutional layer being located after the fourth convolutional layer, the second swelling convolutional layer being located after the fifth convolutional layer group; the classifier comprises a first characteristic projection layer, a second characteristic projection layer, a category prediction layer and a Softmax regression layer which are sequentially arranged behind a second expansion convolution layer;

s2, establishing a confrontation authentication network model:

s21, the confrontation discrimination network is used for distinguishing whether the input is a segmentation map from a segmentation network or a segmentation map of a manual segmentation label;

s22, the confrontation authentication network comprises a first confrontation convolutional layer, a second confrontation convolutional layer, a third fully-connected layer and a Softmax regression layer which are sequentially arranged;

s3, confrontation training and model parameter learning: alternately training the multi-scale feature fusion semantic segmentation network and the confrontation discrimination network in a confrontation mode by using the expanded training data, designing an objective function to optimize network parameters, and generating an optimal segmentation network model, which specifically comprises the following steps:

s31, initializing the network model parameters by adopting an Xavier method;

s32, obtaining a breast ultrasound multi-modal gray level image training sample and a corresponding artificial segmentation label, wherein the breast ultrasound multi-modal gray level image comprises a B mode image and an elastic image, the B mode image and the elastic image are registered in space, and the training data is expanded according to the ratio of 4: 1, dividing the training set into a training set and a verification set, and training a network model by adopting a five-fold cross verification method;

s33, inputting the B-mode ultrasonic image and the ultrasonic elastic image into a multi-scale feature fusion semantic segmentation network as two channels, generating a pixel class label prediction probability distribution map, and adopting the following mixed loss function as an optimization objective function:

the mixed loss function is composed of two classification cross entropy loss functions L₁And L₂Composition, representing segmentation loss and discrimination loss, respectively; wherein, theta_sIs a parameter of a segmented network model, theta_aIs to identify the network model parameter, x_nIs the input n-th training sample, y_nIs the segmentation label corresponding to the nth training sample, s (-) is the segmentation function, a (-) is the discrimination function, N is the batch size, λ₁Is an antagonistic factor;

the cross entropy loss function L in the objective function of the step₁Comprises the following steps:

wherein the content of the first and second substances,

label prediction probability vectors generated by a segmentation function s (.), y is probability vectors corresponding to segmentation labels, H x W is the dimension of the image, C is the number of classes of pixels, and ln (.) is a natural logarithm;

the cross entropy loss function L in the objective function of the step₂Comprises the following steps:

wherein the content of the first and second substances,

the image source label prediction probability vector generated by the countermeasure identification network, and z is the probability vector of the real source label of the input image;

s34, inputting the segmentation labels and the segmentation graph output by the semantic segmentation network into a confrontation discrimination network, predicting whether the input is from the segmentation labels or the segmentation network by the confrontation discrimination network, and adopting the following loss function as an optimization objective function:

cross entropy loss function L in the loss function of this step₂The same step as step S33;

s35 to obtain L_seg(θ_s) And L_adv(θ_a) Training the meaning division network and the countermeasure identification network in an alternative mode by taking the minimum value as an optimization target, optimizing a target function by adopting a random gradient descent algorithm, and updating a network model parameter theta by adopting an error back propagation algorithm_sAnd theta_aAnd fixing the challenge discriminating network parameter theta in training the semantic segmentation network_aFixing semantic segmentation network parameters θ in training confrontation discrimination networks_s；

S4, automatic segmentation of breast lesions:

s41, inputting the breast ultrasound B-mode images with different resolutions and the corresponding ultrasound elastic image pairs which are registered in space into each branch of the multi-resolution feature extractor as two channels, and respectively generating feature maps with different sizes;

s42, inputting the context features of different scales extracted by the multi-resolution feature extractor into a cascade feature fusion device to generate a comprehensive feature map;

s43, inputting the comprehensive characteristic graph into a classifier to generate a pixel label prediction value graph;

s44, converting the pixel class label prediction score map into a pixel class label prediction probability distribution map by using a Softmax function, wherein the pixel class label prediction probability distribution map represents the probability of each pixel on each class;

s45, taking the subscript component where the maximum probability of each pixel is located as a pixel class label to form a breast ultrasound focus semantic segmentation map;

s46, carrying out 8 times of upsampling on the breast ultrasound focus semantic segmentation map to obtain a final segmentation map with the same size as the original image.

Further, in step S11, the multi-resolution feature extractor is composed of three parallel deep convolution network branches, namely, a high-resolution feature extractor, a medium-resolution feature extractor, and a low-resolution feature extractor.

Further, the high resolution feature extractor branch input image size is the same as the original image size, the medium resolution feature extractor branch input image size is 1/2 of the original image size, and the low resolution feature extractor branch input image size is 1/4 of the original image size.

Further, the cascade feature fusion device firstly performs 2 times of upsampling on the last feature layer of the low-resolution feature extractor branch by using bilinear interpolation, and then adds the upsampling and the last feature layer of the medium-resolution feature extractor branch to generate a fusion feature map 1 with the same size as the medium-resolution feature map; and then performing 2 times of upsampling on the fused feature map 1 by using bilinear interpolation, and adding the upsampled fused feature map to the final feature layer of the high-resolution feature extractor branch to generate a fused feature map 2 with the same size as the high-resolution feature map, wherein the fused feature map 2 is the final comprehensive feature map.

Further, in step S12, each convolution layer group is composed of two convolution layers, the convolution kernel size of each convolution layer is 3 × 3, the step size is 1, and the number of convolution kernels of the first to fifth convolution layer groups is sequentially 64, 128, 256, 512, and 512; the convolution kernel size of each expansion convolutional layer is 3 multiplied by 3, the step length is 1, and the expansion factors of the first expansion convolutional layer and the second expansion convolutional layer are 3 and 6 respectively; the size of the pooling core of each maximum pooling layer is 2 multiplied by 2, and the step length is 2; the convolution kernel size of each characteristic projection layer is 1 multiplied by 1, the step length is 1, and the number of the convolution kernels of the first characteristic projection layer and the second characteristic projection layer is respectively 256 and 128; the size of the convolution kernel of the category prediction layer is 1 multiplied by 1, the step length is 1, and the number of the convolution kernels is 2.

Further, in step S22, the convolution kernel size of each pair of deconvolution layers is 5 × 5, the step size is 2, and the number of convolution kernels of the first to sixth pair of convolution layers is 32, 64, 128, 256, 512, 1024 in sequence; the first through third fully-connected layers are 1024, 512, and 2, respectively, where 2 represents two categories of whether the input graph is from a split net or a split label.

Further, the source label of the segmentation graph generated by the segmentation network is 0, and the source label of the segmentation graph is 1.

Further, in the method, an output characteristic diagram Z corresponding to any convolution kernel_iThe calculation was performed using the following formula:

where f is a non-linear excitation function, r is the input channel index number, k is the input channel number, W_irIs the r channel weight matrix of the i convolution kernel,

is a convolution operation, X_rIs the r-th input channel image.

Further, in step S33, a regularization term is added to the optimization objective function, so that the final optimization objective function is obtained as follows:

wherein λ is₂Is a regularization factor, Q₁Is theta_sThe number of parameters (c).

Further, in step S34, a regularization term is also added to the optimization objective function, so that the final optimization objective function is obtained as follows:

wherein λ is₃Is a regularization factor, Q₂Is theta_αThe number of parameters (c).

Compared with the prior art, the counterstudy-based multiscale feature fusion ultrasonic image semantic segmentation method provided by the invention has the following advantages:

1. images with different resolutions are input, context features with different scales can be extracted by adopting a multi-branch feature extractor, and the pixel labels are predicted by utilizing the features together, so that the prediction accuracy can be obviously improved;

2. the weight shared by all the branch feature extractors can reduce network parameters, so that the network is easier to train;

3. expansion convolution is adopted, the dimension of the characteristic diagram is not reduced while the neuron receptive field is enlarged, and the resolution of the characteristic diagram is improved, so that a segmentation diagram with higher resolution is generated;

4. the segmentation network is optimized by applying the countermeasure loss and the pixel label prediction loss together through the countermeasure training, the countermeasure gradient flow enters the segmentation network, the learned segmentation network model can output a segmentation map with good appearance and space continuity, the breast ultrasound image focus segmentation is more accurate, and then a fine-grained segmentation map with more segmentation details is generated.

Drawings

FIG. 1 is a system model schematic diagram of the counterstudy-based multi-scale feature fusion ultrasound image semantic segmentation method provided by the invention (a counterstudy discrimination network is only used in training, and only a semantic segmentation network is needed in segmentation application).

Fig. 2 is a schematic diagram of a multi-scale feature fusion semantic segmentation network structure provided by the present invention (the upper side of the legend indicates the number of feature maps, and the lower left side of the legend indicates the dimension of the feature maps).

Fig. 3 is a schematic diagram of the countermeasure discrimination network structure provided by the present invention (the upper side of the diagram indicates the number of feature maps or the number of neurons, and the lower left side of the diagram indicates the dimension of the feature maps).

FIG. 4 is a schematic diagram of the dilated convolution provided by the present invention. The convolution kernel size in this figure is described in the following embodiments and need not be described further herein.

Detailed Description

In order to make the technical means, the original characteristics, the achieved purposes and the effects of the invention easy to understand, the invention is further described with reference to the specific drawings and the preferred embodiments.

Referring to fig. 1 to 4, the present invention provides a method for segmenting semantic of a multi-scale feature fusion ultrasound image based on countervailing learning, including the following steps:

s1, building a multi-scale feature fusion semantic segmentation network model:

s11, the multi-scale feature fusion semantic segmentation network is used for generating a lesion area segmentation map of an input ultrasonic image, the multi-scale feature fusion semantic segmentation network comprises a multi-resolution feature extractor, a cascade feature fusion device and a classifier, the multi-resolution feature extractor is formed by at least two parallel deep convolution network branches, each branch has the same structure and is shared by weight values, the multi-resolution feature extractor is used for extracting context features with different scales from input images with different resolutions of each branch, the cascade feature fusion device is used for carrying out step-by-step fusion on the multi-scale features obtained by each different branch to generate a comprehensive feature map, and the classifier is used for predicting pixel category labels according to the comprehensive feature map; specifically, the input of the multi-scale feature fusion semantic segmentation network is 2 channels which respectively represent a B-mode ultrasonic image and an ultrasonic elastic image, and the output of the network is also 2 channels which respectively represent the probability that pixels belong to normal tissues and lesion tissues;

s12, each deep convolutional network branch structure includes first to fifth convolutional layer groups, first to third largest pooling layers, a first dilation convolutional layer and a second dilation convolutional layer, the first largest pooling layer is located after the first convolutional layer group, the second largest pooling layer is located after the second convolutional layer group, the third largest pooling layer is located after the third convolutional layer group, the first dilation convolutional layer is located after the fourth convolutional layer, the second dilation convolutional layer is located after the fifth convolutional layer group, in order to ensure that the size of the feature map after convolution is the same as that before convolution, Padding is set to 1 in the convolution process, that is, the image is filled with 0 value during convolution; the classifier comprises a first characteristic projection layer, a second characteristic projection layer, a category prediction layer and a Softmax regression layer which are sequentially arranged behind a second expansion convolution layer.

As a specific embodiment, the multi-resolution feature extractor is composed of three parallel deep convolution network branches, namely a high-resolution feature extractor, a medium-resolution feature extractor and a low-resolution feature extractor, wherein the three branches have the same structure and share weights, and the multi-resolution feature extractor is used for extracting different context features from input images with different sizes of the three branches of the high-resolution feature extractor, the medium-resolution feature extractor and the low-resolution feature extractor; correspondingly, the cascade feature fusion device has the function of carrying out step-by-step fusion on the multi-scale features acquired by three different branches, firstly carrying out 2-time up-sampling on the last feature layer of the low-resolution feature extractor branch by using bilinear interpolation, and then adding the up-sampling and the last feature layer of the medium-resolution feature extractor branch to generate a fusion feature map 1 with the same size as the medium-resolution feature map; and then performing 2 times of upsampling on the fused feature map 1 by using bilinear interpolation, and adding the upsampled fused feature map to the final feature layer of the high-resolution feature extractor branch to generate a fused feature map 2 with the same size as the high-resolution feature map, wherein the fused feature map 2 is the final comprehensive feature map.

As a specific embodiment, it is assumed that the size of the original image is 480 × 480, the size of the high-resolution feature extractor branch input image is 480 × 480, which is the same as the size of the original image, the size of the medium-resolution feature extractor branch input image is 1/2 × 240, which is the size of the original image, and can be obtained by performing 2-fold down-sampling on the original image, and the size of the low-resolution feature extractor branch input image is 1/4, which is the size of the original image, 120 × 120, which can be obtained by performing 4-fold down-sampling on the original image, thereby obtaining input images with different sizes for each depth convolution network branch.

As a specific embodiment, a detailed structure of the multi-scale feature fusion semantic segmentation network model is shown in table 1 below, where table 1 illustrates a high-resolution image size 480 × 480, and the same goes for a medium-resolution image size and a low-resolution image size:

table 1 ultrasonic image semantic segmentation system model parameter table (Padding ═ 1)

As can be seen from table 1, in step S12, each convolution layer group is composed of two convolution layers, the convolution kernel size of each convolution layer is 3 × 3, the step size is 1, and the number of convolution kernels of the first to fifth convolution layer groups is sequentially 64, 128, 256, 512, and 512; the convolution kernel size of each expansion convolutional layer is 3 multiplied by 3, the step length is 1, and the expansion factors of the first expansion convolutional layer and the second expansion convolutional layer are 3 and 6 respectively; the size of the pooling core of each maximum pooling layer is 2 multiplied by 2, and the step length is 2; the convolution kernel size of each characteristic projection layer is 1 multiplied by 1, the step length is 1, and the number of the convolution kernels of the first characteristic projection layer and the second characteristic projection layer is respectively 256 and 128; the size of a convolution kernel of the category prediction layer is 1 multiplied by 1, the step length is 1, and the number of the convolution kernels is 2; the Softmax regression layer converts the pixel class prediction scores into a probability distribution. The resolution feature maps output by the serial number 15 are subjected to cascade fusion and then input to the serial number 16, namely the input of the serial number 16 is the feature of the serial number 15 subjected to multi-resolution fusion.

S2, establishing a confrontation authentication network model:

s21, the confrontation discrimination network is used for distinguishing whether the input is a segmentation graph from a segmentation network or a manual segmentation label (GT);

s22, the countermeasure identification network includes a first to sixth countermeasure convolutional layers, a first to third full-link layers and a Softmax regression layer arranged in sequence, i.e. the full-link layer is arranged after the countermeasure convolutional layer, and the detailed structure is shown in table 2 below:

table 2 confrontation authentication network model parameter table (Padding ═ 1)

As can be seen from table 2, the convolution kernel size of each of the pair of convolutional layers is 5 × 5, the step size is 2, and the number of convolution kernels of the first to sixth pair of convolutional layers is 32, 64, 128, 256, 512, 1024 in sequence; the first through third fully-connected layers are 1024, 512, and 2, respectively, where 2 represents two categories of whether the input graph is from a split net or a split label. Specifically, the input of the confrontation identification network is 2 channels, which respectively represent probability distribution maps of pixels belonging to 2 categories of normal and focus, wherein the segmentation map and the segmentation label of each pair of images (B mode image and elastic image) correspond to two probability distribution maps, one is the probability distribution map of each pixel belonging to normal tissue, the other is the probability distribution map of each pixel belonging to lesion tissue, and the output of the network is a binary prediction result of 0-1. In a preferred embodiment, the segmentation network generates a segmentation map with a source label of 0 and a source label of 1, so that whether the input map is from the segmentation network or the segmentation label can be distinguished by countering the 0-1 binary prediction result output by the discrimination network.

As a specific embodiment, the operation of the network model is as follows:

(1) and (3) convolution operation: output characteristic diagram Z corresponding to any convolution kernel in the method_iThe calculation was performed using the following formula:

is a convolution operation, X_rIs the r-th input channel image.

(2) And (3) expansion convolution:

the expansion convolution is convolution operation carried out by using an expansion convolution kernel, the expansion convolution kernel is that the convolution kernel is subjected to up-sampling, the original position weight of the convolution kernel is not changed, the middle position is supplemented with 0, and the expansion convolution improves the receptive field by adopting different expansion factors without increasing network parameters and calculated quantity. Specifically, referring to fig. 4, black represents the weight of the convolution kernel (the convolution kernel is 3 × 3), and gray represents the padding value of 0. The expansion factor of 1 is equivalent to a normal convolution, and as the expansion factor becomes larger, the receptive field becomes larger and the number of weights does not change.

(3) ReLU nonlinear excitation function:

in the above formula (1), f is a function of a rectifying Linear unit ReLU (rectifying Linear units), wherein ReLU is an activation function of the network, and the rectifying Linear unit is used for outputting a characteristic diagram Z generated by a convolution kernel_iIs non-linearly transformed, said rectifying linear unit ReLU being defined as follows:

f(x)＝max(0,x) (2)

where f (x) is the rectified linear unit function and x is an input value.

(4) Softmax classification function:

the Softmax function converts the predicted scores output by the network into a probability distribution, the Softmax function being defined as follows:

wherein, O_jIs the predicted score, Y, of a pixel in class j that is the last output of the network_jIs the probability that the input pixel belongs to the jth class, C is the number of classes, 2, and exp (.) is an exponential function with a natural constant e as the base.

s31, initializing the network model parameters by adopting an Xavier method;

s32, acquiring 500 breast ultrasound multi-modal grayscale image training samples and corresponding artificial segmentation labels, where the breast ultrasound multi-modal grayscale image includes a B-mode image and an elastic image, the B-mode image and the elastic image are spatially registered, the sizes of the training samples and the corresponding segmentation labels are 480 × 480, and the training data samples are increased to 10 times of the initial size by using a horizontal flip, vertical flip, clipping, rotation by 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, 315 ° data expansion technique, and then the training data are divided into 4: 1, dividing the training set into a training set and a verification set, and training a network model by adopting a five-fold cross verification method;

the mixed loss function is composed of two classification cross entropy loss functions L₁And L₂Composition, representing segmentation loss and discrimination loss, respectively; wherein, theta_sIs a parameter of a segmented network model, theta_aIs to identify the network model parameter, x_nIs the input n-th training sample, y_nIs the segmentation label corresponding to the nth training sample, and s (.) is the segmentation function (using the parameter θ)_sFor input image sample x_nIs predicted), a () is a discriminant function (using parameter θ)_aPredicting the class of the network input, wherein 0 represents a segmentation graph generated by a segmentation network, 1 represents a segmentation label), N is the batch size which is the number of samples used in each iteration in the random gradient descent iteration, and is set as 32, and lambda is₁Is an antagonistic factor, set to 0.5;

wherein the content of the first and second substances,

is a label prediction probability vector generated by a segmentation function s (), y is a probability vector corresponding to a segmentation label, H × W is the dimension of the image, the size is 480 × 480, C is the number of classes of pixels, C is 2, and ln () is a natural logarithm;

wherein the content of the first and second substances,

in order to prevent overfitting, a regularization term is added to the optimization objective function of the formula (4), so that the final optimization objective function is as follows:

wherein λ is₂Is a regularization factor, set to 0.1, Q₁Is theta_sThe number of parameters (c).

also to prevent overfitting, a regularization term is added to the optimization objective function of the countermeasure discrimination network, and the final optimization objective function is obtained as follows:

wherein λ is₃Is a regularization factor, set to 0.1, Q₂Is theta_αThe number of parameters (c).

S35 to obtain L_seg(θ_s) And L_adv(θ_a) Training the meaning division network and the countermeasure identification network in an alternating mode by taking the minimum value as an optimization target, alternating every 500 iterations, optimizing a target function by adopting a random gradient descent algorithm, and updating a network model parameter theta by adopting an error back propagation algorithm_sAnd theta_a(ii) a The goal of the training of the identification network is to well distinguish the segmentation graph from the segmentation labels, and the goal of the training of the segmentation network is to make the segmentation network accurately predict the pixel labels and make the identifier unable to distinguish the segmentation graph generated by the segmentation network from the corresponding segmentation standard labels, i.e. the segmentation graph generated by the segmentation network is very similar or almost the same as the segmentation standard labels. Specifically, when the segmentation network is trained, the parameter theta of the segmentation network of the model is fixed_aTo optimize theta_sThe momentum coefficient μ is 0.9, and the initial learning rate is η_t＝1e^-3Decrease 1/10 every 1000 iterations until 1e^-6Until the end; parameter theta of fixed mould type division net during anti-identification network training_sTo optimize theta_aThe initial learning rate is set to η_t＝1e^-4Decrease 1/10 every 1000 iterations until 1e^-7Until now.

S4, automatic segmentation of breast lesions:

s41, inputting the breast ultrasound B-mode images with different resolutions and the corresponding ultrasound elastic image pairs which are registered in space into each branch of the multi-resolution feature extractor as two channels, and respectively generating feature maps with different sizes; for example, the original breast ultrasound B-mode image and the ultrasound elasticity image are input into a high-resolution feature extractor as two channels, and a feature map with the size of the original image 1/8 is generated; then, after 2-fold down sampling and 4-fold down sampling are carried out on the original image, a medium-resolution image and a low-resolution image are formed and are respectively input into a medium-resolution feature extractor and a low-resolution feature extractor, and feature maps with the sizes of the original images 1/16 and 1/32 are respectively generated;

s42, inputting the context features of different scales extracted by the multi-resolution feature extractor into a cascade feature fusion device to generate a comprehensive feature map; performing 2-time upsampling on the low-resolution feature map, performing pixel-by-pixel addition fusion on the low-resolution feature map and the medium-resolution feature map, performing 2-time upsampling on the fused feature map, performing pixel-by-pixel addition fusion on the fused feature map and the high-resolution feature map, and finally generating a comprehensive feature map;

s43, inputting the comprehensive characteristic graph into a classifier to generate 2 pixel class label prediction score graphs which respectively represent the prediction scores of the pixels belonging to normal tissues and lesion tissues;

s45, taking a subscript component (0 or 1) where the maximum probability of each pixel is located as a pixel class label to form a breast ultrasound focus semantic segmentation map;

s46, carrying out 8-time upsampling on the breast ultrasound focus semantic segmentation map to obtain a final segmentation map with the same size as the original image; specifically, in the pooling process of this embodiment, the image resolution is reduced to 1/8 of the original image, and 8 times of upsampling is required to obtain a segmentation map with the same size as the original image.

Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims

1. The multi-scale feature fusion ultrasonic image semantic segmentation method based on counterstudy is characterized by comprising the following steps of:

s1, building a multi-scale feature fusion semantic segmentation network model:

s12, each deep convolutional network branch structure including first to fifth convolutional layer groups, a first maximum pooling layer, a second maximum pooling layer, a third maximum pooling layer, a first expansion convolutional layer, and a second expansion convolutional layer, the first maximum pooling layer being located after the first convolutional layer group, the second maximum pooling layer being located after the second convolutional layer group, the third maximum pooling layer being located after the third convolutional layer group, the first expansion convolutional layer being located after the fourth convolutional layer group, and the second expansion convolutional layer being located after the fifth convolutional layer group; the classifier comprises a first characteristic projection layer, a second characteristic projection layer, a category prediction layer and a Softmax regression layer which are sequentially arranged behind a second expansion convolution layer;

s2, establishing a confrontation authentication network model:

s31, initializing the network model parameters by adopting an Xavier method;

wherein the content of the first and second substances,

wherein the content of the first and second substances,

S4, automatic segmentation of breast lesions:

s43, inputting the comprehensive characteristic graph into a classifier to generate a pixel class label prediction value graph;

2. The method for semantic segmentation of ultrasound images based on multi-scale feature fusion of claim 1, wherein in step S11, the multi-resolution feature extractor is composed of three parallel deep convolutional network branches, namely, a high resolution feature extractor, a medium resolution feature extractor and a low resolution feature extractor.

3. The method of claim 2, wherein the high resolution feature extractor branch input image size is the same as the original image size, the medium resolution feature extractor branch input image size is 1/2 of the original image size, and the low resolution feature extractor branch input image size is 1/4 of the original image size.

4. The method for multi-scale feature fusion ultrasonic image semantic segmentation based on antagonistic learning as claimed in claim 2, wherein the cascade feature fusion device firstly performs 2 times up-sampling on the last feature layer of the low resolution feature extractor branch by using bilinear interpolation, and then adds the up-sampled feature layer and the last feature layer of the middle resolution feature extractor branch to generate a fusion feature map 1 with the same size as the middle resolution feature map; and then performing 2 times of upsampling on the fused feature map 1 by using bilinear interpolation, and adding the upsampled fused feature map to the final feature layer of the high-resolution feature extractor branch to generate a fused feature map 2 with the same size as the high-resolution feature map, wherein the fused feature map 2 is the final comprehensive feature map.

5. The method for semantic segmentation of ultrasound images fused with multi-scale features based on antagonistic learning according to claim 1, wherein in step S12, each convolution layer group consists of two convolution layers, the convolution kernel size of each convolution layer is 3 × 3, the step size is 1, and the number of convolution kernels of the first to fifth convolution layer groups is 64, 128, 256, 512 in sequence; the convolution kernel size of each expansion convolutional layer is 3 multiplied by 3, the step length is 1, and the expansion factors of the first expansion convolutional layer and the second expansion convolutional layer are 3 and 6 respectively; the size of the pooling core of each maximum pooling layer is 2 multiplied by 2, and the step length is 2; the convolution kernel size of each characteristic projection layer is 1 multiplied by 1, the step length is 1, and the number of the convolution kernels of the first characteristic projection layer and the second characteristic projection layer is respectively 256 and 128; the size of the convolution kernel of the category prediction layer is 1 multiplied by 1, the step length is 1, and the number of the convolution kernels is 2.

6. The method for semantic segmentation of ultrasound images fused with multi-scale features based on counterlearning of claim 1, wherein in step S22, the convolution kernel size of each counterconvolutional layer is 5 × 5, the step size is 2, and the number of convolution kernels of the first to sixth counterconvolutional layers is 32, 64, 128, 256, 512, 1024; the first through third fully-connected layers are 1024, 512, and 2, respectively, where 2 represents two categories of whether the input graph is from a split net or a split label.

7. The method for semantic segmentation of ultrasound images fused with multi-scale features based on antagonistic learning as claimed in claim 6, wherein the segmentation network generates a segmentation map with a source label of 0 and a source label of 1.

8. The method for semantic segmentation of ultrasound image based on multi-scale feature fusion based on counterstudy of claim 1, wherein the output feature graph Z corresponding to any convolution kernel in the method_iThe calculation was performed using the following formula:

is a convolution operation, X_rIs the r-th input channel image.

9. The method for segmenting ultrasound image semantics based on counterlearning and fusion of multi-scale features as claimed in claim 1, wherein in the step S33, a regularization term is further added to the optimization objective function, so that the final optimization objective function is obtained as follows:

10. The method for segmenting ultrasound image semantics based on multi-scale feature fusion based on counterlearning of claim 1, wherein in the step S34, a regularization term is also added to the optimization objective function, so that the final optimization objective function is obtained as follows: