CN114998647B

CN114998647B - Breast cancer full-size pathological image classification method based on attention multi-instance learning

Info

Publication number: CN114998647B
Application number: CN202210526657.6A
Authority: CN
Inventors: 张建新; 侯存巧; 张冰冰; 韩雨童
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2022-05-16
Filing date: 2022-05-16
Publication date: 2024-05-07
Anticipated expiration: 2042-05-16
Also published as: CN114998647A

Abstract

The breast cancer full-size pathological image classification method based on attention multi-instance learning comprises the following steps: step 1: acquiring a data set and a label; step 2: preprocessing a data set; step 3: constructing a two-stage full-size pathological image (WSI) classification network; step 4: storing the optimal weight of the two-stage network; step 5: and calculating the accuracy of the network on the test set. The SAMIL of the present invention introduces a lightweight and efficient SA module that fuses spatial attention and channel attention, which are used to capture pixel-level pairings and channel dependencies, respectively. SAMIL stack MHA with LSTM to adaptively highlight the most unique instance features to better calculate correlations between selected instances, improving classification accuracy.

Description

Breast cancer full-size pathological image classification method based on attention multi-instance learning

Technical Field

The invention relates to the technical field of image classification methods, in particular to a breast cancer full-size pathological image classification method based on attention multi-instance learning.

Background

According to the recent global cancer estimation, 230 ten thousand new diagnosis cases of breast cancer are found in 2020 women, and lung cancer has become the most common cancer worldwide. At the same time, the digitization of full size images (WSI), i.e. hematoxylin eosin (H & E) stained biopsy specimens, provides an exact reference for breast cancer diagnosis.

In recent years, with the breakthrough success of deep learning in various computer tasks, computer-aided WSI classification methods for cancer diagnosis have received more attention. In particular, some researchers turn WSI classification into a weakly supervised task and introduce multi-instance learning (MILs) as a solution to the problems of massive WSI scale and difficulty in pixel-level labeling in fully supervised learning. The MIL solution mainly focuses on two key links, namely an instance level selection module is constructed, positive probability of slice level images is calculated based on the extracted depth features, and the first K slices with the highest probability are taken as candidate instances; the design aggregation operator generates packet embeddings for calculating the score for each packet. Although multi-instance learning has made great progress in the task of classifying full-slice pathology images.

The defects of the method are that: feature correlation of each sub-feature is rarely described in the spatial or channel dimensions, which is detrimental to the discovery of cancer cells with microscopic breast cancer lymph node metastases. There are limitations in capturing dependencies between different instances that help classify WSI.

Disclosure of Invention

The invention aims to provide a full-size breast cancer pathological image classification method based on attention multi-instance learning, which can acquire more discriminative patch level representation and can improve accuracy of breast cancer metastasis lymph node pathological image classification.

A breast cancer full-size pathological image classification method based on attention multi-instance learning comprises the following steps:

Step 1: acquiring a data set and a label: acquiring a data set and a label of a breast cancer histopathological image, and randomly dividing the breast cancer histopathological image into a training set, a verification set and a test set according to a proportion;

Step 2: preprocessing a data set: preprocessing the divided data set based on inverse binarization thresholding operation, generating a mask of background/tissue area for each WSI picture, cutting the tissue area into slices with a size of a×a, and storing the coordinate set of the slices. In order to further reduce the calculated amount, a probability p is added, when the part of the tissue region in the slice is larger than the probability p, the coordinates of the slice are saved, and the processed WSI image X '_i can be expressed as X' _i＝{x_i,1,x_i,2…,x_i,m }, wherein m is the number of the slices in each full-size breast cancer pathological image;

Step 3: a two-stage full-scale pathology image (WSI) classification network is constructed: the method comprises the steps of selecting an instance in a first stage, extracting features of slices by using an SA-ResNet network, selecting the first K instances with the highest probability in each WSI (wireless sensor array) by using a multi-instance learning method, predicting the full-size level in a second stage, and reliably predicting the whole WSI image by using an aggregator constructed by superposing a multi-head attention (MHA) network and a long short-term memory (LSTM) network;

Step 31: at one stage, the SA-ResNet network performs feature extraction on the slice: taking a slice X ' ∈R ^C×H×W as the input of a pre-trained SA-ResNet network, obtaining a feature matrix X ε R ^c×h×w after the residual structure of ResNet, dividing X into G groups along the channel dimension by replacement attention, namely X= [ X ₁,…,X_G],X_k∈R^c/G×h×w,X_k ] is continuously divided into two branches, namely X _k1,X_k2∈R^c/2G×h×w, one branch utilizes the inter-channel correlation, outputting a channel attention map, the other branch utilizes the inter-feature spatial relationship, generating a space attention map, connecting the results of the two branches, enabling the number of channels X ' _k to be the same as the number of channels of X _k, and then carrying out polymerization operation on all feature matrices X ' _k, wherein the final output of the SA module is X _out∈R^c×h×w.X_out, and generating the feature vector X _gap of the slice through global average pooling.

Step 32: acquiring a small training SA-ResNet network: after the feature vector of each slice is obtained, the probability of each slice is obtained through a Softmax function, the probabilities of the slices in each full-size image are ordered from small to large, and the T small blocks with the top probability rank in each full-size image are taken to train the SA-ResNet network.

Step 33: input V to obtain full-size level prediction: and predicting the slices in each WSI by using a one-stage pre-trained optimal weight file, sequencing the predicted probabilities, and taking the first K instances with the highest probability in each full-size image as the input V= [ V ₁,…,v_K]∈R^K×C ] of full-size level prediction.

Step 34: the first K instances with highest aggregate probability: with MHA and LSTM, for the i-th head attention unit (H _i) in MHA, the calculation formula is as follows:

Wherein v= [ V ₁,…,v_K]∈R^K×C, V represents the number of instances of the first K selected instance features, K represents the number of instances, V ₁,…,v_K represents a single instance feature, V _j,v_k e V, C is the instance feature embedding dimension, the convolution kernels are W e R ^D×1 and Z e R ^D×C, D is the feature embedding dimension. The hyperbolic tangent tanh is the activation function. After element multiplication o, for MHA, another convolution is performed to project back to the original dimension for all outputs of the connector unit:

Wherein, Representing the first K instances after feature enhancement, v= [ V ₁,…,v_K]∈R^K×C, V represents the selected first K instance features, K represents the number of instances, V ₁,…,v_K represents a single instance feature, W _pro∈R^(H×D)×C represents a convolution kernel, T represents a transpose of the matrix, H ₁,…,H_h represents the head attention unit, H represents the number of heads, C and D feature embedding dimensions.

Step 35: further modeling the dependencies between the selected Top-K instances: LSTM is further used to construct interactions and fuse interaction instances to obtain differentiated image level representations. LSTM can capture short-term and long-term dependencies, given an input feature sequence (v ₁,…,v_K), and the hidden layer of LSTM is recursively calculated from t=1 to t=k using the following formula:

Wherein f _t,i_t,o_t represents a forget gate, an input gate, and an output gate, respectively. W _{f,i,o,c} and U _{f,i,o,c} represent weight matrices to be learned, b _{f,i,o,c} represents bias vectors, h _t-1 is a hidden vector, c _t represents memory cells, sigmoid and hyperbolic tangent tanh represent activation functions. The output of the last LSTM is used as the final packet level representation vector for prediction.

Step 4: saving the optimal weight of the two-stage network: inputting the data set into a two-stage classification network, training the one-stage network by adopting a training set, updating network parameters in each iteration, verifying the verification set once every three iterations, storing the optimal weight of the one-stage network according to the accuracy of the optimal verification set, processing the data set by using the optimal weight of the one-stage, selecting K instances with the highest probability rank in each WSI as the input of the two stages, initializing the two-stage network by using the optimal weight of the one-stage, verifying once after finishing one iteration in each training, and storing the optimal weight of the two-stage network according to the accuracy of the optimal verification set;

Step 5: calculating the accuracy of the network on the test set: and initializing a network by using two-stage optimal weights, inputting a test set into the network to obtain a prediction result of each WSI, comparing the prediction result with real tag data, counting the number of WSIs which are correctly predicted and incorrectly predicted, and calculating the accuracy of the network on the test set.

Compared with the prior art, the invention has the following beneficial effects:

(1) SAMIL introduces a lightweight and efficient SA module that fuses spatial attention and channel attention, which are used to capture pixel-level pairwise relationships and channel dependencies, respectively.

(2) SAMIL stack MHA with LSTM to adaptively highlight the most unique instance features to better calculate correlations between selected instances, improving classification accuracy.

Drawings

Fig. 1 is an overall frame diagram of the SAMIL model.

Detailed Description

Experimental data used in the present invention was from the lymph node metastasis dataset of 2016Camelyon Grand Challenge. The dataset contained 399 complete full-size images, including both normal and metastatic forms, for detection of metastasis in HE-stained tissue sections of sentinel-assisted lymph nodes of breast cancer patients.

In a schematic diagram of the invention, a method for classifying full-size pathological images of two-stage breast cancer based on attention-deficit-increasing examples comprises the following steps:

step 1: acquiring a data set and a label: the lymph node metastasis data set is randomly divided into training sets according to the proportion of 2:1:1: verification set: test set, wherein training set 204, verification set 95, test set 100.

Step 2: preprocessing a data set: the method is used for preprocessing the divided data set based on inverse binarization thresholding operation, generating a mask of a background/tissue area for each WSI picture, dividing the tissue area into sections with the size of 512 multiplied by 512, and storing coordinate sets of the sections. In order to further reduce the calculated amount, a probability value of 0.4 is added, coordinates of the slice are saved when the part of the tissue area in the slice is larger than 0.4, and the processed WSI image X '_i can be expressed as X' _i＝{x_i,1,x_i,2…,x_i,m, wherein m is the number of the slices in each full-size breast cancer pathological image;

Step 3: a two-stage full-scale pathology image (WSI) classification network is constructed: the method comprises the steps of selecting a first stage for example, extracting features of slices by using an SA-ResNet network, selecting 10 examples with the highest probability in each WSI (wireless sensor array) by using a multi-example learning method, predicting the whole WSI image by using a full-size level prediction model, and reliably predicting the whole WSI image by using an aggregator constructed by superposing a multi-head attention (MHA) network and a long short-term memory (LSTM) network;

Step 31: at one stage, the SA-ResNet network performs feature extraction on the slice: slice x _i,j∈R^3×512×512 is scaled to 224 x 3 pixels as input to the pre-training SA-ResNet network. The SA module is inserted into each residual stage (e.g., conv2_x) in ResNet-50. The input to SA is the feature matrix X ε R ^256×56×56. The SA module first divides X into 64 groups along the channel dimension, i.e., x= [ X ₁,…,X_k,…,X₆₄],X_k∈R^4×56×56,X_k ] is further divided into two branches, X _k1,X_k2∈R^2×56×56 respectively, one branch uses the inter-channel relationship, outputs a channel attention pattern X '_k1∈R^2×56×56, the other branch uses the inter-feature spatial relationship, generates a spatial attention pattern X' _k2∈R^2×56×56, connects the two branches to obtain X '_k∈R^4×56×56, and then performs an aggregation operation on all feature matrices X' _k, and the final output of the SA module is X _out∈R^256×56×56. The SA modules in conv3_x, conv4_x, conv5_x residual blocks are the same, and the feature vector generated by global average pooling of X _out is X _gap∈R^2048×1×1.

Step 32: acquiring a small training SA-ResNet network: after the feature vector of each slice is obtained, the probability of each slice is obtained through a Softmax function, the probabilities of the slices in each full-size image are ordered from small to large, and 2 small blocks with the highest probability rank in each full-size image are taken to train the SA-ResNet network.

Step 33: input V to obtain full-size level prediction: and predicting the slices in each WSI by using a one-stage pre-trained optimal weight file, sequencing the predicted probabilities, and taking the first 10 instances with the highest probability in each full-size image as input V= [ V ₁,…,v₁₀]∈R^2048×1 ] of two-stage full-size level prediction.

Step 34: the first K instances with highest aggregate probability: with MHA and LSTM, for the i-th head attention unit in the multi-head attention, the calculation formula is as follows:

Where v= [ V ₁,…,v₁₀]∈R^10×2048, V denotes the first 10 example features selected, V ₁,…,v₁₀ denotes the single example feature, V _j,v_k e V, and the convolution kernels are W e R ^512×1 and Z e R ^512×2048. The hyperbolic tangent tanh is the activation function. In element multiplication Thereafter, the key instances are highlighted according to the relationship between them. For MHA, all outputs of the connector unit of the invention, another convolution is performed to project back to the original dimension:

Wherein, Representing the first 10 examples after feature enhancement, v= [ V ₁,…,v₁₀]∈R^10×2048, V represents the first 10 example features selected, V ₁,…,v₁₀ represents a single example feature, W _pro∈R^{(3×512)×2048} represents a convolution kernel, T represents a transpose of the matrix, H ₁,…,H_h represents the head attention unit, H represents the number of heads, in this study h=3. The multi-headed attention recalibrates all instance features from different representation subspaces enriching the original selected instance V.

Step 35: further modeling the dependencies between the first 10 selected instances: LSTM is further used to construct interactions and fuse interaction instances to obtain differentiated image level representations. LSTM can capture short-term and long-term dependencies, given an input feature sequence (v ₁,…,v₁₀), the hidden layer of LSTM is recursively calculated from t=1 to t=10 using the following formula: :

Wherein f _t,i_t,o_t represents a forget gate, an input gate, and an output gate, respectively. W _{f,i,o,c} and U _{f,i,o,c} represent weight matrices to be learned, b _{f,i,o,c} represents bias vectors, h _t is a hidden vector, c _t is a memory unit, sigmoid and hyperbolic tangent tanh represent activation functions. In the feature fusion module, the present invention stacks two layers of LSTM so that enhanced instances can interact more fully. The output of the last LSTM is used as the final packet level representation vector for prediction.

Step 4: saving the optimal weight of the two-stage network: inputting the data set into a two-stage classification network, training the one-stage network by adopting a training set, updating network parameters in each iteration, verifying the verification set once every three iterations, storing the optimal weight of the one-stage network according to the accuracy of the optimal verification set, and during the training process, using an Adam optimizer to relieve the gradient vibration problem, wherein the learning rate is set to be 1e-4, and the weight attenuation is set to be 1e-5. Processing the data set by using one-stage optimal weights, selecting the 10 instances with the highest probability ranking in each WSI as two-stage inputs, initializing a two-stage network by using the one-stage optimal weights, setting the learning rate to be 1e-4 and the weight attenuation to be 1e-4 by using an Adam optimizer in the two-stage training process, performing 1 verification after each training is completed for 1 iteration, and storing the optimal weights of the two-stage network according to the accuracy of the optimal verification set;

Step 5: calculating the accuracy of the network on the test set: and initializing a network by using two-stage optimal weights, inputting a test set into the network to obtain a prediction result of each WSI, comparing the prediction result with 100 real label data of the test set, and counting the number of WSIs which are correctly predicted and incorrectly predicted so as to calculate SAMIL accuracy rate on the test set.

According to the steps, the invention provides a novel SAMIL model for a breast cancer WSI classification task. SAMIL uses a permuted attention (SA) module to select discrimination instances and uses LSTM's multi-head attention (MHA) to implement packet level prediction, thus exploring the benefits of attention mechanisms to solve MIL problems. In addition, the experimental result shows that compared with the most advanced MIL method, the method has excellent performance on Camelyon data sets, and the accuracy is 96.56% at most.

Claims

1. The breast cancer full-size pathological image classification method based on attention multi-instance learning is characterized by comprising the following steps of: the method comprises the following steps: step 1: acquiring a data set and a label: acquiring a data set and a label of a breast cancer histopathological image, and randomly dividing the breast cancer histopathological image into a training set, a verification set and a test set according to a proportion; step 2: preprocessing a data set: preprocessing the divided data set based on inverse binarization thresholding operation, generating a mask of a background/tissue region for each WSI image, dividing the tissue region into slices with a size of a multiplied by a, storing a coordinate set of the slices, adding a probability p for further reducing the calculation amount, and storing the coordinates of the slices when the part of the tissue region in the slices is larger than the probability p, wherein m is the number of the slices in each full-size breast cancer pathological image, and the processed WSI image X '_i can be expressed as X' _i＝{x_i,1,x_i,2…,x_i,m; step 3: a two-stage full-scale pathology image (WSI) classification network is constructed: the method comprises the steps of selecting an instance in a first stage, extracting features of slices by using an SA-ResNet network, selecting the first K instances with the highest probability in each WSI (wireless sensor array) by using a multi-instance learning method, predicting the full-size level in a second stage, and reliably predicting the whole WSI image by using an aggregator constructed by superposing a multi-head attention (MHA) network and a long short-term memory (LSTM) network; step 4: saving the optimal weight of the two-stage network: inputting the data set into a two-stage classification network, training the one-stage network by adopting a training set, updating network parameters in each iteration, verifying the verification set once every three iterations, storing the optimal weight of the one-stage network according to the accuracy of the optimal verification set, processing the data set by using the optimal weight of the one-stage, selecting K instances with the highest probability rank in each WSI as the input of the two stages, initializing the two-stage network by using the optimal weight of the one-stage, verifying once after finishing one iteration in each training, and storing the optimal weight of the two-stage network according to the accuracy of the optimal verification set; step 5: calculating the accuracy of the classification network on the test set: using a two-stage optimal weight initializing network, inputting a test set into the classification network to obtain a prediction result of each WSI, comparing the prediction result with real tag data, counting the number of WSIs which are predicted correctly and mispredicted, and calculating the accuracy of the classification network on the test set; in step 3, step 31: at one stage, the SA-ResNet network performs feature extraction on the slice: taking a slice X ' ∈R ^C×H×W as an input of a pre-trained SA-ResNet network, after a residual structure of ResNet, obtaining a feature matrix X ε R ^c×h×w, dividing X into G groups along a channel dimension by replacement attention, namely X= [ X ₁,…,X_G],X_k∈R^c/G×h×w,X_k ] is continuously divided into two branches, namely X _k1,X_k2∈R^c/2G×h×w, one branch utilizes the interrelation among channels, a channel attention map is output, the other branch utilizes the spatial relationship among features, a space attention map is generated, the results of the two branches are connected, the number of channels X ' _k is the same as the number of channels of X _k, then, performing polymerization operation on all feature matrices X ' _k, and the final output of the SA module is X _out∈R^c×h×w,X_out to generate a feature vector X _gap of the slice through global average pooling; step 32: acquiring a small training SA-ResNet network: after the feature vector of each slice is obtained, the probability of each slice is obtained through a Softmax function, the probabilities of the slices in each full-size image are ordered from small to large, and T small blocks with the highest probability rank in each full-size image are taken to train an SA-ResNet network; step 33: input V to obtain full-size level prediction: predicting the slices in each WSI by using a one-stage pre-trained optimal weight file, sequencing the predicted probabilities, and taking the first K instances with the highest probability in each full-size image as the input V= [ V ₁,…,v_K]∈R^K×C ] of full-size level prediction; step 34: the first K instances with highest aggregate probability: with MHA and LSTM, for the i-th head attention unit (H _i) in MHA, the calculation formula is as follows:

Where v= [ V ₁,…,v_K]∈R^K×C, V denotes the number of instances of the first K selected instance features, K denotes the number of instances, V ₁,…,v_K denotes the single instance feature, V _j,v_k e V, C is the instance feature embedding dimension, the convolution kernels are W e R ^D×1 and Z e R ^D×C, D is the feature embedding dimension, hyperbolic tangent tanh is the activation function, and after element multiplication, another convolution is performed for all outputs of the connector unit to project back to the original dimension:

Wherein, Representing the first K instances after feature enhancement, v= [ V ₁,…,v_K]∈R^K×C, V represents the selected first K instance features, K represents the number of instances, V ₁,…,v_K represents a single instance feature, W _pro∈R^(H×D)×C represents a convolution kernel, T represents a transpose of the matrix, H ₁,…,H_h represents the head attention unit, H represents the number of heads, C and D feature embedding dimensions; step 35: further modeling the dependencies between the selected Top-K instances: LSTM is further used to construct interactions and fuse interaction instances to obtain differentiated image level representations, LSTM can capture short-term and long-term dependencies, given an input feature sequence (v ₁,…,v_K), and hidden layers of LSTM are recursively computed from t=1 to t=k using the following formula:

Wherein f _t,i_t,o_t represents a forgetting gate, an input gate and an output gate, W _{f,i,o,c} and U _{f,i,o,c} represent weight matrices to be learned, b _{f,i,o,c} represents a bias vector, h _t-1 is a hidden vector, c _t represents a memory unit, sigmoid and hyperbolic tangent tanh represent an activation function, and the output of the last LSTM is used as a final packet level representation vector for prediction.