CN113469119B

CN113469119B - Cervical cell image classification method based on visual converter and image convolution network

Info

Publication number: CN113469119B
Application number: CN202110820463.2A
Authority: CN
Inventors: 史骏; 唐昆铭; 祝新宇; 孙宇; 李俊; 郑钰山; 姜志国
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2022-10-04
Anticipated expiration: 2041-07-20
Also published as: CN113469119A

Abstract

The invention discloses a cervical cell image classification method based on a visual converter and a graph convolution network, which comprises the following steps: 1. acquiring a cervical liquid-based cell image dataset and preprocessing the cervical liquid-based cell image dataset; 2. for the visual converter branch, the input image is partitioned, then the position coding and the visual converter are used for feature coding, and the long-distance dependency relationship in the image is mined; 3. for the graph convolution network branches, performing superpixel segmentation on an input image, performing pooling processing on convolution neural network characteristics corresponding to each superpixel and using the pooling processing as nodes in a graph structure, modeling the graph structure inside cells by using the graph convolution network, and sensing the spatial topological relation inside the cells; 4. the features resulting from fusing the two branches form the final feature representation of the cervical cells and are used for classification. According to the invention, the characterization extraction and classification of cervical cells are completed by extracting the long-distance dependency relationship and the spatial topological structure information in the cervical fluid-based cell structure.

Description

Cervical cell image classification method based on visual converter and graph convolution network

Technical Field

The invention relates to the technical field of computer vision, in particular to a cervical cell image classification method based on a vision converter and a graph convolution neural network.

Background

Classification of cervical cells has been widely used in clinical work as a primary screening means for gynecological cervical lesions. In actual diagnosis, a pathologist needs to visually examine tens of thousands of cells under a microscope. Each pathologist needs to process a large number of patient specimens every day, so that the fatigue of film reading is often caused, and misdiagnosis sometimes happens. Therefore, there is a need for an efficient and quantitative cervical cell diagnosis method that reduces the burden of pathologists on reading and improves the accuracy of cervical cell identification. At present, algorithms for cervical cell image classification mainly include two types, namely a traditional feature classification algorithm and a supervised learning classification algorithm based on deep learning.

The traditional classification method generally utilizes manually designed features to model image features, and different classes of features are distinguished by training a classifier. Common manual visual features include low-level features such as colors, textures, edges and the like, and potential high-level semantic information of the image is difficult to describe. However, the characteristics of complex pathological image structure and variable types lead to that the manual feature extraction cannot perform better modeling on the image features, so that the classification effect is unsatisfactory.

In recent years, deep learning models have achieved significant effects in various fields of computer vision, and some researchers have applied convolutional neural networks to cervical cell classification tasks, such as residual error network (ResNet) and dense convolutional network (densnet), but they only use neural networks to extract global features of a single picture, and ignore local features of cervical cells, so that it is difficult to better solve the problems of large intra-cell intra-class differences and high inter-class similarities.

Disclosure of Invention

The invention provides a cervical cell image classification method based on a vision converter and a graph convolution network, aiming at solving the problem of difficult cervical cell image classification caused by complex and variable pathological image structure and rich characteristic information.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention relates to a cervical cell image classification method based on a visual converter and a graph convolution network, which is characterized by comprising the following steps:

step 1, marking the cell nucleus position and the cell nucleus type in the cervical digital slice image, and cutting according to the marked cell nucleus position to obtain each cell image and the corresponding type thereof, which are marked as

Wherein the content of the first and second substances,

representing the ith cell image, C, H and W represent the number of channels, height and width of the image, respectively, y _i Represents the ith cell image X _i A corresponding category; i =1,2, …, N; n represents the number of cell images;

step 2, constructing a visual converter ViT composed of L encoders as a first branch, wherein each encoder comprises: two normalization layers, a multi-head attention mechanism layer and a multi-layer sensor;

step 2.1, image X of ith cell _i Performing block processing to obtain a sequence containing m image blocks

Wherein the content of the first and second substances,

represents the ith cell image X _i The jth image block of (1);

p × p denotes the dimension of each image block, and m = H × W/p ² ；

Step 2.2, setting a learnable classification mark x _class And m image blocks and a classification mark x are obtained by using the formula (1) _class D-dimensional embedded representation z ₀ And as input to the 1 st encoder;

in the formula (1), E _pos Representing m image blocks and a class mark x _class In the ith cell image X _i (iii) a spatial position of; e denotes the set embedding matrix;

step 2.3, obtaining m image blocks and classification marks x by using the formula (2) _class Output z of the multi-head attention mechanism layer at the I-th encoder _l ′：

z _l ′＝MSA(LN(z _l-1 ))+z _l-1 ,l＝1,…,L (2)

In the formula (2), LN () represents processing of the normalization layer; z is a radical of _l-1 Represents the output of the l-1 st encoder;

step 2.4, obtaining the output z of the multi-layer perceptron of the first encoder by using the formula (3) _l To obtain the output z of the last encoder _L ：

z _l ＝MLP(LN(z′ _l ))+z′ _l ,l＝1,…,L (3)

Step 2.5, outputting z containing m + 1D dimensional characteristics _L Obtaining the ith cell image X by averaging _i ViT feature representation

Step 3, constructing a graph convolution network as a second branch;

step 3.1, image X of ith cell by SLIC algorithm _i Performing over-segmentation to obtain the ith cell image X _i A set of superpixels;

step 3.2, extracting the ith cell image X by using a pre-trained residual error network ResNet _i CNN profile of (a);

3.3, processing the CNN characteristic in the CNN characteristic graph where each super pixel is located by utilizing a maximum pooling mode according to the position of the super pixel set on the CNN characteristic graph, thereby obtaining the CNN characteristic vector representation corresponding to each super pixel;

step 3.4, the CNN characteristic vector representation corresponding to each super pixel is respectively used as the nodes of the graph structure, thereby forming a node set

Wherein S represents the number of superpixels, and t is the dimension of the CNN feature corresponding to each superpixel;

step 3.5, constructing the adjacency matrix by using the formula (4)

In the formula (4), B _α Representing the alpha-th super pixel, B _β Denotes the beta-th super pixel, A _αβ Representing the alpha-th super-pixel B in the adjacency matrix A _α And the beta-th super pixel B _β Spatial neighbor relationships between;

and 3.6, updating the relation perception representation of the nodes by using an equation (5) according to the hierarchical propagation rule of the graph convolution network:

in the formula (5), the reaction mixture is,

is a q-th level trainable weight, σ (-) is an activation function;

is a characteristic of a node of the q-th layer, d _q Is the dimension of the characteristics of the nodes at the q-th layer;

represents the characteristics of the node of the q +1 th layer, and

the characteristic of the s-th node output in the q + 1-th layer is represented by letting H be when q =0 ⁽⁰⁾ = B; and comprises the following components:

in the formula (6), D represents a diagonal matrix, and

D _ss representing the elements of the s-th row and s-th column of the diagonal matrix D, A _us Elements representing the uth row and the s column in the adjacency matrix A; i represents an identity matrix;

step 3.7, obtaining the ith cell image X obtained by aggregating the node output characteristics of the last Q layer in the graph convolution network by using the formula (7) _i Is a graph convolution feature of (a) represents z _g ：

In equation (7), concat (. Cndot.) represents the feature vector splicing operation, S represents the total number of nodes,

features representing an s-th node in the Q-th layer;

step 4, obtaining the ith cell image X by using the formula (8) _i Final characterization of (a):

and 5, performing linear transformation on the final characterization z by using the formula (9) to obtain an output result p of the linear classifier _pred ：

p _pred ＝Linear(z) (9)

In the formula (9), linear (·) represents a Linear classification function;

c represents the number of classes of cells;

step 6, constructing a cross entropy loss function L by using the formula (10), and training a network formed by two branches and a linear classifier by using a gradient descent algorithm to ensure that the cross entropy loss function L reaches convergence, thereby obtaining a trained cervical cell image classification model:

in formula (10), y _label The cell image sample is a corresponding real category, and N is the total number of the cell image samples.

Compared with the prior art, the invention has the following advantages:

1. according to the method, the vision converter is used for learning the characteristics of the cervical cell image, and the vision converter can learn the long-distance dependency relationship in the image, so that the dependency relationship among all parts in the cervical cell structure is established, and the classification accuracy is improved;

2. the invention utilizes the graph convolution network to model the spatial topological relation of the cell structure, and the graph convolution neural network can well learn and fit the spatial topological structure inside the image, thereby modeling the whole cell image to obtain the representation and improving the characteristic representation capability of the cell image;

3. the invention integrates two groups of characteristics with the remote dependence relationship inside the cell and the spatial topological structure information to obtain a group of more robust cell image representation which is far superior to the global characteristics extracted by the traditional CNN network, thereby enhancing the identification capability of the characteristics;

4. the invention adopts a supervision training mode to train, and the semantic information is embedded into the characteristics of the image, thereby improving the classification accuracy.

Drawings

FIG. 1 is a block diagram of a network in accordance with the present invention;

FIG. 2 is a general flow diagram of the present invention;

fig. 3 is a diagram of a cervical cell image training sample according to the present invention.

Detailed Description

In this embodiment, a cervical cell image classification method based on a visual converter and a graph convolution network comprehensively considers the long-distance dependency relationship and the spatial topology structure inside a cell structure, and is to input an input image into a visual converter branch and a graph convolution network branch respectively to obtain ViT features and GCN features, fuse the two sets of features to obtain a final cervical cell image feature representation, and send the feature representation into a classifier for classification, thereby completing classification of a cervical cell image, as shown in fig. 2, the specific steps are as follows:

step 1, marking the nucleus position and the nucleus type in the cervical digital slice image, and cutting according to the marked nucleus position, thereby obtaining each cell image and the corresponding type thereof, and marking as

Wherein, the first and the second end of the pipe are connected with each other,

representing the ith cell image, C, H and W represent the number of channels, height and width of the image, respectively, y _i Represents the ith cell image X _i A corresponding category; i =1,2, …, N; n represents the number of cell images; the data used in this example contained 7 categories of normal superficial cells, normal intermediate basal cells, granulocytes, glandular cells, atypical squamous cells, hollowed-out cells and high-cytoplasmic ratio cells, as shown in fig. 3, 500 images per category, each cell image size being 224 × 224, so C =3, H =224 and W =224. 80% of each class in the dataset was used for training and the remaining 20% was used for testing;

and 2, establishing a deep learning network model based on the visual converter and the graph convolution network as shown in the figure 1, wherein the deep learning network comprises two branches of the visual converter ViT and the graph convolution network. First, a vision converter ViT made up of L encoders is constructed as a first branch, each encoder comprising: two normalization layers, a multi-head attention mechanism layer and a multi-layer sensor;

Wherein the content of the first and second substances,

represents the ith cell image X _i The jth image block of (1);

p × p denotes the dimension of each image block, and m = H × W/p ² (ii) a In the present embodiment, each image block size is 16 × 16, so p =16, and m =196.

in the formula (1), E _pos Representing m image blocks and a class mark x _class In the ith cell image X _i (iii) a spatial position of; e denotes the set embedding matrix; in the present example, D =768,x _class Is a 768-dimensional vector consisting of 768 random numbers, E is a matrix of 768 × 768 random numbers with 768 rows and 768 columns, and E is a function of the number of rows and columns of the matrix _pos The random number matrix is a matrix formed by 197 x 768 random numbers, the number of rows of the matrix is 197, and the number of columns of the matrix is 768.

Step 2.3, obtaining m image blocks and classification marks x by using the formula (2) _class Output z 'of multi-head attention device layer at l-th encoder' _l ：

z′ _l ＝MSA(LN(z _l-1 ))+z _l-1 ,l＝1,…,L (2)

In the formula (2), LN () represents processing of the normalization layer; z is a radical of _l-1 The output of the L-1 st encoder is shown, L =12 in the present embodiment.

z _l ＝MLP(LN(z′ _l ))+z′ _l ,l＝1,…,L (3)

In this embodiment, the multi-layer perceptron comprises two layers of networks and one GELU non-linear active layer.

Step 2.5, outputting z containing m + 1D dimensional characteristics _L Obtaining the ith cell image X by averaging _i ViT feature representation of

Step 3, constructing a graph convolution network as a second branch;

step 3.2, extracting the ith cell image X by using a pre-trained residual error network ResNet _i CNN feature map of (1), in this exampleExtracting a feature map of the cell image by using ResNet 18;

Wherein S represents the number of superpixels, and t is the dimension of the CNN feature corresponding to each superpixel, in this embodiment, the number S of superpixels is 16, and the dimension t of the CNN feature corresponding to each superpixel is 512;

step 3.5, constructing the adjacency matrix by using the formula (4)

In the formula (4), B _α Representing the alpha-th super pixel, B _β Denotes the beta-th super pixel, A _αβ Representing the alpha-th super-pixel B in the adjacency matrix A _α And the beta-th super pixel B _β The spatial neighbor relation between them, in this embodiment, the spatial 4 neighborhood represents 4 spatial relations of up, down, left and right;

and 3.6, updating the relation perception representation of the nodes by using a formula (5) according to the hierarchical propagation rule of the graph convolution network:

in the formula (5), the reaction mixture is,

is a q-th level trainable weight, σ(. Is an activation function;

represents the characteristics of the node of the q +1 th layer, and

the characteristic of the s-th node output in the q + 1-th layer is shown, when q =0, let H ⁽⁰⁾ = B; and comprises the following components:

in the formula (6), D represents a diagonal matrix, and

D _ss representing the elements of the s-th row and s-th column of the diagonal matrix D, A _us The element of the w-th row and s-th column in the adjacent matrix A is represented, and I represents an identity matrix; in the present embodiment, sigmoid function is used as the activation function σ ();

step 3.7, obtaining the ith cell image X obtained by aggregating the node output characteristics of the last Q layer in the graph convolution network by using the formula (7) _i Is a graph convolution feature of (a) _g ：

features representing the s-th node in the Q-th layer, bookIn the embodiment, the number Q of layers of the graph convolution network is 2;

p _pred ＝Linear(z) (9)

In formula (9), linear (·) represents a Linear classification function;

c represents the number of classes of cells;

step 6, constructing a cross entropy loss function L by using the formula (10), and training a network formed by two branches and a linear classifier by using a gradient descent algorithm to ensure that the cross entropy loss function L is converged, thereby obtaining a trained cervical cell image classification model:

in the formula (10), y _label And N is the actual category corresponding to the cell image sample, and is the total number of the cell image samples.

Claims

1. A cervical cell image classification method based on a visual converter and a graph convolution network is characterized by comprising the following steps:

step 2.1, image X of ith cell _i Performing block division to obtain a sequence containing m image blocks

Wherein the content of the first and second substances,

represents the ith cell image X _i The jth image block of (1);

p × p denotes the dimension of each image block, and m = H × W/p ² ；

Step 2.2, setting a learnable classification mark x _class And obtaining m image blocks and a classification mark x by using the formula (1) _class D-dimensional embedded representation z ₀ And as input to the 1 st encoder;

in the formula (1), E _pos Representing m image blocks and a class mark x _class In the ith cell image X _i (ii) a spatial position of; e represents the set embedding matrix;

step 2.3, obtaining m image blocks and classification marks x by using the formula (2) _class Multiple attention mechanism layer at the first encoderOutput z' _l ：

z′ _l ＝MSA(LN(z _l-1 ))+z _l-1 ,l＝1,…,L (2)

In the formula (2), LN () represents processing of the normalization layer; z is a radical of _l-1 Represents the output of the l-1 encoder;

step 2.4, obtaining the output z of the multilayer perceptron of the first encoder by using the formula (3) _l To obtain the output z of the last encoder _L ：

z _l ＝MLP(LN(z′ _l ))+z′ _l ,l＝1,…,L (3)

Step 3, constructing a graph convolution network as a second branch;

Wherein S represents the number of superpixels, and t is the dimension of the CNN characteristic corresponding to each superpixel;

step 3.5, constructing the adjacency matrix by using the formula (4)

In the formula (4), B _α Denotes the alpha th super pixel, B _β Denotes the beta-th super pixel, A _αβ Representing the alpha-th super-pixel B in the adjacency matrix A _α And the beta-th super pixel B _β Spatial proximity relation between them;

in the formula (5), the reaction mixture is,

is a q-th level trainable weight, σ (-) is an activation function;

represents the characteristics of the node of the q +1 th layer, and

in the formula (6), D represents a diagonal matrix, and

D _ss representing the elements of the s-th row and s-th column of the diagonal matrix D, A _us Elements representing the w-th row and s-th column in the adjacency matrix A; i represents an identity matrix;

step 3.7, obtaining the ith cell image X obtained by aggregating the node output characteristics of the last Q-th layer in the graph convolution network by using the formula (7) _i Is a graph convolution feature of (a) _g ：

features representing an s-th node in the Q-th layer;

p _pred ＝Linear(z) (9)

In the formula (9), linear (·) represents a Linear classification function;

c represents the number of classes of cells;