CN110852396A

CN110852396A - Sample data processing method for cervical image

Info

Publication number: CN110852396A
Application number: CN201911125170.1A
Authority: CN
Inventors: 李凌
Original assignee: Suzhou Zhongkehuaying Health Technology Co Ltd
Current assignee: Suzhou Zhongkehuaying Health Technology Co Ltd
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2020-02-28

Abstract

The invention discloses a sample data processing method of a cervical image, which comprises the following steps: establishing classification; preprocessing data; dividing; data enhancement: classifying the target image data, confirming the difference between various target image data, and implementing enhancement processing aiming at the difference; and (3) equalization processing: aiming at the total amount difference between various target image data, supplementing a few types of samples by adopting data fitting to realize the total amount balance between various target image data; and (3) data set construction: aiming at various target image data after equalization processing, respectively and randomly dividing the target image data into a training data set, a verification data set and a test data set in proportion; constructing a model: based on the training data set and/or the verification data set and/or the test data set, the data set is mapped to the comparison data set to obtain the corresponding classification of the sample data. The method and the device improve and solve the problem of data imbalance in cervical image data classification, improve the precision and efficiency of image classification, and improve the effect and quality of auxiliary diagnosis.

Description

Sample data processing method for cervical image

Technical Field

The invention belongs to a computer-aided application method in the medical field, and particularly relates to a sample data processing method of a cervical image.

Background

The cervical region lesion has definite reasons, so that clinical prevention can be realized. Thus, the high mortality rate of cervical cancer can be alleviated to a large extent. However, as China is still in developing countries and population density is high, HPV prevention vaccines are still difficult to popularize comprehensively. Therefore, screening for early cervical lesions is still the main measure and method for preventing and treating cervical related diseases. At present, the main methods for screening cervical lesions in various hospitals are Pap smear (Pap test), liquid-based cytology (TCT), HPV-DNA detection, electronic colposcopy and histopathological detection. However, the mainstream precancerous lesion screening methods still have respective defects, so that the purpose of diagnosis confirmation can be achieved by combining multiple diagnosis methods in some cases, and the accuracy of precancerous screening needs to be improved to a certain extent. In addition, the final determination of the disease condition requires the physician to carefully observe and analyze the lesion area or the lesion image to draw a conclusion, which imposes a high requirement on the expertise of the physician. For the condition that a doctor needs to make a diagnosis by combining medical images, the doctor is easy to fatigue due to long-time reading and image observation, and the accuracy of disease diagnosis is further influenced. The realization of a novel cervical lesion screening auxiliary diagnosis system is particularly necessary by combining the current situation of the traditional medical field, the characteristics of high cervical cancer incidence rate and strong lethality and the rapid development of current artificial intelligence and machine learning technologies.

At present, computer-aided diagnosis (CAD) systems such as gastric cancer, skin cancer, digestive tract cancer, intestinal cancer and the like which need to combine a medical endoscope and a dermatoscope have been vigorously developed, the realization of the CAD systems needs to use a large number of color images acquired by medical equipment, after preprocessing operations such as image filtering, image enhancement, image segmentation and the like are performed on image data, valuable image features are selected by using feature extraction and feature screening, the selected features are sent into a machine learning classification model for training, and finally, the CAD system with better effect is obtained by adjusting model parameters. However, the existing auxiliary diagnosis systems for cervical region lesions are relatively few, and the traditional cancer tumor classification diagnosis is mostly two classifications. In addition, the current situation that the related performance of the traditional machine learning algorithm on pathological classification reaches a certain bottleneck is achieved.

Disclosure of Invention

As the gynecological malignant tumor disease with the only definite etiology, the cervical cancer has the characteristics of high clinical morbidity and mortality. Therefore, a clear direction is provided for clinical diagnosis and treatment, the cure rate of patients can be greatly improved, and the death rate is reduced. The invention aims to design a cervical lesion screening auxiliary diagnosis system based on deep learning, which can relieve the diagnosis pressure of doctors and greatly improve the accuracy of disease diagnosis. The invention firstly uses the electronic colposcope to collect color images of cervical regions from different patients, and obtains usable patient image data through data cleaning; then, carrying out a series of preprocessing operations such as image filtering, image segmentation (ROI extraction), image enhancement and the like on the acquired image; aiming at the problems of unbalanced data and small data quantity of the lesion to be classified, carrying out balancing treatment by means of SMOTE algorithm, data enhancement and the like; and (3) sending the processed image data into a deep learning model for learning and training to finally obtain six classification results of normal, inflammation, cervical intraepithelial neoplasia I (CIN I), cervical intraepithelial neoplasia II (CIN II), cervical intraepithelial neoplasia III (CIN III) and canceration. The method plays a good auxiliary role in the definite diagnosis of the cervical region lesion.

In order to achieve the above object, the present invention discloses a method for processing sample data of a cervical image, comprising the following steps:

establishing classification: establishing a contrast data set, and acquiring a classification standard for cervical image characteristics on the basis of the contrast data set; the possible textual positions of the operation established in a classified manner do not represent the actual sequence of the process, and the operation can be actually carried out after any step, and even carried out synchronously with other steps has no influence on the scheme of the invention.

Data preprocessing: acquiring sample data of a cervical image (wherein the sample data can be original data of the image or data subjected to preprocessing screening, and when the original data contains content interfering with subsequent processing of the data, preprocessing screening is performed on the original data, and the preprocessing screening operation at least comprises directly deleting images with factors such as over-brightness, over-darkness, blurring and medical instruments and sundries in image visual fields);

and (3) dividing: in the preprocessed data, segmenting the image data to obtain target image data;

data enhancement: classifying the target image data, confirming the difference between various target image data, and implementing enhancement processing aiming at the difference;

and (3) equalization processing: aiming at the total amount difference between various target image data, supplementing a few types of samples by adopting data fitting to realize the total amount balance between various target image data;

and (3) data set construction: aiming at various target image data after equalization processing, respectively and randomly dividing the target image data into a training data set, a verification data set and a test data set in proportion;

constructing a model: based on the training data set and/or the verification data set and/or the test data set, the data set is mapped to the comparison data set to obtain the corresponding classification of the sample data.

The invention discloses an improvement of a sample data processing method of a cervical image, in the segmentation operation, the segmentation of the image data is carried out by adopting an Ostu algorithm:

setting image data comprising foreground pixel data and background pixel data, and calculating to obtain a threshold value for distinguishing the foreground pixel data from the background pixel data;

the image data is divided into foreground pixel data and background pixel data by the threshold.

The invention discloses an improvement of the sample data processing method of the cervical image, wherein the segmentation operation further comprises the step of performing morphological operation on the divided and obtained foreground pixel data and/or background pixel data respectively so as to obtain target image data.

The invention discloses an improvement of the sample data processing method of cervical images, wherein the morphological operation at least comprises any one of addition, filling, deletion and segmentation.

The invention discloses an improvement of a sample data processing method of a cervical image, in the balanced processing operation, the processing of a few types of sample data in target data is carried out by adopting an SMOTE algorithm:

analyzing a few types of sample data;

and synthesizing a new sample according to the minority sample, and adding the synthesized new sample into the original minority sample to form a new minority sample set until the minority sample set realizes total balance among various types of target image data.

The invention discloses an improvement of a sample data processing method of a cervical image, in the balanced processing operation, the SMOTE algorithm adopted for processing a few types of sample data in target data is as follows:

step 1: for each sample x in the minority class samples, calculating the sample x to a minority class sample set S by using Euclidean distance as a standard_minObtaining k neighbors of the samples according to the distances of all the samples;

step 2: setting a sampling ratio according to the sample imbalance ratio to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from k neighbors of each sample x of a minority class, wherein the selected neighbors are assumed to be x_n；

Step 3: for each randomly selected neighbor x_nAnd respectively constructing new samples according to the following formulas with the original samples:

x_new＝x_n+rand(0,1)|x-x_nwhere rand (0,1) refers to a random value in the range of 0 to 1.

The invention discloses an improvement of a sample data processing method of cervical images, wherein the data preprocessing operation at least comprises the step of deleting over-bright image data and/or over-dark image data and/or blurred image data and/or image data containing sundries in the sample data.

The invention discloses an improvement of the sample data processing method of cervical images, and the deleting operation comprises a complete deleting operation executed on each image or a partial deleting operation executed on a target area after each image is segmented.

The invention discloses an improvement of a sample data processing method of cervical images, wherein in the data set construction operation, the division ratio of a training data set, a verification data set and a test data set of each type of target image data is 80%, 15% and 5%.

The invention discloses an improvement of a sample data processing method of cervical images, wherein a training data set, a verification data set and a test data set of each type of target image data are constructed by randomly dividing according to a proportion.

In general, the method comprises the steps of firstly, carrying out data cleaning on a cervical image acquired through an electronic colposcope, and deleting images which are fuzzy in shooting, too dark and have foreign matters in key visual fields of the images; then, under the guidance of a professional doctor and relevant experts, classifying (labeling) the image data by combining with an electronic medical record; preprocessing operations such as image noise reduction, image segmentation, image enhancement and the like are completed on the image data to obtain a region of interest (ROI) image; aiming at the phenomenon that the number of samples in different classes is greatly different, an SMOTE algorithm is utilized to artificially synthesize a few classes of samples; carrying out data enhancement to further expand the number of samples; and (3) sending the final image data into a deep learning model for training, adjusting the classification performance of the model by adjusting parameters and introducing transfer learning, and finally achieving the effect of auxiliary diagnosis.

The existing auxiliary diagnosis system mainly has the following problems: firstly, the training data volume is small, and the effective labeling sample volume is small, so that the training and testing samples selected by the method are obtained by carefully screening and confirming professional doctors and related experts, and each acquired image data is labeled as a corresponding disease label, so that the accuracy of model training and diagnosis is ensured; aiming at the problem of small sample size, the invention expands the sample size by utilizing data enhancement technologies such as center cutting, up-down turning, left-right turning, brightness adjustment and the like on the basis of manually synthesizing a small number of samples by adopting an SMOTE algorithm, so that the final sample size reaches a satisfactory level; the other is that the currently applied auxiliary diagnostic system for cervical lesion detection for electronic colposcopic images, either uses a traditional machine learning algorithm on a classification model, needs to perform tedious feature extraction and feature screening, consumes time, has performance reaching a bottleneck, and cannot improve accuracy well, or uses a new deep learning model to directly train an image subjected to image noise reduction and enhancement, and this method cannot make a computer perform key learning on the features of a key lesion region, and can improve the learning effect to a certain extent.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of an embodiment of a method for processing sample data of a cervical image according to the present invention;

fig. 2 is a schematic diagram of a segmentation operation of an embodiment of the cervical image sample data processing method of the present invention;

fig. 3 is a schematic diagram of model construction of an embodiment of the method for processing sample data of a cervical image according to the present invention.

Detailed Description

The present invention will be described in detail below with reference to embodiments shown in the drawings. The embodiments are not intended to limit the present invention, and structural, methodological, or functional changes made by those skilled in the art according to the embodiments are included in the scope of the present invention.

The system implementation flow of the present invention can be described as shown in fig. 1. It mainly comprises the following parts:

and an image preprocessing part which comprises image data cleaning, image noise reduction, image segmentation (ROI extraction) and image enhancement.

And a sample amplification part which comprises a SMOTE algorithm to artificially synthesize a new minority of samples and data enhancement.

And (4) constructing a classification model part, sending the image data into a CNN model for training, and introducing. And the training precision is further improved by the transfer learning.

The detailed process of the invention is explained in detail as follows:

the method comprises the following steps of firstly, carrying out data cleaning on a cervical data set collected by an electronic colposcope and stored in a workstation, and directly deleting images of over-bright images, over-dark images, blurred images and images with medical instruments and sundries in image visual fields.

And secondly, dividing the data set into six types of normal, inflammation, CINI, CINIII and canceration according to the opinion of a professional doctor and the electronic medical record.

And thirdly, the image collected by the electronic colposcope is not only limited by the conditions of hardware facilities in the processes of shooting, compressing, transmitting and storing, but also influenced by various objective factors of the external environment, so that more noise information is often mixed in the image. The noise in the image not only affects the sensory effect of the physician, but also affects the extraction of image features by the CAD system, and affects the identification and diagnosis performance of the system. In the field of medical image processing, invalid signals in an image that can affect image feature extraction can be referred to as noise. After experimental comparison, the noise in the colposcope image is filtered by selecting a median filtering mode.

Median filtering: the median filtering method is a non-linear smoothing technique, and sets the gray value of each pixel point as the median of all the gray values of the pixel points in a certain neighborhood window of the point.

The median filtering is a nonlinear signal processing technology which is based on the ordering statistical theory and can effectively inhibit noise, and the basic principle of the median filtering is to replace the value of one point in a digital image or a digital sequence by the median of all point values in a neighborhood of the point, so that the surrounding pixel values are close to the true values, and isolated noise points are eliminated. The method is to sort the pixels in the plate according to the size of the pixel value by using a two-dimensional sliding template with a certain structure, and generate a monotonously ascending (or descending) two-dimensional data sequence. The two-dimensional median filter output is g (x, y) ═ med { f (x-k, y-l), (k, l ∈ W) }, where f (x, y), g (x, y) are the original image and the processed image, respectively. W is a two-dimensional template, typically 3 × 3, 5 × 5 regions, and may also be of different shapes, such as lines, circles, crosses, circles, and the like.

For example, the screening of data is realized by the following method:

1: ordering by taking an odd number of data from a sampling window in the image

2: and replacing the data to be processed by the sorted median value.

Fourthly, in the cervical disease diagnosis process, the doctor only needs to observe and analyze the region of interest, so that image segmentation is necessary to obtain image information of the region of interest. The invention adopts an Ostu algorithm, assumes that an image comprises two types of pixels (foreground pixels and background pixels), the histogram is a bimodal histogram, and then calculates an optimal threshold value (intra-class variance) for separating the two types of pixels or the equivalent inter-class variance is the maximum, thereby realizing the distinguishing of the ROI and the background irrelevant area and acquiring the image of the cervix interested area by adding morphological operations such as filling, deleting, segmenting and the like. The image segmentation results are shown in fig. 2.

Ostu algorithm:

for an image I (x, y), a segmentation threshold value of a foreground (namely a target) and a background is marked as T, the proportion of the number of pixels belonging to the foreground in the whole image is marked as omega 0, and the average gray level of the pixel number is mu 0; the ratio of the number of background pixels to the whole image is ω 1, and the average gray level is μ 1. The total mean gray level of the image is denoted as μ and the inter-class variance is denoted as g.

Assuming that the background of the image is dark and the size of the image is M × N, the number of pixels in the image with the gray scale value smaller than the threshold T is denoted as N0, and the number of pixels with the gray scale value larger than the threshold T is denoted as N1, there are:

(1)ω0＝N0/(M×N)

(2)ω1＝N1/(M×N)

(3)N0+N1＝M×N

(4)ω0+ω1＝1

(5)μ＝ω0*μ0+ω1*μ1

(6)g＝ω0*(μ0-μ)2+ω1*(μ1-μ)2

substituting formula (5) for formula (6) yields the equivalent formula:

(7)g＝ω0*ω1*(μ0-μ1)2

and obtaining a threshold T which enables the inter-class variance g to be maximum by adopting a traversal method. At this time, the image with the gray value smaller than T is the foreground, and the part with the gray value larger than T is the background.

Fifthly, the image enhancement in the CAD system is to highlight the effective information in the image, so that the effective information is properly amplified, and the difference between the features of different types of pictures is also amplified, so that the CAD system can more accurately identify the difference between the images. The key to identifying a cervical image is the transformation zone near the cervical os. In a cervical image taken by colposcope, the area should be rich in detailed texture, while the area is usually in a higher gray scale area in an image taken by colposcope, so the enhancement of the cervical image should highlight the high gray scale part of the cervical os and compress the "highlight" part far away from the cervical os. The invention adopts gamma correction to enhance the images, so that the characteristic difference between the cervical images is more obvious, the difference in numerical value can be amplified, and the classification and identification work can be more easily distinguished.

gamma correction is mainly used for correcting images, and corrects pictures with excessively high gray levels (transition exposure) or excessively low gray levels (underexposure), so as to enhance contrast. The transformation formula is to perform product operation on each pixel value on the original image:

s＝c·r^γ

when the gamma value is less than 1, stretching the area with lower gray level in the image and compressing the part with higher gray level; when the gamma value is greater than 1, a region of the image having a higher gray level is stretched while a portion having a lower gray level is compressed. Therefore, the effect of enhancing details of low gray or high gray parts can be achieved by adjusting different gamma values. The gamma transformation has obvious enhancement effect on the colposcopic image with low image contrast and high overall brightness value.

And sixthly, in order to solve the difference of samples existing among different cervical lesions and prevent precision loss caused by unbalanced samples, the method introduces a SMOTE algorithm, analyzes a few types of samples, artificially synthesizes new samples according to the few types of samples and adds the new samples into a data set, thereby solving the problem of unbalanced six categories. In the data set acquired by the method, because the inflammation images are the most and the rest images are fewer, the number of samples of the rest categories is consistent with that of the inflammation images after being processed by the SMOTE algorithm.

The SMOTE algorithm is called Synthetic minimum optimization Technique, i.e. a Technique for synthesizing a few classes of Oversampling, and is an improved scheme based on a random Oversampling algorithm. Since random oversampling takes the strategy of simply copying samples to add a few classes of samples, it is easy to create the problem of model overfitting, i.e. to make the information learned by the model too Specific (Specific) to generalize (General). The basic idea of the SMOTE algorithm is to analyze a few types of samples and artificially synthesize new samples according to the few types of samples to add the new samples into a data set, and the specific algorithm flow is as follows:

step 1: for each sample x in the minority class, calculating a sample set S from the sample x to the minority class by using Euclidean distance as a standard_minThe k neighbors of the distance between all samples are obtained.

Step 2: setting a sampling ratio to determine a sampling rate N based on the sample imbalance ratio, forRandomly selecting a plurality of samples from k neighbors of each sample x of the minority class, and assuming that the selected neighbors are x_n。

x_new＝x_n+rand(0,1)|x-x_n|

and seventhly, randomly dividing each class of image into a training set, a verification set and a test set according to the proportion of 80%, 15% and 5% respectively for the balanced data set. Because the sample data size is small and may affect the generalization performance of the classification diagnosis model, data enhancement work needs to be performed on the training set and the verification set test set respectively. The invention adopts various data enhancement operations such as center clipping, up-down turning, left-right turning, brightness and chromaticity change and the like.

And eighthly, constructing a CNN classification model. The invention selects a VGG19 network structure, further modifies a model structure, deletes all original full connection layers and adds three new connection layers at the end of the network, wherein the first two layers are full connection layers, local features extracted from the convolution layers are assembled into a complete graph again through a weight matrix, and the third layer is a softmax function activation layer and is also the last output layer of the model, thereby realizing six types of differentiation of cervical lesions. And on the basis, transfer learning is introduced, and the final cervical lesion screening auxiliary diagnosis system is realized by freezing and unfreezing partial convolutional layers and fine-tuning network parameters.

The CNN model is composed of an input layer, an output layer, a hidden layer, and weights (parameters) connecting the layers. Each layer of network has multiple neurons, the neuron in the upper layer is mapped to the neuron in the next layer by an activation function, and each neuron has a corresponding weight, and the output is the classification category of the neurons. The CNN model has the following advantages: the method has the advantages that firstly, a parameter sharing mechanism is provided, in the convolutional neural network, the parameters are kernel values, and the values are the same for all regions, so that the number of the parameters can be small, and overfitting can be effectively prevented. Secondly, CNN has the advantage of a sparse connection mechanism, in the output of the convolutional network, the output of each "small lattice" is only related to the input image and its corresponding part, while the other parts are unrelated, and the computation amount is small. Finally, CNNs have the property of advanced feature extraction, with more advanced features being extracted continuously by using a process of convolutional pooling.

Of course, the VGG19 deep neural network is selected in the present embodiment, and actually, other CNN networks, such as VGG16, inclusion net, ResNet, etc., may be adopted in the deep model.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A sample data processing method of a cervical image comprises the following steps:

establishing classification: establishing a contrast data set, and acquiring a classification standard for cervical image characteristics on the basis of the contrast data set;

data preprocessing: acquiring and denoising sample data of a cervical image;

2. The method for processing sample data of a cervical image according to claim 1, wherein in the segmentation operation, the segmentation of the image data is performed by using an Ostu algorithm:

3. The method for processing the sample data of the cervical image according to claim 2, wherein the segmentation operation further includes performing a morphological operation on each of the foreground pixel data and/or the background pixel data obtained by the division, so as to obtain the target image data.

4. The method of claim 3, wherein said morphological operation includes at least any of adding, padding, deleting and segmenting.

5. The method for processing the sample data of the cervical image according to claim 1, wherein in the equalizing operation, the small number of sample data in the target data is processed by using SMOTE algorithm:

analyzing a few types of sample data;

6. The method for processing the sample data of the cervical image according to claim 5, wherein in the equalizing operation, the SMOTE algorithm is adopted to process a few types of sample data in the target data as follows:

7. The method for processing the sample data of the cervical image according to claim 1, wherein the data preprocessing operation at least includes a deleting operation performed on the image data with too bright and/or the image data with too dark and/or the image data with blur and/or the image data with impurities in the sample data.

8. The method for processing sample data of a cervical image according to claim 7, wherein the deletion operation includes a complete deletion operation performed on each image or a partial deletion operation performed on a target region after segmenting each image.

9. The method for processing the sample data of the cervical image according to claim 1, wherein in the data set constructing operation, the division ratio of the training data set, the verification data set and the testing data set of each target image data is 80%, 15% or 5%.

10. The method for processing the sample data of the cervical image according to claim 9, wherein the training data set, the verification data set, and the testing data set of each of the target image data of the respective types are constructed by randomly dividing according to a scale.