CN108564114B

CN108564114B - Human body fecal leucocyte automatic identification method based on machine learning

Info

Publication number: CN108564114B
Application number: CN201810262889.9A
Authority: CN
Inventors: 刘娟秀; 王祥舟; 申志杰; 邓鼎文; 杜晓辉; 赵家喜; 张静; 倪光明; 郝如茜; 刘霖; 刘永; 周辉
Original assignee: University of Electronic Science and Technology of China; Ningbo Momi Innovation Works Electronic Technology Co Ltd
Current assignee: University of Electronic Science and Technology of China; Ningbo Momi Innovation Works Electronic Technology Co Ltd
Priority date: 2018-03-28
Filing date: 2018-03-28
Publication date: 2022-05-27
Anticipated expiration: 2038-03-28
Also published as: CN108564114A

Abstract

The invention discloses a human body excrement leukocyte automatic identification method based on machine learning, and relates to a method for automatically identifying and detecting leukocytes in human body excrement based on a digital image processing technology, in particular to a machine learning technology. The method has the advantages of accurate recognition, high speed, high efficiency and small calculated amount.

Description

Human body fecal leucocyte automatic identification method based on machine learning

Technical Field

The invention relates to a method for automatically identifying and detecting white blood cells in human excrement based on a digital image processing technology, in particular to a machine learning technology.

Background

Leukocytes in human feces are usually detected in mucus and purulent blood stool, and are mostly neutral paging nuclear granulocytes. Cells become more swollen and unclear due to their degeneration. If the amount is large, the cells are piled up and the cell membrane is incomplete or broken, which is also called as pus cell, indicating serious infection. Different from blood and urine of a human body, the excrement of the human body contains a large amount of impurities such as food residues and the like, so that the image background is complicated; part of the leukocyte cell membrane is not intact and adheres to part of the impurities. The above reasons all result in the difficult division of leukocytes. The morphology of leukocytes is variable due to cellular degeneration and is not easily recognized using morphology-based methods.

The applicant has filed a patent in 14 years for an automatic detection method (patent number 2014103696359) for white blood cells in a bronchoalveolar lavage smear, and the invention discloses that the white blood cells are finally identified by carrying out graying and binarization processing on a microscopic image of a bronchoalveolar lavage fluid smear collected by a microscope, simultaneously screening by utilizing the appearance characteristics and the internal characteristics of the white blood cells, and the white blood cells in a sample are detected by adopting a pure image processing method, so that the calculation amount is large in practical application.

Disclosure of Invention

The invention aims to design an automatic identification method for white blood cells in human excrement aiming at complex background of an excrement microscopic image and the morphological change diversity of the white blood cells.

The technical scheme provided by the invention is a human body fecal leucocyte automatic identification method based on machine learning, which comprises the following steps:

step 1: diluting, stirring and precipitating human excrement, and acquiring a microscopic image of a human excrement sample by using a biological microscope;

step 2: carrying out graying processing on the image in the step 1, and converting the image into a gray image;

and step 3: performing local reverse binarization processing on the gray level image in the step 2 to obtain a binarized image;

and 4, step 4: performing hole filling operation on the binary image in the step 3, and filling an area with an area smaller than S1, wherein the value range of S1 is 180-230 pixels;

and 5: performing morphological dilation operation on the binary image in the step 4, and adopting a circular template with the radius of R1; filling holes in the expanded binary image; performing morphological corrosion operation on the binary image after the hole is filled, and adopting a circular template with the radius of R2; the value range of R1 is 2-5 pixels, the value range of R2 is 1-3 pixels, and R1 is larger than R2;

step 6: marking the binary image connected region in the step 5, and calculating the area characteristic and the circumscribed rectangle of the connected region; the reserved area is larger than S2, and the area proportion of the reserved area to the circumscribed rectangle is larger than D1; the width and height of the circumscribed rectangle are larger than H1 and smaller than H2; the value range of S2 is 1400-1550 pixels, the value range of D1 is 60% -80%, the value range of H1 is 30-50 pixels, and the value range of H2 is 100-120 pixels;

and 7: cutting a binary image according to the circumscribed rectangle for the area reserved in the step 6; for the cut binary image, firstly, calculating the center and the radius of the maximum inscribed circle of the binary image, and only reserving the area of which the radius of the inscribed circle is larger than R3; further operating on the binary image, and only keeping points which are less than L1 away from the circle center; the value range of R3 is 15-20 pixels, and the value range of L1 is 20-30 pixels;

and 8: calculating the minimum circumcircle of the binary image in the step 7, and reserving the binary image of which the area of the binary region accounts for more than the area D2 of the circumcircle; the value range of D2 is 72-90%

And step 9: recalculating the binary image in the step 8 and correcting the coordinates of the circumscribed rectangle; reserving a region of which the width and the height of the circumscribed rectangle are larger than H3 and the difference of the absolute values of the width and the height is smaller than H4; the value range of H3 is 50-57 pixels, and the value range of H4 is 12-18 pixels;

step 10: cutting the corresponding gray level image and color image according to the newly calculated external rectangular coordinates of the area reserved in the step 9; calculating the variance, the gray mean value and the definition of the gray image; reserving an area with variance larger than F1, gray average larger than G1 and definition larger than Q1; the value range of F1 is 295-308, the value range of G1 is 55-70, and the value range of Q1 is 11-20;

the method comprises two processes of training a classifier and actually detecting;

step 11 to step 16 are classifier training processes; step 17-step 20 are actual detection processes;

step 11: comparing the area coordinates reserved in the step 10 with corresponding manual frame selection leukocyte area coordinates, removing real leukocytes, and intercepting the color images of the remaining areas as impurities of the negative sample set; the training positive sample set adopts white blood cells selected by manual frame; all the parameters adopted in the steps 4 to 10 can ensure that the white blood cells in all the samples enter the reserved area after the step 10;

step 12: normalizing all sample images in the sample sets, and then respectively extracting Gabor characteristics and LBP characteristics of the three color channels;

step 13: preprocessing the features extracted from the sample set in the step 12;

step 13-1: firstly, carrying out Principal Component Analysis (PCA) (principal Component analysis) on the feature vector, and only retaining 99% of principal components;

step 13-2: normalizing the characteristics obtained in the step 13-1 by 'L2';

step 13-3: standardizing the characteristics obtained in the step 13-2;

step 14: equally dividing the sample set into N parts by using a k-fold cross validation method, wherein N is more than 3, using one or two parts as a training set each time, and using the rest parts as a test set;

step 15: training the training set in the step 14 by using an SVM (support Vector machine) support Vector machine to obtain a classification model 1;

step 16: training a classification model 2 according to the classification model 1 obtained in the step 15;

and step 17: cutting the color image corresponding to the reserved area in the step 10, extracting and combining Gabor and LBP characteristics, and performing data preprocessing according to the method in the step 13 to obtain a characteristic vector to be detected;

step 18: inputting the feature vector to be detected in the step 17 into a classifier 1 to obtain a classification result;

step 19: extracting the HOG (histogram of organized gradient) features of the color image with the result of the leucocyte in the step 18, performing data preprocessing according to the method in the step 13, and sending the processed feature vector to the classifier 2 to obtain a final classification result;

step 20: all detected leukocytes in step 19 were marked on the color microscope image of step 1.

Further, the local area binarization method in the step 3 is that the local area is processed to be square, the side length is 11 pixels, the threshold value selection method is a gray average value, and the threshold value scaling coefficient is 0.98.

Further, the sharpness calculation in step 10 adopts a secondary blurring method, and specifically includes the following steps:

step 10-1: and calculating the gradient value of each pixel point of the gray image to obtain an original gradient map.

Step 10-2: the gray scale was gaussian blurred, the template size was 9 × 9, and the variance was taken to be 1.5.

Step 10-3: and (3) calculating the gradient value of each pixel point of the gray level image after the step (10-2) of blurring to obtain a blurring gradient image.

Step 10-4: and (4) calculating the average absolute difference value of the original gradient map obtained in the step 10-1 and the fuzzy gradient map obtained in the step 10-3 to obtain a definition value.

Further, the specific method of step 12 is as follows:

scaling the image to 80 × 80, and then extracting Gabor features and LBP features of three color channels respectively; wherein the Gabor characteristic scale is selected 15; the direction is selected from {0 degrees, 45 degrees, 90 degrees, 135 degrees, 180 degrees, 225 degrees, 270 degrees, 315 degrees }8 directions; the wavelength is pi/4; the spatial aspect ratio is 1.0; taking the standard deviation as 1.5; the phase offset is 0; simultaneously, 8 times of down sampling is adopted to reduce the characteristic dimension; wherein the LBP (local Binary Pattern) feature adopts an LBP equivalent mode (Uniform Pattern), and combines the Gabor feature and the LBP feature.

Further, in the step 15:

when an SVM support vector machine model is used for training, an RBF kernel function is selected, gamma is automatically selected, a penalty coefficient C is taken as 2, and the weight ratio of positive samples to negative samples is 2: 1;

the specific steps in the step 16 are as follows:

step 16-1: and taking all false negative samples of the model 1 in the step 15 as a negative sample set of the model 2. The positive sample set is the white blood cells of the manual frame selection after the scaling.

Step 16-2: HOG (histogram of Oriented gradient) histogram features are extracted from the gray scale map of the sample set. The Cell size is 8 × 8, and the gradient is in 9 directions; the size of Block is 2 x 2; the step size is taken to be 8. A 3600-dimensional feature vector is obtained.

Step 16-3: and (4) adopting a data preprocessing mode in the step 13 for the sample set characteristics to obtain 2320-dimensional characteristic vectors.

Step 16-4: and (3) obtaining a classification model 2 by adopting the same training method as the model 1 and using an SVM model, thereby achieving the purpose of further removing false detection impurities on the basis of keeping white blood cells. Finally, selecting an RBF kernel function, wherein gamma is selected automatically; taking 1 as a penalty coefficient C; the positive and negative sample weight ratio is 6:1, and the others are default parameters.

The method has the advantages of accurate recognition, high speed, high efficiency and small calculated amount.

Drawings

FIG. 1 is a flow chart of the detection of the automatic identification method for leukocytes in human feces.

Fig. 2 is a flow chart of a corresponding training classification model.

Fig. 3 is a diagram of the final automatic recognition result.

Detailed Description

A method for detecting human fecal leukocytes, the method comprising the steps of:

step 1: the human body feces are subjected to pretreatment operations such as dilution, stirring, precipitation and the like, and a biological microscope is used for collecting microscopic images of human body feces samples.

Step 2: and (3) carrying out gray processing on the image in the step (1) to convert the image into a gray image.

And step 3: and (3) carrying out local reverse binarization processing on the gray level image in the step (2) to obtain a binarized image. The area of the treatment part is square, and the radius is 11; the threshold value selection method is a gray average value, and the threshold value scaling coefficient is 0.98.

And 4, step 4: and (4) carrying out hole filling operation on the binary image in the step (3), and then removing the area smaller than 200.

And 5: and (4) performing morphological expansion operation on the binary image in the step (4), and adopting a circular template with the radius of 4. And filling holes in the expanded binary image. And (4) performing morphological corrosion operation on the binary image after the hole is filled, and adopting a circular template with the radius of 2.

Step 6: and (5) marking the binary image connected region in the step (5), and calculating the area characteristics and the circumscribed rectangle of the connected region. The reserved area is more than 1500, and the area proportion of the reserved area in the circumscribed rectangle is more than 70%; the width and height of the circumscribed rectangle are larger than 35 and smaller than 110.

And 7: and 6, cutting the area reserved in the step 6 into a binary image according to the circumscribed rectangle. For the cut binary image, firstly, the center and the radius of the maximum inscribed circle of the binary image are calculated, and only the area of which the radius of the inscribed circle is larger than 18 is reserved. And further operating on the binary image, and only keeping points which are less than 24 away from the circle center.

And 8: and (4) calculating the minimum circumcircle of the binary image in the step (7), and reserving the binary image of which the area of the binary region accounts for more than 84% of the area of the circumcircle.

And step 9: and (4) recalculating the binary image in the step (8) and correcting the coordinates of the circumscribed rectangle. And reserving an area of which the width and height of the circumscribed rectangle are more than 53 and the difference of the absolute values of the width and the height is less than 15.

Step 10: and (4) cutting the corresponding gray-scale image and color image according to the newly calculated external rectangular coordinates of the area reserved in the step (9). And calculating the variance, the gray mean and the definition of the gray map. Regions with variance greater than 300, mean of gray greater than 60, and sharpness greater than 15 are retained.

The following processes are classified into a training classifier and an actual detection. Step 11 to step 16 are classifier training processes. Step 17 to step 20 are actual detection processes.

Step 11: and (3) comparing the area coordinates reserved in the step (10) with the corresponding manual frame selection leukocyte area coordinates, removing real leukocytes, and intercepting the color images of the remaining areas as impurities of the negative sample set. The positive sample set used for training was white blood cells framed manually. All the parameters adopted in the steps 4 to 10 can ensure that the white blood cells in all the samples enter the reserved area after the step 10.

Step 12: for all sample sets, the images were scaled to 80 × 80 size, and then Gabor and LBP features were extracted for the three color channels, respectively. Wherein the Gabor characteristic dimension is selected 15; the direction is selected from {0 degrees, 45 degrees, 90 degrees, 135 degrees, 180 degrees, 225 degrees, 270 degrees, 315 degrees }8 directions; the wavelength is pi/4; the spatial aspect ratio is 1.0; taking the standard deviation as 1.5; the phase offset is 0. And meanwhile, 8 times of downsampling is adopted to reduce the characteristic dimension. A total of 2400-dimensional features are obtained. Wherein, the LBP (local Binary Pattern) feature adopts an LBP equivalent mode (Uniform Pattern) to extract 78-dimensional features. The Gabor features and LBP features are combined to obtain a 2478-dimensional feature vector.

Step 13: preprocessing the features extracted from the sample set in step 12 to obtain 1342-dimensional feature vectors.

Step 14: using the k-fold cross validation method, the sample set was divided equally into 5 parts, 1 of which was used as the training set and the remaining 4 as the test set.

Step 15: and (3) training the training set in the step 14 by using an SVM (support Vector machine) support Vector machine model to obtain a classification model 1. Finally selecting an RBF kernel function according to the result of the cross validation, wherein gamma is selected automatically; the penalty coefficient C takes the value of 2; the positive and negative sample weight ratio is 2:1, and the others are default parameters.

Step 16: based on the classification model 1 obtained in step 15, a classification model 2 is trained.

And step 17: and (3) cutting the color image corresponding to the reserved area in the step 10, extracting and combining Gabor and LBP characteristics, and performing data preprocessing according to the method in the step 13 to obtain a characteristic vector to be detected.

Step 18: and inputting the feature vector to be detected in the step 17 into a classifier 1 to obtain a classification result.

Step 19: extracting the HOG (histogram of organized gradient) features of the color image with the result of the white blood cells in the step 18, performing data preprocessing according to the method in the step 13, and sending the processed feature vectors into the classifier 2 to obtain a final classification result.

The definition calculation in the step 10 adopts a secondary fuzzy method, and the specific steps are as follows:

step 10-1: and calculating the gradient value of each pixel point of the gray image to obtain an original gradient image.

The specific steps in step 13 are as follows:

step 13-1: first, pca (principal Component analysis) principal Component analysis is performed on the 2478-dimensional feature vector, and only 99% of the principal components are retained to obtain a 1342-dimensional feature vector.

Step 13-2: the characteristics obtained in step 13-1 were normalized by 'L2'.

Step 13-3: the features obtained in step 13-2 are normalized.

The specific steps in the step 16 are as follows:

step 16-1: and taking all false negative samples of the model 1 in the step 15 as a negative sample set of the model 2. The positive sample set was a scaled manual frame for white blood cells.

Step 16-3: and (3) preprocessing the sample set features in step 13 to obtain 2320-dimensional feature vectors.

Step 16-4: and (3) obtaining a classification model 2 by adopting the same training method as the model 1 and using an SVM model, thereby achieving the purpose of further removing false detection impurities on the basis of keeping white blood cells. Finally, selecting an RBF kernel function, wherein gamma is selected automatically; taking a penalty coefficient C as 1; the positive and negative sample weight ratio is 6:1, and the others are default parameters.

Claims

1. A human body fecal leucocyte automatic identification method based on machine learning comprises the following steps:

step 2: carrying out graying processing on the image in the step 1 to convert the image into a grayscale image;

step 6: marking the binary image connected region in the step 5, and calculating the area characteristic and the circumscribed rectangle of the connected region; the reserved area is larger than S2, and the area ratio of the reserved area to the circumscribed rectangle is larger than D1; the width and height of the circumscribed rectangle are larger than H1 and smaller than H2; the value range of S2 is 1400-1550 pixels, the value range of D1 is 60% -80%, the value range of H1 is 30-50 pixels, and the value range of H2 is 100-120 pixels;

and 8: calculating the minimum circumcircle of the binary image in the step 7, and reserving the binary image of which the area of the binary region accounts for more than the area D2 of the circumcircle; the value range of D2 is 72% -90%;

and step 9: recalculating the binary image in the step 8 and correcting the coordinates of the circumscribed rectangle; reserving a region of which the width and the height of the circumscribed rectangle are larger than H3 and the difference of the absolute values of the width and the height is smaller than H4; h3 ranges from 50 pixels to 57 pixels, and H4 ranges from 12 pixels to 18 pixels;

step 13-1: firstly, PCA principal component analysis is carried out on the feature vector, and only 99% of principal components are reserved;

step 13-2: normalizing the characteristics obtained in the step 13-1 by 'L2';

step 13-3: standardizing the characteristics obtained in the step 13-2;

step 15: training the training set in the step 14 by using an SVM (support vector machine) model to obtain a classification model 1;

step 16-1: taking all false-detected negative samples of the model 1 in the step 15 as a negative sample set of the model 2, and taking the positive sample set as the scaled artificial frame selection white blood cells;

step 16-2: extracting HOG direction gradient histogram features from a gray level image of a sample set; the size of the Cell is 8 × 8, the gradient is 9 directions, the size of the Block is 2 × 2, the step length is 8, and 3600-dimensional feature vectors are obtained;

step 16-3: adopting a data preprocessing mode in the step 13 for the sample set characteristics to obtain 2320-dimensional characteristic vectors;

step 16-4: a classification model 2 is obtained by adopting the same training method as the model 1 and using an SVM model, so that the aim of further removing false detection impurities on the basis of keeping white blood cells is fulfilled; finally, selecting an RBF kernel function, wherein gamma is selected automatically; taking 1 as a penalty coefficient C; the weight ratio of the positive sample to the negative sample is 6:1, and the other samples are default parameters;

step 19: extracting the HOG features of the color image with the result of the white blood cells in the step 18, preprocessing the data according to the method in the step 13, and sending the processed feature vectors into the classifier 2 to obtain a final classification result;

2. The method for automatically identifying human fecal leucocytes based on machine learning as claimed in claim 1, wherein the method for binarizing the local region in step 3 is to process the local region as a square with a side length of 11 pixels, the threshold value selection method is a gray average value, and the threshold scaling factor is 0.98.

3. The method for automatically identifying human fecal leucocytes based on machine learning according to claim 1, wherein the sharpness calculation in the step 10 adopts a secondary fuzzy method, comprising the following steps:

step 10-1: calculating the gradient value of each pixel point of the gray level image to obtain an original gradient map;

step 10-2: performing Gaussian blur on the gray level image, wherein the size of the template is 9 x 9, and the variance is 1.5;

step 10-3: calculating the gradient value of each pixel point of the gray level image after the step 10-2 to obtain a fuzzy gradient map;

step 10-4: and (4) calculating the average absolute difference value of the original gradient map obtained in the step (10-1) and the fuzzy gradient map obtained in the step (10-3) to obtain a definition value.

4. The method for automatically identifying human fecal leucocytes based on machine learning as claimed in claim 1, wherein the specific method of the step 12 is as follows:

scaling the image to 80 × 80, and then extracting Gabor features and LBP features of three color channels respectively; wherein the Gabor characteristic scale is selected 15; the directions are selected from {0 °,45 °,90 °,135 °,180 °,225 °,270 °,315 ° }8 directions; the wavelength is pi/4; the spatial aspect ratio is 1.0; taking the standard deviation as 1.5; the phase offset is 0; simultaneously, 8 times of down sampling is adopted to reduce the characteristic dimension; wherein the LBP features employ an LBP equivalence mode, combining the Gabor features and the LBP features.

5. The method for automatically identifying human fecal leukocytes based on machine learning according to claim 1, wherein in the step 15:

when an SVM support vector machine model is used for training, an RBF kernel function is selected, gamma is automatically selected, a punishment coefficient C is taken as 2, and the weight ratio of positive samples to negative samples is 2: 1.