CN112686329A

CN112686329A - Electronic laryngoscope image classification method based on dual-core convolution feature extraction

Info

Publication number: CN112686329A
Application number: CN202110013954.6A
Authority: CN
Inventors: 陈皓; 闫庆元; 郝马阳
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2021-01-06
Filing date: 2021-01-06
Publication date: 2021-04-20

Abstract

The invention discloses an electronic laryngoscope image classification method based on dual-core convolution feature extraction. The invention belongs to the field of computer vision and pattern recognition. The method comprises the following steps: firstly, preprocessing a laryngoscope image, including frame dismantling and image cutting, reserving effective image information, and then adjusting the image size to 224 x 224; secondly, designing a deep convolution neural network capable of acquiring the detailed information characteristics of the image, inputting the image into the network, and extracting the high-order image characteristics with detailed information; and then, training the obtained image characteristic information into an integrated classifier by using an extreme gradient boost (Xgboost) integration method, and obtaining an electronic laryngoscope image classification result. The image features extracted by the invention have abundant detail features such as texture features, shape features, position information and the like, and the accuracy of the classification of the electronic laryngoscope image is effectively improved by combining an Xgboost integrated classification method.

Description

Electronic laryngoscope image classification method based on dual-core convolution feature extraction

Technical Field

The invention belongs to the field of computer vision and pattern recognition. The method specifically comprises the following steps: the invention discloses a deep Convolutional Neural Network (CNN) capable of extracting rich detail features and an extreme gradient boosting (Xgboost) integrated classification method for classifying and identifying an electronic laryngoscope image. The method can classify and identify the electronic laryngoscope image parts with unobvious signs, such as low-order texture features, shape features, position information and the like, thereby improving the classification and identification of the nasopharynx and vocal cord closed parts in the electronic laryngoscope image and improving the accuracy of the whole classification.

Background

The electronic laryngoscope is a main auxiliary tool for diagnosing the diseases of the ear, nose and throat by doctors in the department of otolaryngology, and the analysis of images of the electronic laryngoscope is a direct reference basis for the doctors to judge the diseases of the ear, nose and throat. A large number of electronic laryngoscope video images are generated during the use of electronic laryngoscopy, where clear and high quality pictures of major organ sites are the primary basis for technicians to compose examination reports and doctors to diagnose disease. At present, photos of different parts are manually intercepted mainly by human eyes in the process of electronic laryngoscope examination, and the problems of selection omission, low efficiency, poor reliability and the like exist. Therefore, the automatic classification of the electronic laryngoscope images by using a computer-aided method is an important means for improving the diagnosis efficiency and accuracy, and is also a mainstream direction of intelligent medical treatment.

The current image classification research aiming at the electronic laryngoscope can be roughly classified into two types: one is to use the traditional image classification method, such as extracting a plurality of low-order image characteristics of information of color characteristics, texture characteristics, geometric shape characteristics, image intensity gradient direction, frequency content and the like of the laryngeal image, and then use a Support Vector Machine (SVM) to train a classification model to perform classification and identification on the laryngeal of the electronic laryngoscope image. The method only extracts low-order image features, does not extract deeper image features, and is easy to generate an overfitting phenomenon in the model. The other type is a deep learning method, and the existing classical convolutional neural network or a transfer learning method is used for carrying out classification and identification on the electronic laryngoscope image. However, the method often ignores the particularity and detail information characteristics of the electronic laryngoscope image, and easily causes low efficiency and accuracy of classification and identification.

Disclosure of Invention

Aiming at the problems of the existing electronic laryngoscope image classification and identification, the invention provides a deep CNN capable of extracting high-order image features with detail information and low-order feature enhancement, the network can extract image features with rich detail information, and then an Xgboost integration method is used for classification, so that the efficiency and the accuracy of classification and identification of the electronic laryngoscope image by a deep learning model are effectively improved. In order to achieve the purpose, the technical scheme of the invention is as follows:

step 1: and (4) preprocessing data. Specifically, the method comprises the following steps of;

and separating the acquired electronic laryngoscope video frame by frame to obtain all image frames of the laryngoscope video and obtain the electronic laryngoscope image, wherein the image frames comprise 6 types of nasal cavity, nasopharynx, epiglottis, vocal cord closure, vocal cord opening and extracorporeal/fuzzy. And cutting the redundant part of the image, removing a black area without a laryngoscope part around the image and keeping useful information of the image. In addition, the retained image data is flipped, color adjusted and randomly cropped using the tensrflow data enhancement method to expand the training data set and adjust the image to 224 × 224 using the resize method.

Step 2: and (4) extracting image features by using a dual-core convolution method. Specifically, the method comprises the following steps of;

step 2.1: inputting the processed electronic laryngoscope image into a convolution layer with convolution kernel of 3 x 3 for convolution processing, constructing a feature sequence 1, providing more sufficient features for increasing the classification of the electronic laryngoscope image, extracting image features by using the convolution layers with convolution kernel sizes of 3 and 5 respectively, wherein each convolution layer is convolved by 2 dimensions. Relu activating functions are used in each layer of convolution, the Relu functions not only have activating functions, but also enable the output of a part of neurons to be 0, namely the part of neurons cannot be activated and lose effects, the network becomes sparse, and the computing efficiency of the network is improved. And a Batch Normalization (BN) layer is used in the network, and the characteristics obtained by the convolution layer are subjected to normalization processing, so that the convergence speed in the model training process is increased, and the model precision is improved. Each convolution kernel is two-dimensional, and their calculation is the same, and the formula is as follows:

wherein, Con_d(i, j) represents a two-dimensional convolution; d represents the convolution kernel size; x_kRepresents the kth input matrix; w_kRepresents the kth weight matrix; b represents a deviation term; k representsInputting the number of filters; i represents the abscissa of the image matrix; j represents the ordinate of the image matrix.

Relu activation function is used in the convolutional layer, the formula is as follows:

Relu_d(i,j)＝max(0,Con_d(i,j)) (2)

wherein, Relu_d(i, j) represents the Relu activation function; max represents the maximum operator for the collection element. The batch normalization formula is as follows:

wherein BN_d(i, j) represents batch normalization; e2]Represents the mean of the input matrix; var [ alpha ], [ alpha]Representing the variance of the input matrix.

The networks used were as follows:

Conv_layer(kernel_size＝5)+BN+Relu (4)

Conv_layer(kernel_size＝3)+BN+Relu (5)

Conv_layer(kernel_size＝1)+BN+Relu (6)

MaxPooling_layer(pool_size＝2) (7)

step 2.2: and mutually fusing the convolution characteristic of the previous layer and the characteristic extracted by the dual-core CNN into a new characteristic. The low-order features of the image are transferred to the next layer, so that various low-order feature information such as textures, positions and shapes can be provided for the next unit, and the low-order feature transfer performance is improved. The model learns high-order features containing detail information, and the perception capability of the model to image details is improved. And then, performing pooling operation with the size of 2 on the high-dimensional features obtained by fusion, reducing the dimension and compressing the features, improving the training speed and improving the fault tolerance of the model. The feature fusion formula is as follows:

where Output (i, j) represents different volumesAccumulating to obtain a feature fused output; BN₃(i, j) representing a matrix obtained by convolution and normalization with a convolution kernel size of 3; BN₅(i, j) representing a matrix obtained by normalization of convolution with convolution kernel size of 5; concatenate represents a feature linkage;

representing a splicing operator; con_d(i, j) represents a two-dimensional convolution. The pooling layer formula is as follows:

MaxPooling(i,j)＝max(Output(i,j)) (9)

where MaxPooling (i, j) represents the maximum pooled output.

Step 2.3: and further fusing and enhancing the characteristic sequence 1 and the characteristic sequence 2, and then performing convolution pooling operation on the fused characteristics by using a convolution layer with convolution kernel of 1 × 1 and a pooling layer with convolution kernel of 2 × 2 to obtain further image characteristics 2.3, so as to provide more image high-order characteristics for a subsequent training classification model and obtain characteristics with strong semantic property.

And repeating the step 2.2 and the step 2.3 for four times, fusing and reinforcing the features of each layer, inputting the fused and reinforced features into the next module, providing more sufficient features for image classification of the electronic laryngoscope, changing the last operation of the pooling layer into a full connection layer, and providing various texture and position information features for the next module.

And step 3: and training a classification network model. Specifically, the method comprises the following steps of;

and (3) using an Xgboost ensemble learner as a classifier of the model, performing model training, and training to obtain an electronic endoscope image classification model with high precision and good generalization. And (3) inputting the image features obtained in the step (2) into an Xgboost training classifier, randomly extracting images by adopting a random small-batch training strategy, and forming a small batch with the size of 60 before each training. The raw data were divided into training, validation and test sets at a ratio of 7:2:1 during the experiment. 10-fold cross validation was used during the experiment to evaluate the predictive performance of the model. And selecting a Relu activation function at the characteristic extraction stage, and optimizing parameters by using an Adam optimization method in the training process. To avoid overfitting, the Dropout function was used to operate before the fully connected layer of the network and L2 regularization was also applied to all weight parameters. In the whole training process, the weights of Dropout and L2 regularization are set to 0.5 and 0.0005 respectively, the learning rate is initialized to be set to 0.0001, and the iteration number is 500, so as to obtain the final fusion feature.

Drawings

FIG. 1 is an image of a data pre-processing stage in an embodiment of the present invention;

FIG. 2 is a view showing the nasopharyngeal area in an electronic laryngoscope image according to an embodiment of the invention;

FIG. 3 is a diagram of a network architecture of the present invention in an embodiment of the present invention;

FIG. 4 is a view showing the visualization of the feature of the nasopharyngeal area in the embodiment of the present invention;

FIG. 5 is a confusion matrix of recognition results in an embodiment of the present invention;

FIG. 6 is an overall process of the present invention for classifying electronic laryngoscope video images;

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Step 1: a data preprocessing stage, as shown in fig. 1;

step 2: firstly, we extract a picture to be processed currently from step 1, as shown in fig. 2, and input the picture into the dual-core convolution feature extraction network of fig. 3. Secondly, in order to show the effect of extracting features by the network, the previous layer of features of the full connection layer are obtained and visualized, and the result shown in fig. 4 is obtained. Finally, the image is passed through the full connected layers to obtain a 1 x 512 dimensional image feature F.

F＝[0,0,0.24433999,0,0.7739287,2.735432,1.2492355,0,0,5.9589076，…， 2.735432,1.2492355,0,0,0,5.9589076]

And step 3: and (3) inputting the F obtained in the step (2) into a trained Xgboost classifier to obtain the probability P of a group of budget categories.

P＝[1.0717337e-02，9.7289306e-01，1.3869060e-02，2.4896192e-03， 3.8891599e-06，2.7013968e-05]

From P we can see that the maximum value is 9.7289306e-01, in combination with the distribution of the tag, FIG. 2 is eventually predicted to be nasopharyngeal, in full agreement with the true tag.

And 4, step 4: to verify the performance of the model, 3382 images of 6 categories of nasal cavity, nasopharynx, epiglottis, vocal cords closure, vocal cords opening and blur were used, and the images were input into the trained model for prediction to draw a confusion matrix for evaluating the performance of the model, as shown in fig. 5. Wherein the accuracy of nasal cavity, nasopharynx, epiglottis, vocal cords closure, vocal cords opening and vague is 96.180%, 84.399%, 97.104%, 87.742%, 87.290% and 99.572%, respectively.

Claims

1. The electronic laryngoscope image classification method based on the dual-core convolution feature extraction comprises the following steps:

step 1: preprocessing data, specifically;

separating the collected electronic laryngoscope video frame by frame to obtain all image frames of the laryngoscope video to obtain an electronic laryngoscope image, wherein the electronic laryngoscope image comprises 6 types of nasal cavity, nasopharynx, epiglottis, vocal cords closure, vocal cords opening and fuzziness; cutting the redundant part of the image, removing a black area without a laryngoscope part around the image, and keeping useful information of the image; in addition, reserved image data is subjected to turning, color adjustment and random cropping processing by using a TensorFlow data enhancement method, a training data set is expanded, and an image is adjusted to 224 × 224 by using a resize method;

step 2: extracting image features by using a dual-core convolution method, specifically;

step 2.1: inputting the processed electronic laryngoscope image into a convolution layer with convolution kernel of 3 x 3 for convolution processing, constructing a feature sequence 1, providing more sufficient features for increasing the classification of the electronic laryngoscope image, extracting image features by using the convolution layers with convolution kernel sizes of 3 and 5 respectively, wherein each convolution layer is convolved by 2 dimensions; relu activating functions are used in each layer of convolution, the Relu functions not only have activating functions, but also enable the output of a part of neurons to be 0, namely the part of neurons can not be activated and lose effects, so that the network becomes sparse, and the calculation efficiency of the network is improved; a Batch Normalization (BN) layer is used in the network, and the characteristics obtained by the convolution layer are subjected to normalization processing, so that the convergence speed in the model training process is increased, and the model precision is improved; each convolution kernel is two-dimensional, and their calculation is the same, and the formula is as follows:

wherein, Con_d(i, j) represents a two-dimensional convolution, d represents a convolution kernel size, X_kRepresents the kth input matrix, W_kRepresenting the kth weight matrix, b representing a deviation item, k representing the number of input filters, i representing the abscissa of the image matrix, and j representing the ordinate of the image matrix;

Relu_d(i,j)＝max(0,Con_d(i,j)) (2)

wherein, Relu_d(i, j) represents the Relu activation function; max represents the maximum operator of the collection element; the batch normalization formula is as follows:

wherein BN_d(i, j) represents batch normalization; e2]Represents the mean of the input matrix; var [ alpha ], [ alpha]Representing the variance of the input matrix;

the networks used were as follows:

Conv_layer(kernel_size＝5)+BN+Relu (4)

Conv_layer(kernel_size＝3)+BN+Relu (5)

Conv_layer(kernel_size＝1)+BN+Relu (6)

MaxPooling_layer(pool_size＝2) (7)

step 2.2: mutually fusing the convolution characteristic of the previous layer and the characteristic extracted by the dual-core CNN into a new characteristic; the low-order features of the image are transmitted to the next layer, so that various low-order feature information such as textures, positions and shapes are provided for the next unit, and the low-order feature transmissibility is improved; the model learns high-order features containing detail information, and the perception capability of the model to image details is improved; then, pooling operation with the size of 2 is further performed on the high-dimensional features obtained through fusion, dimension reduction and compression are performed on the features, the training speed is increased, and meanwhile the fault tolerance of the model is improved; the feature fusion formula is as follows:

wherein Output (i, j) represents the Output of feature fusion obtained by different convolutions; BN₃(i, j) representing a matrix obtained by convolution and normalization with a convolution kernel size of 3; BN₅(i, j) representing a matrix obtained by normalization of convolution with convolution kernel size of 5; concatenate represents a feature linkage;

representing a splicing operator; con_d(i, j) represents a two-dimensional convolution; the pooling layer formula is as follows:

MaxPooling(i,j)＝max(Output(i,j)) (9)

wherein MaxPooling (i, j) represents the maximum pooled output;

step 2.3: further fusing and reinforcing the characteristic sequence 1 and the characteristic sequence 2, then carrying out convolution pooling operation on the fused characteristics by using a convolution layer with convolution kernel 1 x 1 and a pooling layer with convolution kernel 2 x 2 to obtain further image characteristics 2.3, providing more image high-order characteristics for a subsequent training classification model, and obtaining characteristics with strong semantic property;

repeating the step 2.2 and the step 2.3 for four times, fusing and reinforcing the features of each layer, inputting the fused and reinforced features into the next module, providing more sufficient features for image classification of the electronic laryngoscope, changing the last operation of the pooling layer into a full connection layer, and providing various texture and position information features for the next module;

and step 3: training a classification network model, specifically;

an extreme gradient boost (Xgboost) ensemble learner is used as a classifier of the model to carry out model training, and an electronic endoscope image classification model with high precision and good generalization is obtained through training; inputting the image characteristics obtained in the step 2 into an Xgboost training classifier, randomly extracting images by adopting a random small-batch training strategy, and forming a small batch with the size of 60 before each training; in the experimental process, original data are divided into a training set, a verification set and a test set according to the ratio of 7:2: 1; 10-fold cross validation was used during the experiment to evaluate the predictive performance of the model; selecting a Relu activation function in a characteristic extraction stage, and optimizing parameters by using an Adam optimization method in a training process; to avoid overfitting, the Dropout function was used to operate before the fully connected layer of the network and L2 regularization was also applied to all weight parameters; in the whole training process, the weights of Dropout and L2 regularization are set to 0.5 and 0.0005 respectively, the learning rate is initialized to be set to 0.0001, and the iteration number is 500, so as to obtain the final fusion feature.