CN116071589A

CN116071589A - Endoscope smoke image classification method based on deep learning

Info

Publication number: CN116071589A
Application number: CN202310076261.0A
Authority: CN
Inventors: 庞宇; 王鲲; 王慧倩
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-01-18
Filing date: 2023-01-18
Publication date: 2023-05-05

Abstract

The invention relates to an endoscope smoke image classification method based on deep learning, which belongs to the field of image processing and comprises the following steps: s1: selecting a laparoscope image, rendering a part of the images to obtain a foggy data set, and using the foggy data set and the unrendered foggy data set as a training set and a testing set, wherein the ratio of the foggy images to the foggy images in the training set and the testing set is 4:1; s2: the method is improved on the basis of a Poolformer network, a token mixer part is replaced by a multipath branch structure similar to ConvNext to be used as a training network, and training is carried out by using a training set; s3: the ConvNext Block is converted into a one-way structure RepConvNext Block similar to RepVgg for prediction during prediction training; s4: and outputting a probability value of the smoke category by the input image through the cascade network, and confirming the suspected smoke image category through the probability value.

Description

Endoscope smoke image classification method based on deep learning

Technical Field

The invention belongs to the field of image processing, and relates to an endoscope smoke image classification method based on deep learning.

Background

With the continuous development of economy, even if most hospitals in remote areas are equipped with endoscopes, compared with the traditional large-wound operation by taking laparoscopic operation as an example, the endoscopic operation only needs to be communicated with the affected part through a very small interface, a small-volume cold light source lens extends into the affected part, and the condition of the affected part is observed through an image transmission system on an external screen display system, so that the pain of a patient can be greatly reduced. In the operation process, smoke generated by an advanced plasma knife in the process of cutting focal tissues can obviously blur the image of a target area, and the accurate treatment of doctors is seriously influenced. It is very necessary to assist diagnosis by means of image processing.

In many scenes such as the processing of smoke images by civil image equipment, the need of smoke denoising by traffic monitoring facilities, the purification of surgical noise and the like, the smoke purification is a popular research field, and along with the continuous popularization of endoscope equipment with the development of economy, the smoke purification becomes a research hot spot and is mainly realized by a deep learning-based method.

The method based on deep learning mainly utilizes an end-to-end network model to directly purify noise images. Tang et al uses U-Net as a basic framework, and does not conduct targeted optimization, so that the situation that the image is overexposed after denoising is caused; mohammed et al adds image pyramid decomposition in the downsampling portion of U-Net for optimized noise reduction, and retains more image details. Besides the end-to-end model based on U-Net, divakar proposes a convolutional neural network denoising model for countermeasure training, which combines a multiscale feature and regularization to generate a GAN training model for countermeasure training to realize image denoising.

The judgment of the traditional Chinese medicine on the key target area in the operation process or even after the operation is more dependent on the clinical experience of doctors, and the judgment method is based on statistics, so that the doctors are required to accumulate abundant experiences to realize accurate diagnosis. But it is difficult for young doctors in partially underdeveloped areas to accumulate useful experience in actual practice without excellent doctor instruction, with greater risk and hidden danger to the patient. Therefore, AI-assisted diagnosis of medical images is highly desirable to reduce image smoke noise in real-time to maintain a clear view to help doctors more accurately identify useful information. However, in order to improve accuracy of judgment prediction, the depth and width of the existing neural network are continuously increased under the condition of limited computing resources. For remote areas, many hospitals cannot have large-scale high-performance computer equipment, so that the complexity of a network structure is required to be reduced while the accuracy is ensured, the rapid identification of an endoscope image is ensured, and the requirement on laboratory equipment is reduced.

The defogging method is based on adopting an end-to-end U-Net architecture or adopting an antagonistic neural network to directly perform smoke purification, and is used for defogging all intra-operative endoscope images. However, the generation of smoke is not carried out at all times in the operation process, the calculation amount of the smoke purifying process of the non-fog pictures can be greatly increased, and the unnecessary computer equipment resources are wasted. Therefore, after whether the fog exists in the fog image or not is classified, the fog is purified, so that the requirement on equipment resources can be greatly reduced, and the defogging instantaneity is improved.

Disclosure of Invention

In view of the above, the present invention aims to provide a method for classifying an endoscopic smoke image based on a Poolformer, which is mainly improved in two aspects: in the model, a Token Mixer in an encoder is improved from a simple Pool pooling layer to a multi-path branch structure similar to a pure convolutional neural network ConvNext, and is converted into a single-branch model in prediction, so that the performance of the model for processing picture speed is improved.

In order to achieve the above purpose, the present invention provides the following technical solutions:

an endoscope smoke image classification method based on deep learning comprises the following steps:

s1: selecting a laparoscope image, rendering a part of the images to obtain a foggy data set, and using the foggy data set and the unrendered foggy data set as a training set and a testing set, wherein the ratio of the foggy images to the foggy images in the training set and the testing set is 4:1;

s2: the method is improved on the basis of a Poolformer network, a token mixer part is replaced by a multipath branch structure similar to ConvNext to be used as a training network, and training is carried out by using a training set;

s3: the ConvNext Block is converted into a one-way structure RepConvNext Block similar to RepVgg for prediction during prediction training;

s4: and outputting a probability value of the smoke category by the input image through the cascade network, and confirming the suspected smoke image category through the probability value.

Further, in step S2, the training using the Poolformer network with the token mixer replaced by the multi-path branching structure similar to ConvNext is as follows: the encoder comprises a downsampling module and a ConvNext Block module. The downsampling module consists of a Layer norm Layer and a convolution Layer, the convolution kernel size is 2×2, the step distance is 2, and the channel number is the same as the input image data. The ConvNext Block module outputs H×W×dim data by a first part of convolution layers (convolution kernel size 7×7, stride 1, padding 3, channel number dim) and Layer norm layers, and outputs H×W×C data by a second part of convolution layers (convolution kernel size 1×1, stride 1, channel number 4)

dim) and Gelu Layer outputs h×w×4dim, and h×w×dim is output by fusion of a third partial convolutional Layer (convolutional kernel size 1×1, channel number dim) Layer Scale Layer and Drop path Layer with the initial input data.

For the encoder, viT-B/16 is taken as an example (two-dimensional matrix x1 format [197,768 ]]) As shown in fig. 3, for an image input as h×w×c, the data in the H and C dimensions are interchanged by the mapping of step (1) in fig. 3 to obtain a matrix x2, and the matrix x2 is output as follows through three convolution modules (the channel numbers dim of the three ConvNext Block modules are 197 and 394,788 in turn) each comprising a downsampling module and a ConvNext Block module

And (3) exchanging the data of the H dimension and the C dimension through the mapping in the step (2) in fig. 3 to obtain a matrix x3, and finally outputting an image of H multiplied by W multiplied by C through a Linear layer. Encoder with a plurality of sensorsEach layer serves to extract different features of the smoke image and the multi-layer downsampling operation is to extract features of different frequency domains of the image.

In order to classify the model identification tag data during training, the hazy image tag is set to 1 and the hazless image tag is set to 0 in the present invention. The training output is passed through a classifier to obtain a group of vectors containing probability values, and the fitting effect is achieved through back propagation. And (3) judging whether the input endoscope smoke image is foggy or not through the network forward propagation to obtain corresponding probability values of the two categories.

The reliability of the training effect was confirmed in the model of the present invention by using a GeLu activation function and Adam optimizer, epoch 100, initial learning rate set to 0.001, batch Size 32, patch size=16, 10 fold cross-validation.

Further, the workflow of steps S3, repConvNext Block is as follows: the ConvNext Block in training is converted into the structure shown in the figure 4 in the prediction training, the basic parameters of the ConvNext Block module in the training process are not changed in the operation, but the fusion is not carried out by adopting a single-path branch network in the prediction of the classification result, so that the memory can be saved, the reasoning time is quickened, and the instantaneity is further improved.

In step S4, the output of the L-layer transform encoder is pooled through sequence pooling, the information and the category information of different parts of the input image are included in the data sequence, the sequence pooled output is embedded in the sequence of the potential space generated by the transform encoder, and finally the output after the sequence pooling is passed through the linear classifier to obtain the result.

The invention has the beneficial effects that: in the field of deep learning classification and identification, a model is evaluated for quality, and some performance metrics Acc (accuracy), sens (sensitivity) and reasoning time/fps (number of pictures processed per second) are required as indexes, where accuracy and sensitivity are two metrics widely used in the fields of information retrieval and statistical classification, and are used for evaluating the quality of results. The reasoning time is the key for measuring the speed of the model.

The invention improves the Token Mixer in the encoder from a simple Pool pooling layer to a multipath branch structure similar to a pure convolutional neural network ConvNext, and converts the multipath branch structure into a single branch model during prediction so as to improve the performance of the model for processing the picture speed.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of a network architecture;

FIG. 2 is a block diagram of an encoder;

FIG. 3 is a block diagram of a modified Poolformer encoder;

fig. 4 is RepConvNext Block;

fig. 5 is an index diagram of a test set.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

The method comprises the steps of classifying an endoscope smoke image based on a Poolformer network, changing a Multi-Head Attention module in an encoding Block of a traditional vision transformer into a simple Pool pooling layer on the basis of the network structure, improving the Pool pooling layer into a Multi-branch pure convolutional neural network structure similar to ConvNext, training a training set, and simultaneously converting the training set into a one-way model in a predictive reasoning network for obtaining a classification result while ensuring accuracy and real-time aiming at a test set. The general flow chart of the invention is shown in figure 1.

The method model used in the invention mainly comprises the following steps:

s1: in the invention, in order to ensure the balance of data and under the condition of training a small data set, a real laparoscope image provided by a Ham forest center laparoscope/endoscope video data set is adopted, 5000 pictures are selected from the real laparoscope image, 1000 pictures are selected for rendering to obtain a hazy data set, the hazy data set and the non-rendered hazy data set of 4000 pictures are used as a training set and a testing set, the training set is 3800 pieces, the testing set is 1200 pieces, the training set and the testing set are distributed uniformly, and the ratio of the hazy images to the hazy images of the training set and the testing set is 4:1.

S2: improving on the basis of a Poolformer network, changing a token mixer part into a multi-path branch structure similar to ConvNext, and improving the image detail characterization capability as a training network to train a training set image; convNext is a pure convolutional neural network architecture which is close to a transducer network, and compared with the transducer model, the ConvNext is a pure convolutional neural network architecture, the parameter quantity is greatly reduced, and the ConvNext is capable of providing space induction bias, getting rid of position bias and accelerating the convergence of the network, so that the network training process is more stable. By varying the phase calculation scale, the packet convolution uses a macroscopic design of the inverse bottleneck structure, using a larger convolution kernel on small details and replacing the activation function Relu activation function with an equi-small adjustment. ConvNext has faster reasoning speed and higher accuracy than Swin transducer, reaching 87.8% accuracy on ImageNet 22K.

While the most basic Vision Transformer (ViT) model, the encoder typically contains two components, namely, an attention module and a channel MLP, etc., the former is used to mix information between the token, called token mixer, and the latter includes the channel MLP and residual connection, etc. Ignoring the details of the token mixer implementation with the attention module, the architecture described above can be abstracted to a MetaFormer architecture as shown in FIG. 2 (a). Compared with the traditional vit model, the Poolformer changes the multi-head attention mechanism into a simple pool pooling layer, as shown in (B) in fig. 2, and the Poolformer model can reach 82.1% of accuracy in an ImageNet1K data set thanks to the superiority of the whole MetaFormer framework, and exceeds DeiT-B and ResMLP-B24 (MLP framework), meanwhile, the calculation amount can be greatly reduced due to the addition of the pool layer, and the machine load and the required video memory are reduced. However, the Pool layer loses some information in the process of dimension reduction, while local information in the medical image is quite important and cannot be lost at will. Whereas convolutional neural networks may better retain local information than pooling layers. Based on this feature, the token mixer portion is replaced with a multiplex branching structure like ConvNext, as shown in fig. 2 (c).

The encoder of the Poolformer network comprises a downsampling module and a ConvNext Block module; the downsampling module consists of a Layer norm Layer and a convolution Layer, the convolution kernel size is 2 multiplied by 2, the step distance is 2, and the channel number is the same as that of input image data; the ConvNext Block module outputs H×W×dim through a first part of convolution Layer and a Layer norm Layer for data input into H×W×C, wherein the convolution kernel of the first part of convolution Layer is 7×7, the step distance is 1, the filling is 3, and the channel number is dim; outputting H multiplied by W multiplied by 4dim through a convolution layer of a second part and a Gelu layer, wherein the convolution kernel of the convolution layer of the second part is 1 multiplied by 1, the step distance is 1, and the channel number is 4dim; the convolution Layer of the third part, the Layer Scale Layer, the Drop path Layer and the initial input data are fused and output to be H multiplied by W multiplied by dim, the convolution kernel size of the convolution Layer of the third part is 1 multiplied by 1, and the channel number is dim.

The two-dimensional matrix x1 obtained by the convolution operation and the flattening operation of fig. 2 is taken as an input sequence of the improved Poolformer encoder, and ViT-B/16 is taken as an example (the format of the two-dimensional matrix x1 is [197,768 ]]) The structure and specific steps of (a) are shown in fig. 3. For an image input as H×W×C, the data in H and C dimensions are interchanged by the mapping of step (1) in fig. 3 to obtain a matrix x2, and the matrix x2 is output as a matrix through three convolution modules (the channel numbers dim of the three ConvNext Block modules are 197 and 394,788 in turn) respectively, each of which is composed of a downsampling module and a ConvNext Block module

And (3) exchanging the data of the H dimension and the C dimension through the mapping in the step (2) in fig. 3 to obtain a matrix x3, and finally outputting an image of H multiplied by W multiplied by C through a Linear layer. Each layer of the encoder functions to extract different features of the smoke image and the multi-layer downsampling operation is to extract features of different frequency domains of the image.

S3: the ConvNext Block is converted into a one-way structure RepConvNext Block similar to RepVgg for prediction during prediction training; in order to further improve the instantaneity, the ConvNext Block is converted into a one-way structure RepConvNext Block similar to RepVgg during predictive training, as shown in FIG. 4, the basic parameters of the ConvNext Block module in the training process are not changed in the operation, but the one-way branch network is adopted to not fuse during the prediction of the classification result, so that the memory can be saved, the reasoning time can be quickened, and the instantaneity can be further improved. .

The model is characterized by multiple branches in parallel, so that the characterization capability of the model can be increased. While converting the multi-branch model into a one-way model at the time of reasoning brings the following advantages.

First is faster: mainly considering the parallelism degree of hardware calculation and MAC (memory access cost) of the model during reasoning, for a multi-branch model, hardware needs to calculate the result of each branch respectively, some branches calculate quickly, some branches calculate slowly, the branches which calculate quickly can only wait for after calculation, and other branches can be further fused after calculation, so that the hardware calculation force cannot be fully utilized or the parallelism degree is not high enough. Each branch needs to access the memory once, and the calculation result needs to be stored in the memory after the calculation is finished (the continuous access and writing of the memory wastes a lot of time on the IO). And secondly, the memory is saved more.

S4: the input image outputs the probability value of the smoke category through the cascade network, and the suspected endoscope smoke image category is confirmed through the probability value. In the Vision Transformer (ViT) module, in order to process images of different sizes, a token (vector) sequence is required to be input, and ViT-B/16 is taken as an example, a convolution with a convolution kernel size of 16x16, a stride of 16 and the number of convolution kernels of 768 is directly used for the input image to implement the convolution, and the input image x is divided according to a Patch with a size of 16x16, that is, each Patch is. Although convolution kernels and step size increase receptive fields are also increased in large data sets, feature maps of wider areas can be examined, and the obtained global features are better, detailed information among patches is easily lost in small data sets such as endoscopes.

Therefore, a convolution-based patch method is introduced, loss of detail information is reduced, the size of the patch no longer needs to be zero, and the patch can adapt to data sets with different sizes.

Meanwhile, after the Poolformer encoder is improved, the output of the L-layer converter is assembled through sequence pooling, and classification results are generated after the Class Token is independently segmented unlike a traditional ViT model. The data sequence contains information of different parts of the input image and category information, so that the model is compact, the sequence is embedded in the potential space generated by the sequence pooling output transducer encoder, the data in the input data are better related, and finally the output subjected to the sequence pooling can obtain a result through the linear classifier.

The reliability of the training effect was confirmed in the model of the present invention by using a GeLu activation function and Adam optimizer, epoch 100, initial learning rate set to 0.001, batch Size 32, patch size=16, 10 fold cross-validation. The results of the tests on the test set are shown in fig. 5.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. An endoscope smoke image classification method based on deep learning, the method comprising the steps of:

2. The deep learning based endoscopic smoke image classification method according to claim 1, wherein: the Poolformer network improvement described in step S2 is as follows:

the encoder of the Poolformer network comprises a downsampling module and a ConvNext Block module; the downsampling module consists of a Layer norm Layer and a convolution Layer, the convolution kernel size is 2 multiplied by 2, the step distance is 2, and the channel number is the same as that of input image data; the ConvNext Block module outputs H×W×dim through a first part of convolution Layer and a Layer norm Layer for data input into H×W×C, wherein the convolution kernel of the first part of convolution Layer is 7×7, the step distance is 1, the filling is 3, and the channel number is dim; outputting H multiplied by W multiplied by 4dim through a convolution layer of a second part and a Gelu layer, wherein the convolution kernel of the convolution layer of the second part is 1 multiplied by 1, the step distance is 1, and the channel number is 4dim; the convolution Layer of the third part, the Layer Scale Layer, the Drop path Layer and the initial input data are fused and output to be H multiplied by W multiplied by dim, the convolution kernel size of the convolution Layer of the third part is 1 multiplied by 1, and the channel number is dim;

in the training process, setting a foggy image label to be 1 and setting a foggy image label to be 0; training output is subjected to a classifier to obtain a group of vectors containing probability values, and the fitting effect is achieved through back propagation; the method comprises the steps of positively transmitting an input endoscope smoke image through a network to obtain corresponding probability values of two categories, and judging whether the input endoscope smoke image is foggy or not;

using the GeLu activation function and Adam optimizer, the Epoch was 100, the initial learning rate was set to 0.001, the batch Size was 32, the patch size=16, and the 10 fold cross-validation confirmed the reliability of the training effect.

3. The deep learning based endoscopic smoke image classification method according to claim 1, wherein: in step S3, the workflow is as follows: the ConvNext Block in training is converted into RepConvNext Block in the process of predicting training, the basic parameters of the ConvNext Block module in the training process are not changed, but the single-path branch network is adopted in the process of predicting the classification result, and fusion is not carried out.

4. The deep learning based endoscopic smoke image classification method according to claim 1, wherein: in step S4, the output of the transform encoder of the L layers is pooled through sequence pooling, the information and the category information of different parts of the input image are included in the data sequence, the sequence pooling outputs the sequential embedding of the potential space generated by the transform encoder, and finally the output through sequence pooling obtains the result through the linear classifier.