CN114092450A

CN114092450A - Real-time image segmentation method, system and device based on gastroscopy video

Info

Publication number: CN114092450A
Application number: CN202111411214.4A
Authority: CN
Inventors: 孔德润; 董兰芳; 董天意; 马涛; 彭杰; 宋绍方; 吴艾久
Original assignee: Hefei Zhongna Medical Instrument Co ltd
Current assignee: Hefei Zhongna Medical Instrument Co ltd
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-02-25

Abstract

The invention belongs to the field of image processing, and particularly relates to a real-time image segmentation method, a real-time image segmentation system and a real-time image segmentation device based on gastroscopy videos. The method comprises the following steps: s1: and constructing a lightweight image segmentation network model based on a Mask-RCNN framework. S2: acquiring a gastroscopy image as a sample to form an original data set; and denoising the image, and then dividing the original data set into a training set and a test set. S3: IoU and GIoU of the original Loss function are replaced by a CIoU-Loss function, and the network model is trained. S4: and testing the network model by using the test set, and reserving the network model with the best performance. S5: acquiring video stream data of gastroscopy, performing framing processing on the video stream data, then performing real-time segmentation processing on a framed image, and outputting a segmentation result of a target area with pathological change characteristics. The invention solves the problems that the existing image analysis tool has poor efficiency and low accuracy and cannot meet the large-scale video stream data processing requirement of high frame rate.

Description

Real-time image segmentation method, system and device based on gastroscopy video

Technical Field

The invention belongs to the field of image processing, and particularly relates to a real-time image segmentation method, a real-time image segmentation system and a real-time image segmentation device based on gastroscopy videos.

Background

Early detection of gastric cancer needs to be performed by gastroscopy, in which a probe of a gastroscopy apparatus is inserted into the body of an examination subject to take endoscopic images of the digestive tract and the stomach wall. And professional inspectors judge whether the tissues of the inspected objects have canceration characteristics according to the endoscopic images, and then corresponding inspection results are given. The image analysis process of the existing gastroscopic medical examination is mainly completed by a professional doctor. However, some professional medical image analysis tools are also gradually applied in the gastroscopy process, and the medical image analysis tools can process the acquired original images, extract the characteristics of the region with the canceration risk and segment the target region with the canceration risk; thereby providing auxiliary information for the diagnosis process of the gastroscopy personnel.

Most medical image analysis tools applied at present extract features in a gastroscopic image through dimensions such as color, texture, shape, spatial layout and the like. The existing mature technology can effectively model the image into a feature vector to extract the traditional features. Most of the methods rely on low-level feature extraction, and the extracted low-level features cannot be used for modeling high-level semantics in a gastroscope image; the tool itself also has no self-learning and self-adjusting capabilities. Furthermore, the feature content extracted by these conventional medical image analysis tools is relatively fixed and often plagued by high dimensional information; therefore, the identification of early gastric cancer lesions is limited to a certain extent, so that the efficiency of the indexing and matching process of an analysis tool is extremely low, the image segmentation effect is poor, and the accuracy is low.

In some studies, deep learning based network models have produced better results in medical image analysis. However, most of the deep learning network models have large scale, low image processing rate and extremely high requirements on hardware performance, and are difficult to meet the large-scale data analysis requirements of real-time video stream data.

Disclosure of Invention

The problems that an image analysis tool used in the existing gastroscopy process is poor in efficiency and low in accuracy and cannot meet the requirement of high-frame-rate large-scale video stream data processing are solved; the invention provides a real-time image segmentation method, a real-time image segmentation system and a real-time image segmentation device based on a gastroscopy video.

The invention is realized by adopting the following technical scheme:

a real-time image segmentation method based on gastroscopy video comprises the following steps:

s1: constructing a lightweight image segmentation network model based on a Mask-RCNN framework, inputting medical images of gastroscopy into the image segmentation network model, and outputting segmentation results of target regions with pathological changes. The model construction process comprises the following steps:

s11: acquiring a traditional Mask-RCNN network containing a backbone network and a ROI (Region of interest) part.

S12: and replacing a backbone network Resnet50 for extracting image features in the Mask-RCNN network with Mobile Net.

S13: and performing model compression on the network model in the previous step by adopting one or more modes of model pruning, model quantification or knowledge distillation to obtain the required light-weight image segmentation network model.

S2: acquiring a plurality of real gastroscopy images with canceration characteristics as sample data, and carrying out manual primary screening and denoising on the images; then, amplifying the number of images by an image enhancement method, and manually marking the type and the position of a target region in the amplified images; the marked images form a required original data set; the raw data set is divided into a training set and a test set.

S3: IoU and GIoU of a Loss function of a backbone network in the original computed image segmentation network are replaced by a CIoU-Loss function; and setting training parameters for model training, training the image segmentation network by using a training set, and iteratively updating network parameters until the loss function is converged.

S4: and testing the image segmentation network model trained in the previous step by using a test set, and reserving the model with the best test effect as the network model for real-time image segmentation.

S5: acquiring video stream data of gastroscopy, performing framing processing on the video stream data, inputting the framed image into the network model stored in the step S4, performing real-time segmentation processing on the framed image by the image segmentation network model, and outputting a segmentation result of a target region with lesion characteristics.

As a further improvement of the present invention, in step S12, the Resnet50 network may be replaced by any feature extraction model based on deep learning, which is capable of extracting color features, texture features, shape features and spaces in an image and has a smaller amount of parameters or a faster processing rate than the Resnet50 network.

As a further improvement of the present invention, in step S13, the processing procedure of the model pruning method is performed in combination with the pre-training procedure of the network model; continuously adjusting parameters in the pre-training process, and cutting off the connection between neurons in the network model, wherein the influence of the accuracy of the network classification result on the neurons is less than a preset loss rate; until the neuron connections in the network model are minimized, the pruning of the remaining neuron connections can reduce the precision of the classification result of the network model.

As a further improvement of the present invention, in step S13, the model quantization process is implemented based on the idea of weight sharing and clustering process; in the model quantization process, K classifications are assumed to be given, namely the weight parameters of the network model have K values after quantization, clustering operation is performed on the weights by adopting a K-Means clustering method to obtain K intervals, all the weight parameters are distributed in the K intervals, the numerical values of the corresponding K intervals are used for replacing original weight data, and the size of the model is reduced from 32bit to log2 (K).

As a further improvement of the present invention, the processing procedure of knowledge distillation is to replace the original large-size network model with a small network model with a smaller number of network nodes without affecting the distribution state of the output result of the final softmax layer.

As a further improvement of the present invention, in step S2, the manual prescreening process removes images in the sample that do not belong to the specified region at all, and images whose image quality is too poor to allow even human eyes to distinguish cancerous features. The image denoising process is respectively processed by using a mean filtering method, a Gaussian filtering method and a median filtering method.

The pixel calculation formula of the mean filtering is as follows:

G₁(x,y)＝∑F(x,y)/m

in the above formula, x and y are pixel coordinates, f (x, y) is the original pixel value, G₁(x, y) is the mean filtered pixel value; m is the total number of pixels contained in the convolution kernel.

The pixel calculation formula of gaussian filtering is as follows:

in the above formula, x and y are image pixel coordinates, G₁(x, y) are gaussian filtered pixel values; σ is a coefficient that determines the degree of smoothness of the overall image.

The median filtering is to sort all pixels in the neighborhood of a certain pixel point through a statistical sorting filter, and then to take the median value as the pixel of the neighborhood center.

Image enhancement methods that augment the number of images in the original data set include rotation and flipping of the images.

As a further improvement of the present invention, in step S3, the calculation formula of the CIoU-Loss function is:

in the above equation, IoU is the original intersection ratio of the original predicted position and the actual position; A. b is a rectangular surrounding frame for marking the target position; rho is the Euclidean distance between the coordinates of the center points of the frame A and the frame B; c is the diagonal distance of the minimum box enclosing the A frame and the B frame; α is a weight function, v is a coefficient used to measure the consistency of the aspect ratio;

the calculation formulas of the weight coefficient alpha and the coefficient v are as follows:

in the above formula, w is the predicted width of the bounding box; h is the predicted height of the bounding box; w is a^gtThe width of the original label position box; h is^gtThe height of the original label position box.

The invention also comprises a real-time image segmentation system based on the gastroscopy video, which adopts the real-time image segmentation method based on the gastroscopy video to process the video stream data in the gastroscopy process so as to segment the target area containing the canceration characteristics in the image. The real-time image segmentation system comprises: the system comprises a video framing module, an image segmentation network model and an image display module.

The video framing module is used for acquiring original video stream data acquired by gastroscopy equipment, and then framing the video stream data to obtain gastroscope original images of all frames.

The image segmentation network model takes gastroscope original images output by the video framing module as output, and further extracts and segments regions with canceration characteristics in the images. The image segmentation network model is a lightweight network model improved based on a Mask-RCNN network, and comprises a feature extraction unit and an RPN network unit. The feature extraction unit is used as a backbone network in the model, and the feature extraction unit adopts a Mobile Net or other feature extraction model which can extract color features, texture features, shape features and space in an image and is based on deep learning with less parameter quantity or faster processing speed compared with a Resnet50 network. The feature extraction unit is used for obtaining a corresponding feature map according to the input medical image; the RPN network element is configured to: (1) and acquiring a plurality of candidate interest areas according to the characteristic diagram. (2) And (4) sending the feature map and the candidate interesting regions into binary classification and BBOX regression, and filtering part of the candidate interesting regions. (3) And performing ROIAlign operation on the rest of the region of interest. (4) N-class classification, BBOX regression, and full convolution operations within each ROI were performed on the remaining regions of interest. (5) And performing non-maximum suppression on a plurality of identification region results which are highly overlapped in the region of interest, and selecting the region with the highest confidence coefficient as a final segmentation result and outputting the final segmentation result.

The image display module is used for simultaneously displaying images in original video stream data of gastroscopy and segmentation results of the regions with the canceration characteristics output by the image segmentation network.

As a further improvement of the method, in the training process of the image segmentation network model, a CIoU-Loss function is used for replacing the original IoU and GIoU to calculate the Loss function of the backbone network; meanwhile, one or more of model pruning, model quantification and knowledge distillation are adopted to further compress the image segmentation network model, so that the lightweight of the network model is realized.

The invention also comprises a gastroscopic video based real-time image segmentation apparatus comprising a memory, a processor and a computer program stored on said memory and executable on said processor, the processor when executing the program implementing the steps of the gastroscopic video based real-time image segmentation method as described above.

The technical scheme provided by the invention has the following beneficial effects:

the method provided by the invention constructs an improved image segmentation network, and the network model can be used for rapidly extracting the characteristics of the image such as color, texture, shape, spatial distribution and the like, thereby laying a foundation for rapidly positioning the lesion tissues. The network model constructed by the invention is a deep learning network with self-learning and self-adjusting capabilities, and can effectively utilize information of different levels in an image, thereby realizing the modeling of high-level semantics, improving the identification capability of the network model for canceration characteristics, and improving the efficiency and the accuracy of characteristic matching.

In the method provided by the invention, the traditional network model is improved in light weight to the greatest extent, and the scale of the network model is greatly reduced on the basis of not obviously damaging the precision and the generalization performance of the network model, so that the network model can run on a conventional hardware platform, and the deployment and running cost of the network model is reduced. Meanwhile, the network model can also perform image segmentation on a gastroscopy video of the embodiment with a high frame rate, and the real-time performance of gastric cancer feature image segmentation is remarkably improved. Thereby providing an auxiliary diagnosis for the gastroscopy process of the doctor.

Drawings

Fig. 1 is a flowchart illustrating steps of a real-time image segmentation method based on gastroscopy video according to embodiment 1 of the present invention.

Fig. 2 is a structural framework diagram of a conventional Mask-RCNN network in embodiment 1 of the present invention.

FIG. 3 is a comparison of an original image of a real gastroscopy and a corresponding annotated mask image in example 1 of the present invention.

FIG. 4 is a comparison chart between the model-predicted result and the true labeled result of the same image in embodiment 1 of the present invention.

Fig. 5 is a block diagram of a real-time image segmentation system based on gastroscopy video according to embodiment 2 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example 1

The embodiment provides a real-time image segmentation method based on gastroscopy video, as shown in fig. 1, the medical image segmentation method comprises the following steps:

The segmentation result of the target region having lesion features is output by drawing a polygonal frame in the target region of the entire image to identify different cancerous individuals contained in the image. The process is actually an image segmentation task in computer vision. Therefore, the embodiment selects the classic fast-RCNN network as the basic model for completing the image segmentation task network. Specifically, the Mask-RCNN network to which branches for executing the division task are added is selected in the present embodiment

A block diagram of Mask-RCNN is shown in FIG. 2. In the Mask-RCNN processing process, the first stage is to input the image into a convolution neural network, automatically extract the image characteristics and obtain candidate frames (Propusals). In the second stage, besides the prediction of the types and positions of the contents of the candidate frames, the branch pairs of the full convolution network are added for image binary masking, and when a pixel belongs to the position of the target, the pixel is marked as 1, and other positions are marked as 0, so as to indicate whether the given pixel is part of the target.

The specific process of the second stage is as follows: (1) and acquiring a plurality of candidate interest areas according to the characteristic diagram. (2) And (4) sending the feature map and the candidate interesting regions into binary classification and BBOX regression, and filtering part of the candidate interesting regions. (3) And performing ROIAlign operation on the rest of the region of interest. (4) N-class classification, BBOX regression, and full convolution operations within each ROI were performed on the remaining regions of interest. (5) And performing non-maximum suppression on a plurality of identification region results which are highly overlapped in the region of interest, and selecting the region with the highest confidence coefficient as a final segmentation result and outputting the final segmentation result.

Relatively complex models tend to have higher recognition accuracy and generalization performance, but larger models also tend to put higher requirements on computing hardware equipment, and the rate of model detection is reduced. The model detection rate of a Resnet50 module in a conventional Mask-RCNN network on a device carrying a GTX1080 video card can only reach 12 images per second, and when real-time visual identification is carried out on a video stream, the output frame rate of the video is further reduced to about 8 frames per second. This clearly does not meet the requirements for real-time processing of gastroscopic video. However, in this embodiment, after replacing the Resnet50 module with the Mobile Net module, the detection rate of the model is significantly increased, the model identification accuracy is not significantly decreased, and the accuracy loss can be improved through the subsequent training process.

The backbone network is replaced in the above steps to achieve the effect of model reduction, and the embodiment further enables the network model to be lighter through a model compression method. Specifically, the adopted model lightweight means comprises pruning, quantification, knowledge distillation and the like.

Firstly, carrying out model pruning, wherein the processing process of the model pruning method is carried out in combination with the pre-training process of the network model; continuously adjusting parameters in the pre-training process, and cutting off the connection between neurons in the network model, wherein the influence of the accuracy of the network classification result on the neurons is less than a preset loss rate; until the neuron connections in the network model are minimized, the pruning of the remaining neuron connections can reduce the precision of the classification result of the network model. Here, the neuron connection removed by the specific pruning operation is based on the accuracy influence of the neuron connection on the final output detection result as a standard, and when the influence of the pruned neuron connection on the final result is small, that is, the loss of accuracy is within an acceptable range, the neuron connection can be pruned. If the effect of a certain neuron connection on the accuracy of the final output result is significant, the neuron connection should be preserved.

In some success cases, such as on imageNet datasets, the pruning method can reduce the number of parameters of AlexNet by a factor of 9 without loss of accuracy. VGG-16 also has a similar phenomenon, and the total amount of parameters can be reduced by about 13 times without precision loss.

Secondly, model quantization processing is carried out, and the model quantization processing is realized based on the ideas of weight sharing and clustering processing; in the model quantization process, K classifications are assumed to be given, namely the weight parameters of the network model have K values after quantization, clustering operation is performed on the weights by adopting a K-Means clustering method to obtain K intervals, all the weight parameters are distributed in the K intervals, the numerical values of the corresponding K intervals are used for replacing original weight data, and the size of the model is reduced from 32bit to log2 (K).

And finally, knowledge distillation is carried out, and the processing procedure is to replace the original large-size network model with a small network model with less network node number on the premise of not influencing the distribution state of the output result of the final softmax layer. Knowledge distillation realizes the knowledge migration from a large network to a small network, so that the model can be calculated at high speed. Apparently, the large network should have better expressive or generalization ability. The number of small network nodes is far from that of a large network, and the small network can approach the classification capability of the large network in a certain scene. Analogizing to a more complex function, when defining a range of values, it can be approximated by some simple function. Secondly, the final result of the network classifier is expressed by probability, and the classification result depends on the maximum probability; thus, the final classification result with a maximum probability of 90% and a maximum probability of 60% is the same. The task of the embodiment belongs to the scene, so the embodiment can further simplify the network model by adopting a knowledge distillation strategy and reduce the scale of the network model.

The embodiment acquires real gastroscopic images of the patient from the historical medical records of the gastroscopic department. In order to ensure the effectiveness of sample data and reduce the interference of invalid data on the model training process, the embodiment also performs manual preliminary screening on the acquired original image, wherein the preliminary screening standard is to remove the image which does not conform to the specified area in the sample, and the image which cannot distinguish the canceration characteristics even if human eyes are too poor in image quality. For example, the captured image is unclear, blurred, and too much noisy.

In addition, for the original image in the selected data set, the embodiment also performs manual annotation on the image, so as to determine the position and the type of canceration in the sample data. As reference data for the model training process. The labeling process of the image data is completed by a professional physician on a professional labeling tool such as Labelme, Pair and the like. And classifying the original image according to multiple categories of normal, cancer, advanced cancer, inflammation and the like. In this embodiment, the annotation result obtained by using the annotation software is a file in the Json format, and a mask image corresponding to the original image is generated. Meanwhile, another class of NBI images using laser staining is polygon boxed for subsequent study use in the categories of inflammation, cancer, etc. For example, as shown in fig. 3, the original gastroscopic image and the labeled polygon frame are shown in fig. 3, wherein the left half of fig. 3 is the original image for gastroscopy and the right half is the corresponding labeled mask image.

After the gastroscope device shoots in a normal state, the endoscope imaging can display a normal anatomical physiological structure and highlight an abnormal anatomical pathological structure. However, when the endoscope device performs close-range operation shooting in a human body, due to the limitation of shooting angles, light rays and the like, a part of images have obvious defects, and a diseased region cannot be observed well, so that preprocessing such as noise reduction needs to be performed on an original image of gastroscopy, the quality of the original image is improved, and the identification accuracy of a lesion position and a lesion degree is further improved. The image preprocessing method mainly used in the present embodiment is image denoising. And the average filtering method, the Gaussian filtering method and the median filtering method are respectively used for processing.

Mean filtering is a simple filter that takes as output the average of the pixel values in a K x K window. This filter is equivalent to an advanced convolution of the image with a kernel whose total element value is 1, followed by scaling. The pixel calculation formula of the mean filtering is as follows:

G₁(x,y)＝∑F(x,y)/m

The specific operation of gaussian filtering is to replace the value of the central pixel of the template with the weighted average gray value of the pixels in the neighborhood determined by the template for each pixel in a convolution scan image. The pixel calculation formula of gaussian filtering is as follows:

The median filtering is to sort all pixels in the neighborhood of a certain pixel point through a statistical sorting filter, and then to take the median value as the pixel of the neighborhood center. Median filtering, which is a non-linear method of removing noise, can in some cases both remove noise and protect the edges of the image. The principle of implementation is to replace the value of a point in the digital image by the median of the values of the points of a region of the point. We refer to a neighborhood of a certain length or shape of a point as a window, and for median filtering of two-dimensional images, a window of 3 x 3 or 5 x 5 is typically used for filtering.

In the training process of the network model, when the data volume of the training set is too small, the overfitting phenomenon is easy to occur, and the detection effect of the model is influenced. In the embodiment, the original image is amplified by an image enhancement method, so that the data volume of the data set is increased. Specifically, the image enhancement method for increasing the number of images in the original data set in the present embodiment includes operations of rotating and flipping the images. The rotation of the image includes rotation of 90 degrees, 180 degrees and 270 degrees, and the image turning includes left-right turning and up-down turning. By the image processing methods, a plurality of images with the same pathological change characteristics can be obtained, and further the generalization performance of the model is improved. Therefore, the identification accuracy of the network model in the later application process is improved.

In addition, in other embodiments, it is also contemplated that the data enhancement may be achieved by generating an enhancement sample through the generation of an antagonistic network.

In a gastroscopic image of gastric cancer, a large lesion region with unobvious characteristics often appears, and a cancerous target with a small part of obviously distinguished characteristics also appears. Conventional IoU shows the coincidence of predicted positions at the true position (GT), but does not reflect the positional relationship of different individuals comprehensively. Meanwhile, if IoU do not overlap, there is no gradient in calculating the loss function. Although the Mask-RCNN uses the GIoU proposed after IoU, the problem that the gradient cannot be calculated when IoU is used as a loss function is solved, and a minimum bounding box is added as a penalty term. It has some problems at all. The GIoU needs to intersect the target frame of the detection result at the beginning, and then starts to reduce the coincidence of the detection result and the GT, which brings about the problem that convergence can be achieved only by a large number of iterations. Furthermore, it degrades to IoU in the case where one box contains another, and does not ensure that the target is fit. Therefore, the embodiment adopts a position Loss function CIoU-Loss integrating multiple aspects, and the CIoU-Loss takes the distance, the overlapping rate, the scale and the punishment term between the target and the frame anchor into consideration, so that the regression of the target frame becomes more stable, and the problems of divergence in the training process and the like do not occur like IoU and GIoU. Better model training effect and result prediction performance can be obtained.

The calculation formula of the CIoU-Loss function is as follows:

The network model training process is roughly as follows:

(1) the model reads the pre-training parameters for the ImageNet image classification.

(2) And reading in a training image network according to Batch, and performing feedforward propagation calculation.

(3) And calculating a loss function by the obtained prediction vector and a real label file containing the type, the position and the Mask coordinate, and performing back propagation.

(4) The optimizer iteratively updates the network parameters until the loss function converges.

In this embodiment, for the same original image of the gastroscopy, the image pair of the model prediction result and the real labeling result is shown in fig. 4, wherein the left half part is the model prediction result and the right half part is the artificial real labeling result.

In the task of the embodiment, lesion model construction is a core of early-stage gastric-cancer gastroscope image processing, wherein the most important thing is to perform feature extraction on a gastroscope picture, and the feature extraction plays a key role in judging gastric cancer no matter whether the lesion position is identified or the canceration degree is identified. The traditional gastroscope image characteristics mainly refer to 3 visual characteristics: color features, texture features, and shape features.

The color feature is the most intuitive feature in the judgment of the gastroscope image by the doctor, is described by the color presented by the gastroscope image and has integrity. The color feature extraction method comprises a color histogram, a color set, a color matrix and the like.

Texture features are an important visual cue, and are ubiquitous and difficult to describe features in images. The targets of the texture feature extraction are: the extracted textural features are small in dimension, strong in identification capability, good in robustness, small in calculation amount in the extraction process and capable of guiding practical application. The texture change features of gastric cancer lesions are relatively obvious in gastroscopy, and therefore, the texture features are also applied to image processing of gastric cancer identification.

The shape feature describes the contour and shape of the object. In the early gastric cancer detection, the shape change of the focus is not obvious, and the focus is difficult to perform feature analysis by using shape features, so the analysis and the application of the shape features of the early gastric cancer are more important than the color and the texture features, but the shape feature extraction of an image shot by a gastroscope is significant in the treatment of other gastric diseases such as gastric polyp and tumor and can be used as a reference.

Therefore, in consideration of the influence of the above features on the analysis of the gastric cancer image, in other embodiments, the Resnet50 network in the present embodiment may be replaced by any deep learning-based feature extraction model that can extract color features, texture features, shape features and space in an image, and has a smaller amount of parameters or a faster processing rate relative to the Resnet50 network.

In order to verify the segmentation performance of the image segmentation network model provided in the method of the present embodiment on the video stream data, the present embodiment designs a relevant verification experiment. The hardware environment for the validation experiment was: intel (R) Xeon (R) CPU E5-2609V 4@1.70GHz, 6G memory, the graphics card is GTX 1080.

The performance indexes adopted in the verification experiment in this embodiment include:

TP: number of true positive samples

FN: number of false negative samples

TN: number of true negative samples

FP: number of false positive samples

PREC: rate of accuracy

PREC＝TP/(TP+FP)

ACC: rate of accuracy

ACC＝(TP+TN)/(TP+FN+TN+FP)

TPR: sensitivity (positive sample recall ratio)

TPR＝TP/(TP+FN)

TNR: specificity (negative sample recall)

TNR＝TN/(TN+FP)

F1: weighted average of precision and recall

F1＝2*(PREC*TPR)/(PREC+TPR)

In the verification experiment, a real inspection video which is manually inspected to contain cancerous features is adopted, the total frame number of the video is 2196 frames, and under different confidence degrees, the data and performance indexes of the network model in the embodiment in the test process are counted as follows:

table 1: test statistical results of the network model of the embodiment under different confidence thresholds

Threshold value	TP	FN	TN	FP	PREC	ACC	TPR	TNR	F1
										0.5	1013	472	597	114	89.89	73.32	68.22	83.97	76.57
0.6	970	515	610	101	90.57	71.95	65.32	85.79	75.90
										0.7	934	551	620	91	91.12	70.77	62.90	87.20	74.42
0.8	903	582	631	80	91.86	69.85	60.81	88.75	73.18
										0.9	834	651	651	60	93.29	67.62	56.16	91.56	70.11

The data in the analysis table can find that: according to the network model provided by the embodiment, even under the condition of the lowest confidence coefficient, the detection accuracy rate of the model can reach about 90%. The identification accuracy of positive canceration characteristics is very high. In addition, the verification test of the embodiment also finds that the network model of the embodiment has very good real-time performance in the processing process of video stream data, and the frame rate of data output can reach more than 30 frames per second, thereby basically meeting the requirement of performing real-time processing on gastroscopy videos. After raising the hardware level, the frame rate of data output can be further raised.

Example 2

The present embodiment further provides a real-time image segmentation system based on gastroscopy video, which employs the real-time image segmentation method based on gastroscopy video as in embodiment 1 to process video stream data of a gastroscopy process, so as to segment a target region containing cancerous features in the image. As shown in fig. 5, the real-time image segmentation system includes: the system comprises a video framing module, an image segmentation network model and an image display module.

Example 3

The present embodiment also provides a gastroscopic video-based real-time image segmentation apparatus, which comprises a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor executes the program to implement the steps of the gastroscopic video-based real-time image segmentation method as in embodiment 1.

The computer device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server or a cabinet server (including an independent server or a server cluster composed of a plurality of servers) capable of executing programs, and the like. The computer device of the embodiment at least includes but is not limited to: a memory, a processor communicatively coupled to each other via a system bus.

In this embodiment, the memory (i.e., the readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device. Of course, the memory may also include both internal and external storage devices for the computer device. In this embodiment, the memory is generally used for storing an operating system, various types of application software, and the like installed in the computer device. In addition, the memory may also be used to temporarily store various types of data that have been output or are to be output.

The processor may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to execute the program code stored in the memory or process the data to implement the processing procedure of the real-time image segmentation method based on gastroscopic video in embodiment 1, so as to obtain the segmentation result of the region where the cancerous feature appears in the image according to the given medical image.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A real-time image segmentation method based on gastroscopy video is characterized by comprising the following steps:

s1: constructing a lightweight image segmentation network model based on a Mask-RCNN framework, wherein the input of the image segmentation network model is a medical image of gastroscopy, and the output of the image segmentation network model is a segmentation result of a target region with lesion characteristics; the model construction process comprises the following steps:

s11: acquiring a traditional Mask-RCNN network comprising a backbone network and an ROI part;

s12: replacing a backbone network Resnet50 for extracting image features in the Mask-RCNN network with Mobile Net;

s13: performing model compression on the network model in the previous step by adopting one or more modes of model pruning, model quantization or knowledge distillation to obtain a required light-weight image segmentation network model;

s2: acquiring a plurality of real gastroscopy images with canceration characteristics as sample data, carrying out artificial primary screening and denoising on the images, then amplifying the number of the images by an image enhancement method, and carrying out artificial marking on the type and the position of a target region in the amplified images, wherein the marked images form a required original data set; dividing an original data set into a training set and a testing set;

s3: IoU and GIoU of a Loss function of a backbone network in the original computed image segmentation network are replaced by a CIoU-Loss function; setting training parameters for model training, training the image segmentation network by using the training set, and iteratively updating network parameters until a loss function is converged;

s4: testing the image segmentation network model trained in the previous step by using the test set, and reserving the model with the best test effect as a network model for real-time image segmentation;

2. A method of real-time image segmentation based on gastroscopic video according to claim 1 wherein: in step S12, the Resnet50 network may be replaced with any deep learning-based feature extraction model that can extract color features, texture features, shape features, and spaces in an image and has a smaller amount of parameters or a faster processing rate than the Resnet50 network.

3. The gastroscopic video based real time image segmentation method of claim 1 wherein: in step S13, the processing procedure of the model pruning method is carried out in combination with the pre-training procedure of the network model; continuously adjusting parameters in the pre-training process, and cutting off the connection between neurons in the network model, wherein the influence of the accuracy of the network classification result on the neurons is less than a preset loss rate; until the neuron connections in the network model are minimized, the pruning of the remaining neuron connections can reduce the precision of the classification result of the network model.

4. The gastroscopic video based real time image segmentation method of claim 1 wherein: in step S13, model quantization processing is implemented based on the idea of weight sharing and clustering processing; in the model quantization process, K classifications are assumed to be given, namely the weight parameters of the network model have K values after quantization, clustering operation is performed on the weights by adopting a K-Means clustering method to obtain K intervals, all the weight parameters are distributed in the K intervals, the numerical values of the corresponding K intervals are used for replacing original weight data, and the size of the model is reduced from 32bit to log2 (K).

5. The gastroscopic video based real time image segmentation method of claim 1 wherein: the knowledge distillation process is to replace the original large-size network model with a small network model with less network node number on the premise of not influencing the distribution state of the output result of the final softmax layer.

6. The gastroscopic video based real time image segmentation method of claim 1 wherein: in step S2, the manual prescreening process removes images that do not belong to the specified region at all, and images that have too poor image quality to allow human eyes to distinguish cancerous features; the denoising process of the image is respectively processed by using methods of mean filtering, Gaussian filtering and median filtering;

the pixel calculation formula of the mean filtering is as follows:

G₁(x,y)＝∑F(x,y)/m

in the above formula, x and y are pixel coordinatesF (x, y) is the original pixel value, G₁(x, y) is the mean filtered pixel value; m is the total number of pixels contained in the convolution kernel;

the gaussian filtered pixel calculation formula is as follows:

in the above formula, x and y are image pixel coordinates, G₁(x, y) are gaussian filtered pixel values; σ is a coefficient that determines the degree of smoothness of the overall image;

the median filtering is to sort all pixels in the neighborhood of a certain pixel point through a statistical sorting filter, and then to take the value of the pixel as the pixel of the neighborhood center;

7. The gastroscopic video based real time image segmentation method of claim 1 wherein: in step S3, the calculation formula of the CIoU-Loss function is:

8. A real-time image segmentation system based on gastroscopy video is characterized in that the real-time image segmentation method based on gastroscopy video according to any one of claims 1 to 7 is adopted to process video stream data of a gastroscopy process so as to segment a target area containing canceration characteristics in the image; the real-time image segmentation system comprises:

the video framing module is used for acquiring original video stream data acquired by gastroscopy equipment, and then framing the video stream data to obtain gastroscope original images of all frames;

the image segmentation network model is used for outputting a gastroscope original image output by the video framing module, and further extracting and segmenting a region with canceration characteristics in the image; the image segmentation network model is a lightweight network model improved based on a Mask-RCNN network, and comprises a feature extraction unit and an RPN network unit; the feature extraction unit is used as a backbone network in the model, the feature extraction unit adopts Mobile Net or other feature extraction models which can extract color features, texture features, shape features and space in an image, and compared with Resnet50 network, the feature extraction model based on deep learning has less parameter quantity or faster processing speed; the feature extraction unit is used for obtaining a corresponding feature map according to the input medical image; the RPN network unit is configured to: (1) acquiring a plurality of candidate interested areas according to the feature map; (2) sending the feature map and the candidate interested regions into a binary classification and BBOX regression, and filtering part of the candidate interested regions; (3) ROIAlign operation is carried out on the rest of the interest areas; (4) performing N-type classification and BBOX regression on the remaining interested regions, and performing full convolution operation in each ROI; (5) carrying out non-maximum suppression on a plurality of identification region results which are highly overlapped in the region of interest, and selecting a region with the highest confidence coefficient as a final segmentation result and outputting the final segmentation result; and

and the image display module is used for simultaneously displaying the images in the original video stream data of the gastroscopy and the segmentation result of the region with the canceration characteristics output by the image segmentation network.

9. The gastroscopic video based real time image segmentation system of claim 8 wherein: in the training process of the image segmentation network model, a CIoU-Loss function is used for replacing the original IoU and GIoU to calculate a Loss function of the backbone network; meanwhile, one or more of model pruning, model quantification and knowledge distillation is adopted to further compress the image segmentation network model, so that the lightweight of the network model is realized.

10. A gastroscopic video based real-time image segmentation apparatus comprising a memory, a processor and a computer program stored on said memory and executable on said processor, characterized in that: the processor, when executing the program, performs the steps of the method for gastroscopic video based real-time image segmentation according to any one of claims 1 to 7.