CN112396601B

CN112396601B - Real-time neurosurgical instrument segmentation method based on endoscope images

Info

Publication number: CN112396601B
Application number: CN202011418220.8A
Authority: CN
Inventors: 黄凯; 龚瑾; 郭英; 何海勇; 郭思璐; 宋日辉; 梁宏立
Original assignee: Third Affiliated Hospital Sun Yat Sen University; Sun Yat Sen University
Current assignee: Third Affiliated Hospital Sun Yat Sen University; Sun Yat Sen University
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2022-07-29
Anticipated expiration: 2040-12-07
Also published as: CN112396601A

Abstract

The invention belongs to the field of medical image processing and the technical field of image segmentation, and particularly relates to a real-time neurosurgical instrument segmentation method based on an endoscope image. A set of real-time instrument example segmentation method aiming at an endoscopic neurosurgery scene is provided, and the method can be applied to clinic and plays a role in assisting neurosurgery in operation in real time. The invention also provides a set of data amplification method aiming at the noises such as light spots, reflection, blur and the like, so that the learning capability and the adaptability of the model are improved while samples are enriched.

Description

Real-time neurosurgical instrument segmentation method based on endoscope images

Technical Field

The invention belongs to the field of medical image processing and the technical field of image segmentation, and particularly relates to a real-time neurosurgical instrument segmentation method based on an endoscope image.

Background

The existing example division methods are mainly divided into two types, two-stage (two-stage) and one-stage (one-stage). Currently, there is no relevant real-time instance segmentation work in the context of neurosurgical endoscopic images.

Data augmentation is one of the common skills in deep learning, and is mainly used for increasing a training data set, so that the data set is diversified as much as possible, and a trained model has stronger generalization capability. The existing data augmentation mainly includes: horizontal/vertical flipping, rotation, scaling, cropping, translation, contrast, color dithering, noise, etc. However, the conventional data augmentation method is not directed to endoscopic surgery images and is not directed to scenes containing light spots, reflection and blur.

Chinese patent CN111724365A, published as 2020.09.29, discloses a method for detecting an interventional device for endovascular aneurysm repair surgery, which utilizes a trained fast attention network to generate a binary segmentation mask of the interventional device, and then covers the binary segmentation mask on an image to be detected to obtain an image of the interventional device. The invention is based on X-ray transmission images, and improves the accuracy and speed of classification of instruments and tissue backgrounds. The method is not developed aiming at the scene of the neurosurgery endoscope operation, and the challenges brought by noises such as light spots, reflection, blurring and the like which are easy to appear in the scene cannot be solved. Meanwhile, most of the segmentation technologies in the examples provide help for doctors in the preoperative examination stage, and cannot provide real-time prompts in the operation.

At present, example segmentation algorithms with good effects are derived from a target detection method, but the example segmentation difficulty is much higher than the target detection difficulty. The accuracy of the two-stage detector depends on feature positioning, and the process is ordered and cannot be accelerated. A single stage detector improves the process into a parallel process, but then many subsequent calculations are performed after the positioning, which is also difficult to accelerate. The real-time instance segmentation task has been difficult to break through.

Disclosure of Invention

The present invention overcomes at least one of the above-mentioned drawbacks of the prior art, and provides a real-time neurosurgical instrument segmentation method in an endoscopic image, which is capable of adapting to the segmentation task performed by a surgical operation and has a high segmentation speed.

In order to solve the technical problems, the invention adopts the technical scheme that: a real-time neurosurgical instrument segmentation method based on endoscopic images, comprising the steps of:

s1, collecting image data of an endoscopic surgery, labeling the image in a manual labeling mode, and performing spatial segmentation and semantic classification on a foreground, namely an instrument and a background by using a label; constructing a data set, setting a cross validation sample, and establishing an instrument instance segmentation database which is divided into a training set and a validation set;

S2, performing data amplification on the data set, wherein the data amplification comprises turning, rotating, adjusting the image intensity, adding light spots/Gaussian noise and mixing images, so that the number of samples of the data set is increased, and the samples are enriched;

s3, constructing a network model, which comprises a feature backbone network, a feature pyramid network, a prototype prediction branch and a mask coefficient prediction branch; the input is a two-dimensional image, and the output is a prediction result of the image, and comprises a group of target detection bounding boxes, masks and corresponding categories;

s4, training the network model constructed in the step S3 by using a back propagation strategy by using a training data set as a training sample, and minimizing a loss function to obtain an optimized network weight;

and S5, testing the model, namely testing the trained network model by using the verification data sample, inputting the verification image into the network model to obtain a prediction result, comparing the prediction result with the label, and judging whether the network has better adaptability.

Further, in the step S2, when selecting a specific data augmentation mode, for several augmentation modes, such as picture flipping, picture rotation, image intensity adjustment, and light spot/gaussian noise addition, a randomly generated probability mode is used to select a corresponding augmentation mode for each picture.

Further, the random probability manner specifically includes: firstly, respectively setting a picture rotation probability, a picture turning probability, an image intensity adjusting probability and a facula/Gaussian noise adding probability; and then generating a floating point random number of 0-1, and using a corresponding augmentation mode for the current picture when the random number is greater than a preset threshold probability.

Further, the adding of the light spots/gaussian noise specifically includes: in order to eliminate the influence of light spots, some elliptical light spots are added into an original image through image processing; these spots are randomly sized and randomly located in the image, so that the network learns the spots as noise rather than background or foreground.

Further, the adding of the elliptical light spot specifically comprises: an integer less than 8 is randomly generated as the number of light spots, elliptic light spots are generated on an image with the same size as the original image by using an ellipse, and the image and the original image are added.

Further, the image mixing specifically includes the following steps:

A1. selecting an image a and an image b, wherein the image b contains an instrument for inverting the tissue texture; extracting the color number of the labels of the graph a and the graph b, wherein the labels of the single picture have multiple colors to distinguish different instruments;

A2. Cutting out the instrument with the reflection in the image b, wherein the process can be obtained by setting the background of the instrument image with the reflection to be black (0,0, 0);

A3. covering the image a by using the instrument image with the completely black background obtained in the step A2, namely adding pixel points of the two images to obtain a new image c;

A4. and overlaying the image label of the graph b to the corresponding position of the label of the graph a, and rearranging the color of the label corresponding to the new instrument according to the number of the colors to obtain the label of the graph c.

Further, the image mixing specifically includes the following steps:

B1. selecting an image a and an image b, and extracting the number of label colors of the image a and the image b;

B2. overlaying the instrument of image a, comprising: rotating the image, and replacing the instrumented portion with the rotated image; under the condition that the rotation cannot cover all instruments, covering the instruments by using areas with equal size near the part with the rest instruments, namely translating;

B3. cutting out the instrument in the drawing b, and setting the background in the drawing b to be black (0,0, 0);

B4. adding images of a graph b with a completely black background except the instrument and a graph a after the instrument is covered, namely adding corresponding pixel points, multiplying the pixels of the instrument part in the graph b by a coefficient transmittance when adding, and multiplying the same part of the graph a by 1-transmittance; adding the images to obtain a new image c; fig. c has a certain transparency in the instrument part, namely, an image of a part belonging to the background can be seen on the instrument, so that the reflection is simulated;

B5. A label is generated for the graph c, which is the instrument label of the graph b because the instrument of the graph a has been covered by the background.

Further, the network model is mainly divided into two parallel tasks, including: a. prototype generation: generating a series of prototype masks having a size consistent with the original image and not dependent on a single instance; b. mask coefficient: predicting a series of mask coefficients for each instance for encoding a representation of the instance in a prototype mask space; and then linearly combining the prototype mask and the corresponding predicted coefficient, and cutting the prototype mask and the predicted boundary box to obtain an example segmentation result of the whole image.

The present invention also provides an electronic device comprising:

a memory for storing a computer program;

a processor for implementing the steps of the method as described above when executing the computer program.

The present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the steps of the method

Compared with the prior art, the beneficial effects are:

1. the method has high speed, can realize real-time example segmentation of the endoscopic image of neurosurgery, can distinguish the instrument in the visual field from other tissues, and can also distinguish the instrument entity in the visual field;

2. A set of data augmentation scheme is designed for the neurosurgical scene, and the expression effects of the model on the light spot, reflection and other data sets are improved by changing the illumination brightness, adding the light spot, adding random noise and simulating the reflection of tissue textures on an instrument.

Drawings

FIG. 1 is a schematic flow diagram of the overall process of the present invention.

FIG. 2 is a flow chart illustrating a random probability selection method according to an embodiment of the present invention.

FIG. 3 is a flowchart illustrating an image blend mode one according to an embodiment of the invention.

Fig. 4 is a flowchart illustrating an image blending mode two according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a network model structure in the embodiment of the present invention.

Fig. 6 is a schematic diagram of a prototype-generating network according to an embodiment of the present invention.

FIG. 7 is a flowchart illustrating a mask coefficient processing according to an embodiment of the present invention.

Detailed Description

The drawings are for illustration purposes only and are not to be construed as limiting the invention; for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted. The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the invention.

As shown in fig. 1, a real-time neurosurgical instrument segmentation method based on endoscopic images comprises the following steps:

step 1, collecting endoscope operation image data, labeling the image by adopting a manual labeling mode, and performing spatial segmentation and semantic classification on a foreground, namely an instrument and a background by using a label; constructing a data set, setting a cross validation sample, and establishing an instrument instance segmentation database which is divided into a training set and a validation set;

step 2, performing data augmentation on the data set, including turning, rotating, adjusting image intensity, adding light spots/Gaussian noise and mixing images, so that the number of samples of the data set is increased and the samples are enriched;

in this embodiment, four data augmentation methods are mainly used:

1. traditional image transformation: including flipping, rotating, adjusting image intensity, adding gaussian noise.

2. Randomly adding spot noise: in order to eliminate the influence of the light spots, some elliptical light spots are added to the original image through image processing. The spots are of random size and are distributed at random positions in the image. Thereby causing the network to learn the spots as noise rather than background or foreground.

3. Image mixing: the two pictures are combined into a new example in two different ways so as to artificially simulate the reflection of the tissue texture on the instrument.

For the first two augmentation modes, a mode of randomly generating probability is adopted to select which augmentation mode is used for each picture. And respectively setting picture rotation probability, picture turning probability, picture brightness change probability, Gaussian noise addition and light spot probability. And generating a floating point random number of 0-1, and using a corresponding augmentation mode for the current picture when the random number is greater than a preset threshold probability. At least one data augmentation mode is in effect by default. The augmentation may only be performed at most once for each type of data. The flow chart is shown in fig. 2.

Specifically, the following three modes of amplification will be described in detail.

Adding light spot noise: the addition of the ellipse is obtained by adding the original image and the image of the same size background, which is totally black and has only random elliptical light spots, namely adding the pixel values of the corresponding points. The RGB value of the elliptical spot is (150,150,150), which is generated by the epipcv's epipse function, and its parameters include picture matrix, center point, major and minor axes, rotation angle, ellipse start and end angles (0-360), edge thickness. The number of spots in a picture is a random integer number within the maximum number range (default 8). The center point position, the length of the major and minor axes are random values within a certain proportion of the picture size. This randomization process can be implemented by random libraries.

Image blend mode one:

as shown in fig. 3, the method comprises the following steps:

And image mixed mode two:

as shown in fig. 4, the method specifically includes the following steps:

Step 3, constructing a network model, which comprises a feature backbone network, a feature pyramid network, a prototype prediction branch and a mask coefficient prediction branch; the input is a two-dimensional image, and the output is a prediction result of the image, and comprises a group of target detection bounding boxes, masks and corresponding categories;

as shown in fig. 5, the network model is mainly divided into two parallel tasks, including: a. prototype generation: generating a series of prototype masks having a size consistent with the original image and not dependent on a single instance; b. mask coefficient: predicting a series of mask coefficients for each instance for encoding a representation of the instance in a prototype mask space; and then linearly combining the prototype mask and the corresponding predicted coefficient, and cutting the prototype mask and the predicted boundary box to obtain an example segmentation result of the whole image.

Wherein, prototype generation: the prototype generation branch predicts k prototype masks for each image. The prototype generation branch is implemented with a Full Convolution Network (FCN). The full convolution network can classify the image at the pixel level, and up-sample the feature map (feature map) of the last volume base layer by using the deconvolution layer to restore it to the same size of the input image, so that a prediction can be generated for each pixel. In the prototype-generated branch, the last layer of the FCN has k channels, one for each prototype, and each channel is fed into the backbone feature layer. The prototype-generating network is shown in fig. 6. Representing feature sizes and channels with image sizes 550 x 550. Arrows indicate 3 × 3 convolutional layers. Finally, an upsampling is performed followed by a 1 × 1 convolutional layer. The prototype generation branches obtained from deeper stem features can produce a more robust mask, and the prototype with higher resolution can not only bring a higher quality mask, but also have a better effect on small targets. Therefore, FPN is used because its largest feature layer is deepest. Then, it is up-sampled to one-fourth of the input image size to improve detection performance on small target objects.

Mask coefficient: a typical anchor-box based target detector has two branches: one branch is used to predict confidence scores for c classes; the other branch is used to predict the 4 coordinates of the bounding box. To predict the mask coefficients, a third branch is added to the system to predict k mask coefficients, one for each prototype. So each anchor block has to predict 4+ c + k numbers. For the resulting mask, the prototype needs to be removed therefrom, and thus non-linear processing is performed using tanh for k mask coefficients, thereby making it more stable than the output without non-linear processing. This process is illustrated in fig. 7.

Mask integration: in order to generate the mask of the example, the branch for generating the prototype and the branch for generating the mask coefficient are combined by using a linear combination method, and Sigmoid nonlinearity is used for the combination result to obtain the final mask, and the process can be efficiently realized by using a single matrix multiplication method:

M＝σ(PC ^T )

where P is the prototype mask of h × w × k size and C is the mask coefficient of n × k size, which is the coefficient of n instances that remain after NMS and score thresholding.

Cutting a mask: to retain small target objects in the prototype, we crop the final mask according to the predicted bounding box, in the training process we crop using the real bounding box, and L is added _mask Divided by the area of the real bounding box.

In this embodiment, the loss function may be selected as: 1. loss of classification L _cls (ii) a 2. Bounding box regression loss L _box (ii) a 3. Mask loss L _mask ＝BCE(M,M _gt ) M is a predictive mask, M _gt Is a true mask, BCE is both pixel level binary cross entropy.

Step 4, training the network model constructed in the step S3 by using a back propagation strategy by using a training data set as a training sample, and minimizing a loss function to obtain an optimized network weight;

and 5, testing the model, namely testing the trained network model by using the verification data sample, inputting the verification image into the network model to obtain a prediction result, comparing the prediction result with the label, and judging whether the network has better adaptability.

Under the method, the instrument instance segmentation can achieve a real-time effect. When ResNet50 is selected as a backbone network, the average frame rate is 66fps, and when ResNet101 is used as the backbone network, the average frame rate is 49.79fps, so that the effect of real-time instance segmentation is completely achieved.

The method realizes the accuracy of 89.17% at the frame rate of 66fps, and is superior to the most advanced real-time semantic segmentation method at present.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A real-time neurosurgical instrument segmentation method based on endoscopic images, comprising the steps of:

s5, testing the trained network model by using the verification data sample, inputting the verification image into the network model to obtain a prediction result, comparing the prediction result with the label, and judging whether the network has better adaptability; the image mixing specifically comprises the following steps:

A2. cutting off the instrument with the reflection in the image b, wherein the process is obtained by setting the background of the instrument image with the reflection to be black (0,0, 0);

A4. overlaying the image label of the graph b to the corresponding position of the label of the graph a, and rearranging the color of the label of the corresponding new instrument according to the number of the colors to obtain the label of the graph c;

or, the image mixing specifically comprises the following steps:

2. The method for real-time neurosurgical instrument segmentation based on endoscopic images as claimed in claim 1, wherein in step S2, when selecting a specific data augmentation mode, for several augmentation modes of picture flipping, picture rotation, image intensity adjustment and spot/gaussian noise addition, a randomly generated probability mode is used to select a corresponding augmentation mode for each picture.

3. The method of endoscopic image based real-time neurosurgical instrument segmentation according to claim 2, wherein the randomly generating probabilities specifically comprises: firstly, respectively setting a picture rotation probability, a picture turning probability, an image intensity adjusting probability and a facula/Gaussian noise adding probability; and then generating a floating point random number of 0-1, and using a corresponding augmentation mode for the current picture when the random number is greater than a preset threshold probability.

4. The endoscopic image based real-time neurosurgical instrument segmentation method of claim 1, wherein the adding of speckle/gaussian noise specifically comprises: in order to eliminate the influence of light spots, some elliptical light spots are added into an original image through image processing; these spots are randomly sized and randomly located in the image, so that the network learns the spots as noise rather than background or foreground.

5. The endoscopic image based real-time neurosurgical instrument segmentation method of claim 4, wherein the addition of elliptical spots specifically comprises: an integer less than 8 is randomly generated as the number of light spots, elliptic light spots are generated on an image with the same size as the original image by using an ellipse, and the image and the original image are added.

6. The method of endoscopic image based real-time neurosurgical instrument segmentation according to any one of claims 1 to 5, wherein the network model is divided into two parallel tasks comprising: a. prototype generation: generating a series of prototype masks having a size consistent with the original image and not dependent on a single instance; b. mask coefficient: predicting a series of mask coefficients for each instance for encoding a representation of the instance in a prototype mask space; and then linearly combining the prototype mask and the corresponding predicted coefficient, and cutting the prototype mask and the predicted boundary box to obtain an example segmentation result of the whole image.

7. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the method according to any one of claims 1 to 6 when executing the computer program.

8. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.