CN116030260A

CN116030260A - Surgical whole-scene semantic segmentation method based on long-strip convolution attention

Info

Publication number: CN116030260A
Application number: CN202310304276.8A
Authority: CN
Inventors: 刘敏; 朱悦豪; 汪嘉正; 张哲�; 王耀南
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2023-03-27
Filing date: 2023-03-27
Publication date: 2023-04-28
Anticipated expiration: 2043-03-27
Also published as: CN116030260B

Abstract

The invention discloses a surgical whole-scene semantic segmentation method based on long-strip convolution attention, which comprises the following steps: acquiring image data of an endoscopic surgery video and a truth value label corresponding to the image data; constructing a full surgical field Jing Yuyi segmentation model; encoding the image data; the coding result passes through a long strip convolution attention module to output a characteristic diagram; performing up-sampling operation and splicing operation on the feature graphs corresponding to the coding results of each stage to obtain a segmentation result; convolving the feature map with the largest size to obtain a boundary map; setting a boundary guiding dividing head, and obtaining a target boundary diagram by the truth value label through the boundary guiding dividing head; calculating boundary loss according to the boundary diagram and the target boundary diagram; calculating segmentation loss according to the segmentation result and the truth value label; combining the boundary loss and the segmentation loss to construct a mixed loss function; the surgical whole-field Jing Yuyi segmentation model is optimized by a hybrid loss function. The method meets the precision requirement of the surgical scene segmentation on the regional boundary.

Description

Surgical whole-scene semantic segmentation method based on long-strip convolution attention

Technical Field

The invention relates to the technical field of surgical scene segmentation, in particular to a surgical whole-scene semantic segmentation method based on long-strip convolution attention.

Background

The intelligent endoscopic surgical robot is a typical application of robot-assisted minimally invasive surgery, and can effectively improve the success rate of surgery, shorten the recovery period of surgery and improve the safety of patients. Automated surgical scene segmentation is a key technology for computer-assisted surgery and intelligent surgical robots. Its task is to segment the anatomical region and the medical device objects in the surgical scene and assign a class label to each pixel. The segmentation results may be used for a number of clinical tasks such as lesion tissue localization, surgical decision making, surgical navigation, and surgical skill assessment, among others.

Accurate interpretation of the entire scene in an endoscopic surgical video is a very challenging task. Compared with the traditional natural scene, the contrast of local features of the segmented target in the surgical scene is lower, and the feature similarity of the segmented target in the local region of different biological tissues or instruments is higher. Most of the work adopts an attention mechanism to combine the local semantic features of the target with the global features of the target, and captures long-range dependence to solve the related problems. DANet can adaptively combine local features with its global dependencies by using both positional and spatial attentions in parallel, but the computational complexity of positional attentions is high and feature modeling is based only on the information of the current feature map, which remains relatively limited for the segmentation surgery scenario. The Transformer based model associates richer cues with each pixel by self-attention, however, associating more views by self-attention will result in an increase in the secondary computational complexity relative to the number of pixel embeddings, and the self-attention mechanism only considers the adaptability of the spatial dimension but ignores the adaptability of the channel dimension, which is also important to the visual task.

Another non-negligible challenge is the precision requirements of the surgical scene segmentation on the regional edges. The clinician must pay attention to the cut boundaries and control errors at all times while performing the procedure, which requires that the network model be able to accurately resolve the tissue boundaries. GSCNN proposes a dual stream CNN architecture for semantic segmentation that links shape information into a single processing branch that can produce clearer predictions around object boundaries. The model may improve the performance of the baseline, but the multi-path branching operation increases the computational complexity of the network.

Disclosure of Invention

Based on the above, it is necessary to provide a surgical whole-scene semantic segmentation method based on long convolution attention, aiming at the existing problems.

The invention provides a surgical whole-scene semantic segmentation method based on long-strip convolution attention, which comprises the following steps:

s1: acquiring image data of an endoscopic surgery video and a truth value label corresponding to the image data; constructing a full surgical field Jing Yuyi segmentation model; the surgical whole-scene semantic segmentation model comprises an encoder, a strip convolution attention module and a segmentation module;

s2: the encoder encodes the image data and outputs an encoding result; the coding result comprises coding results of different stages;

s3: the coding result passes through the strip convolution attention module to output a feature map; the feature map comprises feature maps corresponding to the coding results of all stages;

s4: the segmentation module performs up-sampling operation and splicing operation on the feature graphs corresponding to the coding results of each stage to obtain segmentation results;

s5: convolving the feature map with the largest size to obtain a boundary map; setting a boundary guiding dividing head, and obtaining a target boundary diagram by the truth value label through the boundary guiding dividing head;

s6: calculating boundary loss according to the boundary map and the target boundary map; calculating segmentation loss according to the segmentation result and the truth value label; combining the boundary loss and the segmentation loss to construct a mixed loss function; optimizing the surgical whole-field Jing Yuyi segmentation model by the hybrid loss function;

s7: and inputting the image data to be segmented into the optimized full surgical field Jing Yuyi segmentation model, and outputting a final segmentation result.

Preferably, in S2, HRNetV2 is used as the encoder; the sizes of the encoding results of the stages outputted are different.

Preferably, in S3, the elongated convolution attention module includes a region feature extraction block and an instrument feature extraction block; based on the coding results of each stage, the regional characteristic extraction block and the instrument characteristic extraction block are extracted in parallel to obtain regional characteristics and instrument characteristics; and adding the coding results of each stage with the corresponding regional characteristics and the corresponding instrument characteristics to obtain a characteristic diagram corresponding to the coding results of each stage.

Preferably, the region feature extraction block includes a depth convolution, a first multi-branch depth banded convolution, a1×1 convolution; the extraction process of the regional characteristics comprises the following steps:

local information of the coding results of each stage of the depth convolution aggregation is recorded as:

the method comprises the steps of carrying out a first treatment on the surface of the The calculation formula is as follows:

；

the first multi-branch depth banded convolution, the 1 x 1 convolution, obtains a first attention map based on the local information, the first attention map being noted as:

；

multiplying the first attention map with the local information to obtain the region feature; the calculation formula is as follows:

；

wherein ,

representing a depth convolution with a convolution kernel size of 5 x 5;BNrepresenting a batch normalization operation;x _i represent the firstiThe encoding result of the stage;W _1×1 representing a1 x 1 convolution;jrepresent the firstjA plurality of branches; />

Represent the firstiRegional characteristics of the stage;k _j representing the convolution kernel size; />

Representing element-by-element multiplication.

Preferably, the instrument feature extraction block comprises a depth convolution, a second multi-branch depth banded convolution, a1×1 convolution; the process for extracting the instrument features comprises the following steps:

；

the second multi-branch depth banded convolution, the 1 x 1 convolution, obtains a second attention map based on the local information, the second attention map being noted as:

the method comprises the steps of carrying out a first treatment on the surface of the The calculation formula is as follows: />

；

Multiplying the second attention map with the local information to obtain the instrument feature; the calculation formula is as follows:

；

wherein ,

Represent the firstiInstrument characteristics of the stage;k _j representing the convolution kernel size; />

Representing element-by-element multiplication.

Preferably, in S4, the segmentation module screens out the feature map with the largest size, and upsamples the sizes of the rest feature maps to the same size as the feature map with the largest size; and splicing the up-sampled feature images, and up-sampling the splicing result by a set multiple to obtain a segmentation result.

Preferably, in S5, the calculation formula of the boundary map is:

；

wherein ,bm _x representing a boundary map;UP _×4 representing a 4-fold upsampling operation;W _1×1 representing a1 x 1 convolution;W _bg a convolution operation representing a combination of a Relu activation function and a batch normalization;

a feature map having a maximum size;

the boundary guiding dividing head comprises three branches; each branch extracts the boundary feature of the truth label by applying the convolution network with the Laplace kernel, and each branchStep length is inconsistent; splicing boundary features extracted by each branch to obtain boundary information, and setting a fixed contour threshold to refine the boundary information to obtain a target boundary map; the target boundary map is recorded as:bm _gt 。

preferably, in S6, the calculation formula of the boundary loss is:

；

the calculation formula of the segmentation result is as follows:

；

the calculation formula of the segmentation loss is as follows:

；

wherein ,L _bg representing boundary loss;L _dice () Representing a dice loss function;bm _x representing a boundary map;bm _gt representing a target boundary map;L _ce () Representing a cross entropy loss function;αandβis a constant;L _seg representing segmentation loss;x _seg representing the segmentation result;gtrepresenting a truth value tag;UP _×4 representing a 4-fold upsampling operation;x _maff and representing the splicing result.

Preferably, in S6, the calculation formula of the mixing loss function is:

；

wherein ,L _joint representing a mixing loss function;L _seg representing segmentation loss;L _bg representing boundary loss;γandεis constant.

Preferably, the set multiple is 4 times.

The beneficial effects are that: the method provided by the invention overcomes the problem of low local feature contrast of the segmented target in the surgical scene through the constructed surgical whole scene semantic segmentation model, and also meets the precision requirement of the surgical scene segmentation on the region boundary.

Drawings

Exemplary embodiments of the present invention may be more fully understood by reference to the following drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the application, and not constitute a limitation of the invention. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 is a flow chart of a method provided according to an exemplary embodiment of the present application.

Fig. 2 is a schematic structural diagram of a full surgical field Jing Yuyi segmentation model provided according to an exemplary embodiment of the present application.

Fig. 3 is a schematic diagram of a long strip convolution attention module provided according to an exemplary embodiment of the present application.

Fig. 4 is a schematic diagram of a segmentation effect provided according to an exemplary embodiment of the present application.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs.

In addition, the terms "first" and "second" etc. are used to distinguish different objects and are not used to describe a particular order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

The embodiment of the application provides a surgical whole-scene semantic segmentation method based on long-strip convolution attention, and the method is described below with reference to the accompanying drawings.

Referring to fig. 1 and 2, which are flowcharts illustrating a surgical whole-scene semantic segmentation method based on long convolution attention according to some embodiments of the present application, as shown in the drawings, the method may include the following steps:

s1: acquiring image data of an endoscopic surgery video and a truth value label corresponding to the image data; constructing a full surgical field Jing Yuyi segmentation model; the surgical whole-scene semantic segmentation model comprises an encoder, a strip convolution attention module (LSKA) and a segmentation module;

in this embodiment, the encoder employs HRNetV2; the sizes of the encoding results of the stages outputted are different.

specifically, as shown in fig. 3, the elongated convolution attention module includes a region feature extraction block (an-block) and an instrument feature extraction block (Ins-block); based on the coding results of each stage, the regional characteristic extraction block and the instrument characteristic extraction block are extracted in parallel to obtain regional characteristics and instrument characteristics; and adding the coding results of each stage with the corresponding regional characteristics and the corresponding instrument characteristics to obtain a characteristic diagram corresponding to the coding results of each stage.

The regional feature extraction block comprises a depth convolution, a first multi-branch depth banded convolution and a1 multiplied by 1 convolution; the extraction process of the regional characteristics comprises the following steps:

；

；

；

wherein ,

representing a depth convolution with a convolution kernel size of 5 x 5;BNrepresenting a batch normalization operation;x _i represent the firstiThe result of the encoding of the phase is that,i∈{0,1,2,3}；W _1×1 representing a1 x 1 convolution;jrepresent the firstjThe number of branches is chosen such that,j∈{0,1,2}；

represent the firstiRegional characteristics of the stage;k _j the size of the convolution kernel is indicated,k _j may be set to 11 or 21 or 31; />

Representing element-by-element multiplication.

The instrument feature extraction block comprises a depth convolution, a second multi-branch depth banded convolution and a1 multiplied by 1 convolution; the process for extracting the instrument features comprises the following steps:

；

；

；

wherein ,

Representing element-by-element multiplication.

specifically, the segmentation module screens out the feature map with the largest size, and upsamples the sizes of the rest feature maps to the same size as the feature map with the largest size; and splicing the up-sampled feature images, and up-sampling the splicing result by a set multiple to obtain a segmentation result.

In this embodiment, the set multiple is 4 times.

specifically, the calculation formula of the boundary map is:

；

wherein ,bm _x a boundary map (a boundary map of a feature map of the largest size, i.e., a boundary map of the highest resolution branch);UP _×4 representing a 4-fold upsampling operation;W _1×1 representing a1 x 1 convolution;W _bg a convolution operation representing a combination of a Relu activation function and a batch normalization;

a feature map having a maximum size;

the boundary guiding dividing head comprises three branches; extracting boundary characteristics of a truth value tag by each branch through applying a convolution network with a Laplace kernel, wherein the step sizes of the branches are inconsistent; splicing boundary features extracted by each branch to obtain boundary information, and setting a fixed contour threshold to refine the boundary information to obtain a target boundary map; the target boundary map is recorded as:bm _gt 。

in this embodiment, three branches are designed for the boundary guiding dividing head and different step sizes are set to obtain multi-size information; feature maps of different sizes are up-sampled to the same resolution (same size), and dynamically re-weighted by a stitching operation to obtain richer boundary information.

specifically, the calculation formula of the boundary loss is:

；

the calculation formula of the segmentation result is as follows:

；

the calculation formula of the segmentation loss is as follows:

；

wherein ,L _bg representing boundary loss;L _dice () Representing a dice loss function;bm _x representing a boundary map;bm _gt representing a target boundary map;L _ce () Representing a cross entropy loss function;αandβis constant, in the present embodimentαAndβdefault to 1;L _seg representing segmentation loss;x _seg representing the segmentation result;gtrepresenting a truth value tag;UP _×4 representing a 4-fold upsampling operation;x _maff and representing the splicing result.

The calculation formula of the mixing loss function is as follows:

；

wherein ,L _joint representing a mixing loss function;L _seg representing segmentation loss;L _bg representing boundary loss;γandεis constant, in the present embodimentγAndεdefault to 1.

As shown in fig. 4, fig. 4 is a schematic diagram of a segmentation effect, in which (a) is the 151 st picture of the 1 st test sequence in the endos 2018 dataset, (a 1) is the truth label of (a), and (a 2) is the segmentation result of (a) by the method provided in this embodiment; (b) A 150 th picture of the 2 nd test sequence in the Endovis2018 dataset, (b 1) a truth label of (b), and (b 2) a segmentation result of (b) by this method provided in this example; (c) 153 th picture of the 3 rd test sequence in the Endovis2018 dataset, (c 1) is the truth label of (c), and (c 2) is the segmentation result of (c) by this method provided in this example. As can be seen from the graph, the segmentation result of the method provided by the embodiment is very similar to the truth label, so that the method provided by the embodiment can be demonstrated to have a better segmentation effect.

The method provided by the embodiment can be used for adaptively learning the regional characteristics and the instrument characteristics through the long-strip convolution attention module. The long-strip convolution attention module extracts the block-shaped characteristics of an operation area and the strip-shaped characteristics of an operation instrument respectively by using cascade and addition of deep strip convolution in parallel, and establishes the connection of pixel long distances through convolution kernels with the maximum size of 31 x 31, so that the receptive field of a network is increased, and the false recognition caused by the similarity of the area characteristics is reduced; and a boundary segmentation head is designed as depth supervision, so that the model is guided to learn boundary characteristics, and the capability of the model for distinguishing the operation boundary is improved.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the embodiments, and are intended to be included within the scope of the claims and description.

Claims

1. The surgical whole-scene semantic segmentation method based on the long-strip convolution attention is characterized by comprising the following steps of:

2. The surgical whole-scene semantic segmentation method based on long-strip convolution attention according to claim 1, wherein in S2, the encoder adopts HRNetV2; the sizes of the encoding results of the stages outputted are different.

3. The surgical whole-scene semantic segmentation method based on long-strip convolution attention according to claim 2, wherein in S3, the long-strip convolution attention module comprises a region feature extraction block and an instrument feature extraction block; based on the coding results of each stage, the regional characteristic extraction block and the instrument characteristic extraction block are extracted in parallel to obtain regional characteristics and instrument characteristics; and adding the coding results of each stage with the corresponding regional characteristics and the corresponding instrument characteristics to obtain a characteristic diagram corresponding to the coding results of each stage.

4. The long-strip convolution attention-based surgical whole-scene semantic segmentation method according to claim 3, wherein the region feature extraction block comprises a depth convolution, a first multi-branch depth banded convolution, a1 x 1 convolution; the extraction process of the regional characteristics comprises the following steps:

；

；

；

wherein ,

Representing element-by-element multiplication.

5. The long-strip convolution attention-based surgical whole-scene semantic segmentation method according to claim 4, wherein the instrument feature extraction block comprises a depth convolution, a second multi-branch depth banded convolution, a1 x 1 convolution; the process for extracting the instrument features comprises the following steps:

；

；

；

wherein ,

Representing element-by-element multiplication.

6. The method for semantic segmentation of a surgical whole scene based on long-strip convolution attention according to claim 5, wherein in S4, the segmentation module screens out feature images with the largest size and upsamples the sizes of the rest feature images to the same size as the feature images with the largest size; and splicing the up-sampled feature images, and up-sampling the splicing result by a set multiple to obtain a segmentation result.

7. The surgical whole-scene semantic segmentation method based on long-strip convolution attention as set forth in claim 6, wherein in S5, a calculation formula of the boundary map is:

；

a feature map having a maximum size;

8. the surgical whole-scene semantic segmentation method based on long-strip convolution attention according to claim 7, wherein in S6, a calculation formula of boundary loss is:

；

the calculation formula of the segmentation result is as follows:

；

the calculation formula of the segmentation loss is as follows:

；

9. The surgical whole-scene semantic segmentation method based on the long-strip convolution attention according to claim 8, wherein in S6, a calculation formula of the mixing loss function is:

；

10. The method for semantic segmentation of surgical whole-scene based on long-strip convolution attention according to claim 6, wherein the set multiple is 4 times.