CN114972293B

CN114972293B - Video polyp segmentation method and device based on semi-supervised space-time attention network

Info

Publication number: CN114972293B
Application number: CN202210672002.XA
Authority: CN
Inventors: 万翔; 李冠彬; 李镇; 吴振华; 赵欣恺; 谭双翼
Original assignee: Shenzhen Research Institute of Big Data SRIBD
Current assignee: Shenzhen Research Institute of Big Data SRIBD
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2023-08-01
Anticipated expiration: 2042-06-14
Also published as: CN114972293A

Abstract

The invention discloses a video polyp segmentation method, a device, computer equipment and a storage medium based on a semi-supervised space-time attention network, wherein a polyp segmentation model is trained through the semi-supervised training method, a U-shaped network architecture is built through a multi-layer decoder and an encoder, a time local context attention module or an inter-frame space-time attention module is arranged between each layer of encoder and the decoder, the video frames can be segmented better through the time local context attention module by utilizing the continuity between adjacent frames and the characteristics of different cross-layer scales, meanwhile, the inter-frame space-time attention module is adopted, and the time and space information close to the video frames are utilized, so that the segmentation accuracy is effectively improved, the dilemma that the data volume is small and cannot be applied to training due to overlarge labeling workload of a video data set and overlarge quality requirement is solved, and the practicability of the model is improved.

Description

Video polyp segmentation method and device based on semi-supervised space-time attention network

Technical Field

The present invention relates to the field of medical image processing technologies, and in particular, to a video polyp segmentation method, apparatus, computer device, and storage medium based on a semi-supervised spatio-temporal attention network.

Background

In recent years, along with the improvement of living standard of people and the change of living and working habits, the prevalence rate of colon cancer is increased year by year. The most effective method for preventing colon cancer is enteroscopy, and doctors can detect intestinal polyps through enteroscopy, sample and conduct subsequent pathological analysis so as to achieve the effect of early detection and early treatment.

With the continuous development of artificial intelligence technology, in order to assist doctors in making diagnosis, the efficiency of doctors is improved, and the occurrence of misdiagnosis and missed diagnosis is reduced, more and more AI auxiliary diagnosis technologies are added into the detection process of enteroscopes. The method mainly comprises four major categories of a full-supervision video polyp segmentation method, a full-supervision picture polyp segmentation method, a semi-supervision video polyp segmentation method and a semi-supervision picture polyp segmentation method.

The full-supervision picture polyp segmentation method can generally obtain higher precision, but each picture is provided with polyp labels, so that a lot of labeled manpower and material resources are consumed. The full-supervision video polyp segmentation method can achieve better effects because continuity between videos is considered, but the data size of the videos is larger than that of pictures, the workload required for labeling is larger, and the practicability of an algorithm is lower. However, the current semi-supervised algorithm can realize segmentation by using a small amount of marked data and the rest of unmarked data, but the segmentation accuracy is low because the information of the nonstandard data is not fully learned and the continuity of the enteroscopy data is not overcome.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a video polyp segmentation method, apparatus, computer device and storage medium based on a semi-supervised spatio-temporal attention network, so as to solve the problems of poor practicality and low segmentation accuracy when video polyp segmentation is performed in the prior art.

In a first aspect, a video polyp segmentation method based on a semi-supervised spatio-temporal attention network is provided, comprising:

obtaining enteroscopy video data of a patient to be detected;

dividing the enteroscopy video data into video fragments with preset sizes, and inputting the video fragments into a preset polyp dividing model, wherein the preset polyp dividing model comprises an N-layer encoder and an N-layer decoder;

sequentially taking the output characteristics of each layer of encoder as the input characteristics of the encoder of the next layer and the input characteristics of the time local context attention module according to the sequence of 1 to N-1 layers;

performing attention mechanism calculation on the output characteristics of the N layer encoder through the near inter-frame space-time attention module so as to acquire space-time characteristics and serve as input characteristics of the first layer decoder;

and sequentially carrying out attention mechanism calculation on the output characteristics of each layer of decoder and the output characteristics of the coder of the previous layer according to the sequence of 1 to N-1 layers through a time local context attention module so as to acquire time characteristics, and taking the time characteristics and the output characteristics of the decoder as the input characteristics of the decoder of the previous layer until the actual segmentation result is output through the decoder of the nth layer.

In one embodiment, the performing attention mechanism calculation on the feature output by the N-th layer encoder by the near-inter-frame space-time attention module includes:

converting the output characteristics of the N layer encoder into preset dimensions and dividing the dimension into a plurality of blocks with preset sizes;

performing attention mechanism calculation on the blocks through a time attention module to acquire a first characteristic;

performing attention mechanism operation on the first feature through a spatial attention module to acquire a second feature;

and inputting the second characteristic into a multi-layer sensor for processing so as to acquire the space-time characteristic.

In an embodiment, the performing, by the time attention module, attention mechanism calculation on the partition to obtain a first feature includes:

and carrying out attention mechanism calculation on each block of one frame of video frame and the blocks at the same position in the rest video frames to acquire the first characteristic.

In an embodiment, the performing, by the spatial attention module, an attention mechanism operation on the first feature to obtain a second feature includes:

the first feature is connected with the output feature of the N layer encoder and is input into the spatial attention module;

And in the spatial attention module, each block in a frame of video frame and blocks at different positions of the same video frame are subjected to attention mechanism calculation to acquire the second characteristic.

In one embodiment, the attention mechanism calculation is performed by a temporal local context attention module to obtain a temporal feature, including:

respectively calculating a characteristic diagram of the output characteristics of each layer of decoder to obtain a third characteristic;

multiplying the third characteristic with the output characteristic of the upper layer encoder and adding the output characteristic of the upper layer encoder to obtain the time characteristic.

In an embodiment, the preset polyp segmentation model is obtained by:

creating an original polyp segmentation model;

acquiring a training video data set, wherein the training video data set comprises marked video data and unmarked video data;

and inputting the marked video data and the unmarked video data into the original polyp segmentation model, and performing semi-supervised training to generate the preset polyp segmentation model.

In an embodiment, the inputting the noted video data and the unlabeled video data into the original polyp segmentation model for semi-supervised training includes:

Inputting the noted video data and the non-noted video data into the original polyp segmentation model;

calculating cross entropy and a Dice loss function of a segmentation result output by the original polyp segmentation model, and updating the original polyp segmentation model;

and calculating the distance loss between the segmentation results of the adjacent data frames, and performing supervised learning on the original polyp segmentation model.

In a second aspect, a video polyp segmentation apparatus based on a semi-supervised spatio-temporal attention network is provided, comprising:

the enteroscope video data acquisition unit is used for acquiring enteroscope video data of a patient to be detected;

the video data processing unit is used for dividing the enteroscopy video data into video fragments with preset sizes and inputting the video fragments into a preset polyp segmentation model, wherein the preset polyp segmentation model comprises an N-layer encoder and an N-layer decoder;

the encoder processing unit is used for sequentially taking the output characteristics of each layer of encoder as the input characteristics of the encoder of the next layer and the input characteristics of the time local context attention module according to the sequence of 1 to N-1 layers;

the attention calculating unit is used for carrying out attention mechanism calculation on the output characteristics of the N layer encoder through the near inter-frame space-time attention module so as to acquire space-time characteristics and serve as the input characteristics of the first layer decoder;

And the actual segmentation result output unit is used for sequentially calculating the output characteristics of each layer of decoder and the output characteristics of the coder of the previous layer according to the sequence of 1 to N-1 layers through the attention module of the time local context so as to acquire the time characteristics, and taking the time characteristics and the output characteristics of the decoder as the input characteristics of the decoder of the previous layer until the actual segmentation result is output through the decoder of the nth layer.

In a third aspect, a computer device is provided comprising a memory, a processor and computer readable instructions stored in the memory and executable on the processor, when executing the computer readable instructions, implementing the steps of a video polyp segmentation method based on a semi-supervised spatiotemporal attention network as described above.

In a fourth aspect, there is provided one or more readable storage media storing computer readable instructions that, when executed by a processor, implement the steps of a video polyp segmentation method based on a semi-supervised spatiotemporal attention network as described above.

The video polyp segmentation method, the device, the computer equipment and the storage medium based on the semi-supervised space-time attention network are realized by the method comprising the following steps: obtaining enteroscopy video data of a patient to be detected; dividing the enteroscopy video data into video fragments with preset sizes, and inputting the video fragments into a preset polyp dividing model, wherein the preset polyp dividing model comprises an N-layer encoder and an N-layer decoder; sequentially taking the output characteristics of each layer of encoder as the input characteristics of the encoder of the next layer and the input characteristics of the time local context attention module according to the sequence of 1 to N-1 layers; performing attention mechanism calculation on the output characteristics of the N layer encoder through the near inter-frame space-time attention module so as to acquire space-time characteristics and serve as input characteristics of the first layer decoder; and sequentially carrying out attention mechanism calculation on the output characteristics of each layer of decoder and the output characteristics of the coder of the previous layer according to the sequence of 1 to N-1 layers through a time local context attention module so as to acquire time characteristics, and taking the time characteristics and the output characteristics of the decoder as the input characteristics of the decoder of the previous layer until the actual segmentation result is output through the decoder of the nth layer. In the application, the video frames can be better segmented by utilizing the continuity between adjacent frames and the characteristics of different cross-layer scales through the time local context attention module, and meanwhile, the inter-frame space attention module is adopted, so that the segmentation accuracy is effectively improved by utilizing the time and space information close to the video frames, the dilemma that the data volume is small and the video data set cannot be applied to training due to overlarge labeling workload and overhigh quality requirement in the past is avoided, and the practicability is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an application environment of a video polyp segmentation method based on a semi-supervised spatio-temporal attention network according to an embodiment of the present invention;

FIG. 2 is a flow chart of a video polyp segmentation method based on a semi-supervised spatio-temporal attention network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a near frame spatial attention module according to an embodiment of the present invention;

FIG. 4 is a flowchart of a process performed by the temporal local context attention module according to one embodiment of the present invention;

fig. 5 is a schematic diagram of a video polyp segmentation apparatus based on a semi-supervised spatio-temporal attention network according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a computer device in accordance with an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The video polyp segmentation method based on the semi-supervised spatio-temporal attention network provided by the embodiment can be applied to a network architecture as shown in fig. 1, wherein in the network architecture, a U-shaped basic architecture is formed by five layers of encoders and five layers of decoders, a temporal local context attention module is added between the encoders and the decoders of the first four layers, and an inter-frame spatio-temporal attention module is added between the encoders and the decoders of the fifth layer. After being processed sequentially by the five layers of encoders, the video data are input into a decoder for processing sequentially until the segmentation result is output by the last layer of the decoder.

In the training phase, for the segmentation result output by the decoder, the model can be updated through cross entropy and Dice loss for the video data with labels, for example, labels one in the first frame, labels tail frames in the tail frames, and monitors the model through smooth L1 distance loss between adjacent frames, for example, between the first frame, the middle frame and the tail frames.

In one embodiment, as shown in fig. 2, a video polyp segmentation method based on a semi-supervised spatio-temporal attention network is provided, comprising the steps of:

in step S110, enteroscopy video data of a patient to be detected is acquired;

In the embodiment of the application, the enteroscopy video data can be enteroscopy video data generated in real time when a patient to be detected performs enteroscopy.

In step S120, the enteroscopy video data is segmented into video segments of a preset size, and input into a preset polyp segmentation model, wherein the preset polyp segmentation model comprises an N-layer encoder and an N-layer decoder;

in the embodiment of the application, after the enteroscopy video data of the patient to be detected is acquired, the video data can be divided into video segments with the size of 10 frames and input into a preset polyp segmentation model for segmentation prediction.

In this embodiment of the present application, the preset polyp segmentation model may include an N-layer encoder and an N-layer decoder, when a video segment segmented into a preset size is input into the preset polyp segmentation model, the video segment is firstly processed sequentially according to the sequence from the first layer to the N-th layer, then the N-th layer encoder is connected with the first layer decoder, the processed feature is input into the first layer decoder, and then the feature is processed sequentially according to the sequence from the first layer to the N-th layer until the actual segmentation result is output through the N-th layer decoder.

Further, in the embodiment of the present application, the layer encoders are in one-to-one correspondence with the layer decoders, i.e., the first layer encoder is corresponding to the nth layer decoder, the second layer encoder is corresponding to the N-1 layer decoder, and so on, and finally the nth layer encoder is corresponding to the first layer decoder.

In one embodiment of the present application, the encoder may include 5 layers, the decoder may also include 5 layers, and the 5 th layer of the encoder is connected to the 1 st layer of the decoder, the encoder and the decoder form a U-shaped network architecture, and a layer of encoder corresponds to a layer of decoder.

In step S130, the output characteristics of each layer of encoder are sequentially used as the input characteristics of the encoder of the next layer and the input characteristics of the time local context attention module according to the sequence of 1 to N-1 layers;

in embodiments of the present application, an attention layer is provided between the encoder and decoder, which may include a temporal local context attention module and a near inter-frame spatiotemporal attention module. Wherein the near inter-frame space-time attention module is disposed between the N-th layer encoder and the first layer decoder, and the temporal local context attention modules are disposed between the 1 st to N-1 st layer encoder and the 2 nd to N layer decoder.

In this embodiment of the present application, the enteroscopy video clip of the patient to be detected is first processed by the first layer encoder, and generates a first output feature, the first output feature is input into the second layer encoder and the first temporal local context attention module, after the second layer encoder processes the first output feature, the second output feature is generated, and is input into the third layer encoder and the second temporal local context attention module, until the nth layer encoder processes the feature input by the last layer encoder, and then is input into the inter-frame spatial-temporal attention module.

In step S140, the attention mechanism calculation is performed on the output features of the nth layer encoder by approaching the inter-frame spatio-temporal attention module, so as to obtain spatio-temporal features, and the spatio-temporal features are used as input features of the first layer decoder;

in the embodiment of the application, the near inter-frame spatial attention module may be used to acquire temporal information and spatial information between near video frames in the video frames.

Referring to fig. 3, in an embodiment of the present application, the near inter-frame spatio-temporal attention module may include a temporal attention module, a spatial attention module, and a multi-layer sensor, and the temporal attention module passes after the output characteristics of the nth layer encoder are acquired. And after the spatial attention module and the multi-layer perceptron are processed, extracting space-time characteristics and inputting the space-time characteristics into a first layer decoder.

In step S150, the output features of each layer of decoder and the output features of the previous layer of encoder are sequentially calculated by the temporal local context attention module in order of 1 to N-1 layers, so as to obtain a temporal feature, and the temporal feature and the output feature of the decoder are used as the input features of the previous layer of decoder until the actual segmentation result is output by the nth layer of decoder.

In this embodiment of the present application, the output features of the first layer decoder and the output features of the N-1 layer encoder are input to the temporal local context attention module, after performing attention calculation, the temporal features are obtained, the temporal features and the output features of the first layer decoder are input to the second layer decoder for processing, then the output features of the second layer decoder and the output features of the N-2 layer encoder are input to the temporal local context attention module, after performing attention calculation, the temporal features are obtained, the temporal features and the output features of the second layer decoder are input to the third layer decoder until the actual segmentation result is output by the nth layer decoder.

In an embodiment of the present application, a video polyp segmentation method based on a semi-supervised spatio-temporal attention network is provided, including: obtaining enteroscopy video data of a patient to be detected; dividing the enteroscopy video data into video fragments with preset sizes, and inputting the video fragments into a preset polyp dividing model, wherein the preset polyp dividing model comprises an N-layer encoder and an N-layer decoder; sequentially taking the output characteristics of each layer of encoder as the input characteristics of the encoder of the next layer and the input characteristics of the time local context attention module according to the sequence of 1 to N-1 layers; performing attention mechanism calculation on the output characteristics of the N layer encoder through the near inter-frame space-time attention module so as to acquire space-time characteristics and serve as input characteristics of the first layer decoder; and sequentially carrying out attention mechanism calculation on the output characteristics of each layer of decoder and the output characteristics of the coder of the previous layer according to the sequence of 1 to N-1 layers through a time local context attention module so as to acquire time characteristics, and taking the time characteristics and the output characteristics of the decoder as the input characteristics of the decoder of the previous layer until the actual segmentation result is output through the decoder of the nth layer. In the application, the video frames can be better segmented by utilizing the continuity between adjacent frames and the characteristics of different cross-layer scales through the time local context attention module, and meanwhile, the inter-frame space attention module is adopted, so that the segmentation accuracy is effectively improved by utilizing the time and space information close to the video frames, the dilemma that the data volume is small and the video data set cannot be applied to training due to overlarge labeling workload and overhigh quality requirement in the past is avoided, and the practicability is improved.

In an embodiment of the present application, a implementation flow of a video polyp segmentation method of a semi-supervised spatio-temporal attention network is provided, including the following steps:

in step S110, enteroscopy video data of a patient to be detected is acquired;

In this embodiment of the present application, the preset polyp segmentation model may include an N-layer encoder and an N-layer decoder, when a video segment segmented into a preset size is input into the preset polyp segmentation model, the video segment is firstly processed sequentially according to the sequence from the first layer to the N-th layer, then the N-th layer encoder is connected with the first layer decoder, the processed feature is input into the first layer decoder, and then the feature is processed sequentially according to the sequence from the first layer to the N-th layer until an actual segmentation result is output through the N-th layer decoder.

In the embodiment of the present application, the preset polyp segmentation model is obtained by the following method:

creating an original polyp segmentation model;

In the embodiment of the application, the training video data set is clinical enteroscopy video data, and the marked video is polyp areas marked manually by doctors.

In the embodiment of the application, the labeling is performed in part of clinical enteroscopy video data, and then the labeled video data and the unlabeled video data are input into an original polyp segmentation model for semi-supervised training, and the model is continuously updated, iterated, supervised and the like, so that when the segmentation accuracy reaches a desired value, the preset polyp segmentation model is generated.

In this embodiment of the present application, the inputting the noted annotated video data and the non-annotated video data into the original polyp segmentation model for semi-supervised training includes:

In the embodiment of the present application, in the training stage, for the labeled video data, the loss calculation is performed on the segmentation result output in the original polyp segmentation model through the cross entropy and the Dice loss function, so that according to the calculated loss, for example, the average value of the cross entropy and the Dice loss function, the parameters of the original polyp segmentation model are updated, and for all video data, namely, the labeled video data and the unlabeled video data, the original polyp segmentation model can be supervised and guided by calculating the smooth L1 distance loss between the segmentation results of the current video frame and the adjacent video frame until the loss accords with the expected value, and the preset polyp segmentation model is generated. In the method, only a small amount of marked data is relied on, so that the data marking cost can be greatly reduced, and the segmentation accuracy of unmarked data can be better improved by using a two-part loss function.

In the embodiment of the application, in the training stage, two stages are divided for training, the first stage adopts a semi-supervised method for training, the second stage adopts the model obtained in the first stage for dividing the non-label data, and then all the labeled data and the non-label data are subjected to full-supervised training, so that the obtained preset training module can better utilize the characteristic information of the non-label data to obtain a more accurate dividing result.

in embodiments of the present application, an attention layer is provided between the encoder and decoder, which may include a temporal local context attention module and a near inter-frame spatiotemporal attention module. Wherein the near inter-frame space-time attention module is disposed between the N-th layer encoder and the first layer decoder, and the plurality of temporal local context attention modules are disposed between the 1 st to N-1 layer encoders and the 2-N layer decoder.

in the embodiment of the application, the near inter-frame spatial attention module may be used to obtain temporal information and spatial information between the near frames in the video frame.

In an embodiment of the present application, the performing attention mechanism calculation on the feature output by the N-th layer encoder by approaching the inter-frame space-time attention module includes:

In embodiments of the present application, the near inter-frame spatio-temporal attention module may include a temporal attention module, a spatial attention module, and a multi-layer perceptron.

In this embodiment of the present application, the output features of the N-th layer encoder are subjected to dimension transformation, and then divided into 64×64 partitions, that is, each frame of video frame may be divided into a plurality of 64×64 partitions, and input into a temporal attention module for performing attention mechanism operation, then the result after operation is connected with the original input, and then input into a spatial attention module for performing attention operation, and then the operation result is input into a multi-layer perceptron for processing after being connected with the original input, so as to obtain the temporal-spatial feature.

In particular, the output characteristics for the N-th layer encoder may be expressed as ε ₅ (x)∈R ^b*n*c*h*w Wherein b represents the batch size, n represents the number of frames in the same batch, c represents the number of channels of the image frames, h and w represent the sizes of the image frames, which are first reshaped into a time attention module after being input into the time attention moduleWherein the B is ₁ =b×h×w, then calculate its temporal attention weight, then it will be remodeled to +.>Wherein B is ₂ After calculating the spatial attention weight, the spatial attention weight is obtained through a multi-layer perceptron MLP module, and the spatial-temporal characteristics are used as the input of a first layer decoder.

In an embodiment of the present application, the calculating, by the time attention module, an attention mechanism of the partition to obtain a first feature includes:

In this embodiment of the present application, each video frame input at each time is input into a preset polyp segmentation model by using a batch as a unit to perform segmentation prediction, so after the video frame is divided into segments with a preset size, each segment in any one frame of video frame can be subjected to attention medium operation with the segments in the same batch, i.e., in the rest video frames input by the same batch unit, at the same position by using a time attention module, so as to obtain the first feature.

In an embodiment of the present application, the performing, by the spatial attention module, an attention mechanism operation on the first feature to obtain a second feature includes:

In this embodiment of the present application, after the first feature is calculated by the temporal attention module, the first feature may be connected to the input feature of the N-th layer encoder, that is, the first feature is added to the input feature of the N-th layer encoder, and then the first feature is input to the spatial attention module, where each partition of each frame of video frame is separately calculated by an attention mechanism with other partitions at different positions of the same video frame, and then the second feature is obtained.

Referring to fig. 4, in an embodiment of the present application, the attention mechanism calculation is performed by the temporal local context attention module to obtain a temporal feature, including:

In this embodiment, a second layer decoder is taken as an example to illustrate, firstly, a feature map calculation is performed on the output feature of the first layer decoder through a temporal local context attention module, that is, the absolute error between the segmentation result of each frame of video frame and the segmentation result of other frames of the same video segment is calculated, then, the average value is calculated, so as to obtain a feature map of the output feature of the first layer decoder, the feature map is multiplied by the feature output by the encoder of the same layer as the second layer decoder, that is, the output feature of the N-1 layer encoder, and then, the output feature of the N-1 layer encoder is added as the input feature of the second layer encoder.

In the embodiment of the application, the feature map may be obtained by calculation according to the following formula:

Where M represents a feature map, d represents the number of layers of the encoder, P (x) represents a division mask output to the frame x decoder, and N represents the number of frames of one batch.

In the embodiment of the application, the video frames can be better segmented by using the time local context attention module and utilizing the continuity between adjacent frames and the characteristics of different cross-layer scales, and meanwhile, the inter-frame space attention module is adopted, and the time and space information between the adjacent video frames is utilized, so that the segmentation accuracy is effectively improved, the dilemma that the data quantity is small and the video data set cannot be applied to training due to overlarge labeling workload and overhigh quality requirement in the past is avoided, and the practicability is improved.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

In one embodiment, a video polyp segmentation device based on a semi-supervised spatio-temporal attention network is provided, and the video polyp segmentation device based on the semi-supervised spatio-temporal attention network corresponds to the video polyp segmentation method based on the semi-supervised spatio-temporal attention network in one-to-one correspondence. As shown in fig. 5, the video polyp segmentation apparatus based on the semi-supervised spatiotemporal attention network includes a enteroscopy video data acquisition unit 10, a video data processing unit 20, an encoder processing unit 30, an attention calculation unit 40, and an actual segmentation result output unit 50. The functional modules are described in detail as follows:

A enteroscopy video data acquisition unit 10 for acquiring enteroscopy video data of a patient to be detected;

a video data processing unit 20, configured to segment the enteroscopy video data into video segments of a preset size, and input the video segments into a preset polyp segmentation model, where the preset polyp segmentation model includes an N-layer encoder and an N-layer decoder;

an encoder processing unit 30, configured to sequentially use the output features of each layer of encoders as the input features of the encoder of the next layer and the input features of the temporal local context attention module in order of 1 to N-1 layers;

an attention calculating unit 40, configured to perform attention mechanism calculation on the output feature of the N-th layer encoder through the near inter-frame spatiotemporal attention module, so as to obtain a spatiotemporal feature, and use the spatiotemporal feature as an input feature of the first layer decoder;

the actual segmentation result output unit 50 is configured to sequentially calculate, by the temporal local context attention module, the output characteristics of each layer of decoder and the output characteristics of the previous layer of encoder in order of 1 to N-1 layers, an attention mechanism to obtain a temporal characteristic, and use the temporal characteristic and the output characteristics of the decoder as input characteristics of the previous layer of decoder until the actual segmentation result is output by the nth layer of decoder.

In an embodiment, the attention calculating unit 40 is further configured to:

the calculating, by the time attention module, an attention mechanism for the partition to obtain a first feature includes:

In an embodiment of the present application, the attention calculating unit 40 is further configured to:

the performing, by the spatial attention module, an attention mechanism operation on the first feature to obtain a second feature, including:

The actual segmentation result output unit 50 is further configured to:

In an embodiment of the present application, the apparatus further includes a preset polyp segmentation model obtaining unit, configured to:

creating an original polyp segmentation model;

In an embodiment of the present application, the preset polyp segmentation model obtaining unit is further configured to:

In one embodiment, a computer device is provided, which may be a terminal device, and the internal structure thereof may be as shown in fig. 6. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a readable storage medium. The readable storage medium stores computer readable instructions. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer readable instructions, when executed by a processor, implement a method for video polyp segmentation based on a semi-supervised spatio-temporal attention network. The readable storage medium provided by the present embodiment includes a nonvolatile readable storage medium and a volatile readable storage medium.

In an embodiment of the present application, a computer device is provided, comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, which when executed by the processor implement the steps of the semi-supervised spatiotemporal attention network based video polyp segmentation method as described above.

In an embodiment of the present application, one or more readable storage media are provided, which store computer readable instructions that, when executed by a processor, implement the steps of a video polyp segmentation method based on a semi-supervised spatiotemporal attention network as described above.

Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by instructing the associated hardware by computer readable instructions stored on a non-volatile readable storage medium or a volatile readable storage medium, which when executed may comprise the above described embodiment methods. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A method for video polyp segmentation based on a semi-supervised spatio-temporal attention network, the method comprising:

obtaining enteroscopy video data of a patient to be detected;

performing attention mechanism calculation on output features of an N-th layer encoder through a near inter-frame space-time attention module to acquire space-time features and serve as input features of a first layer decoder, wherein the near inter-frame space-time attention module comprises a time attention module, a space attention module and a multi-layer perceptron, and each block of a frame of video frame and the blocks at the same position in the rest of video frames are subjected to attention mechanism calculation through the space attention module to acquire the first features; connecting the first characteristic with the output characteristic of the N layer encoder and inputting the first characteristic into the spatial attention module; in the spatial attention module, each block in a frame of video frame and blocks at different positions of the same video frame are subjected to attention mechanism calculation to obtain a second feature, and the second feature is input into the multi-layer perceptron for processing to obtain the space-time feature;

2. The method for video polyp segmentation based on semi-supervised spatio-temporal attention network of claim 1, wherein the performing of an attention mechanism calculation by a temporal local context attention module to obtain temporal features comprises:

3. The video polyp segmentation method based on a semi-supervised spatiotemporal attention network as set forth in claim 1, wherein the pre-set polyp segmentation model is obtained by:

creating an original polyp segmentation model;

4. A method of video polyp segmentation based on a semi-supervised spatiotemporal attention network as set forth in claim 3, wherein said inputting said annotated video data and said unlabeled video data into said raw polyp segmentation model for semi-supervised training comprises:

5. A video polyp segmentation apparatus based on a semi-supervised spatiotemporal attention network, the apparatus comprising:

the attention calculating unit is used for carrying out attention mechanism calculation on the output characteristics of the N layer encoder through a near-inter-frame space-time attention module so as to acquire space-time characteristics and serve as the input characteristics of the first layer decoder, the near-inter-frame space-time attention module comprises a time attention module, a space attention module and a multi-layer perceptron, and each block of one frame of video frame and the blocks at the same position in the rest of video frames are subjected to attention mechanism calculation through the space attention module so as to acquire the first characteristics; connecting the first characteristic with the output characteristic of the N layer encoder and inputting the first characteristic into the spatial attention module; in the spatial attention module, each block in a frame of video frame and blocks at different positions of the same video frame are subjected to attention mechanism calculation to obtain a second feature, and the second feature is input into the multi-layer perceptron for processing to obtain the space-time feature;

6. A computer device comprising a memory, a processor and computer readable instructions stored in the memory and executable on the processor, wherein the processor, when executing the computer readable instructions, implements the steps of the semi-supervised spatiotemporal attention network based video polyp segmentation method of any of claims 1 to 4.

7. One or more readable storage media storing computer readable instructions that, when executed by a processor, implement the steps of a semi-supervised spatiotemporal attention network based video polyp segmentation method of any of claims 1 to 4.