WO2023193238A1 - Surgical instrument, behavior and target tissue joint identification method and apparatus - Google Patents

Surgical instrument, behavior and target tissue joint identification method and apparatus Download PDF

Info

Publication number
WO2023193238A1
WO2023193238A1 PCT/CN2022/085837 CN2022085837W WO2023193238A1 WO 2023193238 A1 WO2023193238 A1 WO 2023193238A1 CN 2022085837 W CN2022085837 W CN 2022085837W WO 2023193238 A1 WO2023193238 A1 WO 2023193238A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
feature
behaviors
surgical instruments
category
Prior art date
Application number
PCT/CN2022/085837
Other languages
French (fr)
Chinese (zh)
Inventor
夏彤
贾富仓
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Priority to PCT/CN2022/085837 priority Critical patent/WO2023193238A1/en
Publication of WO2023193238A1 publication Critical patent/WO2023193238A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects

Definitions

  • the present application relates to the field of medical image processing, specifically, to a method and device for joint identification of surgical instruments, behavior and target tissue.
  • the joint identification of surgical instruments, behaviors, and target tissues is key to surgical scene parsing. Precise operation of surgical instruments guarantees the safety and effectiveness of surgery. Instruments are the most prominent targets in video images that guide surgery. Accurate instrument identification is the primary task in scene perception and is also the basis for judging surgical actions and target tissues. Surgical behavior recognition is based on instrument identification and integrates the target tissue involved in the movement of the instrument and the movement of the surgical instrument to accurately judge the specific surgical operation currently being performed. Workflow identification is a global perception of the surgical process at the stage level based on specific instruments and surgical behaviors.
  • Jin et al. used the characteristics of long short-term memory network as an effective time series model, and combined the end-to-end network constructed with deep convolutional network to extract sufficient spatiotemporal fusion features for the first time.
  • Implement workflow identification. Alshirbaji et al. transferred this method to the device recognition task and also achieved recognition accuracy that exceeded previous methods.
  • Jin et al. proposed a multi-task surgical instrument and workflow joint recognition network based on a joint loss function.
  • the equipment recognition and workflow recognition branches share the spatial features on the backbone network.
  • the workflow task branch is followed by a long and short-term memory network to fuse action information in the time dimension.
  • a weighted loss function is used to construct a joint loss function for multi-task network training.
  • Nwoye et al. constructed three types of key content: instruments, actions and target tissues to describe the interaction of instruments and tissues in the surgical scene, and used a 3D mapping interaction space function to achieve multi-tasking. Federated learning.
  • Embodiments of the present invention provide a method and device for joint recognition of surgical instruments, behaviors, and target tissues, so as to at least solve the technical problem that the existing technology lacks recognition tasks for describing surgical actions.
  • a method for joint identification of surgical instruments, behaviors and target tissues including the following steps:
  • the long short-term memory network is introduced to perform spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;
  • the sub-task of identifying surgical instruments, behaviors and target tissues after spatio-temporal feature fusion is used through the fully connected layer.
  • the technical solutions adopted by the embodiments of this application also include: using the category-aligned channel attention mechanism to perform feature category alignment and decoupling of surgical instruments, behaviors, and target tissue subtasks in the scene, including:
  • the multi-label cross-channel loss based on channel attention is used in a deep convolutional network to extract spatial features of surgical instruments, behaviors and target tissue subtasks in the scene.
  • the technical solutions adopted by the embodiments of this application also include: using multi-label cross-channel loss based on channel attention to act on a deep convolutional network to extract spatial features of surgical instruments, behaviors and target tissue subtasks in the scene, including:
  • the corresponding global features are divided into category-aligned feature groups based on the total number of categories for each task.
  • the technical solutions adopted by the embodiments of this application also include: using a deep residual network as the backbone module to initially extract deep features, and then using a global pooling operation to obtain multi-dimensional feature vectors to construct subtask branches including:
  • a fifty-layer deep residual network composed of four residual modules is used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain a 2048-dimensional feature vector as the output of the backbone module;
  • a 1 ⁇ 1 convolution operation is used to transform the extracted 2048-dimensional feature vector into the number of channels suitable for each task branch.
  • the technical solution adopted by the embodiment of the present application also includes: dividing the corresponding global features into category-aligned feature groups based on the total number of categories of each task, including:
  • Laparoscopic cholecystectomy involves 15 types of target tissues.
  • a 1 ⁇ 1 convolution operation is used to obtain the 2040-dimensional global feature F, which is divided into 15 groups of features:
  • Each group F i contains ⁇ channels, which are used to extract the diverse local fine-grained features corresponding to the i-th type of target tissue in the surgical scene;
  • the multi-label cross-channel loss consists of a discriminative module and a diversity module, which act respectively between 15 groups of features F and within each group of features F i on a single task;
  • the discriminative module For the i-th group of features F i , the discriminative module first performs the Mask operation in deep learning on the ⁇ channels in the group through the randomly generated 0-1 diagonal matrix M i , and then performs cross-operation on the features within the group after the Mask operation.
  • the maximum pooling operation of the channel is used to obtain the final response of the current image to the i-th category.
  • the specific distinguishing module is expressed as:
  • W and H represent the width and height of the feature map
  • F i,j,k represents the k-th element position on the j-th channel in the i-th group of features
  • the multi-label discriminative loss function is obtained through the Softmax operation:
  • yi represents the true label of the current image for the i-th category
  • n represents the total number of categories of the subtask
  • the diversity module performs an element-wise softmax operation within each group of features F i , and then performs a cross-channel average pooling operation on each feature map in the group:
  • the diversity loss can be calculated as:
  • L MC (F) ⁇ 1 L dis + ⁇ 2 L div ;
  • the corresponding weights are adjusted and set according to the needs of specific tasks.
  • the technical solutions adopted by the embodiments of this application also include: introducing a long short-term memory network to perform spatiotemporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling, including:
  • a single-layer long short-term memory network is used to extract motion features within a period of input, and 512-dimensional spatio-temporal fusion features are obtained, and the corresponding task recognition is finally achieved through a fully connected layer.
  • the technical solutions adopted by the embodiments of this application also include: using the jump link method at the visual feature level to achieve cascaded effective visual feature transfer, in which the overall loss function of the long short-term memory network is composed of the cross-channel loss and spatio-temporal fusion at the visual feature level.
  • Features are weighted by standard cross-entropy loss to obtain classification results.
  • a device for joint identification of surgical instruments, behaviors and target tissues including:
  • the category-aligned fine-grained visual feature extraction module is used to decouple the feature category alignment of surgical instruments, behaviors and target tissue subtasks in the scene using the category-aligned channel attention mechanism;
  • the spatiotemporal feature fusion module is used to introduce the long short-term memory network to perform spatiotemporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;
  • the multi-task cascade module is used to identify surgical instruments, behaviors and target tissue subtasks after spatio-temporal feature fusion through fully connected layers.
  • a storage medium that stores program files capable of realizing any of the above methods for joint identification of surgical instruments, behaviors, and target tissues.
  • a processor is used to run a program, wherein when the program is running, it executes any one of the above methods for joint identification of surgical instruments, behaviors, and target tissues.
  • the method and device for joint recognition of surgical instruments, behaviors and target tissues in the embodiment of the present invention first use the category-aligned channel attention mechanism to align and decouple the feature categories of the surgical instruments, behaviors and target tissue subtasks in the scene; and then The long short-term memory network is introduced to perform spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling; and then the fully connected layer is used to fuse the spatio-temporal features of the surgical instruments, behaviors and target tissues. Identify subtasks.
  • This invention achieves more adequate spatial feature description by extracting local diversity fine-grained features in surgical scenes, achieves accurate identification of multiple instruments and multiple targets in surgical operations through category decoupling, and comprehensively realizes accurate and specific surgeries. Automatic real-time analysis of key content of the scene.
  • Figure 1 is a flow chart of the method for joint identification of surgical instruments, behaviors and target tissues according to the present invention
  • Figure 2 is a multi-task learning framework diagram of the method for joint identification of surgical instruments, behaviors and target tissues according to the present invention
  • Figure 3 is a functional principle diagram of the diversity loss module of the method for joint identification of surgical instruments, behaviors and target tissues according to the present invention
  • Figure 4 is a module diagram of a device for joint identification of surgical instruments, behavior and target tissue according to the present invention.
  • Surgical scene perception is an important task for modern smart operating rooms to develop information integration and intelligence under the condition of sophisticated hardware equipment and rich real-time sensing signals.
  • endoscope-guided computer-assisted minimally invasive surgery by understanding and processing key information in the current surgical field of view, the surgical scene perception system can monitor the entire surgical process in real time and provide the surgeon with specific auxiliary information at any time.
  • minimally invasive surgeries represented by laparoscopic cholecystectomy, the tiny incisions on the body surface reduce the burden of the surgery on the patient, but the limitations of the endoscopic imaging field of view create certain difficulties in surgical guidance.
  • the viewing range of the endoscopic lens limits the doctor's surgical field of view, and intracavity smoke and screen reflections also block the doctor's field of view.
  • the high similarity and overlap of textures of the target tissue under the limited viewing angle also provide doctors with a better understanding of the current surgical environment. This creates difficulties in judgment and makes the risks of surgery difficult to predict. Therefore, in order to improve the safety of surgery while retaining the advantages of minimally invasive surgery, the key content of the surgical scene is identified and analyzed based on the real-time surgical video signal acquired by the intraoperative endoscope, and the surgeon is provided with real-time surgical monitoring and scenarios. Parsing to provide auxiliary intervention is a key technology in the development of modern operating room scene awareness systems.
  • a method for joint identification of surgical instruments, behaviors and target tissues is provided. See Figure 1, which includes the following steps:
  • S100 Utilize the category-aligned channel attention mechanism to decouple the feature category alignment of surgical instruments, behaviors, and target tissue subtasks in the scene;
  • S200 Introduce the long short-term memory network to perform spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;
  • S300 Recognize the subtask of surgical instruments, behaviors and target tissues after spatio-temporal feature fusion through the fully connected layer.
  • the method for joint identification of surgical instruments, behaviors and target tissues in the embodiment of the present invention first uses the category-aligned channel attention mechanism to perform feature category alignment and decoupling of the sub-tasks of surgical instruments, behaviors and target tissues in the scene; then introduces long and short
  • the temporal memory network performs spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling; then the fully connected layer fuses the spatiotemporal feature fusion of surgical instruments, behaviors and target tissue subtasks. to identify.
  • This invention achieves more adequate spatial feature description by extracting local diversity fine-grained features in surgical scenes, achieves accurate identification of multiple instruments and multiple targets in surgical operations through category decoupling, and comprehensively realizes accurate and specific surgeries. Automatic real-time analysis of key content of the scene.
  • the category-aligned channel attention mechanism is used to decouple the feature category alignment of surgical instruments, behaviors and target tissue subtasks in the scene, including:
  • the multi-label cross-channel loss based on channel attention is used in a deep convolutional network to extract spatial features of surgical instruments, behaviors and target tissue subtasks in the scene.
  • the multi-label cross-channel loss based on channel attention is used to act on the deep convolutional network to extract spatial features of surgical instruments, behaviors and target tissue subtasks in the scene, including:
  • the corresponding global features are divided into category-aligned feature groups based on the total number of categories for each task.
  • a deep residual network is used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain multi-dimensional feature vectors to build subtask branches including:
  • a fifty-layer deep residual network composed of four residual modules is used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain a 2048-dimensional feature vector as the output of the backbone module;
  • a 1 ⁇ 1 convolution operation is used to transform the extracted 2048-dimensional feature vector into the number of channels suitable for each task branch.
  • the corresponding global features are divided into category-aligned feature groups based on the total number of categories of each task, including:
  • Laparoscopic cholecystectomy involves 15 types of target tissues.
  • a 1 ⁇ 1 convolution operation is used to obtain the 2040-dimensional global feature F, which is divided into 15 groups of features:
  • Each group F i contains ⁇ channels, which are used to extract the diverse local fine-grained features corresponding to the i-th type of target tissue in the surgical scene;
  • the multi-label cross-channel loss consists of a discriminative module and a diversity module, which act respectively between 15 groups of features F and within each group of features F i on a single task;
  • the discriminative module For the i-th group of features F i , the discriminative module first performs the Mask operation in deep learning on the ⁇ channels in the group through the randomly generated 0-1 diagonal matrix M i , and then performs cross-operation on the features within the group after the Mask operation.
  • the maximum pooling operation of the channel is used to obtain the final response of the current image to the i-th category.
  • the specific distinguishing module is expressed as:
  • W and H represent the width and height of the feature map
  • F i,j,k represents the k-th element position on the j-th channel in the i-th group of features
  • the multi-label discriminative loss function is obtained through the Softmax operation:
  • yi represents the true label of the current image for the i-th category
  • n represents the total number of categories of the subtask
  • the diversity module performs an element-wise softmax operation within each group of features F i , and then performs a cross-channel average pooling operation on each feature map in the group:
  • the diversity loss can be calculated as:
  • L MC (F) ⁇ 1 L dis + ⁇ 2 L div ;
  • the corresponding weights are adjusted and set according to the needs of specific tasks.
  • the long short-term memory network is introduced to perform spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling, including:
  • a single-layer long short-term memory network is used to extract motion features within a period of input, and 512-dimensional spatio-temporal fusion features are obtained, and the corresponding task recognition is finally achieved through a fully connected layer.
  • the jump link method is used at the visual feature level to achieve cascaded effective visual feature transfer.
  • the overall loss function of the long short-term memory network is the standard cross-entropy loss of the classification result obtained by the cross-channel loss at the visual feature level and the spatio-temporal fusion feature. Weighted composition.
  • the joint identification of surgical instruments, actions, and target tissues is a key technology for computer-assisted surgical intervention and minimally invasive surgery.
  • fine-grained characteristics such as the texture similarity of the target tissue, the similar structure of the instrument tip, and repeated non-specific behavioral actions during the surgical stage all make it difficult to accurately identify these key surgical contents.
  • the purpose of the present invention is to provide a more accurate and specific surgical scene analysis method using the joint identification of surgical instruments, target tissues and execution action subtasks. By extracting local diverse fine-grained features in surgical scenes, a more adequate spatial feature description is achieved, and category decoupling is used to achieve accurate identification of multiple instruments and targets in surgical operations.
  • the present invention proposes a method for joint identification of surgical instruments, behaviors and target tissues based on multi-label mutual channel loss, which is mainly used for computer-assisted minimally invasive surgery represented by laparoscopic cholecystectomy and postoperative scenes and actions.
  • Recognition dedicated to solving the problem of key global and local fusion of visual features and action features under long-term dependencies in surgical videos through fine-grained classification and multi-task learning models.
  • the present invention uses a category-aligned channel attention mechanism to realize visual feature decoupling, introduces a long short-term memory network to extract temporal features of action information in the scene, and realizes multi-task joint recognition in a cascaded manner. Through experimental verification, the present invention achieves better recognition results than previous methods in both single tasks and joint tasks.
  • the present invention applies a multi-label cross-channel loss function to realize feature category alignment and decoupling on channels, thereby achieving the purpose of paying attention to multiple local details in the surgical scene.
  • long short-term memory network modules are connected for tasks involving actions in continuous time to achieve the purpose of integrating spatiotemporal features.
  • the long-short-term memory network module uses the cascade transfer of spatial features to strengthen the interaction between instrument presentation and organizational behavior, and comprehensively achieves automatic real-time analysis of accurate and specific key content of surgical scenes.
  • the multi-task learning framework of the present invention is shown in Figure 2, and mainly includes a three-part structure of a category-aligned fine-grained visual feature extraction module, a spatio-temporal feature fusion module and a multi-task cascade module.
  • this module introduces a multi-label cross-channel loss based on channel attention to act on the spatial features extracted by the deep convolutional network.
  • a fifty-layer deep residual network composed of four residual modules is first used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain 2048 dimensional feature vector as the output of the backbone module.
  • a 1 ⁇ 1 convolution operation is used to transform the extracted 2048-dimensional feature vector into the number of channels suitable for each task branch.
  • 1 ⁇ 1 convolution is used to transform the global vision of 2040, 2000 and 2040 channels respectively.
  • each group contains local features with 340, 200 and 136 channel numbers.
  • the corresponding global features can be divided into category-aligned feature groups based on the total number of categories of each task.
  • laparoscopic cholecystectomy involves 15 types of target tissues.
  • a 1 ⁇ 1 convolution operation is used to obtain a 2040-dimensional global feature F, so it is divided into 15 groups of features:
  • Each group F i contains ⁇ channels, which are used to extract the diverse local fine-grained features corresponding to the i-th type of target tissue in the surgical scene.
  • the multi-label cross-channel loss consists of a discriminative module and a diversity module, which act respectively between 15 groups of features F and within each group of features F i on a single task.
  • the discriminative module is used to guide different groups of features to learn features related to corresponding categories and distinguish them. For the i-th group of features F i , the discriminative module first performs a Mask operation on ⁇ channels in the group through a randomly generated 0-1 diagonal matrix M i , and then performs cross-channel maximum pooling on the masked features in the group. operation, thereby retaining the maximum response to the category at each position in the feature map, and obtaining the final response of the current image to the i-th category through global average pooling.
  • the specific distinguishing module is expressed as:
  • W and H represent the width and height of the feature map
  • F i,j,k represents the k-th element position on the j-th channel in the i-th group of features.
  • the multi-label discriminative loss function can be obtained through the Softmax operation:
  • yi represents the true label of the current image for the i-th category
  • n represents the total number of categories for this subtask.
  • the diversity module performs an element-wise softmax operation within each group of features F i , and then performs a cross-channel average pooling operation on each feature map in the group:
  • the diversity loss can be calculated as:
  • L MC (F) ⁇ 1 L dis + ⁇ 2 L div ;
  • the corresponding weights are adjusted and set according to the needs of specific tasks.
  • a single-layer long short-term memory network is used to extract motion features within a period of input, and 512-dimensional spatio-temporal fusion features are obtained and finally through the full
  • the connection layer realizes the identification of corresponding tasks.
  • the jump link method is used at the visual feature level to achieve cascading effective visual feature transmission.
  • the overall loss function of the long short-term memory network is weighted by the cross-channel loss at the visual feature level and the standard cross-entropy loss that uses spatio-temporal fusion features to obtain classification results.
  • the innovative technical points of the method of the present invention are at least:
  • the multi-label cross-channel loss function improved by the present invention can fully extract the local features distributed in different areas of the field of view in the laparoscopic surgery scene; the loss design in the multi-label case can better cope with the simultaneous execution of multiple instruments and multiple entities in the surgical scene.
  • the application of surgical operations; the decoupling mechanism of category alignment increases the visibility and interpretability of the model; the cascaded joint identification of instruments, behaviors and target tissues better utilizes the correlation between multiple tasks and improves the efficiency of a single Task and multi-task recognition accuracy, thereby providing more specific and precise instructions for real-time assistance during surgery.
  • the multi-task learning method proposed by the present invention based on the joint identification of surgical instruments, behaviors and target tissues based on multi-label cross-channel loss has been tested on the public data sets CholecT40 and HeiCholec, and has achieved results in both single tasks and multi-task combinations. It is an effective improvement over the previous methods mentioned above. Validation on multiple data sets also shows the robustness of the model, which can meet the needs of assisted analysis of instruments, behaviors and target tissues in laparoscopic surgery scenarios.
  • the cross-channel loss function under multi-label proposed by the present invention can effectively realize the decoupling and category alignment of local fine-grained features of the image.
  • the long-short-term memory network module effectively extracts the implication in continuous time based on the decoupled feature sequence.
  • Action information the cascaded multi-task joint recognition structure makes full use of the prior relationships between instruments, behaviors and target tissues, so that the joint recognition network proposed by the present invention is significantly improved compared to existing methods.
  • a device for joint identification of surgical instruments, behaviors and target tissues is provided. See Figure 4, which includes:
  • the category-aligned fine-grained visual feature extraction module 100 is used to use the category-aligned channel attention mechanism to perform feature category alignment and decoupling of surgical instruments, behaviors, and target tissue subtasks in the scene;
  • the spatiotemporal feature fusion module 200 is used to introduce a long short-term memory network to perform spatiotemporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;
  • the multi-task cascade module 300 is used to identify surgical instruments, behaviors and target tissue subtasks after spatio-temporal feature fusion through a fully connected layer.
  • the device for joint recognition of surgical instruments, behaviors and target tissues in the embodiment of the present invention first uses the category-aligned channel attention mechanism to perform feature category alignment and decoupling of the sub-tasks of surgical instruments, behaviors and target tissues in the scene; and then introduces long and short
  • the temporal memory network performs spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling; then the fully connected layer fuses the spatiotemporal feature fusion of surgical instruments, behaviors and target tissue subtasks. to identify.
  • This invention achieves more adequate spatial feature description by extracting local diversity fine-grained features in surgical scenes, achieves accurate identification of multiple instruments and multiple targets in surgical operations through category decoupling, and comprehensively realizes accurate and specific surgeries. Automatic real-time analysis of key content of the scene.
  • the joint recognition of surgical instruments, actions and target tissues is a key technology for computer-assisted surgical intervention and minimally invasive surgery.
  • fine-grained characteristics such as the texture similarity of the target tissue, the similar structure of the instrument tip, and repeated non-specific behavioral actions during the surgical stage all make it difficult to accurately identify these key surgical contents.
  • the purpose of the present invention is to provide a more accurate and specific surgical scene analysis device using the joint identification of surgical instruments, target tissues and execution action subtasks. By extracting local diverse fine-grained features in surgical scenes, a more adequate spatial feature description is achieved, and category decoupling is used to achieve accurate identification of multiple instruments and targets in surgical operations.
  • the present invention proposes a device for joint identification of surgical instruments, behaviors and target tissues based on multi-label mutual channel loss, which is mainly used for computer-assisted minimally invasive surgery represented by laparoscopic cholecystectomy and postoperative scenes and actions.
  • Recognition dedicated to solving the problem of key global and local fusion of visual features and action features under long-term dependencies in surgical videos through fine-grained classification and multi-task learning models.
  • the present invention uses a category-aligned channel attention mechanism to realize visual feature decoupling, introduces a long short-term memory network to extract temporal features of action information in the scene, and realizes multi-task joint recognition in a cascaded manner. Through experimental verification, the present invention achieves better recognition results than previous methods in both single tasks and joint tasks.
  • the present invention applies a multi-label cross-channel loss function to realize feature category alignment and decoupling on channels, thereby achieving the goal of paying attention to multiple local details in the surgical scene.
  • Purpose In the three subtask branches, long and short-term memory network modules are connected for tasks involving actions in continuous time to achieve the purpose of integrating spatiotemporal features.
  • the long-short-term memory network module uses the cascade transfer method of spatial features to strengthen the interaction between instrument presentation and organizational behavior, and comprehensively achieves automatic real-time analysis of accurate and specific key content of surgical scenes.
  • the multi-task learning framework of the present invention is shown in Figure 2, and mainly includes a three-part structure of a category-aligned fine-grained visual feature extraction module 100, a spatio-temporal feature fusion module 200, and a multi-task cascade module 300.
  • this module introduces a multi-label cross-channel loss based on channel attention to act on the spatial features extracted by the deep convolutional network.
  • a fifty-layer deep residual network composed of four residual modules is first used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain 2048 dimensional feature vector as the output of the backbone module.
  • a 1 ⁇ 1 convolution operation is used to transform the extracted 2048-dimensional feature vector into the number of channels suitable for each task branch.
  • 1 ⁇ 1 convolution is used to transform the global vision of 2040, 2000 and 2040 channels respectively.
  • each group contains local features with 340, 200 and 136 channel numbers.
  • the corresponding global features can be divided into category-aligned feature groups based on the total number of categories of each task.
  • laparoscopic cholecystectomy involves 15 types of target tissues.
  • a 1 ⁇ 1 convolution operation is used to obtain a 2040-dimensional global feature F, so it is divided into 15 groups of features:
  • Each group F i contains ⁇ channels, which are used to extract the diverse local fine-grained features corresponding to the i-th type of target tissue in the surgical scene.
  • the multi-label cross-channel loss consists of a discriminative module and a diversity module, which act respectively between 15 groups of features F and within each group of features F i on a single task.
  • the discriminative module is used to guide different groups of features to learn features related to corresponding categories and distinguish them. For the i-th group of features F i , the discriminative module first performs a Mask operation on ⁇ channels in the group through a randomly generated 0-1 diagonal matrix M i , and then performs cross-channel maximum pooling on the masked features in the group. operation, thereby retaining the maximum response to the category at each position in the feature map, and obtaining the final response of the current image to the i-th category through global average pooling.
  • the specific distinguishing module is expressed as:
  • W and H represent the width and height of the feature map
  • F i,j,k represents the k-th element position on the j-th channel in the i-th group of features.
  • the multi-label discriminative loss function can be obtained through the Softmax operation:
  • yi represents the true label of the current image for the i-th category
  • n represents the total number of categories for this subtask.
  • the diversity module performs an element-wise softmax operation within each group of features F i , and then performs a cross-channel average pooling operation on each feature map in the group:
  • the diversity loss can be calculated as:
  • L MC (F) ⁇ 1 L dis + ⁇ 2 L div ;
  • the corresponding weights are adjusted and set according to the needs of specific tasks.
  • a single-layer long short-term memory network is used to extract motion features within a period of input, and 512-dimensional spatio-temporal fusion features are obtained and finally through the full
  • the connection layer realizes the identification of corresponding tasks.
  • the jump link method is used at the visual feature level to achieve cascading effective visual feature transmission.
  • the overall loss function of the long short-term memory network is weighted by the cross-channel loss at the visual feature level and the standard cross-entropy loss that uses spatio-temporal fusion features to obtain classification results.
  • the innovative technical points of the device of the present invention are at least:
  • the multi-label cross-channel loss function improved by the present invention can fully extract the local features distributed in different areas of the field of view in the laparoscopic surgery scene; the loss design in the multi-label case can better cope with the simultaneous execution of multiple instruments and multiple entities in the surgical scene.
  • the application of surgical operations; the decoupling mechanism of category alignment increases the visibility and interpretability of the model; the cascaded joint identification of instruments, behaviors and target tissues better utilizes the correlation between multiple tasks and improves the efficiency of a single Task and multi-task recognition accuracy, thereby providing more specific and precise instructions for real-time assistance during surgery.
  • the multi-task learning device proposed by the present invention based on the joint recognition of surgical instruments, behaviors and target tissues based on multi-label cross-channel loss has been tested on the public data sets CholecT40 and HeiCholec, and has achieved results in both single tasks and multi-task combinations. It is an effective improvement over the previous methods mentioned above. Verification on multiple data sets also shows the robustness of the model, which can meet the needs of assisted analysis of instruments, behaviors and target tissues in laparoscopic surgery scenarios.
  • the cross-channel loss function under multi-label proposed by the present invention can effectively realize the decoupling and category alignment of local fine-grained features of the image.
  • the long-short-term memory network module effectively extracts the implication in continuous time based on the decoupled feature sequence.
  • Action information the cascaded multi-task joint recognition structure makes full use of the prior relationships between instruments, behaviors and target tissues, so that the joint recognition network proposed by the present invention is significantly improved compared to existing methods.
  • a storage medium that stores program files capable of realizing any of the above methods for joint identification of surgical instruments, behaviors, and target tissues.
  • a processor is used to run a program, wherein when the program is running, it executes any one of the above methods for joint identification of surgical instruments, behaviors, and target tissues.
  • the disclosed technical content can be implemented in other ways.
  • the system embodiments described above are only illustrative.
  • the division of units can be a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or integrated into Another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the units or modules may be in electrical or other forms.
  • Units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units, i.e. they may be located in one place, or they may be distributed over multiple units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • Integrated units may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products.
  • the technical solution of the present invention is essentially or contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which can be a personal computer, a server or a network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present invention.
  • the aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program code. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

A surgical instrument, behavior and target tissue joint identification method and apparatus. The method comprises: first performing feature category alignment decoupling on surgical instrument, behavior and target tissue sub-tasks in a scene by using a category-aligned channel attention mechanism (S100); then introducing a long short-term memory network to perform spatial-temporal feature fusion on action information of the surgical instrument, behavior and target tissue sub-tasks in the scene after the feature category alignment decoupling (S200); and identifying the surgical instrument, behavior and target tissue sub-tasks after the spatial-temporal feature fusion by means of a fully connected layer (S300). By extracting local diversity fine-grained features in a surgical scene, more sufficient spatial feature description is achieved, accurate identification under multi-instrument and multi-target conditions in a surgical operation is achieved by means of category decoupling, and automatic real-time analysis of the key content of an accurate and specific surgical scene is comprehensively achieved.

Description

一种手术器械、行为和目标组织联合识别的方法及装置A method and device for joint identification of surgical instruments, behavior and target tissue 技术领域Technical field
本申请涉及医学图像处理领域,具体而言,涉及一种手术器械、行为和目标组织联合识别的方法及装置。The present application relates to the field of medical image processing, specifically, to a method and device for joint identification of surgical instruments, behavior and target tissue.
背景技术Background technique
手术器械、行为和目标组织的联合识别是进行手术场景解析的关键。手术器械的精准操作是手术安全性和效果的保证。器械作为引导手术的视频画面中最显著的目标,精准的器械识别是场景感知中的首要任务,也是对手术动作和目标组织进行判断的基础。手术行为识别,则是在器械识别的基础上,综合器械运动中涉及的目标组织和手术器械的运动情况,来对当前执行的具体手术操作进行精准判断。工作流程识别则是在具体的器械、手术行为上,在阶段层面对手术进行流程进行全局感知。通过对器械、行为和目标组织的联合识别,能在术中为外科医生提供充分的手术状况分析和手术决策支持,同时实现手术剩余时间的估计,为手术室内和手术室间的人员协调提供辅助,有效提高了腹腔镜微创手术的安全性和效率。在术后,精准的手术视频内容解析也极大程度地便利了相应的手术记录和教学。因此,提出高精度的手术器械、行为和目标组织的联合识别是实现计算机辅助干预微创手术的基础和关键。The joint identification of surgical instruments, behaviors, and target tissues is key to surgical scene parsing. Precise operation of surgical instruments guarantees the safety and effectiveness of surgery. Instruments are the most prominent targets in video images that guide surgery. Accurate instrument identification is the primary task in scene perception and is also the basis for judging surgical actions and target tissues. Surgical behavior recognition is based on instrument identification and integrates the target tissue involved in the movement of the instrument and the movement of the surgical instrument to accurately judge the specific surgical operation currently being performed. Workflow identification is a global perception of the surgical process at the stage level based on specific instruments and surgical behaviors. Through the joint identification of instruments, behaviors and target tissues, it can provide surgeons with sufficient surgical status analysis and surgical decision support during the operation, while also achieving estimation of the remaining time of the operation, and providing assistance for personnel coordination in and between operating rooms. , effectively improving the safety and efficiency of laparoscopic minimally invasive surgery. After surgery, accurate surgical video content analysis also greatly facilitates corresponding surgical recording and teaching. Therefore, it is proposed that high-precision joint identification of surgical instruments, behaviors and target tissues is the basis and key to achieve computer-assisted intervention in minimally invasive surgery.
针对手术场景感知问题,前人主要在腹腔镜胆囊切除术上进行了工作流程和手术器械等一系列单任务和多任务联合的识别工作。早期方法基于单张图像利用强度、梯度、形状、颜色和组织纹理等人工筛选的特征实现手术工作流和器械的识别。考虑到帧间相关性,有学者利用隐马尔科夫模型为代表的时间序 列模型处理连续一段时间内的手术视频。随着深度学习方法在自然场景的广泛应用,Twinanda等人首次引入深层卷积网络EndoNet用于手术场景的深层视觉特征提取,同时延续了隐马尔科夫模型进行帧间相关信息的提取,使用两个独立的网络分别实现对工作流程和手术器械的分类识别。针对EndoNet独立处理时空特征的局限性,Jin等人利用长短时记忆网络作为一种有效的时间序列模型的特点,结合深度卷积网络构造的端到端网络首次提取到充分的时空融合特征用以实现工作流的识别。Alshirbaji等人将该方法迁移到器械识别任务上同样取得了超越前人方法的识别精度。In response to the problem of surgical scene perception, previous researchers have mainly carried out a series of single-task and multi-task joint recognition work on laparoscopic cholecystectomy, such as workflow and surgical instruments. Early methods used manually screened features such as intensity, gradient, shape, color, and tissue texture to identify surgical workflows and instruments based on a single image. Considering the inter-frame correlation, some scholars use the time series model represented by the hidden Markov model to process surgical videos within a continuous period of time. With the widespread application of deep learning methods in natural scenes, Twinanda et al. introduced the deep convolutional network EndoNet for the first time to extract deep visual features of surgical scenes. At the same time, they continued the hidden Markov model to extract inter-frame related information, using two Two independent networks realize the classification and identification of workflow and surgical instruments respectively. In view of the limitations of EndoNet's independent processing of spatiotemporal features, Jin et al. used the characteristics of long short-term memory network as an effective time series model, and combined the end-to-end network constructed with deep convolutional network to extract sufficient spatiotemporal fusion features for the first time. Implement workflow identification. Alshirbaji et al. transferred this method to the device recognition task and also achieved recognition accuracy that exceeded previous methods.
观察到手术场景中不同任务间的强关联关系,Jin等人提出一种基于联合损失函数的多任务手术器械和工作流联合识别网络。器械识别和工作流识别分支共享主干网络上的空间特征,在工作流任务分支上后接长短时记忆网络以融合时间维度上的动作信息,最后利用加权损失函数构造联合损失函数进行多任务网络训练。为了更丰富具体地解析手术场景中的关键内容,Nwoye等人构造了器械、动作和目标组织三类关键内容描述手术场景中的器械组织交互情况,并使用一种3D映射交互空间函数实现多任务联合学习。Observing the strong correlation between different tasks in surgical scenes, Jin et al. proposed a multi-task surgical instrument and workflow joint recognition network based on a joint loss function. The equipment recognition and workflow recognition branches share the spatial features on the backbone network. The workflow task branch is followed by a long and short-term memory network to fuse action information in the time dimension. Finally, a weighted loss function is used to construct a joint loss function for multi-task network training. . In order to analyze the key content in the surgical scene more richly and specifically, Nwoye et al. constructed three types of key content: instruments, actions and target tissues to describe the interaction of instruments and tissues in the surgical scene, and used a 3D mapping interaction space function to achieve multi-tasking. Federated learning.
前人围绕手术场景感知问题进行的工作流和器械识别工作大多使用通用的深度卷积网络进行视觉特征提取并利用全连接层实现相应类的识别。这些方法通过提取类别融合的全局特征实现对当前时刻空间特征的粗粒度描述,没有关注到腹腔镜中手术场景由于目标组织纹理高度相似、重叠和器械差异仅存在尖端等局部细节等特点所造成的丰富细粒度特征,同时也没有关注到手术场景中多器械同时出现时的多标签多目标问题。此外,现有围绕手术场景感知方面的研究主要从工作流和器械两方面任务,但缺乏更具体地描述手术动作的识别任务。多任务间的协同识别方法仅采用损失函数的简单加权平均,未能充分利 用不同手术任务间的相关关系。Most of the previous workflow and instrument recognition work around the problem of surgical scene perception used general deep convolutional networks to extract visual features and used fully connected layers to realize the recognition of corresponding classes. These methods achieve coarse-grained description of spatial features at the current moment by extracting global features of category fusion, and do not pay attention to the characteristics of laparoscopy surgical scenes due to highly similar target tissue textures, overlaps, and differences in instruments. Only local details such as tips exist. It enriches fine-grained features and does not pay attention to the multi-label and multi-target problem when multiple instruments appear at the same time in surgical scenes. In addition, existing research on surgical scene perception mainly focuses on workflow and instrument tasks, but lacks recognition tasks that more specifically describe surgical actions. The collaborative recognition method between multiple tasks only uses a simple weighted average of the loss function, which fails to fully utilize the correlation between different surgical tasks.
发明内容Contents of the invention
本发明实施例提供了一种手术器械、行为和目标组织联合识别的方法及装置,以至少解决现有技术缺乏描述手术动作的识别任务的技术问题。Embodiments of the present invention provide a method and device for joint recognition of surgical instruments, behaviors, and target tissues, so as to at least solve the technical problem that the existing technology lacks recognition tasks for describing surgical actions.
根据本发明的一实施例,提供了一种手术器械、行为和目标组织联合识别的方法,包括以下步骤:According to an embodiment of the present invention, a method for joint identification of surgical instruments, behaviors and target tissues is provided, including the following steps:
利用类别对齐的通道注意力机制对场景中的手术器械、行为和目标组织子任务进行特征类别对齐解耦;Utilize the category-aligned channel attention mechanism to decouple the feature category alignment of surgical instruments, behaviors, and target tissue subtasks in the scene;
引入长短时记忆网络对特征类别对齐解耦后的场景中手术器械、行为和目标组织子任务的动作信息进行时空特征融合;The long short-term memory network is introduced to perform spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;
通过全连接层对时空特征融合后的手术器械、行为和目标组织子任务进行识别。The sub-task of identifying surgical instruments, behaviors and target tissues after spatio-temporal feature fusion is used through the fully connected layer.
本申请实施例采取的技术方案还包括:利用类别对齐的通道注意力机制对场景中的手术器械、行为和目标组织子任务进行特征类别对齐解耦包括:The technical solutions adopted by the embodiments of this application also include: using the category-aligned channel attention mechanism to perform feature category alignment and decoupling of surgical instruments, behaviors, and target tissue subtasks in the scene, including:
利用基于通道注意力的多标签互通道损失作用于深度卷积网络对场景中的手术器械、行为和目标组织子任务提取出空间特征。The multi-label cross-channel loss based on channel attention is used in a deep convolutional network to extract spatial features of surgical instruments, behaviors and target tissue subtasks in the scene.
本申请实施例采取的技术方案还包括:利用基于通道注意力的多标签互通道损失作用于深度卷积网络对场景中的手术器械、行为和目标组织子任务提取出空间特征包括:The technical solutions adopted by the embodiments of this application also include: using multi-label cross-channel loss based on channel attention to act on a deep convolutional network to extract spatial features of surgical instruments, behaviors and target tissue subtasks in the scene, including:
采用深度残差网络作为主干模块初步提取深层特征,再使用全局池化操作得到多维特征向量来构建子任务分支;Use a deep residual network as the backbone module to initially extract deep features, and then use global pooling operations to obtain multi-dimensional feature vectors to build subtask branches;
基于各任务的总类别数将对应的全局特征划分为类别对齐的特征组。The corresponding global features are divided into category-aligned feature groups based on the total number of categories for each task.
本申请实施例采取的技术方案还包括:采用深度残差网络作为主干模块初步提取深层特征,再使用全局池化操作得到多维特征向量来构建子任务分支包括:The technical solutions adopted by the embodiments of this application also include: using a deep residual network as the backbone module to initially extract deep features, and then using a global pooling operation to obtain multi-dimensional feature vectors to construct subtask branches including:
首先采用由四个残差模块组成的五十层深度残差网络作为主干模块初步提起深层特征,再使用全局池化操作得到2048维的特征向量作为主干模块的输出;First, a fifty-layer deep residual network composed of four residual modules is used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain a 2048-dimensional feature vector as the output of the backbone module;
采用1×1卷积操作将提取到的2048维特征向量变换到适应于各任务分支的通道数。A 1×1 convolution operation is used to transform the extracted 2048-dimensional feature vector into the number of channels suitable for each task branch.
本申请实施例采取的技术方案还包括:基于各任务的总类别数将对应的全局特征划分为类别对齐的特征组包括:The technical solution adopted by the embodiment of the present application also includes: dividing the corresponding global features into category-aligned feature groups based on the total number of categories of each task, including:
腹腔镜胆囊切除术涉及目标组织15类,利用1×1卷积操作得到2040维度的全局特征F,将其划分为15组特征:Laparoscopic cholecystectomy involves 15 types of target tissues. A 1×1 convolution operation is used to obtain the 2040-dimensional global feature F, which is divided into 15 groups of features:
F={F 0,F 1,…,F 14}; F={F 0 ,F 1 ,…,F 14 };
其中每组F i包含ξ个通道,用于提取第i类目标组织对应在手术场景中的多样性局部细粒度特征; Each group F i contains ξ channels, which are used to extract the diverse local fine-grained features corresponding to the i-th type of target tissue in the surgical scene;
多标签互通道损失由区分性模块和多样性模块组成,在单个任务上分别作用于15组特征F之间和每组特征F i内部; The multi-label cross-channel loss consists of a discriminative module and a diversity module, which act respectively between 15 groups of features F and within each group of features F i on a single task;
对于第i组特征F i,区分性模块首先通过随机生成的0-1对角矩阵M i对该组内ξ个通道进行深度学习中的Mask操作,再对Mask操作后的组内特征进行跨通道的最大池化操作,得到当前图像对第i个类别的最终响应,具体区分性模块表示为: For the i-th group of features F i , the discriminative module first performs the Mask operation in deep learning on the ξ channels in the group through the randomly generated 0-1 diagonal matrix M i , and then performs cross-operation on the features within the group after the Mask operation. The maximum pooling operation of the channel is used to obtain the final response of the current image to the i-th category. The specific distinguishing module is expressed as:
Figure PCTCN2022085837-appb-000001
Figure PCTCN2022085837-appb-000001
其中W和H表示特征图的宽度和高度,F i,j,k表示第i组特征中第j个通道上的第k个元素位置; Where W and H represent the width and height of the feature map, F i,j,k represents the k-th element position on the j-th channel in the i-th group of features;
输入图片对每个类别的最终响应Dis(F 0)到Dis(F n-1)得到后,经过Softmax操作得到多标签区分性损失函数: After the final response of the input image to each category Dis(F 0 ) to Dis(F n-1 ) is obtained, the multi-label discriminative loss function is obtained through the Softmax operation:
Figure PCTCN2022085837-appb-000002
Figure PCTCN2022085837-appb-000002
Figure PCTCN2022085837-appb-000003
Figure PCTCN2022085837-appb-000003
其中y i表示当前图像对于第i类的真实标签,n表示该子任务的总类别数; where yi represents the true label of the current image for the i-th category, and n represents the total number of categories of the subtask;
多样性模块在每组特征F i内部执行逐元素的Softmax操作,然后在组内每张特征图上进行跨通道的平均池化操作: The diversity module performs an element-wise softmax operation within each group of features F i , and then performs a cross-channel average pooling operation on each feature map in the group:
Figure PCTCN2022085837-appb-000004
Figure PCTCN2022085837-appb-000004
Figure PCTCN2022085837-appb-000005
Figure PCTCN2022085837-appb-000005
当每张图上的平均响应计算后,多样性损失可通过下式计算得到:When the average response on each graph is calculated, the diversity loss can be calculated as:
Figure PCTCN2022085837-appb-000006
Figure PCTCN2022085837-appb-000006
完整的多标签互通道损失通过对多样性模块和区分性模块的加权和求得:The complete multi-label cross-channel loss is obtained by the weighted sum of the diversity module and the discriminative module:
L MC(F)=λ 1L dis2L divL MC (F)=λ 1 L dis2 L div ;
其中相应的权重根据特定任务的需求加以调节设定。The corresponding weights are adjusted and set according to the needs of specific tasks.
本申请实施例采取的技术方案还包括:引入长短时记忆网络对特征类别对齐解耦后的场景中手术器械、行为和目标组织子任务的动作信息进行时空特征融合包括:The technical solutions adopted by the embodiments of this application also include: introducing a long short-term memory network to perform spatiotemporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling, including:
在各任务的细粒度视觉特征提取模块后通过一个单层的长短时记忆网络 进行一段时间输入内的运动特征提取,得到512维时空融合特征并最终通过全连接层实现相应任务的识别。After the fine-grained visual feature extraction module of each task, a single-layer long short-term memory network is used to extract motion features within a period of input, and 512-dimensional spatio-temporal fusion features are obtained, and the corresponding task recognition is finally achieved through a fully connected layer.
本申请实施例采取的技术方案还包括:在视觉特征层面采用跳跃链接的方法实现级联式的有效视觉特征传递,其中长短时记忆网络的整体损失函数由视觉特征层级的互通道损失和时空融合特征得到分类结果的标准交叉熵损失加权组成。The technical solutions adopted by the embodiments of this application also include: using the jump link method at the visual feature level to achieve cascaded effective visual feature transfer, in which the overall loss function of the long short-term memory network is composed of the cross-channel loss and spatio-temporal fusion at the visual feature level. Features are weighted by standard cross-entropy loss to obtain classification results.
根据本发明的另一实施例,提供了一种手术器械、行为和目标组织联合识别的装置,包括:According to another embodiment of the present invention, a device for joint identification of surgical instruments, behaviors and target tissues is provided, including:
类别对齐的细粒度视觉特征提取模块,用于利用类别对齐的通道注意力机制对场景中的手术器械、行为和目标组织子任务进行特征类别对齐解耦;The category-aligned fine-grained visual feature extraction module is used to decouple the feature category alignment of surgical instruments, behaviors and target tissue subtasks in the scene using the category-aligned channel attention mechanism;
时空特征融合模块,用于引入长短时记忆网络对特征类别对齐解耦后的场景中手术器械、行为和目标组织子任务的动作信息进行时空特征融合;The spatiotemporal feature fusion module is used to introduce the long short-term memory network to perform spatiotemporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;
多任务级联模块,用于通过全连接层对时空特征融合后的手术器械、行为和目标组织子任务进行识别。The multi-task cascade module is used to identify surgical instruments, behaviors and target tissue subtasks after spatio-temporal feature fusion through fully connected layers.
一种存储介质,存储介质存储有能够实现上述任意一项手术器械、行为和目标组织联合识别的方法的程序文件。A storage medium that stores program files capable of realizing any of the above methods for joint identification of surgical instruments, behaviors, and target tissues.
一种处理器,处理器用于运行程序,其中,程序运行时执行上述任意一项的手术器械、行为和目标组织联合识别的方法。A processor is used to run a program, wherein when the program is running, it executes any one of the above methods for joint identification of surgical instruments, behaviors, and target tissues.
本发明实施例中的手术器械、行为和目标组织联合识别的方法及装置,首先利用类别对齐的通道注意力机制对场景中的手术器械、行为和目标组织子任务进行特征类别对齐解耦;再引入长短时记忆网络对特征类别对齐解耦后的场景中手术器械、行为和目标组织子任务的动作信息进行时空特征融合;而后通过全连接层对时空特征融合后的手术器械、行为和目标组织子任务进行识别。 本发明通过对手术场景中局部多样性细粒度特征进行提取,实现更充分的空间特征描述,通过类别解耦实现外科手术中多器械、多目标情况下的精准识别,综合实现了精准具体的手术场景关键内容的自动实时解析。The method and device for joint recognition of surgical instruments, behaviors and target tissues in the embodiment of the present invention first use the category-aligned channel attention mechanism to align and decouple the feature categories of the surgical instruments, behaviors and target tissue subtasks in the scene; and then The long short-term memory network is introduced to perform spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling; and then the fully connected layer is used to fuse the spatio-temporal features of the surgical instruments, behaviors and target tissues. Identify subtasks. This invention achieves more adequate spatial feature description by extracting local diversity fine-grained features in surgical scenes, achieves accurate identification of multiple instruments and multiple targets in surgical operations through category decoupling, and comprehensively realizes accurate and specific surgeries. Automatic real-time analysis of key content of the scene.
附图说明Description of the drawings
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The drawings described here are used to provide a further understanding of the present invention and constitute a part of this application. The illustrative embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute an improper limitation of the present invention. In the attached picture:
图1为本发明手术器械、行为和目标组织联合识别的方法的流程图;Figure 1 is a flow chart of the method for joint identification of surgical instruments, behaviors and target tissues according to the present invention;
图2为本发明手术器械、行为和目标组织联合识别的方法的多任务学习框架图;Figure 2 is a multi-task learning framework diagram of the method for joint identification of surgical instruments, behaviors and target tissues according to the present invention;
图3为本发明手术器械、行为和目标组织联合识别的方法的多样性损失模块的作用原理图;Figure 3 is a functional principle diagram of the diversity loss module of the method for joint identification of surgical instruments, behaviors and target tissues according to the present invention;
图4为本发明手术器械、行为和目标组织联合识别的装置的模块图。Figure 4 is a module diagram of a device for joint identification of surgical instruments, behavior and target tissue according to the present invention.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。In order to enable those in the technical field to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only These are part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts should fall within the scope of protection of this application.
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请 的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the present application described herein can be practiced in sequences other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method, system, product, or apparatus that encompasses a series of steps or units and need not be limited to those explicitly listed. Those steps or elements may instead include other steps or elements not expressly listed or inherent to the process, method, product or apparatus.
手术场景感知是现代智能手术室在具备精密的硬件设备和丰富的实时传感信号条件下,面向信息综合化和智能化发展的一项重要任务。在内窥镜引导的计算机辅助微创手术中,通过对当前手术视野中关键信息的理解和处理,手术场景感知系统能够实时监控手术全流程并在任意时刻为外科医生提供具体的辅助信息。在以腹腔镜胆囊切除术为代表的微创手术中,体表的微小创口减轻了手术对病人的负担,但内窥镜成像视野的局限性为手术的操作引导造成了一定困难。具体地,内窥镜镜头的取景范围限制了医生的手术视野,腔内烟雾和画面反光也对医生视野形成了遮挡,有限视角下目标组织的纹理高度相似性和重叠也为医生对当前手术环境的判断造成了困难,都使得手术风险难以预知。因此,为了在保留微创手术优点的同时提高手术安全性,基于术中内窥镜获取的实时手术视频信号,对手术场景进行关键内容的识别与解析,为外科医生提供实时的手术监控和场景解析以提供辅助干预是现代手术室场景感知系统发展的关键技术。Surgical scene perception is an important task for modern smart operating rooms to develop information integration and intelligence under the condition of sophisticated hardware equipment and rich real-time sensing signals. In endoscope-guided computer-assisted minimally invasive surgery, by understanding and processing key information in the current surgical field of view, the surgical scene perception system can monitor the entire surgical process in real time and provide the surgeon with specific auxiliary information at any time. In minimally invasive surgeries represented by laparoscopic cholecystectomy, the tiny incisions on the body surface reduce the burden of the surgery on the patient, but the limitations of the endoscopic imaging field of view create certain difficulties in surgical guidance. Specifically, the viewing range of the endoscopic lens limits the doctor's surgical field of view, and intracavity smoke and screen reflections also block the doctor's field of view. The high similarity and overlap of textures of the target tissue under the limited viewing angle also provide doctors with a better understanding of the current surgical environment. This creates difficulties in judgment and makes the risks of surgery difficult to predict. Therefore, in order to improve the safety of surgery while retaining the advantages of minimally invasive surgery, the key content of the surgical scene is identified and analyzed based on the real-time surgical video signal acquired by the intraoperative endoscope, and the surgeon is provided with real-time surgical monitoring and scenarios. Parsing to provide auxiliary intervention is a key technology in the development of modern operating room scene awareness systems.
实施例1Example 1
根据本发明一实施例,提供了一种手术器械、行为和目标组织联合识别的方法,参见图1,包括以下步骤:According to an embodiment of the present invention, a method for joint identification of surgical instruments, behaviors and target tissues is provided. See Figure 1, which includes the following steps:
S100:利用类别对齐的通道注意力机制对场景中的手术器械、行为和目标组织子任务进行特征类别对齐解耦;S100: Utilize the category-aligned channel attention mechanism to decouple the feature category alignment of surgical instruments, behaviors, and target tissue subtasks in the scene;
S200:引入长短时记忆网络对特征类别对齐解耦后的场景中手术器械、行为和目标组织子任务的动作信息进行时空特征融合;S200: Introduce the long short-term memory network to perform spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;
S300:通过全连接层对时空特征融合后的手术器械、行为和目标组织子任务进行识别。S300: Recognize the subtask of surgical instruments, behaviors and target tissues after spatio-temporal feature fusion through the fully connected layer.
本发明实施例中的手术器械、行为和目标组织联合识别的方法,首先利用类别对齐的通道注意力机制对场景中的手术器械、行为和目标组织子任务进行特征类别对齐解耦;再引入长短时记忆网络对特征类别对齐解耦后的场景中手术器械、行为和目标组织子任务的动作信息进行时空特征融合;而后通过全连接层对时空特征融合后的手术器械、行为和目标组织子任务进行识别。本发明通过对手术场景中局部多样性细粒度特征进行提取,实现更充分的空间特征描述,通过类别解耦实现外科手术中多器械、多目标情况下的精准识别,综合实现了精准具体的手术场景关键内容的自动实时解析。The method for joint identification of surgical instruments, behaviors and target tissues in the embodiment of the present invention first uses the category-aligned channel attention mechanism to perform feature category alignment and decoupling of the sub-tasks of surgical instruments, behaviors and target tissues in the scene; then introduces long and short The temporal memory network performs spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling; then the fully connected layer fuses the spatiotemporal feature fusion of surgical instruments, behaviors and target tissue subtasks. to identify. This invention achieves more adequate spatial feature description by extracting local diversity fine-grained features in surgical scenes, achieves accurate identification of multiple instruments and multiple targets in surgical operations through category decoupling, and comprehensively realizes accurate and specific surgeries. Automatic real-time analysis of key content of the scene.
其中,利用类别对齐的通道注意力机制对场景中的手术器械、行为和目标组织子任务进行特征类别对齐解耦包括:Among them, the category-aligned channel attention mechanism is used to decouple the feature category alignment of surgical instruments, behaviors and target tissue subtasks in the scene, including:
利用基于通道注意力的多标签互通道损失作用于深度卷积网络对场景中的手术器械、行为和目标组织子任务提取出空间特征。The multi-label cross-channel loss based on channel attention is used in a deep convolutional network to extract spatial features of surgical instruments, behaviors and target tissue subtasks in the scene.
其中,利用基于通道注意力的多标签互通道损失作用于深度卷积网络对场景中的手术器械、行为和目标组织子任务提取出空间特征包括:Among them, the multi-label cross-channel loss based on channel attention is used to act on the deep convolutional network to extract spatial features of surgical instruments, behaviors and target tissue subtasks in the scene, including:
采用深度残差网络作为主干模块初步提取深层特征,再使用全局池化操作得到多维特征向量来构建子任务分支;Use a deep residual network as the backbone module to initially extract deep features, and then use global pooling operations to obtain multi-dimensional feature vectors to build subtask branches;
基于各任务的总类别数将对应的全局特征划分为类别对齐的特征组。The corresponding global features are divided into category-aligned feature groups based on the total number of categories for each task.
其中,采用深度残差网络作为主干模块初步提取深层特征,再使用全局池化操作得到多维特征向量来构建子任务分支包括:Among them, a deep residual network is used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain multi-dimensional feature vectors to build subtask branches including:
首先采用由四个残差模块组成的五十层深度残差网络作为主干模块初步提起深层特征,再使用全局池化操作得到2048维的特征向量作为主干模块的输出;First, a fifty-layer deep residual network composed of four residual modules is used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain a 2048-dimensional feature vector as the output of the backbone module;
采用1×1卷积操作将提取到的2048维特征向量变换到适应于各任务分支的通道数。A 1×1 convolution operation is used to transform the extracted 2048-dimensional feature vector into the number of channels suitable for each task branch.
其中,基于各任务的总类别数将对应的全局特征划分为类别对齐的特征组包括:Among them, the corresponding global features are divided into category-aligned feature groups based on the total number of categories of each task, including:
腹腔镜胆囊切除术涉及目标组织15类,利用1×1卷积操作得到2040维度的全局特征F,将其划分为15组特征:Laparoscopic cholecystectomy involves 15 types of target tissues. A 1×1 convolution operation is used to obtain the 2040-dimensional global feature F, which is divided into 15 groups of features:
F={F 0,F 1,…,F 14}; F={F 0 ,F 1 ,…,F 14 };
其中每组F i包含ξ个通道,用于提取第i类目标组织对应在手术场景中的多样性局部细粒度特征; Each group F i contains ξ channels, which are used to extract the diverse local fine-grained features corresponding to the i-th type of target tissue in the surgical scene;
多标签互通道损失由区分性模块和多样性模块组成,在单个任务上分别作用于15组特征F之间和每组特征F i内部; The multi-label cross-channel loss consists of a discriminative module and a diversity module, which act respectively between 15 groups of features F and within each group of features F i on a single task;
对于第i组特征F i,区分性模块首先通过随机生成的0-1对角矩阵M i对该组内ξ个通道进行深度学习中的Mask操作,再对Mask操作后的组内特征进行跨通道的最大池化操作,得到当前图像对第i个类别的最终响应,具体区分性模块表示为: For the i-th group of features F i , the discriminative module first performs the Mask operation in deep learning on the ξ channels in the group through the randomly generated 0-1 diagonal matrix M i , and then performs cross-operation on the features within the group after the Mask operation. The maximum pooling operation of the channel is used to obtain the final response of the current image to the i-th category. The specific distinguishing module is expressed as:
Figure PCTCN2022085837-appb-000007
Figure PCTCN2022085837-appb-000007
其中W和H表示特征图的宽度和高度,F i,j,k表示第i组特征中第j个通道上的第k个元素位置; Where W and H represent the width and height of the feature map, F i,j,k represents the k-th element position on the j-th channel in the i-th group of features;
输入图片对每个类别的最终响应Dis(F 0)到Dis(F n-1)得到后,经过Softmax操 作得到多标签区分性损失函数: After the final response of the input image to each category Dis(F 0 ) to Dis(F n-1 ) is obtained, the multi-label discriminative loss function is obtained through the Softmax operation:
Figure PCTCN2022085837-appb-000008
Figure PCTCN2022085837-appb-000008
Figure PCTCN2022085837-appb-000009
Figure PCTCN2022085837-appb-000009
其中y i表示当前图像对于第i类的真实标签,n表示该子任务的总类别数; where yi represents the true label of the current image for the i-th category, and n represents the total number of categories of the subtask;
多样性模块在每组特征F i内部执行逐元素的Softmax操作,然后在组内每张特征图上进行跨通道的平均池化操作: The diversity module performs an element-wise softmax operation within each group of features F i , and then performs a cross-channel average pooling operation on each feature map in the group:
Figure PCTCN2022085837-appb-000010
Figure PCTCN2022085837-appb-000010
Figure PCTCN2022085837-appb-000011
Figure PCTCN2022085837-appb-000011
当每张图上的平均响应计算后,多样性损失可通过下式计算得到:When the average response on each graph is calculated, the diversity loss can be calculated as:
Figure PCTCN2022085837-appb-000012
Figure PCTCN2022085837-appb-000012
完整的多标签互通道损失通过对多样性模块和区分性模块的加权和求得:The complete multi-label cross-channel loss is obtained by the weighted sum of the diversity module and the discriminative module:
L MC(F)=λ 1L dis2L divL MC (F)=λ 1 L dis2 L div ;
其中相应的权重根据特定任务的需求加以调节设定。The corresponding weights are adjusted and set according to the needs of specific tasks.
其中,引入长短时记忆网络对特征类别对齐解耦后的场景中手术器械、行为和目标组织子任务的动作信息进行时空特征融合包括:Among them, the long short-term memory network is introduced to perform spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling, including:
在各任务的细粒度视觉特征提取模块后通过一个单层的长短时记忆网络进行一段时间输入内的运动特征提取,得到512维时空融合特征并最终通过全连接层实现相应任务的识别。After the fine-grained visual feature extraction module of each task, a single-layer long short-term memory network is used to extract motion features within a period of input, and 512-dimensional spatio-temporal fusion features are obtained, and the corresponding task recognition is finally achieved through a fully connected layer.
其中,在视觉特征层面采用跳跃链接的方法实现级联式的有效视觉特征传递,其中长短时记忆网络的整体损失函数由视觉特征层级的互通道损失和时空 融合特征得到分类结果的标准交叉熵损失加权组成。Among them, the jump link method is used at the visual feature level to achieve cascaded effective visual feature transfer. The overall loss function of the long short-term memory network is the standard cross-entropy loss of the classification result obtained by the cross-channel loss at the visual feature level and the spatio-temporal fusion feature. Weighted composition.
下面以具体实施例,对本发明的手术器械、行为和目标组织联合识别的方法进行详细说明:The method for joint identification of surgical instruments, behaviors and target tissues of the present invention will be described in detail below with specific examples:
手术器械、行为和目标组织的联合识别是计算机辅助外科介入微创手术的关键技术。然而在腹腔镜有限视角下,目标组织的纹理相似性、器械尖端的相似结构以及手术阶段中重复的非特定行为动作等细粒度特点,都对这些关键手术内容的精准识别造成了困难。针对前人方法的现有缺点,本发明目的在于利用手术器械、目标组织和执行动作子任务的联合识别提供更精确具体手术场景解析方法。通过对手术场景中局部多样性细粒度特征进行提取,实现更充分的空间特征描述,通过类别解耦实现外科手术中多器械、多目标情况下的精准识别。The joint identification of surgical instruments, actions, and target tissues is a key technology for computer-assisted surgical intervention and minimally invasive surgery. However, under the limited viewing angle of a laparoscope, fine-grained characteristics such as the texture similarity of the target tissue, the similar structure of the instrument tip, and repeated non-specific behavioral actions during the surgical stage all make it difficult to accurately identify these key surgical contents. In view of the existing shortcomings of previous methods, the purpose of the present invention is to provide a more accurate and specific surgical scene analysis method using the joint identification of surgical instruments, target tissues and execution action subtasks. By extracting local diverse fine-grained features in surgical scenes, a more adequate spatial feature description is achieved, and category decoupling is used to achieve accurate identification of multiple instruments and targets in surgical operations.
本发明提出一种基于多标签下互通道损失的手术器械、行为和目标组织联合识别的方法,主要用于以腹腔镜胆囊切除术为代表的计算机辅助外科微创手术中和术后场景和动作识别,致力于解决通过细粒度分类和多任务学习模型解决外科视频中关键全局和局部融合的视觉特征和长时间依赖关系下的动作特征的问题。本发明利用类别对齐的通道注意力机制实现视觉特征解耦,引入长短时记忆网络对场景中的动作信息进行时间特征提取,并以一种级联的方式实现多任务联合识别。经实验验证,本发明在单一任务和联合任务上均取得优于前人方法的良好识别结果。The present invention proposes a method for joint identification of surgical instruments, behaviors and target tissues based on multi-label mutual channel loss, which is mainly used for computer-assisted minimally invasive surgery represented by laparoscopic cholecystectomy and postoperative scenes and actions. Recognition, dedicated to solving the problem of key global and local fusion of visual features and action features under long-term dependencies in surgical videos through fine-grained classification and multi-task learning models. The present invention uses a category-aligned channel attention mechanism to realize visual feature decoupling, introduces a long short-term memory network to extract temporal features of action information in the scene, and realizes multi-task joint recognition in a cascaded manner. Through experimental verification, the present invention achieves better recognition results than previous methods in both single tasks and joint tasks.
本发明在多任务分支共享的类别对齐的细粒度视觉特征提取模块中,应用一种多标签的互通道损失函数实现通道上的特征类别对齐解耦,从而达到关注手术场景中多重局部细节的目的。在三个子任务分支,针对涉及连续时间内动作的任务后接长短时记忆网络模块以达到融合时空特征的目的。最后在多任务 联合识别上,长短时记忆网络模块采用空间特征级联传递的方式强化器械呈现情况与组织行为交互的关系,综合实现了精准具体的手术场景关键内容的自动实时解析。In the fine-grained visual feature extraction module for category alignment shared by multi-task branches, the present invention applies a multi-label cross-channel loss function to realize feature category alignment and decoupling on channels, thereby achieving the purpose of paying attention to multiple local details in the surgical scene. . In the three subtask branches, long short-term memory network modules are connected for tasks involving actions in continuous time to achieve the purpose of integrating spatiotemporal features. Finally, in terms of multi-task joint recognition, the long-short-term memory network module uses the cascade transfer of spatial features to strengthen the interaction between instrument presentation and organizational behavior, and comprehensively achieves automatic real-time analysis of accurate and specific key content of surgical scenes.
本发明的多任务学习框架如图2所示,主要包括类别对齐的细粒度视觉特征提取模块、时空特征融合模块和多任务级联模块三部分结构。The multi-task learning framework of the present invention is shown in Figure 2, and mainly includes a three-part structure of a category-aligned fine-grained visual feature extraction module, a spatio-temporal feature fusion module and a multi-task cascade module.
1.类别对齐的细粒度视觉特征提取模块1. Category-aligned fine-grained visual feature extraction module
针对器械、行为和目标组织等子任务的识别,通用方法是利用深度卷积网络提取类别融合的全局视觉特征。为了充分解析手术场景中的局部细节视觉特征以实现多标签多实体情况下的精准识别,本模块引入一种基于通道注意力的多标签互通道损失作用于深度卷积网络提取到的空间特征。For the identification of subtasks such as devices, behaviors, and target tissues, a common method is to use deep convolutional networks to extract global visual features of category fusion. In order to fully analyze the local detailed visual features in the surgical scene to achieve accurate recognition in multi-label and multi-entity situations, this module introduces a multi-label cross-channel loss based on channel attention to act on the spatial features extracted by the deep convolutional network.
具体地,由于多个子任务分支共享手术场景中的部分视觉特征,首先采用由四个残差模块组成的五十层深度残差网络作为主干模块初步提起深层特征,再使用全局池化操作得到2048维的特征向量作为主干模块的输出。为了便于多标签互通道损失在不同任务分支的应用和计算,采用1×1卷积操作将提取到的2048维特征向量变换到适应于各任务分支的通道数。针对腹腔镜胆囊切除术中手术器械类别为6类,执行动作为10类,涉及目标组织15类,因此对于三个任务分支分别利用1×1卷积变换为2040、2000和2040通道的全局视觉特征,每组包含340、200和136通道数的局部特征。Specifically, since multiple subtask branches share some visual features in the surgical scene, a fifty-layer deep residual network composed of four residual modules is first used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain 2048 dimensional feature vector as the output of the backbone module. In order to facilitate the application and calculation of multi-label cross-channel loss in different task branches, a 1×1 convolution operation is used to transform the extracted 2048-dimensional feature vector into the number of channels suitable for each task branch. For laparoscopic cholecystectomy, there are 6 categories of surgical instruments, 10 categories of execution actions, and 15 categories of target tissues involved. Therefore, for the three task branches, 1×1 convolution is used to transform the global vision of 2040, 2000 and 2040 channels respectively. Features, each group contains local features with 340, 200 and 136 channel numbers.
2.多标签互通道损失的作用方式与组成原理2. The function and composition principle of multi-label mutual channel loss
在构建好子任务分支后,基于各任务的总类别数可以将对应的全局特征划分为类别对齐的特征组。以目标组织识别任务分支为例,腹腔镜胆囊切除术涉及目标组织15类,利用1×1卷积操作得到2040维度的全局特征F,故将其划分为15组特征:After constructing the subtask branches, the corresponding global features can be divided into category-aligned feature groups based on the total number of categories of each task. Taking the target tissue identification task branch as an example, laparoscopic cholecystectomy involves 15 types of target tissues. A 1×1 convolution operation is used to obtain a 2040-dimensional global feature F, so it is divided into 15 groups of features:
F={F 0,F 1,…,F 14}; F={F 0 ,F 1 ,…,F 14 };
其中每组F i包含ξ个通道,用于提取第i类目标组织对应在手术场景中的多样性局部细粒度特征。多标签互通道损失由区分性模块和多样性模块组成,在单个任务上分别作用于15组特征F之间和每组特征F i内部。 Each group F i contains ξ channels, which are used to extract the diverse local fine-grained features corresponding to the i-th type of target tissue in the surgical scene. The multi-label cross-channel loss consists of a discriminative module and a diversity module, which act respectively between 15 groups of features F and within each group of features F i on a single task.
区分性模块用于引导不同组特征分别学习到对应类别相关的特征并将它们区分开来。对于第i组特征F i,区分性模块首先通过随机生成的0-1对角矩阵M i对该组内ξ个通道进行Mask操作,再对Mask后的组内特征进行跨通道的最大池化操作,从而保留特征图中每个位置对该类别的最大响应,通过全局平均池化得到当前图像对第i个类别的最终响应,具体区分性模块表示为: The discriminative module is used to guide different groups of features to learn features related to corresponding categories and distinguish them. For the i-th group of features F i , the discriminative module first performs a Mask operation on ξ channels in the group through a randomly generated 0-1 diagonal matrix M i , and then performs cross-channel maximum pooling on the masked features in the group. operation, thereby retaining the maximum response to the category at each position in the feature map, and obtaining the final response of the current image to the i-th category through global average pooling. The specific distinguishing module is expressed as:
Figure PCTCN2022085837-appb-000013
Figure PCTCN2022085837-appb-000013
其中W和H表示特征图的宽度和高度,F i,j,k表示第i组特征中第j个通道上的第k个元素位置。 Where W and H represent the width and height of the feature map, and F i,j,k represents the k-th element position on the j-th channel in the i-th group of features.
输入图片对每个类别的最终响应Dis(F 0)到Dis(F n-1)得到后,经过Softmax操作可以得到多标签区分性损失函数: After the final response Dis(F 0 ) to Dis(F n-1 ) of the input image for each category is obtained, the multi-label discriminative loss function can be obtained through the Softmax operation:
Figure PCTCN2022085837-appb-000014
Figure PCTCN2022085837-appb-000014
Figure PCTCN2022085837-appb-000015
Figure PCTCN2022085837-appb-000015
其中y i表示当前图像对于第i类的真实标签,n表示该子任务的总类别数。 where yi represents the true label of the current image for the i-th category, and n represents the total number of categories for this subtask.
多样性模块在每组特征F i内部执行逐元素的Softmax操作,然后在组内每张特征图上进行跨通道的平均池化操作: The diversity module performs an element-wise softmax operation within each group of features F i , and then performs a cross-channel average pooling operation on each feature map in the group:
Figure PCTCN2022085837-appb-000016
Figure PCTCN2022085837-appb-000016
Figure PCTCN2022085837-appb-000017
Figure PCTCN2022085837-appb-000017
当每张图上的平均响应计算后,多样性损失可通过下式计算得到:When the average response on each graph is calculated, the diversity loss can be calculated as:
Figure PCTCN2022085837-appb-000018
Figure PCTCN2022085837-appb-000018
值得注意的是,在多器械同时出现的多标签多实体手术场景中,多样性损失模块的作用原理如图3所示。图3中,左边为多标签多样性模块的原理图,右边为单标签多样性模块的原理图。It is worth noting that in a multi-label multi-entity surgery scenario where multiple instruments appear simultaneously, the working principle of the diversity loss module is shown in Figure 3. In Figure 3, the left side is the schematic diagram of the multi-label diversity module, and the right side is the schematic diagram of the single-label diversity module.
完整的多标签互通道损失通过对多样性模块和区分性模块的加权和求得:The complete multi-label cross-channel loss is obtained by the weighted sum of the diversity module and the discriminative module:
L MC(F)=λ 1L dis2L divL MC (F)=λ 1 L dis2 L div ;
其中相应的权重根据特定任务的需求加以调节设定。The corresponding weights are adjusted and set according to the needs of specific tasks.
3.时空特征融合、多任务级联和整体损失函数3. Spatiotemporal feature fusion, multi-task cascade and overall loss function
为了捕获连续帧间包含的运动信息,在各任务的细粒度视觉特征提取模块后通过一个单层的长短时记忆网络进行一段时间输入内的运动特征提取,得到512维时空融合特征并最终通过全连接层实现相应任务的识别。此外,考虑到手术器械作为最显著的特征和手术动作的先决条件,同时器械和动作作用于目标组织的手术操作规律,在视觉特征层面采用跳跃链接的方法实现级联式的有效视觉特征传递。长短时记忆网络的整体损失函数由视觉特征层级的互通道损失和时空融合特征得到分类结果的标准交叉熵损失加权组成。In order to capture the motion information contained between consecutive frames, after the fine-grained visual feature extraction module of each task, a single-layer long short-term memory network is used to extract motion features within a period of input, and 512-dimensional spatio-temporal fusion features are obtained and finally through the full The connection layer realizes the identification of corresponding tasks. In addition, considering that surgical instruments are the most prominent features and prerequisites for surgical actions, and the surgical operation rules of the instruments and actions acting on the target tissue, the jump link method is used at the visual feature level to achieve cascading effective visual feature transmission. The overall loss function of the long short-term memory network is weighted by the cross-channel loss at the visual feature level and the standard cross-entropy loss that uses spatio-temporal fusion features to obtain classification results.
本发明方法的创新技术点至少在于:The innovative technical points of the method of the present invention are at least:
1.多标签多实体情况下互通道损失函数中多样性模块的改进设计;1. Improved design of the diversity module in the cross-channel loss function in the case of multi-label and multi-entity;
2.多标签多实体情况下互通道损失函数中区分性模块的改进设计;2. Improved design of the discriminative module in the cross-channel loss function in the case of multi-label and multi-entity;
3.手术器械、行为和目标组织联合识别的级联式空间特征传递结构;3. Cascade spatial feature transfer structure for joint identification of surgical instruments, behaviors and target tissues;
4.设计改进的多标签互通道损失在腹腔镜场景中提取细粒度空间特征的应用。4. Design the application of improved multi-label cross-channel loss in extracting fine-grained spatial features in laparoscopic scenes.
本发明方法的有益效果至少在于:The beneficial effects of the method of the present invention are at least:
本发明改进设计的多标签互通道损失函数可以充分提取到腹腔镜手术场景中分布在视野不同区域的局部特征;多标签情况下的损失设计能更好地应对手术场景中多器械多实体同时执行手术操作的应用情况;类别对齐的解耦机制增加了模型的可视性和解释性;器械、行为和目标组织的级联式联合识别更好的利用了多任务间的相关关系,提高了单一任务和多任务的识别精度,从而在手术中的实时辅助提供了更具体、更精确的指示。The multi-label cross-channel loss function improved by the present invention can fully extract the local features distributed in different areas of the field of view in the laparoscopic surgery scene; the loss design in the multi-label case can better cope with the simultaneous execution of multiple instruments and multiple entities in the surgical scene. The application of surgical operations; the decoupling mechanism of category alignment increases the visibility and interpretability of the model; the cascaded joint identification of instruments, behaviors and target tissues better utilizes the correlation between multiple tasks and improves the efficiency of a single Task and multi-task recognition accuracy, thereby providing more specific and precise instructions for real-time assistance during surgery.
本发明提出的基于多标签互通道损失的手术器械、行为和目标组织联合识别的多任务学习方法,已在公开数据集CholecT40和HeiCholec上进行了实验,在单一任务和多任务联合上均取得了优于上述前人方法的有效提升。在多个数据集上的验证也显示出模型的鲁棒性,能够达到腹腔镜手术场景中器械、行为和目标组织辅助解析的需求。经实验验证,本发明提出的多标签下的互通道损失函数能够有效地实现图像局部细粒度特征的解耦和类别对齐,长短时记忆网络模块基于解耦后的特征序列有效提取连续时间内蕴含的动作信息,级联式的多任务联合识别结构充分利用了器械、行为到目标组织间的先验关系,从而使本发明提出的联合识别网络相较于现有方法取得明显提升。The multi-task learning method proposed by the present invention based on the joint identification of surgical instruments, behaviors and target tissues based on multi-label cross-channel loss has been tested on the public data sets CholecT40 and HeiCholec, and has achieved results in both single tasks and multi-task combinations. It is an effective improvement over the previous methods mentioned above. Validation on multiple data sets also shows the robustness of the model, which can meet the needs of assisted analysis of instruments, behaviors and target tissues in laparoscopic surgery scenarios. Through experimental verification, the cross-channel loss function under multi-label proposed by the present invention can effectively realize the decoupling and category alignment of local fine-grained features of the image. The long-short-term memory network module effectively extracts the implication in continuous time based on the decoupled feature sequence. Action information, the cascaded multi-task joint recognition structure makes full use of the prior relationships between instruments, behaviors and target tissues, so that the joint recognition network proposed by the present invention is significantly improved compared to existing methods.
实施例2Example 2
根据本发明的另一实施例,提供了一种手术器械、行为和目标组织联合识别的装置,参见图4,包括:According to another embodiment of the present invention, a device for joint identification of surgical instruments, behaviors and target tissues is provided. See Figure 4, which includes:
类别对齐的细粒度视觉特征提取模块100,用于利用类别对齐的通道注意 力机制对场景中的手术器械、行为和目标组织子任务进行特征类别对齐解耦;The category-aligned fine-grained visual feature extraction module 100 is used to use the category-aligned channel attention mechanism to perform feature category alignment and decoupling of surgical instruments, behaviors, and target tissue subtasks in the scene;
时空特征融合模块200,用于引入长短时记忆网络对特征类别对齐解耦后的场景中手术器械、行为和目标组织子任务的动作信息进行时空特征融合;The spatiotemporal feature fusion module 200 is used to introduce a long short-term memory network to perform spatiotemporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;
多任务级联模块300,用于通过全连接层对时空特征融合后的手术器械、行为和目标组织子任务进行识别。The multi-task cascade module 300 is used to identify surgical instruments, behaviors and target tissue subtasks after spatio-temporal feature fusion through a fully connected layer.
本发明实施例中的手术器械、行为和目标组织联合识别的装置,首先利用类别对齐的通道注意力机制对场景中的手术器械、行为和目标组织子任务进行特征类别对齐解耦;再引入长短时记忆网络对特征类别对齐解耦后的场景中手术器械、行为和目标组织子任务的动作信息进行时空特征融合;而后通过全连接层对时空特征融合后的手术器械、行为和目标组织子任务进行识别。本发明通过对手术场景中局部多样性细粒度特征进行提取,实现更充分的空间特征描述,通过类别解耦实现外科手术中多器械、多目标情况下的精准识别,综合实现了精准具体的手术场景关键内容的自动实时解析。The device for joint recognition of surgical instruments, behaviors and target tissues in the embodiment of the present invention first uses the category-aligned channel attention mechanism to perform feature category alignment and decoupling of the sub-tasks of surgical instruments, behaviors and target tissues in the scene; and then introduces long and short The temporal memory network performs spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling; then the fully connected layer fuses the spatiotemporal feature fusion of surgical instruments, behaviors and target tissue subtasks. to identify. This invention achieves more adequate spatial feature description by extracting local diversity fine-grained features in surgical scenes, achieves accurate identification of multiple instruments and multiple targets in surgical operations through category decoupling, and comprehensively realizes accurate and specific surgeries. Automatic real-time analysis of key content of the scene.
下面以具体实施例,对本发明的手术器械、行为和目标组织联合识别的装置进行详细说明:The device for joint identification of surgical instruments, behaviors and target tissues of the present invention will be described in detail below with specific embodiments:
手术器械、行为和目标组织的联合识别是计算机辅助外科介入微创手术的关键技术。然而在腹腔镜有限视角下,目标组织的纹理相似性、器械尖端的相似结构以及手术阶段中重复的非特定行为动作等细粒度特点,都对这些关键手术内容的精准识别造成了困难。针对前人方法的现有缺点,本发明目的在于利用手术器械、目标组织和执行动作子任务的联合识别提供更精确具体手术场景解析装置。通过对手术场景中局部多样性细粒度特征进行提取,实现更充分的空间特征描述,通过类别解耦实现外科手术中多器械、多目标情况下的精准识别。The joint recognition of surgical instruments, actions and target tissues is a key technology for computer-assisted surgical intervention and minimally invasive surgery. However, under the limited viewing angle of a laparoscope, fine-grained characteristics such as the texture similarity of the target tissue, the similar structure of the instrument tip, and repeated non-specific behavioral actions during the surgical stage all make it difficult to accurately identify these key surgical contents. In view of the existing shortcomings of previous methods, the purpose of the present invention is to provide a more accurate and specific surgical scene analysis device using the joint identification of surgical instruments, target tissues and execution action subtasks. By extracting local diverse fine-grained features in surgical scenes, a more adequate spatial feature description is achieved, and category decoupling is used to achieve accurate identification of multiple instruments and targets in surgical operations.
本发明提出一种基于多标签下互通道损失的手术器械、行为和目标组织联合识别的装置,主要用于以腹腔镜胆囊切除术为代表的计算机辅助外科微创手术中和术后场景和动作识别,致力于解决通过细粒度分类和多任务学习模型解决外科视频中关键全局和局部融合的视觉特征和长时间依赖关系下的动作特征的问题。本发明利用类别对齐的通道注意力机制实现视觉特征解耦,引入长短时记忆网络对场景中的动作信息进行时间特征提取,并以一种级联的方式实现多任务联合识别。经实验验证,本发明在单一任务和联合任务上均取得优于前人方法的良好识别结果。The present invention proposes a device for joint identification of surgical instruments, behaviors and target tissues based on multi-label mutual channel loss, which is mainly used for computer-assisted minimally invasive surgery represented by laparoscopic cholecystectomy and postoperative scenes and actions. Recognition, dedicated to solving the problem of key global and local fusion of visual features and action features under long-term dependencies in surgical videos through fine-grained classification and multi-task learning models. The present invention uses a category-aligned channel attention mechanism to realize visual feature decoupling, introduces a long short-term memory network to extract temporal features of action information in the scene, and realizes multi-task joint recognition in a cascaded manner. Through experimental verification, the present invention achieves better recognition results than previous methods in both single tasks and joint tasks.
本发明在多任务分支共享的类别对齐的细粒度视觉特征提取模块100中,应用一种多标签的互通道损失函数实现通道上的特征类别对齐解耦,从而达到关注手术场景中多重局部细节的目的。在三个子任务分支,针对涉及连续时间内动作的任务后接长短时记忆网络模块以达到融合时空特征的目的。最后在多任务联合识别上,长短时记忆网络模块采用空间特征级联传递的方式强化器械呈现情况与组织行为交互的关系,综合实现了精准具体的手术场景关键内容的自动实时解析。In the category-aligned fine-grained visual feature extraction module 100 shared by multi-task branches, the present invention applies a multi-label cross-channel loss function to realize feature category alignment and decoupling on channels, thereby achieving the goal of paying attention to multiple local details in the surgical scene. Purpose. In the three subtask branches, long and short-term memory network modules are connected for tasks involving actions in continuous time to achieve the purpose of integrating spatiotemporal features. Finally, in terms of multi-task joint recognition, the long-short-term memory network module uses the cascade transfer method of spatial features to strengthen the interaction between instrument presentation and organizational behavior, and comprehensively achieves automatic real-time analysis of accurate and specific key content of surgical scenes.
本发明的多任务学习框架如图2所示,主要包括类别对齐的细粒度视觉特征提取模块100、时空特征融合模块200和多任务级联模块300三部分结构。The multi-task learning framework of the present invention is shown in Figure 2, and mainly includes a three-part structure of a category-aligned fine-grained visual feature extraction module 100, a spatio-temporal feature fusion module 200, and a multi-task cascade module 300.
1.类别对齐的细粒度视觉特征提取模块1001. Category-aligned fine-grained visual feature extraction module 100
针对器械、行为和目标组织等子任务的识别,通用方法是利用深度卷积网络提取类别融合的全局视觉特征。为了充分解析手术场景中的局部细节视觉特征以实现多标签多实体情况下的精准识别,本模块引入一种基于通道注意力的多标签互通道损失作用于深度卷积网络提取到的空间特征。For the identification of subtasks such as devices, behaviors, and target tissues, a common method is to use deep convolutional networks to extract global visual features of category fusion. In order to fully analyze the local detailed visual features in the surgical scene to achieve accurate recognition in multi-label and multi-entity situations, this module introduces a multi-label cross-channel loss based on channel attention to act on the spatial features extracted by the deep convolutional network.
具体地,由于多个子任务分支共享手术场景中的部分视觉特征,首先采用 由四个残差模块组成的五十层深度残差网络作为主干模块初步提起深层特征,再使用全局池化操作得到2048维的特征向量作为主干模块的输出。为了便于多标签互通道损失在不同任务分支的应用和计算,采用1×1卷积操作将提取到的2048维特征向量变换到适应于各任务分支的通道数。针对腹腔镜胆囊切除术中手术器械类别为6类,执行动作为10类,涉及目标组织15类,因此对于三个任务分支分别利用1×1卷积变换为2040、2000和2040通道的全局视觉特征,每组包含340、200和136通道数的局部特征。Specifically, since multiple subtask branches share some visual features in the surgical scene, a fifty-layer deep residual network composed of four residual modules is first used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain 2048 dimensional feature vector as the output of the backbone module. In order to facilitate the application and calculation of multi-label cross-channel loss in different task branches, a 1×1 convolution operation is used to transform the extracted 2048-dimensional feature vector into the number of channels suitable for each task branch. For laparoscopic cholecystectomy, there are 6 categories of surgical instruments, 10 categories of execution actions, and 15 categories of target tissues involved. Therefore, for the three task branches, 1×1 convolution is used to transform the global vision of 2040, 2000 and 2040 channels respectively. Features, each group contains local features with 340, 200 and 136 channel numbers.
2.多标签互通道损失的作用方式与组成原理2. The function and composition principle of multi-label mutual channel loss
在构建好子任务分支后,基于各任务的总类别数可以将对应的全局特征划分为类别对齐的特征组。以目标组织识别任务分支为例,腹腔镜胆囊切除术涉及目标组织15类,利用1×1卷积操作得到2040维度的全局特征F,故将其划分为15组特征:After constructing the subtask branches, the corresponding global features can be divided into category-aligned feature groups based on the total number of categories of each task. Taking the target tissue identification task branch as an example, laparoscopic cholecystectomy involves 15 types of target tissues. A 1×1 convolution operation is used to obtain a 2040-dimensional global feature F, so it is divided into 15 groups of features:
F={F 0,F 1,…,F 14}; F={F 0 ,F 1 ,…,F 14 };
其中每组F i包含ξ个通道,用于提取第i类目标组织对应在手术场景中的多样性局部细粒度特征。多标签互通道损失由区分性模块和多样性模块组成,在单个任务上分别作用于15组特征F之间和每组特征F i内部。 Each group F i contains ξ channels, which are used to extract the diverse local fine-grained features corresponding to the i-th type of target tissue in the surgical scene. The multi-label cross-channel loss consists of a discriminative module and a diversity module, which act respectively between 15 groups of features F and within each group of features F i on a single task.
区分性模块用于引导不同组特征分别学习到对应类别相关的特征并将它们区分开来。对于第i组特征F i,区分性模块首先通过随机生成的0-1对角矩阵M i对该组内ξ个通道进行Mask操作,再对Mask后的组内特征进行跨通道的最大池化操作,从而保留特征图中每个位置对该类别的最大响应,通过全局平均池化得到当前图像对第i个类别的最终响应,具体区分性模块表示为: The discriminative module is used to guide different groups of features to learn features related to corresponding categories and distinguish them. For the i-th group of features F i , the discriminative module first performs a Mask operation on ξ channels in the group through a randomly generated 0-1 diagonal matrix M i , and then performs cross-channel maximum pooling on the masked features in the group. operation, thereby retaining the maximum response to the category at each position in the feature map, and obtaining the final response of the current image to the i-th category through global average pooling. The specific distinguishing module is expressed as:
Figure PCTCN2022085837-appb-000019
Figure PCTCN2022085837-appb-000019
其中W和H表示特征图的宽度和高度,F i,j,k表示第i组特征中第j个通道上的第k个元素位置。 Where W and H represent the width and height of the feature map, and F i,j,k represents the k-th element position on the j-th channel in the i-th group of features.
输入图片对每个类别的最终响应Dis(F 0)到Dis(F n-1)得到后,经过Softmax操作可以得到多标签区分性损失函数: After the final response Dis(F 0 ) to Dis(F n-1 ) of the input image for each category is obtained, the multi-label discriminative loss function can be obtained through the Softmax operation:
Figure PCTCN2022085837-appb-000020
Figure PCTCN2022085837-appb-000020
Figure PCTCN2022085837-appb-000021
Figure PCTCN2022085837-appb-000021
其中y i表示当前图像对于第i类的真实标签,n表示该子任务的总类别数。 where yi represents the true label of the current image for the i-th category, and n represents the total number of categories for this subtask.
多样性模块在每组特征F i内部执行逐元素的Softmax操作,然后在组内每张特征图上进行跨通道的平均池化操作: The diversity module performs an element-wise softmax operation within each group of features F i , and then performs a cross-channel average pooling operation on each feature map in the group:
Figure PCTCN2022085837-appb-000022
Figure PCTCN2022085837-appb-000022
Figure PCTCN2022085837-appb-000023
Figure PCTCN2022085837-appb-000023
当每张图上的平均响应计算后,多样性损失可通过下式计算得到:When the average response on each graph is calculated, the diversity loss can be calculated as:
Figure PCTCN2022085837-appb-000024
Figure PCTCN2022085837-appb-000024
值得注意的是,在多器械同时出现的多标签多实体手术场景中,多样性损失模块的作用原理如图3所示。图3中,左边为多标签多样性模块的原理图,右边为单标签多样性模块的原理图。It is worth noting that in a multi-label multi-entity surgery scenario where multiple instruments appear simultaneously, the working principle of the diversity loss module is shown in Figure 3. In Figure 3, the left side is the schematic diagram of the multi-label diversity module, and the right side is the schematic diagram of the single-label diversity module.
完整的多标签互通道损失通过对多样性模块和区分性模块的加权和求得:The complete multi-label cross-channel loss is obtained by the weighted sum of the diversity module and the discriminative module:
L MC(F)=λ 1L dis2L divL MC (F)=λ 1 L dis2 L div ;
其中相应的权重根据特定任务的需求加以调节设定。The corresponding weights are adjusted and set according to the needs of specific tasks.
3.时空特征融合、多任务级联和整体损失函数3. Spatiotemporal feature fusion, multi-task cascade and overall loss function
为了捕获连续帧间包含的运动信息,在各任务的细粒度视觉特征提取模块后通过一个单层的长短时记忆网络进行一段时间输入内的运动特征提取,得到512维时空融合特征并最终通过全连接层实现相应任务的识别。此外,考虑到手术器械作为最显著的特征和手术动作的先决条件,同时器械和动作作用于目标组织的手术操作规律,在视觉特征层面采用跳跃链接的方法实现级联式的有效视觉特征传递。长短时记忆网络的整体损失函数由视觉特征层级的互通道损失和时空融合特征得到分类结果的标准交叉熵损失加权组成。In order to capture the motion information contained between consecutive frames, after the fine-grained visual feature extraction module of each task, a single-layer long short-term memory network is used to extract motion features within a period of input, and 512-dimensional spatio-temporal fusion features are obtained and finally through the full The connection layer realizes the identification of corresponding tasks. In addition, considering that surgical instruments are the most prominent features and prerequisites for surgical actions, and the surgical operation rules of the instruments and actions acting on the target tissue, the jump link method is used at the visual feature level to achieve cascading effective visual feature transmission. The overall loss function of the long short-term memory network is weighted by the cross-channel loss at the visual feature level and the standard cross-entropy loss that uses spatio-temporal fusion features to obtain classification results.
本发明装置的创新技术点至少在于:The innovative technical points of the device of the present invention are at least:
1.多标签多实体情况下互通道损失函数中多样性模块的改进设计;1. Improved design of the diversity module in the cross-channel loss function in the case of multi-label and multi-entity;
2.多标签多实体情况下互通道损失函数中区分性模块的改进设计;2. Improved design of the discriminative module in the cross-channel loss function in the case of multi-label and multi-entity;
3.手术器械、行为和目标组织联合识别的级联式空间特征传递结构;3. Cascade spatial feature transfer structure for joint identification of surgical instruments, behaviors and target tissues;
4.设计改进的多标签互通道损失在腹腔镜场景中提取细粒度空间特征的应用。4. Design the application of improved multi-label cross-channel loss in extracting fine-grained spatial features in laparoscopic scenes.
本发明装置的有益效果至少在于:The beneficial effects of the device of the present invention are at least:
本发明改进设计的多标签互通道损失函数可以充分提取到腹腔镜手术场景中分布在视野不同区域的局部特征;多标签情况下的损失设计能更好地应对手术场景中多器械多实体同时执行手术操作的应用情况;类别对齐的解耦机制增加了模型的可视性和解释性;器械、行为和目标组织的级联式联合识别更好的利用了多任务间的相关关系,提高了单一任务和多任务的识别精度,从而在手术中的实时辅助提供了更具体、更精确的指示。The multi-label cross-channel loss function improved by the present invention can fully extract the local features distributed in different areas of the field of view in the laparoscopic surgery scene; the loss design in the multi-label case can better cope with the simultaneous execution of multiple instruments and multiple entities in the surgical scene. The application of surgical operations; the decoupling mechanism of category alignment increases the visibility and interpretability of the model; the cascaded joint identification of instruments, behaviors and target tissues better utilizes the correlation between multiple tasks and improves the efficiency of a single Task and multi-task recognition accuracy, thereby providing more specific and precise instructions for real-time assistance during surgery.
本发明提出的基于多标签互通道损失的手术器械、行为和目标组织联合识别的多任务学习装置,已在公开数据集CholecT40和HeiCholec上进行了实验,在单一任务和多任务联合上均取得了优于上述前人方法的有效提升。在多个数 据集上的验证也显示出模型的鲁棒性,能够达到腹腔镜手术场景中器械、行为和目标组织辅助解析的需求。经实验验证,本发明提出的多标签下的互通道损失函数能够有效地实现图像局部细粒度特征的解耦和类别对齐,长短时记忆网络模块基于解耦后的特征序列有效提取连续时间内蕴含的动作信息,级联式的多任务联合识别结构充分利用了器械、行为到目标组织间的先验关系,从而使本发明提出的联合识别网络相较于现有方法取得明显提升。The multi-task learning device proposed by the present invention based on the joint recognition of surgical instruments, behaviors and target tissues based on multi-label cross-channel loss has been tested on the public data sets CholecT40 and HeiCholec, and has achieved results in both single tasks and multi-task combinations. It is an effective improvement over the previous methods mentioned above. Verification on multiple data sets also shows the robustness of the model, which can meet the needs of assisted analysis of instruments, behaviors and target tissues in laparoscopic surgery scenarios. Through experimental verification, the cross-channel loss function under multi-label proposed by the present invention can effectively realize the decoupling and category alignment of local fine-grained features of the image. The long-short-term memory network module effectively extracts the implication in continuous time based on the decoupled feature sequence. Action information, the cascaded multi-task joint recognition structure makes full use of the prior relationships between instruments, behaviors and target tissues, so that the joint recognition network proposed by the present invention is significantly improved compared to existing methods.
实施例3Example 3
一种存储介质,存储介质存储有能够实现上述任意一项手术器械、行为和目标组织联合识别的方法的程序文件。A storage medium that stores program files capable of realizing any of the above methods for joint identification of surgical instruments, behaviors, and target tissues.
实施例4Example 4
一种处理器,处理器用于运行程序,其中,程序运行时执行上述任意一项的手术器械、行为和目标组织联合识别的方法。A processor is used to run a program, wherein when the program is running, it executes any one of the above methods for joint identification of surgical instruments, behaviors, and target tissues.
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The above serial numbers of the embodiments of the present invention are only for description and do not represent the advantages and disadvantages of the embodiments.
在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present invention, each embodiment is described with its own emphasis. For parts that are not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的系统实施例仅仅是示意性的,例如单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The system embodiments described above are only illustrative. For example, the division of units can be a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or integrated into Another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the units or modules may be in electrical or other forms.
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元 显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。Units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units, i.e. they may be located in one place, or they may be distributed over multiple units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units.
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。Integrated units may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products. Based on this understanding, the technical solution of the present invention is essentially or contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which can be a personal computer, a server or a network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present invention. The aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program code. .
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above are only preferred embodiments of the present invention. It should be noted that those skilled in the art can make several improvements and modifications without departing from the principles of the present invention. These improvements and modifications can also be made. should be regarded as the protection scope of the present invention.

Claims (10)

  1. 一种手术器械、行为和目标组织联合识别的方法,其特征在于,包括以下步骤:A method for joint identification of surgical instruments, behaviors and target tissues, which is characterized by including the following steps:
    利用类别对齐的通道注意力机制对场景中的手术器械、行为和目标组织子任务进行特征类别对齐解耦;Utilize the category-aligned channel attention mechanism to decouple the feature category alignment of surgical instruments, behaviors, and target tissue subtasks in the scene;
    引入长短时记忆网络对特征类别对齐解耦后的场景中手术器械、行为和目标组织子任务的动作信息进行时空特征融合;The long short-term memory network is introduced to perform spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;
    通过全连接层对时空特征融合后的手术器械、行为和目标组织子任务进行识别。The sub-task of identifying surgical instruments, behaviors and target tissues after spatio-temporal feature fusion is used through the fully connected layer.
  2. 根据权利要求1所述的手术器械、行为和目标组织联合识别的方法,其特征在于,所述利用类别对齐的通道注意力机制对场景中的手术器械、行为和目标组织子任务进行特征类别对齐解耦包括:The method for joint identification of surgical instruments, behaviors and target tissues according to claim 1, characterized in that the channel attention mechanism using category alignment is used to align the feature categories of the surgical instruments, behaviors and target tissue subtasks in the scene. Decoupling includes:
    利用基于通道注意力的多标签互通道损失作用于深度卷积网络对场景中的手术器械、行为和目标组织子任务提取出空间特征。The multi-label cross-channel loss based on channel attention is used in a deep convolutional network to extract spatial features of surgical instruments, behaviors and target tissue subtasks in the scene.
  3. 根据权利要求2所述的手术器械、行为和目标组织联合识别的方法,其特征在于,所述利用基于通道注意力的多标签互通道损失作用于深度卷积网络对场景中的手术器械、行为和目标组织子任务提取出空间特征包括:The method for joint identification of surgical instruments, behaviors and target tissues according to claim 2, characterized in that the multi-label cross-channel loss based on channel attention is used to act on the deep convolution network to identify the surgical instruments and behaviors in the scene. And the target organization subtask to extract spatial features includes:
    采用深度残差网络作为主干模块初步提取深层特征,再使用全局池化操作得到多维特征向量来构建子任务分支;Use a deep residual network as the backbone module to initially extract deep features, and then use global pooling operations to obtain multi-dimensional feature vectors to build subtask branches;
    基于各任务的总类别数将对应的全局特征划分为类别对齐的特征组。The corresponding global features are divided into category-aligned feature groups based on the total number of categories for each task.
  4. 根据权利要求3所述的手术器械、行为和目标组织联合识别的方法,其特征在于,所述采用深度残差网络作为主干模块初步提取深层特征,再使用全局 池化操作得到多维特征向量来构建子任务分支包括:The method for joint identification of surgical instruments, behaviors and target tissues according to claim 3, characterized in that the deep residual network is used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain a multi-dimensional feature vector to construct Subtask branches include:
    首先采用由四个残差模块组成的五十层深度残差网络作为主干模块初步提起深层特征,再使用全局池化操作得到2048维的特征向量作为主干模块的输出;First, a fifty-layer deep residual network composed of four residual modules is used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain a 2048-dimensional feature vector as the output of the backbone module;
    采用1×1卷积操作将提取到的2048维特征向量变换到适应于各任务分支的通道数。A 1×1 convolution operation is used to transform the extracted 2048-dimensional feature vector into the number of channels suitable for each task branch.
  5. 根据权利要求3所述的手术器械、行为和目标组织联合识别的方法,其特征在于,所述基于各任务的总类别数将对应的全局特征划分为类别对齐的特征组包括:The method for joint identification of surgical instruments, behaviors and target tissues according to claim 3, wherein the dividing the corresponding global features into category-aligned feature groups based on the total number of categories of each task includes:
    腹腔镜胆囊切除术涉及目标组织15类,利用1×1卷积操作得到2040维度的全局特征F,将其划分为15组特征:Laparoscopic cholecystectomy involves 15 types of target tissues. A 1×1 convolution operation is used to obtain the 2040-dimensional global feature F, which is divided into 15 groups of features:
    F={F 0,F 1,…,F 14}; F={F 0 ,F 1 ,…,F 14 };
    其中每组F i包含ξ个通道,用于提取第i类目标组织对应在手术场景中的多样性局部细粒度特征; Each group F i contains ξ channels, which are used to extract the diverse local fine-grained features corresponding to the i-th type of target tissue in the surgical scene;
    多标签互通道损失由区分性模块和多样性模块组成,在单个任务上分别作用于15组特征F之间和每组特征F i内部; The multi-label cross-channel loss consists of a discriminative module and a diversity module, which act respectively between 15 groups of features F and within each group of features F i on a single task;
    对于第i组特征F i,区分性模块首先通过随机生成的0-1对角矩阵M i对该组内ξ个通道进行深度学习中的Mask操作,再对Mask操作后的组内特征进行跨通道的最大池化操作,得到当前图像对第i个类别的最终响应,具体区分性模块表示为: For the i-th group of features F i , the discriminative module first performs the Mask operation in deep learning on the ξ channels in the group through the randomly generated 0-1 diagonal matrix M i , and then performs cross-operation on the features within the group after the Mask operation. The maximum pooling operation of the channel is used to obtain the final response of the current image to the i-th category. The specific distinguishing module is expressed as:
    Figure PCTCN2022085837-appb-100001
    Figure PCTCN2022085837-appb-100001
    其中W和H表示特征图的宽度和高度,F i,j,k表示第i组特征中第j个通道上的第k个元素位置; Where W and H represent the width and height of the feature map, F i,j,k represents the k-th element position on the j-th channel in the i-th group of features;
    输入图片对每个类别的最终响应Dis(F 0)到Dis(F n-1)得到后,经过Softmax操作得到多标签区分性损失函数: After the final response of the input image to each category Dis(F 0 ) to Dis(F n-1 ) is obtained, the multi-label discriminative loss function is obtained through the Softmax operation:
    Figure PCTCN2022085837-appb-100002
    Figure PCTCN2022085837-appb-100002
    Figure PCTCN2022085837-appb-100003
    Figure PCTCN2022085837-appb-100003
    其中y i表示当前图像对于第i类的真实标签,n表示该子任务的总类别数; where yi represents the true label of the current image for the i-th category, and n represents the total number of categories of the subtask;
    多样性模块在每组特征F i内部执行逐元素的Softmax操作,然后在组内每张特征图上进行跨通道的平均池化操作: The diversity module performs an element-wise softmax operation within each group of features F i , and then performs a cross-channel average pooling operation on each feature map in the group:
    Figure PCTCN2022085837-appb-100004
    Figure PCTCN2022085837-appb-100004
    Figure PCTCN2022085837-appb-100005
    Figure PCTCN2022085837-appb-100005
    当每张图上的平均响应计算后,多样性损失可通过下式计算得到:When the average response on each graph is calculated, the diversity loss can be calculated as:
    Figure PCTCN2022085837-appb-100006
    Figure PCTCN2022085837-appb-100006
    完整的多标签互通道损失通过对多样性模块和区分性模块的加权和求得:The complete multi-label cross-channel loss is obtained by the weighted sum of the diversity module and the discriminative module:
    L MC(F)=λ 1L dis2L divL MC (F)=λ 1 L dis2 L div ;
    其中相应的权重根据特定任务的需求加以调节设定。The corresponding weights are adjusted and set according to the needs of specific tasks.
  6. 根据权利要求1所述的手术器械、行为和目标组织联合识别的方法,其特征在于,所述引入长短时记忆网络对特征类别对齐解耦后的场景中手术器械、行为和目标组织子任务的动作信息进行时空特征融合包括:The method for joint identification of surgical instruments, behaviors and target tissues according to claim 1, characterized in that the introduction of a long short-term memory network to align and decouple the feature categories of the sub-tasks of surgical instruments, behaviors and target tissues in the scene Spatiotemporal feature fusion of action information includes:
    在各任务的细粒度视觉特征提取模块后通过一个单层的长短时记忆网络进行一段时间输入内的运动特征提取,得到512维时空融合特征并最终通过全连接层实现相应任务的识别。After the fine-grained visual feature extraction module of each task, a single-layer long short-term memory network is used to extract motion features within a period of input, and 512-dimensional spatio-temporal fusion features are obtained, and the corresponding task recognition is finally achieved through a fully connected layer.
  7. 根据权利要求6所述的手术器械、行为和目标组织联合识别的方法,其特征在于,在视觉特征层面采用跳跃链接的方法实现级联式的有效视觉特征传递,其中长短时记忆网络的整体损失函数由视觉特征层级的互通道损失和时空融合特征得到分类结果的标准交叉熵损失加权组成。The method for joint identification of surgical instruments, behaviors and target tissues according to claim 6, characterized in that the jump link method is used at the visual feature level to achieve cascaded effective visual feature transmission, in which the overall loss of the long short-term memory network The function is weighted by a cross-channel loss at the visual feature level and a standard cross-entropy loss that uses spatiotemporal fusion features to obtain classification results.
  8. 一种手术器械、行为和目标组织联合识别的装置,其特征在于,包括:A device for joint recognition of surgical instruments, behaviors and target tissues, which is characterized by including:
    类别对齐的细粒度视觉特征提取模块,用于利用类别对齐的通道注意力机制对场景中的手术器械、行为和目标组织子任务进行特征类别对齐解耦;The category-aligned fine-grained visual feature extraction module is used to decouple the feature category alignment of surgical instruments, behaviors and target tissue subtasks in the scene using the category-aligned channel attention mechanism;
    时空特征融合模块,用于引入长短时记忆网络对特征类别对齐解耦后的场景中手术器械、行为和目标组织子任务的动作信息进行时空特征融合;The spatiotemporal feature fusion module is used to introduce the long short-term memory network to perform spatiotemporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;
    多任务级联模块,用于通过全连接层对时空特征融合后的手术器械、行为和目标组织子任务进行识别。The multi-task cascade module is used to identify surgical instruments, behaviors and target tissue subtasks after spatio-temporal feature fusion through fully connected layers.
  9. 一种存储介质,其特征在于,所述存储介质存储有能够实现权利要求1至7中任意一项所述手术器械、行为和目标组织联合识别的方法的程序文件。A storage medium, characterized in that the storage medium stores program files capable of implementing the method for joint identification of surgical instruments, behaviors and target tissues in any one of claims 1 to 7.
  10. 一种处理器,其特征在于,所述处理器用于运行程序,其中,所述程序运行时执行权利要求1至7中任意一项所述的手术器械、行为和目标组织联合识别的方法。A processor, characterized in that the processor is used to run a program, wherein when the program is run, the method for joint identification of surgical instruments, behaviors and target tissues described in any one of claims 1 to 7 is executed.
PCT/CN2022/085837 2022-04-08 2022-04-08 Surgical instrument, behavior and target tissue joint identification method and apparatus WO2023193238A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/085837 WO2023193238A1 (en) 2022-04-08 2022-04-08 Surgical instrument, behavior and target tissue joint identification method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/085837 WO2023193238A1 (en) 2022-04-08 2022-04-08 Surgical instrument, behavior and target tissue joint identification method and apparatus

Publications (1)

Publication Number Publication Date
WO2023193238A1 true WO2023193238A1 (en) 2023-10-12

Family

ID=88243916

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/085837 WO2023193238A1 (en) 2022-04-08 2022-04-08 Surgical instrument, behavior and target tissue joint identification method and apparatus

Country Status (1)

Country Link
WO (1) WO2023193238A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112037263A (en) * 2020-09-14 2020-12-04 山东大学 Operation tool tracking system based on convolutional neural network and long-short term memory network
CN112347908A (en) * 2020-11-04 2021-02-09 大连理工大学 Surgical instrument image identification method based on space grouping attention model
US20210150710A1 (en) * 2019-11-15 2021-05-20 Arizona Board Of Regents On Behalf Of Arizona State University Systems, methods, and apparatuses for implementing a self-supervised chest x-ray image analysis machine-learning model utilizing transferable visual words
CN112932663A (en) * 2021-03-02 2021-06-11 成都与睿创新科技有限公司 Intelligent auxiliary method and system for improving safety of laparoscopic cholecystectomy
US20210327567A1 (en) * 2020-04-20 2021-10-21 Explorer Surgical Corp. Machine-Learning Based Surgical Instrument Recognition System and Method to Trigger Events in Operating Room Workflows
CN113887545A (en) * 2021-12-07 2022-01-04 南方医科大学南方医院 Laparoscopic surgical instrument identification method and device based on target detection model
CN113887553A (en) * 2021-08-26 2022-01-04 合肥工业大学 Operation interaction behavior recognition method and device, storage medium and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210150710A1 (en) * 2019-11-15 2021-05-20 Arizona Board Of Regents On Behalf Of Arizona State University Systems, methods, and apparatuses for implementing a self-supervised chest x-ray image analysis machine-learning model utilizing transferable visual words
US20210327567A1 (en) * 2020-04-20 2021-10-21 Explorer Surgical Corp. Machine-Learning Based Surgical Instrument Recognition System and Method to Trigger Events in Operating Room Workflows
CN112037263A (en) * 2020-09-14 2020-12-04 山东大学 Operation tool tracking system based on convolutional neural network and long-short term memory network
CN112347908A (en) * 2020-11-04 2021-02-09 大连理工大学 Surgical instrument image identification method based on space grouping attention model
CN112932663A (en) * 2021-03-02 2021-06-11 成都与睿创新科技有限公司 Intelligent auxiliary method and system for improving safety of laparoscopic cholecystectomy
CN113887553A (en) * 2021-08-26 2022-01-04 合肥工业大学 Operation interaction behavior recognition method and device, storage medium and electronic equipment
CN113887545A (en) * 2021-12-07 2022-01-04 南方医科大学南方医院 Laparoscopic surgical instrument identification method and device based on target detection model

Similar Documents

Publication Publication Date Title
Münzer et al. Content-based processing and analysis of endoscopic images and videos: A survey
Volkov et al. Machine learning and coresets for automated real-time video segmentation of laparoscopic and robot-assisted surgery
Nwoye et al. Recognition of instrument-tissue interactions in endoscopic videos via action triplets
CN112932663B (en) Intelligent auxiliary system for improving safety of laparoscopic cholecystectomy
Zia et al. Surgical activity recognition in robot-assisted radical prostatectomy using deep learning
Reiter et al. Appearance learning for 3D tracking of robotic surgical tools
Liu et al. An anchor-free convolutional neural network for real-time surgical tool detection in robot-assisted surgery
Yang et al. Image-based laparoscopic tool detection and tracking using convolutional neural networks: a review of the literature
Bawa et al. The saras endoscopic surgeon action detection (esad) dataset: Challenges and methods
US20200170710A1 (en) Surgical decision support using a decision theoretic model
CN114724682B (en) Auxiliary decision-making device for minimally invasive surgery
Fathabadi et al. Multi-class detection of laparoscopic instruments for the intelligent box-trainer system using faster R-CNN architecture
US20240169579A1 (en) Prediction of structures in surgical data using machine learning
Salazar-Colores et al. Desmoking laparoscopy surgery images using an image-to-image translation guided by an embedded dark channel
Loukas Surgical phase recognition of short video shots based on temporal modeling of deep features
Dhabliya et al. Computer vision: Advances in image and video analysis
WO2023193238A1 (en) Surgical instrument, behavior and target tissue joint identification method and apparatus
CN113496257A (en) Image classification method, system, electronic device and storage medium
Alam et al. Rat-capsnet: A deep learning network utilizing attention and regional information for abnormality detection in wireless capsule endoscopy
Liu et al. Towards surgical tools detection and operative skill assessment based on deep learning
Philipp et al. Localizing neurosurgical instruments across domains and in the wild
Nema et al. Unpaired deep adversarial learning for multi‐class segmentation of instruments in robot‐assisted surgical videos
Nwoye Deep learning methods for the detection and recognition of surgical tools and activities in laparoscopic videos
Tao et al. LAST: LAtent space-constrained transformers for automatic surgical phase recognition and tool presence detection
Jaafari et al. The impact of ensemble learning on surgical tools classification during laparoscopic cholecystectomy

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22936176

Country of ref document: EP

Kind code of ref document: A1