WO2023193238A1

WO2023193238A1 - Surgical instrument, behavior and target tissue joint identification method and apparatus

Info

Publication number: WO2023193238A1
Application number: PCT/CN2022/085837
Authority: WO
Inventors: 夏彤; 贾富仓
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2023-10-12

Abstract

A surgical instrument, behavior and target tissue joint identification method and apparatus. The method comprises: first performing feature category alignment decoupling on surgical instrument, behavior and target tissue sub-tasks in a scene by using a category-aligned channel attention mechanism (S100); then introducing a long short-term memory network to perform spatial-temporal feature fusion on action information of the surgical instrument, behavior and target tissue sub-tasks in the scene after the feature category alignment decoupling (S200); and identifying the surgical instrument, behavior and target tissue sub-tasks after the spatial-temporal feature fusion by means of a fully connected layer (S300). By extracting local diversity fine-grained features in a surgical scene, more sufficient spatial feature description is achieved, accurate identification under multi-instrument and multi-target conditions in a surgical operation is achieved by means of category decoupling, and automatic real-time analysis of the key content of an accurate and specific surgical scene is comprehensively achieved.

Description

A method and device for joint identification of surgical instruments, behavior and target tissue

Technical field

The present application relates to the field of medical image processing, specifically, to a method and device for joint identification of surgical instruments, behavior and target tissue.

Background technique

The joint identification of surgical instruments, behaviors, and target tissues is key to surgical scene parsing. Precise operation of surgical instruments guarantees the safety and effectiveness of surgery. Instruments are the most prominent targets in video images that guide surgery. Accurate instrument identification is the primary task in scene perception and is also the basis for judging surgical actions and target tissues. Surgical behavior recognition is based on instrument identification and integrates the target tissue involved in the movement of the instrument and the movement of the surgical instrument to accurately judge the specific surgical operation currently being performed. Workflow identification is a global perception of the surgical process at the stage level based on specific instruments and surgical behaviors. Through the joint identification of instruments, behaviors and target tissues, it can provide surgeons with sufficient surgical status analysis and surgical decision support during the operation, while also achieving estimation of the remaining time of the operation, and providing assistance for personnel coordination in and between operating rooms. , effectively improving the safety and efficiency of laparoscopic minimally invasive surgery. After surgery, accurate surgical video content analysis also greatly facilitates corresponding surgical recording and teaching. Therefore, it is proposed that high-precision joint identification of surgical instruments, behaviors and target tissues is the basis and key to achieve computer-assisted intervention in minimally invasive surgery.

In response to the problem of surgical scene perception, previous researchers have mainly carried out a series of single-task and multi-task joint recognition work on laparoscopic cholecystectomy, such as workflow and surgical instruments. Early methods used manually screened features such as intensity, gradient, shape, color, and tissue texture to identify surgical workflows and instruments based on a single image. Considering the inter-frame correlation, some scholars use the time series model represented by the hidden Markov model to process surgical videos within a continuous period of time. With the widespread application of deep learning methods in natural scenes, Twinanda et al. introduced the deep convolutional network EndoNet for the first time to extract deep visual features of surgical scenes. At the same time, they continued the hidden Markov model to extract inter-frame related information, using two Two independent networks realize the classification and identification of workflow and surgical instruments respectively. In view of the limitations of EndoNet's independent processing of spatiotemporal features, Jin et al. used the characteristics of long short-term memory network as an effective time series model, and combined the end-to-end network constructed with deep convolutional network to extract sufficient spatiotemporal fusion features for the first time. Implement workflow identification. Alshirbaji et al. transferred this method to the device recognition task and also achieved recognition accuracy that exceeded previous methods.

Observing the strong correlation between different tasks in surgical scenes, Jin et al. proposed a multi-task surgical instrument and workflow joint recognition network based on a joint loss function. The equipment recognition and workflow recognition branches share the spatial features on the backbone network. The workflow task branch is followed by a long and short-term memory network to fuse action information in the time dimension. Finally, a weighted loss function is used to construct a joint loss function for multi-task network training. . In order to analyze the key content in the surgical scene more richly and specifically, Nwoye et al. constructed three types of key content: instruments, actions and target tissues to describe the interaction of instruments and tissues in the surgical scene, and used a 3D mapping interaction space function to achieve multi-tasking. Federated learning.

Most of the previous workflow and instrument recognition work around the problem of surgical scene perception used general deep convolutional networks to extract visual features and used fully connected layers to realize the recognition of corresponding classes. These methods achieve coarse-grained description of spatial features at the current moment by extracting global features of category fusion, and do not pay attention to the characteristics of laparoscopy surgical scenes due to highly similar target tissue textures, overlaps, and differences in instruments. Only local details such as tips exist. It enriches fine-grained features and does not pay attention to the multi-label and multi-target problem when multiple instruments appear at the same time in surgical scenes. In addition, existing research on surgical scene perception mainly focuses on workflow and instrument tasks, but lacks recognition tasks that more specifically describe surgical actions. The collaborative recognition method between multiple tasks only uses a simple weighted average of the loss function, which fails to fully utilize the correlation between different surgical tasks.

Contents of the invention

Embodiments of the present invention provide a method and device for joint recognition of surgical instruments, behaviors, and target tissues, so as to at least solve the technical problem that the existing technology lacks recognition tasks for describing surgical actions.

According to an embodiment of the present invention, a method for joint identification of surgical instruments, behaviors and target tissues is provided, including the following steps:

Utilize the category-aligned channel attention mechanism to decouple the feature category alignment of surgical instruments, behaviors, and target tissue subtasks in the scene;

The long short-term memory network is introduced to perform spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;

The sub-task of identifying surgical instruments, behaviors and target tissues after spatio-temporal feature fusion is used through the fully connected layer.

The technical solutions adopted by the embodiments of this application also include: using the category-aligned channel attention mechanism to perform feature category alignment and decoupling of surgical instruments, behaviors, and target tissue subtasks in the scene, including:

The multi-label cross-channel loss based on channel attention is used in a deep convolutional network to extract spatial features of surgical instruments, behaviors and target tissue subtasks in the scene.

The technical solutions adopted by the embodiments of this application also include: using multi-label cross-channel loss based on channel attention to act on a deep convolutional network to extract spatial features of surgical instruments, behaviors and target tissue subtasks in the scene, including:

Use a deep residual network as the backbone module to initially extract deep features, and then use global pooling operations to obtain multi-dimensional feature vectors to build subtask branches;

The corresponding global features are divided into category-aligned feature groups based on the total number of categories for each task.

The technical solutions adopted by the embodiments of this application also include: using a deep residual network as the backbone module to initially extract deep features, and then using a global pooling operation to obtain multi-dimensional feature vectors to construct subtask branches including:

First, a fifty-layer deep residual network composed of four residual modules is used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain a 2048-dimensional feature vector as the output of the backbone module;

A 1×1 convolution operation is used to transform the extracted 2048-dimensional feature vector into the number of channels suitable for each task branch.

The technical solution adopted by the embodiment of the present application also includes: dividing the corresponding global features into category-aligned feature groups based on the total number of categories of each task, including:

Laparoscopic cholecystectomy involves 15 types of target tissues. A 1×1 convolution operation is used to obtain the 2040-dimensional global feature F, which is divided into 15 groups of features:

F＝{F ₀ ,F ₁ ,…,F ₁₄ };

Each group F _i contains ξ channels, which are used to extract the diverse local fine-grained features corresponding to the i-th type of target tissue in the surgical scene;

The multi-label cross-channel loss consists of a discriminative module and a diversity module, which act respectively between 15 groups of features F and within each group of features F _i on a single task;

For the i-th group of features F _i , the discriminative module first performs the Mask operation in deep learning on the ξ channels in the group through the randomly generated 0-1 diagonal matrix M _i , and then performs cross-operation on the features within the group after the Mask operation. The maximum pooling operation of the channel is used to obtain the final response of the current image to the i-th category. The specific distinguishing module is expressed as:

Where W and H represent the width and height of the feature map, F _i,j,k represents the k-th element position on the j-th channel in the i-th group of features;

After the final response of the input image to each category Dis(F ₀ ) to Dis(F _n-1 ) is obtained, the multi-label discriminative loss function is obtained through the Softmax operation:

where _yi represents the true label of the current image for the i-th category, and n represents the total number of categories of the subtask;

The diversity module performs an element-wise softmax operation within each group of features F _i , and then performs a cross-channel average pooling operation on each feature map in the group:

When the average response on each graph is calculated, the diversity loss can be calculated as:

The complete multi-label cross-channel loss is obtained by the weighted sum of the diversity module and the discriminative module:

L _MC (F)=λ ₁ L _dis +λ ₂ L _div ;

The corresponding weights are adjusted and set according to the needs of specific tasks.

The technical solutions adopted by the embodiments of this application also include: introducing a long short-term memory network to perform spatiotemporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling, including:

After the fine-grained visual feature extraction module of each task, a single-layer long short-term memory network is used to extract motion features within a period of input, and 512-dimensional spatio-temporal fusion features are obtained, and the corresponding task recognition is finally achieved through a fully connected layer.

The technical solutions adopted by the embodiments of this application also include: using the jump link method at the visual feature level to achieve cascaded effective visual feature transfer, in which the overall loss function of the long short-term memory network is composed of the cross-channel loss and spatio-temporal fusion at the visual feature level. Features are weighted by standard cross-entropy loss to obtain classification results.

According to another embodiment of the present invention, a device for joint identification of surgical instruments, behaviors and target tissues is provided, including:

The category-aligned fine-grained visual feature extraction module is used to decouple the feature category alignment of surgical instruments, behaviors and target tissue subtasks in the scene using the category-aligned channel attention mechanism;

The spatiotemporal feature fusion module is used to introduce the long short-term memory network to perform spatiotemporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;

The multi-task cascade module is used to identify surgical instruments, behaviors and target tissue subtasks after spatio-temporal feature fusion through fully connected layers.

A storage medium that stores program files capable of realizing any of the above methods for joint identification of surgical instruments, behaviors, and target tissues.

A processor is used to run a program, wherein when the program is running, it executes any one of the above methods for joint identification of surgical instruments, behaviors, and target tissues.

The method and device for joint recognition of surgical instruments, behaviors and target tissues in the embodiment of the present invention first use the category-aligned channel attention mechanism to align and decouple the feature categories of the surgical instruments, behaviors and target tissue subtasks in the scene; and then The long short-term memory network is introduced to perform spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling; and then the fully connected layer is used to fuse the spatio-temporal features of the surgical instruments, behaviors and target tissues. Identify subtasks. This invention achieves more adequate spatial feature description by extracting local diversity fine-grained features in surgical scenes, achieves accurate identification of multiple instruments and multiple targets in surgical operations through category decoupling, and comprehensively realizes accurate and specific surgeries. Automatic real-time analysis of key content of the scene.

Description of the drawings

The drawings described here are used to provide a further understanding of the present invention and constitute a part of this application. The illustrative embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute an improper limitation of the present invention. In the attached picture:

Figure 1 is a flow chart of the method for joint identification of surgical instruments, behaviors and target tissues according to the present invention;

Figure 2 is a multi-task learning framework diagram of the method for joint identification of surgical instruments, behaviors and target tissues according to the present invention;

Figure 3 is a functional principle diagram of the diversity loss module of the method for joint identification of surgical instruments, behaviors and target tissues according to the present invention;

Figure 4 is a module diagram of a device for joint identification of surgical instruments, behavior and target tissue according to the present invention.

Detailed ways

In order to enable those in the technical field to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only These are part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts should fall within the scope of protection of this application.

It should be noted that the terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the present application described herein can be practiced in sequences other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method, system, product, or apparatus that encompasses a series of steps or units and need not be limited to those explicitly listed. Those steps or elements may instead include other steps or elements not expressly listed or inherent to the process, method, product or apparatus.

Surgical scene perception is an important task for modern smart operating rooms to develop information integration and intelligence under the condition of sophisticated hardware equipment and rich real-time sensing signals. In endoscope-guided computer-assisted minimally invasive surgery, by understanding and processing key information in the current surgical field of view, the surgical scene perception system can monitor the entire surgical process in real time and provide the surgeon with specific auxiliary information at any time. In minimally invasive surgeries represented by laparoscopic cholecystectomy, the tiny incisions on the body surface reduce the burden of the surgery on the patient, but the limitations of the endoscopic imaging field of view create certain difficulties in surgical guidance. Specifically, the viewing range of the endoscopic lens limits the doctor's surgical field of view, and intracavity smoke and screen reflections also block the doctor's field of view. The high similarity and overlap of textures of the target tissue under the limited viewing angle also provide doctors with a better understanding of the current surgical environment. This creates difficulties in judgment and makes the risks of surgery difficult to predict. Therefore, in order to improve the safety of surgery while retaining the advantages of minimally invasive surgery, the key content of the surgical scene is identified and analyzed based on the real-time surgical video signal acquired by the intraoperative endoscope, and the surgeon is provided with real-time surgical monitoring and scenarios. Parsing to provide auxiliary intervention is a key technology in the development of modern operating room scene awareness systems.

Example 1

According to an embodiment of the present invention, a method for joint identification of surgical instruments, behaviors and target tissues is provided. See Figure 1, which includes the following steps:

S100: Utilize the category-aligned channel attention mechanism to decouple the feature category alignment of surgical instruments, behaviors, and target tissue subtasks in the scene;

S200: Introduce the long short-term memory network to perform spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;

S300: Recognize the subtask of surgical instruments, behaviors and target tissues after spatio-temporal feature fusion through the fully connected layer.

The method for joint identification of surgical instruments, behaviors and target tissues in the embodiment of the present invention first uses the category-aligned channel attention mechanism to perform feature category alignment and decoupling of the sub-tasks of surgical instruments, behaviors and target tissues in the scene; then introduces long and short The temporal memory network performs spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling; then the fully connected layer fuses the spatiotemporal feature fusion of surgical instruments, behaviors and target tissue subtasks. to identify. This invention achieves more adequate spatial feature description by extracting local diversity fine-grained features in surgical scenes, achieves accurate identification of multiple instruments and multiple targets in surgical operations through category decoupling, and comprehensively realizes accurate and specific surgeries. Automatic real-time analysis of key content of the scene.

Among them, the category-aligned channel attention mechanism is used to decouple the feature category alignment of surgical instruments, behaviors and target tissue subtasks in the scene, including:

Among them, the multi-label cross-channel loss based on channel attention is used to act on the deep convolutional network to extract spatial features of surgical instruments, behaviors and target tissue subtasks in the scene, including:

Among them, a deep residual network is used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain multi-dimensional feature vectors to build subtask branches including:

Among them, the corresponding global features are divided into category-aligned feature groups based on the total number of categories of each task, including:

F＝{F ₀ ,F ₁ ,…,F ₁₄ };

L _MC (F)=λ ₁ L _dis +λ ₂ L _div ;

Among them, the long short-term memory network is introduced to perform spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling, including:

Among them, the jump link method is used at the visual feature level to achieve cascaded effective visual feature transfer. The overall loss function of the long short-term memory network is the standard cross-entropy loss of the classification result obtained by the cross-channel loss at the visual feature level and the spatio-temporal fusion feature. Weighted composition.

The method for joint identification of surgical instruments, behaviors and target tissues of the present invention will be described in detail below with specific examples:

The joint identification of surgical instruments, actions, and target tissues is a key technology for computer-assisted surgical intervention and minimally invasive surgery. However, under the limited viewing angle of a laparoscope, fine-grained characteristics such as the texture similarity of the target tissue, the similar structure of the instrument tip, and repeated non-specific behavioral actions during the surgical stage all make it difficult to accurately identify these key surgical contents. In view of the existing shortcomings of previous methods, the purpose of the present invention is to provide a more accurate and specific surgical scene analysis method using the joint identification of surgical instruments, target tissues and execution action subtasks. By extracting local diverse fine-grained features in surgical scenes, a more adequate spatial feature description is achieved, and category decoupling is used to achieve accurate identification of multiple instruments and targets in surgical operations.

The present invention proposes a method for joint identification of surgical instruments, behaviors and target tissues based on multi-label mutual channel loss, which is mainly used for computer-assisted minimally invasive surgery represented by laparoscopic cholecystectomy and postoperative scenes and actions. Recognition, dedicated to solving the problem of key global and local fusion of visual features and action features under long-term dependencies in surgical videos through fine-grained classification and multi-task learning models. The present invention uses a category-aligned channel attention mechanism to realize visual feature decoupling, introduces a long short-term memory network to extract temporal features of action information in the scene, and realizes multi-task joint recognition in a cascaded manner. Through experimental verification, the present invention achieves better recognition results than previous methods in both single tasks and joint tasks.

In the fine-grained visual feature extraction module for category alignment shared by multi-task branches, the present invention applies a multi-label cross-channel loss function to realize feature category alignment and decoupling on channels, thereby achieving the purpose of paying attention to multiple local details in the surgical scene. . In the three subtask branches, long short-term memory network modules are connected for tasks involving actions in continuous time to achieve the purpose of integrating spatiotemporal features. Finally, in terms of multi-task joint recognition, the long-short-term memory network module uses the cascade transfer of spatial features to strengthen the interaction between instrument presentation and organizational behavior, and comprehensively achieves automatic real-time analysis of accurate and specific key content of surgical scenes.

The multi-task learning framework of the present invention is shown in Figure 2, and mainly includes a three-part structure of a category-aligned fine-grained visual feature extraction module, a spatio-temporal feature fusion module and a multi-task cascade module.

1. Category-aligned fine-grained visual feature extraction module

For the identification of subtasks such as devices, behaviors, and target tissues, a common method is to use deep convolutional networks to extract global visual features of category fusion. In order to fully analyze the local detailed visual features in the surgical scene to achieve accurate recognition in multi-label and multi-entity situations, this module introduces a multi-label cross-channel loss based on channel attention to act on the spatial features extracted by the deep convolutional network.

Specifically, since multiple subtask branches share some visual features in the surgical scene, a fifty-layer deep residual network composed of four residual modules is first used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain 2048 dimensional feature vector as the output of the backbone module. In order to facilitate the application and calculation of multi-label cross-channel loss in different task branches, a 1×1 convolution operation is used to transform the extracted 2048-dimensional feature vector into the number of channels suitable for each task branch. For laparoscopic cholecystectomy, there are 6 categories of surgical instruments, 10 categories of execution actions, and 15 categories of target tissues involved. Therefore, for the three task branches, 1×1 convolution is used to transform the global vision of 2040, 2000 and 2040 channels respectively. Features, each group contains local features with 340, 200 and 136 channel numbers.

2. The function and composition principle of multi-label mutual channel loss

After constructing the subtask branches, the corresponding global features can be divided into category-aligned feature groups based on the total number of categories of each task. Taking the target tissue identification task branch as an example, laparoscopic cholecystectomy involves 15 types of target tissues. A 1×1 convolution operation is used to obtain a 2040-dimensional global feature F, so it is divided into 15 groups of features:

F＝{F ₀ ,F ₁ ,…,F ₁₄ };

Each group F _i contains ξ channels, which are used to extract the diverse local fine-grained features corresponding to the i-th type of target tissue in the surgical scene. The multi-label cross-channel loss consists of a discriminative module and a diversity module, which act respectively between 15 groups of features F and within each group of features F _i on a single task.

The discriminative module is used to guide different groups of features to learn features related to corresponding categories and distinguish them. For the i-th group of features F _i , the discriminative module first performs a Mask operation on ξ channels in the group through a randomly generated 0-1 diagonal matrix M _i , and then performs cross-channel maximum pooling on the masked features in the group. operation, thereby retaining the maximum response to the category at each position in the feature map, and obtaining the final response of the current image to the i-th category through global average pooling. The specific distinguishing module is expressed as:

Where W and H represent the width and height of the feature map, and F _i,j,k represents the k-th element position on the j-th channel in the i-th group of features.

After the final response Dis(F ₀ ) to Dis(F _n-1 ) of the input image for each category is obtained, the multi-label discriminative loss function can be obtained through the Softmax operation:

where _yi represents the true label of the current image for the i-th category, and n represents the total number of categories for this subtask.

It is worth noting that in a multi-label multi-entity surgery scenario where multiple instruments appear simultaneously, the working principle of the diversity loss module is shown in Figure 3. In Figure 3, the left side is the schematic diagram of the multi-label diversity module, and the right side is the schematic diagram of the single-label diversity module.

L _MC (F)=λ ₁ L _dis +λ ₂ L _div ;

3. Spatiotemporal feature fusion, multi-task cascade and overall loss function

In order to capture the motion information contained between consecutive frames, after the fine-grained visual feature extraction module of each task, a single-layer long short-term memory network is used to extract motion features within a period of input, and 512-dimensional spatio-temporal fusion features are obtained and finally through the full The connection layer realizes the identification of corresponding tasks. In addition, considering that surgical instruments are the most prominent features and prerequisites for surgical actions, and the surgical operation rules of the instruments and actions acting on the target tissue, the jump link method is used at the visual feature level to achieve cascading effective visual feature transmission. The overall loss function of the long short-term memory network is weighted by the cross-channel loss at the visual feature level and the standard cross-entropy loss that uses spatio-temporal fusion features to obtain classification results.

The innovative technical points of the method of the present invention are at least:

1. Improved design of the diversity module in the cross-channel loss function in the case of multi-label and multi-entity;

2. Improved design of the discriminative module in the cross-channel loss function in the case of multi-label and multi-entity;

3. Cascade spatial feature transfer structure for joint identification of surgical instruments, behaviors and target tissues;

4. Design the application of improved multi-label cross-channel loss in extracting fine-grained spatial features in laparoscopic scenes.

The beneficial effects of the method of the present invention are at least:

The multi-label cross-channel loss function improved by the present invention can fully extract the local features distributed in different areas of the field of view in the laparoscopic surgery scene; the loss design in the multi-label case can better cope with the simultaneous execution of multiple instruments and multiple entities in the surgical scene. The application of surgical operations; the decoupling mechanism of category alignment increases the visibility and interpretability of the model; the cascaded joint identification of instruments, behaviors and target tissues better utilizes the correlation between multiple tasks and improves the efficiency of a single Task and multi-task recognition accuracy, thereby providing more specific and precise instructions for real-time assistance during surgery.

The multi-task learning method proposed by the present invention based on the joint identification of surgical instruments, behaviors and target tissues based on multi-label cross-channel loss has been tested on the public data sets CholecT40 and HeiCholec, and has achieved results in both single tasks and multi-task combinations. It is an effective improvement over the previous methods mentioned above. Validation on multiple data sets also shows the robustness of the model, which can meet the needs of assisted analysis of instruments, behaviors and target tissues in laparoscopic surgery scenarios. Through experimental verification, the cross-channel loss function under multi-label proposed by the present invention can effectively realize the decoupling and category alignment of local fine-grained features of the image. The long-short-term memory network module effectively extracts the implication in continuous time based on the decoupled feature sequence. Action information, the cascaded multi-task joint recognition structure makes full use of the prior relationships between instruments, behaviors and target tissues, so that the joint recognition network proposed by the present invention is significantly improved compared to existing methods.

Example 2

According to another embodiment of the present invention, a device for joint identification of surgical instruments, behaviors and target tissues is provided. See Figure 4, which includes:

The category-aligned fine-grained visual feature extraction module 100 is used to use the category-aligned channel attention mechanism to perform feature category alignment and decoupling of surgical instruments, behaviors, and target tissue subtasks in the scene;

The spatiotemporal feature fusion module 200 is used to introduce a long short-term memory network to perform spatiotemporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;

The multi-task cascade module 300 is used to identify surgical instruments, behaviors and target tissue subtasks after spatio-temporal feature fusion through a fully connected layer.

The device for joint recognition of surgical instruments, behaviors and target tissues in the embodiment of the present invention first uses the category-aligned channel attention mechanism to perform feature category alignment and decoupling of the sub-tasks of surgical instruments, behaviors and target tissues in the scene; and then introduces long and short The temporal memory network performs spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling; then the fully connected layer fuses the spatiotemporal feature fusion of surgical instruments, behaviors and target tissue subtasks. to identify. This invention achieves more adequate spatial feature description by extracting local diversity fine-grained features in surgical scenes, achieves accurate identification of multiple instruments and multiple targets in surgical operations through category decoupling, and comprehensively realizes accurate and specific surgeries. Automatic real-time analysis of key content of the scene.

The device for joint identification of surgical instruments, behaviors and target tissues of the present invention will be described in detail below with specific embodiments:

The joint recognition of surgical instruments, actions and target tissues is a key technology for computer-assisted surgical intervention and minimally invasive surgery. However, under the limited viewing angle of a laparoscope, fine-grained characteristics such as the texture similarity of the target tissue, the similar structure of the instrument tip, and repeated non-specific behavioral actions during the surgical stage all make it difficult to accurately identify these key surgical contents. In view of the existing shortcomings of previous methods, the purpose of the present invention is to provide a more accurate and specific surgical scene analysis device using the joint identification of surgical instruments, target tissues and execution action subtasks. By extracting local diverse fine-grained features in surgical scenes, a more adequate spatial feature description is achieved, and category decoupling is used to achieve accurate identification of multiple instruments and targets in surgical operations.

The present invention proposes a device for joint identification of surgical instruments, behaviors and target tissues based on multi-label mutual channel loss, which is mainly used for computer-assisted minimally invasive surgery represented by laparoscopic cholecystectomy and postoperative scenes and actions. Recognition, dedicated to solving the problem of key global and local fusion of visual features and action features under long-term dependencies in surgical videos through fine-grained classification and multi-task learning models. The present invention uses a category-aligned channel attention mechanism to realize visual feature decoupling, introduces a long short-term memory network to extract temporal features of action information in the scene, and realizes multi-task joint recognition in a cascaded manner. Through experimental verification, the present invention achieves better recognition results than previous methods in both single tasks and joint tasks.

In the category-aligned fine-grained visual feature extraction module 100 shared by multi-task branches, the present invention applies a multi-label cross-channel loss function to realize feature category alignment and decoupling on channels, thereby achieving the goal of paying attention to multiple local details in the surgical scene. Purpose. In the three subtask branches, long and short-term memory network modules are connected for tasks involving actions in continuous time to achieve the purpose of integrating spatiotemporal features. Finally, in terms of multi-task joint recognition, the long-short-term memory network module uses the cascade transfer method of spatial features to strengthen the interaction between instrument presentation and organizational behavior, and comprehensively achieves automatic real-time analysis of accurate and specific key content of surgical scenes.

The multi-task learning framework of the present invention is shown in Figure 2, and mainly includes a three-part structure of a category-aligned fine-grained visual feature extraction module 100, a spatio-temporal feature fusion module 200, and a multi-task cascade module 300.

1. Category-aligned fine-grained visual feature extraction module 100

2. The function and composition principle of multi-label mutual channel loss

F＝{F ₀ ,F ₁ ,…,F ₁₄ };

L _MC (F)=λ ₁ L _dis +λ ₂ L _div ;

3. Spatiotemporal feature fusion, multi-task cascade and overall loss function

The innovative technical points of the device of the present invention are at least:

The beneficial effects of the device of the present invention are at least:

The multi-task learning device proposed by the present invention based on the joint recognition of surgical instruments, behaviors and target tissues based on multi-label cross-channel loss has been tested on the public data sets CholecT40 and HeiCholec, and has achieved results in both single tasks and multi-task combinations. It is an effective improvement over the previous methods mentioned above. Verification on multiple data sets also shows the robustness of the model, which can meet the needs of assisted analysis of instruments, behaviors and target tissues in laparoscopic surgery scenarios. Through experimental verification, the cross-channel loss function under multi-label proposed by the present invention can effectively realize the decoupling and category alignment of local fine-grained features of the image. The long-short-term memory network module effectively extracts the implication in continuous time based on the decoupled feature sequence. Action information, the cascaded multi-task joint recognition structure makes full use of the prior relationships between instruments, behaviors and target tissues, so that the joint recognition network proposed by the present invention is significantly improved compared to existing methods.

Example 3

Example 4

The above serial numbers of the embodiments of the present invention are only for description and do not represent the advantages and disadvantages of the embodiments.

In the above-mentioned embodiments of the present invention, each embodiment is described with its own emphasis. For parts that are not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The system embodiments described above are only illustrative. For example, the division of units can be a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or integrated into Another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the units or modules may be in electrical or other forms.

Units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units, i.e. they may be located in one place, or they may be distributed over multiple units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units.

Integrated units may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products. Based on this understanding, the technical solution of the present invention is essentially or contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which can be a personal computer, a server or a network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present invention. The aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program code. .

The above are only preferred embodiments of the present invention. It should be noted that those skilled in the art can make several improvements and modifications without departing from the principles of the present invention. These improvements and modifications can also be made. should be regarded as the protection scope of the present invention.

Claims

A method for joint identification of surgical instruments, behaviors and target tissues, which is characterized by including the following steps:

Utilize the category-aligned channel attention mechanism to decouple the feature category alignment of surgical instruments, behaviors, and target tissue subtasks in the scene;

The long short-term memory network is introduced to perform spatio-temporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;

The sub-task of identifying surgical instruments, behaviors and target tissues after spatio-temporal feature fusion is used through the fully connected layer.
The method for joint identification of surgical instruments, behaviors and target tissues according to claim 1, characterized in that the channel attention mechanism using category alignment is used to align the feature categories of the surgical instruments, behaviors and target tissue subtasks in the scene. Decoupling includes:

The multi-label cross-channel loss based on channel attention is used in a deep convolutional network to extract spatial features of surgical instruments, behaviors and target tissue subtasks in the scene.
The method for joint identification of surgical instruments, behaviors and target tissues according to claim 2, characterized in that the multi-label cross-channel loss based on channel attention is used to act on the deep convolution network to identify the surgical instruments and behaviors in the scene. And the target organization subtask to extract spatial features includes:

Use a deep residual network as the backbone module to initially extract deep features, and then use global pooling operations to obtain multi-dimensional feature vectors to build subtask branches;

The corresponding global features are divided into category-aligned feature groups based on the total number of categories for each task.
The method for joint identification of surgical instruments, behaviors and target tissues according to claim 3, characterized in that the deep residual network is used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain a multi-dimensional feature vector to construct Subtask branches include:

First, a fifty-layer deep residual network composed of four residual modules is used as the backbone module to initially extract deep features, and then a global pooling operation is used to obtain a 2048-dimensional feature vector as the output of the backbone module;

A 1×1 convolution operation is used to transform the extracted 2048-dimensional feature vector into the number of channels suitable for each task branch.
The method for joint identification of surgical instruments, behaviors and target tissues according to claim 3, wherein the dividing the corresponding global features into category-aligned feature groups based on the total number of categories of each task includes:

Laparoscopic cholecystectomy involves 15 types of target tissues. A 1×1 convolution operation is used to obtain the 2040-dimensional global feature F, which is divided into 15 groups of features:

F＝{F 0 ,F 1 ,…,F 14 };

Each group F i contains ξ channels, which are used to extract the diverse local fine-grained features corresponding to the i-th type of target tissue in the surgical scene;

The multi-label cross-channel loss consists of a discriminative module and a diversity module, which act respectively between 15 groups of features F and within each group of features F i on a single task;

For the i-th group of features F i , the discriminative module first performs the Mask operation in deep learning on the ξ channels in the group through the randomly generated 0-1 diagonal matrix M i , and then performs cross-operation on the features within the group after the Mask operation. The maximum pooling operation of the channel is used to obtain the final response of the current image to the i-th category. The specific distinguishing module is expressed as:

Where W and H represent the width and height of the feature map, F i,j,k represents the k-th element position on the j-th channel in the i-th group of features;

After the final response of the input image to each category Dis(F 0 ) to Dis(F n-1 ) is obtained, the multi-label discriminative loss function is obtained through the Softmax operation:

where yi represents the true label of the current image for the i-th category, and n represents the total number of categories of the subtask;

The diversity module performs an element-wise softmax operation within each group of features F i , and then performs a cross-channel average pooling operation on each feature map in the group:

When the average response on each graph is calculated, the diversity loss can be calculated as:

The complete multi-label cross-channel loss is obtained by the weighted sum of the diversity module and the discriminative module:

L MC (F)=λ 1 L dis +λ 2 L div ;

The corresponding weights are adjusted and set according to the needs of specific tasks.
The method for joint identification of surgical instruments, behaviors and target tissues according to claim 1, characterized in that the introduction of a long short-term memory network to align and decouple the feature categories of the sub-tasks of surgical instruments, behaviors and target tissues in the scene Spatiotemporal feature fusion of action information includes:

After the fine-grained visual feature extraction module of each task, a single-layer long short-term memory network is used to extract motion features within a period of input, and 512-dimensional spatio-temporal fusion features are obtained, and the corresponding task recognition is finally achieved through a fully connected layer.
The method for joint identification of surgical instruments, behaviors and target tissues according to claim 6, characterized in that the jump link method is used at the visual feature level to achieve cascaded effective visual feature transmission, in which the overall loss of the long short-term memory network The function is weighted by a cross-channel loss at the visual feature level and a standard cross-entropy loss that uses spatiotemporal fusion features to obtain classification results.
A device for joint recognition of surgical instruments, behaviors and target tissues, which is characterized by including:

The category-aligned fine-grained visual feature extraction module is used to decouple the feature category alignment of surgical instruments, behaviors and target tissue subtasks in the scene using the category-aligned channel attention mechanism;

The spatiotemporal feature fusion module is used to introduce the long short-term memory network to perform spatiotemporal feature fusion on the action information of surgical instruments, behaviors and target tissue subtasks in the scene after feature category alignment and decoupling;

The multi-task cascade module is used to identify surgical instruments, behaviors and target tissue subtasks after spatio-temporal feature fusion through fully connected layers.
A storage medium, characterized in that the storage medium stores program files capable of implementing the method for joint identification of surgical instruments, behaviors and target tissues in any one of claims 1 to 7.
A processor, characterized in that the processor is used to run a program, wherein when the program is run, the method for joint identification of surgical instruments, behaviors and target tissues described in any one of claims 1 to 7 is executed.