CN118213051A

CN118213051A - System and method for predicting operation stage

Info

Publication number: CN118213051A
Application number: CN202410185469.0A
Authority: CN
Inventors: 杨予皓; 陈阵; 吴锦林; 刘宏斌
Original assignee: Artificial Intelligence And Robotics Innovation Center Hong Kong Institute Of Innovation Chinese Academy Of Sciences Ltd
Current assignee: Artificial Intelligence And Robotics Innovation Center Hong Kong Institute Of Innovation Chinese Academy Of Sciences Ltd
Priority date: 2024-02-19
Filing date: 2024-02-19
Publication date: 2024-06-18

Abstract

The invention provides a system and a method for predicting an operation stage, wherein the system comprises the following steps: the visual model module is used for extracting visual information in the operation video data; the language model module is used for converting the visual information into natural language description and obtaining a first operation stage prediction result based on the natural language description; and the multi-mode understanding module is used for obtaining a second operation stage prediction result and operation process interpretation information based on the visual information, the natural language description and the first operation stage prediction result. The present invention accurately and interpretably predicts a surgical stage based on multimodal data.

Description

System and method for predicting operation stage

Technical Field

The invention relates to the technical field of computer vision, in particular to a system and a method for predicting an operation stage.

Background

In the surgical procedure, the doctor needs to adjust the operation strategy in time according to the surgical stage, and the surgical auxiliary system is required to not only provide accurate real-time data, but also predict the next surgical stage and give advice.

While existing surgical assistance systems have been able to provide a degree of machine vision and decision support, there are a number of shortcomings. Most current surgical assistance systems are limited to providing static data analysis and lack the ability to adapt to dynamic surgical environments. It is often difficult to achieve a comprehensive understanding and prediction of the overall procedure, depending on limited data inputs, such as two-dimensional images and physiological signals, resulting in a lower accuracy of the prediction of the operative stage.

Disclosure of Invention

The invention provides a system and a method for predicting an operation stage, which are used for solving the defect of lower prediction accuracy of the operation stage in the prior art and improving the prediction accuracy of the operation stage.

The present invention provides a surgical stage prediction system comprising:

the visual model module is used for extracting visual information in the operation video data;

the language model module is used for converting the visual information into natural language description and obtaining a first operation stage prediction result based on the natural language description;

And the multi-mode understanding module is used for obtaining a second operation stage prediction result and operation process interpretation information based on the visual information, the natural language description and the first operation stage prediction result.

According to the surgical stage prediction system provided by the invention, the visual model module is specifically used for:

And performing image segmentation, target identification, labeling and scene understanding on the surgical video data to obtain visual information in the surgical video data.

image segmentation is carried out on the operation video data to obtain an operation instrument area and a human hand area in the operation video data;

Performing target identification and labeling on the operation video data to obtain the names, operation steps, gestures and hand displacements of the operation instruments in the operation video data;

scene understanding is carried out on the operation video data, so that the operation progress and the change of an operation area in the operation video data are obtained;

And taking one or more of the surgical instrument area, the human hand area, the type of the surgical instrument, the surgical step, the gesture, the hand displacement, the surgical progress and the change of the surgical area as the visual information.

The invention provides a surgical stage prediction system, which further comprises a risk alarm module for:

Matching surgical data in the surgical video data with historical successful surgical data of which the surgical stage is a predicted result of the second surgical stage, wherein the surgical data comprises visual information and/or natural language description of the surgical video data;

predicting a surgical risk of the surgical video data based on the surgical data for which the surgical video data matching fails;

and displaying alarm information based on the surgical risk.

The invention provides a surgical stage prediction system, which also comprises a voice interaction module, a voice interaction module and a voice prediction module, wherein the voice interaction module is used for receiving a voice instruction of a user;

the risk alarm module is used for displaying the alarm information under the condition that the voice instruction is a risk alarm instruction;

and the multi-mode understanding module is used for taking the voice instruction, the visual information, the natural language description and the first operation stage prediction result as input to obtain a second operation stage prediction result and operation process interpretation information under the condition that the voice instruction is an operation stage prediction instruction.

The invention also provides a method for predicting the operation stage, which comprises the following steps:

Extracting visual information in the surgical video data;

converting the visual information into natural language description, and obtaining a first operation stage prediction result based on the natural language description;

and obtaining a second surgical stage prediction result and surgical procedure interpretation information based on the visual information, the natural language description and the first surgical stage prediction result.

According to the method for predicting the operation stage provided by the invention, the method for extracting the visual information in the operation video data comprises the following steps:

According to the method for predicting the operation stage provided by the invention, the image segmentation, the target identification, the labeling and the scene understanding are carried out on the operation video data, so as to obtain the visual information in the operation video data, and the method comprises the following steps:

performing target identification on the surgical video data to obtain the types of surgical instruments, surgical steps, gestures and hand displacements in the surgical video data;

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a surgical stage prediction method as described in any of the above when executing the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of predicting a surgical stage as described in any one of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a method of predicting a surgical stage as described in any one of the above.

According to the operation stage prediction system and method provided by the invention, the multi-mode data is formed by the visual information output by the visual model module, the natural language description output by the language model module and the first operation stage prediction result, and the multi-mode understanding module accurately and interpretably predicts the operation stage according to the multi-mode data.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a surgical stage prediction system provided by the present invention;

FIG. 2 is a schematic process flow diagram of the surgical stage prediction system provided by the present invention;

FIG. 3 is a flow chart of the surgical stage prediction method provided by the present invention;

Fig. 4 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

A surgical stage prediction system of the present invention is described below in conjunction with fig. 1, including a vision model module 101, a language model module 102, and a multimodal understanding module 103, wherein:

the vision model module 101 is used for extracting vision information in the operation video data;

Surgical video data is collected during a surgical procedure, and may be from a real-time surgical procedure or pre-recorded video.

Surgical video data includes streaming medical images or video sources from an endoscope, cameras on a robotic arm, or other surgical procedure.

The operation video data comprises rich operation information captured by an internal and external camera, such as actions of doctors, use of operation instruments, various indexes of an operation area and a patient, conditions of tissues and the like.

The vision model module 101 is used for processing surgical video data, and can use machine learning and deep learning technologies of leading edge to perform target recognition, segmentation and labeling, so as to generate rich and accurate vision information for subsequent steps.

The vision model module 101, through learning of a large number of surgical videos, is able to understand and recognize various surgical instruments, surgical procedures, and gestures and movements of the surgeon. This understanding and recognition capability enables the model to conduct advanced analysis of the surgical video to generate visual information useful for subsequent steps.

The language model module 102 is configured to convert the visual information into a natural language description, and obtain a first operation stage prediction result based on the natural language description;

The language model module 102 receives the output of the visual model module 101 and converts visual information provided by the visual model module 101 into a natural language description.

After the description of the surgical procedure is generated, the language model module 102 further performs a stage of surgery prediction based on various inputs, understanding of the surgical procedure, and knowledge of the surgical procedure using natural language processing methods and medical expertise.

The language model module 102 may be a large language model (Large Language Model, LLM model) capable of generating well-interpreted, easily understood textual descriptions to assist doctors, medical students, and researchers in understanding the surgical procedure.

The multi-modal understanding module 103 is configured to obtain a second surgical stage prediction result and surgical procedure interpretation information based on the visual information, the natural language description, and the first surgical stage prediction result.

The multimodal understanding module 103 is a powerful integrated platform that receives the outputs of the visual model module 101 and the language model module 102 for fusion to generate detailed surgical understanding for the user. For example, it is possible to generate a comprehensive report containing visual information, natural language descriptions, and surgical stage predictions. The multimodal understanding module 103 can be LLaVa (multimodal large model) model.

The multimodal understanding module 103 combines visual and linguistic information using advanced algorithms to generate comprehensive surgical stage predictions. The output of this module includes not only predictions of the surgical stage, but also detailed interpretations of the surgical procedure, helping the user to understand the surgical procedure in depth.

The language model module 102 creates, from the collected information, a set of surgical concept units describing the surgery in text, images, the set of units including a doctor action description, a tool operation description, an anatomical structure description, a vital sign description, and a surgical team description for the neurosurgery.

The comprehensive understanding of the neurosurgery is generated by the multimodal understanding module 103 from the continuously collected set of surgical concept units. The multimodal understanding module 103 recognizes and interprets the surgical phases in combination with understanding the surgery and actual scene information.

In the embodiment, the multi-modal understanding module accurately and interpretably predicts the operation stage according to the multi-modal data by forming the multi-modal data by the visual information output by the visual model module, the natural language description output by the language model module and the first operation stage prediction result.

On the basis of the above embodiment, the vision model module in this embodiment is specifically configured to:

Image segmentation refers to the segmentation of each frame of image in the surgical video data into multiple image regions to distinguish between background and different tissues, instruments. For example, into the region of a doctor's hand, the region of a surgical instrument, and the surgical region of a patient.

The destination identifier refers to information identifying a target from the surgical video data, such as identifying surgical instruments and human tissues.

Labeling refers to adding descriptive text, such as labeling the name and function of the surgical instrument, or describing the ongoing surgical stage and problems encountered, etc.

Scene understanding refers to understanding the overall condition of a procedure in the surgical video data, such as the progress of the procedure, changes in the surgical area, and the like.

On the basis of the above embodiments, this embodiment further includes a risk alarm module, configured to:

and displaying alarm information based on the surgical risk.

The historical successful operation data is operation data of the historical successful operation and can be extracted from operation video data of the historical successful operation.

And matching the operation data in the current operation video data, such as operation progress, operation area change and the like, with the history successful operation data.

The greater the number of items of surgical data that failed to match, the greater the surgical risk. The greater the extent of the effect of the failed match surgical data on the surgery, the greater the risk of surgery. The risk of surgery can thus be determined based on the number of items of surgery data that failed to match and the extent of the impact on the surgery.

The alarm information can comprise surgical risk and surgical data of failed matching for doctors to refer to.

The risk alarm module is a key safety function and generates real-time alarm information based on operation stage and scene information. The risk alarm module can predict possible risk points according to the current operation stage, real-time operation data and historical operation cases and timely give an alarm to doctors. The early warning function can help doctors avoid possible risks and improve the safety of operations.

On the basis of the above embodiment, the present embodiment further includes a voice interaction module, configured to receive a voice command of a user;

The voice interaction module can analyze voice instructions of doctors. This module includes a powerful speech recognition engine that can accurately recognize the physician's voice instructions in a noisy operating room environment. This recognition capability allows the physician to interact with the system through voice to obtain predictions and interpretations of the surgical stage.

The voice interaction module receives the voice instruction of the doctor, and converts the voice instruction into a format which can be understood by the system through a voice recognition technology. The system will then provide corresponding information, such as predictions of the surgical stage, warnings of surgical risk, etc., according to the physician's instructions.

Fig. 2 is a schematic process flow diagram of the surgical stage prediction system provided by the present invention. As shown in fig. 2, the system automatically processes inputs such as images to predict the stage of surgery, and gives relevant auxiliary information and warns of risk at critical moments. The doctor sends out a voice command to the system through voice, and the voice interaction module analyzes the voice command of the doctor and inputs the voice command into the operation language big model. The large surgical language model understands the physician's intent and overlays important surgical visual information over the physician's field of view.

The surgical procedure can be accurately simulated by digital twinning techniques. And operating the surgical robot to execute the surgical operation instruction of the doctor through the simulation robot control instruction.

The surgical stage prediction system provided by the invention can be used for surgical teaching. In the operation teaching simulation environment, the simulated operation steps performed by the students are input into the system. The system recognizes and analyzes the voice command of the doctor, and acquires the operation stage information corresponding to the voice command through the voice interaction module. The multi-mode large surgical model provides real-time feedback and advice based on instructions of the surgical procedure and on site information of the simulated environment. This may include methods of adjusting the surgical procedure, or providing a new surgical procedure. After the simulated surgery is completed, the system will analyze the entire surgical procedure, summarizing the critical phases and events, and provide a valuable surgical record. These recordings can be used for further surgical teaching and training.

The following describes a surgical stage prediction method provided by the present invention, and the surgical stage prediction method described below and the surgical stage prediction system described above may be referred to correspondingly to each other.

Fig. 3 is a flow chart of the method for predicting the operation stage provided by the invention. As shown in fig. 3, the method includes:

step 301, extracting visual information in surgical video data;

Step 302, converting the visual information into natural language description, and obtaining a first operation stage prediction result based on the natural language description;

And step 303, obtaining a second operation stage prediction result and operation process interpretation information based on the visual information, the natural language description and the first operation stage prediction result.

The embodiment forms multi-modal data by outputting visual information from the visual model, natural language description from the language model and a first operation stage prediction result, and the multi-modal large model accurately and interpretably predicts the operation stage according to the multi-modal data.

On the basis of the above embodiment, the extracting visual information in the surgical video data in this embodiment includes:

On the basis of the above embodiment, in this embodiment, performing image segmentation, object recognition, labeling, and scene understanding on the surgical video data to obtain visual information in the surgical video data includes:

Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 410, communication interface (Communications Interface) 420, memory 430, and communication bus 440, wherein processor 410, communication interface 420, and memory 430 communicate with each other via communication bus 440. Processor 410 may invoke logic instructions in memory 430 to perform a surgical stage prediction method comprising: extracting visual information in the surgical video data; converting the visual information into natural language description, and obtaining a first operation stage prediction result based on the natural language description; and obtaining a second surgical stage prediction result and surgical procedure interpretation information based on the visual information, the natural language description and the first surgical stage prediction result.

Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the surgical stage prediction method provided by the methods described above, the method comprising: extracting visual information in the surgical video data; converting the visual information into natural language description, and obtaining a first operation stage prediction result based on the natural language description; and obtaining a second surgical stage prediction result and surgical procedure interpretation information based on the visual information, the natural language description and the first surgical stage prediction result.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method of surgical stage prediction provided by the methods described above, the method comprising: extracting visual information in the surgical video data; converting the visual information into natural language description, and obtaining a first operation stage prediction result based on the natural language description; and obtaining a second surgical stage prediction result and surgical procedure interpretation information based on the visual information, the natural language description and the first surgical stage prediction result.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A surgical stage prediction system, comprising:

2. The surgical stage prediction system according to claim 1, wherein the vision model module is specifically configured to:

3. The surgical stage prediction system according to claim 2, wherein the vision model module is specifically configured to:

4. A surgical stage prediction system according to any one of claims 1-3, further comprising a risk alert module for:

and displaying alarm information based on the surgical risk.

5. The surgical stage prediction system according to claim 4, further comprising a voice interaction module for receiving a voice command from a user;

6. A method of predicting a surgical stage, comprising:

Extracting visual information in the surgical video data;

7. The surgical phase prediction method according to claim 6, wherein the extracting visual information in the surgical video data includes:

8. The method of claim 7, wherein said performing image segmentation, object recognition, labeling, and scene understanding on the surgical video data to obtain visual information in the surgical video data comprises:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the surgical stage prediction method of any one of claims 6 to 8 when the program is executed by the processor.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the surgical stage prediction method of any of claims 6 to 8.