CN117218074A - Training method, soft tissue deformation estimation method, device, equipment and storage medium - Google Patents

Training method, soft tissue deformation estimation method, device, equipment and storage medium Download PDF

Info

Publication number
CN117218074A
CN117218074A CN202311141090.1A CN202311141090A CN117218074A CN 117218074 A CN117218074 A CN 117218074A CN 202311141090 A CN202311141090 A CN 202311141090A CN 117218074 A CN117218074 A CN 117218074A
Authority
CN
China
Prior art keywords
deformation
soft tissue
neural network
images
estimation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311141090.1A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Konuositeng Technology Co ltd
Original Assignee
Shenzhen Konuositeng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Konuositeng Technology Co ltd filed Critical Shenzhen Konuositeng Technology Co ltd
Priority to CN202311141090.1A priority Critical patent/CN117218074A/en
Publication of CN117218074A publication Critical patent/CN117218074A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The application provides a training method, a soft tissue deformation estimation method, a device, equipment and a storage medium. The training method comprises the following steps: acquiring a training sample set, wherein different samples in the training sample set comprise point clouds at different moments and speed vectors of three-dimensional points in the point clouds; any sample is obtained based on depth estimation and optical flow estimation of four soft tissue images acquired from two different directions at adjacent moments; and performing self-supervision training on a preset neural network model by utilizing samples in the training sample set to obtain a soft tissue deformation estimation model. In the embodiment, the soft tissue deformation estimation model aggregates semantic information and motion flow information, and improves accuracy of soft tissue deformation estimation.

Description

Training method, soft tissue deformation estimation method, device, equipment and storage medium
Technical Field
The present application relates to the field of image processing, and in particular, to a training method for a soft tissue deformation estimation model, a soft tissue deformation estimation method, a device, an electronic apparatus, and a computer readable storage medium.
Background
Soft tissue deformation is ubiquitous during medical procedures. The estimation of soft tissue deformation is the basis for various downstream applications including, but not limited to, intelligent perception and surgical decision assistance (by estimating deformation, further identifying the properties such as hardness of predicted soft tissue, and presenting such relevant information to the doctor in an image, providing intelligent decisions for the doctor's surgery), robotic surgery automation (by estimating soft tissue deformation, prediction and tracking of key points on tissue is achieved, this information can be used as feedback for visual servo control to enable automated manipulation of soft tissue), and the like. Despite extensive research on this task, the accuracy of the soft tissue deformation estimation scheme in the related art is still not accurate enough.
Disclosure of Invention
In view of the above, the present application provides a training method of a soft tissue deformation estimation model, a soft tissue deformation estimation method, a device, an electronic apparatus, and a computer readable storage medium.
Specifically, the application is realized by the following technical scheme:
in a first aspect, an embodiment of the present application provides a training method for a soft tissue deformation estimation model, including:
acquiring a training sample set, wherein different samples in the training sample set comprise point clouds at different moments and speed vectors of three-dimensional points in the point clouds; any sample is obtained based on depth estimation and optical flow estimation of four soft tissue images acquired from two different directions at adjacent moments;
performing self-supervision training on a preset neural network model by utilizing samples in the training sample set to obtain a soft tissue deformation estimation model;
wherein the self-supervised training indicates that the following iterative process is repeated until an iteration end condition is satisfied:
in each iteration process, acquiring a speed field output by the neural network model based on samples in the training sample set, and integrating the speed field to obtain a first deformation field and a second deformation field which meet the differential stratospheric mapping relation;
Respectively carrying out deformation treatment on two soft tissue images acquired at the previous moment in the adjacent moment by using the first deformation field to obtain two first deformation images, and respectively carrying out deformation treatment on the two first deformation images by using the second deformation field to obtain two second deformation images;
and under the condition that the iteration ending condition is not met, adjusting parameters of the neural network model according to the four soft tissue images, the two first deformation images and the two second deformation images to obtain the neural network model in the next iteration process.
According to a second aspect of one or more embodiments of the present specification, there is provided a soft tissue deformation estimation method, comprising:
acquiring four soft tissue images acquired from two different directions at adjacent moments;
performing depth estimation and optical flow estimation on the four soft tissue images to obtain a point cloud at the previous moment in the adjacent moment and a speed vector of a three-dimensional point in the point cloud;
inputting the point cloud and the speed vector of the three-dimensional point in the point cloud into a soft tissue deformation estimation model for processing to obtain a speed field; wherein the soft tissue deformation estimation model is trained based on the method of any one of the first aspects;
And integrating the speed field to obtain soft tissue deformation information according to an integration result.
According to a third aspect of embodiments of the present disclosure, there is provided a training apparatus of a soft tissue deformation estimation model, including:
the training sample acquisition module is used for acquiring a training sample set, wherein different samples in the training sample set comprise point clouds at different moments and speed vectors of three-dimensional points in the point clouds; any sample is obtained based on depth estimation and optical flow estimation of four soft tissue images acquired from two different directions at adjacent moments;
the self-supervision training module is used for carrying out self-supervision training on a preset neural network model by utilizing the samples in the training sample set to obtain a soft tissue deformation estimation model;
the self-supervision training module is specifically configured to repeat the following iteration process until the iteration end condition is satisfied: in each iteration process, acquiring a speed field output by the neural network model based on samples in the training sample set, and integrating the speed field to obtain a first deformation field and a second deformation field which meet the differential stratospheric mapping relation; respectively carrying out deformation treatment on two soft tissue images acquired at the previous moment in the adjacent moment by using the first deformation field to obtain two first deformation images, and respectively carrying out deformation treatment on the two first deformation images by using the second deformation field to obtain two second deformation images; and under the condition that the iteration ending condition is not met, adjusting parameters of the neural network model according to the four soft tissue images, the two first deformation images and the two second deformation images to obtain the neural network model in the next iteration process.
According to a fourth aspect of embodiments of the present disclosure, there is provided a soft tissue deformation estimating apparatus, comprising:
the image acquisition module is used for acquiring four soft tissue images acquired from two different directions at adjacent moments;
the image processing module is used for carrying out depth estimation and optical flow estimation on the four soft tissue images to obtain a point cloud at the previous moment in the adjacent moment and a speed vector of a three-dimensional point in the point cloud;
the deformation estimation module is used for inputting the speed vector of the three-dimensional point in the point cloud into the soft tissue deformation estimation model for processing to obtain a speed field; wherein the soft tissue deformation estimation model is trained based on the method of any one of the first aspects;
and the soft tissue deformation information acquisition module is used for integrating the speed field so as to obtain soft tissue deformation information according to the integrated result.
According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor, when executing the executable instructions, is configured to implement the method of the first aspect.
According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of any of the methods described above.
The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:
in the embodiment of the disclosure, soft tissue deformation estimation can be realized by means of a soft tissue deformation estimation model, the soft tissue deformation estimation model is obtained by self-supervision training based on point clouds at different moments and speed vectors of three-dimensional points in the point clouds, the point clouds at each moment and the speed vectors of the three-dimensional points in the point clouds are obtained by depth estimation and optical flow estimation of four soft tissue images acquired from two different directions at adjacent moments, the point clouds represent semantic information of soft tissues, and the speed vectors of the three-dimensional points in the point clouds represent motion flow information of the soft tissues; the soft tissue deformation estimation model can aggregate semantic information and motion flow information of soft tissues, so that soft tissue deformation rules are perceived from a three-dimensional space, the obtained first deformation field and second deformation field meet differential homoembryo mapping relations, the estimated deformation field is ensured to conform to the physical rules in the soft tissue deformation process, the accuracy of the estimated deformation field is improved, and the accurate estimation of the deformation of the soft tissues in an actual three-dimensional space is realized.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
Fig. 1 is a schematic view of a robotic surgical system according to an exemplary embodiment of the application.
Fig. 2 is a schematic view of a patient-side robot according to an exemplary embodiment of the present application.
FIG. 3 is a flow chart illustrating a training method of a soft tissue deformation estimation model according to an exemplary embodiment of the present application.
Fig. 4 is a schematic diagram of acquiring a point cloud and velocity vectors of three-dimensional points in the point cloud according to an exemplary embodiment of the present application.
FIG. 5 is a schematic diagram of a soft tissue deformation estimation model structure and first and second deformation fields satisfying differential stratospheric mapping relationships according to an exemplary embodiment of the present application.
Fig. 6 is a flow chart illustrating a soft tissue deformation estimation method according to an exemplary embodiment of the present application.
Fig. 7 is a schematic structural view of an electronic device according to an exemplary embodiment of the present application.
Fig. 8 is a schematic structural diagram of a training device for a soft tissue deformation estimation model according to an exemplary embodiment of the present application.
Fig. 9 is a schematic structural view of a soft tissue deformation estimating apparatus according to an exemplary embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
Soft tissue refers to various tissues composed of non-bones, including muscles, fat, skin, blood vessels, nerves, viscera, etc. These soft tissues play an important role in supporting, protecting and performing functions inside animals. Soft tissue deformation is ubiquitous during medical procedures. The estimation of soft tissue deformation is the basis for various downstream applications, including intra-operative navigation, tool tissue interaction analysis, surgical assessment, anatomical students mechanical estimation, soft tissue automation, among many downstream tasks or prerequisites for downstream applications.
For example, intelligent awareness and surgical decision assistance: by estimating the deformation of the soft tissue, the predicted hardness and other attributes of the soft tissue are further identified, and the related information can be overlaid in the image to be presented to a doctor, so that an intelligent decision is provided for the operation of the doctor.
For another example, robotic surgical automation: by estimating the deformation of the soft tissue, the prediction and tracking of key points on the tissue are realized, and the information can be used as feedback of visual servo control so as to realize the automatic control of the soft tissue.
The difficulty in estimating soft tissue deformation is: (1) The environment of soft tissue manipulation is highly complex and dynamic; (2) Shielding of the surgical instrument results in the soft tissue being in a partially visible state during deformation; (3) Tissue movement during surgery, illumination changes, smoke, blood, and the like.
One implementation in the related art is to estimate soft tissue deformation by tracking and matching pixel-level dense feature points. But this approach has many drawbacks in the surgical setting. First, the background and texture in a surgical scene are typically relatively single, with monotonic texture background resulting in point-to-point matching being inaccurate. Second, the matching of feature points is based on two-dimensional images, lacks perception of three-dimensional structure, and some of the matched feature points appear reasonable in two-dimensional space and may not be reasonable to project into three-dimensional space. Resulting in an inaccurate estimation of the soft tissue deformation.
Based on the above, the embodiment of the application provides a soft tissue deformation estimation method, which can realize soft tissue deformation estimation by means of a soft tissue deformation estimation model, wherein the soft tissue deformation estimation model is obtained by self-supervision training based on point clouds at different moments and speed vectors of three-dimensional points in the point clouds, the speed vectors of the point clouds at each moment and the three-dimensional points in the point clouds are obtained by depth estimation and optical flow estimation on four soft tissue images acquired from two different directions at adjacent moments, the point clouds represent semantic information of soft tissues, and the speed vectors of the three-dimensional points in the point clouds represent motion flow information of the soft tissues; the soft tissue deformation estimation model can aggregate semantic information and motion flow information of soft tissues, so that soft tissue deformation rules are perceived from a three-dimensional space, the obtained first deformation field and second deformation field meet differential homoembryo mapping relations, the estimated deformation field is ensured to conform to the physical rules in the soft tissue deformation process, the accuracy of the estimated deformation field is improved, and the accurate estimation of the deformation of the soft tissues in an actual three-dimensional space is realized.
By way of example, the soft tissue deformation estimation method and training method of the soft tissue deformation estimation model may be performed by electronic devices including, but not limited to, servers, cloud servers, smart phones/handsets, tablet computers, personal Digital Assistants (PDAs), laptop computers, desktop computers, media content players, video game stations/systems, virtual reality systems, augmented reality systems, wearable devices (e.g., watches, eyeglasses, gloves, headwear (e.g., headwear, virtual reality headphones, augmented reality headphones, head Mounted Devices (HMDs), headbands)), or any other type of device. It will be appreciated that the soft tissue deformation estimation method and the training method of the soft tissue deformation estimation model may be performed by different electronic devices, or may be performed by the same electronic device, which is not limited in this embodiment.
In an exemplary application scenario, the minimally invasive surgery based on the surgical robot has the advantages of accurate positioning, stable operation, small surgical trauma, quick postoperative recovery and the like. The computer-aided surgical robot-based automation is an important part of intelligent medical treatment, so that the misjudgment rate of doctors in surgery can be reduced, and the heavy repeated surgical tasks under the fixation surgery can be reduced. Among them, the task of soft tissue deformation estimation has an important role in minimally invasive surgery based on surgical robots. Through deformation estimation of soft tissues, the surgical robot can predict and track key points on the tissues, and the information can be used as feedback of visual servo control so as to realize automatic control of the soft tissues.
For example, referring to fig. 1, a schematic diagram of a robotic surgical system 100 is shown. In operation, a patient is positioned in front of a patient-side robot (Patient Side Robot, PSR) 101, the patient-side robot 101 including one or more robotic arms 101a, the distal end of each robotic arm 101a being configured to hold one or more surgical instruments 001, the Surgeon may control the robotic arms 101a via a Surgeon Console (SGC) 102 to control the surgical instruments 001 to perform a surgical procedure on the patient. The manipulator 101a may also hold an image acquisition device (e.g., an endoscope, not shown), and the surgeon may control the manipulator 101a holding the endoscope via the doctor console 102 to move and hold the endoscope near the lesion area of the patient for acquiring a surgical screen including the lesion of the patient and its surrounding tissues, and the surgical instrument 001, i.e., the soft tissue image mentioned in the embodiment of the present application.
During surgery, the surgical instrument 001 and/or endoscope on the robotic arm 101a is inserted into the patient through a pre-set aperture in the patient and is rotatable about a pre-set aperture center point, commonly referred to as a remote center of motion point (Remote Center of Motion point, RCM). The images acquired by the endoscope will be transmitted to a Vision Cart (VCT) 103 for image processing and recording, and the processed images will be displayed on respective display devices of the Vision Cart 103 and doctor console 102 for viewing by doctors and other surgical staff.
For example, referring to fig. 2, a schematic diagram of a patient-side robot 101 is shown. The patient side robot 101 includes a chassis 101b, a push handle 101c, and at least one robot arm 101a (only one robot arm 101a is shown in the figure for convenience of illustration), and each robot arm 101a includes an adjustment arm 101a-1 and an operation arm 101a-2. The mechanical arm 101a-2 may include an image acquisition device, through which a soft tissue image near the lesion area of the patient may be acquired, so that training of a soft tissue deformation estimation model and a subsequent soft tissue deformation estimation process may be performed based on the acquired soft tissue image.
In some embodiments, a training process of the soft tissue deformation estimation model is illustrated:
the training process of the soft tissue deformation estimation model may be: firstly, a model is shown through a modeling table, then, the model is evaluated through construction of an evaluation function, and finally, the evaluation function is optimized according to sample data and an optimization method, and the model is adjusted to be optimal.
1. Modeling (Modeling): in the training process, a model architecture suitable for the task is first selected. The model may be a neural network, a decision tree, a support vector machine, and so on. The goal of modeling is to select a model that can fit the data efficiently and with the appropriate complexity.
2. Evaluation function (Evaluation Function): during the training process, an evaluation function needs to be defined to measure the difference or error between the model prediction result and the real label. This evaluation Function, commonly referred to as Loss Function (Loss Function), is capable of measuring the performance of the model.
3. Optimization method (Optimization Method): during the training process, an optimization method needs to be selected to minimize the loss function. Common optimization methods include random gradient descent (Stochastic Gradient Descent, SGD), adam, RMSprop, etc. These methods improve the predictive power of the model by adjusting the model parameters such that the loss function is gradually reduced.
4. Sample Data (Training Data): model training requires the use of large amounts of sample data. These data are typically divided into Training sets (Training sets), validation sets (Validation sets), and Test sets (Test sets). The training set is used for parameter updating, the verification set is used for adjusting the model super parameters, and the test set is used for evaluating the generalization capability of the model on unseen data.
5. Training iteration (Training Iterations): during model training, the model parameters are adjusted step by iterative optimization. Each training iteration includes the steps of forward propagation (Forward Propagation), calculation of loss function values, backward propagation (Backward Propagation), and parameter updating. The number of training iterations is typically adjusted dynamically by setting a fixed number of iterations or according to the performance of the model on the validation set.
Through the steps, the model can be trained by using sample data, and after repeated iterative optimization, the optimal model which can be accurately predicted and generalized is finally obtained. Note that in the model training process, technical means such as fitting control, learning rate adjustment, regularization and the like are also needed to improve the training effect and generalization capability of the model.
In some embodiments, referring to fig. 3, fig. 3 shows a flowchart of a training method of a soft tissue deformation estimation model. The method may be performed by an electronic device, the method comprising:
in S101, a training sample set is obtained, where different samples in the training sample set include point clouds at different moments and velocity vectors of three-dimensional points in the point clouds; wherein, any sample is based on the depth estimation and the optical flow estimation of four soft tissue images acquired from two different directions at adjacent moments.
The point cloud in any sample is obtained by performing depth estimation on four soft tissue images acquired from two different directions at adjacent moments, and the velocity vector of the three-dimensional point in the point cloud in any sample is obtained by performing optical flow estimation on four soft tissue images acquired from two different directions at adjacent moments.
Illustratively, the four soft tissue images corresponding to any one sample include left and right views acquired at a previous time instant in the adjacent time instant, and left and right views acquired at a subsequent time instant in the adjacent time instant. Of course, other orientations than the two different orientations of the left and right views are possible, and the present embodiment is not limited in this regard.
In S102, performing self-supervision training on a preset neural network model by using samples in the training sample set to obtain a soft tissue deformation estimation model.
Wherein the self-supervised training indicates that the following iterative process is repeated until an iteration end condition is satisfied: in each iteration process, acquiring a speed field output by the neural network model based on samples in the training sample set, and integrating the speed field to obtain a first deformation field and a second deformation field which meet the differential stratospheric mapping relation; respectively carrying out deformation treatment on two soft tissue images acquired at the previous moment in the adjacent moment by using the first deformation field to obtain two first deformation images, and respectively carrying out deformation treatment on the two first deformation images by using the second deformation field to obtain two second deformation images; and under the condition that the iteration ending condition is not met, adjusting parameters of the neural network model according to the four soft tissue images, the two first deformation images and the two second deformation images to obtain the neural network model in the next iteration process.
In this embodiment, each sample in the training sample set includes a point cloud representing soft tissue semantic information and a velocity vector representing soft tissue motion stream information; and then, based on samples in a training sample set, performing self-supervision training on a preset neural network model to obtain a soft tissue deformation estimation model which aggregates soft tissue semantic information and motion flow information, estimating a stable speed field of the soft tissue by the soft tissue deformation estimation model, and further, integrating the speed field to obtain a first deformation field and a second deformation field which meet the differential homoembryo mapping relation, so as to ensure that the estimated deformation field accords with the physical rule in the soft tissue deformation process, thereby being beneficial to improving the accuracy of the estimated deformation field, and realizing that the trained soft tissue deformation estimation model can accurately estimate the deformation of the soft tissue in an actual three-dimensional space.
1. With respect to training samples.
In some embodiments, in order to improve accuracy of soft tissue deformation estimation, a pre-trained instrument segmentation network may be used to segment the surgical instrument in the soft tissue image acquired by the image acquisition device in fig. 1 and 2, so as to obtain a soft tissue image from which the surgical instrument is removed, in consideration of the problem that the acquired soft tissue image may have the surgical instrument blocking the soft tissue.
For example, the soft tissue image acquired by the image acquisition device and a mask image of the surgical instrument may be input into a pre-trained instrument segmentation network, the surgical instrument including at least two parts movable relative to each other, the mask image being rendered based on motion information of the surgical instrument acquired in synchronization with the soft tissue image. For example, the robotic arm 101a-2 in fig. 2 may include one or more sensors, such as a displacement meter, an orientation sensor, and/or a position sensor. The detection values of the sensors can be used to obtain movement information of the robot arm 101a and the surgical instrument 001 held by the robot arm 101a, for example, pose information of each part in the surgical instrument 001.
The instrument segmentation network can respectively extract features of the soft tissue image and the mask image to obtain a first feature and a second feature; then, according to the connection relation among all parts in the surgical instrument, the first characteristic and the second characteristic are respectively processed to obtain a first directed graph and a second directed graph; each of the first directed graph and the second directed graph comprises nodes corresponding to each part of the surgical instrument; and finally, according to the first characteristic, the second characteristic, the first directed graph and the second directed graph, the end part of the surgical instrument is segmented, and the end part segmentation result of the surgical instrument is obtained. The electronic equipment can process the soft tissue image acquired by the image acquisition device according to the end part segmentation result of the surgical instrument, so that the soft tissue image from which the surgical instrument is removed is obtained.
In some embodiments, point clouds contained by any sample are illustrated herein: after four soft tissue images acquired from two different directions at adjacent moments are acquired, performing depth estimation on two soft tissue images acquired from two different directions at the previous moment in the adjacent moments to obtain first depth information; and mapping pixels in the soft tissue image acquired from one azimuth at the previous moment in the adjacent moment from a two-dimensional space to a three-dimensional space by utilizing the first depth information and internal and external parameters of the image acquisition device, and obtaining a point cloud at the previous moment in the adjacent moment. The embodiment realizes that semantic information of a three-dimensional space is acquired from a two-dimensional soft tissue image, and the point cloud can provide information of which position in the three-dimensional space the soft tissue deformation appears in, so that the follow-up can accurately estimate the deformation of the soft tissue in the three-dimensional space (namely, the real world space) rather than the two-dimensional space.
In some embodiments, the velocity vectors of three-dimensional points in the point cloud contained by any sample are illustrated herein: after four soft tissue images acquired from two different directions at adjacent moments are acquired, performing depth estimation on two soft tissue images acquired from two different directions at the previous moment in the adjacent moments to obtain first depth information; and performing depth estimation on two soft tissue images acquired from two different directions at a later time in adjacent times to obtain second depth information; then, performing optical flow estimation on two soft tissue images acquired from the same azimuth at adjacent moments by using an optical flow estimation network to obtain a speed vector of a pixel; and finally, carrying out projection mapping on the velocity vector of the pixel by utilizing the first depth information and the second depth information to obtain the velocity vector of the three-dimensional point in the point cloud. The present embodiment achieves obtaining a velocity vector of a three-dimensional space, which can inform a direction and a distance of displacement of soft tissue in the three-dimensional space, thereby facilitating a subsequent ability to accurately estimate deformation of soft tissue in the three-dimensional space (i.e., real world space) rather than in the two-dimensional space.
For example, a depth estimation model may be pre-trained by which to perform depth estimation, e.g., a STTR (STereo TranformeR) depth estimation model may be trained that predicts depth information for each pixel from a soft tissue image based on a transducer architecture. The STTR model is capable of automatically learning features and relationships from deformed images by introducing self-attention mechanisms and transducer's attention mechanisms, and generating pixel-level depth estimates. The STTR model includes an encoder and a decoder. The encoder is responsible for extracting the feature representations of the input morphed icons, and the decoder uses these features to predict depth pixel by pixel. The STTR model is trained by minimizing the difference between the predicted depth and the true depth and updates the network parameters by a back propagation algorithm. The STTR model can capture the context relation between pixels through the self-attention and the attention mechanism of a transducer, and improve the accuracy of depth estimation.
Two soft tissue images acquired from two different orientations at a previous one of the adjacent times may be processed in the STTR model to obtain first depth information output by the STTR model. And inputting two soft tissue images acquired from two different directions at a later time of the adjacent time into the STTR model for processing, so as to obtain second depth information output by the STTR model.
For example, for an optical flow estimation network, the optical flow estimation network may be trained using self-supervised loss composed of occlusion-aware photometric consistency terms and edge-aware smoothing terms. The occlusion perception luminosity consistency item represents the difference between the predicted result of the optical flow estimation network and the real label; the edge-aware smoothing term is a first order edge-aware smoothness that aims to smooth the predicted outcome of the optical flow estimation network.
For example, referring to fig. 4,left view representing time t in the adjacent time, < > and/or->Right view representing time t in adjacent time, < ->Left view representing time t+1 in the adjacent time,/v>A right view of time t+1 from adjacent times is shown. The electronic device can be right->And->Performing depth estimation to obtain first depth information D t I.e. the depth map corresponding to the left view at time t; and (3) pair->And->Performing depth estimation to obtain second depth information D t+1 I.e. the depth map corresponding to the left view at time t + 1. The electronic device can also divide the network by means of pre-trained instruments, for +.>And->The surgical instrument in the process is divided to obtain a left view M at the t moment when the surgical instrument is removed t And left view M at time t+1 t+1 . The electronic device may be based on D t 、M t Back projection is carried out on the internal and external parameters of the image acquisition device, so as to obtain a point cloud S at the moment t t . The electronic device can make M t And M t+1 Inputting into an optical flow estimation network to perform optical flow estimation processing to obtain a velocity vector of a pixel; further utilize D t And D t+1 Projection mapping is carried out on the velocity vector of the pixel to obtain a velocity vector V of a three-dimensional point in the point cloud t
2. With respect to the design of the model structure and the loss function.
In some embodiments, the electronic device may perform self-supervised training on the preset neural network model by using samples in the training sample set to obtain the soft tissue deformation estimation model.
Referring to fig. 5, the neural network model (or soft tissue deformation estimation model) includes a first encoder, a second encoder, and a decoder. The first encoder is used for extracting characteristics of point clouds contained in the samples and acquiring the characteristics of the point clouds. And the second encoder is used for extracting the characteristics of the speed vectors of the three-dimensional points in the point cloud contained in the sample and acquiring the motion characteristics. The decoder is used for carrying out deformation estimation processing according to the point cloud characteristics and the motion characteristics to obtain a speed field.
Illustratively, the neural network model includes a U-shaped neural network model. The U-shaped neural network model is a Convolutional Neural Network (CNN) architecture commonly used for semantic segmentation tasks. The method comprises two parts of an encoder and a decoder, wherein the encoder part is mainly responsible for feature extraction, and the decoder part restores the feature mapping extracted by the encoder into an output result with the same size as an input image through upsampling, jump connection and the like. In the U-shaped neural network model, the encoder is typically composed of a plurality of alternating stacks of convolutional and pooled layers for layer-by-layer extraction of image features while reducing the image size. Therefore, under the condition of keeping important characteristics, the calculation amount and the memory consumption can be reduced, and the training speed and the training effect can be improved. The decoder portion then uses upsampling or deconvolution to recover the encoder-generated low resolution feature map to a high resolution output result of the same size as the original input image. Meanwhile, in order to overcome the problems of information loss, blurring and the like, a jump connection mechanism is introduced into the U-shaped neural network model, so that a decoder can jump some layers from an encoder, and a low-layer characteristic diagram is directly used for fine modeling. The U-shaped neural network model is widely applied to the fields of image segmentation, medical image analysis, natural language processing and the like due to good feature extraction and segmentation effects, and becomes one of the main stream models in the current semantic segmentation task.
The neural network model provided by the embodiment of the application is based on a U-shaped neural network model and comprises a first encoder, a second encoder and a decoder. Wherein the first encoder and the second encoder are in skip connection and the second encoder and the decoder are in skip connection. Referring to fig. 5, a broken line between the first encoder and the second encoder in fig. 5 represents a skip connection, and a broken line between the second encoder and the decoder represents a skip connection.
Illustratively, the first encoder and the second encoder each include N neural network layers; n is an integer greater than 1; the neural network layer is a convolution layer or a pooling layer. The neural network layer in the first encoder is connected with the neural network layer of the second encoder in a jumping manner; the input data of the (n+1) th neural network layer in the first encoder is the output data of the (n) th neural network layer in the first encoder; the input data of the (n+1) th neural network layer in the second encoder is the result of splicing the output data of the (N) th neural network layer in the second encoder and the output data of the (N) th neural network layer in the first encoder, wherein N is more than 1 and less than or equal to N. In this embodiment, the output of the second encoder can be fused with the point cloud feature and the motion feature through the jump connection of the first encoder and the second encoder.
The decoder comprises N neural network layers; n is an integer greater than 1. The neural network layer in the decoder is in jump connection with the neural network layer of the second encoder. The input data of the first neural network layer of the decoder includes the output data of the last neural network layer in the second encoder; it will be appreciated that since the inputs of the non-first neural network layers in the second encoder are all fused with the data output from the neural network layers of the first encoder, the output data of the last neural network layer of the second encoder includes point cloud features and motion features. The input data of the (m+1) th neural network layer of the decoder is the result of splicing the output data of the (m) th neural network layer in the decoder and the output data of the neural network layer corresponding to the (m+1) th neural network layer in the decoder, wherein m is more than 1 and less than or equal to N. The output of the last neural network layer of the decoder is the velocity field of the previous one of the adjacent instants. In this embodiment, the decoder can skip some levels from the second encoder, and directly uses the features of the low levels to perform fine modeling, so as to overcome the problems of information loss, blurring, and the like.
For example, referring to fig. 5, in order to improve accuracy of soft tissue estimation, the input data of the last neural network layer in the decoder includes the velocity field of the last time estimated by the neural network model in addition to the output data of the previous neural network layer in the decoder and the output data of the neural network layer corresponding to the last neural network layer in the decoder in the second encoder. In this embodiment, the velocity field of the last time estimated by the neural network model is used as one of the reference information of the velocity field of the predicted current time, so that time sequence modeling can be better performed, and the characteristic of time sequence consistency of the deformation field predicted subsequently can be better ensured. That is, the neural network model (i.e., the soft tissue deformation estimation model) provided by the embodiment of the application can aggregate semantic information, motion flow information and long-range time context, so that the neural network model has time consistency and physical rationality.
It will be appreciated that if there is no velocity field from the previous instant, a velocity field of all 0's is used as input.
In some embodiments, the self-supervised training process is a continuous iterative optimization process, in which the following iterative process may be repeated until the iteration end condition is met: in each iteration process, a speed field output by the neural network model is obtained based on samples in the training sample set, and the speed field is integrated to obtain a first deformation field and a second deformation field which meet the differential stratospheric mapping relation. Then the electronic equipment can respectively deform two soft tissue images acquired at the previous moment in the adjacent moment by using the first deformation field to obtain two first deformation images, and respectively deform the two first deformation images by using the second deformation field to obtain two second deformation images. And under the condition that the iteration ending condition is not met, adjusting parameters of the neural network model according to the four soft tissue images, the two first deformation images and the two second deformation images to obtain the neural network model in the next iteration process.
Exemplary, iteration end conditions include, but are not limited to: the number of iterations is reached, the difference between the first deformation image and the soft tissue image acquired at a later one of the adjacent moments is smaller than the first preset difference, and/or the difference between the second deformation image and the soft tissue image acquired at a previous one of the adjacent moments is smaller than the second preset difference. The preset iteration number, the first preset difference and the second preset difference can be specifically set according to the actual application scene, which is not limited in this embodiment.
Referring to fig. 5, in acquiring a velocity field v output by a neural network model t The electronic device can integrate the velocity field, and the integration process can be approximated by scaling and squaring algorithm to obtain a first deformation field phi satisfying the differential stratospheric mapping relationship t→t+1 And a second deformation field phi t+1→t . The use of differential stratospheric modeling can better ensure that the resulting deformation field is spatially smooth and reversible bijective. First deformation field phi t→t+1 Characterizing deformation from a previous (t) to a subsequent (t+1) of the adjacent moments, a second deformation field phi t+1→t The deformation from the next (t+1) to the previous (t) of the adjacent moments is characterized.
The samples in the training sample set are three-dimensional information, so that a speed field output by the neural network model is three-dimensional, and a first deformation field and a second deformation field which are obtained by integrating the speed field and meet the differential stratospheric mapping relation are three-dimensional. Thus, the first deformation image and the second deformation image are generated by projecting the deformed point cloud into a two-dimensional space.
The generation process of the first deformed image will be described here: the above-mentioned two soft tissue images acquired at a previous moment in the adjacent moment are respectively deformed by using the first deformation field to obtain two first deformation images, which includes: for each soft tissue image acquired at the previous moment in the adjacent moment, carrying out deformation processing on the point cloud corresponding to the soft tissue image by using a first deformation field to obtain a first deformation point cloud; and projecting the first deformation point cloud to a two-dimensional space by utilizing the internal and external parameters of the image acquisition device to obtain a first deformation image corresponding to the soft tissue image. The point cloud corresponding to the soft tissue image can be obtained by mapping the soft tissue image from a two-dimensional space to a three-dimensional space based on the first depth information and the internal and external parameters of the image acquisition device.
For example, for the left view at time t, a predicted left view (i.e., the first deformed image) at time t+1 can be obtained based on the above steps. Similarly, for the right view at time t, a predicted right view (i.e., the first deformed image) at time t+1 can be obtained based on the above steps.
The generation process of the second deformation image is explained here: the above-mentioned using the second deformation field to deform the two first deformation images respectively to obtain two second deformation images includes: for each first deformation image, carrying out deformation treatment on the first deformation point cloud corresponding to the first deformation image again by using a second deformation field to obtain a second deformation point cloud; and then projecting the second deformation point cloud to a two-dimensional space by utilizing the internal and external parameters of the image acquisition device to obtain a second deformation image.
With the above example, for the predicted left view at time t+1, the predicted left view at time t (i.e., the second deformation image) can be obtained based on the above steps. Similarly, for the predicted right view at time t+1, the predicted right view at time t (i.e., the second deformation image) can be obtained based on the above steps.
By way of example, the embodiment of the application adopts the cycle consistency as a learning target of self-supervision training, so that the soft tissue image returns to the original state after undergoing a forward-backward cycle, and the state in the middle of deformation is consistent with the state presented in the real image at the moment. Specifically, learning objectives of self-supervised training include: (1) The first deformation image is the same as or similar to a soft tissue image acquired at a later one of the adjacent moments in time, and (2) the second deformation image is the same as or similar to a soft tissue image acquired at a previous one of the adjacent moments in time. The (1) th learning object ensures that the estimation result of the soft tissue deformation is consistent with the deformation actually generated by the soft tissue; the (2) th learning objective makes the soft tissue deformation estimation model robust against discontinuous semantic sequences caused by surgical instrument occlusion. Combining these two learning objectives helps to produce a stronger learning signal and can avoid optimization to get a trivial solution.
Based on the (1) th learning objective described above, a first loss function may be constructed; and based on the (2) th learning objective described above, a second loss function may be constructed.
For example, the electronic device may calculate a first loss value corresponding to the first loss function according to the two first deformation images and the two soft tissue images acquired at a later time of the adjacent time instants; calculating a second loss value corresponding to the second loss function according to the two second deformation images and the two soft tissue images acquired at the previous moment in the adjacent moment; and then adjusting parameters of the neural network model according to the first loss value and the second loss value. For example, parameters of the neural network model may be adjusted based on a weighted sum of the first loss value and the second loss value.
The first loss function is used for calculating luminosity loss between the pixel value of the soft tissue image acquired at the next moment in the adjacent moment and the pixel value of the first deformation image belonging to the same azimuth; the first loss value comprises a photometric loss value between the pixel value of the soft tissue image acquired at a later one of the adjacent moments and the pixel value of the first deformation image belonging to the same orientation.
By using the above example, the luminosity loss between the pixel value of the pixel in the left view at the time t+1 and the pixel value of the pixel in the predicted left view (the first deformation image) at the time t+1 can be calculated through the first loss function, so as to obtain the luminosity loss value; calculating luminosity loss between the pixel value of the pixel in the right view at the time t+1 and the pixel value of the pixel in the predicted right view at the time t+1 (first deformation image) through a first loss function to obtain luminosity loss values; and then the two luminosity loss values are weighted and summed to obtain the first loss value.
Wherein, for two deformed images, the same loss function can be used to calculate the luminosity loss; different loss functions may also be used to calculate the photometric loss (i.e. the first loss function comprises two different sub-functions for calculating the photometric loss).
The calculation of photometric losses using two different sub-functions for two deformed images is illustrated: for example, for the left view, the left view at time t+1 and the predicted left view (first deformation image) at time t+1 may be subjected to Census transformation, and then the hamming distance between the left view and the predicted left view (first deformation image) at time t+1 may be calculated based on the transformed result, as luminosity loss values; wherein Census transformation belongs to a kind of non-parametric image transformation, and can better detect local structural features in an image, such as edge, corner features and the like; the Census transformation is essentially to encode the gray values of the pixels of the image into a binary code stream, so as to obtain the magnitude relation of the gray values of the neighboring pixels relative to the gray values of the central pixels. For the right view, the angle between the RGB vector of the pixel in the right view at time t+1 and the RGB vector of the pixel in the predicted right view at time t+1 (first deformed image) may be calculated, resulting in a luminosity loss value.
The second loss function is used for calculating an L1 norm distance between pixel coordinates in the soft tissue image acquired at the previous moment in the adjacent moment and pixel coordinates in the second deformation image belonging to the same azimuth; the second loss value comprises an L1 norm distance between pixel coordinates in the soft tissue image acquired at a previous one of the adjacent moments in time and pixel coordinates in the second deformation image belonging to the same orientation.
With the above example, the L1 norm distance between the pixel coordinates of the pixel in the left view at time t and the pixel coordinates of the pixel in the predicted left view at time t (i.e., the second deformation image) can be calculated by the second loss function. And, an L1 norm distance between the pixel coordinates of the pixel in the right view at time t and the pixel coordinates of the pixel in the predicted right view at time t (i.e., the second deformation image) may be calculated by the second loss function. Further, the two L1 norm distances are weighted and summed to obtain the second loss value; or one of the L1 norm distances is selected as the second loss value, which is not limited in this embodiment.
In some embodiments, to increase the convergence speed of the model, a third loss function may be further added, where the third loss function is used to determine a similarity between the velocity vector of the three-dimensional point in the point cloud and the first deformation field; for example, the L1 norm distance between the velocity vector of the three-dimensional point in the point cloud and the first deformation field can be calculated as a third loss value. The optimization objective of the third loss function is to minimize the L1 norm distance. That is, the electronic device may calculate a third loss value corresponding to the third loss function according to a similarity between the velocity vector of the three-dimensional point in the point cloud and the first deformation field; and then adjusting parameters of the neural network model according to the third loss value, the first loss value and the second loss value.
In some embodiments, to avoid abrupt changes in the deformation results, a fourth loss function may also be added, which is used to make the first deformation field and the second deformation field smoother. That is, the electronic device may calculate a fourth loss value corresponding to the fourth loss function according to a gradient change between the first deformation field and the second deformation field; and then adjusting parameters of the neural network model according to the fourth loss value, the first loss value and the second loss value. For example, the fourth loss function is first-order edge-aware smoothness, and the optimization objective is to smooth the result of the deformation field output, so as to avoid abrupt deformation results.
In some embodiments, the parameters of the neural network model may be adjusted by combining the third loss value and the fourth loss value, and the first loss value and the second loss value, so as to improve the training efficiency of the model and the accuracy of the output result.
The following is an exemplary description of the application of the soft tissue deformation estimation model:
referring to fig. 6, an embodiment of the present application provides a soft tissue deformation estimation method, including:
in S201, four soft tissue images acquired from two different orientations at adjacent times are acquired.
In S202, depth estimation and optical flow estimation are performed on the four soft tissue images, so as to obtain a point cloud at a previous time among adjacent times and a velocity vector of a three-dimensional point in the point cloud.
In S203, inputting the point cloud and the velocity vector of the three-dimensional point in the point cloud into a soft tissue deformation estimation model for processing, so as to obtain a velocity field; the soft tissue deformation estimation model is obtained by training based on the training method of the soft tissue deformation estimation model.
In S204, the velocity field is integrated to obtain soft tissue deformation information according to the result of the integration.
In the embodiment, the point cloud representing the semantic information of the soft tissue and the velocity vector representing the motion flow information of the soft tissue are input into the soft tissue deformation estimation model, the semantic information and the motion flow information of the soft tissue are aggregated through the soft tissue deformation estimation model, the soft tissue deformation rule is perceived from the three-dimensional space, the velocity field is integrated to obtain an integration result meeting the differential homoembryo mapping relation, the estimated soft tissue deformation information is ensured to accord with the physical rule in the soft tissue deformation process, and therefore the deformation of the soft tissue in the actual three-dimensional space is accurately estimated.
In some embodiments, the four soft tissue images include left and right views acquired at a previous time instant in an adjacent time instant, and left and right views acquired at a subsequent time instant in an adjacent time instant.
In some embodiments, deriving the point cloud at a previous one of the adjacent moments includes: performing depth estimation on two soft tissue images acquired from two different directions at the previous moment in the adjacent moment to obtain first depth information; and mapping pixels in the soft tissue image acquired from one of the directions of the target at the previous moment in the adjacent moment from a two-dimensional space to a three-dimensional space by using the first depth information and the internal and external parameter camera parameters of the image acquisition device, so as to obtain a point cloud at the previous moment in the adjacent moment.
In some embodiments, deriving a velocity vector for a three-dimensional point in a point cloud includes: performing depth estimation on two soft tissue images acquired from two different directions at the previous moment in the adjacent moment to obtain first depth information; and performing depth estimation on two soft tissue images acquired from two different directions at a later time in the adjacent time to obtain second depth information; performing optical flow estimation on two soft tissue images acquired from the same azimuth at the adjacent moment by using an optical flow estimation network to obtain a speed vector of a pixel; and performing projection mapping on the velocity vector of the pixel by using the first depth information and the second depth information to obtain the velocity vector of the three-dimensional point in the point cloud.
In some embodiments, further comprising: and acquiring a speed field of the last moment estimated by the soft tissue deformation estimation model. Inputting the speed vectors of the point cloud and the three-dimensional points in the point cloud into a soft tissue deformation estimation model for processing to obtain a speed field, wherein the speed field comprises the following steps: and inputting the point cloud, the speed vector of the three-dimensional point in the point cloud and the speed field of the last moment estimated by the soft tissue deformation estimation model into the soft tissue deformation estimation model for processing to obtain a speed field. In this embodiment, the soft tissue deformation estimation model may aggregate semantic information, motion flow information, and long-range temporal context, so that the output result has both temporal consistency and physical rationality.
The various technical features of the above embodiments may be arbitrarily combined as long as there is no conflict or contradiction between the features, but are not described in detail, and therefore, the arbitrary combination of the various technical features of the above embodiments is also within the scope of the disclosure of the present specification.
The application also provides an embodiment of a training device of the soft tissue deformation estimation model corresponding to the embodiment of the training method of the soft tissue deformation estimation model.
The embodiment of the training device of the soft tissue deformation estimation model can be applied to electronic equipment. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of an electronic device where the device is located for operation. In terms of hardware, as shown in fig. 7, a hardware structure diagram of an electronic device where the training device of the soft tissue deformation estimation model of the present application is located is shown in fig. 7, and in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 7, the electronic device where the device is located in the embodiment generally includes other hardware according to the actual function of the training device of the soft tissue deformation estimation model, which is not described herein again.
Referring to fig. 8, an embodiment of the present application provides a training device for a soft tissue deformation estimation model, including:
a training sample acquiring module 301, configured to acquire a training sample set, where different samples in the training sample set include point clouds at different moments and velocity vectors of three-dimensional points in the point clouds; any sample is obtained based on depth estimation and optical flow estimation of four soft tissue images acquired from two different directions at adjacent moments;
The self-supervision training module 302 is configured to perform self-supervision training on a preset neural network model by using samples in the training sample set to obtain a soft tissue deformation estimation model;
the self-supervision training module is specifically configured to repeat the following iteration process until the iteration end condition is satisfied: in each iteration process, acquiring a speed field output by the neural network model based on samples in the training sample set, and integrating the speed field to obtain a first deformation field and a second deformation field which meet the differential stratospheric mapping relation; respectively carrying out deformation treatment on two soft tissue images acquired at the previous moment in the adjacent moment by using the first deformation field to obtain two first deformation images, and respectively carrying out deformation treatment on the two first deformation images by using the second deformation field to obtain two second deformation images; and under the condition that the iteration ending condition is not met, adjusting parameters of the neural network model according to the four soft tissue images, the two first deformation images and the two second deformation images to obtain the neural network model in the next iteration process.
In some embodiments, the first deformation field characterizes deformation from a previous time to a subsequent time of the adjacent times, and the second deformation field characterizes deformation from a subsequent time to a previous time of the adjacent times.
In some embodiments, the learning objective of the self-supervised training includes: the first deformation image is the same as or similar to the soft tissue image acquired at a later one of the adjacent moments in time, and the second deformation image is the same as or similar to the soft tissue image acquired at a previous one of the adjacent moments in time.
In some embodiments, the iteration end condition comprises: the number of iterations is preset, the difference between the first deformation image and the soft tissue image acquired at the next moment in the adjacent moment is smaller than a first preset difference, and/or the difference between the second deformation image and the soft tissue image acquired at the previous moment in the adjacent moment is smaller than a second preset difference.
In some embodiments, the self-supervised training module 302 is specifically configured to: calculating a first loss value according to the two first deformation images and the two soft tissue images acquired at the later moment of the adjacent moment; calculating a second loss value according to the two second deformation images and the two soft tissue images acquired at the previous moment in the adjacent moment; and adjusting parameters of the neural network model according to the first loss value and the second loss value.
In some embodiments, the first loss value comprises a photometric loss value between a pixel value of a soft tissue image acquired at a later one of the adjacent moments in time and a pixel value of a first deformed image belonging to the same orientation; the second loss value includes an L1 norm distance between pixel coordinates in the soft tissue image acquired at a previous one of the adjacent moments and pixel coordinates in a second deformation image belonging to the same orientation.
In some embodiments, the self-supervising training module 302 is further configured to: calculating a third loss value according to the similarity between the speed vector of the three-dimensional point in the point cloud and the first deformation field; and/or calculating a fourth loss value from a gradient change between the first deformation field and the second deformation field; and adjusting parameters of the neural network model according to at least one of the third loss value and the fourth loss value, the first loss value and the second loss value.
In some embodiments, the neural network model includes a first encoder, a second encoder, and a decoder; the first encoder is used for extracting characteristics of point clouds contained in the sample and acquiring the characteristics of the point clouds; the second encoder is used for extracting the characteristics of the speed vectors of the three-dimensional points in the point cloud contained in the sample, and obtaining the motion characteristics; and the decoder is used for carrying out deformation estimation processing according to the point cloud characteristics and the motion characteristics to obtain a speed field.
In some embodiments, the neural network model is a U-shaped neural network model; wherein the first encoder and the second encoder are in skip connection, and the second encoder and the decoder are in skip connection.
In some embodiments, the first encoder and the second encoder each include N neural network layers; n is an integer greater than 1; the neural network layer in the first encoder is connected with the neural network layer of the second encoder in a jumping manner; the input data of the n+1th neural network layer in the first encoder is the output data of the N th neural network layer in the first encoder, and the input data of the n+1th neural network layer in the second encoder is the result of splicing the output data of the N th neural network layer in the second encoder and the output data of the N th neural network layer in the first encoder, wherein N is more than 1 and less than or equal to N.
In some embodiments, the decoder includes N neural network layers; n is an integer greater than 1; the input data of the first neural network layer of the decoder includes the output data of the last neural network layer in the second encoder; the input data of the (m+1) th neural network layer of the decoder is the result of splicing the output data of the (m) th neural network layer in the decoder and the output data of the neural network layer corresponding to the (m+1) th neural network layer in the decoder, wherein m is more than 1 and less than or equal to N; the input data of the last neural network layer in the decoder also includes the velocity field of the last instant estimated by the neural network model.
In some embodiments, the point cloud contained by any sample is obtained by: performing depth estimation on two soft tissue images acquired from two different directions at the previous moment in the adjacent moment to obtain first depth information; and mapping pixels in the soft tissue image acquired from one azimuth at the previous moment in the adjacent moment from a two-dimensional space to a three-dimensional space by utilizing the first depth information and internal and external parameters of the image acquisition device, so as to obtain a point cloud at the previous moment in the adjacent moment.
In some embodiments, the velocity vector of a three-dimensional point in the point cloud that any sample contains is obtained by: performing depth estimation on two soft tissue images acquired from two different directions at the previous moment in the adjacent moment to obtain first depth information; and performing depth estimation on two soft tissue images acquired from two different directions at a later time in the adjacent time to obtain second depth information; performing optical flow estimation on two soft tissue images acquired from the same azimuth at the adjacent moment by using an optical flow estimation network to obtain a speed vector of a pixel; and performing projection mapping on the velocity vector of the pixel by using the first depth information and the second depth information to obtain the velocity vector of the three-dimensional point in the point cloud.
In some embodiments, the soft tissue image is an image obtained by segmenting a surgical instrument in the soft tissue image using a pre-trained instrument segmentation network.
In some embodiments, the four soft tissue images include left and right views acquired at a previous time instant in an adjacent time instant, and left and right views acquired at a subsequent time instant in an adjacent time instant.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
The application also provides an embodiment of the soft tissue deformation estimating device corresponding to the embodiment of the soft tissue deformation estimating method. The embodiment of the soft tissue deformation estimating device of the present application may be applied to an electronic device, and the hardware structure of the electronic device is similar to that of fig. 7, and will not be described herein.
Referring to fig. 9, an embodiment of the present application provides a soft tissue deformation estimating apparatus, including:
the image acquisition module 401 is configured to acquire four soft tissue images acquired from two different directions at adjacent moments.
The image processing module 402 is configured to perform depth estimation and optical flow estimation on the four soft tissue images, so as to obtain a point cloud at a previous time of the adjacent time and a velocity vector of a three-dimensional point in the point cloud.
The deformation estimation module 403 is configured to input a soft tissue deformation estimation model through the point cloud and a velocity vector of a three-dimensional point in the point cloud for processing, so as to obtain a velocity field; the soft tissue deformation estimation model is obtained through training based on the method.
And the soft tissue deformation information acquisition module 404 is configured to integrate the velocity field to obtain soft tissue deformation information according to the integrated result.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present application without undue burden.
In some embodiments, the present application further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method of any one of the above when executing the program.
In some embodiments, the present description embodiments also provide a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method as described in any of the above.
Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and structural equivalents thereof, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on a manually-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Computers suitable for executing computer programs include, for example, general purpose and/or special purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from a read only memory and/or a random access memory. The essential elements of a computer include a central processing unit for carrying out or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks, etc. However, a computer does not have to have such a device. Furthermore, the computer may be embedded in another device, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.
Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disk or removable disks), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features of specific embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. On the other hand, the various features described in the individual embodiments may also be implemented separately in the various embodiments or in any suitable subcombination. Furthermore, although features may be acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings are not necessarily required to be in the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the application.

Claims (20)

1. A method of training a soft tissue deformation estimation model, comprising:
acquiring a training sample set, wherein different samples in the training sample set comprise point clouds at different moments and speed vectors of three-dimensional points in the point clouds; any sample is obtained based on depth estimation and optical flow estimation of four soft tissue images acquired from two different directions at adjacent moments;
performing self-supervision training on a preset neural network model by utilizing samples in the training sample set to obtain a soft tissue deformation estimation model;
wherein the self-supervised training indicates that the following iterative process is repeated until an iteration end condition is satisfied:
in each iteration process, acquiring a speed field output by the neural network model based on samples in the training sample set, and integrating the speed field to obtain a first deformation field and a second deformation field which meet the differential stratospheric mapping relation;
respectively carrying out deformation treatment on two soft tissue images acquired at the previous moment in the adjacent moment by using the first deformation field to obtain two first deformation images, and respectively carrying out deformation treatment on the two first deformation images by using the second deformation field to obtain two second deformation images;
And under the condition that the iteration ending condition is not met, adjusting parameters of the neural network model according to the four soft tissue images, the two first deformation images and the two second deformation images to obtain the neural network model in the next iteration process.
2. The method of claim 1, wherein the first deformation field characterizes deformation from a previous time to a subsequent time of the adjacent times, and the second deformation field characterizes deformation from a subsequent time to a previous time of the adjacent times.
3. The method according to claim 1 or 2, wherein the learning objective of the self-supervised training comprises: the first deformation image is the same as or similar to the soft tissue image acquired at a later one of the adjacent moments in time, and the second deformation image is the same as or similar to the soft tissue image acquired at a previous one of the adjacent moments in time.
4. The method according to claim 1, wherein the performing the deformation processing on the two soft tissue images acquired at the previous time of the adjacent time using the first deformation field to obtain two first deformed images respectively includes:
For each soft tissue image acquired at the previous moment in the adjacent moment, carrying out deformation processing on the point cloud corresponding to the soft tissue image by using a first deformation field to obtain a first deformation point cloud;
projecting the first deformation point cloud to a two-dimensional space by utilizing internal and external parameters of the image acquisition device to obtain a first deformation image corresponding to the soft tissue image;
the step of performing deformation processing on the two first deformation images by using the second deformation field to obtain two second deformation images respectively includes:
for each first deformation image, performing deformation processing on the first deformation point cloud corresponding to the first deformation image by using a second deformation field to obtain a second deformation point cloud;
and projecting the second deformation point cloud to a two-dimensional space by utilizing the internal and external parameters of the image acquisition device to obtain a second deformation image.
5. The method of claim 1, wherein said adjusting parameters of the neural network model based on the four soft tissue images, the two first deformation images, and the two second deformation images comprises:
calculating a first loss value according to the two first deformation images and the two soft tissue images acquired at the later moment of the adjacent moment; calculating a second loss value according to the two second deformation images and the two soft tissue images acquired at the previous moment in the adjacent moment;
And adjusting parameters of the neural network model according to the first loss value and the second loss value.
6. The method of claim 5, wherein the first loss value comprises a photometric loss value between a pixel value of a soft tissue image acquired at a later one of the adjacent moments in time and a pixel value of a first deformed image belonging to the same orientation;
the second loss value includes an L1 norm distance between pixel coordinates in the soft tissue image acquired at a previous one of the adjacent moments and pixel coordinates in a second deformation image belonging to the same orientation.
7. The method according to claim 5 or 6, characterized in that the method further comprises:
calculating a third loss value according to the similarity between the speed vector of the three-dimensional point in the point cloud and the first deformation field; and/or calculating a fourth loss value from a gradient change between the first deformation field and the second deformation field;
the adjusting the parameters of the neural network model according to the first loss value and the second loss value further includes:
and adjusting parameters of the neural network model according to at least one of the third loss value and the fourth loss value, the first loss value and the second loss value.
8. The method of claim 1, wherein the neural network model comprises a first encoder, a second encoder, and a decoder;
the first encoder is used for extracting characteristics of point clouds contained in the sample and acquiring the characteristics of the point clouds;
the second encoder is used for extracting the characteristics of the speed vectors of the three-dimensional points in the point cloud contained in the sample, and obtaining the motion characteristics;
and the decoder is used for carrying out deformation estimation processing according to the point cloud characteristics and the motion characteristics to obtain a speed field.
9. The method of claim 8, wherein the neural network model comprises a U-shaped neural network model; wherein the first encoder and the second encoder are in skip connection, and the second encoder and the decoder are in skip connection.
10. The method according to claim 8 or 9, wherein the first encoder and the second encoder each comprise N neural network layers; n is an integer greater than 1;
the neural network layer in the first encoder is connected with the neural network layer of the second encoder in a jumping manner;
the input data of the n+1th neural network layer in the first encoder is the output data of the N th neural network layer in the first encoder, and the input data of the n+1th neural network layer in the second encoder is the result of splicing the output data of the N th neural network layer in the second encoder and the output data of the N th neural network layer in the first encoder, wherein N is more than 1 and less than or equal to N.
11. The method of claim 10, wherein the decoder comprises N neural network layers; n is an integer greater than 1;
the neural network layer in the decoder is connected with the neural network layer of the second encoder in a jumping manner;
the input data of the first neural network layer of the decoder includes the output data of the last neural network layer in the second encoder;
the input data of the (m+1) th neural network layer of the decoder is the result of splicing the output data of the (m) th neural network layer in the decoder and the output data of the neural network layer corresponding to the (m+1) th neural network layer in the decoder, wherein m is more than 1 and less than or equal to N;
the input data of the last neural network layer in the decoder also includes the velocity field of the last instant estimated by the neural network model.
12. The method of claim 1, wherein the point cloud contained by any sample is obtained by:
performing depth estimation on two soft tissue images acquired from two different directions at the previous moment in the adjacent moment to obtain first depth information;
and mapping pixels in the soft tissue image acquired from one azimuth at the previous moment in the adjacent moment from a two-dimensional space to a three-dimensional space by utilizing the first depth information and internal and external parameters of the image acquisition device, so as to obtain a point cloud at the previous moment in the adjacent moment.
13. The method of claim 12, wherein the velocity vector of a three-dimensional point in the point cloud comprised by any sample is obtained by:
performing depth estimation on two soft tissue images acquired from two different directions at the previous moment in the adjacent moment to obtain first depth information; and performing depth estimation on two soft tissue images acquired from two different directions at a later time in the adjacent time to obtain second depth information;
performing optical flow estimation on two soft tissue images acquired from the same azimuth at the adjacent moment by using an optical flow estimation network to obtain a speed vector of a pixel;
and performing projection mapping on the velocity vector of the pixel by using the first depth information and the second depth information to obtain the velocity vector of the three-dimensional point in the point cloud.
14. The method of claim 1, wherein the soft tissue image is an image obtained by segmenting a surgical instrument in the soft tissue image using a pre-trained instrument segmentation network.
15. The method of claim 1, wherein the four soft tissue images include left and right views acquired at a previous time instant in an adjacent time instant and left and right views acquired at a subsequent time instant in an adjacent time instant.
16. A method for estimating soft tissue deformation, comprising:
acquiring four soft tissue images acquired from two different directions at adjacent moments;
performing depth estimation and optical flow estimation on the four soft tissue images to obtain a point cloud at the previous moment in the adjacent moment and a speed vector of a three-dimensional point in the point cloud;
inputting the point cloud and the speed vector of the three-dimensional point in the point cloud into a soft tissue deformation estimation model for processing to obtain a speed field; wherein the soft tissue deformation estimation model is trained based on the method of any one of claims 1 to 15;
and integrating the speed field to obtain soft tissue deformation information according to an integration result.
17. A training device for a soft tissue deformation estimation model, comprising:
the training sample acquisition module is used for acquiring a training sample set, wherein different samples in the training sample set comprise point clouds at different moments and speed vectors of three-dimensional points in the point clouds; any sample is obtained based on depth estimation and optical flow estimation of four soft tissue images acquired from two different directions at adjacent moments;
the self-supervision training module is used for carrying out self-supervision training on a preset neural network model by utilizing the samples in the training sample set to obtain a soft tissue deformation estimation model;
The self-supervision training module is specifically configured to repeat the following iteration process until the iteration end condition is satisfied: in each iteration process, acquiring a speed field output by the neural network model based on samples in the training sample set, and integrating the speed field to obtain a first deformation field and a second deformation field which meet the differential stratospheric mapping relation; respectively carrying out deformation treatment on two soft tissue images acquired at the previous moment in the adjacent moment by using the first deformation field to obtain two first deformation images, and respectively carrying out deformation treatment on the two first deformation images by using the second deformation field to obtain two second deformation images; and under the condition that the iteration ending condition is not met, adjusting parameters of the neural network model according to the four soft tissue images, the two first deformation images and the two second deformation images to obtain the neural network model in the next iteration process.
18. A soft tissue deformation estimating apparatus, comprising:
the image acquisition module is used for acquiring four soft tissue images acquired from two different directions at adjacent moments;
the image processing module is used for carrying out depth estimation and optical flow estimation on the four soft tissue images to obtain a point cloud at the previous moment in the adjacent moment and a speed vector of a three-dimensional point in the point cloud;
The deformation estimation module is used for inputting the speed vector of the three-dimensional point in the point cloud into the soft tissue deformation estimation model for processing to obtain a speed field; wherein the soft tissue deformation estimation model is trained based on the method of any one of claims 1 to 15;
and the soft tissue deformation information acquisition module is used for integrating the speed field so as to obtain soft tissue deformation information according to the integrated result.
19. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 16 when the program is executed.
20. A computer readable storage medium, characterized in that a computer program is stored thereon, which program, when being executed by a processor, implements the method of any of claims 1 to 16.
CN202311141090.1A 2023-09-05 2023-09-05 Training method, soft tissue deformation estimation method, device, equipment and storage medium Pending CN117218074A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311141090.1A CN117218074A (en) 2023-09-05 2023-09-05 Training method, soft tissue deformation estimation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311141090.1A CN117218074A (en) 2023-09-05 2023-09-05 Training method, soft tissue deformation estimation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117218074A true CN117218074A (en) 2023-12-12

Family

ID=89036336

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311141090.1A Pending CN117218074A (en) 2023-09-05 2023-09-05 Training method, soft tissue deformation estimation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117218074A (en)

Similar Documents

Publication Publication Date Title
Du et al. Articulated multi-instrument 2-D pose estimation using fully convolutional networks
Sharp et al. Accurate, robust, and flexible real-time hand tracking
Sridhar et al. Fast and robust hand tracking using detection-guided optimization
Grasa et al. EKF monocular SLAM with relocalization for laparoscopic sequences
Qin et al. Surgical instrument segmentation for endoscopic vision with data fusion of cnn prediction and kinematic pose
Castle et al. Combining monoSLAM with object recognition for scene augmentation using a wearable camera
CN110660017A (en) Dance music recording and demonstrating method based on three-dimensional gesture recognition
US11605192B2 (en) Skeleton model update apparatus, skeleton model update method, and program
CN113034652A (en) Virtual image driving method, device, equipment and storage medium
US20200286286A1 (en) Tracking rigged polygon-mesh models of articulated objects
WO2017116814A1 (en) Calibrating object shape
CN110555426A (en) Sight line detection method, device, equipment and storage medium
Park et al. 3D hand tracking in the presence of excessive motion blur
CN114663575A (en) Method, apparatus and computer-readable storage medium for image processing
Li et al. Learning laparoscope actions via video features for proactive robotic field-of-view control
Luo et al. Evolutionarily optimized electromagnetic sensor measurements for robust surgical navigation
Bianchi et al. High-fidelity visuo-haptic interaction with virtual objects in multi-modal AR systems
CN117218074A (en) Training method, soft tissue deformation estimation method, device, equipment and storage medium
Shin et al. A single camera tracking system for 3D position, grasper angle, and rolling angle of laparoscopic instruments
WO2022195305A1 (en) Adaptive visualization of contextual targets in surgical video
CN115994944A (en) Three-dimensional key point prediction method, training method and related equipment
Luo et al. Diversity-enhanced condensation algorithm and its application for robust and accurate endoscope three-dimensional motion tracking
CN112099330A (en) Holographic human body reconstruction method based on external camera and wearable display control equipment
CN113052883B (en) Fused reality operation navigation registration system and method in large-scale dynamic environment
Chao et al. Surgical action detection based on path aggregation adaptive spatial network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination