WO2024050918A1

WO2024050918A1 - Endoscope positioning method, electronic device, and non-transitory computer-readable storage medium

Info

Publication number: WO2024050918A1
Application number: PCT/CN2022/125009
Authority: WO
Inventors: 刘宏斌; 田庆瑶; 张子惠
Original assignee: 中国科学院自动化研究所
Priority date: 2022-09-06
Filing date: 2022-10-13
Publication date: 2024-03-14
Also published as: CN117710279A

Abstract

Provided in the present application are an endoscope positioning method, an electronic device, and a non-transitory computer-readable storage medium. The method comprises: on the basis of a depth extraction network, acquiring a depth image (I) of a t-th image frame collected by a real endoscope; acquiring a depth image d_t-n of a (t-n)-th target virtual image frame collected by a virtual endoscope, or on the basis of the depth extraction network, acquiring a depth image (II) of a (t-n)-th image frame collected by the real endoscope; inputting the depth image (I) and the depth image d_t-n or inputting the depth image (I) and the depth image (II) into a depth registration network to obtain the relative position and orientation estimation information (III) of the real endoscope; and superposing the relative position and orientation estimation information (III) with the position and orientation estimation information (IIII) of the real endoscope collecting the (t-n)-th image frame, so as to obtain the position and orientation estimation information (IV) of the real endoscope collecting the t-th image frame. The method can quickly, accurately and continuously obtain the current position and orientation information of the real endoscope.

Description

Endoscope positioning method, electronic device and non-transitory computer-readable storage medium

Cross-references to related applications

This application requires the priority of the Chinese patent application with the application number 202211086312. All incorporated herein by reference.

Technical field

The present application relates to the technical field of endoscope positioning, and in particular to an endoscope positioning method, electronic device and non-transitory computer-readable storage medium.

Background technique

An endoscope is a testing instrument that integrates traditional optics, ergonomics, precision machinery, modern electronics, mathematics, and software. It has image sensors, optical lenses, light sources, mechanical devices, etc. It can enter the stomach through the mouth or enter the body through other natural orifices. Endoscopes can see lesions that cannot be shown by X-rays, so they have become a commonly used technical method in medical examinations.

Currently, commonly used methods for endoscope positioning include: (1) Extracting the depth of the endoscopic image through the shape from shading (SFS) method, and identifying the part with greater depth as the airway. After the airway is extracted, the model reconstructed from the preoperative CT is compared, and the current image is mapped to the airway branch where the camera is located, or the endoscope movement is estimated based on changes in the deepest position of the airway in adjacent images. This method is possible at airway bifurcations, but it is difficult to provide continuous endoscopic positioning information when there is no or only one airway in the field of view. (2) Extract the feature points of the endoscopic image through the Structure From Motion (SFM) method. For two adjacent frames of images, match the feature points one by one, and solve the Perspective-n-Point (PnP ) for endoscopic pose estimation. This method will not be able to solve the problem of Perspective-n-Point (PnP) when the endoscopic image has few or missing feature points, causing the problem of endoscope positioning loss. (3) 2D/3D registration method, by registering the 2D image captured by the endoscope to the virtual model reconstructed before surgery, thereby obtaining the position of the endoscope in the model. This method is based on an iterative optimization algorithm, so it requires a long calculation time to obtain the positioning of each frame. However, the position of the endoscope changes rapidly during actual inspection, and excessive calculation time can easily cause positioning loss.

Contents of the invention

This application provides an endoscope positioning method, electronic equipment and non-transitory computer-readable storage medium to solve the shortcomings in the existing technology of being unable to provide continuous positioning information and easily causing positioning loss, and to achieve rapid positioning of the endoscope. , accurate positioning and the ability to obtain continuous pose information.

This application provides an endoscope positioning method, including:

Obtain the depth image of the t-th frame image collected by the real endoscope based on the pre-trained depth extraction network

Obtain the depth image _dtn of the tnth frame of the target virtual image collected by the virtual endoscope at the tn frame positioning position in the target virtual model, or obtain the tnth depth image dtn collected by the real endoscope based on the pre-trained depth extraction network. Depth image of tn frame image

Wherein, the virtual endoscope is determined based on the real endoscope;

The depth image

and the depth image d _tn or the depth image

and the depth image

Input the pre-trained depth registration network to obtain the relative pose estimation information when the real endoscope collects the t-th frame image and the tn-th frame image.

The relative pose estimation information

The pose estimation information when collecting the tnth frame image with the real endoscope

Overlay to obtain the pose estimation information of the t-th frame image collected by the real endoscope.

And based on the pose estimation information

Position the real endoscope.

According to an endoscope positioning method provided by this application, the depth extraction network is a depth extraction network based on a recurrent generative adversarial network and the pre-trained depth registration network, and the recurrent generative adversarial network includes a first generator , a first discriminator, a second generator and a second discriminator, the first generator is used to convert the depth image into a real-style endoscopic image, the second generator is used to convert the real-style endoscope image into The image is converted into a depth image;

The depth extraction network based on the recurrent generative adversarial network and the deep registration network is trained in the following way:

Establish a virtual model, obtain the depth image of the virtual image collected by the virtual endoscope in the virtual model, and obtain the virtual pose information corresponding to the virtual endoscope when collecting the virtual image;

Obtain preset real endoscopic images;

Use the preset real endoscopic image, the depth image of the virtual image and the virtual pose information as training data to perform weak supervision training on the initial depth extraction network;

A loss function is obtained based on the weighted summation of cycle consistency loss, identity loss, generative adversarial loss, reconstruction loss, and geometric consistency loss that constrain the initial depth extraction network;

Optimize the loss function and update the parameters of the initial depth extraction network based on the recurrent generative adversarial network and the deep registration network until the preset number of rounds to obtain the initial depth extraction network based on the recurrent generative adversarial network and the deep registration network. Deep extraction network.

According to an endoscope positioning method provided by this application, the depth extraction network is a depth extraction network based on SfMLearner or a depth extraction network based on a recurrent generative adversarial network;

The depth image will be

and the depth image d _tn or the depth image

and the depth image

Before inputting the pre-trained deep registration network, the method further includes:

to the depth image

and the depth image

Perform scaling to obtain the depth image

and the depth image

The unit.

According to an endoscope positioning method provided by this application, the depth registration network is trained in the following manner:

Establish a virtual model, obtain the depth image of the virtual image collected by the virtual endoscope in the virtual model, and obtain the corresponding virtual pose information when the virtual endoscope collects the virtual image;

Input the depth image of the virtual image into an initial depth registration network, and the initial depth registration network outputs the relative pose estimation information of the virtual endoscope when collecting two adjacent frames of virtual images;

Using the virtual pose information as the training truth value, obtain the virtual relative pose information when the virtual endoscope collects the two adjacent frames of virtual images according to the virtual pose information;

The loss function is obtained by performing a weighted sum of the translation loss and rotation loss between the relative pose estimation information and the virtual relative pose information;

Optimize the loss function and update the parameters of the initial depth registration network until convergence to obtain the depth registration network.

An endoscope positioning method provided according to this application also includes:

A registration method based on an iterative optimization algorithm is used to run in parallel with the depth registration network, and the pose estimation information of the real endoscope is corrected according to the corrected pose obtained by the registration method based on an iterative optimization algorithm to eliminate Cumulative error.

According to an endoscope positioning method provided by this application, a method for obtaining corrected posture according to a registration method based on an iterative optimization algorithm includes:

Obtain the k-th frame image collected by the real endoscope as the current corrected image, and obtain the depth image of the k-th frame image through the depth extraction network

where k≤t;

Obtain the pose estimation information of the k-th frame image collected by the real endoscope based on the depth registration network

Using the current corrected image or the depth image

or the current corrected image and the depth image

Perform semantic segmentation on the lumen image in the real endoscopic field of view;

Based on image similarity measure and semantic segmentation similarity measure, pose estimation information

Optimize and solve the initial value to obtain the corrected pose of the current corrected image.

The pose estimation information when the real endoscope collects the kth frame image

Replace with the corrected pose

where k≤t;

Obtain the depth image d _k of the k-th frame target virtual image collected by the virtual endoscope at the k-th frame positioning pose in the target virtual model;

The depth image

Convert to corresponding point cloud

Convert the depth image d _k into a point cloud image Y _k ;

Solve Y _k through the ICP algorithm to

relative posture between

Adopt the relative pose

Correcting the pose estimation information when the real endoscope collects the kth frame image

The RGB image feature extraction method is used to extract the feature information of the t-th frame image collected by the real endoscope, and the feature information of the t-th frame image and the depth image are

Input the pre-trained deep registration network together;

The RGB image feature extraction method is used to extract the feature information of the t-nth frame image collected by the real endoscope or the feature information of the t-nth frame target virtual image collected by the virtual endoscope, wherein the features of the t-nth frame target virtual image are The information is extracted after texture mapping the t-nth frame target virtual image;

Combine the feature information of the tnth frame target virtual image and the depth image _dtn , or combine the feature information of the tnth frame image and the depth image

Enter the pretrained deep registration network.

This application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, the endoscope is implemented as any one of the above. Positioning method.

The present application also provides a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, it implements any of the above endoscope positioning methods.

The present application also provides a computer program product, which includes a computer program. When the computer program is executed by a processor, the computer program implements any one of the above endoscope positioning methods.

The endoscope positioning method provided by this application can quickly, accurately and continuously obtain the current position of the real endoscope by using the pre-trained depth extraction network and depth registration network when the initial pose of the real endoscope is known. pose information. The deep extraction network and deep registration network in this method can be directly used for different patients after training and learning. They do not need to be trained before surgery, which is convenient and time-saving.

Description of the drawings

In order to explain the technical solutions in this application or the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are of the present invention. For some embodiments of the application, those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.

Figure 1 is one of the flow diagrams of the endoscope positioning method provided by this application;

Figure 2 is a schematic diagram of the depth extraction network structure provided by this application;

Figure 3 is a schematic flow chart of the training method of the depth extraction network provided by this application;

Figure 4a is a schematic diagram of the depth extraction network generator architecture provided by this application;

Figure 4b is a schematic diagram of the deep extraction network Resnet block architecture provided by this application

Figure 4c is a schematic diagram of the depth extraction network discriminator architecture provided by this application.

Figure 5 is a schematic flow chart of the training method of the deep registration network provided by this application;

Figure 6 is a schematic diagram of the deep registration network architecture provided by this application;

Figure 7 is one of the flow diagrams of the method for obtaining the corrected pose using the registration method based on the iterative optimization algorithm provided by this application;

Figure 8 is the second schematic flow chart of the method for obtaining the corrected pose using the registration method based on the iterative optimization algorithm provided by this application;

Figure 9 is the second schematic flow chart of the endoscope positioning method provided by this application;

Figure 10 is a schematic structural diagram of an electronic device provided by this application.

Detailed ways

In order to make the purpose, technical solutions and advantages of this application clearer, the technical solutions in this application will be clearly and completely described below in conjunction with the drawings in this application. Obviously, the described embodiments are part of the embodiments of this application. , not all examples. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

The endoscope positioning method of the present application is described below in conjunction with Figures 1-9. As shown in Figure 1, the method includes:

S101: Obtain the depth image of the t-th frame image collected by a real endoscope based on the pre-trained depth extraction network

In the embodiment of the present application, the endoscope positioning method can be used in the natural cavities of the human body such as the respiratory tract, biliary tract, and cerebral ventricle. In this method, we first need to obtain the depth image of the current frame collected by the real endoscope, that is, the t-th frame image.

Depth image, also known as range image, refers to an image in which the distance (depth) from the image collector to each point in the scene is used as a pixel value. It directly reflects the geometry of the visible surface of the scene. . Depth images can be calculated into point cloud data after coordinate conversion, and point cloud data with rules and necessary information can also be back-calculated into depth image data.

S102: Obtain the depth image _dtn of the tnth frame target virtual image collected by the virtual endoscope at the tn frame positioning pose in the target virtual model, or obtain the real endoscope collection based on the pre-trained depth extraction network The depth image of the tnth frame image

Wherein, the virtual endoscope is determined based on the real endoscope.

Specifically, the depth image d _tn of the tnth frame target virtual image collected by the virtual endoscope at the tn frame positioning pose in the target virtual model is obtained, or the depth image dtn of the tnth frame image collected by the real endoscope is obtained.

The virtual endoscope moves together with the movement of the real endoscope in the target virtual model. The positioning position of the virtual endoscope at the tn frame in the target virtual model means that the real endoscope is collecting the tnth frame image. The positioning position at that time corresponds to the target virtual model. Among them, n≤10, that is, the images within ten frames before the current frame image, so that the tn frame and the t frame have more similar feature points. The value of n in this method is not fixed. For example, when the current frame is the 8th frame image, tn can be equal to 7, which is the 7th frame image, in which case n = 1, or it can be equal to 3, which is the 3rd frame image, in which case n=5. When the current frame is the 9th frame image, tn can be equal to 7, which is the 7th frame image, and at this time n=2.

The virtual endoscope needs to be determined based on the real endoscope, so the internal parameters of the virtual endoscope need to be consistent with the internal parameters of the real endoscope.

Illustrative: Use MATLAB software to perform checkerboard calibration on a real endoscope to obtain the internal reference of the endoscope.

The internal reference of the real endoscope is:

The image pixels are:

width*length=width×height

make:

Average focal length:

Window center x-axis coordinate: wcx=-2×(cx-width/2)/width

Window center y-axis coordinate: wcy=2×(cy-height/2)/height

At this time, when designing the virtual endoscope, the parameters of the virtual endoscope are:

Field of view:

ViewAngle＝180/π*(2.0*atan2(height/2.0,focal_length))

Window size:

WindowSize=[width,height]

Window center position:

WindowCenter=[wcx,wcy]

S103: Convert the depth image to

and the depth image d _tn or the depth image

and the depth image

Specifically, the depth image can be

And the depth image d _tn is input into the pre-trained depth registration network to obtain the relative pose estimation information when the real endoscope collects the t-th frame image and the tn-th frame image.

Depth images can also be

and depth images

S104: Convert the relative pose estimation information to

And based on the pose estimation information

Position the real endoscope.

Specifically, the relative pose estimation information will be obtained

By superimposing, the pose estimation information of the t-th frame image collected by the real endoscope can be obtained.

According to the pose estimation information

Position the real endoscope.

The pose information of the initial position of the real endoscope

It can be learned when the deep registration network is initialized.

In one embodiment, as shown in Figure 2, the depth extraction network is a depth extraction network based on a recurrent generative adversarial network and the pre-trained deep registration network, and the recurrent generative adversarial network includes a first generator , a first discriminator, a second generator and a second discriminator, the first generator is used to convert the depth image into a real-style endoscopic image, the second generator is used to convert the real-style endoscope image into The image is converted into a depth image;

As shown in Figure 3, the depth extraction network based on the recurrent generative adversarial network and the depth registration network is trained in the following way:

S301: Establish a virtual model, obtain the depth image of the virtual image collected by the virtual endoscope in the virtual model, and obtain the virtual posture information corresponding to the virtual endoscope when collecting the virtual image.

Specifically, before training the above-mentioned depth extraction network, a depth registration network needs to be trained first, and the depth extraction network needs to apply the trained depth registration network. The style of an image refers to the texture, color, and visual patterns at different spatial scales in the image.

In practice, since it is difficult to obtain the pose of the endoscope during real endoscopy, we need to establish a virtual model and obtain a large amount of depth images and virtual pose information through the virtual endoscope to perform The depth extraction network performs training supervision, which can improve the robustness of the depth extraction network. There can be a variety of virtual models, such as virtual models for the respiratory tract, virtual models for the biliary tract, etc. Corresponding virtual models can be established according to the needs of use.

S302: Obtain the preset real endoscopic image.

The target body corresponding to the preset real endoscopic image is consistent with the target body corresponding to the virtual model. For example, the virtual model is a virtual model of the respiratory tract established based on the respiratory tract, then the preset real endoscopic image is also an image of the collected respiratory tract.

S303: Use the preset real endoscopic image, the depth image of the virtual image, and the virtual pose information as training data to perform weakly supervised training on the initial depth extraction network.

Specifically, the depth image and virtual pose information obtained in the above steps are used as training data to perform weakly supervised training on the initial depth extraction network.

S304: Obtain a loss function based on the weighted summation of cycle consistency loss, identity loss, generative adversarial loss, reconstruction loss, and geometric consistency loss that constrain the initial depth extraction network.

Specifically, referring to Figure 2, Cycle GAN includes a first generator G _image , a first discriminator G _image , a second generator G _depth and a second discriminator D _depth , which combines the depth image domain and the endoscope The image domains are denoted Z and X respectively.

Cycle consistency loss:

For an endoscopic image x∈X, the depth extraction algorithm aims to learn a mapping G _depth _:

Next, map G _image : Z→X will

Reconstruct to domain

The difference from x _t after reconstruction to domain X. The conversion from Z domain to X domain is similar. In the reconstruction loop here, the network model imposes cycle consistency loss on G _image and G _depth :

Among them, y is a variable, representing a certain frame of image, p represents the probability distribution,

express expectations.

Loss of identity:

To add constraints to the learning of mappings, an identity loss is proposed:

Generate adversarial loss:

While the generator completes the mapping cycle, the discriminators D _image and D _depth respectively learn to determine whether the input endoscopic image and depth image are true or false; and the generator hopes to fool the discriminator and generate a code that can be considered true by the discriminator. image, therefore, a generative adversarial loss is introduced, where LS-GAN loss can be used:

Among them, · is used to omit image or depth. y～p(data) represents the distribution of the sample following domain X or Y.

Reconstruction loss:

In order for the network to learn depth image estimation at a given scale, the motion trajectory of the virtual endoscope can be collected from the virtual model, and the pose and corresponding depth image of the virtual endoscope at each moment can be recorded. The pose and corresponding depth images impose view consistency constraints between the generated image frames collected by real endoscopes, and the image view consistency loss is added based on the Perspective-n-point (PnP) based on the adversarial loss.

There are depth images z _tn and z _t , which are input to the generator G _image respectively to obtain the generated endoscopic image.

and

Since the virtual pose information at time tn and time t is also recorded when collecting data, the virtual relative pose p _tn,t = (t _x ,t _y ,t _z ,θ, from time tn to time t can be calculated φ,ψ). Known camera internal parameter K, pixel point under homogeneous coordinates

It can be warped to

Among them, t _tn,t = (t _x , t _y , t _z ) is the translation vector of the camera from time tn to time t; the camera rotation matrix R _{tn,t from time tn to time t} is calculated by the following formula:

Among them, α ₁ =sinθ, α ₂ =sinφ, α ₃ =sinψ, β ₁ =cosθ, β ₂ =cosφ, β ₃ =cosψ.

Use n≤5 when training. Too large n cannot ensure that there is enough common viewing area between the two images.

because

Usually non-integer, it needs to be hyperbolically sampled to integer pixel coordinates, and finally obtained from

Warp the image to

and

should be with

Consistent, so the reconstruction loss is obtained by view consistency:

where w(·) is warped to

space operator,

By

and the depth image obtained by reprojecting the relative translation vector t _tn,t and the relative rotation vector R _tn,t ;

represents a pixel in image x. From this, G _image is encouraged to learn unbiased estimation from depth images to corresponding endoscopic images. Due to the cycle consistency constraint, G _depth will also be encouraged to learn unbiased estimates from endoscopic images to depth images, that is, generate depth images that are consistent with the scale of the input depth map.

In order to further constrain the learning of the generator G _depth : X→Z, view consistency is also added to x _tn and x _t and the generated depth map

and

Although the relative pose of the endoscope cannot be collected at this time, there is a pre-trained depth registration network based on the depth pose estimation algorithm.

and

The relative pose of the corresponding endoscope can be calculated. Load the pre-trained pose estimation network during training to estimate the relative motion p _tn,t of the endoscope. At this time, an ideal depth image estimate should include information that allows the pose estimation network to capture the motion of the endoscope, thus obtaining the reconstruction loss obtained by view consistency:

Therefore, the total view-consistent reconstruction loss is obtained:

Geometric consistency loss:

For the generated depth map

and

If they correspond to the same 3D scene, then the corresponding depth information of the two should be consistent. Depth map

and

The inconsistent z _diff is defined as:

in,

By

and the depth image obtained by reprojecting the relative pose p _tn,t of the virtual endoscope calculated by the pre-trained depth registration network.

From

Sampled depth map. Calculate here

and

error instead of

and

error, this is because

The result of reprojection is not in an integer coordinate system and needs to be

Sample to the same coordinate system to calculate the difference between the two.

The geometric consistency loss is defined as:

in,

represents a pixel in image z.

In summary, the total loss function of deep extraction network training is:

Among them, β, γ, δ, θ ₁ , θ ₂ , and eta are hyperparameters that adjust the weight of each loss.

S305: Optimize the loss function and update the parameters of the initial depth extraction network based on the recurrent generative adversarial network and the deep registration network until the preset number of rounds to obtain the initial depth extraction network based on the recurrent generative adversarial network and the deep registration network. The deep extraction network.

As shown in Figure 4(a), 4(b), and 4(c), it is a schematic diagram of the architecture of the deep extraction network, including (a) generator, (b) Resnet block in the generator, (c) discriminator device. The dimensionality of the tensor shown in the figure is based on the input of image size 1×256×256; Res(256, 256) represents the Resnet block with input and output channels of 256; IN represents the Instance Norm layer, and Leaky ReLU represents Leaky ReLU. activation function.

For example, the depth extraction network can be trained with 7 preset real endoscopic video segments and 8 segments of data collected by virtual endoscopy, including multiple preset real endoscopic images, 2187 depth images and corresponding virtual Endoscopic position. In the deep extraction network architecture, the generator is a conventional encoder-decoder architecture, in which the bottleneck layer consists of six Resnet blocks and the discriminator consists of five convolutional layers. The Adam optimizer is used to train for 100 rounds. At the beginning of training, the learning rate is set to 0.001 and θ ₁ =θ ₂ =η = 0 to avoid imposing consistency constraints on depth maps with poor early generation results. After 10 rounds of training, θ ₁ , θ ₂ and η are set to 0.3, 5 and 5 respectively. β, γ and δ are set to 10, 5 and 1 respectively throughout the training process.

During the training process, the parameters of the depth extraction network are updated by continuously optimizing the loss function obtained in the above steps until the final depth extraction network is determined by the preset number of rounds. The preset number of rounds can be 50 to 300 rounds, and further can be 100 rounds to 200 rounds. The trained depth extraction network can generate depth images with clearer outlines than depth extraction networks such as SfMLearner. Compared with only using deep extraction networks such as Cycle GAN, it can ensure that the structure of the input image is not changed. Depth images with stable and knowable scales (basically the same scale as the training data) can be generated.

In one embodiment, the depth extraction network is a depth extraction network based on SfMLearner or a depth extraction network based on a recurrent generative adversarial network;

The depth image will be

and the depth image

or the depth image

and the depth image

to the depth image

and the depth image

Perform scaling to obtain the depth image

and the depth image

The unit.

Specifically, for the deep extraction network based on SfMLearner:

Simultaneously train a depth estimation network and a pose network. The depth estimation network estimates the depth information z from an input endoscopic image, and the pose network estimates the relative poses T and R of the camera between the two images through the input two endoscopic images.

For the input of two consecutive frames of endoscopic images x _tn , x _t , the depth estimation network can estimate the depth images of the two frames of images

and

The pose network can estimate the relative motion of the camera t _tn,t and R _tn,t .

Known camera internal parameter K, pixel point under homogeneous coordinates

It can be warped to

because

Warp the image to

Should be with

consistent. The reconstruction loss is obtained from view consistency:

where w(·) is warped to

space operator,

By

Representing a pixel in the image x, warping refers to manipulating the image to deform the pixels in the image. Through this loss function, the pose network and depth estimation network can achieve self-supervision, thereby completing network training.

In order to stabilize the scale of the depth image generated by the network, a geometric consistency loss is added. For the generated depth image

and

The inconsistent z _diff is defined as:

in

By

and the relative motion of the real endoscope calculated through the pose network

Depth map obtained by reprojection.

From

Sampled depth map. Calculate here

and

error instead of

and

error, this is because

The geometric consistency loss is defined as:

in,

represents a pixel in image z.

In summary, the loss function is obtained: L=aL _rec +bL _gc , where a and b are hyperparameters that adjust the weight of each loss.

Specifically, for the deep extraction network based on Cycle GAN, the loss function can include the following losses:

For an endoscopic image x∈X, the depth extraction algorithm aims to learn a mapping G _depth :

Next, map G _image : Z→X will

Rebuild to domain X, completing the loop. The conversion from Z domain to X domain is similar. In the reconstruction loop here, the network model imposes cycle consistency loss on G _image and G _depth :

Among them, p represents the probability distribution,

express expectations.

To add constraints on the learning of mappings, other loss functions include identity loss:

While the generator completes the mapping cycle, the discriminators D _image and D _depth respectively learn to determine whether the input endoscopic image and depth image are true or false; and the generator hopes to fool the discriminator and generate a code that can be considered true by the discriminator. For the image, a generative adversarial loss is introduced, here the LS-GAN loss is used:

It is difficult to ensure the generation of scale-stable depth images using only Cycle GAN, so adding geometric consistency loss can also be considered.

The scale of the depth image obtained by the above two depth extraction networks is fuzzy and unitless, so it needs to be calibrated. When calibrating dimensions, specific calibration methods include the following two methods. At least one of the following two methods can be used when calibrating:

(1) When the real endoscope enters the lumen, the visual range of the real endoscope is segmented according to the depth threshold, and the diameter of the area above the threshold is the same diameter as the depth peak in the lumen in the virtual model established before surgery. The depth is compared to obtain the true endoscope scale. For example, if the depth threshold is set to 5, the depth portion higher than the threshold in the depth image 0 extracted by the real endoscope is segmented into a circle with a diameter of 10 pixels. For the virtual model established for the main airway, it is assumed that the real endoscope is in the center of the main airway. At this time, the corresponding depth image contour can be found as a circle with a peak diameter of 10 pixels. The depth corresponding to this contour line is 1cm, then the scale of the depth network can be obtained as 1/5 = 0.2cm.

(2) Based on the depth extraction network in the above embodiment, the pose network and the depth network have the same fuzzy scale. When the real endoscope enters the country, the relative position of the pose network can be compared with reference to the robot control signal. Calibrate the attitude estimation information. For example, the robot control signal controls the endoscope to enter 1cm, and the relative translation vector obtained by the pose network is 2 translations in the entry direction, then the scale is 1/2 = 0.5cm.

In one embodiment, as shown in Figure 5, the deep registration network is trained in the following manner:

S501: Establish a virtual model, obtain the depth image of the virtual image collected by the virtual endoscope in the virtual model, and obtain the corresponding virtual posture information when the virtual endoscope collects the virtual image.

Specifically, the deep registration network is a deep neural network in the form of an encoder-decoder. The network input is two frames of depth information. The encoder uses the structure of the FlowNetC encoder (the optical flow extracted by FlowNet is a simulation of the sports field). The decoder uses several layers of CNN (Convolutional Neural Network) to finally transform the encoded information into It is the 6DOF (ie 3D translation and 3D Euler angle) pose parameter output.

When training the deep registration network, you first need to establish a virtual model, and use a virtual endoscope to obtain a large number of depth images and virtual pose information to train and supervise the deep registration network to improve the robustness of the deep registration network. sex.

S502: Input the depth image of the virtual image into an initial depth registration network, and the initial depth registration network outputs the relative pose estimation information of the virtual endoscope when two adjacent frames of virtual images are collected.

Specifically, the depth image of the virtual image obtained in the above steps is input into the initial depth registration network for weak supervision training. The output of the initial depth registration network can obtain the relative pose of the virtual endoscope when collecting two adjacent frames of virtual images. Estimate information.

S503: Use the virtual pose information as a training true value, and obtain the virtual relative pose information when the virtual endoscope collects the two adjacent frames of virtual images according to the virtual pose information.

At the same time, the virtual pose information is used as the training true value. By calculating the virtual pose information, the virtual relative pose information when the virtual endoscope collects the two adjacent frames of virtual images can be obtained. At this time, the virtual pose information is obtained. The endoscope collects relative pose true value information and relative pose estimation information when two adjacent frames of images are collected.

S504: Obtain the loss function by performing a weighted sum of the translation loss and rotation loss between the relative pose estimation information and the virtual relative pose information.

Specifically, the translation loss and rotation loss between the relative pose estimation information of the virtual endoscope and the real relative pose are calculated respectively, and the translation loss and rotation loss are weighted and summed to obtain the final loss function:

L(z _tm ,z _t )=L _t (z _tm ,z _t )+ωL _r (z _tm ,z _t )

Among them, L _t is the translation loss:

T _tm,t ,

are the translation vectors in the real relative pose information and relative pose estimation information respectively; L _r is the rotation loss:

R _tm,t ,

are the rotation vectors in the real relative pose information and relative pose estimation information respectively; ω is a hyperparameter used to adjust the proportion of the two losses of rotation loss and displacement loss.

As shown in Figure 6, it is a schematic diagram of the depth registration architecture:

The pose estimation network is trained with 37 virtual endoscope pose and depth images collected from the virtual endoscope trajectory, including 11,904 frames. The network uses a pre-trained FlowNetC encoder to regress pose vectors with three convolutional blocks. The network is trained by using the Adam optimizer with an initial learning rate of 1e-5 and training time of 300 epochs. ω is set to 100.

S505: Optimize the loss function and update the parameters of the initial depth registration network until convergence to obtain the depth registration network.

The depth extraction network learns the endoscope pose transformation parameters between two input depth images through deep learning methods, thereby updating the endoscope pose transformation for each input endoscopic image. This depth registration network is based on depth registration rather than image intensity, allowing the algorithm to have no additional requirements for the rendering of virtual images acquired by virtual endoscopes in the simulator. The deep learning algorithm directly estimates pose transformation, allowing the algorithm to run quickly and in real time to obtain real-time positioning results.

In one embodiment, it also includes:

A registration method based on an iterative optimization algorithm is used to run in parallel with the depth registration network. According to the registration method based on an iterative optimization algorithm, the corrected pose is obtained to correct the pose estimation information of the real endoscope and eliminate the problem. Cumulative error.

Specifically, the registration method based on the iterative optimization algorithm has a slow calculation speed and runs in parallel with the deep registration network for pose correction. It can correct the pose estimation information of the real endoscope lazily, so that the cumulative error does not increase. It will continue to increase and improve positioning accuracy.

In one embodiment, as shown in Figure 7, a method for obtaining a corrected pose according to a registration method based on an iterative optimization algorithm includes:

S701: Obtain the k-th frame image collected by the real endoscope as the current corrected image, and obtain the depth image of the k-th frame image through the depth extraction network

where k≤t.

Specifically, this correction method runs slower than the network that estimates the real endoscope pose estimation information. Therefore, when performing parallel correction, it is not corrected frame by frame. The k-th image frame of k≤t is obtained as the current corrected image, that is, the pose estimation information of the real endoscope corresponding to the image frame of the corrected image has been estimated and obtained.

S702: Obtain the pose estimation information of the k-th frame image collected by the real endoscope based on the depth registration network.

Specifically, since k≤t, when correcting the k-th frame image, the pose estimation information of the k-th frame image is

It has been estimated and can be obtained directly.

S703: Using the current corrected image or the depth image

or the current corrected image and the depth image

Semantic segmentation is performed on the lumen image in the real endoscopic field of view.

It was found in the experiment that due to the similarity measure used in the registration process, when a deeper cavity and several shallower channels appear in the image at the same time, because the depth of the deeper cavity is compared with other channels, If it is larger, the alignment of this cavity will be given priority during the registration optimization process, and the registration of other shallower channels will be easily ignored. At this time, the registration will easily ignore the structural information of the shallower cavity. To solve this problem, depth images are used to segment the lumen images before registration. The registration process not only requires registration to similar depths, but also to similar lumen structures.

Segmentation here refers to regional segmentation of all cavity images in the detection field of view, that is, partitioning. For the input endoscopic image x _t , the depth image can be utilized

Either an RGB image x _t or an RGBD image (x _t and

) divides the cavity. The segmentation method can be to use depth threshold to segment depth images, or the network can be trained to learn channel segmentation of RGB or RGBD images.

S704: Based on the image similarity measure and the semantic segmentation similarity measure, estimate the information by pose

Specifically, this method is a correction method based on image registration. The segmentation process is recorded as Seg(·), and the corrected pose of the real endoscope at time k

The corresponding airway segmentation result is

Given the camera pose at time t-1, from the initial value of the pose

Start optimization solution

The optimization process is described as:

Among them, SIM1(·) is the image similarity measure, SIM2(·) is the segmentation similarity measure, and P _t ^′ is a variable. Seg(P′ _t ) is the result of segmenting the corresponding image or depth map when the virtual pose of the virtual endoscope is P′ _t . Powell's algorithm is also used as the optimization strategy for optimization. For example, when taking k=t, that is, using the latest calculated pose estimation information

Optimizing solutions as initial values can improve the convergence of the algorithm and reduce the number of iterations.

This method can make up for the situation where two channels, one deep and one shallow, appear when only using image similarity measures. Similarity measures such as NCC (Normalized Cross Correlation) will focus on aligning the two depth maps. For the deep cavity part, the characteristics of the shallow cavity are ignored, resulting in inaccurate calculations.

S705: Use the pose estimation information when the real endoscope collects the k-th image.

Replace with the corrected pose

After the corrected pose is obtained, the pose estimation information when the k-th frame image is collected by the real endoscope

Replaced with corrected pose

At this time, the pose when the t-th frame image is collected on the real endoscope trajectory is corrected. In one embodiment, as shown in Figure 8, a method for obtaining a corrected pose according to a registration method based on an iterative optimization algorithm includes:

S801: Obtain the k-th frame image collected by the real endoscope as the current corrected image, and obtain the depth image of the k-th frame image through the depth extraction network

where k≤t.

Specifically, this correction method runs slower than the network that estimates the real endoscope pose estimation information. Therefore, when performing parallel correction, it is not corrected frame by frame. When performing correction, the k-th frame image with k≤t is obtained as the current corrected image.

S802: Obtain the depth image d _k of the k-th frame target virtual image collected by the virtual endoscope at the k-th frame positioning pose in the target virtual model.

Specifically, the virtual endoscope moves together with the movement of the real endoscope in the target virtual model. The positioning pose of the virtual endoscope at the kth frame in the target virtual model is the real endoscope in the collection. The positioning pose of the kth frame image corresponds to the target virtual model.

S803: Convert the depth image to

Convert to corresponding point cloud

The depth image d _k is converted into a point cloud image Y _k .

S804: Solve Y _k to

relative posture between

S805: Adopt the relative pose

Specifically, using the relative pose

Correct the pose estimation information when the real endoscope collects the kth frame image

At this time, the pose when the k-th frame image is collected on the real endoscope trajectory is corrected.

In one embodiment, the method further includes:

S901: Use the RGB image feature extraction method to extract the feature information of the t-th frame image collected by the real endoscope, and combine the feature information of the t-th frame image with the depth image

Input the pre-trained deep registration network together;

S902: Use the RGB image feature extraction method to extract the feature information of the t-nth frame image collected by the real endoscope or extract the feature information of the t-nth frame target virtual image collected by the virtual endoscope, wherein the t-nth frame target virtual image The feature information is extracted after texture mapping the target virtual image of the t-nth frame;

S903: Combine the feature information of the tn-th frame target virtual image and the depth image _dtn , or combine the feature information of the tn-th frame image and the depth image

Enter the pretrained deep registration network.

Current algorithms only use RGB image information or only depth information. Although depth-based positioning technology has been proven to be more robust, in actual use, when there is only one cavity in the field of view, there will be a circular depth peak area in the depth image. The rotational and translational motion of the speculum will be difficult to estimate.

Therefore, RGB feature extraction is integrated into the relative pose calculation of real-time positioning. Specifically, you can use features such as lumen texture extracted from RGB images, use feature descriptors (such as SIFT, ORB) or pre-trained feature extraction networks to extract the features of two frames of endoscopic images, and then use them together with the depth image as the depth registration network. Input can make up for the problem that the endoscope pose is difficult to estimate when the depth map structure is single, and assist in estimating the movement of the real endoscope. In this case, it is necessary to collect virtual endoscope images, depth images and corresponding virtual endoscope poses to train the depth extraction network.

During data collection, texture mapping needs to be done on the virtual endoscope image, and the texture needs to be close to the texture of the image collected by the real endoscope.

The endoscope positioning method provided by this application can quickly and continuously obtain the current position of the real endoscope by using the pre-trained depth extraction network and depth registration network after knowing the initial position of the real endoscope. posture information. The deep extraction network and deep registration network in this method can be directly used for different patients after training and learning. They do not need to be trained before surgery, which is convenient and time-saving.

Figure 10 illustrates a schematic diagram of the physical structure of an electronic device. As shown in Figure 10, the electronic device may include: a processor (processor) 1010, a communications interface (Communications Interface) 1020, a memory (memory) 1030 and a communication bus 1040. Among them, the processor 1010, the communication interface 1020, and the memory 1030 complete communication with each other through the communication bus 1040. The processor 1010 can call logical instructions in the memory 1030 to perform an endoscope positioning method, which method includes: obtaining a depth image of the current frame collected by the real endoscope, that is, the t-th frame image, based on a pre-trained depth extraction network.

Wherein, the virtual endoscope is determined based on the real endoscope; the depth image

and the depth image d _tn or the depth image

and the depth image

The relative pose estimation information

And based on the pose estimation information

Position the real endoscope, where the pose information of the initial position of the real endoscope is

is known.

In addition, the above-mentioned logical instructions in the memory 1030 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. .

On the other hand, the present application also provides a computer program product. The computer program product includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can Execute the endoscope positioning method provided by each of the above methods. The method includes: obtaining the depth image of the current frame collected by the real endoscope, that is, the t-th frame image based on the pre-trained depth extraction network.

Obtain the depth image _dtn of the tnth frame of the target virtual image collected by the virtual endoscope at the tn frame positioning pose in the target virtual model, or obtain the depth image dtn of the tnth frame of the target virtual image collected by the real endoscope based on the pre-trained depth extraction network. Depth image of tn frame image

and the depth image d _tn or the depth image

and the depth image

The relative pose estimation information

And based on the pose estimation information

is known.

On the other hand, the present application also provides a non-transitory computer-readable storage medium on which a computer program is stored. The computer program is implemented when executed by a processor to perform the endoscope positioning method provided by each of the above methods. The method Including: based on the pre-trained depth extraction network to obtain the depth image of the current frame collected by the real endoscope, that is, the t-th frame image

and the depth image d _tn or the depth image

and the depth image

The relative pose estimation information

And based on the pose estimation information

is known.

The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the part of the above technical solution that essentially contributes to the existing technology can be embodied in the form of a software product. The computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., including a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present application, but not to limit it; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent substitutions are made to some of the technical features; however, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims

An endoscope positioning method, including:

Obtain the depth image of the t-th frame image collected by the real endoscope based on the pre-trained depth extraction network

Obtain the depth image dtn of the tnth frame of the target virtual image collected by the virtual endoscope at the tn frame positioning position in the target virtual model, or obtain the tnth depth image dtn collected by the real endoscope based on the pre-trained depth extraction network. Depth image of tn frame image
Wherein, the virtual endoscope is determined based on the real endoscope;

The depth image
and the depth image d tn or the depth image
and the depth image
Input the pre-trained depth registration network to obtain the relative pose estimation information when the real endoscope collects the t-th frame image and the tn-th frame image.

The relative pose estimation information
The pose estimation information when collecting the tnth frame image with the real endoscope
Overlay to obtain the pose estimation information of the t-th frame image collected by the real endoscope.
And based on the pose estimation information
Position the real endoscope.
The endoscope positioning method according to claim 1, wherein the depth extraction network is a depth extraction network based on a recurrent generative adversarial network and the pre-trained depth registration network, and the recurrent generative adversarial network includes a first A generator, a first discriminator, a second generator and a second discriminator, the first generator is used to convert the depth image into a real-style endoscopic image, the second generator is used to convert the real-style endoscopic image into a real-style endoscopic image. The speculum image is converted into a depth image;

The depth extraction network based on the recurrent generative adversarial network and the deep registration network is trained in the following way:

Establish a virtual model, obtain the depth image of the virtual image collected by the virtual endoscope in the virtual model, and obtain the virtual pose information corresponding to the virtual endoscope when collecting the virtual image;

Obtain preset real endoscopic images;

Use the preset real endoscopic image, the depth image of the virtual image and the virtual pose information as training data to perform weak supervision training on the initial depth extraction network;

A loss function is obtained based on the weighted summation of cycle consistency loss, identity loss, generative adversarial loss, reconstruction loss, and geometric consistency loss that constrain the initial depth extraction network;

Optimize the loss function and update the parameters of the initial depth extraction network based on the recurrent generative adversarial network and the deep registration network until the preset number of rounds to obtain the initial depth extraction network based on the recurrent generative adversarial network and the deep registration network. Deep extraction network.
The endoscope positioning method according to claim 1, wherein the depth extraction network is a depth extraction network based on SfMLearner or a depth extraction network based on a recurrent generative adversarial network;

The depth image will be
and the depth image d tn or the depth image
and the depth image
Before inputting the pre-trained deep registration network, the method further includes:

to the depth image
and the depth image
Perform scaling to obtain the depth image
and the depth image
The unit.
The endoscope positioning method according to claim 1, wherein the depth registration network is trained in the following manner:

Establish a virtual model, obtain the depth image of the virtual image collected by the virtual endoscope in the virtual model, and obtain the corresponding virtual pose information when the virtual endoscope collects the virtual image;

Input the depth image of the virtual image into an initial depth registration network, and the initial depth registration network outputs the relative pose estimation information of the virtual endoscope when collecting two adjacent frames of virtual images;

Using the virtual pose information as the training truth value, obtain the virtual relative pose information when the virtual endoscope collects the two adjacent frames of virtual images according to the virtual pose information;

The loss function is obtained by performing a weighted sum of the translation loss and rotation loss between the relative pose estimation information and the virtual relative pose information;

Optimize the loss function and update the parameters of the initial depth registration network until convergence to obtain the depth registration network.
The endoscope positioning method according to any one of claims 1 to 4, further comprising:

A registration method based on an iterative optimization algorithm is used to run in parallel with the depth registration network. According to the registration method based on an iterative optimization algorithm, the corrected pose is obtained to correct the pose estimation information of the real endoscope and eliminate the problem. Cumulative error.
The endoscope positioning method according to claim 5, wherein the method for obtaining the corrected pose according to a registration method based on an iterative optimization algorithm includes:

Obtain the k-th frame image collected by the real endoscope as the current corrected image, and obtain the depth image of the k-th frame image through the depth extraction network
where k≤t;

Obtain the pose estimation information of the k-th frame image collected by the real endoscope based on the depth registration network

Using the current corrected image or the depth image
or the current corrected image and the depth image
Perform semantic segmentation on the lumen image in the real endoscopic field of view;

Based on image similarity measure and semantic segmentation similarity measure, pose estimation information
Optimize and solve the initial value to obtain the corrected pose of the current corrected image.

The pose estimation information when the real endoscope collects the kth frame image
Replace with the corrected pose
The endoscope positioning method according to claim 5, wherein the method for obtaining the corrected pose according to a registration method based on an iterative optimization algorithm includes:

Obtain the k-th frame image collected by the real endoscope as the current corrected image, and obtain the depth image of the k-th frame image through the depth extraction network
where k≤t;

Obtain the depth image d k of the k-th frame target virtual image collected by the virtual endoscope at the k-th frame positioning pose in the target virtual model;

The depth image
Convert to corresponding point cloud
Convert the depth image d k into a point cloud image Y k ;

Solve Y k through the ICP algorithm to
relative posture between

Adopt the relative pose
Correcting the pose estimation information when the real endoscope collects the kth frame image
The endoscope positioning method according to any one of claims 1 to 4, further comprising:

The RGB image feature extraction method is used to extract the feature information of the t-th frame image collected by the real endoscope, and the feature information of the t-th frame image and the depth image are
Input the pre-trained deep registration network together;

The RGB image feature extraction method is used to extract the feature information of the t-nth frame image collected by the real endoscope or the feature information of the t-nth frame target virtual image collected by the virtual endoscope, wherein the features of the t-nth frame target virtual image are The information is extracted after texture mapping the t-nth frame target virtual image;

Combine the feature information of the tnth frame target virtual image and the depth image dtn , or combine the feature information of the tnth frame image and the depth image
Enter the pretrained deep registration network.
An electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein when the processor executes the program, any one of claims 1 to 8 is implemented. The endoscope positioning method described in the item.
A non-transitory computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, the endoscope positioning method according to any one of claims 1 to 8 is implemented.