CN117392312A

CN117392312A - New view image generation method of monocular endoscope based on deformable nerve radiation field

Info

Publication number: CN117392312A
Application number: CN202311280252.XA
Authority: CN
Inventors: 刘金华; 黄东晋; 刘玉华; 曾子洋
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-01-12

Abstract

The invention relates to a method for generating a monocular endoscope new view image based on a deformable nerve radiation field, which comprises the following steps: acquiring a plurality of visual angle original images and camera parameters acquired by a monocular endoscope; based on the original image, mask image manufacturing is carried out to obtain Mask images of the original images with multiple view angles; inputting the original image into a pre-trained monocular endoscope depth estimation network model to obtain a dense depth map; obtaining an original image and an effective area of a dense depth map based on the Mask image; based on the camera parameters and the effective area, implicitly reconstructing by adopting a deformable nerve radiation field to obtain space volume density information and space color information; and rendering the space volume density information and the space color information by a volume to obtain an image generated by the new view angle. Compared with the prior art, the method has the advantages of improving the generation accuracy of the new view angle image and the like.

Description

New view image generation method of monocular endoscope based on deformable nerve radiation field

Technical Field

The invention relates to the technical field of medical image reconstruction and new view angle generation, in particular to a method for generating a new view angle image of a monocular endoscope based on a deformable nerve radiation field.

Background

Monocular endoscopes have been widely used for examination, diagnosis and treatment of the esophageal digestive system such as the stomach, intestines, etc. However, monocular endoscopic image sequences typically provide only a limited field of view, which limits the physician's view of the lesion, particularly at stenosed or curved sites. The new view generation technique can synthesize a new view from image data of a plurality of views, thereby expanding the field of view of an original image. Therefore, the research of generating the new visual angle of the monocular endoscope image is very important for assisting doctors in accurately diagnosing and improving the medical effect and the treatment quality.

Some new view angle generation methods have been proposed by researchers. The conventional method cannot restore fine appearance details. In recent years, methods based on neural radiation fields have performed well in the field of natural image new view generation. However, since monocular endoscopic images possess special complex features such as limited field of view and depth of the input image, black dead area occlusion, and in vivo soft tissue deformation. The use of neural radiation field-based methods to accomplish three-dimensional reconstruction of soft tissue and new view angle generation in a dynamic minimally invasive surgical environment remains a challenging task.

Disclosure of Invention

The invention aims to provide a monocular endoscope new view image generation method based on a deformable nerve radiation field, which improves the accuracy of new view image generation.

The aim of the invention can be achieved by the following technical scheme:

a method for generating a new view image of a monocular endoscope based on a deformable neural radiation field, comprising the steps of:

acquiring a plurality of visual angle original images and camera parameters acquired by a monocular endoscope;

based on the original image, mask image manufacturing is carried out to obtain Mask images of the original images with multiple view angles;

inputting the original image into a pre-trained monocular endoscope depth estimation network model to obtain a dense depth map;

obtaining an original image and an effective area of a dense depth map based on the Mask image;

based on the camera parameters and the effective area, implicitly reconstructing by adopting a deformable nerve radiation field to obtain space volume density information and space color information;

and rendering the space volume density information and the space color information by a volume to obtain an image generated by the new view angle.

Further, the camera parameters are acquired using Colmap.

Further, the camera parameters include camera pose and internal parameters.

Further, the specific steps of obtaining the Mask image include:

selecting one input labelme from the original images with multiple view angles;

manually manufacturing a closed-loop binary Mask at the edges of the invalid area and the effective area to obtain Mask images of a single original image;

and expanding based on the Mask images of the single original image to obtain Mask images corresponding to the original images with multiple view angles.

Further, the training process of the monocular endoscope depth estimation network model specifically comprises the following steps:

acquiring data sets formed by different monocular endoscope videos to finish the pre-training of a monocular endoscope depth estimation network model, and initializing network parameters;

retraining the network model based on the multiple view original images to complete the training process of the monocular endoscope depth estimation network model.

Further, in the training process of the monocular endoscope depth estimation network model, a scale-invariant loss function is adopted to adjust the network weight, and the expression is as follows:

wherein L is ₁ Represents a scale-invariant loss value,representing the i-th original image input into the dense depth map predicted by the monocular endoscope depth estimation network model,/for the monocular endoscope depth estimation network model>And (3) representing a sparse depth map corresponding to the ith original image, and q represents a pixel.

Further, the obtaining process of the effective area of the original image and the dense depth map specifically includes:

and judging whether the pixels of the original image and the dense depth map are positioned in a black shielding area or not based on the Mask image, if so, positioning the pixels in an invalid area, and if not, positioning the pixels in an effective area.

Further, the specific step of obtaining the spatial volume density information and the spatial color information includes:

obtaining 3D points corresponding to the pixels based on the effective areas;

and inputting the camera parameters and the 3D points corresponding to the effective areas into a deformation network and a standard network of the trained deformable nerve radiation field, and outputting space volume density information and space color information, wherein in the process of training the deformation network and the standard network of the deformable nerve radiation field, the training is performed by building color loss and depth loss to build total loss.

Further, the expression of the total loss function is:

L＝L _RGB +L _Depth

wherein L is the total loss, L _RGB For colour loss, L _Depth For depth loss, C _i (p)、C _i-gt (p) is the color of the pixel p of the ith image and the color of the pixel p of the ith original image predicted based on the canonical network, D _i (p)、D _i-gt (p) is the depth value of the pixel p of the ith image and the depth value of the pixel p of the effective depth map of the ith image predicted based on the canonical network, N _S Delta is a parameter of the depth loss function for a collection of sampled rays in the input view.

Further, the expression of the volume rendering is:

wherein C is _i (p) predicting the color value of the corresponding pixel point p in the 2D image, D is the direction of the light emitted from the center of the camera through the pixel p, h is the position parameter representing the sampling point on the ray, τ _i (h) Representing the ray from the nearest point h _n To the furthest point h _f P _i (h) Is a 3D point on the camera ray transformed into canonical space using a deformation network.

Compared with the prior art, the invention has the following beneficial effects:

(1) Aiming at the problem that the input view field and depth of the monocular endoscope sequence image are limited, the monocular endoscope depth estimation is utilized to predict a dense depth map, reconstruction optimization is carried out in a follow-up deformable nerve radiation field, and scene depth information in a sparse view angle is recovered, so that an accurate new view angle image is obtained.

(2) The deformable nerve radiation field can model the shape and the color in the scene through a deformation network and a standard network, and can better process the deformation and the movement in the scene, thereby avoiding the problems of blurring, artifact and the like.

(3) The invention adopts labelme to create the mask image which can obviously distinguish the invalid region and the effective region so as to standardize the ray sampling range, so that the deformable nerve radiation field only carries out clear modeling on the effective region in the image so as to improve the result.

(4) The invention realizes accurate new view synthesis, can be applied to the field of medical images, helps doctors to more accurately observe the details of disease parts, and makes more reliable preoperative diagnosis.

Drawings

FIG. 1 is a schematic flow chart of a method according to an embodiment of the invention;

FIG. 2 is a diagram of an overall architecture of an embodiment of the present invention;

FIG. 3 is a comparison chart of the new view result graph obtained by predicting the new view by the new view generation model trained in the embodiment of the present invention and the results of a plurality of existing methods.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.

The embodiment provides a method for generating a new view image of a monocular endoscope based on a deformable nerve radiation field, as shown in fig. 1, the method comprises the following steps:

step 1: and acquiring a plurality of visual angle original images and camera parameters acquired by the monocular endoscope.

The camera parameters of the original multiple view images are acquired using Colmap and a rough depth image estimation is performed. Camera angles and internal references generated by SfM (Structure from Motion, motion structure restoration) with camera parameters of Colmap are used to guide implicit modeling of a scene described later; the coarse depth image is a sparse depth map generated by SfM of Colmap to guide retraining of the monocular endoscopic depth estimation network described later.

The camera parameters and coarse depth map estimation operations include the steps of:

1-1: inputting original multiple view images into a Colmap;

1-2: feature extraction, feature matching and sparse reconstruction are performed successively on the pictures to obtain camera intrinsic parameters, camera pose and sparse 3D point clouds. The generated camera internal parameters and camera pose are used for guiding the implicit modeling of a scene described later, and the generated sparse 3D point cloud can be used for guiding the retraining of a monocular endoscope depth estimation network described later;

1-4: the camera internal parameters, the camera pose and the sparse 3D point cloud output are stored as bin formats;

1-5: the file is converted into LLFF format to facilitate reading of the variable neural radiation field model.

Step 2: and based on the original image, making a Mask image to obtain Mask images of the original images with multiple view angles.

The mask image making in the step 2 comprises the following operations:

2-1: selecting one input labelme from the original multiple view images;

2-2: and manually manufacturing a closed-loop binary Mask at the edges of the black invalid region and the effective region to obtain a Mask image. In the binary mask, a black pixel region corresponds to an effective region within the endoscopic image, and a white pixel region corresponds to a black ineffective region of the endoscopic image. This image can be used to distinguish between an invalid region and an valid region of the image. Outputting and storing the manufactured mask image into json format;

2-3: converting a json format Mask image obtained by 2-2 into a png format, and then expanding the png format Mask image into a plurality of Mask images I corresponding to the original plurality of view images _mask 。

Step 3: inputting the original image into a pre-trained monocular endoscope depth estimation network model to obtain a dense depth map.

This step of acquiring a dense depth map includes the operations of:

3-1: monocular endoscopic depth estimation networks are an existing, self-supervising, dense depth estimation network that performs pre-training on data sets composed of different monocular endoscopic videos. Initializing network parameters by using a pre-trained dense depth estimation model, so that transfer learning can be realized, and a reliable dense depth map can be generated;

3-2: inputting the original multiple view images into a monocular endoscope depth estimation network, and using the sparse depth map generated by SfM from the step 1 as a self-supervision signal to finely tune the pre-trained monocular endoscope depth estimation network;

3-3: in order to avoid the acquired depth map having scale ambiguity, the weights of the monocular endoscope depth estimation network are fine-tuned using standard back propagation with a loss function 1;

the loss function 1 is a scale invariant loss function expression as follows:

wherein,representing the i-th original image input into the dense depth map predicted by the monocular endoscope depth estimation network model,/for the monocular endoscope depth estimation network model>Representing the rough depth image corresponding to the i-th original image.

3-4: the network model trains 150 epochs in total;

3-5: inputting the original multiple view images into a trained monocular endoscope depth estimation network model to obtain a dense depth map I predicted by the model _ddepth 。

Step 4: and obtaining an original image and an effective area of the dense depth map based on the Mask image.

The realization method for acquiring the effective area of the original image comprises the following steps: mask image I obtained according to step 2 _mask Determining whether the pixel of the original image is located at a position not blocked by the black invalid pixelAn effective area. If the pixel is located in the black shielding area, the pixel is not subjected to subsequent ray sampling. If the pixel is located in the effective area, step 5 is executed, and the randomly extracted pixel located in the effective area projects rays and samples to obtain 3D position coordinates.

The implementation method for acquiring the effective area of the dense depth map comprises the following steps: mask image I obtained according to step 2 _mask It is determined whether the pixels of the dense depth map are located in an active area that is not occluded by black inactive pixels. If the pixel is located in the black shielding region, the depth value of the pixel is set to 0. If the pixel is located in the effective area, the original pixel depth information is reserved. Thereby, an effective depth map I can be obtained _ldepth 。I _ldepth The implicit deformable neural radiation field can be trained as a self-supervising signal in combination with the loss function 3.

Step 5: based on the camera parameters and the effective area, implicit reconstruction is performed by using a deformable neural radiation field to obtain spatial volume density information and spatial color information.

The implicit reconstruction of the scene at this step includes the following operations:

5-1: the D-NeRF (Deformable Neural Radiance Fields), deformable neural radiation field, proposes a deformation network and a canonical network, consisting of two cascaded multi-layer perceptrons (Multilayer Perceptron, MLP) capable of reconstructing and rendering rigid and non-rigid scenes acquired by monocular cameras. The invention reconstructs and renders the deformed scene in the monocular endoscopic image based on D-NeRF. The reconstruction process is shown in fig. 2, where randomly extracted pixels located within the active area of the original image project rays and sample to obtain 3D position coordinates (x, y, z). The position offset (Deltax, deltay, deltaz) from the 1 st original image to the i-th original image is predicted by a deformation network. The new position coordinates (x+Δx, y+Δy, z+Δz) and the viewing angle generated by SfM of step 1 are then fed into the canonical network to obtain the reconstructed color and bulk density. Through training of the network, a field state with continuous color and volume density in a three-dimensional space can be obtained as an implicit expression of the scene.

5-2: total loss function sampling loss function 2 and lossThe loss functions 3 are combined so as to self-supervise the training of the deformable neural radiation field of the endoscope scene. The loss function 2 is the color C of the pixel p of the ith image predicted based on canonical network _i Color C of pixel p of (p) and ith original image _i-gt (p) established color loss L _RGB . The loss function 3 is the depth value D of the pixel p of the ith image predicted based on the canonical network _i (p) and depth value D of pixel p of effective depth map of ith image obtained in step 4 _i-gt (p) established depth loss L _Depth . The total loss function can be expressed as:

L＝L _RGB +L _Depth (2)

wherein N is _S Representing a set of sampled rays in the input view;

5-3: the MLP neural network model trains a total of 200k epochs, the batch size of the ray is set to 4096, and each pixel is sampled 32 times along the ray.

Step 6: and rendering the space volume density information and the space color information by the volume to obtain an image generated by the new view angle.

The new view angle generation of the scene comprises the following specific operation steps:

6-1: inputting the 3D position coordinates and the observation visual angles corresponding to the pixels in the images of the test set into a deformation network and a standard network of the trained deformable nerve radiation field to obtain the color and volume density of the sampling points;

6-2: according to the color C and the volume density sigma of the sampling point obtained by 6-1, predicting the color value C of the corresponding pixel point p in the 2D image through a volume rendering formula _i (p) and then generating the whole new view angle image. The volume rendering formula is as follows:

where d is the direction of the light from the camera center through pixel p, h is the position parameter representing the sampling point on the ray, τ _i (h) Representing the ray from the nearest point h _n To the furthest point h _f P _i (h) Is a 3D point on the camera ray transformed into canonical space using a deformation network.

The method can utilize the constructed data set to train and test the network model, and accurately synthesize the high-fidelity view.

The method of the embodiment selects a plurality of sparse view images with different view angles from a public endoscope data set Nerthus dataset to verify network efficiency. The method is used for generating new angles of monocular endoscopic images, and the method is compared with 7 new angle generation methods based on nerve radiation fields (NeRF), nerve radiation fields aiming at dynamic scenes (D-NeRF), dietNeRF, guide optimization of indoor multi-view stereoscopic vision nerve radiation fields (NerfingMVS), depth supervision NeRF (DS-NeRF), fast dynamic radiation fields based on time perception nerve voxels (TiNeuVox) and nerve radiation characteristic fields (NRFF). FIG. 3 is a detailed graph comparison of the predicted outcome of the method of the present invention and the prior art method. As can be seen from the comparison result of FIG. 3, the method of the present invention can accurately infer the geometry and appearance of the inner wall of the real soft tissue by performing new view angle synthesis on the monocular endoscopic image, and the most clear detail is obtained. Furthermore, the proposed method does not produce unreasonable pixels in invalid occlusion areas. Therefore, the new view synthesized by the method of this embodiment is closer to the original color of the group trunk.

The above functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the invention can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method for generating a new view image of a monocular endoscope based on a deformable neural radiation field, comprising the steps of:

2. A method of generating a new view image of a monocular endoscope based on a deformable neural radiation field according to claim 1, wherein the camera parameters are acquired using Colmap.

3. The method for generating a new view image of a monocular endoscope based on a deformable neural radiation field of claim 1, wherein the camera parameters include camera pose and internal parameters.

4. The method for generating a new view image of a monocular endoscope based on a deformable nerve radiation field according to claim 1, wherein the specific step of obtaining the Mask image comprises:

selecting one input labelme from the original images with multiple view angles;

5. The method for generating a new view image of a monocular endoscope based on a deformable nerve radiation field according to claim 1, wherein the training process of the monocular endoscope depth estimation network model specifically comprises the following steps:

6. The method for generating a new view image of a monocular endoscope based on a deformable nerve radiation field according to claim 5, wherein the training process of the monocular endoscope depth estimation network model adopts a scale-invariant loss function to adjust network weights, and the expression is as follows:

7. The method for generating a new view image of a monocular endoscope based on a deformable nerve radiation field according to claim 1, wherein the process of obtaining the original image and the effective area of the dense depth map specifically comprises:

8. The method for generating a new view image of a monocular endoscope based on a deformable nerve radiation field according to claim 1, wherein the specific step of acquiring spatial volume density information and spatial color information comprises:

obtaining 3D points corresponding to the pixels based on the effective areas;

9. The method for generating a new view image of a monocular endoscope based on a deformable nerve radiation field of claim 8, wherein the expression of the total loss function is:

L＝L _RGB +L _Depth

10. The method for generating a new view image of a monocular endoscope based on a deformable neural radiation field of claim 1, wherein the expression of the volume rendering is: