CN112330729B

CN112330729B - Image depth prediction method, device, terminal equipment and readable storage medium

Info

Publication number: CN112330729B
Application number: CN202011359229.6A
Authority: CN
Inventors: 廖祥云; 王琼; 王平安
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2024-01-12
Anticipated expiration: 2040-11-27
Also published as: CN112330729A

Abstract

The application is applicable to the technical fields of computer vision and image processing, and provides an image depth prediction method, an image depth prediction device, terminal equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring an image to be predicted of a tissue surface acquired by a monocular endoscope; inputting the image to be predicted into a trained convolutional neural network model for processing, and outputting a dense depth map of the image to be predicted; the trained convolutional neural network model is obtained by training according to a sparse depth map and a sparse displacement map of sample images in a training set, and the sample images are images of tissue surfaces acquired through a monocular endoscope. According to the method and the device, the problem that the traditional depth prediction method can generate sparse and unevenly distributed three-dimensional reconstruction, so that the depth prediction result of the image is poor can be solved, and the depth of the image acquired through the monocular endoscope can be well predicted under the condition that the camera and the light source are both moved.

Description

Image depth prediction method, device, terminal equipment and readable storage medium

Technical Field

The application belongs to the technical field of computer vision and image processing, and particularly relates to an image depth prediction method, an image depth prediction device, terminal equipment and a readable storage medium.

Background

An endoscope is an instrument that can access the inside of a human organ or industrial equipment for viewing via a narrow passageway. The generation and development of the method breaks through the visual limit of human eyes, and provides a plurality of convenience for medical diagnosis and detection of defects inside industrial equipment. Depth information of scenes plays a vital role in many research topics, and depth prediction techniques are increasingly applied to various fields, such as three-dimensional stereo reconstruction, obstacle detection, visual navigation and the like.

Because of the sparse texture image obtained under the monocular endoscope scene, the method has the advantages of lack of characteristics and small visual field range, and the sparse and unevenly distributed three-dimensional reconstruction can be generated by the traditional depth prediction method, so that the depth prediction result of the image is poor.

Disclosure of Invention

The embodiment of the application provides an image depth prediction method, an image depth prediction device and terminal equipment, which can solve the problem that the traditional depth prediction method can generate sparse and unevenly distributed three-dimensional reconstruction, so that the depth prediction result of an image is poor.

In a first aspect, an embodiment of the present application provides an image depth prediction method, including:

acquiring an image to be predicted of a tissue surface acquired by a monocular endoscope; inputting the image to be predicted into a trained convolutional neural network model for processing, and outputting a dense depth map of the image to be predicted; the trained convolutional neural network model is obtained by training according to a sparse depth map and a sparse displacement map of sample images in a training set, and the sample images are images of tissue surfaces acquired through a monocular endoscope.

In a possible implementation manner of the first aspect, the method includes:

extracting characteristic points of sample images of different views in the training set; matching the characteristic points among the sample images of different views to obtain mutually matched characteristic points in the sample images; performing sparse reconstruction according to the matched characteristic points and the camera internal parameters to obtain sparse point cloud data and camera pose; and performing projection mapping and data transformation on the sparse point cloud data according to the camera pose and the camera internal parameters to obtain the sparse depth map and the sparse displacement map.

In a possible implementation manner of the first aspect, training the convolutional neural network model according to the sparse depth map and the sparse displacement map of the sample images in the training set includes:

respectively inputting a first sample image and a second sample image into two paths of convolutional neural network models, respectively performing feature learning through the two paths of convolutional neural network models, and then obtaining a first predicted depth map of the first sample image and a second predicted depth map of the second sample image which are respectively output by the two paths of convolutional neural network models; the first sample image and the second sample image are two frames of images, wherein the overlapping area of the two frames of images in the training set meets the preset condition; inputting a first sparse depth map and the first predicted depth map into a scaling depth network layer, and performing size enlargement or reduction processing on the first sparse depth map and the first predicted depth map through the scaling depth network layer to obtain a first scaling depth map; the first sparse depth map is a sparse depth map of the first sample image; inputting a second sparse depth map and the second predicted depth map into a scaling depth network layer, and performing size enlargement or reduction processing on the second sparse depth map and the second predicted depth map through the scaling depth network layer to obtain a second scaling depth map; the second sparse depth map is a sparse depth map of a second sample image; inputting the first scaled depth map into a coordinate transformation network layer, and obtaining a first dense depth map output by the coordinate transformation network layer after coordinate transformation; inputting the second zoom depth map into a coordinate transformation network layer, and obtaining a second dense depth map output by the coordinate transformation network layer after coordinate transformation; respectively carrying out projection transformation on the first scaling depth map and the second scaling depth map to obtain a first dense displacement map corresponding to the first scaling depth map and a second dense displacement map corresponding to the second scaling depth map; training and updating parameters of the convolutional neural network model by an effective depth loss between the first scaled depth map and the first sparse depth map and an effective depth loss between the second scaled depth map and the second sparse depth map, a depth difference loss between the first scaled depth map and the first dense depth map and a depth difference loss between the second scaled depth map and the second dense depth map, a projected displacement loss between the first sparse displacement map and the first dense displacement map, and a projected displacement loss between the second sparse displacement map and the second dense displacement map.

In a possible implementation manner of the first aspect, an effective depth loss between the first scaled depth map and the first sparse depth map and an effective depth loss between the second scaled depth map and the second sparse depth map are calculated by an effective depth loss function, where the effective depth loss function is expressed as follows:

wherein L is _edl(j,k) For effective depth loss, M _j Sparse mask for jth frame image, M _k Sparse mask for k-th frame image, Y _j For the first scaled depth map, Y _j ^* For the first sparse depth map, Y _k For the second scaled depth map, Y _k ^* Is a second sparse depth map; n is the number of effective pixel points in the sparse depth map, and the value of n is an integer larger than 1.

In a possible implementation manner of the first aspect, the projected displacement loss between the first sparse displacement map and the first dense displacement map and the projected displacement loss between the second sparse displacement map and the second dense displacement map are calculated by a projected displacement loss function, which is expressed as follows:

wherein L is _psl(j,k) For projection displacement loss, M _j Sparse mask for jth frame image, M _k Sparse mask for kth frame image, S _j,k For the first dense displacement map, S _j,k ^* For the first sparse displacement map, S _k,j For the second dense displacement map, S _k,j ^* A second sparse displacement map; n 'is the number of effective pixel points in the sparse displacement graph, and the value of n' is an integer greater than 1.

In a possible implementation manner of the first aspect, the depth difference loss between the first scaled depth map and the first dense depth map and the depth difference loss between the second scaled depth map and the second dense depth map are calculated by a depth difference loss function, the depth difference loss function being expressed as follows:

wherein L is _dcl(j,k) For depth difference loss, Y _j For the first scaled depth map, Y ^{^} _j,k For the first dense depth map, Y _k For the second scaled depth map, Y ^{^} _k,j Is a second dense depth map; n is the number of effective pixel points in the sparse depth map, and the value of n is an integer larger than 1.

In a possible implementation manner of the first aspect, the trained convolutional neural network model includes a dense network layer, an upsampling network layer, a downsampling network layer, and a residual network layer; inputting the image to be predicted into a trained convolutional neural network model for processing, wherein the method comprises the following steps of:

extracting features of the image to be predicted through the dense network layer, and outputting a feature map;

Inputting the feature map output by the dense network layer into the up-sampling network layer or the down-sampling network layer, and outputting a sampling feature map after up-sampling or down-sampling treatment;

and inputting the sampling feature images output by the up-sampling network layer or the down-sampling network layer into a residual network layer, wherein the residual network layer distributes weight values for channels of the sampling feature images and outputs feature images with preset dimensions.

In a second aspect, an embodiment of the present application provides an image depth prediction apparatus, including:

an acquisition unit for acquiring an image to be predicted of a tissue surface acquired by a monocular endoscope;

the processing unit is used for inputting the image to be predicted into the trained convolutional neural network model for processing and outputting a dense depth map of the image to be predicted; the trained convolutional neural network model is obtained by training according to a sparse depth map and a sparse displacement map of sample images in a training set, and the sample images are images of tissue surfaces acquired through a monocular endoscope.

In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the method when executing the computer program.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program that when executed by a processor implements the method.

In a fifth aspect, embodiments of the present application provide a computer program product for, when run on a terminal device, causing the terminal device to perform the method of any one of the first aspects.

It will be appreciated that the advantages of the second to fifth aspects may be found in the relevant description of the first aspect, and are not described here again.

Compared with the prior art, the embodiment of the application has the beneficial effects that: according to the embodiment of the application, the terminal equipment acquires the image to be predicted of the tissue surface acquired through the monocular endoscope; inputting the image to be predicted into a trained convolutional neural network model for processing, and outputting a dense depth map of the image to be predicted; the trained convolutional neural network model is obtained by training according to a sparse depth map and a sparse displacement map of sample images in a training set, wherein the sample images are images of tissue surfaces acquired through a monocular endoscope; training a convolutional neural network model according to a sparse depth map and a sparse displacement map of a sample image in a training set to obtain a trained convolutional neural network model, processing an image to be predicted acquired by a monocular endoscope through the trained convolutional neural network model, so as to obtain a dense depth map of the image to be predicted, and better reflecting depth information of the image to be predicted through dense depth to realize a better prediction effect on the depth of the image; has stronger usability and practicability.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required for the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of an application scenario provided in an embodiment of the present application;

fig. 2 is a flowchart of an image depth prediction method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a sample image preprocessing process provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a network architecture for model training provided by embodiments of the present application;

FIG. 5 is a schematic diagram of a convolutional neural network model provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of each network layer of a convolutional neural network model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a residual network layer provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of an upsampling network layer and a downsampling network layer according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an image depth prediction apparatus according to an embodiment of the present application;

Fig. 10 is a schematic structural diagram of a terminal device provided in an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

In addition, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

Currently, the depth prediction method in the monocular endoscope scene comprises a traditional multi-view stereoscopic prediction method and a deep learning-based method. However, the images in the scene acquired by the monocular endoscope have the characteristics of sparse texture, lack of features and the like, and sparse and unevenly distributed reconstruction can be generated when the conventional multi-view stereoscopic method is used for depth prediction. Depth prediction is carried out on the image based on a deep learning method, and the image is generally fully supervised, so that a corresponding depth label of the acquired image is required to be provided; because the device for acquiring the depth image is difficult to be used in a living body or a narrow pipeline, a real depth image label cannot be acquired, and therefore, in an image acquired by a monocular endoscope, a convolutional neural network deep learning method cannot be adopted in a fully supervised mode to conduct deep prediction.

In the field of computer vision, the self-supervision method for single-frame image depth prediction further comprises the steps of calculating a point cloud of a target view through a depth convolution network and a camera pose network, mapping the point cloud projection onto a source view, calculating pixel values of each position on the source view through a bilinear interpolation method, twisting the pixel values onto a new view corresponding to the target view, taking the new view as a synthesized view, calculating pixel difference values of pixels at the same positions of the synthesized view and the target view, and realizing prediction of image depth through the pixel difference values; in addition, the method also comprises an unsupervised learning framework Geo Net combining the learning depth, the optical flow and the camera pose, and a method for extracting sparse data based on a motion recovery structure (Structure from motion, SFM) to serve as a self-supervision signal, training a convolutional neural network model and the like, so that the depth prediction of the image is realized.

However, the above method is not very good for image depth prediction in application scenarios of monocular endoscopes. On the one hand, in the application process of using the monocular endoscope, the adopted cold light source and the adopted camera move together, so that the constancy of the light source cannot be met; therefore, for the same acquisition object, a difference in prediction results may be caused by a difference in illumination and shooting posture. Further, although the manner of extracting sparse data based on the motion restoration structure (Structure from motion, SFM) has the illumination invariant property, the acquired image texture is sparse for the tissue structure with high surface reflectivity, so that the capability of the network to acquire the features is reduced. In addition, the monocular endoscope has a smaller visual field range and insignificant depth difference; thus, better depth differentiation is not demonstrated for smoother tissue surfaces.

Based on the above problems, the embodiment of the application provides an image depth prediction method, a convolutional neural network model is trained through a new self-supervision method, and the trained convolutional neural network model is used for realizing depth prediction with good effect on acquisition objects with sparse textures, small depth information and lack of features, and has good reconstruction effect.

Referring to fig. 1, a flow chart of an application scenario provided in an embodiment of the present application is shown. After the convolutional neural network model is trained by a self-supervision method, a trained convolutional neural network model is obtained, an image to be predicted is input into the trained convolutional neural network model, feature extraction and feature learning are carried out on the trained convolutional neural network model, a dense depth map of the image to be predicted is output, and prediction of the image depth is achieved.

In the embodiment of the application, the sample data of the training set is preprocessed, sparse reconstruction data is obtained through a three-dimensional reconstruction tool COLMAP, and supervision data (or training data) is obtained through projection transformation processing of the sparse reconstruction data. In the data preprocessing stage of the image, sparse reconstruction is carried out according to the internal parameters of the camera, training data can be obtained without manually marking a depth map or other imaging modes, and the training data comprises a sparse depth map and a sparse displacement map.

In the training process of the convolutional neural network model, the adopted data comprise images acquired through a monocular endoscope, camera internal parameters, a sparse depth map, a sparse displacement map, camera pose and the like. Training of convolutional neural network models also involves scaling the depth network layer, coordinate transformation network layer, projectively transformed dense displacement maps, and application of corresponding loss functions.

The following describes the specific contents of data preprocessing, model training and model architecture in conjunction with the implementation steps of the image depth prediction method provided in the embodiments of the present application.

Referring to fig. 2, a flowchart of an image depth prediction method provided in an embodiment of the present application is shown; the image depth prediction method comprises the following steps in the application process:

step S201, an image to be predicted of the tissue surface acquired by the monocular endoscope is acquired.

In some embodiments, the endoscope is an instrument that can access a human organ or the interior of an industrial device via a narrow channel for viewing, and by connecting to a camera, video or images within the field of view can be acquired. The terminal equipment obtains an image to be predicted of the tissue surface through the monocular endoscope, the visual field range is limited under the constraint of the probe size of the monocular endoscope, only a local two-dimensional image can be obtained, and analysis and judgment are not facilitated for a key area; therefore, the depth prediction is performed on the acquired image to be predicted, so that the position, the size, the structure and other information of the key area can be acquired in the complex pipeline structure, such as the bronchus in the medical field, the hole area in the structure of the water supply pipeline in the industrial field and the like.

Wherein the image to be predicted may be one or more video frames in the video acquired during the monocular endoscope movement. The terminal device may be a device integrating the photographing apparatus or a terminal device connected to the photographing apparatus by a wired or wireless manner.

Step S202, inputting an image to be predicted into a trained convolutional neural network model for processing, and outputting a dense depth map of the image to be predicted; the trained convolutional neural network model is obtained by training according to a sparse depth map and a sparse displacement map of sample images in a training set, and the sample images are images of tissue surfaces acquired through a monocular endoscope.

In some embodiments, the convolutional neural network model is trained by a sparse depth map and a sparse displacement map of sample images in a training set, which are obtained by performing data preprocessing on the sample images.

For example, the sample images in the training set may be images of the tissue surface acquired by a monocular endoscope based on different view angles.

In some embodiments, preprocessing the sample images in the training set includes:

a1, extracting characteristic points of sample images of different views in the training set.

In some embodiments, the sample images of different views are multi-frame images containing the same region taken based on different angles. Based on the illumination and size invariance function, feature points in the sample image can be extracted. The features of the feature points include color features, texture features, shape features, spatial relationship features, and the like.

For example, referring to fig. 3, a schematic diagram of a sample image preprocessing process provided in the embodiment of the present application, a sample image in a training set and a camera internal reference are input into a three-dimensional reconstruction tool COLMAP, and feature extraction is performed on the sample image, so as to obtain feature points of the sample image. The COLMAP is a three-dimensional automatic reconstruction tool based on a structural algorithm recovered from motion, and the tool can extract the characteristics of a sample image only by inputting the sample image.

And A2, carrying out feature point matching between sample images of different views to obtain feature points matched with each other in the sample images.

In some embodiments, the sample images for different views contain the same capture site; for different sample images of the same or partially the same scene, the extracted features are the same, i.e. the extracted feature points or partial feature points have the same features. And matching the characteristic points with the same characteristic among the sample images of different views.

Exemplary, sample images of different views are input into a three-dimensional reconstruction tool COLMAP, feature point matching is performed between the sample images of different views after feature extraction is performed, and feature points with the same features are determined.

And A3, performing sparse reconstruction according to the matched characteristic points and the camera internal parameters to obtain sparse point cloud data and camera pose.

In some embodiments, sparse reconstruction is performed based on a recovery from motion structural algorithm (Structure From Motion, sfM) from known camera internal parameters and mutually matched feature points, resulting in data that builds a three-dimensional sparse point cloud (i.e., sparse point cloud data) and camera pose. Wherein the camera internal reference is an internal reference matrix of the camera.

And A4, performing projection mapping and data transformation on the sparse point cloud data according to the camera pose and the camera internal parameters to obtain a sparse depth map and a sparse displacement map.

In some embodiments, as shown in fig. 3, according to the pose of the camera and the internal parameters of the camera, the three-dimensional sparse point cloud data projection after sparse reconstruction is mapped onto a two-dimensional plane of a corresponding image; for example, if a point in a three-dimensional point cloud is derived from a multi-frame image in an image sequence, that point may be projected onto a corresponding location on the corresponding multi-frame image.

Specifically, the data transformation process is a process of obtaining a sparse depth map and a sparse displacement map. The process for obtaining the sparse depth map comprises the following steps: assume that in a three-dimensional (3D) sparse point cloud of sparse reconstruction, the homogeneous coordinate of an nth 3D point in a world coordinate system is P _nw The camera coordinate of the nth 3D point corresponding to the jth frame image in the multi-frame image is P _nj The corresponding relation of the two coordinates is P _nj =KT _wj *P _nw Wherein KT _wj Is a coordinate transformation matrix; depth value D of the nth 3D point of the jth frame image can be determined from camera coordinates of the nth 3D point _nj Is P _nj The z-axis component of (2); the projection position of the nth 3D point plane of the jth frame image is u _nj Depth value v at this position _nj =P _nj *D _nj The method comprises the steps of carrying out a first treatment on the surface of the The depth value is set to 0 at the position where no projection is made, thereby obtaining the thin image of the j-th frameA sparse depth map; for example, the sparse depth map of the jth frame of image may be represented asThe sparse depth map of the kth frame image may be denoted +.>。

In addition, the sparse displacement graph is used for describing two-dimensional projection motion of sparse reconstruction, and the sparse displacement graph is determined by the movement amount of the sparse point cloud projected to the position of the jth frame image relative to the kth frame image or the movement amount of the sparse point cloud projected to the position of the kth frame image relative to the jth frame image; for example, the number of the cells to be processed, Representing the amount of movement of the sparse point cloud projected to the position of the jth frame image relative to the kth frame image,/>The amount of movement of the sparse point cloud projected to the position of the kth frame image relative to the jth frame image is represented. The j-th frame image and the k-th frame image are two frames of images with the same image content and a part meeting a certain threshold value; for example, the same portion may occupy 70% or more of the original image, and other threshold values may be set as needed.

In some embodiments, the training data of the convolutional neural network model includes supervision data, a sequence of sample images, a camera pose, and the like, where the supervision data is a sparse depth map and a sparse displacement map, and is used to construct a convergence function of the entire network architecture in a training process of the convolutional neural network model.

In some embodiments, training a convolutional neural network model from a sparse depth map and a sparse displacement map of sample images in a training set, comprising:

b1, respectively inputting a first sample image and a second sample image into two paths of convolutional neural network models, respectively performing feature learning through the two paths of convolutional neural network models, and then obtaining a first predicted depth map of the first sample image and a second predicted depth map of the second sample image which are respectively output by the two paths of convolutional neural network models; the first sample image and the second sample image are two frames of images, wherein the overlapping area of the two frames of images in the training set meets the preset condition.

In some embodiments, in the process of training the convolutional neural network, the convolutional neural network model is trained by adopting network architectures of two identical branches. As shown in fig. 4, a network architecture diagram of model training provided in an embodiment of the present application is shown. The convolution neural network models SE-FCDenseNet of the two branches are the same model, and the scaling depth network layer and the coordinate transformation network layer in each branch are the same; and respectively inputting a first sample image j and a second sample image k into the two branches, respectively obtaining respective scaling depth images and dense displacement images through processing of each network layer, and then combining the two branches to construct a loss function so as to realize weight sharing of the convolution neural network models of the two branches.

The first sample image and the second sample image input in the training process are two frames of images in a group of video frame sequences, and the group of video frames is one video shot by the camera in a corresponding group of poses. In the training process, selecting one frame image in the group of video frames as the input of one branch, and selecting the other frame image in the range from the 2 nd frame image to the 20 th frame image in the group of video frame sequences as the input of the other branch; the overlapping area (or the area with the same image content) of the two input frames of images meets the preset condition, for example, the overlapping area accounts for 70% of the original image; while ensuring that two frames of images have enough overlapping areas, certain randomness exists.

B2, inputting the first sparse depth map and the first predicted depth map into a scaling depth network layer, and performing size enlarging or shrinking treatment on the first sparse depth map and the first predicted depth map through the scaling depth network layer to obtain a first scaling depth map; the first sparse depth map is a sparse depth map of the first sample image.

In some embodiments, the first sparse depth map is a sparse depth map obtained by preprocessing data of the first sample image. Performing amplification or reduction processing on the pixel value of the first prediction depth map according to the pixel value of the first sparse depth map; for example, if the maximum pixel value in the first sparse depth map is 100 and the minimum pixel value is 10, scaling the pixel value of the first prediction depth map to be within the range of 10 to 100 pixel values, so as to obtain the first scaled depth map.

B3, inputting the second sparse depth map and the second predicted depth map into a scaling depth network layer, and performing size enlarging or shrinking treatment on the second sparse depth map and the second predicted depth map through the scaling depth network layer to obtain a second scaling depth map; the second sparse depth map is a sparse depth map of the second sample image.

In some embodiments, the second sparse depth map is a sparse depth map obtained by preprocessing the second sample image. And (3) according to the same scaling principle as the first predicted depth map, performing amplification or reduction processing on the pixel value of the second predicted depth map according to the pixel value of the second sparse depth map, and scaling the pixel value of the second predicted depth map to be within the pixel value range of the second sparse depth map to obtain a second scaled depth map. For example, if the maximum pixel value in the second sparse depth map is 100 and the minimum pixel value is 10, scaling the pixel value of the second prediction depth map to be within the pixel value range of 10 to 100, so as to obtain the second scaled depth map.

And B4, inputting the first scaled depth map into a coordinate transformation network layer, and obtaining a first dense depth map output by the coordinate transformation network layer after coordinate transformation.

In some embodiments, the first sample image and the second sample image are two frames of images having sufficient overlap, one frame being camera-moved relative to the other. And converting a first zoom depth map corresponding to the first sample image into an image under the coordinates of the second sample image through a coordinate transformation network layer, and obtaining the depth value of the image after coordinate transformation by adopting a bilinear interpolation method to obtain a first dense depth map of the first zoom depth map.

And B5, inputting the second zoom depth map into a coordinate transformation network layer, and obtaining a second dense depth map output by the coordinate transformation network layer after coordinate transformation.

In some embodiments, the same coordinate transformation principle as the first scaling depth map is used to transform the second scaling depth map corresponding to the second sample image into an image under the coordinates of the first sample image through a coordinate transformation network layer, and a bilinear interpolation method is used to obtain the depth value of the image after coordinate transformation, so as to obtain a second dense depth map of the second scaling depth map.

And B6, respectively carrying out projection transformation on the first scaling depth map and the second scaling depth map to obtain a first dense displacement map corresponding to the first scaling depth map and a second dense displacement map corresponding to the second scaling depth map.

In some embodiments, the first zoom depth map and the second zoom depth map are respectively combined with the camera internal parameters and the corresponding camera pose, dense reconstruction is performed to determine a three-dimensional dense point cloud, the three-dimensional dense point cloud is projected onto a two-dimensional plane to obtain two-frame two-dimensional plane images, displacement amounts of corresponding points of the two-frame two-dimensional plane images are calculated (namely, two-dimensional projection movement of the three-dimensional dense point cloud of the dense reconstruction is calculated), and a first dense displacement map of the first zoom depth map and a second dense displacement map of the second zoom depth map are obtained.

Illustratively, in the training phase, the network architecture is a self-supervision dual-branch network architecture, and the input data of the network architecture includes video sequence frames (i.e. sample images in a training set), camera pose, camera internal parameters, sparse depth map and sparse displacement map. Two frames of sample images with enough overlapping parts are selected from the same video sequence, and are respectively input into two identical branch networks for training. The two frames of sample images are denoted as frame j and frame k, respectively. The convolutional neural network participating in training is a SE-FCDenseNet network model, and a weight sharing mechanism is arranged between the two branched network models, so that the parameter number of the network models can be reduced, and the correlation between two frames of sample images is enhanced.

And scaling the depth map predicted by the SE-FCDenseNet network model and the sparse depth map subjected to data preprocessing to obtain scaled depth maps of the two branches. The scaling depth network layer performs scale change on the predicted depth map by taking the sparse depth map as an anchor point, so as to obtain a scaling depth map; for example, if the maximum pixel value in the sparse depth map is 100 and the minimum pixel value is 10, the predicted depth map is scaled to the same range of pixel values by this range of values.

And carrying out coordinate transformation between the scaled depth map of the frame j and the scaled depth map of the frame k to generate a depth map of the frame j in a frame k coordinate system and a depth map of the frame k in a frame j coordinate system. Frame j and frame k have sufficient overlap where one frame is camera-moved relative to the other. Based on the camera pose of two frames of images of the frame j and the frame k, converting a zoom depth map under one camera coordinate system into a zoom depth map under the next camera coordinate system; converting the scaled depth map of the frame j into the coordinate system of the image of the frame k, and converting the scaled depth map of the frame k into the coordinate system of the image of the frame j; and then obtaining depth values after coordinate conversion by adopting a bilinear interpolation method to obtain two dense depth maps corresponding to the frame j and the frame k.

The dense displacement map is a representation of the two-dimensional projection motion of the dense reconstruction. The dense displacement map is obtained by reconstructing the zoom depth map into dense point clouds by combining camera internal parameters and camera pose, projecting the dense point clouds onto corresponding two-dimensional planes, and calculating the dense reconstructed two-dimensional projection motion, namely the displacement of corresponding points of two frames of images.

Further, two frames j and k having sufficient overlapping portions, one of which is camera-moved relative to the other. The transformed depth value is obtained by transforming one frame of image into the coordinate system of the other frame of image and then utilizing bilinear interpolation.Representing a dense depth map of frame j transformed into frame k coordinate system,/>Representing a dense depth map of frame k transformed into frame j in the coordinate system.

The camera pose describes the moving track of the camera and comprises parameters such as translation amount, rotation angle and the like of the camera; camera intrinsic refers to the intrinsic matrix of the camera.

And B7, training and updating parameters of the convolutional neural network model through effective depth loss between the first zoom depth map and the first sparse depth map and effective depth loss between the second zoom depth map and the second sparse depth map, depth difference loss between the first zoom depth map and the first dense depth map and depth difference loss between the second zoom depth map and the second dense depth map, projection displacement loss between the first sparse displacement map and the first dense displacement map and projection displacement loss between the second sparse displacement map and the second dense displacement map.

In some embodiments, the first sparse displacement map represents a movement amount of the first sample image projected by the sparse point cloud during data preprocessing to the corresponding same feature point position of the first sample image relative to the second sample image, and the second sparse displacement map represents a movement amount of the second sample image projected by the sparse point cloud during data preprocessing to the corresponding same feature point position of the second sample image relative to the first sample image.

In some embodiments, the parameters of the convolutional neural network model are adjusted by training the convolutional neural network model with three loss functions. The three loss functions include: an effective depth loss function, a projected displacement loss function, and a depth difference loss function; the effective depth loss function directly monitors the correlation between the sparse depth map and the target depth, the projection displacement loss function monitors the correlation between the sparse reconstruction and the dense reconstruction projection position change, and the depth difference value loss function compensates the relative difference between the scaled depth map and the dense depth map.

In some embodiments, the effective depth loss between the first scaled depth map and the first sparse depth map and the effective depth loss between the second scaled depth map and the second sparse depth map are calculated by an effective depth loss function, the effective depth loss function being represented as follows:

（1）

Wherein L is _edl(j,k) For effective depth loss, M _j Sparse mask for jth frame image, M _k The sparse mask is a k frame image, and the sparse mask can be a two-dimensional matrix array or a multi-value image; y is Y _j For the first scaled depth map, Y _j ^* For the first sparse depth map, Y _k For the second scaled depth map, Y _k ^* Is a second sparse depth map; n is the effective pixel number in the sparse depth map, and the value is an integer greater than 1. The effective depth loss function can effectively update parameters of the network model, so that the loss function converges more quickly.

In some embodiments, to generate a dense depth map consistent with sparse reconstruction, projection displacement losses are introduced that can be matched to the effective depth loss by reducing conditions for convergence of the differential construction function between the sparse and dense displacement maps. Calculating a projected displacement loss between the first sparse displacement map and the first dense displacement map and a projected displacement loss between the second sparse displacement map and the second dense displacement map by a projected displacement loss function, the projected displacement loss function being expressed as follows:

（2）

wherein L is _psl(j,k) For projection displacement loss, M _j Sparse mask for jth frame image, M _k The sparse mask is a k frame image, and the sparse mask can be a two-dimensional matrix array or a multi-value image; s is S _j,k For the first dense displacement map, S _j,k ^* For the first sparse displacement map, S _k,j For the second dense displacement map, S _k,j ^* A second sparse displacement map; n' is the effective pixel number in the sparse displacement graph, and the value is an integer greater than 1.

In some embodiments, the depth difference loss between the first scaled depth map and the first dense depth map and the depth difference loss between the second scaled depth map and the second dense depth map are calculated by a depth difference loss function, the depth difference loss function being represented as follows:

（3）

wherein L is _dcl(j,k) For depth difference loss, Y _j For the first scaled depth map, Y ^{^} _j,k For the first dense depth map, Y _k For the second scaled depth map, Y ^{^} _k,j Is a second dense depth map; n is the effective pixel number in the sparse depth map, N is the total number of the effective pixel number, and the value of N is an integer larger than 1. The depth difference loss function increases geometric constraint between two frames of images in the same video frame sequence, compensates relative difference between a scaled depth image and a dense depth image, and can play a good role in depth prediction for images with smoother tissue surfaces.

In some embodiments, the overall loss function of the training convolutional neural network model SE-FCDenseNet is a weighted sum of the three loss functions described above, the overall loss function being expressed as follows:

（4）

Wherein the lambda value is the result of a number of experimental determinations. General lambda ₁ = 5, λ ₂ = 20, λ ₃ The value at the first 40000 iterations is 0.1, after which the iterative training can be changed to 5. The effective depth loss function and the projection displacement loss function are mainly used for constructing a sparse depth map and a predicted depth map, and minimizing the difference between the sparse displacement map and the dense displacement map; the predicted depth map is made closer to the effective depth values in the sparse depth map. When the effective depth loss function and the projection displacement loss function are converged to a certain degree and the predicted depth map is closer to the sparse depth map, the weight of the depth difference value loss function is increased, and the relative difference between the scaled depth map and the dense depth map is compensated.

In the training process, training is generally stopped when the overall loss function converges to a stable value; or set 20 tens of thousands of iterations to stop training.

It can be understood that in the data preprocessing process and the training process of the convolutional neural network model, the sample images serving as the input data are multiple groups of video frames, and each group of video frames is a video shot when corresponding to a group of camera poses. During the shooting process, each group of videos shot in a moving way corresponds to a group of camera poses (namely, the moving track of the camera).

In some embodiments, the trained convolutional neural network model includes a dense network layer, an upsampling network layer, a downsampling network layer, and a residual network layer; inputting the image to be predicted into a trained convolutional neural network model for processing, wherein the method comprises the following steps of:

extracting features of the image to be predicted through the dense network layer, and outputting a feature map; inputting the feature map output by the dense network layer into an up-sampling network layer or a down-sampling network layer, and outputting a sampling feature map after up-sampling or down-sampling treatment; and inputting the sampling feature map into a residual network layer, distributing weight values to channels of the sampling feature map by the residual network layer, and outputting the feature map with preset dimensions.

Referring to fig. 5, a schematic structural diagram of a convolutional neural network model provided in an embodiment of the present application is shown. The trained neural network model SE-FCDenseNet mainly comprises a compressed extraction network module (SE Block) with an attention mechanism, and the compressed extraction network module is used for completely convoluting the DenseNet network architecture. As shown in fig. 5, in the structure of the trained convolutional neural network model SE-FCDenseNet, the image to be predicted is input into a 3×3 convolutional layer, and after one convolutional process, is input into a Dense network layer Dense Block, and the feature extraction is performed by the Dense network layer Dense Block, so as to output a feature map. The Dense network layer Dense Block is cascaded with a downsampling network layer, and the downsampling network layer performs downsampling processing on the feature map and outputs a sampling feature map.

Inputting the sampling feature map into a residual network layer, carrying out weighting treatment on feature dimensions through a residual network, enhancing the features of useful channels and inhibiting the features of useless channels; inputting the weighted characteristics output by the residual error network into the Dense network layer Dense Block again, and continuing to extract the characteristics by the Dense network layer; inputting the extracted features into an Up-sampling network layer transmission Up to perform Up-sampling treatment; inputting the up-sampled sampling feature map into a residual network layer, carrying out feature channel weighting processing by the residual network layer, and outputting weighted features; the residual network layer and the dense network layer are cascaded, the characteristics after channel weighting are input into the dense network layer, after characteristic extraction is carried out through the dense network layer again, the characteristic map is input into the up-sampling network layer, after up-sampling treatment, the sampled characteristic map is input into the residual network layer, after the characteristic channel weighting treatment is carried out through the residual network layer, the characteristic map is input into the dense network layer, after characteristic extraction is carried out through the dense network layer, the characteristic map is input into the last layer of the convolution layer of 1 multiplied by 1, and after convolution treatment, the predicted dense depth image is output.

The broken line of the cascade part indicates a jump connection relation between the front and rear sides. The outputs of all the previous layers in the Dense network layer Dense block module are used as the inputs of the current layer, so that the repeated utilization of the characteristics is realized, the parameter number of the network is reduced, and the problem of gradient disappearance is also relieved. The relation between characteristic channels is concerned by a residual network layer Squeeze-specification Block, and the importance degree of the characteristics of different channels is learned. The trained convolutional neural network model SE-FCDenseNet network architecture can not only recycle the characteristic diagram of the multilayer network, but also allocate weights to the channels according to the information quantity of the channel characteristics, thereby realizing the dynamic use of the image characteristics.

Corresponding to fig. 5, fig. 6 shows a schematic structural diagram of each network layer of the convolutional neural network model according to the embodiment of the present application. As shown in fig. 6, the trained convolutional neural network model corresponds to the network structure in fig. 5, and the specific structure of each network layer is as follows: input, m=3; 3×3 convolutions, m=48; DB (4 layers) +td+es Block, m=96; DB (4 layers) +td+es Block, m=144; DB (4 layers) +td+es Block, m=192; DB (4 layers) +td+es Block, m=240; DB (4 layers) +td+es Block, m=288; DB (15 layers), m=288; tu+es block+db (4 layers), m=336; tu+es block+db (4 layers), m=288; tu+es block+db (4 layers), m=240; tu+es block+db (4 layers), m=192; tu+es block+db (4 layers), m=144; 1 x 1 convolution, m=1; activating layer ReLU. Wherein DB is a dense network layer, TD is a downsampling network layer, TU is an upsampling network layer, ES Block is a residual network layer, and m is the channel number.

SE Block is added in both TD stage and TU stage, and in compression stage, characteristic dimension is 1/4 of input. In order to adapt the network output to the depth prediction task, the last convolution layer channel number is changed to 1 in the up-sampling stage, and the activation function is changed to a linear activation function.

As shown in fig. 7, in the structure diagram of the residual network layer provided in the embodiment of the present application, a sampling feature map processed by the downsampling network layer TD or the downsampling network layer TU is input into the residual network layer, the input sampling feature map is a feature dimension of a channel (c) x high (h) x wide (w), a global pooling layer of the residual network layer firstly extracts global features (c x 1*1) of the sampling feature map, including feature dimensions of c channels, and then two full-connection layers are passed, wherein a first full-connection layer compresses the number of channels of the feature, compresses the feature dimensions to 1/4 (i.e., 1/4 x 1) of the number of original channels, a second full-connection layer recovers to the number of c channels (c x 1*1), and finally normalizes and activates (c x 1*1) through a Sigmoid, as a coefficient to increase the global features of the feature channel weighting layer, thereby enhancing the features of useful channels and suppressing the features of useless channels.

Referring to fig. 8, a schematic structural diagram of a downsampling network layer and an upsampling network layer is provided in an embodiment of the present application. As shown in fig. 8 (a), the specific structure of the downsampling network layer includes: normalizing Batch Normalization; an active layer ReLU;1 x 1 convolution; neuron drop probability dropoutp=0.2; max Pooling 2 x 2 Max Pooling. As shown in fig. 8 (b), the specific structure of the upsampling network layer includes: 3×3 transpose convolution, step size=2.

Through the trained convolutional neural network model SE-FCDenseNet provided by the embodiment of the application, the SE Block module is added into the FC-DenseNet network architecture, and can allocate weights to the characteristic channels according to the importance degree of the channel characteristics, so that the utilization rate of useful characteristics is improved, and useless characteristics are restrained. The network model has stronger capability of extracting characteristics and better prediction effect on tissue surfaces with smaller sparse textures and depth information. Meanwhile, three loss functions are provided, and the total loss function is obtained through weighted combination, so that the relation between the sparse depth map and the dense depth map is enhanced, and the convergence speed of the functions is improved.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.

According to the embodiment of the invention, the method for carrying out deep image prediction on the tissue surface image has the characteristic of light invariance, a better depth image can be predicted in a scene where a camera and a light source move, and the network model obtained by the training method provided by the embodiment of the application has stronger capability of extracting characteristics and better prediction effect on some tissue surfaces with sparse textures and smaller depth information.

Fig. 9 shows a block diagram of an image depth prediction apparatus according to an embodiment of the present application, corresponding to the image depth prediction method described in the above embodiment, and only a portion related to the embodiment of the present application is shown for convenience of explanation.

Referring to fig. 9, the apparatus includes:

an acquisition unit 91 for acquiring an image to be predicted of a tissue surface acquired by a monocular endoscope;

the processing unit 92 is configured to input the image to be predicted into a trained convolutional neural network model for processing, and output a dense depth map of the image to be predicted; the trained convolutional neural network model is obtained by training according to a sparse depth map and a sparse displacement map of sample images in a training set, and the sample images are images of tissue surfaces acquired through a monocular endoscope.

According to the embodiment of the invention, the convolutional neural network model is trained according to the sparse depth map and the sparse displacement map of the sample image in the training set, the trained convolutional neural network model is obtained, the images to be predicted acquired by the monocular endoscope are processed through the trained convolutional neural network model, so that the dense depth map of the images to be predicted is obtained, the depth information of the images to be predicted can be better reflected through the dense depth, and the better prediction effect on the depth of the images is realized.

It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps that may implement the various method embodiments described above.

Embodiments of the present application provide a computer program product which, when run on a mobile terminal, causes the mobile terminal to perform steps that may be performed in the various method embodiments described above.

Fig. 10 is a schematic structural diagram of a terminal device 10 according to an embodiment of the present application. As shown in fig. 10, the terminal device 10 of this embodiment includes: at least one processor 100 (only one shown in fig. 10), a memory 101, and a computer program 102 stored in the memory 101 and executable on the at least one processor 100, the processor 100 implementing the steps in any of the various certification method embodiments described above when executing the computer program 102.

The terminal device 10 may be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server, etc. The terminal device 10 may include, but is not limited to, a processor 100, a memory 101. It will be appreciated by those skilled in the art that fig. 10 is merely an example of the terminal device 10 and is not intended to limit the terminal device 10, and may include more or fewer components than shown, or may combine certain components, or may include different components, such as input-output devices, network access devices, etc.

The processor 100 may be a central processing unit (Central Processing Unit, CPU), and the processor 100 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 101 may in some embodiments be an internal storage unit of the terminal device 10, such as a hard disk or a memory of the terminal device 10. The memory 101 may in other embodiments also be an external storage device of the terminal device 10, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 10. Further, the memory 101 may also include both an internal storage unit and an external storage device of the terminal device 10. The memory 101 is used for storing an operating system, application programs, boot loader (BootLoader), data, other programs, etc., such as program codes of the computer program. The memory 101 may also be used to temporarily store data that has been output or is to be output.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing device/terminal apparatus, recording medium, computer Memory, read-Only Memory (ROM), random access Memory (RAM, random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. An image depth prediction method, comprising:

acquiring an image to be predicted of a tissue surface acquired by a monocular endoscope;

inputting the image to be predicted into a trained convolutional neural network model for processing, and outputting a dense depth map of the image to be predicted;

the trained convolutional neural network model is obtained by training according to a sparse depth map and a sparse displacement map of sample images in a training set, wherein the sample images are images of tissue surfaces acquired through a monocular endoscope;

the training stage is a self-supervision dual-branch network architecture; the data input into the self-supervision dual-branch network architecture comprises a sample image, a camera pose, a camera internal reference, a sparse depth map and a sparse displacement map in a training set, a weight sharing mechanism is arranged between the two branch network architectures, and the trained network architecture of the convolutional neural network model distributes weights to channels according to the information quantity of the channel characteristics;

training a convolutional neural network model by adopting three loss functions, wherein the three loss functions comprise an effective depth loss function, a projection displacement loss function and a depth difference loss function; the method comprises the steps that an effective depth loss function directly monitors the correlation between a sparse depth map and target depth, a projection displacement loss function monitors the correlation between the sparse reconstruction and the dense reconstruction, and a depth difference value loss function compensates the relative difference between a scaled depth map and a dense depth map;

The effective depth loss function is expressed as follows:

wherein L is _edl(j,k) For effective depth loss, M _j Sparse mask for jth frame image, M _k Sparse mask for k-th frame image, Y _j For the first scaled depth map, Y _j ^* For the first sparse depth map, Y _k For the second scaled depth map, Y _k ^* Is a second sparse depth map; n is the number of effective pixel points in the sparse depth map, and the value of n is an integer greater than 1;

the projected displacement loss function is expressed as follows:

wherein L is _psl(j,k) For projection displacement loss, M _j Sparse mask for jth frame image, M _k Sparse mask for kth frame image, S _j,k For the first dense displacement map, S _j,k ^* For the first sparse displacement map, S _k,j For the second dense displacement map, S _k,j ^* A second sparse displacement map; n 'is the number of effective pixel points in the sparse displacement graph, and the value of n' is an integer greater than 1;

the depth difference loss function is expressed as follows:

wherein L is _dcl(j,k) For depth difference loss, Y _j For the first scaled depth map, Y ^{^} _j,k For the first dense depth map, Y _k For the second scaled depth map, Y ^{^} _k,j Is a second dense depth map; n is the number of effective pixel points in the sparse depth map, and the value is greater than 1An integer.

2. The method of claim 1, wherein the method comprises:

Extracting characteristic points of sample images of different views in the training set;

matching the characteristic points among the sample images of different views to obtain mutually matched characteristic points in the sample images;

performing sparse reconstruction according to the matched characteristic points and the camera internal parameters to obtain sparse point cloud data and camera pose;

and performing projection mapping and data transformation on the sparse point cloud data according to the camera pose and the camera internal parameters to obtain the sparse depth map and the sparse displacement map.

3. The method of claim 1, wherein training the convolutional neural network model from the sparse depth map and the sparse displacement map of the sample images in the training set comprises:

respectively inputting a first sample image and a second sample image into two paths of convolutional neural network models, respectively performing feature learning through the two paths of convolutional neural network models, and then obtaining a first predicted depth map of the first sample image and a second predicted depth map of the second sample image which are respectively output by the two paths of convolutional neural network models; the first sample image and the second sample image are two frames of images, wherein the overlapping area of the two frames of images in the training set meets the preset condition;

Inputting a first sparse depth map and the first predicted depth map into a scaling depth network layer, and performing size enlargement or reduction processing on the first sparse depth map and the first predicted depth map through the scaling depth network layer to obtain a first scaling depth map; the first sparse depth map is a sparse depth map of the first sample image;

inputting a second sparse depth map and the second predicted depth map into a scaling depth network layer, and performing size enlargement or reduction processing on the second sparse depth map and the second predicted depth map through the scaling depth network layer to obtain a second scaling depth map; the second sparse depth map is a sparse depth map of a second sample image;

inputting the first scaled depth map into a coordinate transformation network layer, and obtaining a first dense depth map output by the coordinate transformation network layer after coordinate transformation;

inputting the second zoom depth map into a coordinate transformation network layer, and obtaining a second dense depth map output by the coordinate transformation network layer after coordinate transformation;

respectively carrying out projection transformation on the first scaling depth map and the second scaling depth map to obtain a first dense displacement map corresponding to the first scaling depth map and a second dense displacement map corresponding to the second scaling depth map;

And training and updating parameters of the convolutional neural network model through an effective depth loss between the first scaled depth map and the first sparse depth map and an effective depth loss between the second scaled depth map and the second sparse depth map, a depth difference loss between the first scaled depth map and the first dense depth map and a depth difference loss between the second scaled depth map and the second dense depth map, a projection displacement loss between a first sparse displacement map and the first dense displacement map, and a projection displacement loss between a second sparse displacement map and the second dense displacement map.

4. A method as claimed in claim 3, wherein the effective depth loss between the first scaled depth map and the first sparse depth map and the effective depth loss between the second scaled depth map and the second sparse depth map are calculated by an effective depth loss function.

5. A method according to claim 3, wherein the projected displacement loss between the first sparse displacement map and the first dense displacement map and the projected displacement loss between the second sparse displacement map and the second dense displacement map are calculated by a projected displacement loss function.

6. The method of claim 3, wherein a depth difference loss between the first scaled depth map and the first dense depth map and a depth difference loss between the second scaled depth map and the second dense depth map are calculated by a depth difference loss function.

7. The method of any one of claims 1 to 6, wherein the trained convolutional neural network model comprises a dense network layer, an upsampling network layer, a downsampling network layer, and a residual network layer; inputting the image to be predicted into a trained convolutional neural network model for processing, wherein the method comprises the following steps of:

8. An image depth prediction apparatus, comprising:

the processing unit is used for inputting the image to be predicted into the trained convolutional neural network model for processing and outputting a dense depth map of the image to be predicted; the trained convolutional neural network model is obtained by training according to a sparse depth map and a sparse displacement map of sample images in a training set, wherein the sample images are images of tissue surfaces acquired through a monocular endoscope;

the training stage is a self-supervision dual-branch network architecture; the data input into the self-supervision dual-branch network architecture comprises a sample image, a camera pose, a camera internal reference, a sparse depth map and a sparse displacement map in a training set, a weight sharing mechanism is arranged between the two branch network architectures, and the trained network architecture of the convolutional neural network model distributes weights to channels according to the information quantity of the channel characteristics; the processing unit is further used for training the convolutional neural network model by adopting three loss functions, wherein the three loss functions comprise an effective depth loss function, a projection displacement loss function and a depth difference loss function; the method comprises the steps that an effective depth loss function directly monitors the correlation between a sparse depth map and target depth, a projection displacement loss function monitors the correlation between the sparse reconstruction and the dense reconstruction, and a depth difference value loss function compensates the relative difference between a scaled depth map and a dense depth map; the effective depth loss function is expressed as follows:

the projected displacement loss function is expressed as follows:

wherein L is _psl(j,k) For projection displacement loss, M _j Sparse mask for jth frame image, M _k Sparse mask for kth frame image, S _j,k For the first dense displacement map, S _j,k ^* For the first sparse displacement map, S _k,j Is the second oneDense displacement map, S _k,j ^* A second sparse displacement map; n 'is the number of effective pixel points in the sparse displacement graph, and the value of n' is an integer greater than 1;

the depth difference loss function is expressed as follows:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method according to any one of claims 1 to 7.