CN115423856A

CN115423856A - Monocular depth estimation system and method for intelligent pump cavity endoscope image

Info

Publication number: CN115423856A
Application number: CN202211070319.2A
Authority: CN
Inventors: 程一飞; 范舒铭; 董国庆; 李玉道; 郭素英; 李志远
Original assignee: Jining Antai Mine Equipment Manufacturing Co ltd
Current assignee: Jining Antai Mine Equipment Manufacturing Co ltd
Priority date: 2022-09-02
Filing date: 2022-09-02
Publication date: 2022-12-02

Abstract

The invention relates to the technical field of endoscope detection, in particular to a monocular depth estimation system and a monocular depth estimation method for an intelligent pump cavity endoscope image, wherein the invention uses transformers series as a backbone network of an automatic supervision monocular depth estimation system; for the depth encoder, swin _ transformer is used as a backbone network; for the pose extractor, a vision _ transform is used as a backbone network, and different networks in the transform series and taking the transform as a frame are used as the backbone network of the self-supervision monocular depth estimation system, so that the accuracy of the depth estimation of the monocular picture can be effectively improved.

Description

Monocular depth estimation system and method for intelligent pump cavity endoscope image

Technical Field

The invention relates to the technical field of endoscope detection, in particular to a monocular depth estimation system and a monocular depth estimation method for an intelligent pump cavity endoscope image.

Background

In the era of the rapid development of artificial intelligence, the intelligent pump is widely applied to important fields of industry, agricultural production, energy, petrifaction, aviation, steel, military industry and the like, and plays an important role in national economic development. As a manufacturing industry of a large country, the manufacturing technology of the intelligent pump in China still has a plurality of problems, such as the problem of intelligent pump faults caused by insufficient research and development capital investment, weak independent innovation capability, weak basic matching components and the like, and the defects of internal faults of the pump cavity, such as cracking, corrosion, rusting and the like, of the intelligent pump are difficult to diagnose through human eye observation, so that the service life of the intelligent pump is short, the manufacturing technology development is lagged, and the technical innovation is difficult to realize. The endoscope integrates traditional optics, ergonomics, precision machinery, modern electronics and software, is a detection instrument with an image sensor, optical illumination and a mechanical device, and has the functions of probing the inside of a bent pipeline and observing parts which cannot be directly viewed by human eyes. The endoscope is commonly used in the industry for nondestructive detection, the detected body does not need to be detached, the surface conditions of the interior of an object, such as cracks, welding seams, rusting and the like, can be directly observed, and dynamic video recording or photographing recording is carried out on the whole detection process while detection is carried out, so that fault diagnosis and subsequent quantitative analysis can be conveniently carried out.

The concept of depth estimation refers to the process of saving three-dimensional information of a scene using two-dimensional information captured by a camera. Monocular solutions tend to use only one image to achieve this goal. The purpose of these methods is to estimate the distance between the scene object and the camera from one viewpoint. In the deep learning, a mapping relation from pixel points in the image to actual depth values is realized.

The depth estimation direction is roughly classified into two categories, binocular depth estimation and monocular depth estimation, according to the number of views required for prediction. Binocular depth estimation (or multi-view depth estimation) is to deduce the depth of an image through a stereo matching algorithm or a motion recovery structure. At present, the binocular depth estimation can obtain better results, but the defect is obvious, namely the cost is too high: not only is the requirement for equipment high, but also the flow is relatively complicated in the aspect of data processing. To solve these problems, monocular depth estimation is required. Monocular depth estimation is also divided into two methods: supervised monocular depth estimation and self-supervised monocular depth estimation.

With the rapid development of deep neural networks, especially in the last decade, deployment of monocular deep self-supervision algorithms on deep learning is made possible.

The existing self-supervision monocular depth estimation method has the following defects:

the first self-supervision monocular depth estimation and estimation method is low in overall precision, practical effects are difficult to achieve, and the precision requirement required by detection of the intelligent pump cavity endoscope is high.

Secondly, in the process of estimating the pose, due to the limitation of the method of the used network, the change characteristics of the pose cannot be captured well, and particularly, the pose of the intelligent pump cavity endoscope image is different from the pose of the traditional camera.

Disclosure of Invention

In order to solve the above technical problems, it is an object of the present invention to provide a monocular depth estimation system and method for intelligent pump chamber endoscope images, which uses the series of transformations that are highly colorful in various tasks (including image classification, image segmentation, etc.) of image processing recently as a backbone network of an auto-supervised monocular depth estimation system. Compared with the former framework with a convolutional neural network as a backbone network, when the transformer performs specific extraction on an image, the feature extraction at each stage has a global visual field due to the characteristics of a self-attention module. For the depth encoder, swin _ transform is used as the backbone network. The swin _ transformer adds a sliding window when extracting image features, so that the performance which is not output to a convolutional neural network is provided for a dense prediction task in the aspects of position prior and multi-scale. For the pose extractor, vision _ transform is used as a backbone network. The Vision _ Transformer adds the position information before extracting the image characteristics, so that the corresponding position of the image can be marked to better obtain the information of pose change, and the requirement of the pose change on the scale is not high, so that a sliding window is not needed to extract the information on different scales.

The invention adopts the following specific technical scheme:

a monocular depth estimation system for intelligent pump cavity endoscope images adopts transformers series instead of CNN as a backbone network of an auto-supervision monocular depth estimation system, and the system consists of a depth encoder, a depth decoder and a pose extractor; the depth encoder comprises a patch partition layer and 4 stages, wherein the stage1 consists of a linear embedding and two switch _ transformer blocks, the internal structures of the stage2 and the stage3 are the same and both consist of a patch merging and two switch _ transformer blocks, and the stage4 consists of a patch merging and six switch _ transformer blocks; the pose extractor uses a Vision _ Transformer as a backbone network, and the structure of the Vision _ Transformer comprises a Linear project of FattenePatches and a Transformer Encoder, wherein the default depth of the Transformer Encoder is 6.

The invention also discloses a method for monocular depth estimation of an intelligent pump cavity endoscope image, which comprises the following steps:

step 1: constructing a data set suitable for training, namely preprocessing the image;

step 2: sending the preprocessed image into an automatic supervision monocular depth estimation frame for operation;

and 3, step 3: and obtaining a corresponding predicted depth map.

The process of performing self-supervision monocular depth estimation operation on the preprocessed image comprises the following steps:

step 2.1: sending a current frame of a preprocessed image into a depth encoder module, wherein the image is subjected to sufficient down-sampling feature extraction through swin _ transformer in the depth encoder module;

step 2.2: the features obtained by the depth encoder are sent to the depth encoder to obtain a predicted depth map, in the process, the input of a decoder is a feature group, wherein in order to meet the input requirement of a resnet decoder, an upsampling operation is firstly carried out on a first group of shallow features in the feature group before the input;

step 2.3: taking the current frame and the next frame of the preprocessed image as an image pair, sending the image pair to a pose extractor, and obtaining a group of vectors: rotating vectors and translating vectors, wherein in the process, the image pair is coded at a corresponding position, and then the image pair is sent to a 6-layer transformer _ encoder of a vision _ transformer to obtain pose change vectors of front and back frames;

step 2.4: and (4) obtaining reconstructed front and rear frame images according to the depth map obtained in the step (2.2), the pose vector obtained in the step (2.3) and the original front and rear frame images, calculating a Loss value according to the reconstructed front and rear frame images, and performing back propagation to update network parameters.

The loss function used for the constraint in the present system consists of two parts, reprojection loss and smoothing loss.

The reprojection loss L is shown in equation (1) _p The value of (1) is the minimum value of the luminosity error function Pe projected to the time t at the time t-1, wherein the luminosity error function Pe is shown as a formula (2), SSIM is used for measuring the similarity of two pictures, including the structural similarity and the gray value similarity, and the calculation mode is shown as a formula (3):

L _p ＝min _t-1 Pe(I _t ,I _t-1 ) (1)

SSIM(I _a ,I _b )＝[l(I _a ,I _b )] ^α [c(I _a ,I _b )] ^β [s(I _a ,I _b )] ^γ (3)

loss of smoothness L _s As shown in equation (4) where d is shown in equation (5), the smoothness loss L _s The method is a regular term of a system network and prevents the system parameters from being over-fitted;

the actual total loss L is shown in equation (6), where r is 0.001, and the calculation equation for u is shown in equation (7), where u has the effect of acting as a mask to determine whether the re-projection is smaller than the original photometric error. If less than, u is 1; on the contrary, the number of the carbon atoms is 0,

L＝μL _p +γL _s (6)

μ＝[min _t-1 Pe(I _t ,I _t-1→t )＜min _t-1 Pe(I _t ,I _t-1 )] (7)。

the invention uses transformers series to replace cnn as the backbone network of the system, thereby improving the accuracy of the self-supervision monocular depth estimation on the whole; for the pose extractor, before the image pair is input, the position encoding is carried out on the pose extractor, so that the structural contrast of the corresponding position is enhanced, and the output of the pose vector is more accurate.

The invention has the beneficial effects that:

first, the use of the latest transforms series instead of CNN as the backbone network of the self-supervised monocular depth estimation system improves the accuracy of the self-supervised monocular depth estimation as a whole, wherein the depth encoder uses swin _ transform and the pose extractor uses vision _ transform.

Secondly, for a pose extractor in the system, the transducer adds position information before extracting image features, so that corresponding positions of the image can be marked, pose change information can be better obtained, and the accuracy of the obtained pose vector is improved.

Drawings

FIG. 1 is an overall framework diagram of an auto-supervised monocular depth estimation network system of the present invention.

FIG. 2 is a depth prediction network diagram according to the present invention.

Fig. 3 is a structural diagram of a depth _ encoder according to the present invention.

FIG. 4 is a structural diagram of a swin _ transformer block according to the present invention.

FIG. 5 is a diagram of a vision _ transformer block according to the present invention.

Fig. 6 is a pose prediction network diagram of the present invention.

Detailed Description

The present invention will be further described in detail below with reference to examples in order to facilitate understanding and practice of the invention by those of ordinary skill in the art, and it should be understood that the examples described herein are for illustration and explanation only and are not intended to limit the invention.

Example (b): as shown in fig. 1, a monocular depth estimation system for intelligent pump cavity endoscope images adopts transformers series instead of CNN as a backbone network of an auto-supervised monocular depth estimation system, and the system consists of a depth encoder, a depth decoder and a pose extractor; the depth encoder comprises a patch partition layer and 4 stages, wherein the stage1 consists of a linear embedding and two switch _ transformer blocks, the internal structures of the stage2 and the stage3 are the same and both consist of a patch merging and two switch _ transformer blocks, and the stage4 consists of a patch merging and six switch _ transformer blocks; the pose extractor takes a Vision _ transform as a backbone network, and the structure of the Vision _ transform comprises a Linear project of Fattened patterns and a transform Encoder, wherein the default depth of the transform Encoder is 6.

The monocular depth estimation of the endoscope image of the intelligent pump cavity by using the system specifically comprises the following steps:

and step 3: and obtaining a corresponding predicted depth map.

step 2.2: the features obtained by the depth encoder are sent to the depth encoder to obtain a predicted depth map, in the process, the input of a decoder is a feature group, wherein in order to meet the input requirement of a resnet decoder, a first group of shallow features in the feature group are subjected to one-time up-sampling operation before input;

As shown in equation (1), the reprojection loss L _p The value of (1) is the minimum value of the luminosity error function Pe projected to the time t at the time t-1, wherein the luminosity error function Pe is shown as a formula (2), SSIM is used for measuring the similarity of two pictures, including the structural similarity and the gray value similarity, and the calculation mode is shown as a formula (3):

L _p ＝min _t-1 Pe(I _t ,I _t-1 ) (1)

the actual total loss L is shown in equation (6), where r is 0.001 and the calculation equation for u is shown in equation (7), and the effect of u is to act as a mask to determine if the re-projection is less than the original photometric error. If less than, u is 1; on the contrary, the number of the carbon atoms is 0,

L＝μL _p +γL _s (6)

μ＝[min _t-1 Pe(I _t ,I _t-1→t )＜min _t-1 Pe(I _t ,I _t-1 )] (7)。

the above description is only an embodiment of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A monocular depth estimation system for an intelligent pump cavity endoscope image is characterized by comprising a depth encoder, a depth decoder and a pose extractor; the depth encoder comprises a patch partition layer and 4 stages, wherein the stage1 consists of a linear embedding and two switch _ transformer blocks, the internal structures of the stage2 and the stage3 are the same and consist of a patch clustering and two switch _ transformer blocks, and the stage4 consists of a patch clustering and six switch _ transformer blocks; the pose extractor takes a Vision _ transform as a backbone network, and the structure of the Vision _ transform comprises a Linear Projection of fantened patterns and a transform Encoder, wherein the default depth of the transform Encoder is 6.

2. A method for monocular depth estimation of intelligent pump cavity endoscopic images, using the monocular depth estimation system of claim 1, comprising in particular the steps of:

and 2, step: sending the preprocessed image into a self-supervision monocular depth estimation framework for operation;

and step 3: and obtaining a corresponding predicted depth map.

3. The method for monocular depth estimation of intelligent pump chamber endoscope images of claim 2, wherein the step 2 of performing an auto-supervised monocular depth estimation operation on the preprocessed images comprises:

step 2.1: sending the current frame of the preprocessed image into a depth encoder module;

step 2.2: sending the features obtained by the depth encoder into the depth encoder to obtain a predicted depth map;

step 2.3: taking the current frame and the next frame of the preprocessed image as an image pair, and sending the image pair to a pose extractor to obtain a group of vectors: a rotation vector and a translation vector;

step 2.4: and (3) obtaining reconstructed front and back frame images according to the depth map obtained in the step (2.2), the pose vector obtained in the step (2.3) and the original front and back frame images, calculating a Loss value from the front and back frame images, and performing backward propagation to update network parameters.

4. The method for monocular depth estimation of intelligent pump cavity endoscopic images of claim 3, wherein the loss function used for constraint in step 2.4 consists of two parts, reprojection loss and smooth loss.

5. The method for monocular depth estimation of an intelligent pump chamber endoscope image according to claim 4, characterized in that the value of the reprojection loss Lp is the minimum value of the photometric error function Pe projected to the time t at the time t-1, as shown in formula (1), wherein the photometric error function Pe is shown in formula (2), SSIM is used to measure the similarity of two pictures, including the structural similarity and the gray value similarity, and the calculation method is shown in formula (3):

L _p ＝min _t-1 Pe(I _t ,I _t-1 ) (1)

loss of smoothness L _s As shown in equation (4) where d is shown in equation (5), the smoothness loss L _s Is a regular term of the system network, prevents the system parameters from being over-fitted,

the actual total loss L is shown in equation (6), where r is 0.001, the calculation equation for u is shown in equation (7), and u has the effect of acting as a mask to determine whether the re-projection is smaller than the original photometric error, if smaller, u is 1, otherwise, it is 0:

L＝μL _p +γL _s (6)

μ＝[min _t-1 Pe(I _t ,I _t-1→t )＜min _t-1 Pe(I _t ,I _t-1 )] (7)。