CN115423856A - Monocular depth estimation system and method for intelligent pump cavity endoscope image - Google Patents

Monocular depth estimation system and method for intelligent pump cavity endoscope image Download PDF

Info

Publication number
CN115423856A
CN115423856A CN202211070319.2A CN202211070319A CN115423856A CN 115423856 A CN115423856 A CN 115423856A CN 202211070319 A CN202211070319 A CN 202211070319A CN 115423856 A CN115423856 A CN 115423856A
Authority
CN
China
Prior art keywords
depth estimation
depth
monocular
image
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211070319.2A
Other languages
Chinese (zh)
Inventor
程一飞
范舒铭
董国庆
李玉道
郭素英
李志远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jining Antai Mine Equipment Manufacturing Co ltd
Original Assignee
Jining Antai Mine Equipment Manufacturing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jining Antai Mine Equipment Manufacturing Co ltd filed Critical Jining Antai Mine Equipment Manufacturing Co ltd
Priority to CN202211070319.2A priority Critical patent/CN115423856A/en
Publication of CN115423856A publication Critical patent/CN115423856A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10068Endoscopic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Abstract

The invention relates to the technical field of endoscope detection, in particular to a monocular depth estimation system and a monocular depth estimation method for an intelligent pump cavity endoscope image, wherein the invention uses transformers series as a backbone network of an automatic supervision monocular depth estimation system; for the depth encoder, swin _ transformer is used as a backbone network; for the pose extractor, a vision _ transform is used as a backbone network, and different networks in the transform series and taking the transform as a frame are used as the backbone network of the self-supervision monocular depth estimation system, so that the accuracy of the depth estimation of the monocular picture can be effectively improved.

Description

Monocular depth estimation system and method for intelligent pump cavity endoscope image
Technical Field
The invention relates to the technical field of endoscope detection, in particular to a monocular depth estimation system and a monocular depth estimation method for an intelligent pump cavity endoscope image.
Background
In the era of the rapid development of artificial intelligence, the intelligent pump is widely applied to important fields of industry, agricultural production, energy, petrifaction, aviation, steel, military industry and the like, and plays an important role in national economic development. As a manufacturing industry of a large country, the manufacturing technology of the intelligent pump in China still has a plurality of problems, such as the problem of intelligent pump faults caused by insufficient research and development capital investment, weak independent innovation capability, weak basic matching components and the like, and the defects of internal faults of the pump cavity, such as cracking, corrosion, rusting and the like, of the intelligent pump are difficult to diagnose through human eye observation, so that the service life of the intelligent pump is short, the manufacturing technology development is lagged, and the technical innovation is difficult to realize. The endoscope integrates traditional optics, ergonomics, precision machinery, modern electronics and software, is a detection instrument with an image sensor, optical illumination and a mechanical device, and has the functions of probing the inside of a bent pipeline and observing parts which cannot be directly viewed by human eyes. The endoscope is commonly used in the industry for nondestructive detection, the detected body does not need to be detached, the surface conditions of the interior of an object, such as cracks, welding seams, rusting and the like, can be directly observed, and dynamic video recording or photographing recording is carried out on the whole detection process while detection is carried out, so that fault diagnosis and subsequent quantitative analysis can be conveniently carried out.
The concept of depth estimation refers to the process of saving three-dimensional information of a scene using two-dimensional information captured by a camera. Monocular solutions tend to use only one image to achieve this goal. The purpose of these methods is to estimate the distance between the scene object and the camera from one viewpoint. In the deep learning, a mapping relation from pixel points in the image to actual depth values is realized.
The depth estimation direction is roughly classified into two categories, binocular depth estimation and monocular depth estimation, according to the number of views required for prediction. Binocular depth estimation (or multi-view depth estimation) is to deduce the depth of an image through a stereo matching algorithm or a motion recovery structure. At present, the binocular depth estimation can obtain better results, but the defect is obvious, namely the cost is too high: not only is the requirement for equipment high, but also the flow is relatively complicated in the aspect of data processing. To solve these problems, monocular depth estimation is required. Monocular depth estimation is also divided into two methods: supervised monocular depth estimation and self-supervised monocular depth estimation.
With the rapid development of deep neural networks, especially in the last decade, deployment of monocular deep self-supervision algorithms on deep learning is made possible.
The existing self-supervision monocular depth estimation method has the following defects:
the first self-supervision monocular depth estimation and estimation method is low in overall precision, practical effects are difficult to achieve, and the precision requirement required by detection of the intelligent pump cavity endoscope is high.
Secondly, in the process of estimating the pose, due to the limitation of the method of the used network, the change characteristics of the pose cannot be captured well, and particularly, the pose of the intelligent pump cavity endoscope image is different from the pose of the traditional camera.
Disclosure of Invention
In order to solve the above technical problems, it is an object of the present invention to provide a monocular depth estimation system and method for intelligent pump chamber endoscope images, which uses the series of transformations that are highly colorful in various tasks (including image classification, image segmentation, etc.) of image processing recently as a backbone network of an auto-supervised monocular depth estimation system. Compared with the former framework with a convolutional neural network as a backbone network, when the transformer performs specific extraction on an image, the feature extraction at each stage has a global visual field due to the characteristics of a self-attention module. For the depth encoder, swin _ transform is used as the backbone network. The swin _ transformer adds a sliding window when extracting image features, so that the performance which is not output to a convolutional neural network is provided for a dense prediction task in the aspects of position prior and multi-scale. For the pose extractor, vision _ transform is used as a backbone network. The Vision _ Transformer adds the position information before extracting the image characteristics, so that the corresponding position of the image can be marked to better obtain the information of pose change, and the requirement of the pose change on the scale is not high, so that a sliding window is not needed to extract the information on different scales.
The invention adopts the following specific technical scheme:
a monocular depth estimation system for intelligent pump cavity endoscope images adopts transformers series instead of CNN as a backbone network of an auto-supervision monocular depth estimation system, and the system consists of a depth encoder, a depth decoder and a pose extractor; the depth encoder comprises a patch partition layer and 4 stages, wherein the stage1 consists of a linear embedding and two switch _ transformer blocks, the internal structures of the stage2 and the stage3 are the same and both consist of a patch merging and two switch _ transformer blocks, and the stage4 consists of a patch merging and six switch _ transformer blocks; the pose extractor uses a Vision _ Transformer as a backbone network, and the structure of the Vision _ Transformer comprises a Linear project of FattenePatches and a Transformer Encoder, wherein the default depth of the Transformer Encoder is 6.
The invention also discloses a method for monocular depth estimation of an intelligent pump cavity endoscope image, which comprises the following steps:
step 1: constructing a data set suitable for training, namely preprocessing the image;
step 2: sending the preprocessed image into an automatic supervision monocular depth estimation frame for operation;
and 3, step 3: and obtaining a corresponding predicted depth map.
The process of performing self-supervision monocular depth estimation operation on the preprocessed image comprises the following steps:
step 2.1: sending a current frame of a preprocessed image into a depth encoder module, wherein the image is subjected to sufficient down-sampling feature extraction through swin _ transformer in the depth encoder module;
step 2.2: the features obtained by the depth encoder are sent to the depth encoder to obtain a predicted depth map, in the process, the input of a decoder is a feature group, wherein in order to meet the input requirement of a resnet decoder, an upsampling operation is firstly carried out on a first group of shallow features in the feature group before the input;
step 2.3: taking the current frame and the next frame of the preprocessed image as an image pair, sending the image pair to a pose extractor, and obtaining a group of vectors: rotating vectors and translating vectors, wherein in the process, the image pair is coded at a corresponding position, and then the image pair is sent to a 6-layer transformer _ encoder of a vision _ transformer to obtain pose change vectors of front and back frames;
step 2.4: and (4) obtaining reconstructed front and rear frame images according to the depth map obtained in the step (2.2), the pose vector obtained in the step (2.3) and the original front and rear frame images, calculating a Loss value according to the reconstructed front and rear frame images, and performing back propagation to update network parameters.
The loss function used for the constraint in the present system consists of two parts, reprojection loss and smoothing loss.
The reprojection loss L is shown in equation (1) p The value of (1) is the minimum value of the luminosity error function Pe projected to the time t at the time t-1, wherein the luminosity error function Pe is shown as a formula (2), SSIM is used for measuring the similarity of two pictures, including the structural similarity and the gray value similarity, and the calculation mode is shown as a formula (3):
L p =min t-1 Pe(I t ,I t-1 ) (1)
Figure BDA0003829769570000041
SSIM(I a ,I b )=[l(I a ,I b )] α [c(I a ,I b )] β [s(I a ,I b )] γ (3)
loss of smoothness L s As shown in equation (4) where d is shown in equation (5), the smoothness loss L s The method is a regular term of a system network and prevents the system parameters from being over-fitted;
Figure BDA0003829769570000042
Figure BDA0003829769570000043
the actual total loss L is shown in equation (6), where r is 0.001, and the calculation equation for u is shown in equation (7), where u has the effect of acting as a mask to determine whether the re-projection is smaller than the original photometric error. If less than, u is 1; on the contrary, the number of the carbon atoms is 0,
L=μL p +γL s (6)
μ=[min t-1 Pe(I t ,I t-1→t )<min t-1 Pe(I t ,I t-1 )] (7)。
the invention uses transformers series to replace cnn as the backbone network of the system, thereby improving the accuracy of the self-supervision monocular depth estimation on the whole; for the pose extractor, before the image pair is input, the position encoding is carried out on the pose extractor, so that the structural contrast of the corresponding position is enhanced, and the output of the pose vector is more accurate.
The invention has the beneficial effects that:
first, the use of the latest transforms series instead of CNN as the backbone network of the self-supervised monocular depth estimation system improves the accuracy of the self-supervised monocular depth estimation as a whole, wherein the depth encoder uses swin _ transform and the pose extractor uses vision _ transform.
Secondly, for a pose extractor in the system, the transducer adds position information before extracting image features, so that corresponding positions of the image can be marked, pose change information can be better obtained, and the accuracy of the obtained pose vector is improved.
Drawings
FIG. 1 is an overall framework diagram of an auto-supervised monocular depth estimation network system of the present invention.
FIG. 2 is a depth prediction network diagram according to the present invention.
Fig. 3 is a structural diagram of a depth _ encoder according to the present invention.
FIG. 4 is a structural diagram of a swin _ transformer block according to the present invention.
FIG. 5 is a diagram of a vision _ transformer block according to the present invention.
Fig. 6 is a pose prediction network diagram of the present invention.
Detailed Description
The present invention will be further described in detail below with reference to examples in order to facilitate understanding and practice of the invention by those of ordinary skill in the art, and it should be understood that the examples described herein are for illustration and explanation only and are not intended to limit the invention.
Example (b): as shown in fig. 1, a monocular depth estimation system for intelligent pump cavity endoscope images adopts transformers series instead of CNN as a backbone network of an auto-supervised monocular depth estimation system, and the system consists of a depth encoder, a depth decoder and a pose extractor; the depth encoder comprises a patch partition layer and 4 stages, wherein the stage1 consists of a linear embedding and two switch _ transformer blocks, the internal structures of the stage2 and the stage3 are the same and both consist of a patch merging and two switch _ transformer blocks, and the stage4 consists of a patch merging and six switch _ transformer blocks; the pose extractor takes a Vision _ transform as a backbone network, and the structure of the Vision _ transform comprises a Linear project of Fattened patterns and a transform Encoder, wherein the default depth of the transform Encoder is 6.
The monocular depth estimation of the endoscope image of the intelligent pump cavity by using the system specifically comprises the following steps:
step 1: constructing a data set suitable for training, namely preprocessing the image;
step 2: sending the preprocessed image into an automatic supervision monocular depth estimation frame for operation;
and step 3: and obtaining a corresponding predicted depth map.
The process of performing self-supervision monocular depth estimation operation on the preprocessed image comprises the following steps:
step 2.1: sending a current frame of a preprocessed image into a depth encoder module, wherein the image is subjected to sufficient down-sampling feature extraction through swin _ transformer in the depth encoder module;
step 2.2: the features obtained by the depth encoder are sent to the depth encoder to obtain a predicted depth map, in the process, the input of a decoder is a feature group, wherein in order to meet the input requirement of a resnet decoder, a first group of shallow features in the feature group are subjected to one-time up-sampling operation before input;
step 2.3: taking the current frame and the next frame of the preprocessed image as an image pair, sending the image pair to a pose extractor, and obtaining a group of vectors: rotating vectors and translating vectors, wherein in the process, the image pair is coded at a corresponding position, and then the image pair is sent to a 6-layer transformer _ encoder of a vision _ transformer to obtain pose change vectors of front and back frames;
step 2.4: and (4) obtaining reconstructed front and rear frame images according to the depth map obtained in the step (2.2), the pose vector obtained in the step (2.3) and the original front and rear frame images, calculating a Loss value according to the reconstructed front and rear frame images, and performing back propagation to update network parameters.
The loss function used for the constraint in the present system consists of two parts, reprojection loss and smoothing loss.
As shown in equation (1), the reprojection loss L p The value of (1) is the minimum value of the luminosity error function Pe projected to the time t at the time t-1, wherein the luminosity error function Pe is shown as a formula (2), SSIM is used for measuring the similarity of two pictures, including the structural similarity and the gray value similarity, and the calculation mode is shown as a formula (3):
L p =min t-1 Pe(I t ,I t-1 ) (1)
Figure BDA0003829769570000061
SSIM(I a ,I b )=[l(I a ,I b )] α [c(I a ,I b )] β [s(I a ,I b )] γ (3)
loss of smoothness L s As shown in equation (4) where d is shown in equation (5), the smoothness loss L s The method is a regular term of a system network and prevents the system parameters from being over-fitted;
Figure BDA0003829769570000062
Figure BDA0003829769570000063
the actual total loss L is shown in equation (6), where r is 0.001 and the calculation equation for u is shown in equation (7), and the effect of u is to act as a mask to determine if the re-projection is less than the original photometric error. If less than, u is 1; on the contrary, the number of the carbon atoms is 0,
L=μL p +γL s (6)
μ=[min t-1 Pe(I t ,I t-1→t )<min t-1 Pe(I t ,I t-1 )] (7)。
the above description is only an embodiment of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (5)

1. A monocular depth estimation system for an intelligent pump cavity endoscope image is characterized by comprising a depth encoder, a depth decoder and a pose extractor; the depth encoder comprises a patch partition layer and 4 stages, wherein the stage1 consists of a linear embedding and two switch _ transformer blocks, the internal structures of the stage2 and the stage3 are the same and consist of a patch clustering and two switch _ transformer blocks, and the stage4 consists of a patch clustering and six switch _ transformer blocks; the pose extractor takes a Vision _ transform as a backbone network, and the structure of the Vision _ transform comprises a Linear Projection of fantened patterns and a transform Encoder, wherein the default depth of the transform Encoder is 6.
2. A method for monocular depth estimation of intelligent pump cavity endoscopic images, using the monocular depth estimation system of claim 1, comprising in particular the steps of:
step 1: constructing a data set suitable for training, namely preprocessing the image;
and 2, step: sending the preprocessed image into a self-supervision monocular depth estimation framework for operation;
and step 3: and obtaining a corresponding predicted depth map.
3. The method for monocular depth estimation of intelligent pump chamber endoscope images of claim 2, wherein the step 2 of performing an auto-supervised monocular depth estimation operation on the preprocessed images comprises:
step 2.1: sending the current frame of the preprocessed image into a depth encoder module;
step 2.2: sending the features obtained by the depth encoder into the depth encoder to obtain a predicted depth map;
step 2.3: taking the current frame and the next frame of the preprocessed image as an image pair, and sending the image pair to a pose extractor to obtain a group of vectors: a rotation vector and a translation vector;
step 2.4: and (3) obtaining reconstructed front and back frame images according to the depth map obtained in the step (2.2), the pose vector obtained in the step (2.3) and the original front and back frame images, calculating a Loss value from the front and back frame images, and performing backward propagation to update network parameters.
4. The method for monocular depth estimation of intelligent pump cavity endoscopic images of claim 3, wherein the loss function used for constraint in step 2.4 consists of two parts, reprojection loss and smooth loss.
5. The method for monocular depth estimation of an intelligent pump chamber endoscope image according to claim 4, characterized in that the value of the reprojection loss Lp is the minimum value of the photometric error function Pe projected to the time t at the time t-1, as shown in formula (1), wherein the photometric error function Pe is shown in formula (2), SSIM is used to measure the similarity of two pictures, including the structural similarity and the gray value similarity, and the calculation method is shown in formula (3):
L p =min t-1 Pe(I t ,I t-1 ) (1)
Figure FDA0003829769560000021
SSIM(I a ,I b )=[l(I a ,I b )] α [c(I a ,I b )] β [s(I a ,I b )] γ (3)
loss of smoothness L s As shown in equation (4) where d is shown in equation (5), the smoothness loss L s Is a regular term of the system network, prevents the system parameters from being over-fitted,
Figure FDA0003829769560000022
Figure FDA0003829769560000023
the actual total loss L is shown in equation (6), where r is 0.001, the calculation equation for u is shown in equation (7), and u has the effect of acting as a mask to determine whether the re-projection is smaller than the original photometric error, if smaller, u is 1, otherwise, it is 0:
L=μL p +γL s (6)
μ=[min t-1 Pe(I t ,I t-1→t )<min t-1 Pe(I t ,I t-1 )] (7)。
CN202211070319.2A 2022-09-02 2022-09-02 Monocular depth estimation system and method for intelligent pump cavity endoscope image Pending CN115423856A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211070319.2A CN115423856A (en) 2022-09-02 2022-09-02 Monocular depth estimation system and method for intelligent pump cavity endoscope image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211070319.2A CN115423856A (en) 2022-09-02 2022-09-02 Monocular depth estimation system and method for intelligent pump cavity endoscope image

Publications (1)

Publication Number Publication Date
CN115423856A true CN115423856A (en) 2022-12-02

Family

ID=84202739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211070319.2A Pending CN115423856A (en) 2022-09-02 2022-09-02 Monocular depth estimation system and method for intelligent pump cavity endoscope image

Country Status (1)

Country Link
CN (1) CN115423856A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116168067A (en) * 2022-12-21 2023-05-26 东华大学 Supervised multi-modal light field depth estimation method based on deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116168067A (en) * 2022-12-21 2023-05-26 东华大学 Supervised multi-modal light field depth estimation method based on deep learning
CN116168067B (en) * 2022-12-21 2023-11-21 东华大学 Supervised multi-modal light field depth estimation method based on deep learning

Similar Documents

Publication Publication Date Title
CN110910447B (en) Visual odometer method based on dynamic and static scene separation
CN108564041B (en) Face detection and restoration method based on RGBD camera
CN110517306B (en) Binocular depth vision estimation method and system based on deep learning
CN111027415B (en) Vehicle detection method based on polarization image
CN115619826A (en) Dynamic SLAM method based on reprojection error and depth estimation
CN114067197A (en) Pipeline defect identification and positioning method based on target detection and binocular vision
CN111383257A (en) Method and device for determining loading and unloading rate of carriage
CN115272271A (en) Pipeline defect detecting and positioning ranging system based on binocular stereo vision
CN115423856A (en) Monocular depth estimation system and method for intelligent pump cavity endoscope image
CN110889868B (en) Monocular image depth estimation method combining gradient and texture features
CN114648669A (en) Motor train unit fault detection method and system based on domain-adaptive binocular parallax calculation
CN111105451B (en) Driving scene binocular depth estimation method for overcoming occlusion effect
Basak et al. Monocular depth estimation using encoder-decoder architecture and transfer learning from single RGB image
CN115035172A (en) Depth estimation method and system based on confidence degree grading and inter-stage fusion enhancement
CN114331961A (en) Method for defect detection of an object
CN110807799B (en) Line feature visual odometer method combined with depth map inference
CN111104532A (en) RGBD image joint recovery method based on double-current network
CN116524340A (en) AUV near-end docking monocular pose estimation method and device based on dense point reconstruction
CN116091793A (en) Light field significance detection method based on optical flow fusion
CN115953460A (en) Visual odometer method based on self-supervision deep learning
CN113763261B (en) Real-time detection method for far small target under sea fog weather condition
CN112561979B (en) Self-supervision monocular depth estimation method based on deep learning
CN113096176B (en) Semantic segmentation-assisted binocular vision unsupervised depth estimation method
CN115239559A (en) Depth map super-resolution method and system for fusion view synthesis
CN112069997B (en) Unmanned aerial vehicle autonomous landing target extraction method and device based on DenseHR-Net

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination