CN113762358B - Semi-supervised learning three-dimensional reconstruction method based on relative depth training - Google Patents

Semi-supervised learning three-dimensional reconstruction method based on relative depth training Download PDF

Info

Publication number
CN113762358B
CN113762358B CN202110946711.8A CN202110946711A CN113762358B CN 113762358 B CN113762358 B CN 113762358B CN 202110946711 A CN202110946711 A CN 202110946711A CN 113762358 B CN113762358 B CN 113762358B
Authority
CN
China
Prior art keywords
dimensional reconstruction
parallax
images
relative depth
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110946711.8A
Other languages
Chinese (zh)
Other versions
CN113762358A (en
Inventor
顾寄南
胡君杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN202110946711.8A priority Critical patent/CN113762358B/en
Publication of CN113762358A publication Critical patent/CN113762358A/en
Application granted granted Critical
Publication of CN113762358B publication Critical patent/CN113762358B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • G06T7/596Depth or shape recovery from multiple images from stereo images from three or more stereo images

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a semi-supervised learning three-dimensional reconstruction method based on relative depth training, which comprises the steps of firstly constructing a target object image dataset for three-dimensional reconstruction network training, constructing a three-dimensional reconstruction network model with a U-shaped structure, performing unsupervised and semi-supervised training on the three-dimensional reconstruction network model, pruning the trained three-dimensional reconstruction network, cutting multi-scale prediction branches for calculating loss in the three-dimensional reconstruction network, and only leaving the output of the last layer; during prediction, the three-dimensional reconstruction network model inputs a single image, outputs a parallax image, and then calculates a depth image by combining parameters of the binocular camera and the parallax-depth conversion relationship, so as to finally finish three-dimensional reconstruction. The invention solves the problems of low unsupervised training precision, difficult acquisition of supervised training real data and the like in the existing three-dimensional reconstruction algorithm based on deep learning.

Description

Semi-supervised learning three-dimensional reconstruction method based on relative depth training
Technical Field
The invention relates to the technical field of three-dimensional reconstruction of machine vision, in particular to a semi-supervised learning three-dimensional reconstruction method based on relative depth training.
Background
Three-dimensional reconstruction is one of the key technologies of environment perception, and the application of the three-dimensional reconstruction relates to automatic driving, virtual reality, moving target monitoring, behavior analysis, security monitoring, important crowd monitoring and the like. At present, most three-dimensional reconstruction is based on conversion estimation from a two-dimensional RGB image to an RBG-D image, and mainly comprises a Shape from X method for acquiring scene depth Shape from image brightness, different view angles, luminosity, texture information and the like, and an algorithm for predicting camera pose by combining SFM, SLAM and other modes. Although many devices can directly acquire depth, such as laser radar, the device is expensive, and is currently used in the technology research and development and testing stage, and a certain distance is provided for large-scale market application; in addition, with the rapid development of convolutional neural networks in recent years, three-dimensional reconstruction techniques based on deep learning are becoming a hotspot of research.
At present, many students at home and abroad have conducted intensive research on the field of three-dimensional reconstruction, some great progress is also achieved, and a three-dimensional reconstruction algorithm based on supervised and unsupervised deep learning also achieves a very good effect. But at the same time, each of these algorithms has some problems: (1) The method based on complete supervision requires real three-dimensional data training, but the acquisition difficulty of depth data is higher and the cost is higher; (2) Unsupervised or self-supervised based methods do not utilize three-dimensional information at all, resulting in poor accuracy and the need to mine a priori knowledge.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a semi-supervised learning three-dimensional reconstruction method based on relative depth training, which solves the problems of difficult data acquisition, low precision, low robustness and the like of the supervised method.
The present invention achieves the above technical object by the following means.
A semi-supervised learning three-dimensional reconstruction method based on relative depth training comprises the following steps:
S1, acquiring a stereoscopic image pair of a target object through a binocular camera, and processing each pair of images, wherein the correction and manual labeling of the images are included, and the processed images form a training data set;
S2, building a three-dimensional reconstruction network model of a U-shaped structure, wherein the three-dimensional reconstruction network model comprises a feature extraction part and a decoding part, a basic network module of the feature extraction part adopts a residual structure, and the feature extraction part also introduces a convolution kernel attention mechanism;
s3, carrying out feature extraction on the input three-dimensional reconstruction network of the stereoscopic image, predicting to obtain a pair of parallax images, reconstructing a pair of original images by utilizing the relationship between the predicted parallax images and the parallax images-original images, and calculating reconstruction error loss by comparing the reconstructed original images with the real original images;
s4, training is carried out on the predicted disparity map obtained in the S3, corresponding loss items are constructed, and pixel point pairs which do not meet the relative depth value are punished;
S5, pruning the trained three-dimensional reconstruction network, and cutting multi-scale prediction branches used for calculating losses in the three-dimensional reconstruction network, wherein only the output of the last layer is left; during prediction, the three-dimensional reconstruction network model inputs a single image, outputs a parallax image, and then calculates a depth image by combining parameters of the binocular camera and the parallax-depth conversion relationship, so as to finally finish three-dimensional reconstruction.
In the above technical scheme, the manual labeling specifically comprises:
Labeling on a stereoscopic image pair of a target object, selecting different pixel point pairs on two images of the stereoscopic image pair for labeling, selecting two pairs of pixel points on each image, labeling the relative depth relation of the two pairs of pixel points, quantifying the relative depth relation, and converting the relative depth relation into a relative depth value R; in order to take the points, if the first point is far away from the second point, let r=1, if the first point is close to the second point, let r= -1, if the two points are of the same depth, let r=0.
In the above technical solution, the residual structure specifically includes:
the back layer is in jump connection with the front layer: the input features firstly pass through a residual block, adopt the Attention-block as the beginning of the main branch of the residual block, carry out element level addition after the two branches are up-scaled by 1*1 convolution, send into a second residual block after being activated by a BN layer and a ReLU, the main branch of the second residual block consists of two convolutions of 3*3, directly carry out element level addition with the input of the second residual block after the convolutions, and then output after being activated by the BN layer and the ReLU.
In the above technical solution, the convolution kernel attention mechanism specifically includes:
Carrying out 3*3 and 7*7 convolution on an input feature map respectively, reducing the resolution of an original map by one time, carrying out element-level addition fusion on the results of two branches, carrying out global average pooling on the fused feature map to obtain a C multiplied by 1 one-dimensional vector, obtaining two C multiplied by 1 one-dimensional vectors after two full connection layers are passed through, and then sending the two one-dimensional vectors to a softmax analyzer for non-negativity and normalization operation to generate a weight matrix; multiplying the feature graphs of the two branches with the weight matrix of each branch, and then adding element levels to obtain the final output feature; wherein C is the number of channels.
In the above technical solution, the first half of the three-dimensional reconstruction network model is feature extraction, the second half is upsampling, after the resolution of the training dataset of the target object is uniformly adjusted, the three-dimensional reconstruction network model is input, first, the convolution and the downsampling are performed once, then, the three-dimensional reconstruction network model passes through the 4-time basic network module, then, the upsampling, the peer stitching and the convolution are performed for 6 times, and parallax prediction is performed on feature graphs obtained by the upsampling, the peer stitching and the convolution for training.
In the above technical solution, the S3 specifically is: inputting a stereoscopic image pair into a built three-dimensional reconstruction network model to respectively obtain 4 scale prediction parallax images, wherein the left parallax images are called left parallax images, the right parallax images are called right parallax images, the left parallax images with the same size as the original images are combined with the right images, interpolation is performed to generate left image estimation, the right parallax images with the same size as the original images are combined with the left images, interpolation is performed to generate right image estimation, a pair of reconstructed original images are generated, reconstruction loss is formed by comparing the reconstructed original images with the true original images, and the loss function is as follows:
Wherein I ij is each pixel point of one view of the stereoscopic image pair, For each pixel on the prediction disparity map, N is the total number of pixels and SSIM is the filter function.
In the above technical solution, training on the predicted disparity map obtained in S3 in S4 specifically includes:
Searching two-dimensional position coordinates of pixel point pairs which are manually marked and contain relative depth information on the predicted parallax map obtained in the step S3, and obtaining a predicted parallax value of each pixel point; obtaining a predicted relative depth through the magnitude relation between the predicted parallax values of a pair of pixel points, quantifying the predicted relative depth into a predicted relative depth value D, wherein D=1 if the parallax of a first point is smaller than the parallax of a second point according to the query point sequence, D= -1 if the parallax of the first point is larger than the parallax of the second point, and D=0 if the parallaxes of the two points are equal;
constructing corresponding loss items, specifically constructing loss functions
Wherein I represents the currently processed image, D is the predicted relative depth value, R is the true relative depth value, D is the predicted disparity value, I is the first point in the pixel pair, and j is the second point in the pixel pair.
In the above technical solution, the relationship between the parallax map and the original map is: and (3) performing coordinate shift on each pixel point of one view of the stereoscopic image pair by using the known parallax map, and reconstructing the other Zhang Shitu of the stereoscopic image pair.
The beneficial effects of the invention are as follows:
(1) The invention corrects and manually marks the stereo image pair of the target object acquired by the binocular camera, thereby facilitating training.
(2) The basic network module of the characteristic extraction part of the three-dimensional reconstruction network model adopts a residual structure, so that gradient disappearance or gradient explosion is avoided when the three-dimensional reconstruction network is trained to a deep layer; the feature extraction part of the three-dimensional reconstruction network model also introduces a convolution kernel attention mechanism, so that the accuracy of three-dimensional reconstruction is improved.
(3) According to the invention, a relative depth concept is introduced on the basis of unsupervised training, the three-dimensional information is converted into training data through manual labeling, and the auxiliary training of the three-dimensional information is added, so that the robustness of a three-dimensional reconstruction algorithm and the fineness of a prediction result can be remarkably improved.
(4) The non-supervision training and semi-supervision training methods do not need to collect real depth data as training data, so that the data collection difficulty and the training cost are greatly reduced.
Drawings
FIG. 1 is a flow chart of a semi-supervised learning three-dimensional reconstruction method based on relative depth training according to the invention;
FIG. 2 is a flow chart of a depth learning three-dimensional reconstruction algorithm according to the present invention;
FIG. 3 is a schematic diagram of an Attention-mechanism (Attention-block) structure according to the present invention;
FIG. 4 is a schematic diagram of a Basic-block architecture of the present invention;
fig. 5 is a schematic diagram of a three-dimensional reconstruction network model (modified U-Net) according to the present invention.
Detailed Description
The invention will be further described with reference to the drawings and the specific embodiments, but the scope of the invention is not limited thereto.
As shown in fig. 1, the invention relates to a semi-supervised learning three-dimensional reconstruction method based on relative depth training, which specifically comprises the following steps:
Step (1), constructing a target object image data set for three-dimensional reconstruction network training
Acquiring a large number of stereo image pairs (i.e. left view and right view) of a target object through a binocular camera, and calibrating the binocular camera to obtain an outer parameter matrix, an inner parameter matrix, a distortion parameter matrix and structural parameters; the internal parameters and the distortion parameters are utilized to carry out distortion correction on the stereoscopic image, imaging distortion generated by physical distortion of the binocular camera lens is eliminated, and then the structural parameters are utilized to carry out epipolar line calibration (parallel calibration) of left and right views on the stereoscopic image, so that the sizes of objects in the two images are the same and corresponding pixel points are horizontal on a straight line; after processing all source images, generating new corrected images, manually marking relative depth on the corrected images, and preparing for semi-supervised training, wherein the relative depth is the relative distance relation between two pixel points and a binocular camera plane; selecting two points on the three-dimensional image, recording two-dimensional coordinate values of the two points, and respectively labeling the two points as near points or far points, namely finishing one-time relative depth labeling, wherein the labeling quality determines the effect of supervision training; the invention provides a relative depth labeling strategy, which comprises the following steps: marking on left and right views of a target object, selecting different pixel point pairs on two images to mark, selecting 4 pixel points, namely 2 pairs of pixel points, wherein one pair is a point with obvious depth difference and the other pair is a point with small depth difference, marking the relative depth relation of the two pairs of pixel points, quantifying the relative depth relation, converting the relative depth relation into a relative depth value R, and according to the point taking sequence, if a first point is far away from a second point, making R=1, if the first point is close to the second point, making R= -1, and if the two points are of the same depth, making R=0; in this way, all the images collected and corrected in the step (1) are marked, and the marked images and related files are saved to form a training data set of the target object.
Step (2), building a three-dimensional reconstruction network model
The three-dimensional reconstruction network model generally adopts a U-shaped structure and comprises a feature extraction (encoding) part and a decoding part, a basic network module of the feature extraction part adopts a residual structure, and a convolution kernel attention mechanism is introduced.
As shown in FIG. 3, the preferred structure diagram of the Attention mechanism (Attention-block) of the invention has different effects on targets of different scales (near and far, size) due to different sizes of perception fields (convolution kernels), therefore, a fixed convolution kernel is used, and a convolution kernel Attention mechanism is introduced into a network feature extraction part to dynamically generate convolution kernels for different input images, preferably, the input feature images are respectively subjected to 3*3 and 7*7 convolutions, the resolution of original images is reduced by one time, a BN layer and a ReLU are arranged after each convolution, element-level addition fusion is performed on the results of the two branches, global average pooling is performed on the feature images after fusion, so that the information about channels is a one-dimensional vector of C×1×1, the importance degree of the information about each channel is obtained, then the two one-dimensional vectors of C×1×1 are obtained after passing through two layers of full connection layers, the two one-dimensional vectors are sent into a soft tmax analyzer to perform non-negative and normalization operations, the two-level feature images are multiplied by one another, and the two-level feature images are obtained after the two-level feature images are multiplied by one another, and the feature images are subjected to one-half of the feature images after the two-level images are subjected to addition operation, and the feature images are subjected to the addition of the weight matrix is obtained, and the feature images are reduced by one-half of the feature images (H and the feature images are obtained after the feature images are obtained).
As shown in fig. 4, the Basic network module (Basic-block) of the present invention adopts a residual structure, that is, a certain layer at the back is in jump connection with a certain layer at the front, and the low-dimensional feature is maintained in the process of continuously updating the feature.
As shown in fig. 5, the first half of the three-dimensional reconstruction network model is feature extraction, the second half is up-sampling, the resolution of the training dataset of the target object obtained in the step (1) is uniformly adjusted to 256×512, the three-dimensional reconstruction network model is input, first, convolution and down-sampling are performed once to obtain a feature map with the resolution of 64×128, then, 4 Basic-block (Basic network module) are performed, feature maps with the resolutions of 32×64, 16×32, 8×16 and 4*8 are sequentially obtained, then up-sampling is performed based on the feature map with the size of 4*8, the resolution is doubled to 8×16, then channel stitching is performed on the feature map with the size of 8×16 identical to that of the first half, then, the above process is repeated 6 times, and finally, feature maps with the size identical to that of the original map are obtained (256×512), wherein in the last 4 processes, up-sampling, convolution stitching and convolution operations are performed on the feature maps, prediction is performed on the feature maps with parallax difference values for training functions by using each predicted pixel id;
the technical parameters of the three-dimensional reconstruction network model are shown in table 1:
TABLE 1 network model technical parameter table
Stack () in table 1 is a channel dimension splicing operation, and the features after up-sampling each time are spliced with features with the same size in the feature extraction part, so that low-dimensional information is reserved, the network can be trained deeper, and the accuracy is higher.
Step (3), unsupervised training
An unsupervised (or self-supervised) training method without real three-dimensional data is adopted, and the method comprises the following steps: and carrying out feature extraction on the input three-dimensional reconstruction network by the stereo image, predicting to obtain a pair of parallax images, reconstructing a pair of original images by utilizing the relationship between the predicted parallax images and the parallax images-original images, and calculating reconstruction error loss by comparing the reconstructed original images with the real original images. The relation between the parallax map and the original map is as follows: the method comprises the steps that a point in the real world is a coordinate difference value on two views of a stereoscopic image pair is called parallax, a parallax image is the parallax of each point on a target object calculated through the two views of the stereoscopic image pair, a known parallax image is utilized to carry out coordinate shift on each pixel point of one view of the stereoscopic image pair, the size of an offset value is the parallax value of each pixel point, and the other Zhang Shitu of the stereoscopic image pair is reconstructed.
Step (4), semi-supervised training
And (3) introducing relative depth as auxiliary information to perform semi-supervised training, performing training on the predicted disparity map obtained in the step (3), constructing a corresponding loss term, and punishing pixel point pairs which do not meet the relative depth value.
The specific processes of the steps (3) and (4) are as follows:
As shown in fig. 2, on the basis of an unsupervised training method, introducing 'relative depth' as a supervised training label to perform semi-supervised training; the method comprises the steps of inputting left and right view pairs of a target object into a built three-dimensional reconstruction network model to respectively obtain 4 scale prediction parallax images, obtaining a left parallax image from the left image, obtaining a right parallax image from the right image, combining the left parallax image with the right image, which is the same as the original image in size, interpolating to generate an estimation of the left image, combining the right parallax image with the original image in size, interpolating to generate an estimation of the right image, generating a pair of reconstructed original images, comparing the reconstructed original images with the real original images to form reconstruction loss, namely completing unsupervised training, and obtaining a loss function as follows:
Wherein I ij is each pixel point of one view of the stereoscopic image pair, For each pixel on the prediction disparity map, N is the total number of pixels and SSIM is the filter function.
Meanwhile, a pair of predicted parallax images with the resolution of the original image size are obtained through unsupervised training, two-dimensional position coordinates of pixel point pairs which are manually marked and contain relative depth information are searched on the pair of parallax images, and a predicted parallax value of each pixel point is obtained; obtaining a predicted relative depth through the magnitude relation between the predicted parallax values of a pair of pixel points, quantifying the predicted relative depth into a predicted relative depth value D, wherein D=1 if the parallax of a first point is smaller than the parallax of a second point according to the query point sequence, D= -1 if the parallax of the first point is larger than the parallax of the second point, and D=0 if the parallaxes of the two points are equal; for each pair of marked pixel points, obtaining a predicted relative depth value D, searching a real relative depth value R of each pair of pixel points according to a marking file, comparing, and if D=R, indicating that the prediction is correct, and if D=R, indicating that the prediction is incorrect; meanwhile, a loss function is designed, different contributions of gradient descent are given according to different prediction conditions, the contribution is small if the prediction is correct, the contribution is large if the prediction is incorrect, and the loss function is as follows:
Wherein I represents the current processed image, D is the predicted relative depth value, R is the true relative depth value, D is the predicted disparity value, where I is the first point in the pixel pair and j is the second point.
After the non-supervision training and the supervision training are completed, the semi-supervision training is completed.
Step (5), three-dimensional reconstruction
Pruning the trained three-dimensional reconstruction network, cutting prediction branches with the sizes of 32 x 64 x1, 64 x 128 x1 and 128 x 256 x1 in the three-dimensional reconstruction network, only leaving the last layer of 256 x 512 x1 scale as output so as to improve the prediction speed, outputting a 256 x 512 x1 parallax map only by inputting an image with the single Zhang Fenbian rate of 256 x 512 x 3 during prediction, and calculating to obtain a depth map by combining parameters of a binocular camera and a parallax-depth conversion relation, wherein the conversion relation between the parallax map and the depth map is as follows:
Z=(f*b)/d1
In the above formula, Z is the absolute depth of a pixel point, d 1 is the pixel point parallax value, f is the focal length of the binocular camera, and b is the translational offset of the two binocular cameras.
The examples are preferred embodiments of the present invention, but the present invention is not limited to the above-described embodiments, and any obvious modifications, substitutions or variations that can be made by one skilled in the art without departing from the spirit of the present invention are within the scope of the present invention.

Claims (8)

1. The semi-supervised learning three-dimensional reconstruction method based on relative depth training is characterized by comprising the following steps of:
S1, acquiring a stereoscopic image pair of a target object through a binocular camera, and processing each pair of images, wherein the correction and manual labeling of the images are included, and the processed images form a training data set;
S2, building a three-dimensional reconstruction network model of a U-shaped structure, wherein the three-dimensional reconstruction network model comprises a feature extraction part and a decoding part, a basic network module of the feature extraction part adopts a residual structure, and the feature extraction part also introduces a convolution kernel attention mechanism;
s3, carrying out feature extraction on the input three-dimensional reconstruction network of the stereoscopic image, predicting to obtain a pair of parallax images, reconstructing a pair of original images by utilizing the relationship between the predicted parallax images and the parallax images-original images, and calculating reconstruction error loss by comparing the reconstructed original images with the real original images;
s4, training is carried out on the predicted disparity map obtained in the S3, corresponding loss items are constructed, and pixel point pairs which do not meet the relative depth value are punished;
S5, pruning the trained three-dimensional reconstruction network, and cutting multi-scale prediction branches used for calculating losses in the three-dimensional reconstruction network, wherein only the output of the last layer is left; during prediction, the three-dimensional reconstruction network model inputs a single image, outputs a parallax image, and then calculates a depth image by combining parameters of the binocular camera and the parallax-depth conversion relationship, so as to finally finish three-dimensional reconstruction.
2. The semi-supervised learning three-dimensional reconstruction method based on relative depth training as set forth in claim 1, wherein the manual labeling is specifically:
Labeling on a stereoscopic image pair of a target object, selecting different pixel point pairs on two images of the stereoscopic image pair for labeling, selecting two pairs of pixel points on each image, labeling the relative depth relation of the two pairs of pixel points, quantifying the relative depth relation, and converting the relative depth relation into a relative depth value R; in order to take the points, if the first point is far away from the second point, let r=1, if the first point is close to the second point, let r= -1, if the two points are of the same depth, let r=0.
3. The semi-supervised learning three dimensional reconstruction method based on relative depth training of claim 1, wherein the residual structure is specifically:
the back layer is in jump connection with the front layer: the input features firstly pass through a residual block, adopt the Attention-block as the beginning of the main branch of the residual block, carry out element level addition after the two branches are up-scaled by 1*1 convolution, send into a second residual block after being activated by a BN layer and a ReLU, the main branch of the second residual block consists of two convolutions of 3*3, directly carry out element level addition with the input of the second residual block after the convolutions, and then output after being activated by the BN layer and the ReLU.
4. The relative depth training-based semi-supervised learning three dimensional reconstruction method as set forth in claim 1, wherein the convolution kernel attention mechanism is specifically:
Carrying out 3*3 and 7*7 convolution on an input feature map respectively, reducing the resolution of an original map by one time, carrying out element-level addition fusion on the results of two branches, carrying out global average pooling on the fused feature map to obtain a C multiplied by 1 one-dimensional vector, obtaining two C multiplied by 1 one-dimensional vectors after two full connection layers are passed through, and then sending the two one-dimensional vectors to a softmax analyzer for non-negativity and normalization operation to generate a weight matrix; multiplying the feature graphs of the two branches with the weight matrix of each branch, and then adding element levels to obtain the final output feature; wherein C is the number of channels.
5. The three-dimensional reconstruction method for semi-supervised learning based on relative depth training according to claim 3, wherein the first half of the three-dimensional reconstruction network model is feature extraction, the second half is upsampling, the resolution of the training data set of the target object is uniformly adjusted, the three-dimensional reconstruction network model is input, first, one convolution and one downsampling are performed, then, 4 times of basic network modules are passed, then, 6 times of upsampling, peer stitching and convolution are performed, and parallax prediction is performed on feature graphs obtained by the last 4 times of upsampling, peer stitching and convolution for training.
6. The semi-supervised learning three-dimensional reconstruction method based on relative depth training of claim 1, wherein the S3 specifically is: inputting a stereoscopic image pair into a built three-dimensional reconstruction network model to respectively obtain 4 scale prediction parallax images, wherein the left parallax images are called left parallax images, the right parallax images are called right parallax images, the left parallax images with the same size as the original images are combined with the right images, interpolation is performed to generate left image estimation, the right parallax images with the same size as the original images are combined with the left images, interpolation is performed to generate right image estimation, a pair of reconstructed original images are generated, reconstruction loss is formed by comparing the reconstructed original images with the true original images, and the loss function is as follows:
Wherein I ij is each pixel point of one view of the stereoscopic image pair, For each pixel on the prediction disparity map, N is the total number of pixels and SSIM is the filter function.
7. The method for three-dimensional reconstruction of semi-supervised learning based on relative depth training according to claim 6, wherein the training performed on the predicted disparity map obtained in S3 in S4 is specifically:
Searching two-dimensional position coordinates of pixel point pairs which are manually marked and contain relative depth information on the predicted parallax map obtained in the step S3, and obtaining a predicted parallax value of each pixel point; obtaining a predicted relative depth through the magnitude relation between the predicted parallax values of a pair of pixel points, quantifying the predicted relative depth into a predicted relative depth value D, wherein D=1 if the parallax of a first point is smaller than the parallax of a second point according to the query point sequence, D= -1 if the parallax of the first point is larger than the parallax of the second point, and D=0 if the parallaxes of the two points are equal;
the construction of the corresponding loss term, in particular the construction of the loss function
Wherein I represents the currently processed image, D is the predicted relative depth value, R is the true relative depth value, D is the predicted disparity value, I is the first point in the pixel pair, and j is the second point in the pixel pair.
8. The method for three-dimensional reconstruction of semi-supervised learning based on relative depth training as set forth in claim 1, wherein the disparity map-artwork relationship is: and (3) performing coordinate shift on each pixel point of one view of the stereoscopic image pair by using the known parallax map, and reconstructing the other Zhang Shitu of the stereoscopic image pair.
CN202110946711.8A 2021-08-18 2021-08-18 Semi-supervised learning three-dimensional reconstruction method based on relative depth training Active CN113762358B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110946711.8A CN113762358B (en) 2021-08-18 2021-08-18 Semi-supervised learning three-dimensional reconstruction method based on relative depth training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110946711.8A CN113762358B (en) 2021-08-18 2021-08-18 Semi-supervised learning three-dimensional reconstruction method based on relative depth training

Publications (2)

Publication Number Publication Date
CN113762358A CN113762358A (en) 2021-12-07
CN113762358B true CN113762358B (en) 2024-05-14

Family

ID=78790328

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110946711.8A Active CN113762358B (en) 2021-08-18 2021-08-18 Semi-supervised learning three-dimensional reconstruction method based on relative depth training

Country Status (1)

Country Link
CN (1) CN113762358B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113936117B (en) * 2021-12-14 2022-03-08 中国海洋大学 High-frequency region enhanced luminosity three-dimensional reconstruction method based on deep learning
CN114842287B (en) * 2022-03-25 2022-12-06 中国科学院自动化研究所 Monocular three-dimensional target detection model training method and device of depth-guided deformer
TWI787141B (en) * 2022-06-21 2022-12-11 鴻海精密工業股份有限公司 Method and equipment for training depth estimation model, and method and equipment for depth estimation
CN115829005B (en) * 2022-12-09 2023-06-27 之江实验室 Automatic defect diagnosis and repair method and device for convolutional neural classification network
CN116105632B (en) * 2023-04-12 2023-06-23 四川大学 Self-supervision phase unwrapping method and device for structured light three-dimensional imaging
CN117333758B (en) * 2023-12-01 2024-02-13 博创联动科技股份有限公司 Land route identification system based on big data analysis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018000752A1 (en) * 2016-06-27 2018-01-04 浙江工商大学 Monocular image depth estimation method based on multi-scale cnn and continuous crf
CN109377530A (en) * 2018-11-30 2019-02-22 天津大学 A kind of binocular depth estimation method based on deep neural network
CN109472819A (en) * 2018-09-06 2019-03-15 杭州电子科技大学 A kind of binocular parallax estimation method based on cascade geometry context neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018000752A1 (en) * 2016-06-27 2018-01-04 浙江工商大学 Monocular image depth estimation method based on multi-scale cnn and continuous crf
CN109472819A (en) * 2018-09-06 2019-03-15 杭州电子科技大学 A kind of binocular parallax estimation method based on cascade geometry context neural network
CN109377530A (en) * 2018-11-30 2019-02-22 天津大学 A kind of binocular depth estimation method based on deep neural network

Also Published As

Publication number Publication date
CN113762358A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
CN113762358B (en) Semi-supervised learning three-dimensional reconstruction method based on relative depth training
CN111325794B (en) Visual simultaneous localization and map construction method based on depth convolution self-encoder
CN111462329B (en) Three-dimensional reconstruction method of unmanned aerial vehicle aerial image based on deep learning
CN113269237B (en) Assembly change detection method, device and medium based on attention mechanism
CN109472819B (en) Binocular parallax estimation method based on cascade geometric context neural network
CN111325797A (en) Pose estimation method based on self-supervision learning
CN111127538B (en) Multi-view image three-dimensional reconstruction method based on convolution cyclic coding-decoding structure
CN112884682B (en) Stereo image color correction method and system based on matching and fusion
CN111126148A (en) DSM (digital communication system) generation method based on video satellite images
CN111861880B (en) Image super-fusion method based on regional information enhancement and block self-attention
CN110197505B (en) Remote sensing image binocular stereo matching method based on depth network and semantic information
CN110969653A (en) Image depth estimation algorithm based on deep learning and Fourier domain analysis
CN110335222B (en) Self-correction weak supervision binocular parallax extraction method and device based on neural network
CN108171249B (en) RGBD data-based local descriptor learning method
CN111127401B (en) Robot stereoscopic vision mechanical part detection method based on deep learning
CN113538243B (en) Super-resolution image reconstruction method based on multi-parallax attention module combination
CN112288788A (en) Monocular image depth estimation method
CN111260794A (en) Outdoor augmented reality application method based on cross-source image matching
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
Nouduri et al. Deep realistic novel view generation for city-scale aerial images
CN112365400B (en) Rapid light field angle super-resolution reconstruction method
CN115035193A (en) Bulk grain random sampling method based on binocular vision and image segmentation technology
CN113096176B (en) Semantic segmentation-assisted binocular vision unsupervised depth estimation method
CN115239559A (en) Depth map super-resolution method and system for fusion view synthesis
CN114119704A (en) Light field image depth estimation method based on spatial pyramid pooling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant