CN113762358B

CN113762358B - Semi-supervised learning three-dimensional reconstruction method based on relative depth training

Info

Publication number: CN113762358B
Application number: CN202110946711.8A
Authority: CN
Inventors: 顾寄南; 胡君杰
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2024-05-14
Anticipated expiration: 2041-08-18
Also published as: CN113762358A

Abstract

The invention provides a semi-supervised learning three-dimensional reconstruction method based on relative depth training, which comprises the steps of firstly constructing a target object image dataset for three-dimensional reconstruction network training, constructing a three-dimensional reconstruction network model with a U-shaped structure, performing unsupervised and semi-supervised training on the three-dimensional reconstruction network model, pruning the trained three-dimensional reconstruction network, cutting multi-scale prediction branches for calculating loss in the three-dimensional reconstruction network, and only leaving the output of the last layer; during prediction, the three-dimensional reconstruction network model inputs a single image, outputs a parallax image, and then calculates a depth image by combining parameters of the binocular camera and the parallax-depth conversion relationship, so as to finally finish three-dimensional reconstruction. The invention solves the problems of low unsupervised training precision, difficult acquisition of supervised training real data and the like in the existing three-dimensional reconstruction algorithm based on deep learning.

Description

Semi-supervised learning three-dimensional reconstruction method based on relative depth training

Technical Field

The invention relates to the technical field of three-dimensional reconstruction of machine vision, in particular to a semi-supervised learning three-dimensional reconstruction method based on relative depth training.

Background

Three-dimensional reconstruction is one of the key technologies of environment perception, and the application of the three-dimensional reconstruction relates to automatic driving, virtual reality, moving target monitoring, behavior analysis, security monitoring, important crowd monitoring and the like. At present, most three-dimensional reconstruction is based on conversion estimation from a two-dimensional RGB image to an RBG-D image, and mainly comprises a Shape from X method for acquiring scene depth Shape from image brightness, different view angles, luminosity, texture information and the like, and an algorithm for predicting camera pose by combining SFM, SLAM and other modes. Although many devices can directly acquire depth, such as laser radar, the device is expensive, and is currently used in the technology research and development and testing stage, and a certain distance is provided for large-scale market application; in addition, with the rapid development of convolutional neural networks in recent years, three-dimensional reconstruction techniques based on deep learning are becoming a hotspot of research.

At present, many students at home and abroad have conducted intensive research on the field of three-dimensional reconstruction, some great progress is also achieved, and a three-dimensional reconstruction algorithm based on supervised and unsupervised deep learning also achieves a very good effect. But at the same time, each of these algorithms has some problems: (1) The method based on complete supervision requires real three-dimensional data training, but the acquisition difficulty of depth data is higher and the cost is higher; (2) Unsupervised or self-supervised based methods do not utilize three-dimensional information at all, resulting in poor accuracy and the need to mine a priori knowledge.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a semi-supervised learning three-dimensional reconstruction method based on relative depth training, which solves the problems of difficult data acquisition, low precision, low robustness and the like of the supervised method.

The present invention achieves the above technical object by the following means.

A semi-supervised learning three-dimensional reconstruction method based on relative depth training comprises the following steps:

S1, acquiring a stereoscopic image pair of a target object through a binocular camera, and processing each pair of images, wherein the correction and manual labeling of the images are included, and the processed images form a training data set;

S2, building a three-dimensional reconstruction network model of a U-shaped structure, wherein the three-dimensional reconstruction network model comprises a feature extraction part and a decoding part, a basic network module of the feature extraction part adopts a residual structure, and the feature extraction part also introduces a convolution kernel attention mechanism;

s3, carrying out feature extraction on the input three-dimensional reconstruction network of the stereoscopic image, predicting to obtain a pair of parallax images, reconstructing a pair of original images by utilizing the relationship between the predicted parallax images and the parallax images-original images, and calculating reconstruction error loss by comparing the reconstructed original images with the real original images;

s4, training is carried out on the predicted disparity map obtained in the S3, corresponding loss items are constructed, and pixel point pairs which do not meet the relative depth value are punished;

S5, pruning the trained three-dimensional reconstruction network, and cutting multi-scale prediction branches used for calculating losses in the three-dimensional reconstruction network, wherein only the output of the last layer is left; during prediction, the three-dimensional reconstruction network model inputs a single image, outputs a parallax image, and then calculates a depth image by combining parameters of the binocular camera and the parallax-depth conversion relationship, so as to finally finish three-dimensional reconstruction.

In the above technical scheme, the manual labeling specifically comprises:

Labeling on a stereoscopic image pair of a target object, selecting different pixel point pairs on two images of the stereoscopic image pair for labeling, selecting two pairs of pixel points on each image, labeling the relative depth relation of the two pairs of pixel points, quantifying the relative depth relation, and converting the relative depth relation into a relative depth value R; in order to take the points, if the first point is far away from the second point, let r=1, if the first point is close to the second point, let r= -1, if the two points are of the same depth, let r=0.

In the above technical solution, the residual structure specifically includes:

the back layer is in jump connection with the front layer: the input features firstly pass through a residual block, adopt the Attention-block as the beginning of the main branch of the residual block, carry out element level addition after the two branches are up-scaled by 1*1 convolution, send into a second residual block after being activated by a BN layer and a ReLU, the main branch of the second residual block consists of two convolutions of 3*3, directly carry out element level addition with the input of the second residual block after the convolutions, and then output after being activated by the BN layer and the ReLU.

In the above technical solution, the convolution kernel attention mechanism specifically includes:

Carrying out 3*3 and 7*7 convolution on an input feature map respectively, reducing the resolution of an original map by one time, carrying out element-level addition fusion on the results of two branches, carrying out global average pooling on the fused feature map to obtain a C multiplied by 1 one-dimensional vector, obtaining two C multiplied by 1 one-dimensional vectors after two full connection layers are passed through, and then sending the two one-dimensional vectors to a softmax analyzer for non-negativity and normalization operation to generate a weight matrix; multiplying the feature graphs of the two branches with the weight matrix of each branch, and then adding element levels to obtain the final output feature; wherein C is the number of channels.

In the above technical solution, the first half of the three-dimensional reconstruction network model is feature extraction, the second half is upsampling, after the resolution of the training dataset of the target object is uniformly adjusted, the three-dimensional reconstruction network model is input, first, the convolution and the downsampling are performed once, then, the three-dimensional reconstruction network model passes through the 4-time basic network module, then, the upsampling, the peer stitching and the convolution are performed for 6 times, and parallax prediction is performed on feature graphs obtained by the upsampling, the peer stitching and the convolution for training.

In the above technical solution, the S3 specifically is: inputting a stereoscopic image pair into a built three-dimensional reconstruction network model to respectively obtain 4 scale prediction parallax images, wherein the left parallax images are called left parallax images, the right parallax images are called right parallax images, the left parallax images with the same size as the original images are combined with the right images, interpolation is performed to generate left image estimation, the right parallax images with the same size as the original images are combined with the left images, interpolation is performed to generate right image estimation, a pair of reconstructed original images are generated, reconstruction loss is formed by comparing the reconstructed original images with the true original images, and the loss function is as follows:

Wherein I _ij is each pixel point of one view of the stereoscopic image pair, For each pixel on the prediction disparity map, N is the total number of pixels and SSIM is the filter function.

In the above technical solution, training on the predicted disparity map obtained in S3 in S4 specifically includes:

Searching two-dimensional position coordinates of pixel point pairs which are manually marked and contain relative depth information on the predicted parallax map obtained in the step S3, and obtaining a predicted parallax value of each pixel point; obtaining a predicted relative depth through the magnitude relation between the predicted parallax values of a pair of pixel points, quantifying the predicted relative depth into a predicted relative depth value D, wherein D=1 if the parallax of a first point is smaller than the parallax of a second point according to the query point sequence, D= -1 if the parallax of the first point is larger than the parallax of the second point, and D=0 if the parallaxes of the two points are equal;

constructing corresponding loss items, specifically constructing loss functions

Wherein I represents the currently processed image, D is the predicted relative depth value, R is the true relative depth value, D is the predicted disparity value, I is the first point in the pixel pair, and j is the second point in the pixel pair.

In the above technical solution, the relationship between the parallax map and the original map is: and (3) performing coordinate shift on each pixel point of one view of the stereoscopic image pair by using the known parallax map, and reconstructing the other Zhang Shitu of the stereoscopic image pair.

The beneficial effects of the invention are as follows:

(1) The invention corrects and manually marks the stereo image pair of the target object acquired by the binocular camera, thereby facilitating training.

(2) The basic network module of the characteristic extraction part of the three-dimensional reconstruction network model adopts a residual structure, so that gradient disappearance or gradient explosion is avoided when the three-dimensional reconstruction network is trained to a deep layer; the feature extraction part of the three-dimensional reconstruction network model also introduces a convolution kernel attention mechanism, so that the accuracy of three-dimensional reconstruction is improved.

(3) According to the invention, a relative depth concept is introduced on the basis of unsupervised training, the three-dimensional information is converted into training data through manual labeling, and the auxiliary training of the three-dimensional information is added, so that the robustness of a three-dimensional reconstruction algorithm and the fineness of a prediction result can be remarkably improved.

(4) The non-supervision training and semi-supervision training methods do not need to collect real depth data as training data, so that the data collection difficulty and the training cost are greatly reduced.

Drawings

FIG. 1 is a flow chart of a semi-supervised learning three-dimensional reconstruction method based on relative depth training according to the invention;

FIG. 2 is a flow chart of a depth learning three-dimensional reconstruction algorithm according to the present invention;

FIG. 3 is a schematic diagram of an Attention-mechanism (Attention-block) structure according to the present invention;

FIG. 4 is a schematic diagram of a Basic-block architecture of the present invention;

fig. 5 is a schematic diagram of a three-dimensional reconstruction network model (modified U-Net) according to the present invention.

Detailed Description

The invention will be further described with reference to the drawings and the specific embodiments, but the scope of the invention is not limited thereto.

As shown in fig. 1, the invention relates to a semi-supervised learning three-dimensional reconstruction method based on relative depth training, which specifically comprises the following steps:

Step (1), constructing a target object image data set for three-dimensional reconstruction network training

Acquiring a large number of stereo image pairs (i.e. left view and right view) of a target object through a binocular camera, and calibrating the binocular camera to obtain an outer parameter matrix, an inner parameter matrix, a distortion parameter matrix and structural parameters; the internal parameters and the distortion parameters are utilized to carry out distortion correction on the stereoscopic image, imaging distortion generated by physical distortion of the binocular camera lens is eliminated, and then the structural parameters are utilized to carry out epipolar line calibration (parallel calibration) of left and right views on the stereoscopic image, so that the sizes of objects in the two images are the same and corresponding pixel points are horizontal on a straight line; after processing all source images, generating new corrected images, manually marking relative depth on the corrected images, and preparing for semi-supervised training, wherein the relative depth is the relative distance relation between two pixel points and a binocular camera plane; selecting two points on the three-dimensional image, recording two-dimensional coordinate values of the two points, and respectively labeling the two points as near points or far points, namely finishing one-time relative depth labeling, wherein the labeling quality determines the effect of supervision training; the invention provides a relative depth labeling strategy, which comprises the following steps: marking on left and right views of a target object, selecting different pixel point pairs on two images to mark, selecting 4 pixel points, namely 2 pairs of pixel points, wherein one pair is a point with obvious depth difference and the other pair is a point with small depth difference, marking the relative depth relation of the two pairs of pixel points, quantifying the relative depth relation, converting the relative depth relation into a relative depth value R, and according to the point taking sequence, if a first point is far away from a second point, making R=1, if the first point is close to the second point, making R= -1, and if the two points are of the same depth, making R=0; in this way, all the images collected and corrected in the step (1) are marked, and the marked images and related files are saved to form a training data set of the target object.

Step (2), building a three-dimensional reconstruction network model

The three-dimensional reconstruction network model generally adopts a U-shaped structure and comprises a feature extraction (encoding) part and a decoding part, a basic network module of the feature extraction part adopts a residual structure, and a convolution kernel attention mechanism is introduced.

As shown in FIG. 3, the preferred structure diagram of the Attention mechanism (Attention-block) of the invention has different effects on targets of different scales (near and far, size) due to different sizes of perception fields (convolution kernels), therefore, a fixed convolution kernel is used, and a convolution kernel Attention mechanism is introduced into a network feature extraction part to dynamically generate convolution kernels for different input images, preferably, the input feature images are respectively subjected to 3*3 and 7*7 convolutions, the resolution of original images is reduced by one time, a BN layer and a ReLU are arranged after each convolution, element-level addition fusion is performed on the results of the two branches, global average pooling is performed on the feature images after fusion, so that the information about channels is a one-dimensional vector of C×1×1, the importance degree of the information about each channel is obtained, then the two one-dimensional vectors of C×1×1 are obtained after passing through two layers of full connection layers, the two one-dimensional vectors are sent into a soft tmax analyzer to perform non-negative and normalization operations, the two-level feature images are multiplied by one another, and the two-level feature images are obtained after the two-level feature images are multiplied by one another, and the feature images are subjected to one-half of the feature images after the two-level images are subjected to addition operation, and the feature images are subjected to the addition of the weight matrix is obtained, and the feature images are reduced by one-half of the feature images (H and the feature images are obtained after the feature images are obtained).

As shown in fig. 4, the Basic network module (Basic-block) of the present invention adopts a residual structure, that is, a certain layer at the back is in jump connection with a certain layer at the front, and the low-dimensional feature is maintained in the process of continuously updating the feature.

As shown in fig. 5, the first half of the three-dimensional reconstruction network model is feature extraction, the second half is up-sampling, the resolution of the training dataset of the target object obtained in the step (1) is uniformly adjusted to 256×512, the three-dimensional reconstruction network model is input, first, convolution and down-sampling are performed once to obtain a feature map with the resolution of 64×128, then, 4 Basic-block (Basic network module) are performed, feature maps with the resolutions of 32×64, 16×32, 8×16 and 4*8 are sequentially obtained, then up-sampling is performed based on the feature map with the size of 4*8, the resolution is doubled to 8×16, then channel stitching is performed on the feature map with the size of 8×16 identical to that of the first half, then, the above process is repeated 6 times, and finally, feature maps with the size identical to that of the original map are obtained (256×512), wherein in the last 4 processes, up-sampling, convolution stitching and convolution operations are performed on the feature maps, prediction is performed on the feature maps with parallax difference values for training functions by using each predicted pixel id;

the technical parameters of the three-dimensional reconstruction network model are shown in table 1:

TABLE 1 network model technical parameter table

Stack () in table 1 is a channel dimension splicing operation, and the features after up-sampling each time are spliced with features with the same size in the feature extraction part, so that low-dimensional information is reserved, the network can be trained deeper, and the accuracy is higher.

Step (3), unsupervised training

An unsupervised (or self-supervised) training method without real three-dimensional data is adopted, and the method comprises the following steps: and carrying out feature extraction on the input three-dimensional reconstruction network by the stereo image, predicting to obtain a pair of parallax images, reconstructing a pair of original images by utilizing the relationship between the predicted parallax images and the parallax images-original images, and calculating reconstruction error loss by comparing the reconstructed original images with the real original images. The relation between the parallax map and the original map is as follows: the method comprises the steps that a point in the real world is a coordinate difference value on two views of a stereoscopic image pair is called parallax, a parallax image is the parallax of each point on a target object calculated through the two views of the stereoscopic image pair, a known parallax image is utilized to carry out coordinate shift on each pixel point of one view of the stereoscopic image pair, the size of an offset value is the parallax value of each pixel point, and the other Zhang Shitu of the stereoscopic image pair is reconstructed.

Step (4), semi-supervised training

And (3) introducing relative depth as auxiliary information to perform semi-supervised training, performing training on the predicted disparity map obtained in the step (3), constructing a corresponding loss term, and punishing pixel point pairs which do not meet the relative depth value.

The specific processes of the steps (3) and (4) are as follows:

As shown in fig. 2, on the basis of an unsupervised training method, introducing 'relative depth' as a supervised training label to perform semi-supervised training; the method comprises the steps of inputting left and right view pairs of a target object into a built three-dimensional reconstruction network model to respectively obtain 4 scale prediction parallax images, obtaining a left parallax image from the left image, obtaining a right parallax image from the right image, combining the left parallax image with the right image, which is the same as the original image in size, interpolating to generate an estimation of the left image, combining the right parallax image with the original image in size, interpolating to generate an estimation of the right image, generating a pair of reconstructed original images, comparing the reconstructed original images with the real original images to form reconstruction loss, namely completing unsupervised training, and obtaining a loss function as follows:

Meanwhile, a pair of predicted parallax images with the resolution of the original image size are obtained through unsupervised training, two-dimensional position coordinates of pixel point pairs which are manually marked and contain relative depth information are searched on the pair of parallax images, and a predicted parallax value of each pixel point is obtained; obtaining a predicted relative depth through the magnitude relation between the predicted parallax values of a pair of pixel points, quantifying the predicted relative depth into a predicted relative depth value D, wherein D=1 if the parallax of a first point is smaller than the parallax of a second point according to the query point sequence, D= -1 if the parallax of the first point is larger than the parallax of the second point, and D=0 if the parallaxes of the two points are equal; for each pair of marked pixel points, obtaining a predicted relative depth value D, searching a real relative depth value R of each pair of pixel points according to a marking file, comparing, and if D=R, indicating that the prediction is correct, and if D=R, indicating that the prediction is incorrect; meanwhile, a loss function is designed, different contributions of gradient descent are given according to different prediction conditions, the contribution is small if the prediction is correct, the contribution is large if the prediction is incorrect, and the loss function is as follows:

Wherein I represents the current processed image, D is the predicted relative depth value, R is the true relative depth value, D is the predicted disparity value, where I is the first point in the pixel pair and j is the second point.

After the non-supervision training and the supervision training are completed, the semi-supervision training is completed.

Step (5), three-dimensional reconstruction

Pruning the trained three-dimensional reconstruction network, cutting prediction branches with the sizes of 32 x 64 x1, 64 x 128 x1 and 128 x 256 x1 in the three-dimensional reconstruction network, only leaving the last layer of 256 x 512 x1 scale as output so as to improve the prediction speed, outputting a 256 x 512 x1 parallax map only by inputting an image with the single Zhang Fenbian rate of 256 x 512 x 3 during prediction, and calculating to obtain a depth map by combining parameters of a binocular camera and a parallax-depth conversion relation, wherein the conversion relation between the parallax map and the depth map is as follows:

Z＝(f*b)/d₁

In the above formula, Z is the absolute depth of a pixel point, d ₁ is the pixel point parallax value, f is the focal length of the binocular camera, and b is the translational offset of the two binocular cameras.

The examples are preferred embodiments of the present invention, but the present invention is not limited to the above-described embodiments, and any obvious modifications, substitutions or variations that can be made by one skilled in the art without departing from the spirit of the present invention are within the scope of the present invention.

Claims

1. The semi-supervised learning three-dimensional reconstruction method based on relative depth training is characterized by comprising the following steps of:

2. The semi-supervised learning three-dimensional reconstruction method based on relative depth training as set forth in claim 1, wherein the manual labeling is specifically:

3. The semi-supervised learning three dimensional reconstruction method based on relative depth training of claim 1, wherein the residual structure is specifically:

4. The relative depth training-based semi-supervised learning three dimensional reconstruction method as set forth in claim 1, wherein the convolution kernel attention mechanism is specifically:

5. The three-dimensional reconstruction method for semi-supervised learning based on relative depth training according to claim 3, wherein the first half of the three-dimensional reconstruction network model is feature extraction, the second half is upsampling, the resolution of the training data set of the target object is uniformly adjusted, the three-dimensional reconstruction network model is input, first, one convolution and one downsampling are performed, then, 4 times of basic network modules are passed, then, 6 times of upsampling, peer stitching and convolution are performed, and parallax prediction is performed on feature graphs obtained by the last 4 times of upsampling, peer stitching and convolution for training.

6. The semi-supervised learning three-dimensional reconstruction method based on relative depth training of claim 1, wherein the S3 specifically is: inputting a stereoscopic image pair into a built three-dimensional reconstruction network model to respectively obtain 4 scale prediction parallax images, wherein the left parallax images are called left parallax images, the right parallax images are called right parallax images, the left parallax images with the same size as the original images are combined with the right images, interpolation is performed to generate left image estimation, the right parallax images with the same size as the original images are combined with the left images, interpolation is performed to generate right image estimation, a pair of reconstructed original images are generated, reconstruction loss is formed by comparing the reconstructed original images with the true original images, and the loss function is as follows:

7. The method for three-dimensional reconstruction of semi-supervised learning based on relative depth training according to claim 6, wherein the training performed on the predicted disparity map obtained in S3 in S4 is specifically:

the construction of the corresponding loss term, in particular the construction of the loss function

8. The method for three-dimensional reconstruction of semi-supervised learning based on relative depth training as set forth in claim 1, wherein the disparity map-artwork relationship is: and (3) performing coordinate shift on each pixel point of one view of the stereoscopic image pair by using the known parallax map, and reconstructing the other Zhang Shitu of the stereoscopic image pair.