CN111292425B

CN111292425B - View synthesis method based on monocular and binocular mixed data set

Info

Publication number: CN111292425B
Application number: CN202010072802.9A
Authority: CN
Inventors: 肖春霞; 李文杰
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2022-02-01
Anticipated expiration: 2040-01-21
Also published as: CN111292425A

Abstract

The invention provides a view synthesis method based on a monocular and binocular mixed data set, which comprises the steps of pre-training a disparity estimation network by utilizing small-scale left and right binocular images, generating a right image and a disparity label for a large-scale monocular image set by utilizing the pre-trained network to form a large-scale binocular image pair, training another disparity estimation network by utilizing the generated large-scale binocular image pair, and finally finishing view synthesis by utilizing a rendering technology based on a disparity map. The invention has the following advantages: training a parallax estimation network based on the small-scale left and right binocular images; a large-scale pseudo-binocular data set with a parallax label is generated based on the large-scale monocular picture set; training a parallax estimation network based on a self-generated 'pseudo data set'; the method for training the parallax estimation network by using the small-scale left and right binocular image pairs and the large-scale monocular image set is provided, the data set is easier to construct, and factors such as illumination consistency, camera movement and object movement do not need to be considered in the monocular image set.

Description

View synthesis method based on monocular and binocular mixed data set

Technical Field

The invention belongs to the field of computer vision and image rendering, and relates to a view synthesis method based on deep learning, in particular to a view synthesis method based on a small-scale binocular training set.

Background

In many cases in life, view synthesis technologies, such as virtual image rendering in virtual reality, 3D display technology, 2D video to 3D video conversion, etc., are required. The existing view synthesis method is mainly based on a depth learning method, a convolutional neural network is used as an image processing model to extract image features, further, depth information of a scene is estimated, and then a rendering technology based on a depth map is used for generating an image of a new view angle. However, existing methods based on deep learning are mostly based on binocular or multi-view data sets, and the required data sets are large in size. Although some large-scale binocular image data sets and monocular video data sets are available for training, the scenes contained in these data sets are relatively simple and homogeneous, which is not favorable for generalization of models. On one hand, if a binocular or multi-view data set containing various scenes is constructed, a large amount of time, labor and equipment cost are consumed, and in comparison, the construction of a small-scale monocular picture data set is easier, and only various single pictures need to be collected on the internet. On the other hand, the monocular video data set has the conditions of camera motion, object movement in a scene and the like, which can increase the difficulty for model training, and in contrast, the problems can be avoided by training with the monocular picture data set.

Disclosure of Invention

The invention aims to overcome the defects of the existing method and provides a view synthesis method based on a small-scale left and right binocular picture and large-scale monocular picture mixed data set.

The technical problem of the invention is mainly solved by the following technical scheme, and the view synthesis method based on the monocular and binocular mixed data set comprises the following steps:

step 1, constructing a mixed data set containing a small-scale left and right binocular image pair and a large-scale monocular image set;

step 2, pre-training a monocular parallax estimation network by utilizing small-scale left and right binocular images;

step 3, by using the model pre-trained in the step 2, regarding all pictures as 'left pictures' aiming at the monocular images in the mixed data set, and estimating a 'pseudo-disparity map' of each picture;

step 4, generating a corresponding 'pseudo right graph' by using monocular image data and an estimated 'pseudo disparity map' corresponding to the monocular image data and adopting a rendering method based on the disparity map;

step 5, forming a pseudo-binocular data set with a parallax label by using the monocular image set and the pseudo-parallax map and the pseudo-right map generated in the step 3 and the step 4;

step 6, retraining a binocular disparity estimation network by using the pseudo binocular data set generated in the step 5;

and 7, utilizing the binocular disparity estimation network trained in the step 6 to perform disparity map estimation on the input left and right binocular test picture pairs and render based on the disparity maps, and generating new view synthesis results of the left and right pairs on the camera base line.

Further, the data set constructed in step 1 is a mixed data set of a small-scale left and right binocular image pair and a large-scale monocular image set, wherein the small-scale left and right binocular image pair is a stereo rectified image pair with the scale of (10)²Stage), the large-scale monocular image set is an image set collected from the internet and containing various indoor and outdoor scenes, and the scale of the image set is (10)⁴Stages).

Further, when a small-scale left and right binocular image is used for pre-training the monocular parallax estimation network in the step 2, the left image is used as network input, and the right image is used for supervision; the network outputs left and right disparity maps corresponding to the left and right images and generates a right image and a left image respectively by rendering based on the disparity maps, and the process can be expressed as follows:

(D^l,D^r)＝N^g(I^l)

wherein, I^lRepresenting the left image, N, of a small-scale left-right binocular image pair^gRepresenting a disparity estimation network, (D)^l,D^r) Left and right disparity maps representing the output of the network,

representing a left-based picture and a predicted right disparity map, rendering the generated right map,

and (e) representing a left image generated by rendering based on the right image and the predicted left disparity image, wherein (i, j) represents the pixel coordinates of the picture.

Further, when the small-scale left and right binocular images are used for pre-training the monocular parallax estimation network in the step 2, the real left and right images are used as bidirectional monitoring information. Taking the supervision of the left image as an example, the specific implementation process is as follows:

step 2.1, the generated left graph

And the true left image I^lComparing, finding SSIM and L1 weighted loss:

wherein, N represents the total number of pixels of the left image, and α is a weight value for balancing SSIM loss and L1 loss.

Step 2.2, the gradient of the generated left disparity map is constrained by using a gradient smoothing term, so that the generated disparity map is smooth enough:

wherein,

the partial differential is expressed, e is the natural logarithm, | indicates the absolute value.

And 2.3, carrying out consistency constraint on the generated left and right disparity maps to ensure that the generated disparity maps meet the geometrical condition limit between the left and right:

step 2.4, left of the loss function in step 2.1, step 2.2 and step 2.3Exchanging the right graph to obtain a loss function for the right graph

And

the overall loss function is:

wherein alpha is_*The weight value for controlling the ratio of the three losses is obtained. By minimizing

Supervision network N^gA gradient update is performed.

Further, the monocular image set in step 3 is considered as a "left image", using the pre-trained network N in step 2^gEstimating a disparity map corresponding to each picture, wherein the process can be represented as follows:

wherein,

representing a network N pre-trained by entering a singleton dataset into step 2^gThe predicted "pseudo-disparity map".

Further, step 4 generates a "pseudo right map" based on the rendering method of the disparity map by using the monocular image set and the "pseudo disparity map" generated in step 3, and the process is defined as follows:

further, step 5 uses the monocular image set and the "pseudo-disparity map" and the "pseudo-right map" generated in steps 3 and 4 to form a "pseudo-binocular" data set with disparity labels:

the data set is used as a data set for network training in the subsequent step, and the subsequent training of the parallax estimation network is converted into a supervised training process.

Further, step 6 retrains a binocular disparity estimation network based on the "pseudo-binocular" data set generated in step 5, and uses a "pseudo-disparity map" in the "pseudo-binocular" data set as a supervision signal. The specific implementation process is as follows:

step 6.1, inputting the left image and the right image in the pseudo binocular data set into a network, and estimating a disparity map:

wherein N is^aRepresenting the newly trained binocular disparity estimation network, and D represents the disparity values of the left and right views predicted by the network.

Step 6.2, the generated disparity map D and the pseudo disparity map in the pseudo binocular data set "

In comparison, the L1 loss is calculated:

by minimizing

Supervision network N^aA gradient update is performed.

Further, step 7 uses the binocular disparity estimation network trained in step 6 to input left and right binocular images of the real world to estimate disparity values thereof, and uses rendering based on disparity images to generate a series of intermediate view results on the camera baselines of the left and right images. The process is concretely realized as follows:

and 7.1, inputting left and right binocular images of the real world to estimate the disparity value by using the binocular disparity estimation network trained in the step 6:

D＝N^a(I^l,I^r)

wherein (I)^l,I^r) A left and right image pair representing the real world, N^aRepresenting a trained binocular disparity estimation network, D represents (I)^l,I^r) An estimated disparity value.

And 7.2, calculating the disparity map of the left and right images at the alpha position on the camera base line by using the disparity map estimated in the step 7.1:

where α ∈ [0,1] indicates a relative position of the target view from the left image on the camera base line of the left and right images, and for example, α ═ 0.5 indicates that the distance from the position to the left image is 0.5 times the camera distance from the left and right images.

And 7.3, generating an image at the alpha position by using the parallax map at the alpha position generated in the step 7.2 and a rendering method based on the parallax map:

wherein, I^lThe left image in the left and right image pair representing the real world, with (i, j) representing the image pixel coordinates.

Compared with the prior art, the invention has the following advantages:

1. the invention is based on a small-scale binocular data set (10)²) Training a parallax estimation network;

2. the invention generates a large-scale 'pseudo-binocular data set' with a parallax label based on a large-scale monocular data set;

3. the invention trains a parallax estimation network based on a self-generated 'pseudo data set';

4. the invention provides a parallax estimation network trained by a large-scale monocular data set, the data set is easier to construct, and factors such as illumination inconsistency, camera motion and object motion do not exist.

Drawings

Fig. 1 is a general flow chart of the present invention.

Detailed Description

The technical solution of the present invention is further explained with reference to the drawings and the embodiments.

As shown in fig. 1, a binocular vision chart synthesis method based on a small-scale left and right binocular training set and a large-scale monocular training set includes the following steps:

step 1, constructing a mixed data set containing a small-scale left and right binocular image pair and a large-scale monocular image set, wherein the specific implementation mode is as follows:

constructing a small-scale left and right binocular image pair and performing stereo rectification, wherein the scale is (10)²Stage) of collecting image sets containing various indoor and outdoor scenes from the Internet, and constructing a large-scale monocular image set with the size of (10)⁴Stages).

Step 2, pre-training a monocular parallax estimation network by utilizing small-scale left and right binocular images, wherein the network is the conventional network structure DispNet, and the specific implementation mode is as follows:

step 2.1, the left image is taken as network input, the network outputs left and right disparity maps corresponding to the left and right images, and rendering based on the disparity maps is utilized to respectively generate a right image and a left image, and the process can be expressed as:

(D^l,D^r)＝N^g(I^l)

And 2.2, when the small-scale left and right binocular images are used for pre-training the monocular parallax estimation network, the real left and right images are used as bidirectional monitoring information. Taking the supervision of the left image as an example, the specific implementation process is as follows:

step 2.2.1, generate left graph

And the true left image I^lComparing, finding SSIM and L1 weighted loss:

wherein, N represents the total number of pixels in the left graph, α is a weight for balancing SSIM loss and L1 loss, and α is 0.85.

Step 2.2.2, constraining the gradient of the generated left disparity map by using a gradient smoothing term so that the generated disparity map is smooth enough:

wherein,

representing partial differential, e being a natural pairNumber, | indicates the absolute value.

Step 2.2.3, carrying out consistency constraint on the generated left and right disparity maps to ensure that the generated disparity maps meet the geometrical condition limit between the left and the right:

step 2.2.4, the left and right graphs of the loss function in step 2.2.1, step 2.2.2 and step 2.2.3 are exchanged to obtain the loss function aiming at the right graph

And

the overall loss function is:

wherein alpha is_*To control the weight of the ratio of the three losses, α_ap＝1，α_ds＝0.1，α_lr1. By minimizing

Supervision network N^gA gradient update is performed.

Step 3, regarding the monocular image set in the mixed data set as a 'left image', and utilizing the network N pre-trained in the step 2^gEstimating a disparity map corresponding to each picture, wherein the process can be represented as follows:

wherein,

And 4, generating a pseudo right image by using the monocular image set and the pseudo disparity map generated in the step 3 based on a rendering method of the disparity map, wherein the process is defined as follows:

step 5, forming a pseudo-binocular data set with a parallax label by using the monocular image set and the pseudo-parallax map and the pseudo-right map generated in the step 3 and the step 4, wherein the data set specifically comprises the following components:

And 6, retraining a binocular disparity estimation network based on the pseudo-binocular data set generated in the step 5, and taking a pseudo-disparity map in the pseudo-binocular data set as a supervision signal. The specific implementation process is as follows:

In comparison, the L1 loss is calculated:

by minimizing

Supervision network N^aA gradient update is performed.

And 7, inputting left and right binocular images of the real world to estimate the parallax value of the left and right binocular images by using the binocular parallax estimation network trained in the step 6, and generating a series of intermediate view results on the camera baselines of the left and right images by using rendering based on the parallax images. The process is concretely realized as follows:

D＝N^a(I^l,I^r)

Compared with the prior art, the invention has the following advantages:

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A view synthesis method based on a monocular and binocular mixed data set is characterized by comprising the following steps:

2. The method of claim 1, wherein the view synthesis method based on the monocular and binocular mixed data sets comprises: the data set constructed in the step 1 is a mixed data set of a small-scale left and right binocular image pair and a large-scale monocular image set, wherein the small-scale left and right binocular image pair is a stereo rectified image pair with the scale of 10²A large-scale monocular image set is an image set which is collected from the Internet and contains various indoor and outdoor scenes, and the scale of the image set is 10⁴And (4) stages.

3. The method of claim 1, wherein the view synthesis method based on the monocular and binocular mixed data sets comprises: in step 2, when a small-scale left and right binocular images are used for pre-training the monocular parallax estimation network, the left image is used as network input, the network outputs left and right parallax images corresponding to the left and right images, and rendering based on the parallax images is used for respectively generating a right image and a left image, and the process is represented as follows:

(D^l，D^r)＝N^g(I^l)

wherein, I^lRepresenting the left image, N, of a small-scale left-right binocular image pair^gRepresenting a disparity estimation network, (D)^l，D^r) Left and right disparity maps representing the output of the network,

4. A method for view synthesis based on a monocular and binocular mixed data set according to claim 3, wherein: in step 2, when a small-scale left and right binocular image is used for pre-training the monocular parallax estimation network, the real left and right images are used as bidirectional monitoring information, and the monitoring of the left image is taken as an example, and the specific implementation process is as follows:

step 2.1, the generated left graph

And the true left image I^lComparing, finding SSIM and L1 weighted loss:

wherein, N represents the total number of the pixel points of the left image, and alpha is a weight value for balancing SSIM loss and L1 loss;

wherein,

representing partial differential, e is a natural logarithm, | represents solving an absolute value;

step 2.4, exchanging the left and right graphs of the loss function in step 2.1, step 2.2 and step 2.3 to obtain the loss function for the right graph

And

the overall loss function is:

wherein alpha is_*For controlling the weight of the ratio of the three losses by minimizing

Supervision network N^gA gradient update is performed.

5. The method of claim 1, wherein the view synthesis method based on the monocular and binocular mixed data sets comprises: the monocular image set in step 3 is considered a "left image," using the pre-trained network N of step 2^gEstimating a disparity map corresponding to each picture, wherein the process can be represented as follows:

wherein,

6. The method of claim 5, wherein the view synthesis method based on the monocular and binocular mixed data sets comprises: step 4, generating a pseudo right image by using the monocular image set and the pseudo disparity map generated in step 3 based on a rendering method of the disparity map, wherein the process is defined as follows:

where (i, j) represents the pixel coordinates of the picture.

7. The method of claim 6, wherein the view synthesis method based on the monocular and binocular mixed data sets comprises: step 5, forming a pseudo-binocular data set with a parallax label by using the monocular image set and the pseudo-parallax map and the pseudo-right map generated in the step 3 and the step 4:

8. The method of claim 7, wherein the view synthesis method based on the monocular and binocular mixed data sets comprises: step 6, based on the pseudo-binocular data set generated in step 5, retraining a binocular disparity estimation network, and using a pseudo-disparity map in the pseudo-binocular data set as a supervision signal, wherein the specific implementation process is as follows:

wherein N is^aRepresenting a newly trained binocular disparity estimation network, and D represents disparity values of left and right views predicted by the network;

And comparing, and solving loss:

by minimizing

Supervision network N^aA gradient update is performed.

9. The method of claim 8, wherein the view synthesis method based on the monocular and binocular mixed data sets comprises: step 7, inputting left and right binocular images of the real world to estimate the disparity values by using the binocular disparity estimation network trained in the step 6, and generating a series of intermediate view results on the baselines of the left and right image cameras by using rendering based on the disparity images; the process is concretely realized as follows:

D＝N^a(I^l，I^r)

wherein (I)^l，I^r) A left and right image pair representing the real world, N^aRepresenting a trained binocular disparity estimation network, D represents (I)^l，I^r) An estimated disparity value;

wherein, alpha belongs to [0,1] represents the relative position of the target view and the left image on the camera base line of the left and right images;