CN116723305B

CN116723305B - Virtual viewpoint quality enhancement method based on generation type countermeasure network

Info

Publication number: CN116723305B
Application number: CN202310445621.XA
Authority: CN
Inventors: 刘畅; 白鹤鸣; 姜芮芮; 张佳琳; 王振国
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2023-04-24
Filing date: 2023-04-24
Publication date: 2024-05-03
Anticipated expiration: 2043-04-24
Also published as: CN116723305A

Abstract

The invention provides a virtual viewpoint quality enhancement method based on a generation type countermeasure network, and belongs to the technical field of virtual viewpoint synthesis of three-dimensional videos. The technical problem of virtual viewpoint distortion caused by three-dimensional video coding and decoding and viewpoint synthesis is solved. The technical proposal is as follows: firstly, analyzing the relevance between network and video characteristics and virtual viewpoint quality; then, designing a virtual viewpoint quality enhancement flow frame; finally, constructing a virtual viewpoint quality enhancement model based on the generated type countermeasure network aiming at the quality enhancement model part in the flow frame. The beneficial effects of the invention are as follows: by designing a quality enhancement network model oriented to the virtual viewpoint, the quality of the synthesized virtual viewpoint is improved, and better subjective visual quality is obtained.

Description

Virtual viewpoint quality enhancement method based on generation type countermeasure network

Technical Field

The invention relates to the technical field of virtual viewpoint synthesis of three-dimensional video, in particular to a virtual viewpoint quality enhancement method based on a generation type countermeasure network for a virtual viewpoint of a three-dimensional video decoding end.

Background

In recent years, with the vigorous development of multimedia information technology and the further expansion of video fields, video televisions are also being updated continuously. On the one hand, video televisions are developed from standard definition to high definition and even full-high definition televisions, and the number of supportable pixels is increased. On the other hand, video televisions are developed from two-dimensional planes to three-dimensional stereoscopic and even free viewpoint televisions, and the number of viewpoints which can be supported is increased. From standard definition to high definition, and from plane to stereo, video technology has undergone several innovations, and has been broad-leaved toward the ultra-high definition era. In face of the new development of Video technology, three-dimensional efficient Video Coding (Three Dimensional HIGH EFFICIENCY Video Coding, 3D-HEVC) standards have evolved.

In the 3D-HEVC video coding standard, when coding, the coding is performed jointly according to the sequence of texture map and depth map, and when decoding, the synthesis of virtual view is realized by using a DEPTH IMAGE-Based Rendering (DIBR) technology Based on depth image. In terms of measuring coding distortion, the conventional rate-distortion optimization method calculates coding distortion by calculating the sum of absolute errors or the sum of squares of differences between a current coding block in a current coding frame and a reference block in a reference frame. But since the depth map is not directly provided for human eyes to view, it is only used for synthesizing virtual views, so the coding quality of the depth map has a close relation with the distortion degree of the virtual views. In order to ensure that the synthesized virtual view does not have obvious distortion, the distortion of the virtual view needs to be taken into consideration when measuring the coding distortion of the depth map.

For virtual view distortion calculation, the current 3D-HEVC video coding standard employs a synthetic view distortion variation model (Synthesized View Distortion Change, SVDC) as a calculation model for virtual view distortion. The SVDC model can alleviate the problem that the distortion of the depth map cannot be accurately mapped to the virtual viewpoint distortion caused by shielding and shielding-off phenomena to a certain extent. In a real application scenario, however, distortion is always inevitably introduced. Since DIBR techniques use depth map information to achieve a mapping of pixels in the original view to pixels in the virtual view, this process is related to both the original texture map and the original depth map. Therefore, the distortion of the texture map, the distortion of the depth map, and the distortion generated in the virtual view synthesis process may all cause the distortion of the synthesized virtual view. Depending on the starting point of the optimization, the enhancement of the quality of the virtual viewpoint of the three-dimensional video can be generally achieved in four ways: texture map quality enhancement, depth map quality enhancement, view synthesis technique optimization, and video post-processing.

The method effectively enhances the quality of the image or video after compression coding. But such methods are mainly used for artifact removal after image or h.265/HEVC video compression coding and cannot be directly used for enhancing the quality of virtual views in 3D-HEVC.

In terms of virtual viewpoint quality enhancement, zhu et al consider the quality enhancement of a virtual viewpoint as an image reconstruction task while considering geometric distortion and compression distortion, and propose a CNN-based virtual viewpoint quality enhancement method. Pan et al propose a virtual viewpoint quality enhancement method based on a dual-stream attention network, which utilizes global information of learning context and local information of extracted texture, and more comprehensively removes distortion of a synthesized virtual viewpoint compared with the method proposed by Zhu et al. However, the research on the method is less at present, and how to further mine the depth characteristics of the synthesized virtual viewpoint is a key problem for improving the subjective visual quality experience of the three-dimensional video.

Disclosure of Invention

The invention aims to provide a virtual viewpoint quality enhancement method based on a generation type countermeasure network, which solves the problem of virtual viewpoint distortion caused by three-dimensional video coding and decoding and viewpoint synthesis, can effectively improve the quality of a synthesized virtual viewpoint, improves PSNR by 1.127dB on average, and obtains good subjective visual effect. The invention is characterized in that: the invention provides a virtual viewpoint quality enhancement method based on a generated type countermeasure network.

In order to achieve the aim of the invention, the invention adopts the technical scheme that: a virtual viewpoint quality enhancement method based on a generation type countermeasure network specifically comprises the following steps:

1.1, analyzing the relevance of network and video characteristics and virtual viewpoint quality;

1.2, designing a virtual viewpoint quality enhancement flow frame;

1.3, constructing a virtual viewpoint quality enhancement model based on a generated type countermeasure network;

As a further optimization scheme of the virtual viewpoint quality enhancement method based on the generated type countermeasure network, the step 1.1 specifically comprises the following steps:

2.1, analyzing the reasons for virtual viewpoint distortion, mainly including compression coding, background area de-shielding, non-viewpoint overlapping area and inaccurate depth information;

2.2, establishing a virtual view distortion evaluation criterion C, wherein the formula of the virtual view distortion evaluation criterion C is C=f (p _video(d_texture,d_depth,d_synthesis),p_network (v, b, d)), wherein p _video (DEG) represents video characteristic parameters, d _texture (DEG) represents texture map coding distortion, d _depth (DEG) represents depth map coding distortion, and d _synthesis (DEG) represents virtual view synthesis process distortion; p _network (·) denotes the network characteristic parameter, v denotes the network data transmission rate, b denotes the network channel bandwidth, and d denotes the total network data delay.

As a further optimization scheme of the virtual viewpoint quality enhancement method based on the generated type countermeasure network, the step 1.1 specifically comprises the following steps: the step 1.2 specifically comprises the following steps:

3.1, obtaining a low-quality virtual viewpoint to be enhanced by encoding and decoding an original texture map and a depth map and performing viewpoint synthesis;

3.2, preprocessing the low-quality virtual view and the high-quality virtual view data set obtained in the step 3.1 based on a virtual view distortion evaluation criterion C;

3.3, constructing a generating network, judging the network, defining a loss function, and constructing a virtual viewpoint quality enhancement model;

and 3.4, reconstructing the low-quality virtual view into a high-quality virtual view by using the trained virtual view quality enhancement model.

As a further optimization scheme of the virtual viewpoint quality enhancement method based on the generated type countermeasure network, the step 1.3 specifically comprises the following steps:

4.1, combining 1 generation network module, 1 discrimination network module and 1 loss feedback module into a virtual viewpoint quality enhancement model;

4.2, combining the convolution layers of 1 64-channel 3×3 convolution kernels and 16 residual units to generate a network module;

4.3, combining 6 groups of convolution layers comprising a convolution kernel with the size of 3 multiplied by 3, a BN layer and a Leaky ReLU activation function into a discrimination network module;

4.4, composing a mean square error L _MSE for measuring pixel-level loss, L _PSNR for measuring objective difference of image picture quality, a perception loss L _P for measuring image style (color, texture, contrast and the like) and a antagonism loss L _A for measuring judgment result and real image cross entropy into a loss function L _G of a generating network, wherein the formula is L _G＝λ_MSEL_MSE+λ_PSNRL_PSNR+λ_PL_P+λ_AL_A, wherein lambda _MSE、λ_PSNR、λ_P and lambda _A are parameters in the training process of generating a network model, Wherein W, H, C is the width, height and channel number of the image respectively,/>Wherein MSE is the mean square error lossWherein, phi _m,n represents the feature distribution of the nth convolutional layer before the mth pooling layer in the VGG19 network, phi _m,n(I_o) is the feature distribution of the original high quality virtual view image, phi _m,n(I_y) is the feature distribution of the high quality virtual view image generated by the generating network, W _m,n and H _m,n are the sizes of the features;

4.5, obtaining a loss function of the discrimination network by calculating the probability that an original high-quality virtual viewpoint is true and a low-quality virtual viewpoint which is 'false and spurious', wherein the formula is L _D＝-log(D(I_o))-log(1-D(G(I_x)), wherein I _x is a low-quality virtual viewpoint image synthesized by compressed video, I _o is an original high-quality virtual viewpoint image, G (-) represents an image generated by the generation network module, and D (-) represents the probability that the discrimination network module determines that the generated image is a real image.

Compared with the prior art, the invention has the beneficial effects that:

(1) Aiming at the problem of virtual viewpoint synthesis distortion caused by three-dimensional video coding, decoding and viewpoint synthesis, the invention designs a quality enhancement network model oriented to a virtual viewpoint by taking the reason that the virtual viewpoint distortion is caused into consideration that the generated type countermeasure network has better performance in the aspect of image restoration and aims at realizing the virtual viewpoint quality enhancement.

(2) The invention analyzes the quality characteristics of the synthesized virtual viewpoint, deeply analyzes the main reasons of the synthetic distortion of the virtual viewpoint, and lays a foundation for preprocessing the data set;

(3) The invention designs a virtual viewpoint quality enhancement network model flow frame, which mainly comprises two parts of data preprocessing and virtual viewpoint quality enhancement, and integrally establishes a technical route and implementation steps of a virtual viewpoint quality enhancement method;

(4) The invention provides a virtual viewpoint quality enhancement network model based on a generation type countermeasure network, and the quality enhancement of the virtual viewpoint is realized through alternate training of a generation network and a discrimination network. From the experimental results of the method, on objective performance evaluation, the PSNR of the virtual view point is averagely improved by 1.127dB compared with the original HTM-16.0 method, and on SSIM gain, compared with the original HTM-16.0 method, the SSIM of the method is averagely improved by 0.0267. In addition, in subjective performance, the method provided by the invention has no obvious difference with the original video basically in subjective quality, and can obtain good subjective visual effect.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.

Fig. 1 is an overall flowchart of a virtual viewpoint quality enhancement method based on a generative countermeasure network provided by the present invention.

Fig. 2 is a schematic diagram of virtual viewpoint distortion caused by background region de-occlusion and non-viewpoint overlapping regions in the present invention.

Fig. 3 is a schematic diagram of virtual viewpoint distortion caused by inaccurate depth information in the present invention.

Fig. 4 is a flow frame for enhancing virtual viewpoint quality in the present invention.

Fig. 5 is a schematic view of a virtual viewpoint quality enhancement model based on a generated countermeasure network in the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. Of course, the specific embodiments described herein are for purposes of illustration only and are not intended to limit the invention.

Example 1

Referring to fig. 1, the technical scheme provided in this embodiment is as follows: a virtual viewpoint quality enhancement method based on a generative countermeasure network comprises the following steps:

step 1, analyzing the relevance between network and video characteristics and virtual viewpoint quality;

step 2, designing a virtual viewpoint quality enhancement flow frame;

And 3, constructing a virtual viewpoint quality enhancement model based on the generated type countermeasure network.

Specifically, referring to fig. 2 and 3, in step 1, the correlation between the network and video features and the virtual viewpoint quality is analyzed, and the method specifically includes the following steps:

1) Analyzing the reasons for the distortion of the virtual view, mainly including compression coding, background area de-shielding, non-view overlapping area and inaccurate depth information;

2) Establishing a virtual view distortion evaluation criterion C, wherein the formula of the virtual view distortion evaluation criterion C is C=f (p _video(d_texture,d_depth,d_synthesis),p_network (v, b, d)), wherein p _video (-) represents video characteristic parameters, d _texture (-) represents texture map coding distortion, d _depth (-) represents depth map coding distortion, and d _synthesis (-) represents virtual view synthesis process distortion; p _network (·) denotes the network characteristic parameter, v denotes the network data transmission rate, b denotes the network channel bandwidth, and d denotes the total network data delay.

Specifically, referring to fig. 4, in step 2, a virtual viewpoint quality enhancement flow framework is designed, including the steps of:

1) The method comprises the steps of obtaining a low-quality virtual viewpoint to be enhanced by encoding and decoding an original texture map and a depth map and performing viewpoint synthesis;

2) Preprocessing the low-quality virtual view and the high-quality virtual view data set obtained in the step 3.1 based on a virtual view distortion evaluation criterion C;

3) Constructing a generating network, judging the network, defining a loss function, and constructing a virtual viewpoint quality enhancement model;

4) And reconstructing the low-quality virtual view into a high-quality virtual view by using the trained virtual view quality enhancement model.

Specifically, referring to fig. 5, in step 3, a virtual viewpoint quality enhancement model based on a generated type countermeasure network is constructed, including the steps of:

1) Combining the 1 generation network module, the 1 discrimination network module and the 1 loss feedback module into a virtual viewpoint quality enhancement model;

2) Combining the convolution layers of the 1 64-channel 3×3 convolution kernels and 16 residual units to generate a network module;

3) Combining 6 groups of convolution layers comprising a convolution kernel with the size of 3 multiplied by 3, a BN layer and a Leaky ReLU activation function into a discrimination network module;

4) The mean square error L _MSE for measuring pixel level loss, L _PSNR for measuring objective difference of image picture quality, perceived loss L _P for measuring image style (color, texture, contrast and the like) and antagonism loss L _A for measuring judgment result and real image cross entropy are combined to form a loss function L _G of a generating network, wherein the formula is L _G＝λ_MSEL_MSE+λ_PSNRL_PSNR+λ_PL_P+λ_AL_A, lambda _MSE、λ_PSNR、λ_P and lambda _A are parameters in the training process of generating a network model, Wherein W, H, C is the width, height and channel number of the image respectively,/>Wherein MSE is the mean square error lossWherein, phi _m,n represents the feature distribution of the nth convolutional layer before the mth pooling layer in the VGG19 network, phi _m,n(I_o) is the feature distribution of the original high quality virtual view image, phi _m,n(I_y) is the feature distribution of the high quality virtual view image generated by the generating network, W _m,n and H _m,n are the sizes of the features;

5) Obtaining a loss function of the discrimination network by calculating the probability that an original high-quality virtual viewpoint is true and a low-quality virtual viewpoint of 'spurious' is false, wherein the formula is L _D＝-log(D(I_o))-log(1-D(G(I_x)), wherein I _x is a low-quality virtual viewpoint image synthesized by compressed video, I _o is an original high-quality virtual viewpoint image, G (·) represents an image generated by a generation network module, and D (·) represents the probability that the discrimination network module judges the generated image as a real image.

In order to test the performance of the method proposed in this embodiment, a Pytorch learning framework is used to build a virtual viewpoint quality enhancement model, all training is performed on a server, and the training environment configuration is shown in table 1. When the model is trained, an Adam method is adopted, the initial learning rate is set to be 0.0001, the learning rate is finely adjusted along with the change of training times, and the optimal parameters for generating the network loss countermeasure function are obtained. Test sequences Balloons, kendo, NEWSPAPER, poznan _Hal2, poznan _street and undo_Dancer.

Table 1 training environment settings

Table 2 shows the comparison results of the proposed method of the present embodiment and the two currently mainstream virtual viewpoint quality enhancement methods on objective quality evaluation indexes. All experimental results are obtained in the training environment of the embodiment, and the difference between the SSIM calculated by the image obtained by the proposed method and the original image and the SSIM calculated by the image obtained by the HTM-16.0 test platform code and the original image, and the PSNR calculated by the image obtained by the proposed method and the original image and the PSNR calculated by the image obtained by the HTM-16.0 test platform code and the original image are calculated

The difference between them is used as an evaluation index. Wherein, the specific calculation formula of the SSIM is as followsWherein X and Y are two images to be measured, mu _X and mu _Y are the average value of X and Y, respectively,/>And/>Is the variance of X and Y, σ _XY is the covariance of X and Y, c ₁＝(k₁L)² and c ₂＝(k₂L)² are constants for stability, where k ₁＝0.01,k₂ = 0.03, L represents the dynamic range of values for the pixel values, and when the number of image bits is 8, the value of L is 255. The specific calculation formula of PSNR isWherein M and N represent the sizes of the images, x _i,j and y _i,j represent the gray values of the original image and the image processed by the proposed method at the pixel points corresponding to the ith row and the jth column respectively, and N represents the number of image bits. Table 2 shows the results of the comparison of the methods and the comparative method

As shown in Table 2, the virtual viewpoint quality enhancement model proposed in this embodiment has an average improvement of 1.127dB in PSNR of the virtual viewpoint after the viewpoint synthesis compared with HTM-16.0 codec, which is better than 0.804dB of TSAN [4] and 0.315dB of the method proposed by Zhu et al [6 ]. On the gain of SSIM, the virtual viewpoint quality enhancement model proposed in this embodiment is improved by 0.0267 as compared with the SSIM of the virtual viewpoint after HTM-16.0 codec and viewpoint synthesis, and the result is also better than 0.0117 of TSAN [4] and 0.0046 of Zhu et al [6] proposed method. It can be derived from this that the method proposed by this embodiment has better performance in enhancing the quality of the virtual viewpoint.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. The virtual viewpoint quality enhancement method based on the generation type countermeasure network is characterized by comprising the following steps of:

Step1.1, analyzing the relevance of network and video characteristics and virtual viewpoint quality;

Step 1.2, designing a virtual viewpoint quality enhancement flow frame;

step 1.3, constructing a virtual viewpoint quality enhancement model based on a generated type countermeasure network;

the step 1.1 specifically comprises the following steps:

2.1, analyzing reasons for virtual viewpoint distortion include compression coding, background area de-shielding, non-viewpoint overlapping area and inaccurate depth information;

2.2, establishing a virtual view distortion evaluation criterion C, wherein the formula of the virtual view distortion evaluation criterion C is C=f (p _video(d_texture,d_depth,d_synthesis),p_network (v, b, d)), wherein p _video (DEG) represents video characteristic parameters, d _texture (DEG) represents texture map coding distortion, d _depth (DEG) represents depth map coding distortion, and d _synthesis (DEG) represents virtual view synthesis process distortion; p _network (·) represents a network characteristic parameter, v represents a network data transmission rate, b represents a network channel bandwidth, d represents a total network data delay;

The step 1.2 specifically comprises the following steps:

3.4, rebuilding the low-quality virtual view into a high-quality virtual view by using the trained virtual view quality enhancement model;

the step 1.3 specifically comprises the following steps:

4.4, composing a mean square error L _MSE for measuring pixel-level loss, L _PSNR for measuring objective difference of image picture quality, a perceived loss L _P for measuring image style and a counterloss L _A for measuring judgment result and real image cross entropy into a loss function L _G of a generating network, wherein the formula is L _G＝λ_MSEL_MSE+λ_PSNRL_PSNR+λ_PL_P+λ_AL_A, wherein lambda _MSE、λ_PSNR、λ_P and lambda _A are parameters in the training process of generating a network model, Wherein W, H, C are the width, height and channel number of the image respectively,Where MSE is the mean square error loss L _MSE,Wherein, phi _m,n represents the feature distribution of the nth convolutional layer before the mth pooling layer in the VGG19 network, phi _m,n(I_o) is the feature distribution of the original high quality virtual view image, phi _m,n(I_y) is the feature distribution of the high quality virtual view image generated by the generating network, W _m,n and H _m,n are the sizes of the features;