CN113179396B

CN113179396B - Double-viewpoint stereo video fusion method based on K-means model

Info

Publication number: CN113179396B
Application number: CN202110295931.9A
Authority: CN
Inventors: 周洋; 张博文; 崔金鹏; 梁文青; 殷海兵; 陆宇; 黄晓峰
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2022-11-11
Anticipated expiration: 2041-03-19
Also published as: CN113179396A

Abstract

The invention discloses a double-viewpoint three-dimensional video fusion method based on a K-means model. The method comprises the steps of firstly preprocessing a left viewpoint depth image and a right viewpoint depth image to obtain the left viewpoint depth image and the right viewpoint depth image; then, the depth images of the left viewpoint and the right viewpoint are respectively segmented by using a K-means method, and the three-dimensional projection operation is carried out on the segmented depth images of the foreground and background areas to obtain foreground and background drawing images of the left viewpoint and the right viewpoint; filling a vacant area of the foreground rendering image with the background rendering image when the foreground rendering image is a bluebook, and carrying out image fusion on the filled left and right viewpoint rendering images to obtain a virtual viewpoint rendering image; and finally, performing weighted filling on the hole area of the virtual viewpoint rendering image according to pixel information around the hole to obtain a final output image. The method of the invention adopts the operation of pixel level to accurately process the hollow area, so that the drawing effect is better and more harmonious in visual effect.

Description

Double-viewpoint stereo video fusion method based on K-means model

Technical Field

The invention belongs to the technical field of stereo video coding and decoding, relates to a double-viewpoint stereo video fusion method based on a K-means model, and aims to improve the double-viewpoint image fusion process.

Background

Currently, depth Image Based Rendering (DIBR) is a main method for Rendering observation images from different viewing angles. And drawing the image of another visual angle according to the existing image so as to obtain the image observed under different visual angles. The most critical part of the mapping method is to perform the 3D-WARPING process. The process is that an image is restored to a three-dimensional model, and then the three-dimensional model is re-projected to a target plane near another target viewpoint to obtain an image at the virtual viewpoint. In the process of restoring the three-dimensional model and projecting onto a plane, the depth information of the image is very critical. The depth information of each pixel is very critical, and the multiple between the depth information of two views directly influences the result after rendering.

The most critical part of the DIBR technique is 3D Image transformation (3D Image warping), which is an operation of changing pixels of an Image. And mapping the pixel points of the reference image into the target view through three-dimensional transformation, thereby forming an original target view corresponding to the reference image.

The whole virtual viewpoint drawing process can be divided into two parts, and firstly, a depth map corresponding to the virtual viewpoint can be obtained by performing projection operation through the input depth image. In order to obtain the depth image at the virtual viewpoint, the most convenient method is completed by a three-dimensional projection operation process (3 d-forwarding). In the process, an image is reversely projected into a three-dimensional space to form a three-dimensional model, and then the three-dimensional model is re-projected to a target plane at a virtual viewpoint to obtain a virtual viewpoint image.

Disclosure of Invention

The invention aims to provide a double-viewpoint stereo video fusion method based on a K-means model.

The method comprises the following steps:

the method comprises the following steps of (1) preprocessing a left viewpoint depth map and a right viewpoint depth map to obtain a left viewpoint depth image and a right viewpoint depth image;

respectively segmenting the left viewpoint depth image and the right viewpoint depth image by using a K-means method, dividing the left viewpoint depth image and the right viewpoint depth image into a foreground region depth image and a background region depth image, and respectively performing three-dimensional projection operation on the foreground region depth image and the background region depth image to obtain a foreground rendering image and a background rendering image of a left viewpoint and a foreground rendering image and a background rendering image of a right viewpoint;

respectively fusing the foreground rendering image and the background rendering image of the left viewpoint and the right viewpoint: filling the vacant areas of the foreground drawing image with the background drawing image, wherein the foreground drawing image is a bluebook; obtaining a left viewpoint drawing image and a right viewpoint drawing image; carrying out image fusion on the filled left viewpoint drawing image and right viewpoint drawing image to obtain a virtual viewpoint drawing image;

and (4) carrying out weighted filling on the hole area of the virtual viewpoint drawing image according to pixel information around the hole to obtain a final output image.

Further, the preprocessing in the step (1) comprises a noise removing process and an image smoothing process; the noise removal processing is to select a residual error neural network for processing, and the image smoothing processing is to perform opening operation processing on the image after the noise removal processing.

Further, the step (2) is specifically:

carrying out clustering operation on the input left viewpoint depth image by a K-means method and segmenting, wherein the method comprises the following steps:

the pixel value distribution probability of the pixel points of the input depth image is { p } ₀ ,p ₁ ,…,p ₂₅₅ }，p _i I =0,1, \ 8230;, 255, which is the distribution probability of a pixel point having a pixel value i;

setting k thresholds { τ ₁ ,τ ₂ ,…,τ _k Inputting the data into K-means operation, calculating the minimum Euclidean distance between each pixel value and each threshold value

Taking a threshold value corresponding to the minimum Euclidean distance as a first filling value; outputting new k threshold values { tau 'simultaneously' ₁ ,τ′ ₂ ,…,τ′ _k The obtained value is used as the input of the next iteration operation;

selecting a threshold value tau each time after one iteration _j A change occurs; will tau _j Adjusted to set C _j Statistical mean of all elements within, C _j Is at τ _j To tau _j+1 Set of pixels in between:

x is C _j A pixel of (1);

foreground region mask map

Foreground region Depth image FG = Depth × MOD _FG ；

Background area mask map

Background region Depth image BG = Depth × MOD _BG (ii) a K (i, j) represents a pixel value of the filling result, the corresponding pixel coordinate is (i, j), and Depth represents a Depth image;

respectively carrying out three-dimensional projection operation on the depth image of the foreground area and the depth image of the background area to obtain a drawing depth image of the foreground area and a drawing depth image of the background area;

respectively carrying out three-dimensional projection operation on the foreground area rendering depth image, the background area rendering depth image and the color image of the original viewpoint to obtain a foreground rendering image and a background rendering image of the left viewpoint;

and performing the same operation on the depth image of the right viewpoint to obtain a foreground drawing image and a background drawing image of the right viewpoint.

And further, in the step (3), carrying out image fusion on the filled left viewpoint drawing image and right viewpoint drawing image to obtain a virtual viewpoint drawing image, wherein the pixel value I of a pixel point with coordinates (x, y) in the image _blend (x, y) is:

wherein l (x, y) ∈ </h and r (x, y) ∈ </h respectively indicate that the left viewpoint and the right viewpoint are hollow regions at coordinates (x, y), l (x, y), and

respectively indicating that the left viewpoint and the right viewpoint are not hole areas at coordinates (x, y); i is _L (x, y) and I _R (x, y) respectively represent pixel values of the left viewpoint rendered image and the right viewpoint rendered image at coordinates (x, y);

left viewpoint weight

Right viewpoint weight

R _L And R _R Spatial rotation matrices, T, for the left and right viewpoints, respectively _L And T _R The spatial translation vectors for the left and right views, respectively.

Further, in the step (4), the hollow area is formed by all I _blend (x, y) =0 pixel configuration for each I _blend A region in which the coordinates (x, y) of the pixel (x, y) =0 are 5 × 5 around the center is used as a filling region Ω, and the other 24 pixels except for (x, y) in the filling region Ω are divided into four groups of six pixels;

finally outputting the pixel value of the pixel point with the (x, y) image coordinate

Mean of non-hole pixels in mth group

Wherein, H (x, y) represents whether the coordinate (x, y) is a hole region, H (x, y) =0 if the coordinate is the hole region, and H (x, y) =1 if the coordinate is not the hole region;

weighted sum of pixels

Y (x, Y) is the priority of the coordinate (x, Y), W is the priority weight,

t is abbreviated as priority Y (x, Y).

The method of the invention uses a noise removal network and a smooth filtering method to preprocess the depth image, effectively reduces the drawing problem possibly generated by the depth image in the virtual viewpoint drawing process, and has positive influence on the improving process of drawing the color image. The method fuses the two-viewpoint images and uses a weighting filter based on geometric distance to optimize the fused images. The method breaks through the previous method of searching the best matching block, adopts the operation of pixel level to accurately process the hollow area, and leads the drawing effect to be better and more harmonious in visual effect.

Detailed Description

The double-viewpoint stereo video fusion method based on the K-means model specifically comprises the following steps:

the method comprises the following steps of (1) preprocessing a left viewpoint depth map and a right viewpoint depth map to obtain a left viewpoint depth image and a right viewpoint depth image; the preprocessing comprises noise removal processing and image smoothing processing; the noise removing processing is to select a residual error neural network for processing, and the image smoothing processing is to perform opening operation processing on the image after the noise removing processing.

Respectively segmenting the left viewpoint depth image and the right viewpoint depth image by using a K-means method, dividing the left viewpoint depth image and the right viewpoint depth image into a foreground region depth image and a background region depth image, and respectively performing three-dimensional projection operation (3 d-forwarding) on the foreground region depth image and the background region depth image to obtain a foreground drawing image and a background drawing image of a left viewpoint and a foreground drawing image and a background drawing image of a right viewpoint; the method comprises the following steps:

clustering operation is carried out on the input left viewpoint depth image through a K-means method and segmentation is carried out on the left viewpoint depth image, the K-means method is a clustering method based on data statistics, data with high similarity in a group of data are gathered into a class, then the data are divided into K groups of clusters, and the distance between the clusters is maximized as far as possible, and the method is as follows:

setting k thresholdsValue { tau ₁ ,τ ₂ ,…,τ _k Inputting the data into K-means operation, calculating the minimum Euclidean distance between each pixel value and each threshold value

Taking a threshold value corresponding to the minimum Euclidean distance as a first filling value; simultaneously outputting new k threshold values { tau } ₁ ′,τ ₂ ′,…,τ _k ' } is used as the input of the next iteration operation;

the selected threshold value tau is obtained after each iteration _j A change occurs; will tau _j Adjusted to set C _j Statistical average of all elements in, C _j Is at τ _j To tau _j+1 Set of pixels in between:

x is C _j The pixel of (2);

foreground region mask map

Foreground region Depth image FG = Depth × MOD _FG ；

Background region mask map

Background region Depth image BG = Depth × MOD _BG ；

K (i, j) represents the fill result pixel value, the corresponding pixel coordinate is (i, j), and Depth represents the Depth image.

Respectively carrying out three-dimensional projection operation on the depth image of the foreground area and the depth image of the background area to obtain a drawing depth image of the foreground area and a drawing depth image of the background area; the three-dimensional projection operation method comprises the following steps:

wherein M is _v And M represents the coordinate vectors of the virtual viewpoint and the original viewpoint respectively; I.C. A _v And I in-camera representing virtual and original viewpoints, respectivelyA partial parameter matrix relating only to the internal performance of the camera; r _v And R represents the camera external rotation matrix of the virtual viewpoint and the original viewpoint respectively, the size is 3 multiplied by 3, and the rotation angle of the camera in the three-dimensional space is represented; t is _v And T represents the camera translation vectors of the virtual viewpoint and the original viewpoint, respectively, the size of which is 3 × 1, and represents the translation amount of the camera in the three-dimensional space.

Taking an original viewpoint coordinate vector M of 4 × 1 as an example, M = [ x, y, C, D (x, y)] ^T (ii) a T denotes transpose, x is abscissa of M, y is ordinate of M, C denotes constant coefficient, in this embodiment C =1, d (x, y) is depth value of M, the depth value reflects distance of a scene from a camera in an image, and the depth value can be calculated by depth image information:

where Depth (x, y) represents a pixel value at the Depth map (x, y), MAXZ represents a maximum value that is desirable in the Depth map, MAXZ =255 in the present embodiment; z _min And Z _max Respectively, the minimum actual depth and the maximum actual depth, i.e. the closest and farthest distances of the photographic subject to the camera, respectively.

and performing the same operation on the depth image of the right viewpoint to obtain a foreground rendering image and a background rendering image of the right viewpoint.

And (3) fusing the foreground rendering image and the background rendering image of the left viewpoint and the right viewpoint respectively: filling the vacant areas of the foreground drawing image with the background drawing image, wherein the foreground drawing image is a bluebook; obtaining a left viewpoint drawing image and a right viewpoint drawing image; carrying out image fusion on the filled left viewpoint drawing image and right viewpoint drawing image to obtain a virtual viewpoint drawing image, wherein the pixel value I of a pixel point with coordinates (x, y) in the image _blend (x, y) is:

respectively indicating that the left viewpoint and the right viewpoint are not hole areas at coordinates (x, y); I.C. A _L (x, y) and I _R (x, y) respectively represent pixel values of the left viewpoint rendered image and the right viewpoint rendered image at coordinates (x, y);

left viewpoint weight

Right viewpoint weight

R _L And R _R Spatial rotation matrices, T, for the left and right viewpoints, respectively _L And T _R The spatial translation vectors of the left viewpoint and the right viewpoint, respectively.

And (4) carrying out weighted filling on the hole area of the virtual viewpoint drawing image according to the pixel information around the hole to obtain a final output image. The hollow region is composed of all I _blend (x, y) =0 pixel configuration per I _blend A region having a size of 5 × 5 with the coordinate (x, y) of the pixel of (x, y) =0 as the center is set as a filled region Ω, and the other 24 pixels in the filled region Ω except for (x, y) are divided into four groups of six pixels each.

Mean of non-hole pixels in mth group

Wherein, H (x, y) represents whether the coordinate (x, y) is a hole region, H (x, y) =0 if the coordinate is the hole region, and H (x, y) =1 if the coordinate is not the hole region; weightingSummation of pixels

Y (x, Y) is the priority of the coordinate (x, Y), W is the priority weight,

t is abbreviated as priority Y (x, Y).

Claims

1. The double-view point stereo video fusion method based on the K-means model is characterized by comprising the following steps:

the method comprises the following steps of (1) preprocessing a left viewpoint depth map and a right viewpoint depth map to obtain a left viewpoint depth image and a right viewpoint depth image; the preprocessing comprises noise removal processing and image smoothing processing; the noise removal processing is to select a residual error neural network for processing, and the image smoothing processing is to perform opening operation processing on the image after the noise removal processing;

respectively segmenting the left viewpoint depth image and the right viewpoint depth image by using a K-means method, dividing the left viewpoint depth image and the right viewpoint depth image into a foreground region depth image and a background region depth image, and respectively performing three-dimensional projection operation on the foreground region depth image and the background region depth image to obtain a foreground rendering image and a background rendering image of a left viewpoint and a foreground rendering image and a background rendering image of a right viewpoint; the method comprises the following steps:

set k thresholds { tau } ₁ ,τ ₂ ,…,τ _k Inputting the pixel values into a K-means operation, calculating the minimum Euclidean distance between each pixel value and each threshold value

Threshold value corresponding to minimum Euclidean distanceFilling the value for the first time; outputting new k threshold values { tau 'simultaneously' ₁ ,τ′ ₂ ,…,τ′ _k The arithmetic is used as the input of the next iteration operation;

the selected threshold value tau is obtained after each iteration _j A change occurs; will tau _j Adjusted to set C _j Statistical mean of all elements within, C _j Is at tau _j To tau _j+1 Set of pixels in between:

x is C _j The pixel of (2);

foreground region mask map

Foreground region Depth image FG = Depth × MOD _FG ；

Background area mask map

Background region Depth image BG = Depth × MOD _BG (ii) a K (i, j) represents a filling result pixel value, the corresponding pixel coordinate is (i, j), and Depth represents a Depth image;

respectively carrying out three-dimensional projection operation on the foreground region rendering depth image, the background region rendering depth image and the color image of the original viewpoint to obtain a foreground rendering image and a background rendering image of the left viewpoint; the three-dimensional projection operation method comprises the following steps:

wherein M is _v And M respectively represent coordinate vectors of a virtual viewpoint and an original viewpoint; i is _v And I represents the internal parameter matrix of the camera of the virtual viewpoint and the original viewpoint respectively, and is only related to the internal performance of the camera; r _v And R represents a virtual viewpoint and a raw viewpoint, respectivelyA partial rotation matrix with the size of 3 multiplied by 3 and representing the rotation angle of the camera in the three-dimensional space; t is _v And T represents the translation vector of the camera of the virtual viewpoint and the original viewpoint respectively, the size of the translation vector is 3 multiplied by 1, and the translation vector of the camera in the three-dimensional space is represented;

performing the same operation on the depth image of the right viewpoint to obtain a foreground drawn image and a background drawn image of the right viewpoint;

left viewpoint weight

Right viewpoint weight

R _L And R _R Spatial rotation matrices, T, for the left and right viewpoints, respectively _L And T _R Spatial translation vectors for the left viewpoint and the right viewpoint, respectively;

step (4), carrying out weighted filling on a cavity area of the virtual viewpoint drawing image according to pixel information around the cavity to obtain a final output image;

the hollow region is composed of all I _blend (x, y) =0 pixel configuration for each I _blend A region with the size of 5 × 5 with the coordinate (x, y) of the pixel of (x, y) =0 as the center is used as a filling region Ω, and the other 24 pixels except for (x, y) in the filling region Ω are divided into four groups, each group including six pixels;

finally outputting the pixel value of the pixel point with (x, y) image coordinates

Mean of non-hole pixels in mth group

weighted sum of pixels

Y (x, Y) is the priority of the coordinate (x, Y), W is the priority weight,

t is abbreviated as priority Y (x, Y).