CN114007058A

CN114007058A - Depth map correction method, video processing method, video reconstruction method and related devices

Info

Publication number: CN114007058A
Application number: CN202010740742.3A
Authority: CN
Inventors: 盛骁杰
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2022-02-01

Abstract

The depth map correction method comprises the following steps of: acquiring a target depth map to be processed, wherein the target depth map is estimated and obtained based on an original texture map of a corresponding frame moment in a video frame sequence of a corresponding viewpoint; acquiring a reference video frame sequence of corresponding viewpoints, wherein video frames in the reference video frame sequence comprise: an original texture map; obtaining an estimated depth map corresponding to an original texture map in a reference video frame sequence; respectively carrying out time domain filtering on an original texture map and a corresponding estimated depth map in the reference video frame sequence to obtain a background texture map and a background depth map of the corresponding viewpoint; and correcting the target depth map by adopting the background texture map and the background depth map of the corresponding viewpoint to obtain a corrected depth map. The scheme can improve the quality of the depth map and further improve the image quality of the free viewpoint video.

Description

Depth map correction method, video processing method, video reconstruction method and related devices

Technical Field

The embodiment of the specification relates to the technical field of video processing, in particular to a depth map correction method, a video processing method, a video reconstruction method and a related device.

Background

The free viewpoint video is a technology capable of providing high-freedom viewing experience, and a user can adjust a viewing angle through interactive operation in a viewing process and view the video from a desired free viewpoint angle, so that the viewing experience can be greatly improved.

Depth map estimation is a very important step in free viewpoint video reconstruction, and the quality of depth map estimation largely determines the quality of the finally generated virtual viewpoint image, so how to improve the quality of depth map estimation becomes an important problem.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a depth map correction method, a video processing method, a video reconstruction method, and a related apparatus, which can improve the quality of a depth map and further improve the image quality of a free viewpoint video.

First, an embodiment of the present specification provides a depth map correction method, including:

acquiring a target depth map to be processed, wherein the target depth map is estimated and obtained based on an original texture map of a corresponding frame moment in a video frame sequence of a corresponding viewpoint;

acquiring a reference video frame sequence of corresponding viewpoints, wherein video frames in the reference video frame sequence comprise: an original texture map;

obtaining an estimated depth map corresponding to an original texture map in a reference video frame sequence;

respectively carrying out time domain filtering on an original texture map and a corresponding estimated depth map in the reference video frame sequence to obtain a background texture map and a background depth map of the corresponding viewpoint;

and correcting the target depth map by adopting the background texture map and the background depth map of the corresponding viewpoint to obtain a corrected depth map.

Optionally, the performing temporal filtering on the original texture map and the corresponding estimated depth map in the reference video frame sequence respectively to obtain the background texture map and the background depth map of the corresponding viewpoint includes:

and respectively carrying out time domain median filtering on pixels in the original texture map and pixels in the corresponding estimated depth map in the reference video frame sequence to obtain a background texture map and a background depth map of the corresponding viewpoint.

Optionally, the obtaining a corrected depth map by performing correction processing on the target depth map by using the background texture map and the background depth map of the corresponding viewpoint includes at least one of:

acquiring the depth value of any pixel in the target depth map, comparing the depth value with the depth value of the corresponding pixel in the background depth map of the corresponding viewpoint, and selecting the maximum value of the depth value and the depth value as the depth value of the corresponding pixel in the corrected depth map;

and comparing the pixel value of the corresponding pixel in the original texture image corresponding to the target depth image with the pixel value of the corresponding pixel in the background texture image of the corresponding viewpoint, and selecting the depth value of the corresponding pixel in the background depth image of the corresponding viewpoint as the depth value of the corresponding pixel in the corrected depth image when the difference value of the pixel value and the pixel value is smaller than a preset threshold value.

Optionally, the foreground object in the original texture map in the sequence of reference video frames is less than the foreground object in the original texture map corresponding to the target depth map.

Optionally, the target depth maps include multiple groups, and the multiple groups of target depth maps are estimated based on original texture maps acquired synchronously from multiple viewpoints.

Optionally, the obtaining a corrected depth map by performing correction processing on the target depth map by using the background texture map and the background depth map of the corresponding viewpoint includes:

and respectively adopting the background texture image and the background depth image of the corresponding viewpoint to carry out correction processing on the multiple groups of target depth images to obtain multiple groups of corrected depth images.

Optionally, the performing, by using the background texture map and the background depth map of the corresponding viewpoint to correct the multiple sets of target depth maps to obtain multiple sets of corrected depth maps includes:

and respectively adopting the background texture image and the background depth image of the corresponding viewpoint to the multiple groups of target depth images, and performing correction processing in parallel to obtain multiple groups of corrected depth images.

An embodiment of the present specification further provides a video processing method, including:

acquiring a video frame sequence of a plurality of synchronous viewpoints, wherein video frames in the video frame sequence of the plurality of viewpoints are synchronously acquired according to a time sequence;

respectively estimating and obtaining an estimated depth map of a corresponding viewpoint as a target depth map to be processed for a texture map at any frame time in the video frame sequence of each viewpoint;

acquiring a background texture map and a background depth map corresponding to each viewpoint, wherein the background texture map and the background depth map corresponding to each viewpoint are obtained by performing time domain filtering respectively based on an original texture map and a corresponding estimated depth map in a reference video frame sequence of the corresponding viewpoint;

Optionally, the performing, by the target depth map, a correction process using the background texture map and the background depth map of the corresponding viewpoint to obtain a corrected depth map includes:

acquiring the depth value of any pixel in the target depth map, comparing the depth value with the depth value of the corresponding pixel in the corresponding background depth map, and selecting the maximum value of the depth value and the depth value as the depth value of the corresponding pixel in the corrected depth map;

and comparing the pixel value of the corresponding pixel in the original texture map corresponding to the target depth map with the pixel value of the corresponding pixel in the background texture map of the corresponding viewpoint, and selecting the depth value of the corresponding pixel in the corresponding background depth map as the depth value of the corresponding pixel in the corrected depth map when the difference value of the two is less than a preset threshold value.

An embodiment of the present specification further provides a free viewpoint video reconstruction method, including:

acquiring video frames of a plurality of frame moments, wherein the video frames comprise texture maps of a plurality of synchronous viewpoints and depth maps of corresponding viewpoints, the depth maps are obtained by correcting background texture maps and background depth maps of the corresponding viewpoints, and the background texture maps and the background depth maps of the corresponding viewpoints are obtained by performing time domain filtering on original texture maps and corresponding estimated depth maps in a reference video frame sequence respectively;

and reconstructing to obtain an image of the virtual viewpoint according to the position information of the virtual viewpoint and the parameter data corresponding to the video frame based on the texture maps and the corresponding depth maps of the multiple synchronous viewpoints contained in the video frame.

An embodiment of the present specification further provides a depth map correction device, including:

the target depth map acquisition unit is suitable for acquiring a target depth map to be processed, and the estimated depth map to be processed is estimated and obtained based on an original texture map of a corresponding viewpoint;

a reference view acquisition unit adapted to acquire a reference video frame sequence of corresponding views, a video frame of the reference video frame sequence comprising: obtaining an original texture map, and obtaining an estimated depth map corresponding to the original texture map in a reference video frame sequence;

a background view filtering unit, adapted to perform time-domain filtering on the original texture map and the corresponding estimated depth map in the reference video frame sequence, respectively, to obtain a background texture map and a corresponding background depth map of the corresponding viewpoint;

and the correction unit is suitable for correcting the target depth map by adopting the background texture map and the background depth map of the corresponding viewpoint to obtain a corrected depth map.

An embodiment of the present specification further provides a video processing apparatus, including:

the video acquisition unit is suitable for acquiring a plurality of synchronous viewpoint video frame sequences, and video frames in the plurality of viewpoint video frame sequences are synchronously acquired according to a time sequence;

the target depth map acquisition unit is suitable for respectively estimating and obtaining an estimated depth map of a corresponding viewpoint as a target depth map to be processed for a texture map at any frame time in the video frame sequence of each viewpoint;

the background view acquisition unit is suitable for acquiring a background texture map and a background depth map corresponding to each viewpoint, and the background texture map and the background depth map corresponding to each viewpoint are obtained by performing time domain filtering on the basis of an original texture map and a corresponding estimated depth map in a reference video frame sequence corresponding to the viewpoint;

and the correction unit is suitable for performing correction processing on the estimated depth map to be processed by adopting the background texture map and the background depth map of the corresponding viewpoint to obtain a corrected depth map.

An embodiment of the present specification further provides a free viewpoint video reconstruction apparatus, including:

the video frame acquisition unit is suitable for acquiring video frames at a plurality of frame moments, wherein the video frames comprise texture maps of a plurality of synchronous viewpoints and depth maps of corresponding viewpoints, the depth maps are obtained by correcting background texture maps and background depth maps of the corresponding viewpoints, and the background texture maps and the background depth maps of the corresponding viewpoints are obtained by performing time domain filtering on an original texture map and a corresponding estimated depth map in a reference video frame sequence;

and the image reconstruction unit is suitable for reconstructing and obtaining the image of the virtual viewpoint according to the position information of the virtual viewpoint and the parameter data corresponding to the video frame based on the texture maps and the corresponding depth maps of the plurality of synchronous viewpoints contained in the video frame.

The present specification further provides an electronic device, which includes a memory and a processor, where the memory stores computer instructions executable on the processor, and the processor executes the computer instructions to perform the steps of the method according to any one of the foregoing embodiments.

The present specification also provides a computer readable storage medium, on which computer instructions are stored, wherein the computer instructions are executed to perform the steps of the method of any one of the foregoing embodiments.

Compared with the prior art, the technical scheme of the embodiment of the specification has the following beneficial effects:

by adopting the scheme of the embodiment of the present specification, for a target depth map to be processed obtained based on texture map estimation, an original texture map in a reference video frame sequence is obtained by obtaining the reference video frame sequence of a corresponding viewpoint, an estimated depth map corresponding to the original texture map in the reference video frame sequence is further obtained, then time-domain filtering is respectively performed on the original texture map and the corresponding estimated depth map in the reference video frame sequence, so as to obtain a background texture map and a background depth map of the corresponding viewpoint, and then the target depth map is corrected by adopting the background texture map and the background depth map of the corresponding viewpoint, so that a corrected depth map can be obtained. In the depth map correction scheme, the original texture map and the corresponding estimated depth map in the reference video frame sequence are subjected to time domain filtering, so that a stable background texture map and a stable background depth map can be effectively extracted, and then the depth value of a background object interfered in the target depth map can be corrected based on two stable background information, namely the background texture map and the background depth map, so that the image quality of the estimated target depth map can be improved, the jitter of the reconstructed free viewpoint video image in the time domain can be reduced, and the quality of the reconstructed free viewpoint video image can be improved.

Furthermore, by obtaining the depth value of any pixel in the target depth map, comparing the depth value with the depth value of the corresponding pixel in the background depth map of the corresponding viewpoint, and selecting the maximum value of the two as the depth value of the corresponding pixel in the corrected depth map, the adoption of the correction mode based on the background depth map can avoid the lack of the background object in the target depth map, thereby improving the image quality of the target depth map and the image quality of the free viewpoint video reconstructed based on the target depth map.

Further, by comparing the pixel values of the pixels in the original texture map corresponding to the target depth map with the pixel values of the corresponding pixels in the background texture map of the corresponding viewpoint, and when the difference between the two values is smaller than a preset threshold, selecting the depth value of the corresponding pixel in the background depth map of the corresponding viewpoint as the depth value of the corresponding pixel in the corrected depth map, and by adopting the method of correcting based on the difference with the background texture map, the depth value of the background part in the target depth map can be directly corrected, so that the image quality of the target depth map can be improved, and the image quality of the free viewpoint video reconstructed based on the target depth map can be improved.

Drawings

Fig. 1 is a schematic diagram of an application specific system of a free viewpoint video presentation in an embodiment of the present specification;

FIG. 2 is a schematic diagram of an interactive interface of a terminal device in an embodiment of the present specification;

FIG. 3 is a schematic diagram of an arrangement of a collecting apparatus in an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an interactive interface of another terminal device in the embodiment of the present specification;

fig. 5 is a schematic diagram of a free viewpoint video data generation process in an embodiment of the present specification;

FIG. 6 is a schematic diagram illustrating the generation and processing of 6DoF video data according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a header file in an embodiment of the present disclosure;

fig. 8 is a schematic diagram of a user side processing 6DoF video data in an embodiment of the present specification;

FIG. 9A is a specific example of a texture map in a sequence of video frames in an embodiment of the present specification;

FIG. 9B is a specific example of a depth map estimated based on the texture map shown in FIG. 9A;

FIG. 10A is a specific example of another texture map in a sequence of video frames in an embodiment of the present specification;

fig. 10B is a specific example of a depth map estimated based on the texture map shown in fig. 9A;

FIG. 11 is a flowchart of a depth map correction method in an embodiment of the present disclosure;

FIG. 12 is a schematic diagram illustrating a depth map correction method for a specific application scenario in an embodiment of the present disclosure;

fig. 13 is a flowchart of a video processing method in an embodiment of the present description;

fig. 14 is a schematic diagram of a structure of a stitched image of a free viewpoint video frame in the embodiment of the present specification;

fig. 15 is a schematic structural diagram of another stitched image of a free viewpoint video frame in the embodiment of the present specification;

fig. 16 is a flowchart of a free viewpoint video reconstruction method in an embodiment of the present specification;

fig. 17 is a schematic structural diagram of a depth map correction apparatus in an embodiment of the present disclosure;

fig. 18 is a schematic structural diagram of a video processing apparatus in an embodiment of the present specification;

fig. 19 is a schematic structural diagram of a free viewpoint video reconstruction apparatus in an embodiment of the present specification;

fig. 20 is a schematic structural diagram of an electronic device in an embodiment of the present specification;

fig. 21 is a schematic structural diagram of a video processing system in an embodiment of the present specification.

Detailed Description

For those skilled in the art to better understand and implement the embodiments in the present specification, the following first provides an exemplary description of an implementation of a free viewpoint video with reference to the drawings and a specific application scenario.

Referring to fig. 1, a specific application system for free viewpoint video presentation in an embodiment of the present invention may include an acquisition system 11 with multiple acquisition devices, a server 12, and a display device 13, where the acquisition system 11 may perform image acquisition on an area to be viewed; the acquisition system 11 or the server 12 may process the acquired multiple texture maps in synchronization, and generate multi-angle free view data capable of supporting the display device 13 to perform virtual viewpoint switching. The display device 13 may present a reconstructed image generated based on the multi-angle free view data, the reconstructed image corresponding to a virtual viewpoint, present reconstructed images corresponding to different virtual viewpoints according to a user instruction, and switch viewing positions and viewing angles.

In a specific implementation, the process of reconstructing the image to obtain the reconstructed image may be implemented by the display device 13, or may be implemented by a device located in a Content Delivery Network (CDN) in an edge computing manner. It is to be understood that fig. 1 is an example only and is not limiting of the acquisition system, the server, the terminal device, and the specific implementation.

With continued reference to fig. 1, the user may view the area to be viewed through the display device 13, in this embodiment, the area to be viewed is a basketball court. As described above, the viewing position and the viewing angle are switchable.

For example, the user may slide on the screen to switch the virtual viewpoint. In an embodiment of the present invention, with combined reference to FIG. 2, the user's finger is along D₂₂When the screen is slid in the direction, the virtual viewpoint for viewing can be switched. With continued reference to FIG. 3, the position of the virtual viewpoint prior to sliding may be the VP₁After the sliding screen switches the virtual viewpoint, the position of the virtual viewpoint may be VP₂. Referring collectively to fig. 4, after sliding the screen, the reconstructed image of the screen presentation may be as shown in fig. 4. The reconstructed image can be obtained by image reconstruction based on multi-angle free view data generated by images acquired by a plurality of acquisition devices in an actual acquisition situation.

It is to be understood that the image viewed before switching may be a reconstructed image. The reconstructed image may be a frame image in a video stream. In addition, the manner of switching the virtual viewpoint according to the user instruction may be various, and is not limited herein.

In a specific implementation, the viewpoint may be represented by coordinates of 6 degrees of Freedom (DoF), wherein the spatial position of the viewpoint may be represented as (x, y, z) and the viewing angle may be represented as three rotational directions

Accordingly, based on the coordinates of 6 degrees of freedom, a virtual viewpoint, including a position and a view angle, may be determined.

The virtual viewpoint is a three-dimensional concept, and three-dimensional information is required for generating a reconstructed image. In a specific implementation, the multi-angle freeview data may include depth map data for providing third-dimensional information outside the planar image. The data volume of the depth map data is small compared to other implementations, for example, providing three-dimensional information through point cloud data.

In the embodiment of the invention, the switching of the virtual viewpoint can be performed within a certain range, namely a multi-angle free visual angle range. That is, the position and view angle of the virtual viewpoint can be arbitrarily switched within the multi-angle free view angle range.

The multi-angle free visual angle range is related to the arrangement of the collecting equipment, and the wider the shooting coverage range of the collecting equipment is, the larger the multi-angle free visual angle range is. The quality of the picture displayed by the terminal equipment is related to the number of the acquisition equipment, and generally, the more the number of the acquisition equipment is set, the less the hollow area in the displayed picture is.

Furthermore, the range of multi-angle freeviews is related to the spatial distribution of the acquisition device. The range of the multi-angle free viewing angle and the interaction mode with the display device at the terminal side can be set based on the spatial distribution relation of the acquisition device.

It can be understood by those skilled in the art that the foregoing embodiments and the corresponding drawings are only exemplary illustrations, and are not limited to the setting of the capturing device and the association relationship between the multi-angle free viewing angle ranges, nor the interaction manner and the display effect of the display device.

With reference to fig. 5, for free viewpoint Video reconstruction, texture Map acquisition and Depth Map Calculation are required, which includes three main steps, namely, Multi-Camera Video Capturing (Multi-Camera Video Capturing), Camera Parameter Estimation (Camera Parameter Estimation), and Depth Map Calculation (Depth Map Calculation). For multi-camera video capture, it is desirable that the video captured by the various cameras be frame-level aligned. Wherein, a Texture Image (Texture Image) can be obtained through the video acquisition of multiple cameras; the Camera parameters (Camera Parameter) can be obtained through the calculation of the internal parameters and the external parameters of the Camera, and the Camera parameters can comprise internal Parameter data and external Parameter data of the Camera; through the Depth Map calculation, a Depth Map (Depth Map), a plurality of synchronous texture maps, Depth maps of corresponding visual angles and camera parameters can be obtained, and 6DoF video data is formed.

In the embodiment of the present specification, a special camera, such as a light field camera, is not required for capturing the video. Likewise, complicated camera calibration prior to acquisition is not required. Multiple cameras can be laid out and arranged to better capture objects or scenes to be photographed.

After the above three steps are processed, the texture map collected from the multiple cameras, the camera parameters of all the cameras, and the depth map of each camera are obtained. These three portions of data may be referred to as data files in multi-angle freeview video data, and may also be referred to as 6-degree-of-freedom video data (6DoFvideo data). With the data, the user end can generate a virtual viewpoint according to a virtual 6 Degree of Freedom (DoF) position, thereby providing a video experience of 6 DoF.

With reference to fig. 6, the 6DoF video data and the indicative data may be compressed and transmitted to the user side, and the user side may obtain the 6DoF expression of the user side according to the received data, that is, the 6DoF video data and the metadata. The indicative data may also be referred to as Metadata (Metadata), where the video data includes texture map and depth map data of each viewpoint corresponding to multiple cameras, and the texture map and the depth map may be stitched according to a certain stitching rule or a stitching mode to form a stitched image.

Referring to fig. 7 in combination, the metadata may be used to describe a data schema of the 6DoF video data, and specifically may include: stitching Pattern metadata (Stitching Pattern metadata) indicating storage rules for pixel data and depth map data of a plurality of texture maps in a stitched image; edge protection metadata (Padding pattern metadata), which may be used to indicate the way edge protection is performed in the stitched image, and Other metadata (Other metadata). The metadata may be stored in a header file, and the specific order of storage may be as shown in FIG. 7, or in other orders.

With reference to fig. 8, the user side obtains 6DoF video data, which includes camera parameters, stitched images (texture map and depth map), and description metadata (metadata), and besides, interactive behavior data of the user side. Through these data, the user side may perform 6DoF Rendering in a Depth Image-Based Rendering (DIBR) manner, so as to generate an Image of a virtual viewpoint at a specific 6DoF position generated according to a user behavior, that is, according to a user instruction, determine a virtual viewpoint at the 6DoF position corresponding to the instruction.

In general depth map calculation, the depth map is calculated individually for each frame time. The inventors have found that, in a stationary background, a depth value that does not match is obtained, which results in a picture that is visible in a time domain being blurred.

Fig. 9A and 10A are texture maps acquired from the same viewpoint, and texture maps that are successively different in a sequence of video frames continuously acquired by an acquisition device (camera), and are referred to as texture maps Tm and Tn herein for convenience of description, respectively, with reference to fig. 9A, 9B, and 10B, where the texture maps Tm and Tn are used to calculate a depth map Dm, and fig. 10B, and the depth map Tn are used to calculate a depth map Dn. For convenience of description, the depth map obtained by directly calculating the depth map is referred to as an estimated depth map in this specification. Referring to fig. 9A, 9B, 10A and 10B, the camera fixed on the basketball stand by the swing arm is contained in the area Q, and for the camera with no motion at all in the area Q, it should have a completely consistent depth value in the video. However, due to the disturbance of the video frame, as shown in the Q-region of FIG. 10A, an athlete with a clothing color and camera color that are very similar disturbs the camera depth value extraction, resulting in a wrong depth value, as shown in the Q-region of FIG. 10B. In fig. 9A, the Q area is not disturbed, so that the correct depth value can be obtained, as shown in the Q area of fig. 9B.

As can be seen from the depth maps obtained by the above depth map calculation of a specific video frame sequence, due to complex changes in the video, background objects in some video frames are greatly disturbed, so that the depth values obtained by the method based on the texture map of a single frame may obtain wrong depth values, and the wrong depth values ultimately cause the degradation of the image quality obtained by DIBR and the jitter of the free-viewpoint video in the time domain.

In view of the foregoing problems, embodiments of the present disclosure provide a depth map correction scheme, which corrects an estimated depth map obtained by directly calculating a depth map, so as to improve image quality of the depth map and improve temporal stability of a free-viewpoint video image. For any viewpoint texture map, the depth map correction can be performed by the method in the embodiment of the present specification, and the following detailed description is made by specific embodiments with reference to the drawings.

As shown in the flowchart of the depth map correction method in fig. 11, the method may specifically include the following steps:

and S111, acquiring a target depth map to be processed, wherein the target depth map is estimated and obtained based on an original texture map of a corresponding frame moment in the video frame sequence of the corresponding viewpoint.

In an implementation, the scenario shown in FIG. 3, is implemented by placing the collection equipment CJ on site₁To CJ₆The formed acquisition system acquires videos and can obtain videos of corresponding viewpoints. It is to be understood that embodiments of the present description are also suitable for obtaining a sequence of video frames from only one viewpoint acquisition. For any video frame in the acquired video frame sequence, it may be referred to as an original texture map, and based on the original texture map at the corresponding frame time in the video frame sequence at the corresponding viewpoint, a corresponding depth map may be estimated through depth map calculation, which is referred to as an estimated depth map herein, as described above. For convenience of description and understanding, any one of the estimated depth maps to be processed is referred to herein as a target depth map.

S112, acquiring a reference video frame sequence of the corresponding viewpoint, wherein video frames in the reference video frame sequence comprise: and (5) original texture maps.

In a specific implementation, a video segment from the same viewpoint as the target depth map may be selected as a reference video frame sequence, and a video frame in the reference video frame sequence may be an original texture map directly acquired.

For the selection of the reference video frame sequence, as a preferred example, a video segment in which the foreground object (foreground object) in the original texture map is less than the foreground object in the original texture map corresponding to the target depth map can be selected as the reference video frame sequence. For a basketball game, video segments of a relatively small period of time of players on the basketball game can be selected as the reference video frame sequence, for example, video segments shot at a corresponding viewpoint before the beginning of the formal game are selected as the reference video frame sequence.

S113, obtaining an estimated depth map corresponding to the original texture map in the reference video frame sequence.

In specific implementation, the estimated depth map corresponding to the original texture map in the reference video frame sequence may be directly obtained, or the corresponding estimated depth map may be obtained by performing depth map calculation on the original texture map at each frame time of the reference video frame sequence. The depth map calculation may be performed by using a known depth map calculation method, and the specific depth map calculation method does not limit the scope of the embodiments of the present disclosure, and detailed descriptions thereof are not provided herein.

S114, respectively carrying out time domain filtering on the original texture map and the corresponding estimated depth map in the reference video frame sequence to obtain the background texture map and the background depth map of the corresponding viewpoint.

In particular implementations, there are a number of ways in which temporal filtering may be implemented.

For example, a mean filtering method, more specifically, an arithmetic mean filtering, a median average filtering, a moving average filtering, or the like may be employed.

As another example, a median filtering method may be employed. Specifically, temporal median filtering may be performed on pixels in the original texture map and pixels in the corresponding estimated depth map in the reference video frame sequence, respectively, to obtain the background texture map and the background depth map of the corresponding viewpoint.

As an alternative example, t may be selected from a video X from the same viewpoint as the target depth map₁To t₂The original texture map sequence and the corresponding estimated depth map sequence of the time period are obtained from the video frame sequence at the moment, the sampling values of the corresponding pixel positions in the original texture map sequence and the estimated depth map sequence can be arranged according to the size, and the intermediate values are respectively taken as the effective values of the corresponding pixel positions of the background texture map and the background depth map. To facilitate taking the median value, t₁To t₂The number of original texture maps and corresponding estimated depth maps sampled at a time instant should be odd, e.g., taking 3 consecutive frames,5 frames, 7 frames, etc. The following can be expressed using the formula:

P(x_t)＝med({I_x,i|i∈[t₁,t₂]})

wherein, P (x)_t) Representing either a background texture map or a background depth map, I_x,iRepresents t₁To t₂Sum of P (x) in all original texture maps or estimated depth maps at a time_t) Sequence of pixel values at the same pixel location, med denotes taking I_x,iMiddle value of (1).

It is understood that in the specific implementation, other temporal filtering methods, such as clipping filtering, first-order lag filtering, etc., may also be used according to the environmental characteristics involved in the specific video and the specific requirements.

And S115, correcting the target depth map by adopting the background texture map and the background depth map of the corresponding viewpoint to obtain a corrected depth map.

In a specific implementation, the background texture map or the background depth map of the corresponding viewpoint may be used alone to perform the correction processing on the target depth map, or the background texture map and the background depth map of the corresponding viewpoint may be used simultaneously to perform the correction processing on the target depth map, so as to obtain the corrected depth map.

Through steps S112 to S114, time domain background modeling may be implemented to obtain a background texture map and a background depth map, obtain texture map background information and depth map background information that are stable in time domain, and further correct an estimated depth map calculated based on a single frame depth map through the background texture map and the background depth map, so that the stability of a background object in the depth map may be significantly improved.

In the depth map correction scheme, the original texture map and the corresponding estimated depth map in the reference video frame sequence are subjected to time domain filtering, so that a stable background texture map and a stable background depth map can be effectively extracted, and then the depth value of a background object interfered in the target depth map can be corrected based on two stable background information, namely the background texture map and the background depth map, so that the image quality of the estimated target depth map can be improved, the jitter of the reconstructed free viewpoint video image in the time domain can be reduced, and the quality of the reconstructed free viewpoint video image can be improved.

Some specific correction methods that may be employed in step S115 are illustrated below for a better understanding and implementation by those skilled in the art.

In one aspect, the background depth map of the corresponding viewpoint may be used for depth map correction.

As a specific optional example, the depth value of any pixel in the target depth map may be obtained, and compared with the depth value of the corresponding pixel in the background depth map of the corresponding viewpoint, and the maximum value of the two may be selected as the depth value of the corresponding pixel in the corrected depth map. The following can be expressed by the formula:

Depth_out(i,j)＝max(Depth_background(i,j),Depth_in(i,j)

wherein, Depth_in(i, j) is the Depth value of the pixel with coordinate (i, j) in the target Depth map, Depth_background(i, j) is the Depth value of the pixel with coordinate (i, j) in the background Depth map corresponding to the viewpoint, Depth_outAnd (i, j) is the depth value of the pixel with the coordinate (i, j) in the corrected depth map.

By adopting the correction mode based on the background depth map, the deficiency of background objects in the target depth map can be avoided, so that the image quality of the target depth map and the image quality of the free viewpoint video reconstructed based on the target depth map can be improved.

On the other hand, the background texture map of the corresponding viewpoint can be used for depth map correction.

As a specific optional example, comparing the pixel value of the pixel in the original texture map corresponding to the target depth map with the pixel value of the corresponding pixel in the background texture map of the corresponding viewpoint, and if the difference between the two is smaller than a preset threshold, selecting the depth value of the corresponding pixel in the background depth map of the corresponding viewpoint as the depth value of the corresponding pixel in the corrected depth map. The following can be expressed by the formula:

If|Pixel(i,j)-Pixel_background(i,j)|＜Thr,Depth_out(i,j)＝Depth_background(i,j)

wherein, Pixel (i, j) represents the Pixel value of the Pixel with coordinate (i, j) in the original texture map corresponding to the target depth map, and Pixel (i, j) represents the Pixel value of the Pixel with coordinate (i, j) in the original texture map corresponding to the target depth map_background(i, j) represents the pixel value of the pixel with the coordinate (i, j) in the background texture map of the viewpoint corresponding to the target Depth map, Thr is a preset difference threshold value, Depth_background(i, j) is the Depth value of the pixel with coordinate (i, j) in the background Depth map corresponding to the viewpoint, Depth_outAnd (i, j) is the depth value of the pixel with the coordinate (i, j) in the corrected depth map.

By adopting the method of correcting based on the difference with the background texture map, the depth value belonging to the background part in the target depth map can be directly corrected, thereby improving the image quality of the target depth map and the image quality of the free viewpoint video reconstructed based on the target depth map.

Referring to the schematic diagram of the depth map correction method for a specific application scenario shown in fig. 12, for example, based on an original texture map Tx acquired with a viewpoint Vx, a corresponding estimated depth map Dx can be obtained through depth map estimation, and the estimated depth map Dx is taken as a target depth map. For depth correction of the target depth map Dx, a reference video frame sequence with the same viewpoint Vx may be obtained, for example, original texture maps Tp to Tq are obtained, depth map estimation is separately performed based on each frame in the original texture maps Tp to Tq, corresponding estimated depth maps Dp to Dq may be obtained, time domain filtering is performed based on the original texture maps Tp to Tq and the estimated depth maps Dp to Dq, respectively, a background texture map Tb and a background depth map Db of the viewpoint Vx may be obtained, and then the background texture map Tb and the background depth map Db are used, and correction processing may be performed on the target depth map Dx, and finally a corrected depth map Dc is output.

Similarly, the estimated depth map corresponding to the original texture map Tx acquired from the viewpoint Vx can be subjected to depth map correction processing using the background texture map Tb and the background depth map Db.

If the acquisition system includes acquisition devices of multiple viewpoints, for the estimated depth maps corresponding to the original texture maps acquired from other viewpoints except the viewpoint Vx, the background texture map and the background depth map of the corresponding viewpoint may also be used, and the depth map correction processing is performed separately in the manner described in the foregoing embodiment, so as to obtain corresponding corrected depth maps.

In some embodiments of the present description, the target depth maps include multiple groups, and multiple groups of target depth maps are estimated based on original texture maps synchronously acquired from multiple viewpoints, so that the depth map correction method according to the embodiments of the present description may be adopted to respectively perform correction processing on the multiple groups of target depth maps. Specifically, the multiple sets of target depth maps are corrected by respectively using the background texture map and the background depth map of the corresponding viewpoint, so as to obtain multiple sets of corrected depth maps.

In order to improve the data processing speed, the background texture map and the background depth map of the corresponding viewpoints are respectively adopted for the multiple groups of target depth maps, and correction processing can be carried out in a batch processing or parallel processing mode to obtain multiple groups of corrected depth maps.

In order to improve the temporal stability of a depth map in a free-viewpoint video and the quality of a free-viewpoint video image reconstructed based on the depth map, embodiments of the present specification further provide a corresponding video processing method, which may be executed by the server 12 shown in fig. 1 or the data processing device a2 or the cloud server cluster A3 shown in fig. 21, corresponding to a specific application scenario. Referring to the flow chart of the video processing method shown in fig. 13, the specific steps are as follows:

s131, acquiring a synchronous video frame sequence of multiple viewpoints, wherein video frames in the synchronous video frame sequence of multiple viewpoints are synchronously acquired according to a time sequence.

As described in the foregoing embodiment, a plurality of acquisition devices are arranged at different positions in a field to form an acquisition array, and the plurality of acquisition devices in the acquisition array can acquire images synchronously to obtain video frame sequences of corresponding viewpoints respectively.

And S132, respectively estimating and obtaining an estimated depth map of a corresponding viewpoint as a target depth map to be processed for a texture map at any frame time in the video frame sequence of each viewpoint.

S133, obtaining a background texture map and a background depth map corresponding to each viewpoint, wherein the background texture map and the background depth map corresponding to each viewpoint are obtained by performing time domain filtering respectively based on an original texture map and a corresponding estimated depth map in a reference video frame sequence of the corresponding viewpoint.

In a specific implementation, for each viewpoint, a corresponding background texture map and a background depth map may be obtained, and specifically, the background texture map and the background depth map may be obtained by performing temporal filtering based on an original texture map and a corresponding estimated depth map in a reference video frame sequence corresponding to the viewpoint. Regarding the selection of the reference video frame sequence of each view, and the specific manner of temporal filtering, etc., all can be described with reference to the foregoing embodiments.

And S134, correcting the target depth map by adopting the background texture map and the background depth map of the corresponding viewpoint to obtain a corrected depth map.

Through the step S133, time domain background modeling of each viewpoint can be realized, a stable background texture map and a background depth map corresponding to the viewpoint are obtained, and stable texture background information and depth background information corresponding to the viewpoint are obtained, so that interference of various foreground objects in a depth map estimation process based on a single-frame original texture map can be avoided, the quality of the obtained depth map is improved, the time domain stability of the free viewpoint video is improved, the situation of video image jitter is avoided, and the image quality of the reconstructed free viewpoint video is improved.

For the target depth map, performing correction processing by using the background texture map and the background depth map of the corresponding viewpoint to obtain a corrected depth map, which may specifically adopt at least one of the following manners:

first, correction processing may be performed based on the background depth map of the corresponding viewpoint as follows: and obtaining the depth value of any pixel in the target depth map, comparing the depth value with the depth value of the corresponding pixel in the corresponding background depth map, and selecting the maximum value of the two as the depth value of the corresponding pixel in the corrected depth map.

Secondly, the background texture map based on the corresponding viewpoint can be corrected in the following way: and comparing the pixel value of the corresponding pixel in the original texture map corresponding to the target depth map with the pixel value of the corresponding pixel in the background texture map of the corresponding viewpoint, and selecting the depth value of the corresponding pixel in the corresponding background depth map as the depth value of the corresponding pixel in the corrected depth map when the difference value of the two is less than a preset threshold value.

In specific implementation, for scenes with high delay requirements such as live broadcast or collimated broadcast, in order to increase the data processing speed, the target depth maps of the respective viewpoints, or even pixels in the respective target depth maps, may be corrected in a batch processing manner or a parallel processing manner.

After the depth map is corrected, the texture map of each viewpoint at the corresponding frame time and the corresponding corrected depth map may be stitched to obtain a stitched image, such as a schematic structural diagram of the stitched image shown in fig. 14, where the top half of the stitched image is a texture map region and the bottom half of the stitched image is a depth map region. After the depth map is corrected, in order to improve the quality of the reconstructed video image as much as possible under the condition of limited bandwidth resources, the corrected depth map obtained by correction may be downsampled, and the original texture map of each viewpoint and the downsampled depth map of the corresponding viewpoint are spliced according to a preset splicing rule to form a spliced image, which is shown in fig. 15, where the depth map is downsampled 1/4.

It can be understood that the texture maps of different viewpoints and the depth maps of corresponding viewpoints may be arranged according to a preset arrangement rule and spliced together. For example, the position of a specific viewpoint image in the stitched image and the corresponding relationship between the texture map region and the depth map region of the overall stitched image may be set according to the positional relationship of different viewpoints. The embodiments of the present description do not limit the specific stitching rules for stitching images.

After the spliced image is obtained, in order to save transmission bandwidth, the spliced image of each video frame can be transmitted after being subjected to video compression.

The display device or the terminal device adapted to display the image may perform free viewpoint video reconstruction based on the received free viewpoint video.

Referring to the flowchart of the free viewpoint video reconstruction method shown in fig. 16, in some embodiments of the present description, the following steps may be specifically adopted to perform video reconstruction:

s161, obtaining video frames of multiple frame moments, where the video frames include texture maps of multiple synchronous viewpoints and depth maps of corresponding viewpoints, the depth maps are obtained by performing correction processing on a background texture map and a background depth map of the corresponding viewpoints, and the background texture map and the background depth map of the corresponding viewpoints are obtained by performing time-domain filtering on an original texture map and a corresponding estimated depth map in a reference video frame sequence.

For a specific correction method, reference may be made to the foregoing embodiment example, and for a specific applicable temporal filtering manner, reference may also be made to the foregoing specific example in the depth map correction method embodiment, which is not described in detail here.

And S162, reconstructing to obtain an image of the virtual viewpoint according to the position information of the virtual viewpoint and the parameter data corresponding to the video frame based on the texture maps and the corresponding depth maps of the plurality of synchronous viewpoints contained in the video frame.

In a specific implementation, the position information of the virtual viewpoint may be obtained based on an interaction behavior of a user, or may be pre-specified by an upstream device such as a front-end server, a cloud server, or a broadcast guide processing device, and is indicated by metadata in the free viewpoint video. The parameter data corresponding to the video frame can also be transmitted to the display equipment or the terminal equipment at the user side through the free viewpoint video.

In a specific implementation, according to the position information of the virtual viewpoint, the texture map and the corresponding depth map of a part of viewpoints included in the video frame may be used to complete image reconstruction of the virtual viewpoint, or the texture map and the corresponding depth map of all viewpoints included in the video frame may be used to perform image reconstruction of the virtual viewpoint.

By adopting the free viewpoint video reconstruction method, the depth map in the video frame is corrected based on the background texture map and the background depth map of the corresponding viewpoint, and the background texture map and the background depth map of the corresponding viewpoint are respectively obtained by performing time domain filtering on the original texture map and the corresponding estimated depth map in the reference video frame sequence, so that the depth map has higher time domain stability, the quality of the image of the free viewpoint video reconstructed based on the depth map is correspondingly improved, and the jitter phenomenon of the free viewpoint video image can be reduced.

The embodiments of the present disclosure further provide corresponding apparatuses of the foregoing method embodiments, which are described below respectively, and it can be understood by those skilled in the art that the following apparatuses may all perform the correction processing on the depth map by using the foregoing method embodiments, or perform image reconstruction based on the corrected depth map obtained after the correction.

Referring to the schematic structural diagram of the correction device shown in fig. 17, in some embodiments of the present description, as shown in fig. 17, the depth map correction device 170 may include: a target depth map acquisition unit 171, a reference view acquisition unit 172, a background view filtering unit 173, and a correction unit 174, wherein:

the target depth map obtaining unit 171 is adapted to obtain a target depth map to be processed, where the target depth map is estimated based on an original texture map of a corresponding viewpoint;

the reference view acquiring unit 172 is adapted to acquire a reference video frame sequence of corresponding views, where a video frame in the reference video frame sequence includes: obtaining an original texture map, and obtaining an estimated depth map corresponding to the original texture map in a reference video frame sequence;

the background view filtering unit 173 is adapted to perform temporal filtering on the original texture map and the corresponding estimated depth map in the reference video frame sequence, respectively, to obtain the background texture map and the corresponding background depth map of the corresponding viewpoint;

the correcting unit 174 is adapted to perform correction processing on the target depth map by using the background texture map and the background depth map of the corresponding viewpoint, so as to obtain a corrected depth map.

By adopting the depth map correction device, the original texture map and the corresponding estimated depth map in the reference video frame sequence are subjected to time domain filtering, so that a stable background texture map and a stable background depth map can be effectively extracted, and the depth value of an interfered background object in the target depth map can be corrected based on two stable background information, namely the background texture map and the background depth map, so that the image quality of the estimated target depth map can be improved, the jitter of the reconstructed free viewpoint video image in the time domain can be reduced, and the quality of the reconstructed free viewpoint video image can be improved.

The embodiment of the present specification further provides a video processing apparatus, referring to a schematic structural diagram of the video processing apparatus shown in fig. 18, where the video processing apparatus 180 may include: a video acquisition unit 181, a target depth map acquisition unit 182, a background view acquisition unit 183, and a correction unit 184, wherein:

the video acquiring unit 181 is adapted to acquire a plurality of synchronized viewpoint video frame sequences, where video frames in the plurality of viewpoint video frame sequences are synchronously acquired according to a time sequence;

the target depth map obtaining unit 182 is adapted to obtain, as to texture maps at any frame time in the video frame sequence of each viewpoint, estimated depth maps of corresponding viewpoints by respective estimation as target depth maps to be processed;

the background view acquiring unit 183 is adapted to acquire a background texture map and a background depth map corresponding to each viewpoint, where the background texture map and the background depth map corresponding to each viewpoint are obtained by performing time-domain filtering based on an original texture map and a corresponding estimated depth map in a reference video frame sequence of the corresponding viewpoint;

the correcting unit 184 is adapted to perform correction processing on the estimated depth map to be processed by using the background texture map and the background depth map of the corresponding viewpoint, so as to obtain a corrected depth map.

By adopting the video processing device, the stable background texture map and the background depth map of the corresponding viewpoint can be obtained through the background depth map obtaining unit, the stable texture background information and the stable depth background information of the corresponding viewpoint are obtained, and the target depth map is corrected through the correcting unit according to the background texture map and the background depth map of the corresponding viewpoint, so that the interference of various foreground objects in the process of estimating the depth map based on the single-frame original texture map can be avoided, the quality of the obtained depth map is improved, the time domain stability of the free viewpoint video is improved, the condition of video image jitter is avoided, and the image quality of the reconstructed free viewpoint video is improved.

In an embodiment of the present specification, referring to a schematic structural diagram of the free viewpoint video reconstruction apparatus shown in fig. 19, in some embodiments of the present specification, as shown in fig. 19, the free viewpoint video reconstruction apparatus 190 may include: a video frame acquisition unit 191 and an image reconstruction unit 192, wherein:

the video frame obtaining unit 191 is adapted to obtain video frames at a plurality of frame moments, where the video frames include texture maps of a plurality of synchronous viewpoints and depth maps of corresponding viewpoints, the depth maps are obtained by performing correction processing on a background texture map and a background depth map of a corresponding viewpoint, and the background texture map and the background depth map of the corresponding viewpoint are obtained by performing time-domain filtering on an original texture map and a corresponding estimated depth map in a reference video frame sequence, respectively;

the image reconstructing unit 192 is adapted to reconstruct an image of a virtual viewpoint according to the position information of the virtual viewpoint and the parameter data corresponding to the video frame based on the texture maps and the corresponding depth maps of the multiple viewpoints that are synchronized and included in the video frame.

By adopting the free viewpoint video reconstruction device, the depth map in the video frame acquired by the video frame acquisition unit is corrected based on the background texture map and the background depth map of the corresponding viewpoint, and the background texture map and the background depth map of the corresponding viewpoint are obtained by respectively performing time domain filtering on the original texture map and the corresponding estimated depth map in the reference video frame sequence, so that the depth map has higher time domain stability, the quality of the image of the free viewpoint video reconstructed based on the depth map is correspondingly improved, and the jitter phenomenon of the free viewpoint video image can be reduced.

The embodiments of the present disclosure also provide a computer-readable storage medium, on which computer instructions are stored, where the computer instructions, when executed, perform the steps of the method according to any one of the foregoing embodiments, which may be specifically described with reference to the foregoing embodiments, and are not described herein again.

In particular implementations, the computer-readable storage medium may be a variety of suitable readable storage media such as an optical disk, a mechanical hard disk, a solid state disk, and so on.

The present specification further provides an electronic device, referring to a schematic structural diagram of the electronic device shown in fig. 20, where the electronic device 200 may include a memory 201 and a processor 202, where the memory 201 stores computer instructions executable on the processor 202, and when the processor executes the computer instructions, the steps of the method in any one of the foregoing embodiments may be executed.

The electronic device may also include other electronic components or assemblies based on where the electronic device is located throughout the video processing system.

For example, with continued reference to fig. 20, the electronic device 200 may further include a communication component 203, which may communicate with the acquisition system or the cloud server to obtain a video frame sequence including the original texture map, or directly obtain an estimated depth map of a corresponding viewpoint obtained after the depth map calculation, so as to serve as a target depth map to be subjected to depth map correction. Or the depth map in the video frame acquired by the communication component 203 has been post-processed by the correction method in the embodiment of the present specification, and then the processor 202 may perform free viewpoint video reconstruction based on the video frame acquired by the communication component 203 and the virtual viewpoint position.

As another example, in some electronic devices, with continued reference to fig. 20, the electronic device 200 may further include a display component 204 (e.g., a display, a touch screen, a projector) to display the reconstructed video image.

In some embodiments of the present description, the memory, processor, communication component, and display component may communicate over a bus network.

In a specific implementation, the communication component 203, the display component 204, and the like may be components disposed inside the electronic device 200, or may be external devices connected through an expansion component such as an expansion interface, a docking station, an expansion line, and the like.

In a specific implementation, the processor 202 may be implemented cooperatively by any one or more of a Central Processing Unit (CPU) (e.g., a single-core processor, a multi-core processor), a CPU group, a Graphics Processing Unit (GPU), an Artificial Intelligence (AI) chip, a Field Programmable Gate Array (FPGA) chip, and the like.

In specific implementation, for a large number of depth maps, in order to reduce processing delay, an electronic device cluster formed by a plurality of electronic devices may be cooperatively implemented.

For a better understanding and implementation by those skilled in the art, a specific application scenario is described below. Referring to the schematic structural diagram of the video processing system shown in fig. 21, as shown in fig. 21, which is a schematic structural diagram of a video processing system in an application scenario, where an arrangement scenario of a data processing system of a basketball game is shown, the video processing system a0 includes a capture array a1 composed of multiple capture devices, a data processing device a2, a server cluster A3 in a cloud, a play control device a4, a play terminal a5, and an interaction terminal a 6.

Referring to fig. 21, a basketball frame on the left side is taken as a core viewpoint, the core viewpoint is taken as a circle center, and a sector area located on the same plane as the core viewpoint is taken as a preset multi-angle free viewing angle range. Each acquisition device in the acquisition array A1 can be arranged in different positions of a field acquisition area in a fan shape according to the preset multi-angle free visual angle range, and can synchronously acquire video data streams from corresponding angles in real time.

In particular implementations, the collection devices may also be located in the ceiling area of a basketball venue, on a basketball stand, or the like. The acquisition devices can be arranged and distributed along a straight line, a fan shape, an arc line, a circle or an irregular shape. The specific arrangement mode can be set according to one or more factors such as specific field environment, the number of the acquisition equipment, the characteristics of the acquisition equipment, imaging effect requirements and the like. The acquisition device may be any device having a camera function, such as a general camera, a mobile phone, a professional camera, and the like.

In order not to affect the operation of the acquisition device, the data processing device a2 may be located in a field non-acquisition area, which may be regarded as a field server. The data processing device a2 may send a pull stream command to each acquisition device in the acquisition array a1 through a wireless local area network, and each acquisition device in the acquisition array a1 transmits an obtained video data stream to the data processing device A3 in real time based on the pull stream command sent by the data processing device a 2. Each acquisition device in the acquisition array a1 can transmit the obtained video data stream to the data processing device a2 in real time through the switch a 7. Acquisition array a1 and switch a7 together form an acquisition system.

When the data processing device a2 receives a video frame capture instruction, capture a plurality of frame images of the synchronized video frames from the video frames at the specified frame time in the received multiple video data streams, and upload the plurality of obtained synchronized video frames at the specified frame time to the server cluster A3 at the cloud.

Correspondingly, the cloud server cluster a3 uses the received original texture maps of multiple synchronous video frames as an image combination, determines parameter data corresponding to the image combination and an estimated depth map corresponding to each original texture map in the image combination, and performs frame image reconstruction based on the obtained virtual viewpoint path based on the parameter data corresponding to the image combination, the pixel data of the texture map in the image combination and the depth data of the corresponding depth map, so as to obtain corresponding multi-angle free view video data.

As a depth map post-processing step, the depth map correction method introduced in the foregoing embodiment of the present specification may be used to perform depth map correction on the estimated depth map of the corresponding viewpoint.

The server may be placed in the cloud, and in order to process data in parallel more quickly, the server cluster a3 in the cloud may be composed of a plurality of different servers or server groups according to different processing data.

For example, the cloud server cluster a3 may include: a first cloud server a31, a second cloud server a32, a third cloud server a33, and a fourth cloud server a 34. The first cloud server a31 may be configured to determine parameter data corresponding to the image combination; the second cloud server a32 may be configured to determine an estimated depth map of the original texture map of each viewpoint in the image combination and perform depth map correction processing; the third cloud server a33 may perform frame Image reconstruction Based on the parameter data corresponding to the Image combination, the texture map and the Depth map of the Image combination, and using a Depth Image Based Rendering (DIBR) algorithm Based on the Depth map according to the position information of the virtual viewpoint, to obtain an Image of the virtual viewpoint; the fourth cloud server a34 may be configured to generate free-viewpoint video (multi-angle free-view video).

It can be understood that the first cloud server a31, the second cloud server a32, the third cloud server a33, and the fourth cloud server a34 may also be a server group composed of a server array or a server sub-cluster, which is not limited in the embodiment of the present invention.

Then, the playback control apparatus a4 may insert the received free viewpoint video frame into the video stream to be played, and the playback terminal a5 receives the video stream to be played from the playback control apparatus a4 and plays it in real time. The playing control device 34 may be a manual playing control device or a virtual playing control device. In specific implementation, a special server capable of automatically switching video streams can be set as a virtual play control device to control data sources. A director control device such as a director may be used as one of the playback control devices a4 in embodiments of the present invention.

The interactive device a6 may play free viewpoint video based on user interaction.

It can be understood that each acquisition device in the acquisition array a1 and the data processing device a2 may be connected through a switch a7 and/or a local area network, the number of the play terminal a5 and the number of the interaction terminals a6 may be one or more, the play terminal a5 and the interaction terminal a6 may be the same terminal device, the data processing device a2 may be placed in a field non-acquisition area or a cloud according to a specific scenario, the server cluster A3 and the play control device a4 may be placed in the field non-acquisition area or the cloud or a terminal access side according to the specific scenario, and this embodiment is not used to limit the specific implementation and protection scope of the present invention.

Although the embodiments of the present invention are disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected by one skilled in the art without departing from the spirit and scope of the embodiments of the invention as defined in the appended claims.

Claims

1. A depth map correction method, comprising:

2. The method of claim 1, wherein the temporally filtering the original texture map and the corresponding estimated depth map in the sequence of reference video frames to obtain the background texture map and the background depth map of the corresponding viewpoint respectively comprises:

3. The method according to claim 1 or 2, wherein the correcting the target depth map by using the background texture map and the background depth map of the corresponding viewpoint to obtain a corrected depth map includes at least one of:

4. The method according to claim 1 or 2, wherein fewer foreground objects in the original texture map than in the original texture map corresponding to the target depth map are present in the sequence of reference video frames.

5. The method of claim 1, wherein the target depth maps comprise a plurality of sets of target depth maps estimated based on raw texture maps acquired simultaneously from a plurality of viewpoints.

6. The method of claim 5, wherein the performing a correction process on the target depth map by using the background texture map and the background depth map of the corresponding viewpoint to obtain a corrected depth map comprises:

7. The method of claim 6, wherein the performing correction processing on the multiple sets of target depth maps by respectively using the background texture map and the background depth map of the corresponding viewpoints to obtain multiple sets of corrected depth maps comprises:

8. A video processing method, comprising:

9. The method of claim 8, wherein the performing a correction process on the target depth map by using the background texture map and the background depth map of the corresponding viewpoint to obtain a corrected depth map comprises:

10. A free viewpoint video reconstruction method, comprising:

11. A depth map correction apparatus, comprising:

the target depth map acquisition unit is suitable for acquiring a target depth map to be processed, and the target depth map is estimated based on an original texture map of a corresponding viewpoint;

12. A video processing apparatus, comprising:

and the correction unit is suitable for performing correction processing on the target depth map by adopting the background texture map and the background depth map of the corresponding viewpoint to obtain a corrected depth map.

13. A free viewpoint video reconstruction apparatus, comprising:

14. An electronic device comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, wherein the processor, when executing the computer instructions, performs the steps of the method of any one of claims 1 to 7, 8 or 9, or 10.

15. A computer readable storage medium having computer instructions stored thereon, wherein the computer instructions when executed perform the steps of the method of any one of claims 1 to 7, 8 or 9, or 10.