CN113542721B

CN113542721B - Depth map processing method, video reconstruction method and related devices

Info

Publication number: CN113542721B
Application number: CN202010312853.4A
Authority: CN
Inventors: 盛骁杰; 魏开进
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2023-04-25
Anticipated expiration: 2040-04-20
Also published as: CN113542721A; WO2021213301A1

Abstract

Depth map processing method, video reconstruction method and related devices, wherein the method comprises the following steps: obtaining a depth map to be processed from an image combination of a current video frame of a multi-angle free view, wherein the image combination of the current video frame of the multi-angle free view comprises a plurality of texture maps and depth maps which are synchronous in angle and have corresponding relation; acquiring a video frame sequence containing a preset window in the time domain of the current video frame; acquiring a window filter coefficient value corresponding to each video frame in the video frame sequence, wherein the window filter coefficient value is generated by weight values of at least two dimensions, and the window filter coefficient value comprises: a first filter coefficient weight value corresponding to the pixel confidence coefficient; and filtering the pixels corresponding to the position in the depth map to be processed according to a preset filtering mode based on the window filtering coefficient values corresponding to the video frames to obtain depth values after the pixels corresponding to the position in the depth map to be processed are filtered. By adopting the scheme, the stability of the depth map in the time domain can be improved.

Description

Depth map processing method, video reconstruction method and related devices

Technical Field

The embodiment of the specification relates to the technical field of video processing, in particular to a depth map processing method, a video reconstruction method and related devices.

Background

The 6-degree-of-freedom (6Degree of Freedom,6DoF) technology is a technology for providing a high-degree-of-freedom viewing experience, and a user can adjust a viewing angle of viewing through an interactive operation in viewing, so that viewing can be performed from a desired viewing angle.

In a wide range of scenarios, such as sports games, achieving high degrees of freedom viewing through depth map based image rendering (Depth Image Based Rendering, DIBR) techniques is a solution with great potential and feasibility. In contrast to the deficiencies of point cloud reconstruction schemes with respect to reconstructed view quality and stability, DIBR techniques may already approximate the quality of the originally acquired view in terms of reconstructed view quality.

In DIBR schemes, the stability of the depth map in the time domain has an important impact on the quality of the final reconstructed image.

Disclosure of Invention

In view of this, in order to improve the stability of the depth map in the time domain, an aspect of the embodiments of the present disclosure provides a depth map processing method and a related apparatus.

In order to improve the image quality of the reconstructed video, another aspect of the embodiments of the present disclosure provides a video reconstruction method and a related apparatus.

First, an embodiment of the present disclosure provides a depth map generating method, including:

obtaining a depth map to be processed from an image combination of a current video frame of a multi-angle free view, wherein the image combination of the current video frame of the multi-angle free view comprises a plurality of texture maps and depth maps which are synchronous in angle and have corresponding relation;

acquiring a video frame sequence containing a preset window in the time domain of the current video frame;

acquiring a window filter coefficient value corresponding to each video frame in the video frame sequence, wherein the window filter coefficient value is generated by weight values of at least two dimensions, and the window filter coefficient value comprises: the first filter coefficient weight value corresponding to the pixel confidence coefficient is obtained by adopting the following mode: obtaining confidence values of pixels corresponding to positions in the depth map to be processed and each second depth map, and determining a first filter coefficient weight value corresponding to the confidence values, wherein: the second depth map is a depth map with the same view angle as the depth map to be processed in each video frame of the video frame sequence;

And filtering the pixels corresponding to the position in the depth map to be processed according to a preset filtering mode based on the window filtering coefficient values corresponding to the video frames to obtain depth values after the pixels corresponding to the position in the depth map to be processed are filtered.

Optionally, the weight value of the window filter coefficient further includes: at least one of a second filter coefficient weight value corresponding to the frame distance and a third filter coefficient weight value corresponding to the pixel similarity; the second filter coefficient weight value and the third filter coefficient weight value are obtained by adopting the following modes:

acquiring frame distances between each video frame in the video frame sequence and the current video frame, and determining a second filter coefficient weight value corresponding to the frame distances;

obtaining similarity values of corresponding pixels in positions of texture maps corresponding to the second depth maps and the texture maps corresponding to the depth maps to be processed, and determining third filter coefficient weight values corresponding to the similarity values;

optionally, the obtaining the confidence value of the pixel corresponding to the position in each second depth map in the depth map to be processed includes at least one of the following:

obtaining a depth map in a preset view angle range around the corresponding view angles of the depth map to be processed and each second depth map, obtaining a third depth map of the corresponding view angles, and determining confidence values of corresponding pixels in the depth map to be processed and each second depth map based on the third depth map of each corresponding view angle;

And determining confidence values of the pixels corresponding to the positions in the depth map to be processed and the second depth maps based on the spatial consistency of the pixels corresponding to the positions in the depth map to be processed and the pixels in the preset area around the depth map where the pixels are located.

Optionally, the determining, based on the third depth map of each corresponding view, a confidence value of a corresponding pixel in the depth map to be processed and in each second depth map includes:

obtaining texture maps corresponding to the depth map to be processed and texture maps corresponding to the second depth maps, and respectively mapping the texture values corresponding to the corresponding positions in the texture maps corresponding to the second depth maps to corresponding positions in the texture maps corresponding to the third depth maps of the corresponding view angles according to the depth values of the corresponding pixels in the depth maps corresponding to the second depth maps and the corresponding pixels in the texture maps corresponding to the depth maps to be processed, so as to obtain mapped texture values corresponding to the third depth maps of the corresponding view angles;

and respectively matching the mapping texture values with actual texture values of corresponding positions in texture maps corresponding to the third depth maps of the corresponding view angles, and determining confidence values of corresponding pixels in the depth maps to be processed and the second depth maps based on distribution intervals of matching degrees of the texture values corresponding to the third depth maps of the corresponding view angles.

mapping corresponding pixels in the depth map to be processed and the positions in each second depth map to the third depth map of each corresponding view angle, and obtaining mapping depth values of the pixels in the corresponding positions in the third depth map of each corresponding view angle;

and matching the mapping depth values of the pixels at the corresponding positions in the third depth map of each corresponding view angle with the actual depth values of the pixels at the corresponding positions in the third depth map of each corresponding view angle, and determining the confidence values of the pixels at the corresponding positions in the depth map to be processed and each second depth map based on the distribution interval of the matching degree of the depth values corresponding to the third depth map of each corresponding view angle.

Optionally, the determining, based on the third depth map of each corresponding view, a confidence value of a corresponding pixel in the to-be-processed depth map and each second depth map includes:

respectively obtaining depth values of corresponding pixels in the depth map to be processed and the second depth maps, respectively mapping the depth values to corresponding pixel positions in the third depth map corresponding to the view angle according to the depth values, obtaining and reversely mapping the depth values to corresponding pixel positions in the depth map to be processed and the second depth maps according to the depth values of corresponding pixel positions in the third depth map corresponding to the view angle, and obtaining mapped pixel positions corresponding to the third depth map corresponding to the view angle in the depth map to be processed and the second depth maps;

And respectively calculating the pixel distances of the mapping pixel positions obtained by the reflection of the actual pixel positions of the pixels in the depth map to be processed and the corresponding positions in the second depth maps and the third depth maps of the corresponding visual angles, and determining the confidence values of the pixels in the depth map to be processed and the corresponding positions in the second depth maps based on the calculated distribution intervals of the pixel distances.

Optionally, the determining the confidence value of the pixel corresponding to the position in the depth map to be processed and each second depth map based on the spatial consistency between the pixel corresponding to the position in the depth map to be processed and the pixel in the preset area around the depth map where the pixel is located includes at least one of:

respectively matching the pixels corresponding to the positions in the depth map to be processed and the second depth maps with the depth values of the pixels in the preset area around the depth map where the pixels are located, and respectively determining confidence values of the pixels corresponding to the positions in the depth map to be processed and the second depth maps based on the matching degree of the depth values and the number of the pixels with the matching degree meeting a preset pixel matching degree threshold value;

and matching the corresponding pixels in the depth map to be processed and the positions in each second depth map with weighted average values of depth values of pixels in preset areas around the depth map where the pixels are located, and respectively determining confidence values of the corresponding pixels in the depth map to be processed and the positions in each second depth map based on the matching degree of the corresponding pixels in the depth map to be processed and the positions in each second depth map with the corresponding weighted average values.

Optionally, filtering the depth value of the pixel corresponding to the position of the depth map to be processed according to a preset filtering mode based on the window filtering coefficient value corresponding to each video frame to obtain the depth value after the pixel corresponding to the position of the depth map to be processed is filtered, including:

taking the product of the first filter coefficient weight value and at least one of the second filter coefficient weight value and the third filter coefficient weight value, or a weighted average value as a corresponding window filter coefficient value of each video frame;

and calculating a weighted average value of products of the depth values of the pixels corresponding to the positions in the depth map to be processed and the second depth map and window filter coefficient values corresponding to the video frames, and obtaining the depth values of the pixels corresponding to the positions in the depth map to be processed after filtering.

Optionally, the current video frame is located in the middle of the video frame sequence.

The embodiment of the specification provides another depth map processing method, which comprises the following steps:

acquiring a window filter coefficient value corresponding to each video frame in the video frame sequence, wherein the window filter coefficient value is generated by weight values of at least two dimensions, and the window filter coefficient value comprises: the first filter coefficient weight value corresponding to the pixel confidence coefficient is obtained by adopting the following mode: obtaining a depth map in a preset view angle range around the corresponding view angles of the depth map to be processed and each second depth map, obtaining a third depth map of the corresponding view angles, and determining confidence values of corresponding pixels in the depth map to be processed and each second depth map based on the third depth map of each corresponding view angle; determining a first filter coefficient weight value corresponding to the confidence coefficient value;

and filtering pixels corresponding to the position in the depth map to be processed according to a preset filtering mode based on the window filtering coefficient values corresponding to each video frame to obtain depth values after the pixels corresponding to the position in the depth map to be processed are filtered.

The embodiment of the specification also provides a video reconstruction method, which comprises the following steps:

acquiring image combinations of video frames of multi-angle free view angles, parameter data corresponding to the image combinations of the video frames and virtual view point position information based on user interaction, wherein the image combinations of the video frames comprise a plurality of groups of texture images and depth images with corresponding relations, wherein the groups of texture images and depth images have synchronous angles;

Obtaining a filtered depth map by adopting the depth map processing method described in any embodiment;

according to the virtual viewpoint position information and parameter data corresponding to the image combination of the video frame, selecting a texture map and a filtered depth map of a corresponding group in the image combination of the video frame at the time of user interaction according to a preset rule;

and based on the virtual viewpoint position information and parameter data corresponding to the texture map and the depth map of the corresponding group in the image combination of the video frame at the user interaction moment, carrying out combined rendering on the texture map and the filtered depth map of the corresponding group in the image combination of the video frame at the selected user interaction moment to obtain a reconstructed image corresponding to the virtual viewpoint position at the user interaction moment.

The embodiment of the specification also provides a depth map processing device, which comprises:

the depth map acquisition unit is suitable for acquiring a depth map to be processed from the image combination of the current video frame of the multi-angle free view, wherein the image combination of the current video frame of the multi-angle free view comprises a plurality of groups of texture maps and depth maps with corresponding relations, wherein the groups of texture maps and the depth maps are in angle synchronization;

a frame sequence obtaining unit, adapted to obtain a video frame sequence containing a preset window in the time domain of the current video frame;

A window filter coefficient value obtaining unit, adapted to obtain a window filter coefficient value corresponding to each video frame in the video frame sequence, where the window filter coefficient value is generated by a weight value of at least two dimensions, and includes: the window filter coefficient value obtaining unit includes: the first filtering coefficient weight value obtaining subunit is adapted to obtain confidence values of pixels corresponding to positions in the depth map to be processed and each second depth map, and determine a first filtering coefficient weight value corresponding to the confidence value, wherein: the second depth map is a depth map with the same view angle as the depth map to be processed in each video frame of the video frame sequence;

and the filtering unit is suitable for filtering the pixels corresponding to the position in the depth image to be processed according to a preset filtering mode based on the window filtering coefficient values corresponding to the video frames, and obtaining the depth values of the pixels corresponding to the position in the depth image to be processed after filtering.

Optionally, the window filter coefficient value acquisition unit further includes at least one of:

a second filter coefficient weight value obtaining subunit, adapted to obtain a frame distance between each video frame in the video frame sequence and the current video frame, and determine a second filter coefficient weight value corresponding to the frame distance;

And the third filter coefficient weight value acquisition subunit is suitable for acquiring similarity values of pixels corresponding to positions of texture maps corresponding to the second depth maps and the texture maps corresponding to the depth maps to be processed, and determining the third filter coefficient weight values corresponding to the similarity values.

The embodiment of the specification also provides a video reconstruction system, which comprises:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is suitable for acquiring image combinations of video frames of multi-angle free view angles, parameter data corresponding to the image combinations of the video frames and virtual view point position information based on user interaction, wherein the image combinations of the video frames comprise a plurality of groups of texture images and depth images with corresponding relations, wherein the groups of texture images and depth images have synchronous angles;

the filtering module is suitable for filtering the depth map in the video frame;

the selection module is suitable for selecting a texture map and a filtered depth map of a corresponding group in the image combination of the video frame at the time of user interaction according to the virtual viewpoint position information and parameter data corresponding to the image combination of the video frame and a preset rule;

the image reconstruction module is suitable for carrying out combined rendering on the texture map and the filtered depth map of the corresponding group in the image combination of the video frame at the selected user interaction moment based on the virtual viewpoint position information and the parameter data corresponding to the texture map and the depth map of the corresponding group in the image combination of the video frame at the user interaction moment to obtain a reconstructed image corresponding to the virtual viewpoint position at the user interaction moment;

Wherein, the filtering module includes:

The present description also provides an electronic device comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, the steps of the method according to any of the preceding embodiments being performed when the processor executes the computer instructions.

The present description also provides a computer-readable storage medium having stored thereon computer instructions which, when executed, perform the steps of the method of any of the preceding embodiments.

Compared with the prior art, the technical scheme of the embodiment of the specification has the following beneficial effects:

according to the depth map processing scheme, filtering is performed on a time domain by acquiring a depth map to be processed from an image combination of a current video frame with a multi-angle free view angle, and for a depth map, namely a second depth map, which is the same as the view angle of the depth map to be processed in each video frame in a video frame sequence of a preset window in the time domain, the influence on a filtering result caused by the input of unreliable depth values in the depth map to be processed and each second depth map can be avoided by acquiring confidence values of pixels corresponding to the positions in the depth map to be processed and determining a first filtering coefficient weight value corresponding to the confidence values, window filtering coefficient values are generated based on the first filtering coefficient weight values, and the pixels corresponding to the positions in the depth map to be processed are filtered according to a preset filtering mode, so that the stability of the depth map in the time domain can be improved.

By adopting the video reconstruction scheme of the embodiment of the present disclosure, for filtering a depth map in a video frame in a time domain, and for a depth map, i.e., a second depth map, in each video frame in a video frame sequence of a preset window in the time domain, which is the same as a view angle of the depth map to be processed, by acquiring confidence values of pixels corresponding to positions in the depth map to be processed and each second depth map, determining a first filter coefficient weight value corresponding to the confidence value, and adding the third filter coefficient weight value to a window filter coefficient value, an influence of an unreliable depth value introduced into the depth map to be processed and each second depth map on a filtering result can be avoided, so that stability of the depth map in the time domain can be improved, and further, image quality of a reconstructed video can be improved.

Drawings

FIG. 1 is a schematic diagram of a data processing system in a specific application scenario to which the depth map processing method according to the embodiments of the present disclosure is applicable.

Fig. 2 is a schematic diagram of a multi-angle freeview data generation process according to an embodiment of the present disclosure.

Fig. 3 is a schematic diagram of a user side processing 6DoF video data in an embodiment of the present disclosure.

Fig. 4 is a schematic diagram of input and output of a video reconstruction system according to an embodiment of the present disclosure.

Fig. 5 is a flowchart of a depth map processing method in an embodiment of the present disclosure.

Fig. 6 is a schematic diagram of a video frame sequence in an application scenario according to an embodiment of the present disclosure.

Fig. 7 is a flowchart of a method for obtaining confidence values of corresponding pixels in a depth map to be processed and in a position of each second depth map in an embodiment of the present disclosure.

Fig. 8 is a flowchart of another method for obtaining confidence values of corresponding pixels in the depth map to be processed and in the positions of the second depth maps according to the embodiment of the present disclosure.

Fig. 9 is a flowchart of another method for obtaining confidence values of corresponding pixels in the depth map to be processed and in the positions of the second depth maps according to the embodiment of the present disclosure.

Fig. 10 is a schematic view of a scenario in which confidence values of corresponding pixels in a depth map to be processed and in a position of each second depth map are determined in the embodiment of the present disclosure.

Fig. 11 is a flowchart of a video reconstruction method according to an embodiment of the present disclosure.

Fig. 12 is a schematic structural diagram of a depth map processing apparatus according to an embodiment of the present disclosure.

Fig. 13 is a schematic structural diagram of a video reconstruction system according to an embodiment of the present disclosure.

Fig. 14 is a schematic structural view of an electronic device in the embodiment of the present specification.

Detailed Description

In a conventional video playing scene, for example, a playing video of a sports game, a user can only watch the game through one view point position in the watching process, and cannot freely switch the view point positions to watch the game pictures or the game process at different view point positions, so that the user cannot experience the feeling of watching the game while moving the view point on site.

The technology of 6 degrees of freedom (6Degree of Freedom,6DoF) is adopted to provide high-degree-of-freedom viewing experience, a user can adjust the viewing angle of video viewing through an interaction means in the viewing process, and the viewing is performed from the viewing-desired free viewpoint, so that the viewing experience is greatly improved.

In order to realize a 6DoF scene, there are currently Free-D playback technology, depth map-based DIBR technology, and the like. The Free-D playback technology is to acquire point cloud data of a scene through multi-angle shooting, express a 6DoF image, and reconstruct the 6DoF image or video based on the point cloud data. The 6DoF video generating method based on the depth map is to perform combined rendering on the texture map and the depth map of the corresponding group in the image combination of the video frame at the user interaction moment based on the virtual viewpoint position and the parameter data corresponding to the texture map and the depth map of the corresponding group, and reconstruct the 6DoF image or video.

In contrast to the deficiencies of point cloud reconstruction schemes with respect to reconstructed view quality and stability, DIBR techniques may already approximate the quality of the originally acquired view in terms of reconstructed view quality. In order to improve the reconstructed viewpoint quality, the depth map is filtered in the time domain in the DIBR process, so that the time domain stability of the depth map reconstruction is improved.

However, the inventors have found that in some cases the quality of the depth map generated after filtering is rather degraded. For this reason, the inventors have conducted further intensive studies and experiments, and found that the depth values of the pixels in the depth map participating in filtering in the time domain are not always reliable per se, and that adding the filtering to the unreliable depth values results in a degradation in the quality of the depth map finally generated after the filtering.

Referring to fig. 1, which is a schematic structural diagram of a data processing system in a specific application scenario, a layout scenario of a data processing system of a basketball game is shown, where a data processing system 10 includes an acquisition array 11 formed by a plurality of acquisition devices, a data processing device 12, a cloud server cluster 13, a play control device 14, a play terminal 15, and an interaction terminal 16. With the data processing system 10, reconstruction of multi-angle freeview video can be achieved, and a user can view multi-angle freeview video with low latency.

Referring to fig. 1, the left basketball rim is taken as a core point of view, the core point of view is taken as a center, and a sector area which is in the same plane with the core point of view is taken as a preset multi-angle free view angle range. Each collection device in the collection array 11 can be arranged at different positions of the field collection area in a fan shape according to the preset multi-angle free view angle range, and can respectively and synchronously collect video data streams from corresponding angles in real time.

The data processing device 12 may be located in an off-site area, which may be considered an on-site server, in order not to interfere with the operation of the acquisition device. The data processing device 12 may send a streaming instruction to each acquisition device in the acquisition array 11 through a wireless local area network, and each acquisition device in the acquisition array 11 transmits the obtained video data stream to the data processing device 12 in real time based on the streaming instruction sent by the data processing device 12.

When the data processing device 12 receives a video frame interception instruction, intercepting video frames at a specified frame time from a received multi-path video data stream to obtain frame images of a plurality of synchronous video frames, and uploading the obtained plurality of synchronous video frames at the specified frame time to a server cluster 13 of a cloud.

Correspondingly, the cloud server cluster 13 uses the received frame images of the plurality of synchronous video frames as an image combination, determines parameter data corresponding to the image combination and depth data of each frame image in the image combination, and reconstructs the frame image of a preset virtual viewpoint path based on the parameter data corresponding to the image combination, pixel data and depth data of the preset frame image in the image combination, so as to obtain corresponding multi-angle free view video data, wherein the multi-angle free view video data may include: multi-angle freeview spatial data and multi-angle freeview temporal data of frame images ordered according to frame moments.

In an implementation, the cloud server cluster 13 may store the pixel data and the depth data of the image combination in the following manner:

and generating a spliced image corresponding to the frame time based on the pixel data and the depth data of the image combination, wherein the spliced image comprises a first field and a second field, the first field comprises pixel data of a preset frame image in the image combination, and the second field comprises a second field of the depth data of the preset frame image in the image combination. The acquired spliced image and corresponding parameter data can be stored in a data file, and when the spliced image or the parameter data needs to be acquired, the spliced image or the parameter data can be read from a corresponding storage space according to a corresponding storage address in a header file of the data file.

Then, the play control device 14 may insert the received multi-angle freeview video data into a data stream to be played, and the play terminal 15 receives the data stream to be played from the play control device 14 and plays the data stream in real time. The play control device 14 may be an artificial play control device or a virtual play control device. In specific implementation, a special server capable of automatically switching video streams can be set as a virtual playing control device to control a data source. A director control device such as a director station may be used as a play control device in the embodiments of the present specification.

When the cloud server cluster 13 receives the image reconstruction instruction from the interactive terminal 16, the spliced image of the preset frame image in the corresponding image combination and the parameter data corresponding to the corresponding image combination can be extracted and transmitted to the interactive terminal 16.

The interactive terminal 16 determines the interactive frame time information based on the triggering operation, sends an image reconstruction instruction containing the interactive frame time information to the server cluster 13, receives a spliced image of a preset frame image and corresponding parameter data in an image combination corresponding to the interactive frame time returned from the server cluster 13 in the cloud, determines virtual viewpoint position information based on the interactive operation, selects corresponding pixel data and depth data and corresponding parameter data in the spliced image according to a preset rule, performs combined rendering on the selected pixel data and depth data, and reconstructs multi-angle free view video data corresponding to the virtual viewpoint position at the interactive frame time and plays the video data.

Generally, the entities in the video will not be completely stationary, for example, with the data processing system described above, and during a basketball game, the entities collected by the collection array, such as athletes, basketball, referees, etc., are most in motion. Accordingly, both texture data and pixel data in the image combination of the acquired video frames also change continuously over time.

In order to improve the quality of the generated multi-angle freeview video image, the cloud server cluster 13 may perform temporal filtering on the depth map for generating the multi-angle freeview video. For example, for a depth map to be processed, a corresponding filter coefficient may be set to perform temporal filtering based on a similarity between a texture map of the depth map to be processed and a texture map of a depth map having the same view angle as the depth map to be processed in the temporal domain.

The inventors have found that the pixel values in the depth map to be processed, which are actually acquired, or in the depth map having the same view angle as the depth map to be processed, may be wrong, for example, some entities are blocked under the view angle, so that the depth values of the pixels in the depth map participating in filtering may be unreliable, and the unreliable depth values are added to the filtering, which may instead result in a reduction in the quality of the finally generated depth map after the filtering. In view of the above problem, in the depth map processing scheme of the embodiment of the present disclosure, for a depth map to be processed in a current video frame in a video frame sequence of a preset window in a time domain, and a depth map in each video frame that is the same as a view angle of the depth map to be processed in the current video frame, a confidence value of a corresponding pixel in the depth map is considered, and a filter coefficient weight value corresponding to the confidence value is added to a filter coefficient value of the preset window in the time domain, so that an influence on a filtering result caused by the depth map to be processed in the video frame sequence of the preset window and an unreliable depth value in the depth map in each video frame that is the same as the view angle of the depth map to be processed in the current video frame can be avoided, and stability of the depth map in the time domain can be improved.

In order for those skilled in the art to better understand and implement the depth map processing scheme and video reconstruction scheme in the embodiments of the present specification, the following first briefly describes the principle of obtaining 6DoF video based on DIBR.

Firstly, video data or image data can be acquired through acquisition equipment, and depth map calculation is carried out, and the method mainly comprises three steps of: video acquisition by multiple cameras (Multi-camera Video Capturing), camera inside and outside parameter computation (Camera Parameter Estimation), and depth map computation (Depth Map Calculation). For multi-camera acquisition, it is desirable that the video acquired by each camera be aligned at the frame level. Referring to FIG. 2 in combination, a Texture Image 21 can be obtained by video capturing with multiple cameras (step S21); through the Camera inside and outside Parameter calculation (step S22), a Camera Parameter (Camera Parameter) 22, that is, parameter data hereinafter, including internal Parameter data and external Parameter data of the Camera, can be obtained; by the Depth Map calculation (step S23), a Depth Map (Depth Map) 23 can be obtained. After the above three steps are completed, a texture map acquired from multiple cameras, all camera parameters, and a depth map for each camera is obtained. These three pieces of data may be referred to as data files in multi-angle freeview video data, and may also be referred to as 6-degree-of-freedom video data (6 DoF video data). With 6 degrees of freedom video data, the client can generate a virtual viewpoint based on the virtual 6 degrees of freedom (Degree of Freedom, doF) position, thereby providing a 6DoF video experience.

The 6DoF video data as well as the indicative data, which may also be referred to as Metadata (Metadata), may be compressed and transmitted to the user side. The user side can obtain the 6DoF expression of the user side, namely the 6DoF video data and the metadata according to the received data, and then the 6DoF rendering is carried out on the user side.

The metadata may be used to describe a data mode of the 6DoF video data, and may specifically include: stitching pattern metadata (Stitching Pattern metadata) for indicating storage rules for pixel data and depth data for a plurality of images in a stitched image; edge protection metadata (Padding pattern metadata) may be used to indicate the manner in which edge protection is performed in the stitched image, as well as Other metadata (Other metadata). Metadata may be stored in the header file.

Referring to fig. 3, in addition to the interactive behavior data 34 of the user side, based on 6DoF video data (including camera parameters 31, texture map and depth map 32, and metadata 33). From these data, 6DoF Rendering Based on Depth Image-Based Rendering (step S30) may be employed, whereby an Image of a virtual viewpoint is generated at one specific 6DoF position 35 generated according to user behavior, i.e., a virtual viewpoint of a 6DoF position corresponding to the instruction is determined according to the user instruction.

In the video reconstruction system or DIBR application software adopted in the embodiments of the present disclosure, the camera parameters, texture map, depth map, and the 6DoF position of the virtual camera may be received as inputs, while the generated texture map and depth map at the virtual 6DoF position are output. The 6DoF position of the virtual camera is the 6DoF position determined from the user behavior as described above. The DIBR application software may be software that implements virtual viewpoint-based image reconstruction in the embodiments of the present specification.

In a DIBR software used in the embodiment of the present disclosure, in conjunction with fig. 4, DIBR software 40 may receive camera parameters 41, texture map 42, depth map 43, and 6DoF position data 44 of the virtual camera as input, and may generate a texture map and a depth map at a virtual 6DoF position by generating texture map S41 and generating depth map S42, and output the generated texture map and depth map at the same time.

The input depth map may be processed, e.g. filtered in the time domain, before generating the texture map and depth map at the virtual 6DoF location.

The following describes in detail, by way of specific embodiments, a depth map processing method that may improve stability of a depth map in a time domain and is used in the embodiments of the present specification, with reference to the accompanying drawings.

Referring to the flowchart of the depth map processing method shown in fig. 5, the following steps may be specifically adopted to perform filtering processing on the depth map:

s51, obtaining a depth map to be processed from an image combination of a current video frame of a multi-angle free view, wherein the image combination of the current video frame of the multi-angle free view comprises a plurality of groups of texture maps and depth maps with corresponding relations, wherein the groups of texture maps and the depth maps are in angle synchronization.

S52, obtaining a video frame sequence containing a preset window in the time domain of the current video frame.

In a specific implementation, after a current video frame containing a depth map to be processed is acquired, a video frame sequence containing a preset window in a time domain of the current video frame may be acquired. As shown in the schematic diagram of the video frame sequence in fig. 6, let the T frame in the video sequence be the current video frame, the preset window size D in the time domain be equal to 2n+1 frames, and the current video frame be at the middle position of the video frame sequence intercepted by the preset window in the time domain, then the video frame sequence from the T-N frame to the t+n frame total 2n+1 frames can be obtained.

It will be appreciated that in implementations, the current video frame may not be located in the middle of the sequence of video frames of the preset window.

It should be noted that, the size of the preset window in the time domain may be set according to the filtering precision requirement and the processing resource requirement, and according to experience. In an embodiment of the present disclosure, the window size D is 5 video frames, that is, 2n+1= 5,N =2, and in other embodiments of the present disclosure, N may be other values such as 3 or 4, and specific values may be different values, and may be determined according to the final filtering effect. And the size of the preset window in the time domain can be adjusted according to the position of the current frame in the whole video stream.

In a specific implementation, for a depth map in the first N video frames in the entire video stream, the filtering process may not be performed, i.e. the filtering is performed starting from the n+1st frame, in this embodiment, T is greater than N.

S53, obtaining a window filter coefficient value corresponding to each video frame in the video frame sequence, wherein the window filter coefficient value is generated by weight values of at least two dimensions, and the window filter coefficient value comprises: and a first filter coefficient weight value corresponding to the pixel confidence coefficient.

In a specific implementation, the first filter coefficient weight value may be obtained as follows:

obtaining confidence values of pixels corresponding to positions in the depth map to be processed and each second depth map, and determining a first filter coefficient weight value corresponding to the confidence values, wherein: the second depth map is a depth map with the same view angle as the depth map to be processed in each video frame of the video frame sequence.

In a specific implementation, there are various methods for evaluating the confidence of the pixels in the depth map to be processed and in each second depth map, and obtaining the confidence values of the pixels in the depth map to be processed and in each second depth map, which correspond to the positions.

For example, the depth map in the preset view angle range around the corresponding view angle of the to-be-processed depth map and the second depth map in each video frame in the preset window may be obtained, a third depth map of the corresponding view angle is obtained, and confidence values of corresponding pixels in the to-be-processed depth map and the second depth map are determined based on the third depth map of the corresponding view angle.

For another example, the confidence value of the pixel corresponding to the position in the depth map to be processed and the pixel corresponding to the position in each second depth map may be determined based on the spatial consistency of the pixel corresponding to the position in the depth map to be processed and the pixel in the preset area around the depth map where the pixel is located.

The specific obtaining of the confidence values of the pixels corresponding to the positions in the depth map to be processed and each second depth map will be described in detail below through specific application scenes.

In the embodiment of the present disclosure, a correspondence relationship between the confidence value and the first filter coefficient weight value may be preset. The larger the confidence coefficient value c is, the larger the corresponding first filter coefficient Weight value weight_c is; the smaller the confidence coefficient value c is, the larger the corresponding first filter coefficient Weight value weight_c is, and the two are in an anti-correlation relationship.

And S54, filtering the pixels corresponding to the positions in the depth map to be processed according to a preset filtering mode based on the window filtering coefficient values corresponding to the video frames, and obtaining the depth values after the pixels corresponding to the positions in the depth map to be processed are filtered.

By adopting the embodiment, for the depth map, namely the second depth map, of each video frame in the video frame sequence of the preset window in the time domain, which is the same as the view angle of the depth map to be processed, the influence on the filtering result by the unreliable depth values introduced into the depth map to be processed and each second depth map can be avoided by acquiring the confidence values of the corresponding pixels in the depth map to be processed and each second depth map, determining the first filtering coefficient weight value corresponding to the confidence values, generating the window filtering coefficient value based on the first filtering coefficient weight value, filtering the corresponding pixels in the depth map to be processed according to the preset filtering mode, and improving the stability of the depth map in the time domain.

In this embodiment of the present disclosure, in order to improve the stability of the depth map in the time domain, as described in the previous embodiment, the window filter coefficient value is generated by weight values of at least two dimensions, and the weight value of one dimension is the first filter coefficient weight value corresponding to the pixel confidence. For a better understanding and implementation of the present embodiments by those skilled in the art, the weight values of the selected other dimensions that generate the window filter coefficient values are illustrated by specific embodiments below.

It will be appreciated that the window filter coefficient values may be generated based on the first filter coefficient weight value and filter coefficient weight values of one or more other dimensions, or the window filter coefficient weight values may be generated by including the first filter coefficient weight value, filter coefficient weight values of at least one of the following dimensions, and filter coefficient weight values of other dimensions, in addition to the example dimensions below.

Example dimension one: second filter coefficient weight value corresponding to frame distance

Specifically, a frame distance between each video frame in the video frame sequence and the current video frame is obtained, and a second filter coefficient weight value corresponding to the frame distance is determined.

In particular implementations, the frame spacing may be expressed in terms of differences in frame positions in the sequence of video frames or in terms of time intervals between corresponding video frames in the sequence of video frames. Since the frames in a sequence of frames are typically equally spaced, the difference in frame positions in the sequence of video frames is chosen for ease of operation. With continued reference to fig. 6, for example, the frame distance between the T-1 th and t+1 th frames and the current video frame (T frame) is 1 frame, the frame distance between the T-2 th and t+2 th frames and the current video frame (T frame) is 2 frames, and so on, the frame distance between the T-N th and t+n th frames and the current video frame (T frame) is N frames.

In a specific implementation, a corresponding relationship between the set frame distance d and the corresponding second filter coefficient Weight value weight_d may be preset. The smaller the frame distance d is, the larger the corresponding second filter coefficient Weight value weight_d is; the larger the frame distance d is, the smaller the corresponding Weight value weight_d of the second filter coefficient is, and the two are in an anti-correlation relationship.

Example dimension two: third filter coefficient weight value corresponding to pixel similarity

Specifically, a similarity value of a texture map corresponding to each second depth map and a pixel corresponding to a position in the texture map corresponding to the depth map to be processed may be obtained, and a third filter coefficient weight value corresponding to the similarity value may be determined.

Continuing with FIG. 6, for example, the view in the T-th frame of the current video frame is acquiredDepth map T corresponding to angle M _M For the depth map to be processed, windows [ T-N, T+N ] are set]Depth map corresponding to inner view angle M (T-N) _M …(T-2) _M 、(T-1) _M 、(T+1) _M …(T+2) _M 、(T+N) _M Sequentially used as a T-N frame … T-2 frame, a T-1 frame, a T+1st frame … T+2st frame, a T+N frame and the depth map T to be processed _M A depth map with the same visual angle, namely each video frame except the current video frame T in the video frame sequence and the depth map T to be processed _M And a corresponding second depth map.

With continued reference to fig. 6, a depth map T for a T-th frame view M _M For convenience of description, the Pixel corresponding to any position (x, y) in the corresponding texture map is referred to as a first Pixel, the texture value of the first Pixel is denoted as Pixel (x 1, y 1), the texture value Color (x 1', y 1') of the Pixel corresponding to the position of the first Pixel in the texture map corresponding to each second depth map can be obtained, and then the similarity value of the texture value Color (x 1', y 1') of the Pixel corresponding to the position in the texture map corresponding to each second depth map relative to the texture value Color (x 1, y 1) of the first Pixel can be obtained, and the third filter coefficient Weight value weight_s corresponding to the similarity value s can be determined.

In a specific implementation, a correspondence between the similarity value s and the corresponding third filter coefficient Weight value weight_s may be preset. The larger the similarity value s is, the smaller the corresponding third filter coefficient Weight value weight_s is; the smaller the similarity value s is, the larger the corresponding third filter coefficient Weight value weight_s is, and the two are in an inverse relation.

For how to generate the window filter coefficient values in the embodiments of the present specification, if the filter coefficient weight values of the above two example dimensions are considered at the same time, in addition to the first filter coefficient weight value corresponding to the pixel confidence coefficient: a second filter coefficient Weight corresponding to a frame distance and a third filter coefficient Weight corresponding to a pixel similarity, in some embodiments of the present disclosure, the first filter coefficient Weight may be equal to the second filter coefficient Weight _i C, the second filter coefficient Weight value Weight _{i_} d and the third filter coefficient Weight value Weight _i The product of s is used as each video frame phaseThe corresponding window filter coefficient value Weight _i The method comprises the following steps: weight (Weight) _i ＝Weight _i _c*Weight _{i_} d*Weight _i S (i is T-N to T+N). And then, calculating a weighted average value of products of the depth values of the pixels corresponding to the positions in the depth map to be processed and the second depth map and window filter coefficient values corresponding to the video frames, and obtaining the depth values of the pixels corresponding to the positions in the depth map to be processed after filtering.

It will be appreciated that in a specific implementation, the product of the first filter coefficient weight value and one of the second filter coefficient weight value and the third filter coefficient weight value, or a weighted average value, may also be used as a corresponding window filter coefficient value for each video frame. And then, calculating a weighted average value of products of the depth values of the pixels corresponding to the positions in the depth map to be processed and the second depth map and window filter coefficient values corresponding to the video frames, and obtaining the depth values of the pixels corresponding to the positions in the depth map to be processed after filtering.

Continuing with the description of FIG. 6, for a depth map T to be processed with a view angle M in the T-th frame _M Any one of the pixels, for convenience of description, is called a second pixel, and the depth value of the second pixel is set

The filtering process can be performed using the following formula to obtain the second pixel filtered depth value +.>

The pixels corresponding to the second Pixel position in each depth map in the above formula are represented by pixels (x 2, y 2), and the view angle identifier corresponding to each depth map and the frame are distinguished by the superscript and the subscript of the pixels (x 2, y 2).

It will be appreciated that in particular implementations, window filter coefficients are obtained The values are not limited to the above embodiments, and for example, the first filter coefficient Weight value Weight may be taken _i C, second filter coefficient Weight value Weight _i D, the third filter coefficient Weight value Weight _i And (3) obtaining the window filter coefficient value by an arithmetic average value or a weighted average value of s and the three, or other weight distribution modes.

In the specific implementation process, each pixel in the depth map to be processed can be sequentially filtered, and in order to improve the filtering speed, each pixel in the depth map to be processed can be filtered in parallel by adopting the embodiment mode, or a plurality of pixels are filtered in batches.

By adopting the depth map processing method in the above embodiment, in the filtering process of the depth map to be processed, not only the frame distance between each video frame in the preset window in the time domain and the depth map to be processed (i.e., the second depth map) and the frame distance between each video frame in the preset window and the depth map to be processed and/or the similarity of texture values corresponding to pixels at positions corresponding to texture maps corresponding to the depth map to be processed (i.e., the second depth map) and the depth map to be processed and the confidence of corresponding pixels in the depth map to be processed and the depth map to be processed (i.e., the second depth map) and the confidence of corresponding pixels in the preset window are added to the window filtering coefficient weight, so that the influence of the introduction of unreliable depth values (including the unreliable depth values in the depth map to be processed and the second depth maps) on the filtering result in the time domain can be avoided, and the stability of the depth map in the time domain can be improved.

In order for those skilled in the art to better understand and practice the embodiments of the present disclosure, how to obtain confidence values of corresponding pixels in the depth map to be processed and in each second depth map will be described in detail below through some specific embodiments.

In one mode, based on the depth map of the preset view angle range around the corresponding view angle of each second depth map in the depth map to be processed, the confidence value of the corresponding pixel in the positions of the depth map to be processed and each second depth map is determined.

Specifically, a depth map within a preset view angle range around the corresponding view angle of the depth map to be processed and the second depth map in each video frame may be obtained, a third depth map corresponding to the view angle is obtained, and confidence values of pixels in corresponding positions in the depth map to be processed and the second depth maps are determined based on the third depth map corresponding to the view angle.

Referring to fig. 6, for a to-be-processed depth map with a view angle M in a current video frame, the second depth map with a view angle M in each video frame may be obtained, in a specific implementation, each depth map with a view angle M (including the to-be-processed depth map and each second depth map) corresponds to a depth map within a preset view angle range [ M-K, m+k ] around the view angle, which may be referred to as a third depth map for convenience of description. For example, specifically, 15 °, 30 °, 45 °, 60 °, and the like may be radiated to both sides with the viewing angle M as the center. It will be appreciated that the values are merely exemplary, and are not intended to limit the scope of the present invention, and that the values specifically relate to the view point densities of corresponding image combinations in the image combinations of each video frame, and that the higher the view point density, the smaller the view point density, and the lower the view point density, and the range of values may be correspondingly expanded.

In a specific implementation, the view angle range may also be determined by using the distribution position of the view points corresponding to the image combination in space, for example, texture maps synchronously acquired by 40 acquisition devices arranged in an arc shape and corresponding depth maps obtained by the texture maps, M and K may represent the positions of the acquisition devices, for example, M represents the view angle of the 10 th acquisition device from the left, K takes 3, and then the confidence value of the pixel at the corresponding position in the depth map corresponding to the 10 th acquisition device may be determined based on the view angles of the 7 th to the 13 th acquisition devices from the left, and based on the view angles of the 7 th to the 9 th acquisition devices and the depth maps corresponding to the view angles of the 11 th to the 13 th acquisition devices, respectively.

It should be noted that, the value interval of the preset view angle range may not be centered on the view angle of the depth map to be processed, and the specific value may be determined according to the spatial position relationship corresponding to the depth map in each video frame. For example, one or more depth maps closest to a corresponding viewpoint in each video frame may be selected for determining confidence levels for pixels in the depth map to be processed and each second depth map.

It should be noted that, in the first embodiment, a different implementation manner may be provided, and the following description is provided by using two examples, and in the specific implementation manner, one embodiment may be used alone, or both may be used in combination, or any one embodiment or combination may be further used in combination with other embodiments, where the examples in this specification are only for making those skilled in the art better understand and implement the present invention, and are not intended to limit the scope of protection of the present invention.

An example of mode one: determining confidence of pixels based on matching differences of texture maps

Referring to the flowchart of the method for obtaining confidence values of corresponding pixels in the depth map to be processed and the positions of the second depth maps shown in fig. 7, the following steps may be specifically adopted:

s71, obtaining a texture map corresponding to the depth map to be processed and texture maps corresponding to the second depth maps.

As described in the foregoing embodiment, since the image combination of the multi-angle freeview video frame includes multiple sets of texture maps and depth maps with corresponding relationships in angle synchronization, the texture map of the corresponding view angle can be obtained from the image combination of the current video frame and the video frame of each second depth map in the video frame sequence of the preset window. Referring to fig. 6, texture maps corresponding to depth maps with M view angles in all of the T-N frame to the t+n frame video frames can be obtained respectively.

And S72, respectively mapping the texture values of the corresponding positions in the texture map corresponding to the depth map to be processed and the texture map corresponding to the second depth map to the corresponding positions in the texture map corresponding to the third depth map of each corresponding view angle according to the depth values of the corresponding pixels in the positions in the depth map to be processed and the second depth map, and obtaining mapped texture values corresponding to the third depth map of each corresponding view angle.

In a specific implementation, based on a spatial position relationship of image combinations of different view angles in the image combinations of each video frame, according to depth values of corresponding pixels in the to-be-processed depth map and each second depth map, texture values in corresponding positions in texture maps corresponding to the to-be-processed depth map and texture maps corresponding to each second depth map are mapped to corresponding positions in texture maps corresponding to third depth maps of each corresponding view angle respectively, so as to obtain mapped texture values corresponding to the third depth maps of each corresponding view angle.

Continuing with the description of FIG. 6, for texture values of texture maps corresponding to depth maps of view angle M in a T-N frame video frame

Can be respectively mapped to the video frames of the T-N frames with the view angles of [ M-K, M+k ]]And obtaining a mapping texture value Color' (x, y) corresponding to the third depth map of the multiple view angles at corresponding positions in the texture map corresponding to the third depth map of the multiple view angles, namely: / >

And S73, respectively matching the mapping texture values with actual texture values of corresponding positions in texture maps corresponding to the third depth maps of the corresponding visual angles, and determining confidence values of corresponding pixels in the depth maps to be processed and the second depth maps based on distribution intervals of matching degrees of the texture values corresponding to the third depth maps of the corresponding visual angles.

If the matching degree of the texture values corresponding to the third depth map of each corresponding view angle is higher, the difference of the corresponding texture maps is smaller, and the reliability of the depth values of the pixels in the corresponding depth map to be processed and the corresponding second depth map is higher, and accordingly, the confidence value is higher.

For how to determine the confidence value of the corresponding pixel in the depth map to be processed and the position of each second depth map based on the distribution interval of the matching degree of the texture value corresponding to the third depth map of each corresponding view angle, various embodiments are possible.

In a specific implementation, the confidence values of the pixels corresponding to the positions in the depth map to be processed and the second depth map can be comprehensively determined based on the matching degree of the texture values corresponding to the third depth map of each corresponding view angle and the number of the third depth maps meeting the corresponding matching degree. For example, it may be set that when the number of matches greater than the preset first match threshold is greater than the preset first number threshold, the corresponding confidence value is set to 1, otherwise the corresponding confidence value is set to 0. Similarly, in an implementation, the correspondence of the matching degree threshold of the texture value, the number threshold of the third depth map satisfying the corresponding matching degree threshold, and the confidence value may also be set in a gradient.

From the above, the confidence value of the pixel corresponding to the position in each second depth map in the depth map to be processed may be binarized to obtain a value, i.e. 0 or 1, or may be set to any value in [0,1] or a set discrete value.

For convenience of description, assuming that the actual texture value of the corresponding position in the texture map corresponding to the third depth map of each corresponding view is color_1 (x, y), the mapped texture value Color' (x, y) corresponding to the third depth map of each corresponding view of each video frame may be respectively matched with the actual texture value color_1 (x, y) of the corresponding position in the texture map corresponding to the third depth map of each corresponding view, for example, the first matching degree threshold is set to 80%.

In an embodiment of the present disclosure, the third depth map corresponding to the to-be-processed depth map and each second depth map in each video frame is a third depth map within a 30-degree range of two viewing angles of the to-be-processed depth map and the corresponding second depth map, and the third depth maps of three viewing angles are respectively present within a 30-degree range of two viewing angles of the to-be-processed depth map and each second depth map. Considering occlusion of the view angle, for example, if the number of third depth maps satisfying the preset first matching degree threshold is greater than or equal to 2, determining that the confidence value of the corresponding pixel in the second depth map of the video frame is 1; if the number of the third depth maps meeting the preset first matching degree threshold is 0, determining that the confidence value of the corresponding pixel in the position in the second depth map of the video frame is 1; if the number of the third depth maps meeting the preset first matching degree threshold is greater than or equal to 2, determining that the confidence value of the corresponding pixel in the second depth map of the video frame is 0.5. And for the depth map to be processed, a third depth map with three visual angles exists, and the confidence value of the pixel corresponding to the position in the depth map to be processed is the same as the value judgment condition of the pixel corresponding to the position in each second depth map.

Example two of mode one: determining confidence of pixels based on consistency of depth maps

Determining the confidence of the pixel based on the consistency of the depth map, wherein according to the different mapping directions of the depth map, one of the two implementation modes is to map the depth value of the corresponding pixel in the second depth map and the depth value of the corresponding pixel in the to-be-processed depth map onto the third depth map of each corresponding view angle, and match the mapping depth value of the corresponding pixel in the third depth map of each corresponding view angle with the actual depth value of the corresponding pixel; and the other is that the obtained depth value of the corresponding pixel in the third depth image is mapped to the corresponding position in the corresponding depth image to be processed and the corresponding position in the second depth image, and then the mapping depth values of the corresponding visual angles in the depth image to be processed and the second depth images are respectively matched with the actual depth values of the corresponding positions in the second depth image. The following is a detailed description of the specific application scenario deployment.

Referring to the method flowchart shown in fig. 8 for obtaining confidence values of pixels corresponding to positions in the second depth maps and in the to-be-processed depth maps, the depth values of pixels corresponding to positions in the to-be-processed depth maps and in the second depth maps may be mapped onto third depth maps of respective corresponding view angles, and the mapped depth values of pixels corresponding to positions in the third depth maps of respective view angles may be respectively matched with actual depth values of pixels corresponding to positions, which specifically may include the following steps:

And S81, mapping the depth values of the pixels corresponding to the positions in the to-be-processed depth map and the second depth map to the third depth map of each corresponding view angle to obtain the mapping depth values of the pixels corresponding to the positions in the third depth map of each corresponding view angle.

As described in the previous embodiment, since the image combination of the multi-angle freeview video frame includes multiple sets of texture maps and depth maps with corresponding relationships, each set of texture maps and depth maps includes multiple views for any video frame. According to a preset spatial position relation, mapping depth values of corresponding pixels in the to-be-processed depth map and the second depth map in each video frame in the preset window to third depth maps of corresponding view angles, and obtaining mapping depth values of corresponding pixels in the third depth maps of corresponding view angles. Referring to fig. 6, a third depth map corresponding to a depth map (including the depth map to be processed and the second depth map) with a view angle M in each video frame in a window from a T-N frame to a t+n frame in a preset view angle range [ M-K, m+k ] may be obtained respectively, and mapping between different depth maps of the depth values of pixels in the same frame in the preset view angle range may be completed, that is, mapping of the depth values of pixels corresponding to the positions in the depth maps (including the depth map to be processed and the second depth map) with a view angle M in each video frame in the preset window in the depth values of pixels corresponding to the positions in the third depth maps of other corresponding view angles in the preset view angle range [ M-K, m+k ] in the frame may be completed.

And S82, respectively matching the mapping depth values of the pixels at the corresponding positions in the third depth map of each corresponding view angle with the actual depth values of the pixels at the corresponding positions in the third depth map of each corresponding view angle, and determining the confidence values of the pixels at the corresponding positions in the depth map to be processed and each second depth map based on the distribution interval of the matching degree of the depth values corresponding to the third depth map of each corresponding view angle.

If the matching degree of the depth values corresponding to the third depth map of each corresponding view angle is higher, the difference of the corresponding depth maps is smaller, and the reliability of the corresponding depth values of the pixels in the depth map to be processed or the second depth map is higher, and accordingly, the confidence value is higher.

For how to determine the confidence value of the corresponding pixel in the depth map to be processed and the position of each second depth map based on the distribution interval of the matching degree of the depth values corresponding to the third depth map of each corresponding view angle, various embodiments are possible.

In a specific implementation, the confidence values of the pixels corresponding to the positions in the depth map to be processed and the second depth map can be comprehensively determined based on the matching degree of the depth values corresponding to the third depth map of each corresponding view angle and the number of the third depth maps meeting the corresponding matching degree. For example, it may be set that when the number of matches is greater than a preset second match threshold (as a specific example, the second match threshold takes a value of 80%, or takes a value of 70%) is greater than a preset second number threshold, the corresponding confidence value is set to 1, and otherwise the corresponding confidence value is set to 0. Similarly, in the implementation, the correspondence relationship of the matching degree threshold value of the depth value, the number threshold value of the third depth map satisfying the corresponding matching degree threshold value, and the confidence value may also be set in a gradient.

In an embodiment of the present disclosure, the third depth map corresponding to the to-be-processed depth map or the second depth map in each video frame is a third depth map within 30 degrees of two sides of a viewing angle of the second depth map, and the third depth maps of three viewing angles are respectively present within 30 degrees of two sides of the viewing angle of the second depth map. Considering occlusion of the view angle, for example, if the number of third depth maps meeting the preset second matching degree threshold is greater than or equal to 2, determining that a confidence value of a corresponding pixel in a to-be-processed depth map or a second depth map in the video frame is 1; if the number of the third depth maps meeting the preset second matching degree threshold is 0, determining that the confidence value of the corresponding pixel in the depth map to be processed in the video frame or the second depth map is 1; if the number of the third depth maps meeting the preset second matching degree threshold is greater than or equal to 2, determining that the confidence value of the corresponding pixel in the depth map to be processed in the video frame or in the second depth map is 0.5.

Referring to the flowchart of the method for obtaining confidence values of the corresponding pixels in the to-be-processed depth map and the corresponding pixels in each second depth map shown in fig. 9, the obtained depth values of the corresponding pixels in the third depth map may be reversely mapped to the corresponding positions in the second depth map, and then the mapped pixel positions of the corresponding view angles in the second depth map are compared with the distances of the actual pixel positions of the corresponding positions in the second depth map, and the confidence values of the corresponding pixels in the to-be-processed depth map and the corresponding pixels in each second depth map are determined according to the difference values of the two positions, which specifically includes the following steps:

s91, respectively obtaining depth values of corresponding pixels in the depth map to be processed and the second depth maps, respectively mapping the depth values to corresponding pixel positions in the third depth map corresponding to the visual angle according to the depth values, obtaining and reversely mapping the depth values of corresponding pixel positions in the third depth map corresponding to the visual angle to corresponding pixel positions in the depth map to be processed and the second depth maps, and obtaining mapped pixel positions of the third depth map corresponding to the visual angle in the depth map to be processed and the second depth maps.

As described in the previous embodiment, the image combination of the multi-angle free view video frame includes multiple sets of texture maps and depth maps with corresponding relationships, where the texture maps and the depth maps are synchronized in multiple angles, and for any video frame, the depth maps with multiple views are included. And respectively obtaining depth values of pixels corresponding to positions in the to-be-processed depth map and the second depth map in each video frame in windows from the T-N frame to the T+N frame, respectively mapping the depth values to corresponding pixel positions in a third depth map corresponding to a preset view angle range [ M-K, M+K ] according to the depth values, and obtaining depth values of pixels corresponding to positions in the to-be-processed depth map and the second depth map in the third depth map of each corresponding view angle in the preset view angle range [ M-K, M+K ] in each video frame. And then, according to a preset spatial position relation, mapping the depth value of the corresponding pixel in the third depth image in each video frame in the window from the T-N frame to the T+N frame to the corresponding pixel in the depth image to be processed and the corresponding pixel in each second depth image in a reverse mapping mode, and obtaining the mapped pixel positions of the third depth image of each corresponding view angle in the depth image to be processed and the corresponding pixel positions of each second depth image.

Referring to fig. 6, depth values of corresponding pixels in a to-be-processed depth map with a view angle M and in each second depth map in each video frame in a window from a T-N frame to a t+n frame can be obtained respectively, the depth values are mapped to corresponding pixel positions in a third depth map with a corresponding view angle M-K, m+k in a preset view angle range according to the depth values, the depth values of corresponding pixel positions in the third depth map with different view angles in the same frame are obtained and according to the preset spatial position relation, and then the depth values of corresponding pixel positions in the third depth map with corresponding view angles in each video frame are obtained according to the obtained spatial position relation and are reversely mapped to corresponding pixel positions in the to-be-processed depth map with corresponding view angle in the same video frame or in the second depth map, so as to obtain mapped pixel positions of the third depth map with corresponding view angle in the to-be-processed depth map and in each second depth map. For example, according to the depth values of the corresponding pixel positions of the third depth maps in the T-N frame view angle range [ M-K, m+k ], the corresponding mapped pixel positions of the third depth maps (e.g., the plurality of third depth maps in the T-N frame view angle M-2, M-1, m+1, m+2) in the T-N frame view angle can be obtained by mapping the depth values of the corresponding pixel positions of the third depth maps in the T-N frame view angle M.

S92, respectively calculating pixel distances of the mapping pixel positions obtained by the reflection of the actual pixel positions of the pixels in the corresponding positions in the depth map to be processed and the second depth maps and the third depth maps of the corresponding visual angles, and determining confidence values of the pixels in the corresponding positions in the depth map to be processed and the second depth maps based on the calculated distribution intervals of the pixel distances.

If the pixel distance is smaller, the reliability of the depth value of the corresponding pixel at the corresponding position in the depth map to be processed or the second depth map is higher, and accordingly, the confidence value is higher.

In a specific implementation, the confidence value of the corresponding pixel in the depth map to be processed and the position in each second depth map may be comprehensively determined based on the pixel distance size corresponding to the third depth map of each corresponding view angle and the number of the third depth maps meeting the corresponding distance threshold interval. For example, it may be set when the distance is smaller than a preset distance threshold d ₀ The number of the corresponding confidence coefficient values is larger than a preset third number threshold value, the corresponding confidence coefficient value is set to be 1, and otherwise, the corresponding confidence coefficient value is set to be 0. Similarly, in an implementation, the correspondence of the distance threshold set by the gradient, the number threshold of the third depth maps satisfying the corresponding distance threshold, and the confidence value may also be set.

From the above, the confidence values of the corresponding pixels in the depth map to be processed and the second depth maps may be binarized to obtain values, i.e. 0 or 1, or may be set to any value in [0,1] or a set discrete value.

In an embodiment of the present disclosure, the third depth map corresponding to the to-be-processed depth map and each second depth map in each video frame is a third depth map within 30 degrees of two viewing angles of the to-be-processed depth map and each second depth map, and the third depth maps of three viewing angles are respectively present within 30 degrees of two viewing angles of the to-be-processed depth map and each second depth map. Considering occlusion of the view angle, for example, if the number of third depth maps with pixel distances smaller than the preset first distance threshold is greater than or equal to 2, determining that a confidence value of a corresponding pixel in a depth map to be processed of the video frame or in a second depth map is 1; if the number of the third depth maps with the pixel distance smaller than the preset first distance threshold is 0, determining that the confidence value of the pixel corresponding to the first pixel position in the depth map to be processed of the video frame or the second depth map is 1; if the pixel distance is greater than or equal to 2 in number to the third depth map with the preset first distance threshold, determining that the confidence value of the corresponding pixel in the depth map to be processed of the video frame or the second depth map is 0.5.

Specific implementation examples of determining the confidence of a pixel based on the matching difference of the texture map and determining the confidence of a pixel based on the consistency of the depth map are given above. In implementations, the confidence of the pixel may also be determined jointly by a combination of the two. The following examples are given in connection with some embodiments, it being understood that the following examples are not intended to limit the scope of the invention.

The combination mode is as follows: taking the product of the confidence of the pixel determined based on the matching difference of the texture map and the confidence of the pixel determined based on the consistency of the depth map as the confidence of the corresponding pixel in the position of the second depth map, the confidence can be expressed by the following formula:

Weight_c＝Weight_c_texture*Weight_c_depth；

wherein weight_c represents the confidence of the pixel corresponding to the position in the second depth map in the depth map to be processed, weight_c_texture represents the confidence of the pixel determined based on the matching difference of the texture map, and weight_c_depth represents the confidence of the pixel determined based on the consistency of the depth map.

And the combination mode II is as follows: taking the weighted sum of the confidence of the pixel determined based on the matching difference of the texture map and the confidence of the pixel determined based on the consistency of the depth map as the confidence of the pixel corresponding to the positions of the depth map to be processed and each second depth map, the confidence can be expressed as follows:

Weight_c＝a*Weight_c_texture+b*Weight_c_depth；

Wherein weight_c represents the confidence of the pixel corresponding to the position in each second depth map in the depth map to be processed, weight_c_texture represents the confidence of the pixel determined based on the matching difference of the texture map, weight_c_depth represents the confidence of the pixel determined based on the consistency of the depth map, a is the weighting coefficient of the confidence of the pixel determined based on the matching difference of the texture map, and b is the confidence of the pixel determined based on the consistency of the depth map.

The above shows the manner in which the pixel confidence is determined in manner one, and two implementation examples in which the pixel confidence is determined in manner two are given below:

example one: and respectively matching the pixels corresponding to the positions in the depth map to be processed and the second depth maps with the depth values of the pixels in the preset area around the depth map where the pixels are located, and respectively determining the confidence values of the pixels corresponding to the positions in the depth map to be processed and the second depth maps based on the matching degree of the depth values and the number of the pixels with the matching degree meeting a preset pixel matching degree threshold value.

Referring to any one of the second depth maps Px shown in fig. 10, as the Pixel with the confidence coefficient to be determined, the Pixel (x 1', y 1') corresponding to any one of the Pixel positions in the depth map to be processed may be respectively matched with the depth value of any one of the pixels in the preset region R around the Pixel (x 1', y 1') in the second depth map Px, for example, if the matching coefficient of 5 pixels in 8 pixels in the preset region R is preset to be greater than the preset Pixel matching coefficient threshold 60%, the confidence coefficient of the Pixel (x 1', y 1') in the second depth map Px is determined to be 0.8.

In an implementation, the preset area may take a circular shape, a rectangular shape or an irregular shape, and the specific shape is not limited, and may surround the pixel to be determined with the confidence coefficient, and the size of the preset area may be set empirically.

Example two: and matching the corresponding pixels in the depth map to be processed and the positions in each second depth map with weighted average values of depth values of pixels in preset areas around the depth map where the pixels are located, and respectively determining confidence values of the corresponding pixels in the depth map to be processed and the positions in each second depth map based on the matching degree of the corresponding pixels in the depth map to be processed and the positions in each second depth map with the weighted average values.

With continued reference to fig. 10, in an embodiment of the present disclosure, the depth values of the pixels in the preset region R around the Pixel (x 1', y 1') are weighted-averaged, and then similarity matching is performed, for example, the degree of matching between the weighted-averaged depth values and the depth values of the Pixel (x 1', y 1') is greater than 50%, and the confidence of the Pixel (x 1', y 1') in the second depth map Px may be determined to be 1.

A number of ways in which the confidence of the corresponding pixel in the depth map to be processed and in the position of each second depth map can be determined are given above. In particular implementations, at least two of these may be used in combination. The first filtering coefficient weight value corresponding to the confidence coefficient value of the corresponding pixel in each second depth image in the depth image to be processed is added to the window filtering coefficient value, the depth value of the corresponding pixel in the depth image to be processed is filtered according to a preset filtering mode, the depth value of the corresponding pixel in the depth image to be processed after filtering is obtained, the influence on a filtering result caused by the introduction of unreliable depth values in the depth image to be processed and each second depth image can be avoided, and therefore the stability of the depth image in a time domain can be improved.

By adopting the depth map processing method of the embodiment, after the depth map is filtered in the time domain, the image quality of video reconstruction can be improved, so that those skilled in the art can better understand and implement the following description of how to perform video reconstruction is given by way of an embodiment.

Referring to the flowchart of the video reconstruction method shown in fig. 11, the method specifically may include the following steps:

s111, acquiring image combinations of video frames with multi-angle free view angles, parameter data corresponding to the image combinations of the video frames and virtual view point position information based on user interaction, wherein the image combinations of the video frames comprise a plurality of groups of texture images and depth images with corresponding relations, wherein the groups of texture images and depth images have synchronous angles.

And S112, performing filtering processing on the depth map in the time domain.

In a specific implementation, the depth map processing method in the embodiments of the present disclosure may be used for filtering, and the specific method may be referred to the description of the foregoing embodiments, which is not further described herein.

And S113, selecting a texture map and a filtered depth map of a corresponding group in the image combination of the video frame at the user interaction time according to the virtual viewpoint position information and parameter data corresponding to the image combination of the video frame and a preset rule.

And S114, based on the virtual viewpoint position information and parameter data corresponding to the texture map and the depth map of the corresponding group in the image combination of the video frame at the user interaction moment, carrying out combined rendering on the texture map and the filtered depth map of the corresponding group in the image combination of the video frame at the selected user interaction moment to obtain a reconstructed image corresponding to the virtual viewpoint position at the user interaction moment.

By adopting the video reconstruction method, the depth map in the video frame is filtered in the time domain, and the depth map with the same view angle as the depth map to be processed, namely the second depth map, in each video frame in the video frame sequence of a preset window in the time domain is obtained, the confidence values of the pixels corresponding to the positions of the depth map to be processed and each second depth map are obtained, the first filter coefficient weight value corresponding to the confidence values is determined, and the first filter coefficient weight value is added into the window filter coefficient value, so that the influence on the filtering result caused by the introduction of unreliable depth values in the depth map to be processed and each second depth map can be avoided, the stability of the depth map in the time domain can be improved, and the image quality of the reconstructed video can be improved.

The embodiments of the present specification further provide specific apparatuses and systems capable of implementing the methods of the foregoing embodiments, and the description is made below by way of specific embodiments with reference to the accompanying drawings.

The embodiment of the specification provides a depth map processing device, which can perform filtering processing on a depth map in the time domain. Referring to the schematic structure of the depth map processing apparatus shown in fig. 12, the depth map processing apparatus 120 may include:

a depth map obtaining unit 121, adapted to obtain a depth map to be processed from an image combination of a current video frame of a multi-angle free view, where the image combination of the current video frame of the multi-angle free view includes a plurality of groups of texture maps and depth maps having a corresponding relationship with each other and having a plurality of synchronous angles;

a frame sequence obtaining unit 122, adapted to obtain a video frame sequence including a preset window in the time domain of the current video frame;

a window filter coefficient value obtaining unit 123, adapted to obtain a window filter coefficient value corresponding to each video frame in the video frame sequence, where the window filter coefficient value is generated by a weight value of at least two dimensions, and includes: the window filter coefficient value obtaining unit 123 includes: the first filter coefficient weight value obtaining subunit 1231 is adapted to obtain confidence values of pixels in the depth map to be processed and corresponding to the positions in each second depth map, and determine a first filter coefficient weight value corresponding to the confidence value, where: the second depth map is a depth map with the same view angle as the depth map to be processed in each video frame of the video frame sequence;

The filtering unit 124 is adapted to filter the pixels corresponding to the positions in the depth map to be processed according to a preset filtering mode based on the window filtering coefficient values corresponding to the video frames, so as to obtain the depth values after the pixels corresponding to the positions in the depth map to be processed are filtered.

In an implementation, the first filter coefficient weight value obtaining subunit 1231 may include at least one of the following confidence value determining subunits:

the first confidence value determining means 12311 is adapted to obtain the depth map in the preset view angle range around the corresponding view angles of the depth map to be processed and the second depth maps, obtain a third depth map of the corresponding view angles, and determine the confidence values of the corresponding pixels in the depth map to be processed and the second depth maps based on the third depth map of the corresponding view angles;

the second confidence value determining means 12312 is adapted to determine a confidence value of a pixel in the depth map to be processed and in each second depth map, based on spatial correspondence between the pixel in the depth map to be processed and the pixel in a preset area around the depth map in which the pixel is located.

In an embodiment of the present disclosure, the first confidence value determining unit 12311 is adapted to obtain a texture map corresponding to the depth map to be processed and a texture map corresponding to each second depth map, and map the texture values of the texture map corresponding to the depth map to be processed and the texture values of the corresponding positions in the texture map corresponding to each second depth map to corresponding positions in the texture map corresponding to each second depth map respectively according to the depth values of the corresponding pixels in the depth map to be processed and the corresponding positions in the texture map corresponding to each second depth map, so as to obtain mapped texture values corresponding to the third depth map corresponding to each corresponding view; and respectively matching the mapping texture values with actual texture values of corresponding positions in texture maps corresponding to the third depth maps of the corresponding view angles, and determining confidence values of corresponding pixels in the depth maps to be processed and the second depth maps based on distribution intervals of matching degrees of the texture values corresponding to the third depth maps of the corresponding view angles.

In another embodiment of the present disclosure, the first confidence value determining unit 12311 maps the corresponding pixels in the depth map to be processed and the positions in the second depth maps to the third depth maps of the corresponding perspectives, to obtain mapped depth values of the pixels in the corresponding positions in the third depth maps of the corresponding perspectives; and matching the mapping depth values of the pixels at the corresponding positions in the third depth map of each corresponding view angle with the actual depth values of the pixels at the corresponding positions in the third depth map of each corresponding view angle, and determining the confidence values of the pixels at the corresponding positions in the depth map to be processed and each second depth map based on the distribution interval of the matching degree of the depth values corresponding to the third depth map of each corresponding view angle.

In yet another embodiment of the present disclosure, the first confidence value determining means 12311 is adapted to obtain depth values of corresponding pixels in the depth map to be processed and in each second depth map, map the depth values to corresponding pixel positions in the third depth map corresponding to the view angle, obtain and map the depth values of corresponding pixel positions in the third depth map corresponding to the view angle to corresponding pixel positions in the depth map to be processed and in each second depth map, and obtain mapped pixel positions of the third depth map corresponding to the view angle in the depth map to be processed and in each second depth map; and respectively calculating the pixel distances of the mapping pixel positions obtained by the reflection of the actual pixel positions of the pixels in the depth map to be processed and the corresponding positions in the second depth maps and the third depth maps of the corresponding visual angles, and determining the confidence values of the pixels in the depth map to be processed and the corresponding positions in the second depth maps based on the calculated distribution intervals of the pixel distances.

In an embodiment of the present disclosure, the second confidence value determining means 12312 is adapted to match the corresponding pixels in the depth map to be processed and the positions in each second depth map with the depth values of the pixels in the preset area around the depth map where the pixels are located, and determine the confidence values of the corresponding pixels in the depth map to be processed and the positions in each second depth map based on the matching degree of the depth values and the number of pixels whose matching degree meets the preset pixel matching degree threshold.

In another embodiment of the present disclosure, the second confidence value determining unit 12312 matches the pixels corresponding to the positions in the depth map to be processed and the weighted average of the depth values of the pixels in the preset area around the depth map where the pixels are located, and determines the confidence values of the pixels corresponding to the positions in the depth map to be processed and the second depth map based on the matching degree of the pixels corresponding to the positions in the depth map to be processed and the weighted average of the pixels corresponding to the positions in the second depth map.

In a specific implementation, the weight value of the window filter coefficient may further include at least one of the following: and a second filter coefficient weight value corresponding to the frame distance and a third filter coefficient weight value corresponding to the pixel similarity.

Accordingly, the window filter coefficient value obtaining unit 123 may further include at least one of:

a second filter coefficient weight value obtaining subunit 1232, adapted to obtain a frame distance between each video frame in the video frame sequence and the current video frame, and determine a second filter coefficient weight value corresponding to the frame distance;

the third filter coefficient weight value obtaining subunit 1233 is adapted to obtain the similarity value of the texture map corresponding to each second depth map and the pixel corresponding to the position in the texture map corresponding to the depth map to be processed, and determine the third filter coefficient weight value corresponding to the similarity value.

In an embodiment of the present disclosure, the filtering unit 124 is adapted to take the product of the first filter coefficient weight value and at least one of the second filter coefficient weight value and the third filter coefficient weight value, or a weighted average value as a corresponding window filter coefficient value for each video frame; and calculating a weighted average value of products of depth values of pixels corresponding to positions in the depth map to be processed and each second depth map and window filter coefficient values corresponding to each video frame, and obtaining the filtered depth values corresponding to positions in the depth map to be processed.

The embodiment of the specification also provides a video reconstruction system, and the video reconstruction system is adopted to reconstruct the video, so that the image quality of the reconstructed video can be improved. Referring to the schematic structure of the video reconstruction system shown in fig. 13, the video reconstruction system 130 includes: an acquisition module 131, a filtering module 132, a selection module 133 and an image reconstruction module 134, wherein:

the obtaining module 131 is adapted to obtain an image combination of a video frame of a multi-angle free view, parameter data corresponding to the image combination of the video frame, and virtual viewpoint position information based on user interaction, where the image combination of the video frame includes a plurality of groups of texture maps and depth maps with corresponding relationships and with synchronous angles;

the filtering module 132 is adapted to filter a depth map in the video frame;

the selection module 133 is adapted to select, according to the virtual viewpoint position information and parameter data corresponding to the image combination of the video frame, a texture map and a filtered depth map of a corresponding group in the image combination of the video frame at the user interaction time according to a preset rule;

the image reconstruction module 134 is adapted to perform combined rendering on the texture map and the filtered depth map of the corresponding group in the image combination of the video frame at the selected user interaction time based on the virtual viewpoint position information and the parameter data corresponding to the texture map and the depth map of the corresponding group in the image combination of the video frame at the user interaction time, so as to obtain a reconstructed image corresponding to the virtual viewpoint position at the user interaction time;

Wherein the filtering module 132 may include:

the depth map obtaining unit 1321 is adapted to obtain a depth map to be processed from an image combination of a current video frame of a multi-angle free view, where the image combination of the current video frame of the multi-angle free view includes a plurality of groups of texture maps and depth maps with corresponding relationships in which angles are synchronous;

a frame sequence obtaining unit 1322, adapted to obtain a video frame sequence including a time domain preset window of the current video frame;

a window filter coefficient value obtaining unit 1323, adapted to obtain a window filter coefficient value corresponding to each video frame in the sequence of video frames, where the window filter coefficient value is generated by a weight value of at least two dimensions, includes: the window filter coefficient value obtaining unit includes: the first filter coefficient weight value obtaining subunit 13231 is adapted to obtain confidence values of pixels in the depth map to be processed and corresponding to the positions of the second depth maps, and determine a first filter coefficient weight value corresponding to the confidence value, where: the second depth map is a depth map with the same view angle as the depth map to be processed in each video frame of the video frame sequence;

The filtering unit 1324 is adapted to filter the pixels corresponding to the positions in the depth map to be processed according to a preset filtering manner based on the window filtering coefficient values corresponding to the video frames, so as to obtain the depth values after the pixels corresponding to the positions in the depth map to be processed are filtered.

In a specific implementation, the window filter coefficient value obtaining unit 1323 further includes at least one of:

a second filter coefficient weight value obtaining subunit 13232, adapted to obtain a frame distance between each video frame in the video frame sequence and the current video frame, and determine a second filter coefficient weight value corresponding to the frame distance;

the third filter coefficient weight value obtaining subunit 13233 is adapted to obtain similarity values of corresponding pixels in the texture map corresponding to each second depth map and the texture map corresponding to the depth map to be processed, and determine the third filter coefficient weight value corresponding to the similarity value.

The specific implementation of the filtering module 132 may refer to fig. 12, and the depth map processing apparatus shown in fig. 12 may be used as the filtering module 132 to perform the time domain filtering, and may specifically refer to the depth map processing apparatus and the depth map processing method in the foregoing embodiments. It should be noted that, the depth map processing apparatus may be specifically implemented by corresponding software, hardware, or a combination of software and hardware. The calculation of each filtering coefficient weight value can be implemented by one or more CPUs or GPUs or the CPUs and the GPUs in a cooperative mode, the CPUs can be communicated with one or more GPU chips or GPU modules, and each GPU chip or GPU module is controlled to carry out filtering processing of the depth map.

The embodiment of the present disclosure further provides an electronic device, referring to the schematic structural diagram of the electronic device shown in fig. 14, the electronic device 140 may include a memory 141 and a processor 142, where the memory 141 stores computer instructions that may be executed on the processor 142, and when the processor 142 executes the computer instructions, the depth map processing method described in any one of the foregoing embodiments or the steps of the video reconstruction method described in any one of the foregoing embodiments may be executed. Specific steps may be referred to the description of the foregoing embodiments, and will not be repeated here.

It should be noted that the processor 142 may specifically include a CPU chip 1421 formed by one or more CPU cores, or may include a GPU chip 1422, or a chip module formed by the CPU chip 1421 and the GPU chip 1422. The processor 142 and the memory 141 may communicate with each other via a bus or the like, and the chips may also communicate with each other via corresponding communication interfaces.

The embodiments of the present disclosure further provide a computer readable storage medium having stored thereon computer instructions, where the computer instructions may perform the steps of the depth map processing method according to any one of the previous embodiments or the video reconstruction method according to any one of the previous embodiments. Specific steps may be referred to the description of the foregoing embodiments, and will not be repeated here.

For better understanding and implementation by those skilled in the art, the following is an example of a specific application of the specific application scenario shown in fig. 1.

The cloud server cluster 13 may first perform time-domain filtering on the depth map by using the embodiment of the present disclosure, and then perform image reconstruction based on the texture map of the corresponding group in the image combination of the video frame and the filtered depth map, so as to obtain a reconstructed multi-angle free view image.

In an implementation, the cloud server cluster 13 may include: the first cloud server 131, the second cloud server 132, the third cloud server 133, and the fourth cloud server 134. The first cloud server 131 may be configured to determine parameter data corresponding to the image combination; the second cloud server 132 may be configured to determine depth data of each frame of image in the image combination; the third cloud server 133 may reconstruct a frame image of a preset virtual viewpoint path using a virtual viewpoint reconstruction (Depth Image Based Rendering, DIBR) algorithm based on the image combination corresponding parameter data, the image combination pixel data, and the depth data; the fourth cloud server 134 may be configured to generate multi-angle freeview video, where the multi-angle freeview video data may include: multi-angle freeview spatial data and multi-angle freeview temporal data of frame images ordered according to frame moments.

It should be understood that the first cloud server 131, the second cloud server 132, the third cloud server 133, and the fourth cloud server 134 may also be a server group formed by a server array or a server subset, which is not limited in the embodiment of the present disclosure.

As a specific example, the second cloud server 132 may obtain the depth map from the image combination of the current video frame of the multi-angle free view, and as the depth map to be processed, the second cloud server 132 may perform temporal filtering on the depth map to be processed by adopting the foregoing embodiment of the present disclosure, so as to improve the stability of the depth map in the temporal domain, and then perform video reconstruction by adopting the depth map after temporal filtering, so that the quality of the video reconstructed image can be improved whether the video reconstructed is played at the playing terminal 15 or the interactive terminal 16.

In the present embodiments, the collection device may also be located in the ceiling area of a basketball venue, on a basketball stand, etc. The acquisition devices may be arranged and distributed along a line, sector, arc, circle, matrix, or irregular shape. The specific arrangement mode can be set according to one or more factors such as specific field environment, the number of the acquisition devices, the characteristics of the acquisition devices, imaging effect requirements and the like. The acquisition device may be any device with camera functionality, such as a normal camera, a cell phone, a professional camera, etc.

In some embodiments of the present disclosure, as shown in fig. 1, each acquisition device in the acquisition array 11 may transmit the obtained video data stream to the data processing device 12 in real time through a switch 17 or a local area network, etc.

It will be appreciated that the data processing device 12 may be disposed in an on-site non-acquisition area or a cloud end according to a specific scenario, and the server (cluster) and the play control device may be disposed in an on-site non-acquisition area, a cloud end or a terminal access side according to a specific scenario, which is not intended to limit the specific implementation and protection scope of the present invention. Reference should be made to the specific description of an embodiment of the method for specific implementation, operating principle and specific actions and effects of each device, system, apparatus or system in the embodiments of the present disclosure.

It can be understood that the above embodiment is applicable to live or collimated broadcasting scenes, but not limited thereto, and the schemes in the embodiments of the present disclosure may also be applicable to broadcasting requirements of non-live broadcasting scenes, such as recording broadcasting, rebroadcasting, and other scenes with low latency requirements, for video or image acquisition, data processing of video data stream, and image generation of a server.

Although the embodiments of the present specification are disclosed above, the present invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the embodiments of the present invention, and the scope of the invention is therefore intended to be limited only by the appended claims.

Claims

1. A depth map processing method, comprising:

2. The depth map processing method according to claim 1, wherein the weight values of the window filter coefficients further include: at least one of a second filter coefficient weight value corresponding to the frame distance and a third filter coefficient weight value corresponding to the pixel similarity; the second filter coefficient weight value and the third filter coefficient weight value are obtained by adopting the following modes:

and obtaining similarity values of corresponding pixels in positions of texture maps corresponding to the second depth maps and the texture maps corresponding to the depth maps to be processed, and determining third filter coefficient weight values corresponding to the similarity values.

3. The depth map processing method according to claim 1 or 2, wherein the obtaining the confidence value of the pixel corresponding to the position in each second depth map in the depth map to be processed includes at least one of:

4. A depth map processing method according to claim 3, wherein determining confidence values of corresponding pixels in the depth map to be processed and in each second depth map based on the third depth map of each corresponding view comprises:

5. A depth map processing method according to claim 3, wherein determining confidence values of corresponding pixels in the depth map to be processed and in each second depth map based on the third depth map of each corresponding view comprises:

6. A depth map processing method according to claim 3, wherein determining the confidence value of the corresponding pixel in the depth map to be processed and each second depth map based on the third depth map of each corresponding view comprises:

7. A depth map processing method according to claim 3, wherein determining the confidence value of the corresponding pixel in the depth map to be processed and each second depth map based on the spatial consistency between the corresponding pixel in the depth map to be processed and the pixels in the preset area around the depth map where the pixel is located, includes at least one of:

8. The depth map processing method according to claim 2, wherein the filtering the depth values of the pixels corresponding to the positions in the depth map to be processed according to the preset filtering mode based on the window filtering coefficient values corresponding to each video frame to obtain the depth values after the filtering of the pixels corresponding to the positions in the depth map to be processed includes:

9. The depth map processing method according to claim 1 or 2, wherein the current video frame is located in the middle of the video frame sequence.

10. A depth map processing method, comprising:

acquiring a window filter coefficient value corresponding to each video frame in the video frame sequence, wherein the window filter coefficient value is generated by weight values of at least two dimensions, and the window filter coefficient value comprises: the first filter coefficient weight value corresponding to the pixel confidence coefficient is obtained by adopting the following mode: obtaining a depth map in a preset view angle range around the corresponding view angles of the depth map to be processed and each second depth map, obtaining a third depth map of the corresponding view angles, and determining confidence values of corresponding pixels in the depth map to be processed and each second depth map based on the third depth map of each corresponding view angle; determining a first filter coefficient weight value corresponding to the confidence coefficient value; wherein: the second depth map is a depth map with the same view angle as the depth map to be processed in each video frame of the video frame sequence;

11. A method of video reconstruction, comprising:

obtaining a filtered depth map by the depth map processing method according to any one of claims 1 to 9;

12. A depth map processing apparatus, comprising:

13. The depth map processing device of claim 12, wherein the window filter coefficient value acquisition unit further comprises at least one of:

14. A video reconstruction system, comprising:

wherein, the filtering module includes:

15. The video reconstruction system according to claim 14, wherein the window filter coefficient value acquisition unit further comprises at least one of:

16. An electronic device comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, wherein the processor, when executing the computer instructions, performs the method of any of claims 1 to 9 or the steps of the method of claim 10 or 11.

17. A computer readable storage medium having stored thereon computer instructions which, when run, perform the method of any of claims 1 to 9 or the steps of the method of claim 10 or 11.