WO2022022548A1

WO2022022548A1 - Free viewpoint video reconstruction and playing processing method, device, and storage medium

Info

Publication number: WO2022022548A1
Application number: PCT/CN2021/108827
Authority: WO
Inventors: 王荣刚; 蔡砚刚; 顾嵩; 盛骁杰
Original assignee: 阿里巴巴集团控股有限公司; 北京大学深圳研究生院
Priority date: 2020-07-31
Filing date: 2021-07-28
Publication date: 2022-02-03
Also published as: CN114071115A

Abstract

Free viewpoint video reconstruction and playing processing methods, a device, and a storage medium. The video reconstruction method comprises: acquiring a free viewpoint video frame, wherein the video frame comprises synchronous original texture images of a plurality of original viewpoints, and original depth images of the corresponding viewpoints; acquiring a target video frame corresponding to a virtual viewpoint; using original texture images of the plurality of original viewpoints and corresponding original depth images in the target video frame to synthesize a texture image of the virtual viewpoint; acquiring background texture images and background depth images of the corresponding viewpoints in the target video frame, and acquiring a background texture image of the virtual viewpoint according to the background texture images and the background depth images of the corresponding viewpoints; and using the background texture image of the virtual viewpoint to perform hole filling on a hole region in the texture image of the virtual viewpoint, and then performing processing to obtain a reconstructed image of the virtual viewpoint. By means of the scheme, the hole filling quality can be improved, thereby improving the image quality of a free viewpoint video.

Description

Free viewpoint video reconstruction and playback processing method, device and storage medium

This disclosure claims the priority of the Chinese patent application filed on July 31, 2020 with the application number 202010759861.3 and the invention titled "Free-view video reconstruction and playback processing method, device and storage medium", the entire contents of which are incorporated herein by reference middle.

technical field

The embodiments of this specification relate to the technical field of video processing, and in particular, to a method, device, and storage medium for free-view video reconstruction and playback processing.

Background technique

Free viewpoint video is a technology that can provide a high degree of freedom viewing experience. Users can adjust the viewing angle through interactive operations during the viewing process, and watch from the free viewpoint they want to watch, which can greatly improve the viewing experience.

In order to realize free viewpoint viewing, virtual viewpoint synthesis technology can be used. In the virtual viewpoint synthesis technology, the Depth Image Based Rendering (DIBR) technology has become an important method of virtual viewpoint synthesis, which only needs to refer to the texture map of the viewpoint and the corresponding depth map, after three-dimensional coordinate transformation You can get a view of the viewpoint where the camera does not originally exist.

DIBR technology is mainly divided into the steps of selecting viewpoints, preprocessing, mapping, view fusion and postprocessing. Among them, during the mapping process, there may be a situation that the part of the background texture in the reference viewpoint that is occluded by the foreground object is invisible in the reference viewpoint, but is visible in the virtual viewpoint. Therefore, after view fusion, there are still some unfilled hollow areas in the virtual viewpoint.

For the hole filling in the occluded area of the foreground object, there are currently some filtering methods using the effective texture information around the hole. However, these methods are not ideal, and are prone to artifacts and blurring, resulting in poor image quality of reconstructed free-viewpoint videos.

SUMMARY OF THE INVENTION

In view of this, the embodiments of this specification provide a free-view video reconstruction and playback processing method, device, and storage medium, which can improve the quality of hole filling, thereby improving the image quality of the free-view video.

First, the embodiments of this specification provide a free-view video reconstruction method, including:

Obtaining a free-view video frame, the video frame includes the original texture maps of multiple original viewpoints and the original depth maps of the corresponding viewpoints;

Obtain the target video frame corresponding to the virtual viewpoint;

Using the original texture maps of multiple original viewpoints and the corresponding original depth maps in the target video frame to synthesize the texture maps of the virtual viewpoints;

Obtain the background texture map and the background depth map of the viewpoint corresponding to the target video frame, and obtain the background texture map of the virtual viewpoint according to the background texture map and the background depth map of the corresponding viewpoint;

Using the background texture map of the virtual viewpoint, post-processing is performed on the void area in the texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint.

Optionally, the acquiring the background texture map and the background depth map of the viewpoint corresponding to the target video frame includes:

selecting the reference texture map sequence and the reference depth map sequence of the viewpoint corresponding to the target video frame;

Temporal filtering is performed on the reference texture map sequence and the reference depth map sequence, respectively, to obtain a background texture map and a background depth map of the viewpoint corresponding to the target video frame.

Optionally, the temporal filtering is performed on the reference texture map sequence and the reference depth map sequence, respectively, to obtain the background texture map and the background depth map of the viewpoint corresponding to the target video frame, including:

Temporal median filtering is performed on the pixels in the reference texture map sequence and the reference depth map sequence, respectively, to obtain a background texture map and a background depth map of the viewpoint corresponding to the target video frame.

Optionally, using the original texture maps and corresponding original depth maps of multiple original viewpoints in the target video frame to synthesize the texture maps of the virtual viewpoints, including:

Based on the virtual viewpoint, select the original texture map and the corresponding original depth map of the corresponding original viewpoint in the target video frame according to a preset rule;

Using the selected original texture map of the corresponding original viewpoint and the corresponding original depth map, the texture map of the virtual viewpoint is synthesized.

Obtain a reference texture map sequence and a reference depth map sequence corresponding to the selected original viewpoint;

Temporal filtering is performed on the reference texture map sequence and the reference depth map sequence, respectively, to obtain a background texture map and a background depth map of the selected corresponding original viewpoint.

Pre-collecting a background texture map in which there is no foreground object at the corresponding viewpoint in the field of view targeted by the target video frame;

The background depth map of the corresponding viewpoint is acquired according to the background texture map in which the corresponding viewpoint does not have a foreground object in the field of view targeted by the target video frame.

Optionally, the background texture map of the virtual viewpoint is used to perform post-filling post-processing on the hollow area in the texture map of the virtual viewpoint to obtain the reconstructed image of the virtual viewpoint, including:

The background texture map of the virtual viewpoint is used, and a joint bilateral filtering method is used to perform interpolation processing on the hollow area in the texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint.

Optionally, after the post-processing of hole filling is performed on the hole region in the texture map of the virtual viewpoint, and before the reconstructed image of the virtual viewpoint is obtained, the method further includes:

Filtering is performed on the foreground edge in the texture map of the virtual viewpoint obtained after the post-processing of hole filling, so as to obtain the reconstructed image of the virtual viewpoint.

The embodiments of this specification also provide a free-view video playback processing method, including:

Determine the virtual viewpoint, and determine the target video frame according to the virtual viewpoint;

Optionally, the determining a virtual viewpoint includes at least one of the following:

determining a virtual viewpoint in response to user interaction;

The virtual viewpoint is determined based on the virtual viewpoint position information contained in the video stream.

Optionally, the method further includes:

acquiring the virtual rendering target object in the reconstructed image of the virtual viewpoint;

acquiring a virtual information image generated based on the augmented reality special effect input data of the virtual rendering target object;

The virtual information image and the reconstructed image of the virtual viewpoint are synthesized and displayed.

Optionally, the acquiring the virtual information image generated based on the augmented reality special effect input data of the virtual rendering target object includes:

According to the position of the virtual rendering target object in the reconstructed image of the virtual viewpoint obtained by 3D calibration, a virtual information image matching the position of the virtual rendering target object is obtained.

The embodiments of this specification also provide a free-viewpoint video reconstruction device, including:

a video frame acquisition unit, adapted to acquire a free-view video frame, the video frame including the original texture maps of multiple original viewpoints and the original depth maps of the corresponding viewpoints;

a target video frame determination unit, adapted to obtain the target video frame corresponding to the virtual viewpoint;

a virtual viewpoint texture map synthesis unit, adapted to use the original texture maps and corresponding original depth maps of multiple original viewpoints in the target video frame to synthesize the texture maps of the virtual viewpoints;

A virtual viewpoint background texture map synthesis unit, adapted to obtain the background texture map and background depth map of the viewpoint corresponding to the target video frame, and obtain the background texture of the virtual viewpoint according to the background texture map and background depth map of the corresponding viewpoint picture;

The post-processing unit is adapted to use the background texture map of the virtual view point to perform post-processing of hole filling on the hole area in the texture map of the virtual view point to obtain a reconstructed image of the virtual view point.

The embodiments of this specification also provide a free-viewpoint video playback processing device, including:

a virtual viewpoint determination unit, adapted to determine a virtual viewpoint;

a target video frame determination unit, adapted to determine a target video frame according to the virtual viewpoint;

Embodiments of the present specification further provide an electronic device, including a memory and a processor, wherein the memory stores computer instructions that can be executed on the processor, wherein the processor executes the aforementioned computer instructions when the processor executes the computer instructions. The steps of the method of any one of the embodiments.

The embodiments of this specification also provide an electronic device, including: a communication component, a processor, and a display component, wherein:

the communication component, adapted to obtain free-view video;

the processor, adapted to perform the steps of the method in any of the foregoing embodiments;

The display component is adapted to display the reconstructed image of the virtual viewpoint obtained after processing by the processor.

The embodiments of the present specification further provide a computer-readable storage medium on which computer instructions are stored, wherein, when the computer instructions are executed, the steps of the methods described in any of the foregoing embodiments are executed.

Compared with the prior art, the technical solutions of the embodiments of this specification have the following beneficial effects:

The solution of the embodiment of the present specification, wherein the complete background texture map of the virtual viewpoint is obtained by reconstruction, and the texture map of the virtual viewpoint corresponding to the synthesized target video frame is subjected to hole filling post-processing. The scheme of filtering the texture based on the filter can avoid the artifacts and blurring caused by incomplete hole filling, improve the quality of hole filling, and then improve the image quality of free-view video.

Further, based on the virtual viewpoint, the original texture maps and corresponding original depth maps of multiple original viewpoints in the target video frame are selected according to preset rules as reference texture maps and reference depth maps, respectively, for synthesizing the The texture map of the virtual viewpoint can reduce the amount of data processing in the video reconstruction process and improve the video reconstruction efficiency.

Further, by performing temporal filtering on the reference texture map sequence and the reference depth map sequence of the viewpoint corresponding to the selected target video frame, respectively, the background texture map and the background depth map of the viewpoint corresponding to the target video frame are obtained. The reference texture map sequence and reference depth map sequence of the viewpoint corresponding to the target video frame, that is, the texture information and depth information in the temporal domain of the viewpoint corresponding to the target video frame, not only based on the texture in the spatial domain of the target video frame. Therefore, the integrity and authenticity of the obtained background texture map and background depth map can be improved, and the artifacts and blurring caused by foreground occlusion can be avoided, thereby improving the quality of image hole filling.

Description of drawings

1 is a schematic diagram of a specific application system of a free-view video display in an embodiment of this specification;

2 is a schematic diagram of an interactive interface of a terminal device in an embodiment of this specification;

FIG. 3 is a schematic diagram of a setting mode of a collection device in an embodiment of the present specification;

4 is a schematic diagram of another terminal device interaction interface in the embodiment of this specification;

5 is a schematic diagram of a free-viewpoint video data generation process in an embodiment of the present specification;

6 is a schematic diagram of the generation and processing of a kind of 6DoF video data in the embodiment of this specification;

7 is a schematic structural diagram of a data header file in an embodiment of the present specification;

8 is a schematic diagram of a user side processing 6DoF video data in an embodiment of the present specification;

9 is a flowchart of a free-viewpoint video reconstruction method in an embodiment of the present specification;

10 is a schematic diagram of a free-viewpoint video reconstruction method for a specific application scenario in an embodiment of the present specification;

11 is a flowchart of a free-viewpoint video playback processing method in an embodiment of the present specification;

12 is a flowchart of another free-viewpoint video playback processing method in the embodiment of this specification;

13 to 17 are schematic diagrams of display interfaces of an interactive terminal in the embodiments of this specification;

18 is a schematic structural diagram of a device for free-view video reconstruction in an embodiment of the present specification;

19 is a schematic structural diagram of a free-viewpoint video playback processing device in an embodiment of the present specification;

20 is a schematic structural diagram of an electronic device in an embodiment of this specification;

21 is a schematic structural diagram of another electronic device in the embodiment of this specification;

FIG. 22 is a schematic structural diagram of a video processing system in an embodiment of the present specification.

detailed description

In order to enable those skilled in the art to better understand and implement the embodiments in this specification, the following first exemplarily introduces the implementation of free-view video with reference to the accompanying drawings and specific application scenarios.

Referring to FIG. 1 , a specific application system for free-view video display in an embodiment of the present invention may include a collection system 11 of multiple collection devices, a server 12 , and a display device 13 , wherein the collection system 11 can collect images of the area to be viewed. The acquisition system 11 or the server 12 can process the acquired synchronized multiple texture maps to generate multi-angle free viewing angle data that can support the display device 13 to perform virtual viewpoint switching. The display device 13 can display reconstructed images generated based on multi-angle free viewing angle data, the reconstructed images correspond to virtual viewpoints, and can display reconstructed images corresponding to different virtual viewpoints according to user instructions, and switch the viewing position and viewing angle.

In a specific implementation, the process of performing image reconstruction to obtain a reconstructed image may be implemented by the display device 13, or may be implemented by a device located in a content delivery network (Content Delivery Network, CDN) by means of edge computing. It can be understood that FIG. 1 is only an example, and is not a limitation on the collection system, the server, the terminal device and the specific implementation manner.

Continuing to refer to FIG. 1 , the user can view the area to be viewed through the display device 13 , and in this embodiment, the area to be viewed is a basketball court. As mentioned earlier, the viewing position and viewing angle can be switched.

For example, the user can swipe across the screen to switch virtual viewpoints. In an embodiment of the present invention, referring to FIG. 2 , when the user slides the screen in the direction of D ₂₂ , the virtual viewpoint for viewing can be switched. Continuing to refer to FIG. 3 , the position of the virtual viewpoint before sliding may be VP ₁ , and after sliding the screen to switch the virtual viewpoint, the position of the virtual viewpoint may be VP ₂ . With reference to FIG. 4 , after sliding the screen, the reconstructed image displayed on the screen may be as shown in FIG. 4 . The reconstructed image may be obtained by performing image reconstruction based on multi-angle free viewing angle data generated from images collected by multiple collection devices in an actual collection situation.

It can be understood that the image viewed before switching may also be a reconstructed image. The reconstructed images may be frame images in the video stream. In addition, the manner of switching the virtual viewpoint according to the user's instruction may be various, which is not limited here.

In a specific implementation, the viewpoint can be represented by coordinates of 6 degrees of freedom (DoF), wherein the spatial position of the viewpoint can be represented as (x, y, z), and the viewing angle can be represented as three rotation directions

Accordingly, based on the coordinates of 6 degrees of freedom, a virtual viewpoint, including position and viewing angle, can be determined.

Virtual viewpoint is a three-dimensional concept, and three-dimensional information is required to generate reconstructed images. In a specific implementation manner, the multi-angle free viewing angle data may include depth map data, which is used to provide third-dimensional information outside the plane image. Compared with other implementations, such as providing three-dimensional information through point cloud data, the data volume of the depth map data is smaller.

In the embodiment of the present invention, the switching of the virtual viewpoints may be performed within a certain range, which is a multi-angle free viewing angle range. That is, within the multi-angle free viewing angle range, the position and viewing angle of the virtual viewpoint can be switched arbitrarily.

The multi-angle free viewing angle range is related to the arrangement of the acquisition device. The wider the shooting coverage of the acquisition device, the larger the multi-angle free viewing angle range. The quality of the picture displayed by the terminal device is related to the number of collection devices. Generally, the more collection devices are set, the fewer empty areas in the displayed picture.

In addition, the range of multi-angle free viewing angles is related to the spatial distribution of the acquisition devices. The range of multi-angle free viewing angles and the interaction mode with the display device on the terminal side can be set based on the spatial distribution relationship of the collection devices.

It can be understood by those skilled in the art that the above-mentioned embodiments and the corresponding accompanying drawings are only illustrative and illustrative, and do not limit the setting of the collection device and the relationship between the multi-angle free viewing angle range, nor the interaction method and Display the limitations of the display effect of the device.

Referring to Figure 5, in order to perform free-view video reconstruction, texture map acquisition and depth map calculation are required, including three main steps, namely Multi-camera Video Capturing, camera internal and external parameter calculation (Camera Parameter Estimation), and Depth Map Calculation. For multi-camera video capture, it is required that the video captured by each camera can be aligned at the frame level. Among them, the texture image (Texture Image) can be obtained through the video acquisition of multiple cameras; the camera parameters (Camera Parameter) can be obtained through the calculation of the internal and external parameters of the camera, and the camera parameters can include the internal parameter data of the camera and the external parameter data; Through the depth map calculation, The depth map, multiple synchronized texture maps, depth maps and camera parameters corresponding to the viewing angle can be obtained to form 6DoF video data.

In the solution of the embodiment of the present specification, no special camera, such as a light field camera, is required to collect video. Likewise, there is no need for complex camera calibration prior to acquisition. Multiple camera positions can be laid out and arranged to better capture the object or scene that needs to be captured.

After the above three steps are processed, the texture map collected from multiple cameras, the camera parameters of all cameras, and the depth map of each camera are obtained. These three parts of data can be referred to as data files in the multi-angle free-view video data, and can also be referred to as 6DoF video data. Because of these data, the client can generate virtual viewpoints according to the virtual 6 degrees of freedom (DoF) position, thereby providing a 6DoF video experience.

6, 6DoF video data and indicative data can reach the user side through compression and transmission, and the user side can obtain the user side 6DoF expression according to the received data, that is, the aforementioned 6DoF video data and metadata. The indicative data may also be called metadata, wherein the video data includes texture map and depth map data of each viewpoint corresponding to multiple cameras, and the texture map and depth map can be spliced according to certain splicing rules or splicing modes , forming a stitched image.

7, metadata can be used to describe the data pattern of 6DoF video data, specifically can include: stitching pattern metadata (Stitching Pattern metadata), used to indicate the pixel data of multiple texture maps and depth map data in the stitched image. Storage rules; edge protection metadata (Padding pattern metadata), which can be used to indicate the way of edge protection in stitched images, and other metadata (Other metadata). The metadata may be stored in the data header file, and the specific storage sequence may be as shown in FIG. 7 , or stored in other sequences.

Referring to Figure 8, the user side obtains 6DoF video data, which includes camera parameters, stitched images (texture map and depth map), and description metadata (metadata), in addition to user-end interactive behavior data . Through these data, the user side can use Depth Image-Based Rendering (DIBR, Depth Image-Based Rendering) for 6DoF rendering, so as to generate a virtual viewpoint image at a specific 6DoF position generated according to user behavior, that is, according to the user's behavior. Indicate, determine the virtual viewpoint of the 6DoF position corresponding to the indication.

Among them, in the usual depth map calculation, it is calculated separately at each frame moment. The inventor has found through research that in this way, in a fixed background, inconsistent depth values are obtained, resulting in jitter of the picture seen in the time domain.

As mentioned above, for the hole filling in the occluded area of the foreground object, the effective texture information around the hole is usually used for filtering. However, the effect of the actual hole repair is not ideal, and it is easy to produce artifacts and blurring, resulting in the reconstructed free-view video. image quality is poor.

To this end, the embodiments of this specification provide a free-view video reconstruction scheme, by reconstructing the complete background texture map of the virtual viewpoint, to perform hole filling post-processing on the texture map of the virtual viewpoint corresponding to the synthesized target video frame Compared with the scheme that only uses the texture around the hole for filtering, it can avoid the artifacts and blurring caused by incomplete hole filling, improve the quality of hole filling, and then improve the image quality of free-view video.

The following describes in detail the solution, principle, advantages, etc. of the post-processing of hole filling in the free-viewpoint video reconstruction process according to the embodiment of the present specification with reference to the accompanying drawings and in combination with specific application scenarios.

Referring to the flowchart of the free-viewpoint video reconstruction method shown in FIG. 9 , in the specific implementation, if it is applied to the specific application system of the free-viewpoint video display shown in FIG. 1 , it can be implemented by the server 12 or the display device 13. Free viewpoint video reconstruction is performed as follows:

S91: Acquire a free-view video frame, where the video frame includes original texture maps of multiple original viewpoints and original depth maps of corresponding viewpoints that are synchronized.

In a specific implementation, a free-view video frame may include synchronized original texture maps of multiple original viewpoints and original depth maps of corresponding viewpoints. As an optional example, a free-view video frame may be obtained based on the aforementioned 6DoF video data, where the corresponding viewing angle is also the corresponding viewpoint.

In a specific implementation, a free-view video stream can be downloaded through a network, or a free-view video frame can be obtained from a locally stored free-view video file.

S92: Acquire a target video frame corresponding to the virtual viewpoint.

In a specific implementation, the virtual viewpoint may be determined according to user interaction behavior or preset. If it is determined based on the user interaction behavior, the virtual viewpoint position at the corresponding interaction moment can be determined by acquiring the trajectory data corresponding to the user interaction operation, and the virtual viewpoint can be determined.

In some embodiments of this specification, the location information of the virtual viewpoint corresponding to the corresponding video frame may also be preset on the server (such as the server or the cloud), and the set virtual viewpoint is transmitted in the header file of the free viewpoint video. The location information of the viewpoint.

After the virtual viewpoint is determined, the corresponding video frame in the free viewpoint video corresponding to the virtual viewpoint may be determined as the target video frame.

S93, using the original texture maps and corresponding original depth maps of multiple original viewpoints in the target video frame to synthesize the texture map of the virtual viewpoint.

In a specific implementation, according to the position information of the virtual viewpoint, the original texture maps and corresponding original depth maps of all viewpoints included in the target video frame may be used to synthesize the texture map of the virtual viewpoint.

In order to reduce the amount of data processing and improve the image reconstruction speed, under the condition of ensuring the image reconstruction quality, the original texture map and corresponding original depth of some viewpoints in the target video frame can also be selected based on the position information of the virtual viewpoint. map for synthesizing the texture map of the virtual viewpoint.

Specifically, based on the virtual viewpoint, the original texture map and the corresponding original depth map of the corresponding original viewpoint in the target video frame may be selected according to preset rules, and then the selected original texture map and corresponding original viewpoint of the corresponding original viewpoint are used. The original depth map is synthesized, and the texture map of the virtual viewpoint is synthesized. For example, an original texture map and a corresponding original depth map of a corresponding original viewpoint satisfying a preset distance condition from the virtual viewpoint may be selected based on the spatial positional relationship between the virtual viewpoint and the positions of each original viewpoint. For another example, an original texture map and a corresponding original depth map of a corresponding original viewpoint that satisfy a preset spatial position relationship with the virtual viewpoint and satisfy a preset number threshold may be selected.

It can be understood that the above is only an example of some optional implementations for selecting the original texture map and the corresponding original depth map of some original viewpoints, and it is not a necessary selection condition.

S94: Acquire a background texture map and a background depth map of a viewpoint corresponding to the target video frame, and acquire a background texture map of the virtual viewpoint according to the background texture map and background depth map of the corresponding viewpoint.

In a specific implementation, there are various ways to obtain the background texture map and the background depth map of the viewpoint corresponding to the target video frame. For example, the time domain filtering method can be used, or the pre-collection method can be used. The specific implementation will be described in detail later in combination with specific application scenarios.

After the background texture map and the background depth map of the viewpoint corresponding to the target video frame are obtained, virtual viewpoint synthesis may be performed in the same manner as in step S93 to obtain the background texture map of the virtual viewpoint.

In a specific implementation, post-processing of hole filling may be performed on the background texture map of the virtual viewpoint, so as to enhance the image quality of the background texture map of the virtual viewpoint. As a specific example, the method of joint bilateral filtering can be used to perform post-processing of hole filling on the background texture map of the virtual viewpoint.

In a specific implementation, in order to obtain a more complete background texture map of the virtual viewpoint, background texture maps and background depth maps of multiple viewpoints may be used, wherein the selected viewpoint may be higher than the density of the viewpoint corresponding to the target video frame. bigger.

S95 , using the background texture map of the virtual viewpoint to perform post-filling processing on the hollow area in the texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint.

Using the background texture map of the virtual viewpoint, there may be various ways to perform post-filling and post-processing on the hollow area in the texture map of the virtual viewpoint.

In a specific implementation, the texture map of the virtual viewpoint and the background texture map of the virtual viewpoint may be compared pixel by pixel, and it is determined that the background area in the texture map of the virtual viewpoint is inconsistent with the background texture map of the virtual viewpoint. , or determine a pixel whose pixel value difference is greater than a preset threshold, and modify the value of the corresponding pixel in the texture map of the virtual viewpoint to the value of the corresponding pixel in the background texture map of the virtual viewpoint.

For another example, a joint bilateral filtering method may be used to perform interpolation processing on the hollow area in the texture map of the virtual viewpoint to obtain the reconstructed image of the virtual viewpoint. In a specific implementation, a special joint bilateral filter may be used for implementation, or a corresponding software execution logic may be invoked for implementation. Through joint bilateral filtering, the foreground edges in the texture map of virtual viewpoints can be protected and background noise can be removed.

In other embodiments of the present specification, the background texture map of the virtual viewpoint is used as a guide image, and a guided filtering method is used to fill the hollow area in the texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint .

In specific implementation, other filtering methods, such as bilateral filtering method, median smoothing filtering method, etc., can also be used, and the texture map of the virtual viewpoint is subjected to hole filling post-processing based on the input background texture map of the virtual viewpoint, No more examples.

In step S95, after the hole filling and post-processing is performed on the hole region in the texture map of the virtual viewpoint, in order to further improve the quality of the reconstructed image, the foreground edge in the texture map of the virtual viewpoint obtained after the hole filling and post-processing can also be processed. A filtering process is performed to obtain a reconstructed image of the virtual viewpoint.

Using the above-mentioned embodiment, by reconstructing the complete background texture map of the virtual viewpoint, the texture map of the virtual viewpoint corresponding to the synthesized target video frame is subjected to hole-filling post-processing. The solution can avoid artifacts and blurring caused by incomplete hole filling, improve the quality of hole filling, and then improve the image quality of free-view video.

For better understanding and implementation by those skilled in the art, some examples of specific implementations for obtaining the background texture map and the background depth map of the viewpoint corresponding to the target video frame are first given below.

Example 1: Select a reference texture map sequence and a reference depth map sequence of a viewpoint corresponding to the target video frame, and then obtain a background texture map and a background depth map of the viewpoint corresponding to the target video frame.

Wherein, for the selection of the reference texture map sequence and the reference depth map sequence, in some embodiments, the following methods are adopted:

Manner 1: For any original viewpoint among all original viewpoints corresponding to the original texture map and the original depth map included in the target video frame, obtain the corresponding reference texture map sequence and reference depth map sequence.

For example, if the target video frame includes the original texture maps and corresponding original depth maps of 30 viewpoints, for these 30 viewpoints, corresponding reference texture map sequences and reference depth map sequences are obtained respectively.

In a second manner, a reference texture map sequence and a corresponding reference depth map sequence of the original viewpoint selected for synthesizing the texture map of the virtual viewpoint are used.

Specifically, a reference texture map sequence and a reference depth map sequence corresponding to the selected original viewpoint can be obtained, and temporal filtering is performed on the reference texture map sequence and the reference depth map sequence respectively to obtain the selected corresponding original viewpoint background texture map and background depth map.

For example, if only the original texture maps and the corresponding original depth maps of the two original viewpoints closest to the virtual viewpoint are selected when synthesizing the texture maps of the virtual viewpoint, only the two closest to the virtual viewpoint can be obtained. The reference texture map sequence and the reference depth map sequence of the original viewpoint can reduce the amount of data operation and improve the generation efficiency of the background texture map of the virtual viewpoint.

In addition, the selected reference texture map sequence and reference depth map sequence may be selected from a video clip independent of the target video frame, or may be selected from a video clip including the target video frame.

After selecting the reference texture map sequence and the reference depth map sequence of the viewpoint corresponding to the target video frame, temporal filtering may be performed on the reference texture map sequence and the reference depth map sequence, respectively, to obtain the corresponding viewpoint of the target video frame. Background texture map and background depth map.

In a specific implementation, there are various ways to implement temporal filtering.

For example, the average filtering method can be used, and more specifically, arithmetic average filtering, median value average filtering, moving average filtering and other methods can be used.

For another example, a median filter method can be used. Specifically, temporal median filtering may be performed on the pixels in the reference texture map sequence and the pixels in the corresponding reference depth map sequence, respectively, to obtain the background texture map and the background depth map of the viewpoint corresponding to the target video frame.

As an optional example, a video frame sequence from time t ₁ to t ₂ can be selected from a video X of the same viewpoint as the target depth map, as the reference texture map sequence for this time period, and the reference texture map sequence corresponds to The reference depth map sequence of the reference depth map sequence, the sampling values of the corresponding pixel positions in the reference texture map sequence and the reference depth map sequence can be arranged according to the size, and the median value is taken as the valid value of the corresponding pixel position of the background texture map and the background depth map respectively. For the convenience of taking the median value, the number of images in the reference texture map sequence and the corresponding reference depth map sequence sampled from time t ₁ to t ₂ should be odd, for example, take 3 consecutive frames, 5 frames, 7 frames, etc. The formula can be expressed as follows:

P(x _t )=med({I _{x, i} |i∈[t ₁ , t ₂ ]})

Among them, P(x _t ) represents any pixel in the background texture map or background depth map, and I _x,i represents the same pixel as P(x _t ) in the reference texture map sequence or reference depth map sequence from t ₁ to t ₂ The sequence of pixel values of the position, med means to take the intermediate value in I _{x, i} .

It can be understood that, in the specific implementation, according to the environmental characteristics involved in the specific video, as well as the specific requirements and other factors, other time-domain filtering methods can also be used, for example, limiting filtering, first-order lag filtering and other methods can also be used. .

Example 2: Pre-collect the background texture map of the target video frame corresponding to the viewpoint without the foreground object, and then corresponding to the background depth map of the viewpoint.

In a specific implementation, a background texture map in which there is no foreground object at the corresponding viewpoint in the field of view targeted by the target video frame may be pre-collected, and according to the background texture map in which there is no foreground object at the corresponding viewpoint in the field of view targeted by the target video frame Texture map to obtain the background depth map of the corresponding viewpoint.

Since the background in the image is a fixed object relative to the acquisition viewpoint, based on this, in some embodiments of this specification, when there is no foreground object at the corresponding viewpoint in the field of view targeted by the target video frame, If the texture image is pre-collected at the corresponding viewpoint, the texture image contains only background texture information. Therefore, the texture image pre-collected at the corresponding viewpoint can be used as the background texture image of the corresponding viewpoint, and then according to the background texture image Texture map, the background depth map of the corresponding viewpoint can be obtained.

For example, for the live broadcast of a basketball game, one or more images without foreground objects can be collected at the corresponding viewpoint before the game starts. If one image is collected, this image can be directly used as the background texture map. If multiple images are collected, the multiple images can be used as a reference texture map sequence, and time domain filtering can be performed to obtain the background texture map of the corresponding viewpoint. Correspondingly, through depth calculation, the reference depth map of the collected reference texture map can be estimated. If the reference depth map corresponding to a single reference texture map can be directly used as the background depth map, for multiple reference texture map sequences, it can be obtained. The corresponding reference depth map sequence, and then through the temporal filtering operation, the corresponding background depth map can be obtained.

Referring to the schematic diagram of a free-viewpoint video reconstruction method for a specific application scenario shown in FIG. 10 , in an embodiment of this specification, a plurality of free-viewpoint video frames I can be obtained first, wherein any free-viewpoint video frame I includes a synchronized The original texture maps of multiple original viewpoints and the original depth maps of the corresponding viewpoints, based on the virtual viewpoint, the target video frame I ₀ is acquired, and the original texture maps and corresponding original depth maps of multiple original viewpoints contained in the target video frame I ₀ are based on , through virtual viewpoint reconstruction, the texture map T ₀ of the virtual viewpoint can be synthesized. In addition, based on the target video frame, the background texture map Tb and the background depth map Db of the viewpoint corresponding to the target video frame can be obtained, and based on the background texture map Tb and the background depth map Db of the corresponding viewpoint, the virtual viewpoint reconstruction can obtain the For the background texture map Tb0 of the virtual viewpoint, the texture map T ₀ of the virtual viewpoint is post-processed by filling holes by using the background texture map Tb0 of the virtual viewpoint to obtain the final free viewpoint video reconstruction image Te.

The above describes the free-view video reconstruction method in detail through some specific examples. The embodiments of this specification also provide a corresponding free-view video playback processing method. Referring to the flowchart of the free-view video playback processing method shown in FIG. Can include the following steps:

S111: Determine a virtual viewpoint, and determine a target video frame according to the virtual viewpoint.

In a specific implementation, the virtual viewpoint may be generated in real time during the playback of the free viewpoint video, or may be preset. More specifically, the virtual viewpoint may be determined in response to a user's gesture interaction operation. For example, the virtual viewpoint at the corresponding interaction moment is determined by acquiring the trajectory data corresponding to the user interaction operation. Alternatively, the position information of the virtual viewpoint corresponding to the corresponding video frame can be preset on the server (such as the server or the cloud), and the set position information of the virtual viewpoint can be transmitted in the header file of the free-view video stream. The virtual viewpoint is determined based on the virtual viewpoint position information contained in the video stream.

After the virtual viewpoint is determined, the corresponding frame moment and the video frame at the corresponding frame moment may be determined as the target video frame according to the virtual viewpoint.

S112 , synthesizing the texture map of the virtual viewpoint by using the original texture maps and corresponding original depth maps of multiple original viewpoints in the target video frame.

After the virtual viewpoint is determined, in order to save data processing resources, based on the virtual viewpoint position and the parameter data corresponding to the target video frame, the original texture map of some original viewpoints in the target video frame and the original texture map of the corresponding viewpoint can be selected according to preset rules. The depth map is combined and rendered, and the texture map of the virtual viewpoint is synthesized. For example, original texture maps and original depth maps corresponding to 2 to N viewpoints closest to the virtual viewpoint position in the target video frame may be selected. Wherein, N is the number of original texture images in the target video frame, that is, the number of acquisition devices corresponding to the original texture images. In a specific implementation, the quantitative relationship value may be fixed or variable.

S113: Acquire a background texture map and a background depth map of a viewpoint corresponding to the target video frame, and acquire a background texture map of the virtual viewpoint according to the background texture map and background depth map of the corresponding viewpoint.

For the method of acquiring the background texture map and the background depth map of the viewpoint corresponding to the target video frame, and the specific implementation method of acquiring the background texture map of the virtual viewpoint according to the background texture map and the background depth map of the corresponding viewpoint, please refer to The introduction of step S113 and the specific implementation manner in the foregoing embodiment will not be described here again.

S114. Using the background texture map of the virtual viewpoint, perform post-filling processing on the hollow area in the texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint.

In a specific implementation, any one or more filtering methods such as bilateral filtering, joint bilateral filtering, guided filtering, etc. may be used to perform hole filling and post-processing on the hole region in the texture map of the virtual view point, and obtain the virtual view point. Rebuild the image.

By adopting the above-mentioned free-view video playback processing method, since all videos displayed during playback are post-processed based on the texture maps of the corresponding virtual viewpoints, and the texture maps of the corresponding virtual viewpoints are based on the texture maps of the synthesized virtual viewpoints corresponding to the original viewpoints The reference texture map and the reference depth map are obtained by temporal filtering. Therefore, the background texture map of the virtual viewpoint contains stable and complete background texture information. The post-processing of hole filling can improve the reconstructed image quality of the virtual viewpoint.

In order to reduce image voids, the position of the camera (acquisition device) can be configured through a specific viewpoint configuration algorithm or system. In the specific implementation, the three-dimensional space information of the field of view can be obtained, the number of selectable viewpoints, the internal and external parameters of the camera (including the horizontal field of view, vertical field of view and other parameters of the camera), matching and matching according to the preset configuration model. The operation can output the suggested camera arrangement and the corresponding camera position.

In a specific implementation, further optimization and expansion of the playback mode of the free-viewpoint video can be made on the basis of the above-mentioned embodiment. An exemplary extension is given below.

In order to enrich the user's visual experience, Augmented Reality (AR) special effects can be implanted in the reconstructed free-viewpoint images. In some embodiments of this specification, referring to the flowchart of the free-viewpoint video playback processing method shown in FIG. 12 , the implantation of AR special effects can be implemented in the following manner:

S121: Acquire a virtual rendering target object in the reconstructed image of the virtual viewpoint.

In a specific implementation, certain objects in the image of the free-view video may be determined as virtual rendering target objects based on certain indication information, and the indication information may be generated based on user interaction, or may be based on certain preset trigger conditions or a third party. command is obtained. In an optional embodiment of the present specification, the virtual rendering target object in the reconstructed image of the virtual viewpoint may be acquired in response to the special effect generating the interactive control instruction.

S122: Acquire a virtual information image generated based on the augmented reality special effect input data of the virtual rendering target object.

In the embodiments of this specification, the implanted AR special effects are presented in the form of virtual information images. The virtual information image may be generated based on augmented reality special effect input data of the target object. After the virtual rendering target object is determined, a virtual information image generated based on the augmented reality special effect input data of the virtual rendering target object may be acquired.

In the embodiment of this specification, the virtual information image corresponding to the virtual rendering target object may be generated in advance, or may be generated immediately in response to the special effect generation instruction.

In a specific implementation, a virtual information image matching the position of the virtual rendering target object can be obtained based on the position of the virtual rendering target object obtained by 3D calibration in the reconstructed image, so that the obtained virtual information image can be matched with the virtual information image. The position of the virtual rendering target object in the three-dimensional space is more matched, and the displayed virtual information image is more in line with the real state in the three-dimensional space, so the displayed composite image is more realistic and vivid, and the user's visual experience is enhanced.

In a specific implementation, a virtual information image corresponding to the target object may be generated according to a preset special effect generation method based on the augmented reality special effect input data of the virtual rendering target object.

In a specific implementation, a variety of special effect generation methods can be adopted.

For example, the augmented reality special effect input data of the target object may be input into a preset three-dimensional model, and the output matches the virtual rendering target object based on the position of the virtual rendering target object obtained by the three-dimensional calibration in the image. virtual information images;

For another example, the augmented reality special effects input data of the virtual rendering target object can be input into a preset machine learning model, and the position of the virtual rendering target object in the image obtained based on the three-dimensional calibration can be output and the same as that of the virtual rendering target object. A virtual information image that matches the virtual render target object.

S123 , synthesizing and displaying the virtual information image and the image of the virtual viewpoint.

In a specific implementation, the virtual information image and the reconstructed image of the virtual viewpoint can be synthesized and displayed in various ways, and two specific implementation examples are given below:

Example 1: The virtual information image and the corresponding reconstructed image are fused to obtain a fused image, and the fused image is displayed;

Example 2: The virtual information image is superimposed on the corresponding reconstructed image to obtain a superimposed composite image, and the superimposed composite image is displayed.

In a specific implementation, the obtained composite image may be displayed directly, or the obtained composite image may be inserted into a video stream to be played for playback and display. For example, the fused image may be inserted into the video stream to be played for display.

The free viewpoint video may include a special effect display identifier. In a specific implementation, the superimposed position of the virtual information image in the image of the virtual viewpoint may be determined based on the special effect display identifier, and then the virtual information image may be placed in the image of the virtual viewpoint. The determined superposition position is displayed in superposition.

In order to make those skilled in the art better understand and implement, a detailed description is given below through an image display process of an interactive terminal. Referring to the schematic diagrams of the video playback screens of the interactive terminal shown in FIG. 13 to FIG. 17 , the interactive terminal T1 plays the video in real time. 13 , the video frame P1 is displayed. Next, the video frame P2 displayed by the interactive terminal includes a plurality of special effects display identifiers such as the special effect display identifier I1. The video frame P2 is represented by an inverted triangle symbol pointing to the target object, such as Figure 14. It can be understood that, the special effect display logo may also be displayed in other manners. The terminal user touches and clicks on the special effect display identifier I1, then the system automatically acquires the virtual information image corresponding to the special effect display identifier I1, and superimposes the virtual information image on the video frame P3 and displays it in the video frame P3, as shown in FIG. The position of the site where Q1 stands is the center, and a three-dimensional ring R1 is rendered. Next, as shown in FIG. 16 and FIG. 17 , the end user touches and clicks the special effect display identifier I2 in the video frame P3, and the system automatically acquires the virtual information image corresponding to the special effect display identifier I2, and displays the virtual information image in a superimposed manner. On the video frame P3, a superimposed image is obtained, that is, the video frame P4, in which the hit rate information display board M0 is displayed. The hit rate information display board M0 displays the number position, name and hit rate information of the target object, namely the athlete Q2.

As shown in FIG. 13 to FIG. 17 , the end user can continue to click other special effect display signs displayed in the video frame to watch the video showing the AR special effect corresponding to each special effect display sign.

It can be understood that different types of implanted special effects can be distinguished by different types of special effect display signs.

The embodiments of this specification also provide a corresponding free-viewpoint video reconstruction apparatus. Referring to the schematic structural diagram of the free-viewpoint video reconstruction apparatus shown in FIG. 18 , the free-viewpoint video reconstruction apparatus 180 may include: a video frame obtaining unit 181 , a target video frame The determination unit 182, the virtual viewpoint texture map synthesis unit 183, the virtual viewpoint background texture map synthesis unit 184 and the post-processing unit 185, wherein:

The video frame obtaining unit 181 is adapted to obtain a free-view video frame, the video frame including the original texture maps of multiple original viewpoints and the original depth maps of the corresponding viewpoints;

The target video frame determining unit 182 is adapted to obtain the target video frame corresponding to the virtual viewpoint;

The virtual viewpoint texture map synthesis unit 183 is adapted to use the original texture maps and corresponding original depth maps of multiple original viewpoints in the target video frame to synthesize the texture maps of the virtual viewpoints;

The virtual viewpoint background texture map synthesis unit 184 is adapted to obtain the background texture map and background depth map of the viewpoint corresponding to the target video frame, and obtain the virtual viewpoint according to the background texture map and background depth map of the corresponding viewpoint The background texture map of ;

The post-processing unit 185 is adapted to use the background texture map of the virtual viewpoint to perform post-processing for filling voids in the texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint.

Using the above-mentioned free viewpoint video reconstruction device 180, the complete background texture map of the virtual viewpoint obtained by reconstruction is used to perform hole filling post-processing on the texture map of the virtual viewpoint corresponding to the synthesized target video frame. The scheme of filtering the texture based on the filter can avoid the artifacts and blurring caused by incomplete hole filling, improve the quality of hole filling, and then improve the image quality of free-view video.

In a specific implementation, each unit in the virtual viewpoint video reconstruction apparatus may be implemented by using the specific method examples and specific manners of the corresponding steps in the aforementioned free viewpoint video reconstruction method. For details, please refer to the foregoing embodiments for details.

The embodiments of this specification also provide a corresponding free-viewpoint video playback processing apparatus. Referring to the schematic structural diagram of the free-viewpoint video playback processing apparatus shown in FIG. 19 , in some embodiments of this specification, as shown in FIG. 19 , the free-viewpoint video playback The processing device 190 may include: a virtual viewpoint determination unit 191, a target video frame determination unit 192, a virtual viewpoint texture map synthesis unit 193, a virtual viewpoint background texture map synthesis unit 194, and a post-processing unit 195, wherein:

a virtual viewpoint determination unit 191, adapted to determine a virtual viewpoint;

a target video frame determining unit 192, adapted to determine a target video frame according to the virtual viewpoint;

The virtual viewpoint texture map synthesizing unit 193 is adapted to use the original texture maps and corresponding original depth maps of multiple original viewpoints in the target video frame to synthesize the texture maps of the virtual viewpoints;

The virtual viewpoint background texture map synthesis unit 194 is adapted to obtain the background texture map and background depth map of the viewpoint corresponding to the target video frame, and obtain the background texture map and background depth map of the corresponding viewpoint according to the background texture map and background depth map of the virtual viewpoint. texture map;

The post-processing unit 195 is adapted to use the background texture map of the virtual viewpoint to perform post-processing for filling voids in the texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint.

Using the above-mentioned free-viewpoint video playback processing unit, and reconstructing the complete background texture map of the virtual viewpoint, the texture map of the virtual viewpoint corresponding to the synthesized target video frame is subjected to hole-filling post-processing. The scheme of filtering the texture based on the filter can avoid the artifacts and blurring caused by incomplete hole filling, improve the quality of hole filling, and then improve the image quality of free-view video.

In a specific implementation, each unit in the virtual viewpoint video playback processing device can be implemented by using the specific method examples and specific manners of the corresponding steps in the aforementioned free viewpoint video reconstruction method, and details can be found in the foregoing embodiments.

In the embodiments of this specification, each specific unit such as the virtual viewpoint video reconstruction apparatus, the virtual viewpoint video playback processing apparatus, etc. may be implemented by software, hardware, or a combination of software and hardware.

Referring to the schematic structural diagram of the electronic device shown in FIG. 20 , in some embodiments of the present specification, as shown in FIG. 20 , the electronic device 200 may include a memory 201 and a processor 202 . computer instructions running on the processor 202, wherein, when the processor executes the computer instructions, the steps of the method described in any of the foregoing embodiments can be performed.

Based on the location of the electronic device in the entire video processing system, the electronic device may also include other electronic components or assemblies.

Referring to the schematic structural diagram of another electronic device shown in FIG. 21, in other embodiments of this specification, as shown in FIG. 21, the electronic device 210 may include a communication component 211, a processor 212, and a display component 213, wherein:

The communication component 211 is adapted to obtain free-view video;

The processor 212 is adapted to execute the steps of the method in any of the foregoing embodiments;

The display component 213 is adapted to display the reconstructed image of the virtual viewpoint obtained after processing by the processor.

In a specific implementation, the display component 213 may specifically be one or more of a display, a touch screen, a projector, and the like.

In a specific implementation, the communication component 211 and the display component 213 may be components disposed inside the electronic device 210, or may be external devices connected through expansion components such as expansion interfaces, docking stations, and expansion cables.

In a specific implementation, the processor 212 can use a central processing unit (Central Processing Unit, CPU) (such as a single-core processor, a multi-core processor), a CPU group, a graphics processing unit (Graphics Processing Unit, GPU), artificial intelligence (Artificial Intelligence, AI) chip, Field Programmable Gate Array (Field Programmable Gate Array, FPGA) chip, etc. any one or more of them are implemented collaboratively.

In some embodiments of this specification, the memory, the processor, the communication component and the display component in the electronic device may communicate through a bus network.

For better understanding and implementation by those skilled in the art, a specific application scenario is described below. Referring to the schematic structural diagram of the video processing system shown in FIG. 22, as shown in FIG. 22, it is a schematic structural diagram of the video processing system in an application scenario, wherein the layout scene of the data processing system of a basketball game is shown, so The video processing system A0 includes a collection array A1 composed of multiple collection devices, a data processing device A2, a server cluster A3 in the cloud, a playback control device A4, a playback terminal A5 and an interactive terminal A6.

Referring to FIG. 21 , the basketball hoop on the left is used as the core point of view, the core point of view is the center of the circle, and the fan-shaped area on the same plane as the core point of view is used as the preset multi-angle free viewing angle range. According to the preset multi-angle free viewing angle range, each acquisition device in the acquisition array A1 can be placed in a fan shape at different positions in the on-site acquisition area, and can synchronously acquire video data streams from corresponding angles in real time.

In a specific implementation, the collection device may also be arranged in the ceiling area of the basketball stadium, on the basketball hoop, and the like. The collection devices can be arranged and distributed along a straight line, a fan shape, an arc line, a circle or an irregular shape. The specific arrangement can be set according to one or more factors such as the specific on-site environment, the number of acquisition devices, the characteristics of the acquisition devices, and imaging effect requirements. The collection device may be any device with a camera function, such as a common camera, a mobile phone, a professional camera, and the like.

In order not to affect the work of the acquisition device, the data processing device A2 can be placed in a non-acquisition area on site, and can be regarded as an on-site server. The data processing device A2 may send a stream pull instruction to each acquisition device in the acquisition array A1 through a wireless local area network, respectively, and each acquisition device in the acquisition array A1 will obtain a stream based on the stream pull instruction sent by the data processing device A2. The video data stream is transmitted to the data processing device A3 in real time. Wherein, each acquisition device in the acquisition array A1 can transmit the obtained video data stream to the data processing device A2 in real time through the switch A7. The acquisition array A1 and the switch A7 together form an acquisition system.

When the data processing device A2 receives the video frame interception instruction, it intercepts the video frame at the specified frame moment from the received multi-channel video data stream to obtain frame images of multiple synchronized video frames, and uses the obtained specified Multiple synchronized video frames at frame moments are uploaded to the server cluster A3 in the cloud.

Correspondingly, the server cluster A3 in the cloud uses the received original texture maps of multiple synchronized video frames as an image combination, determines the parameter data corresponding to the image combination and the original depth map corresponding to each original texture map in the image combination, and Based on the corresponding parameter data of the image combination, the pixel data of the texture map and the depth data of the corresponding depth map in the image combination, image stitching is performed based on the acquired virtual viewpoint to obtain corresponding multi-angle free-view video data.

The server can be placed in the cloud, and in order to process data in parallel more quickly, a server cluster A3 in the cloud can be composed of multiple different servers or server groups according to different data processed.

For example, the cloud server cluster A3 may include: a first cloud server A31, a second cloud server A32, a third cloud server A33, and a fourth cloud server A34. The first cloud server A31 can be used to determine the corresponding parameter data of the image combination; the second cloud server A32 can be used to determine the estimated depth map of the original texture map of each viewpoint in the image combination and perform depth map correction processing The third cloud server A33 can be based on the position information of the virtual viewpoint, based on the corresponding parameter data of the image combination, the texture map and the depth map of the image combination, use the virtual viewpoint reconstruction based on the depth map (Depth Image Based Rendering, DIBR ) algorithm to reconstruct frame images to obtain images of virtual viewpoints; the fourth cloud server A34 can be used to generate free viewpoint videos (multi-angle free viewpoint videos).

It can be understood that the first cloud server A31, the second cloud server A32, the third cloud server A33, and the fourth cloud server A34 may also be a server group composed of a server array or a server sub-cluster, which is not required in this embodiment of the present invention. limit.

Then, the playback control device A4 can insert the received free-view video frame into the to-be-played video stream, and the playback terminal A5 receives the to-be-played video stream from the playback control device A4 and plays it in real time. The playback control device A4 may be a manual playback control device or a virtual playback control device. In a specific implementation, a dedicated server that can automatically switch video streams can be set up as a virtual playback control device to control the data source. A broadcast director control device, such as a broadcast director station, can be used as a playback control device A4 in the embodiment of the present invention.

The interaction device A6 can play free-view video based on user interaction.

It can be understood that each acquisition device in the acquisition array A1 and the data processing device A2 can be connected through a switch A7 and/or a local area network, and the number of playback terminals A5 and interactive terminals A6 can be one or more, The playback terminal A5 and the interactive terminal A6 may be the same terminal device, the data processing device A2 may be placed in a non-collection area or in the cloud according to specific scenarios, and the server cluster A3 and playback control device A4 may be based on specific scenarios. It is placed in the non-collection area of the site, on the cloud or terminal access side, and this embodiment is not used to limit the specific implementation and protection scope of the present invention.

The embodiments of the present specification further provide a computer-readable storage medium on which computer instructions are stored, wherein, when the computer instructions are executed, the steps of the methods described in any of the foregoing embodiments may be performed. For details, reference may be made to the introduction of the foregoing embodiments. It will not be repeated here.

In a specific implementation, the computer-readable storage medium may be various suitable readable storage mediums such as an optical disc, a mechanical hard disk, and a solid-state hard disk.

Although the embodiments of the present specification are disclosed as above, the present invention is not limited thereto. Any person skilled in the art can make various changes and modifications without departing from the spirit and scope of the embodiments of this specification. Therefore, the protection scope of the present invention should be based on the scope defined by the claims.

Claims

A free-viewpoint video reconstruction method, comprising:

Obtaining a free-view video frame, the video frame includes the original texture maps of multiple original viewpoints and the original depth maps of the corresponding viewpoints;

Obtain the target video frame corresponding to the virtual viewpoint;

Using the original texture maps of multiple original viewpoints and the corresponding original depth maps in the target video frame to synthesize the texture maps of the virtual viewpoints;

Obtain the background texture map and the background depth map of the viewpoint corresponding to the target video frame, and obtain the background texture map of the virtual viewpoint according to the background texture map and the background depth map of the corresponding viewpoint;

Using the background texture map of the virtual viewpoint, post-processing is performed on the void area in the texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint.
The method according to claim 1, wherein the acquiring the background texture map and the background depth map of the viewpoint corresponding to the target video frame comprises:

selecting the reference texture map sequence and the reference depth map sequence of the viewpoint corresponding to the target video frame;

Temporal filtering is performed on the reference texture map sequence and the reference depth map sequence, respectively, to obtain a background texture map and a background depth map of the viewpoint corresponding to the target video frame.
The method according to claim 2, wherein the performing temporal filtering on the reference texture map sequence and the reference depth map sequence respectively to obtain a background texture map and a background depth map of the viewpoint corresponding to the target video frame, comprising:

Temporal median filtering is performed on the pixels in the reference texture map sequence and the reference depth map sequence, respectively, to obtain a background texture map and a background depth map of the viewpoint corresponding to the target video frame.
The method according to claim 1, wherein, using the original texture maps and corresponding original depth maps of multiple original viewpoints in the target video frame to synthesize the texture maps of the virtual viewpoints, comprising:

Based on the virtual viewpoint, select the original texture map and the corresponding original depth map of the corresponding original viewpoint in the target video frame according to a preset rule;

Using the selected original texture map of the corresponding original viewpoint and the corresponding original depth map, the texture map of the virtual viewpoint is synthesized.
The method according to claim 4, wherein the acquiring a background texture map and a background depth map of the viewpoint corresponding to the target video frame comprises:

Obtain a reference texture map sequence and a reference depth map sequence corresponding to the selected original viewpoint;

Temporal filtering is performed on the reference texture map sequence and the reference depth map sequence, respectively, to obtain a background texture map and a background depth map of the selected corresponding original viewpoint.
The method according to claim 1, wherein the acquiring the background texture map and the background depth map of the viewpoint corresponding to the target video frame comprises:

Pre-collecting a background texture map in which there is no foreground object at the corresponding viewpoint in the field of view targeted by the target video frame;

The background depth map of the corresponding viewpoint is acquired according to the background texture map in which the corresponding viewpoint does not have a foreground object in the field of view targeted by the target video frame.
The method according to any one of claims 1 to 6, wherein the background texture map of the virtual viewpoint is used to perform post-filling post-processing on a hollow area in the texture map of the virtual viewpoint to obtain the virtual viewpoint. Reconstructed images of viewpoints, including:

The background texture map of the virtual viewpoint is used, and a joint bilateral filtering method is used to perform interpolation processing on the hollow area in the texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint.
The method according to any one of claims 1 to 6, wherein after the post-processing of hole filling is performed on the hole region in the texture map of the virtual viewpoint, and before the reconstructed image of the virtual viewpoint is obtained, the method further comprises:

Filtering is performed on the foreground edge in the texture map of the virtual viewpoint obtained after the post-processing of hole filling, so as to obtain the reconstructed image of the virtual viewpoint.
A free-viewpoint video playback processing method, comprising:

Determine the virtual viewpoint, and determine the target video frame according to the virtual viewpoint;

Using the original texture maps of multiple original viewpoints and the corresponding original depth maps in the target video frame to synthesize the texture maps of the virtual viewpoints;

Obtain the background texture map and the background depth map of the viewpoint corresponding to the target video frame, and obtain the background texture map of the virtual viewpoint according to the background texture map and the background depth map of the corresponding viewpoint;

Using the background texture map of the virtual viewpoint, post-processing is performed on the void area in the texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint.
The method according to claim 9, wherein the determining a virtual viewpoint comprises at least one of the following:

determining a virtual viewpoint in response to user interaction;

The virtual viewpoint is determined based on the virtual viewpoint position information contained in the video stream.
The method of claim 9, further comprising:

acquiring the virtual rendering target object in the reconstructed image of the virtual viewpoint;

acquiring a virtual information image generated based on the augmented reality special effect input data of the virtual rendering target object;

The virtual information image and the reconstructed image of the virtual viewpoint are synthesized and displayed.
The method according to claim 11, wherein the acquiring the virtual information image generated based on the augmented reality special effect input data of the virtual rendering target object comprises:

According to the position of the virtual rendering target object in the reconstructed image of the virtual viewpoint obtained by 3D calibration, a virtual information image matching the position of the virtual rendering target object is obtained.
A free-viewpoint video reconstruction device, comprising:

a video frame acquisition unit, adapted to acquire a free-view video frame, the video frame including the original texture maps of multiple original viewpoints and the original depth maps of the corresponding viewpoints;

a target video frame determination unit, adapted to obtain the target video frame corresponding to the virtual viewpoint;

a virtual viewpoint texture map synthesis unit, adapted to use the original texture maps and corresponding original depth maps of multiple original viewpoints in the target video frame to synthesize the texture maps of the virtual viewpoints;

A virtual viewpoint background texture map synthesis unit, adapted to obtain the background texture map and background depth map of the viewpoint corresponding to the target video frame, and obtain the background texture of the virtual viewpoint according to the background texture map and background depth map of the corresponding viewpoint picture;

The post-processing unit is adapted to use the background texture map of the virtual view point to perform post-processing of hole filling on the hole area in the texture map of the virtual view point to obtain a reconstructed image of the virtual view point.
A free-view video playback processing device, comprising:

a virtual viewpoint determination unit, adapted to determine a virtual viewpoint;

a target video frame determination unit, adapted to determine a target video frame according to the virtual viewpoint;

a virtual viewpoint texture map synthesis unit, adapted to use the original texture maps and corresponding original depth maps of multiple original viewpoints in the target video frame to synthesize the texture maps of the virtual viewpoints;

A virtual viewpoint background texture map synthesis unit, adapted to obtain the background texture map and background depth map of the viewpoint corresponding to the target video frame, and obtain the background texture of the virtual viewpoint according to the background texture map and background depth map of the corresponding viewpoint picture;

The post-processing unit is adapted to use the background texture map of the virtual view point to perform post-processing of hole filling on the hole area in the texture map of the virtual view point to obtain a reconstructed image of the virtual view point.
An electronic device comprising a memory and a processor, the memory storing computer instructions executable on the processor, wherein the processor executes claims 1 to 8 or claims when the processor executes the computer instructions The steps of any one of 9 to 12.
An electronic device comprising: a communication component, a processor and a display component, wherein:

the communication component, adapted to obtain free-view video;

the processor, adapted to perform the steps of the method of any one of claims 1 to 8 or claims 9 to 12;

The display component is adapted to display the reconstructed image of the virtual viewpoint obtained after processing by the processor.
A computer-readable storage medium having computer instructions stored thereon, wherein the computer instructions, when executed, perform the steps of the method of any one of claims 1 to 8 or claims 9 to 12.