CN114071115A

CN114071115A - Free viewpoint video reconstruction and playing processing method, device and storage medium

Info

Publication number: CN114071115A
Application number: CN202010759861.3A
Authority: CN
Inventors: 王荣刚; 蔡砚刚; 顾嵩; 盛骁杰
Original assignee: Peking University Shenzhen Graduate School; Alibaba Group Holding Ltd
Current assignee: Peking University Shenzhen Graduate School; Alibaba Group Holding Ltd
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2022-02-18
Also published as: WO2022022548A1

Abstract

A free viewpoint video reconstruction and playing processing method, a device and a storage medium are provided, wherein the video reconstruction method comprises the following steps: acquiring a free viewpoint video frame, wherein the video frame comprises original texture maps of a plurality of synchronous original viewpoints and original depth maps of corresponding viewpoints; acquiring a target video frame corresponding to the virtual viewpoint; synthesizing texture maps of the virtual viewpoints by adopting original texture maps and corresponding original depth maps of a plurality of original viewpoints in the target video frame; acquiring a background texture map and a background depth map of a corresponding viewpoint of the target video frame, and acquiring a background texture map of the virtual viewpoint according to the background texture map and the background depth map of the corresponding viewpoint; and carrying out hole filling post-processing on a hole area in the texture map of the virtual viewpoint by adopting the background texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint. By the scheme, the hole filling quality can be improved, and the image quality of the free viewpoint video can be improved.

Description

Free viewpoint video reconstruction and playing processing method, device and storage medium

Technical Field

The embodiments of the present disclosure relate to the field of video processing technologies, and in particular, to a method, a device, and a storage medium for reconstructing and playing a free viewpoint video.

Background

The free viewpoint video is a technology capable of providing high-freedom viewing experience, and a user can adjust a viewing angle through interactive operation in a viewing process and view the video from a desired free viewpoint angle, so that the viewing experience can be greatly improved.

To achieve free viewpoint viewing, virtual viewpoint synthesis techniques may be employed. In the virtual viewpoint synthesis technology, a Depth Image Based Rendering (DIBR) technology becomes an important method for virtual viewpoint synthesis, and a view of a viewpoint where no camera exists originally can be obtained only by referring to a texture map of a viewpoint and a corresponding Depth map and by three-dimensional coordinate transformation.

The DIBR technology is mainly divided into the steps of viewpoint selection, preprocessing, mapping, view fusion, post-processing and the like. In the mapping process, there may be a case where a background texture portion occluded by a foreground object in the reference viewpoint is invisible from the reference viewpoint but is visible from the virtual viewpoint. Therefore, after view fusion, the virtual viewpoint still has some empty regions that are not filled.

For filling up a cavity in a foreground object shielding area, methods for filtering by using effective texture information around the cavity exist at present. However, these methods are not ideal, and artifacts and blurring phenomena are easily generated, resulting in poor image quality of the reconstructed free viewpoint video.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a method, an apparatus, and a storage medium for reconstructing and playing a free viewpoint video, which can improve the hole filling quality and further improve the image quality of the free viewpoint video.

First, an embodiment of the present specification provides a free viewpoint video reconstruction method, including:

acquiring a free viewpoint video frame, wherein the video frame comprises original texture maps of a plurality of synchronous original viewpoints and original depth maps of corresponding viewpoints;

acquiring a target video frame corresponding to the virtual viewpoint;

synthesizing texture maps of the virtual viewpoints by adopting original texture maps and corresponding original depth maps of a plurality of original viewpoints in the target video frame;

acquiring a background texture map and a background depth map of a corresponding viewpoint of the target video frame, and acquiring a background texture map of the virtual viewpoint according to the background texture map and the background depth map of the corresponding viewpoint;

and carrying out hole filling post-processing on a hole area in the texture map of the virtual viewpoint by adopting the background texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint.

Optionally, the obtaining a background texture map and a background depth map of a viewpoint corresponding to the target video frame includes:

selecting a reference texture map sequence and a reference depth map sequence of a corresponding viewpoint of the target video frame;

and respectively carrying out time domain filtering on the reference texture map sequence and the reference depth map sequence to obtain a background texture map and a background depth map of a viewpoint corresponding to the target video frame.

Optionally, the performing time-domain filtering on the reference texture map sequence and the reference depth map sequence respectively to obtain a background texture map and a background depth map of a viewpoint corresponding to the target video frame includes:

and respectively carrying out time domain median filtering on the pixels in the reference texture map sequence and the reference depth map sequence to obtain a background texture map and a background depth map of a corresponding viewpoint of the target video frame.

Optionally, the synthesizing the texture map of the virtual viewpoint by using the original texture maps and the corresponding original depth maps of the multiple original viewpoints in the target video frame includes:

based on the virtual viewpoint, selecting an original texture map and a corresponding original depth map of a corresponding original viewpoint in the target video frame according to a preset rule;

and synthesizing the texture map of the virtual viewpoint by adopting the selected original texture map of the corresponding original viewpoint and the corresponding original depth map.

acquiring a reference texture map sequence and a reference depth map sequence corresponding to the selected original viewpoint;

and respectively carrying out time domain filtering on the reference texture map sequence and the reference depth map sequence to obtain a background texture map and a background depth map of the selected corresponding original viewpoint.

pre-collecting a background texture map of which no foreground object exists in a corresponding viewpoint in a view field for which the target video frame is directed;

and acquiring a background depth map of a corresponding viewpoint according to the background texture map of the corresponding viewpoint in the view field to which the target video frame aims and without a foreground object.

Optionally, the performing, by using the background texture map of the virtual viewpoint, hole filling post-processing on a hole region in the texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint includes:

and performing interpolation processing on the void region in the texture map of the virtual viewpoint by adopting the background texture map of the virtual viewpoint and using a joint bilateral filtering method to obtain a reconstructed image of the virtual viewpoint.

Optionally, after performing hole filling post-processing on a hole region in the texture map of the virtual viewpoint and before obtaining a reconstructed image of the virtual viewpoint, the method further includes:

and filtering the foreground edge in the texture map of the virtual viewpoint obtained after the hole filling post-processing to obtain a reconstructed image of the virtual viewpoint.

An embodiment of the present specification further provides a free viewpoint video playing processing method, including:

determining a virtual viewpoint, and determining a target video frame according to the virtual viewpoint;

Optionally, the determining a virtual viewpoint includes at least one of:

determining a virtual viewpoint in response to the user interaction behavior;

the virtual viewpoint is determined based on virtual viewpoint position information contained in the video stream.

Optionally, the method further comprises:

acquiring a virtual rendering target object in the reconstructed image of the virtual viewpoint;

acquiring a virtual information image generated based on augmented reality special effect input data of the virtual rendering target object;

and synthesizing the virtual information image and the reconstructed image of the virtual viewpoint and displaying the synthesized image.

Optionally, the acquiring a virtual information image generated based on augmented reality effect input data of the virtual rendering target object includes:

and obtaining a virtual information image matched with the position of the virtual rendering target object according to the position of the virtual rendering target object in the reconstructed image of the virtual viewpoint obtained by three-dimensional calibration.

An embodiment of the present specification further provides a free viewpoint video reconstruction apparatus, including:

the video frame acquisition unit is suitable for acquiring a free viewpoint video frame, and the video frame comprises original texture maps of a plurality of synchronous original viewpoints and original depth maps of corresponding viewpoints;

the target video frame determining unit is suitable for acquiring a target video frame corresponding to the virtual viewpoint;

the virtual viewpoint texture map synthesizing unit is suitable for synthesizing the texture map of the virtual viewpoint by adopting the original texture maps and the corresponding original depth maps of a plurality of original viewpoints in the target video frame;

the virtual viewpoint background texture map synthesizing unit is suitable for acquiring a background texture map and a background depth map of a viewpoint corresponding to the target video frame, and acquiring the background texture map of the virtual viewpoint according to the background texture map and the background depth map of the corresponding viewpoint;

and the post-processing unit is suitable for carrying out hole filling post-processing on a hole area in the texture map of the virtual viewpoint by adopting the background texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint.

An embodiment of the present specification further provides a free viewpoint video playing processing apparatus, including:

a virtual viewpoint determining unit adapted to determine a virtual viewpoint;

a target video frame determination unit adapted to determine a target video frame from the virtual viewpoint;

The present specification further provides an electronic device, which includes a memory and a processor, where the memory stores computer instructions executable on the processor, and the processor executes the computer instructions to perform the steps of the method according to any one of the foregoing embodiments.

An embodiment of the present specification further provides an electronic device, including: a communication component, a processor, and a display component, wherein:

the communication component is suitable for acquiring a free viewpoint video;

the processor is adapted to perform the steps of the method of any of the preceding embodiments;

and the display component is suitable for displaying the reconstructed image of the virtual viewpoint obtained after the processing of the processor.

The present specification also provides a computer readable storage medium, on which computer instructions are stored, wherein the computer instructions are executed to perform the steps of the method of any one of the foregoing embodiments.

Compared with the prior art, the technical scheme of the embodiment of the specification has the following beneficial effects:

in the solution of the embodiment of the present specification, the complete background texture map of the virtual viewpoint is obtained by reconstruction, and the texture map of the virtual viewpoint corresponding to the synthesized target video frame is subjected to hole filling post-processing, so that compared with a scheme of performing filtering only by using textures around a hole, artifacts and blurring caused by incomplete hole filling can be avoided, the hole filling quality is improved, and further, the image quality of the free viewpoint video can be improved.

Further, based on the virtual viewpoint, the original texture maps and the corresponding original depth maps of the multiple original viewpoints in the target video frame are selected according to a preset rule and are respectively used as a reference texture map and a reference depth map for synthesizing the texture maps of the virtual viewpoints, so that the data processing amount in the video reconstruction process can be reduced, and the video reconstruction efficiency can be improved.

Further, the reference texture map sequence and the reference depth map sequence of the viewpoint corresponding to the selected target video frame are subjected to time domain filtering respectively to obtain a background texture map and a background depth map of the viewpoint corresponding to the target video frame, and the reference texture map sequence and the reference depth map sequence of the viewpoint corresponding to the target video frame are considered, namely texture information and depth information on the viewpoint time domain corresponding to the target video frame, rather than texture information and depth information only based on the target video frame space domain, so that the integrity and authenticity of the obtained background texture map and background depth map can be improved, artifacts and blurring caused by foreground shielding are avoided, and the filling quality of image holes can be improved.

Drawings

Fig. 1 is a schematic diagram of an application specific system of a free viewpoint video presentation in an embodiment of the present specification;

FIG. 2 is a schematic diagram of an interactive interface of a terminal device in an embodiment of the present specification;

FIG. 3 is a schematic diagram of an arrangement of a collecting apparatus in an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an interactive interface of another terminal device in the embodiment of the present specification;

fig. 5 is a schematic diagram of a free viewpoint video data generation process in an embodiment of the present specification;

FIG. 6 is a schematic diagram illustrating the generation and processing of 6DoF video data according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a header file in an embodiment of the present disclosure;

fig. 8 is a schematic diagram of a user side processing 6DoF video data in an embodiment of the present specification;

fig. 9 is a flowchart of a free viewpoint video reconstruction method in an embodiment of the present disclosure;

fig. 10 is a schematic diagram of a free viewpoint video reconstruction method for a specific application scene in an embodiment of the present specification;

fig. 11 is a flowchart of a free viewpoint video playing processing method in an embodiment of the present specification;

fig. 12 is a flowchart of another free viewpoint video playing processing method in the embodiment of the present specification;

fig. 13 to 17 are schematic diagrams of display interfaces of an interactive terminal in an embodiment of the present specification;

fig. 18 is a schematic structural diagram of a free viewpoint video reconstruction apparatus in an embodiment of the present specification;

fig. 19 is a schematic structural diagram of a free viewpoint video playing processing apparatus in an embodiment of the present specification;

fig. 20 is a schematic structural diagram of an electronic device in an embodiment of the present specification;

fig. 21 is a schematic structural diagram of another electronic device in an embodiment of the present specification;

fig. 22 is a schematic structural diagram of a video processing system in an embodiment of the present specification.

Detailed Description

For those skilled in the art to better understand and implement the embodiments in the present specification, the following first provides an exemplary description of an implementation of a free viewpoint video with reference to the drawings and a specific application scenario.

Referring to fig. 1, a specific application system for free viewpoint video presentation in an embodiment of the present invention may include an acquisition system 11 with multiple acquisition devices, a server 12, and a display device 13, where the acquisition system 11 may perform image acquisition on an area to be viewed; the acquisition system 11 or the server 12 may process the acquired multiple texture maps in synchronization, and generate multi-angle free view data capable of supporting the display device 13 to perform virtual viewpoint switching. The display device 13 may present a reconstructed image generated based on the multi-angle free view data, the reconstructed image corresponding to a virtual viewpoint, present reconstructed images corresponding to different virtual viewpoints according to a user instruction, and switch viewing positions and viewing angles.

In a specific implementation, the process of reconstructing the image to obtain the reconstructed image may be implemented by the display device 13, or may be implemented by a device located in a Content Delivery Network (CDN) in an edge computing manner. It is to be understood that fig. 1 is an example only and is not limiting of the acquisition system, the server, the terminal device, and the specific implementation.

With continued reference to fig. 1, the user may view the area to be viewed through the display device 13, in this embodiment, the area to be viewed is a basketball court. As described above, the viewing position and the viewing angle are switchable.

For example, the user may slide on the screen to switch the virtual viewpoint. In an embodiment of the present invention, with combined reference to FIG. 2, the user's finger is along D₂₂When the screen is slid in the direction, the virtual viewpoint for viewing can be switched. With continued reference to FIG. 3, the position of the virtual viewpoint prior to sliding may be the VP₁After the sliding screen switches the virtual viewpoint, the position of the virtual viewpoint may be VP₂. Referring collectively to fig. 4, after sliding the screen, the reconstructed image of the screen presentation may be as shown in fig. 4. The reconstructed image can be obtained by image reconstruction based on multi-angle free view data generated by images acquired by a plurality of acquisition devices in an actual acquisition situation.

It is to be understood that the image viewed before switching may be a reconstructed image. The reconstructed image may be a frame image in a video stream. In addition, the manner of switching the virtual viewpoint according to the user instruction may be various, and is not limited herein.

In a specific implementation, the viewpoint may be represented by coordinates of 6 degrees of Freedom (DoF), wherein the spatial position of the viewpoint may be represented as (x, y, z) and the viewing angle may be represented as three rotational directions

Accordingly, based on the coordinates of 6 degrees of freedom, a virtual viewpoint, including a position and a view angle, may be determined.

The virtual viewpoint is a three-dimensional concept, and three-dimensional information is required for generating a reconstructed image. In a specific implementation, the multi-angle freeview data may include depth map data for providing third-dimensional information outside the planar image. The data volume of the depth map data is small compared to other implementations, for example, providing three-dimensional information through point cloud data.

In the embodiment of the invention, the switching of the virtual viewpoint can be performed within a certain range, namely a multi-angle free visual angle range. That is, the position and view angle of the virtual viewpoint can be arbitrarily switched within the multi-angle free view angle range.

The multi-angle free visual angle range is related to the arrangement of the collecting equipment, and the wider the shooting coverage range of the collecting equipment is, the larger the multi-angle free visual angle range is. The quality of the picture displayed by the terminal equipment is related to the number of the acquisition equipment, and generally, the more the number of the acquisition equipment is set, the less the hollow area in the displayed picture is.

Furthermore, the range of multi-angle freeviews is related to the spatial distribution of the acquisition device. The range of the multi-angle free viewing angle and the interaction mode with the display device at the terminal side can be set based on the spatial distribution relation of the acquisition device.

It can be understood by those skilled in the art that the foregoing embodiments and the corresponding drawings are only exemplary illustrations, and are not limited to the setting of the capturing device and the association relationship between the multi-angle free viewing angle ranges, nor the interaction manner and the display effect of the display device.

With reference to fig. 5, for free viewpoint Video reconstruction, texture Map acquisition and Depth Map Calculation are required, which includes three main steps, namely, Multi-Camera Video Capturing (Multi-Camera Video Capturing), Camera Parameter Estimation (Camera Parameter Estimation), and Depth Map Calculation (Depth Map Calculation). For multi-camera video capture, it is desirable that the video captured by the various cameras be frame-level aligned. Wherein, a Texture Image (Texture Image) can be obtained through the video acquisition of multiple cameras; the Camera parameters (Camera Parameter) can be obtained through the calculation of the internal parameters and the external parameters of the Camera, and the Camera parameters can comprise internal Parameter data and external Parameter data of the Camera; through the Depth Map calculation, a Depth Map (Depth Map), a plurality of synchronous texture maps, Depth maps of corresponding visual angles and camera parameters can be obtained, and 6DoF video data is formed.

In the embodiment of the present specification, a special camera, such as a light field camera, is not required for capturing the video. Likewise, complicated camera calibration prior to acquisition is not required. Multiple cameras can be laid out and arranged to better capture objects or scenes to be photographed.

After the above three steps are processed, the texture map collected from the multiple cameras, the camera parameters of all the cameras, and the depth map of each camera are obtained. These three portions of data may be referred to as data files in multi-angle freeview video data, and may also be referred to as 6-degree-of-freedom video data (6DoF video data). With the data, the user end can generate a virtual viewpoint according to a virtual 6 Degree of Freedom (DoF) position, thereby providing a video experience of 6 DoF.

With reference to fig. 6, the 6DoF video data and the indicative data may be compressed and transmitted to the user side, and the user side may obtain the 6DoF expression of the user side according to the received data, that is, the 6DoF video data and the metadata. The indicative data may also be referred to as Metadata (Metadata), where the video data includes texture map and depth map data of each viewpoint corresponding to multiple cameras, and the texture map and the depth map may be stitched according to a certain stitching rule or a stitching mode to form a stitched image.

Referring to fig. 7 in combination, the metadata may be used to describe a data schema of the 6DoF video data, and specifically may include: stitching Pattern metadata (Stitching Pattern metadata) indicating storage rules for pixel data and depth map data of a plurality of texture maps in a stitched image; edge protection metadata (Padding pattern metadata), which may be used to indicate the way edge protection is performed in the stitched image, and Other metadata (Other metadata). The metadata may be stored in a header file, and the specific order of storage may be as shown in FIG. 7, or in other orders.

With reference to fig. 8, the user side obtains 6DoF video data, which includes camera parameters, stitched images (texture map and depth map), and description metadata (metadata), and besides, interactive behavior data of the user side. Through these data, the user side may perform 6DoF Rendering in a Depth Image-Based Rendering (DIBR) manner, so as to generate an Image of a virtual viewpoint at a specific 6DoF position generated according to a user behavior, that is, according to a user instruction, determine a virtual viewpoint at the 6DoF position corresponding to the instruction.

In general depth map calculation, the depth map is calculated individually for each frame time. The inventors have found that, in a stationary background, a depth value that does not match is obtained, which results in a picture that is visible in a time domain being blurred.

As described above, for filling a cavity in a foreground object occlusion area, filtering is usually performed by using effective texture information around the cavity at present, but the actual cavity repairing effect is not ideal, artifacts and blurring are easily generated, and the image quality of a reconstructed free viewpoint video is poor.

Therefore, embodiments of the present specification provide a free viewpoint video reconstruction scheme, where a texture map of a virtual viewpoint corresponding to a target video frame obtained through synthesis is subjected to hole filling post-processing by reconstructing a complete background texture map of the virtual viewpoint, and compared with a scheme that filtering is performed only by using textures around a hole, artifacts and blurring caused by incomplete hole filling can be avoided, the hole filling quality is improved, and thus, the image quality of a free viewpoint video can be improved.

The following describes in detail a scheme, a principle, an advantage, and the like of the hole filling post-processing in the free viewpoint video reconstruction process according to the embodiment of the present disclosure with reference to the accompanying drawings and a specific application scenario.

Referring to the flowchart of the free viewpoint video reconstruction method shown in fig. 9, in a specific implementation, if the method is applied to a specific application system for displaying a free viewpoint video shown in fig. 1, the method may be implemented by the server 12 or the display device 13, and specifically may perform free viewpoint video reconstruction by using the following steps:

s91, obtaining a free viewpoint video frame, wherein the video frame comprises original texture maps of a plurality of synchronous original viewpoints and original depth maps of corresponding viewpoints.

In a particular implementation, a free-viewpoint video frame may include original texture maps for multiple original viewpoints and original depth maps for corresponding viewpoints in synchronization. As an alternative example, a free viewpoint video frame may be obtained based on the aforementioned 6DoF video data, wherein the corresponding view angle is also the corresponding viewpoint.

In a specific implementation, the free viewpoint video stream may be downloaded over a network, or the free viewpoint video frame may be obtained from a locally stored free viewpoint video file.

And S92, acquiring the target video frame corresponding to the virtual viewpoint.

In a specific implementation, the virtual viewpoint may be determined according to user interaction behavior, or according to a preset setting. If the determination is based on the user interaction behavior, the virtual viewpoint position at the corresponding interaction moment can be determined by acquiring the track data corresponding to the user interaction operation, and the virtual viewpoint is determined.

In some embodiments of the present description, position information of a virtual viewpoint corresponding to a corresponding video frame may also be preset at a server (e.g., a server or a cloud), and the set position information of the virtual viewpoint may be transmitted in a header file of the free viewpoint video.

After the virtual viewpoint is determined, a corresponding video frame in the free viewpoint video corresponding to the virtual viewpoint may be determined as a target video frame.

S93, synthesizing the texture map of the virtual viewpoint by using the original texture maps and the corresponding original depth maps of the original viewpoints in the target video frame.

In a specific implementation, according to the position information of the virtual viewpoint, the texture map of the virtual viewpoint may be synthesized by using the original texture maps and the corresponding original depth maps of all viewpoints included in the target video frame.

In order to reduce data processing amount and improve image reconstruction speed, and under the condition of ensuring image reconstruction quality, an original texture map and a corresponding original depth map of a part of viewpoints in the target video frame can be selected based on the position information of the virtual viewpoints to synthesize the texture map of the virtual viewpoints.

Specifically, based on the virtual viewpoint, the original texture map and the corresponding original depth map of the corresponding original viewpoint in the target video frame may be selected according to a preset rule, and then the texture map of the virtual viewpoint is synthesized by using the selected original texture map and the corresponding original depth map of the corresponding original viewpoint. For example, the original texture map and the corresponding original depth map of the corresponding original viewpoint that satisfies the preset distance condition with the virtual viewpoint may be selected based on the spatial position relationship between the virtual viewpoint and each original viewpoint position. For another example, the original texture map and the corresponding original depth map of the corresponding original viewpoint that satisfy the preset spatial position relationship with the virtual viewpoint and satisfy the preset number threshold may be selected.

It will be appreciated that the above are merely examples of some alternative implementations of selecting the original texture map and the corresponding original depth map of a portion of the original viewpoint, and that no selection condition is necessary.

S94, obtaining the background texture map and the background depth map of the viewpoint corresponding to the target video frame, and obtaining the background texture map of the virtual viewpoint according to the background texture map and the background depth map of the corresponding viewpoint.

In specific implementation, there are various ways to obtain the background texture map and the background depth map of the corresponding viewpoint of the target video frame, for example, a temporal filtering way may be adopted, or a pre-acquisition way may be adopted. The specific implementation will be described in detail later with reference to specific application scenarios.

After the background texture map and the background depth map of the viewpoint corresponding to the target video frame are obtained, virtual viewpoint synthesis may be performed in the same manner as in step S93 to obtain the background texture map of the virtual viewpoint.

In specific implementation, the background texture map of the virtual viewpoint may be subjected to hole filling post-processing to enhance the image quality of the background texture map of the virtual viewpoint. As a specific example, a joint bilateral filtering method may be adopted to perform hole filling post-processing on the background texture map of the virtual viewpoint.

In a specific implementation, in order to obtain a more complete background texture map of the virtual viewpoint, a background texture map and a background depth map of multiple viewpoints may be used, wherein the density of the selected viewpoint may be greater than that of the viewpoint corresponding to the target video frame.

And S95, performing hole filling post-processing on the hole area in the texture map of the virtual viewpoint by using the background texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint.

By adopting the background texture map of the virtual viewpoint, the hole filling post-processing can be carried out on the hole area in the texture map of the virtual viewpoint in various ways.

In specific implementation, the texture map of the virtual viewpoint and the background texture map of the virtual viewpoint may be compared according to pixels, a place where a background area in the texture map of the virtual viewpoint is inconsistent with the background texture map of the virtual viewpoint is determined, or a pixel whose difference between pixel values is greater than a preset threshold value is determined, and a value of a corresponding pixel in the texture map of the virtual viewpoint is modified to a value of a corresponding pixel in the background texture map of the virtual viewpoint.

For another example, a joint bilateral filtering method may be used to perform interpolation processing on the hole region in the texture map of the virtual viewpoint, so as to obtain a reconstructed image of the virtual viewpoint. In specific implementations, the execution may be performed using a dedicated joint bilateral filter, or may be performed using invoking corresponding software execution logic. By combining bilateral filtering, the foreground edge in the texture map of the virtual viewpoint can be protected and background noise can be removed.

In other embodiments of this specification, the background texture map of the virtual viewpoint is used as a guide map, and a guide filtering method is used to fill up a hole in a hole region in the texture map of the virtual viewpoint, so as to obtain a reconstructed image of the virtual viewpoint.

In a specific implementation, other filtering methods, such as a bilateral filtering method, a median smoothing filtering method, and the like, may also be used to perform the hole filling post-processing on the texture map of the virtual viewpoint based on the input background texture map of the virtual viewpoint, which is not listed any more.

After the step S95 performs the hole filling post-processing on the hole region in the texture map of the virtual viewpoint, in order to further improve the quality of the reconstructed image, the foreground edge in the texture map of the virtual viewpoint obtained after the hole filling post-processing may be further filtered to obtain the reconstructed image of the virtual viewpoint.

By adopting the embodiment, the texture map of the virtual viewpoint corresponding to the synthesized target video frame is subjected to the hole filling post-processing by reconstructing the complete background texture map of the virtual viewpoint, and compared with a scheme of filtering only by using textures around a hole, the method can avoid artifacts and blurring caused by incomplete hole filling, improve the hole filling quality, and further improve the image quality of the free viewpoint video.

For better understanding and implementation by those skilled in the art, the following first gives some examples of specific implementations of obtaining the background texture map and the background depth map of the corresponding viewpoint of the target video frame.

In an example one, the reference texture map sequence and the reference depth map sequence of the viewpoint corresponding to the target video frame are selected, and then the background texture map and the background depth map of the viewpoint corresponding to the target video frame are obtained.

Wherein, for the selection of the reference texture map sequence and the reference depth map sequence, in some embodiments, the following manner is adopted:

in a first mode, for any original viewpoint in all original viewpoints corresponding to an original texture map and an original depth map contained in the target video frame, a corresponding reference texture map sequence and a reference depth map sequence are obtained.

For example, the target video frame includes original texture maps and corresponding original depth maps of 30 viewpoints, and corresponding reference texture map sequences and reference depth map sequences are obtained for the 30 viewpoints respectively.

And in the second mode, a reference texture map sequence and a corresponding reference depth map sequence of the original viewpoint selected by synthesizing the texture map of the virtual viewpoint are adopted.

Specifically, a reference texture map sequence and a reference depth map sequence corresponding to the selected original viewpoint may be obtained, and the reference texture map sequence and the reference depth map sequence are respectively subjected to time-domain filtering to obtain a background texture map and a background depth map of the selected corresponding original viewpoint.

For example, if only the original texture maps and the corresponding original depth maps of the two original viewpoints closest to the virtual viewpoint are selected when the texture maps of the virtual viewpoint are synthesized, only the reference texture map sequences and the reference depth map sequences of the two original viewpoints closest to the virtual viewpoint may be obtained, so that the amount of data calculation may be reduced, and the generation efficiency of the virtual viewpoint background texture map may be improved.

In addition, the selected reference texture map sequence and the reference depth map sequence may be selected from video clips independent from the video clip containing the target video frame, or selected from video clips containing the target video frame.

After the reference texture map sequence and the reference depth map sequence of the viewpoint corresponding to the target video frame are selected, time-domain filtering may be performed on the reference texture map sequence and the reference depth map sequence, respectively, to obtain a background texture map and a background depth map of the viewpoint corresponding to the target video frame.

In particular implementations, there are a number of ways in which temporal filtering may be implemented.

For example, a mean filtering method, more specifically, an arithmetic mean filtering, a median average filtering, a moving average filtering, or the like may be employed.

As another example, a median filtering method may be employed. Specifically, the temporal median filtering may be performed on the pixels in the reference texture map sequence and the pixels in the corresponding reference depth map sequence, respectively, to obtain a background texture map and a background depth map of the viewpoint corresponding to the target video frame.

As an alternative example, t may be selected from a video X from the same viewpoint as the target depth map₁To t₂Temporal video frame sequence, makingFor the reference texture map sequence of the time period and the reference depth map sequence corresponding to the reference texture map sequence, the sampling values of the corresponding pixel positions in the reference texture map sequence and the reference depth map sequence may be arranged according to size, and the intermediate values are respectively used as effective values of the corresponding pixel positions in the background texture map and the background depth map. To facilitate taking the median value, t₁To t₂The number of images in the reference texture map sequence and the corresponding reference depth map sequence sampled at a time instant should be odd, for example, taking 3 frames, 5 frames, 7 frames, etc. in succession. The following can be expressed using the formula:

P(x_t)＝med({I_x,i|i∈[t₁,t₂]})

wherein, P (x)_t) Representing either a background texture map or a background depth map, I_x,iRepresents t₁To t₂The sum of P (x) in the time-wise reference texture map sequence or the reference depth map sequence_t) Sequence of pixel values at the same pixel location, med denotes taking I_x,iMiddle value of (1).

It is understood that in the specific implementation, other temporal filtering methods, such as clipping filtering, first-order lag filtering, etc., may also be used according to the environmental characteristics involved in the specific video and the specific requirements.

And in the second example, a background texture map of the target video frame corresponding to the viewpoint without the foreground object is collected in advance, and further a background depth map of the corresponding viewpoint is collected.

In specific implementation, a background texture map in which no foreground object exists in a corresponding viewpoint in a field of view to which the target video frame is directed may be collected in advance, and a background depth map of the corresponding viewpoint may be obtained according to the background texture map in which no foreground object exists in the corresponding viewpoint in the field of view to which the target video frame is directed.

Because the background in the image is an object that is fixed and unchanged with respect to the acquisition viewpoint, in some embodiments of the present description, when there is no foreground object in the corresponding viewpoint in the field of view to which the target video frame is directed, a texture image is acquired in advance at the corresponding viewpoint, and only background texture information is included in the texture image, therefore, the texture image acquired in advance at the corresponding viewpoint may be used as a background texture map of the corresponding viewpoint, and further, a background depth map of the corresponding viewpoint may be obtained according to the background texture map.

For example, for a live basketball game, one or more images without foreground objects can be collected at corresponding viewpoints before the game starts, if one image is collected, the one image can be directly used as a background texture map, and if multiple images are collected, the multiple images can be used as a reference texture map sequence and subjected to time-domain filtering to obtain a background texture map corresponding to the viewpoints. Correspondingly, a reference depth map of the acquired reference texture map can be estimated and obtained through depth calculation, if the reference depth map corresponding to a single reference texture map is directly used as a background depth map, corresponding reference depth map sequences can be obtained for a plurality of reference texture map sequences, and then corresponding background depth maps can be obtained through time-domain filtering operation.

Referring to fig. 10, which is a schematic diagram of a free viewpoint video reconstruction method for a specific application scene, in an embodiment of this specification, a plurality of free viewpoint video frames I may be obtained first, where any free viewpoint video frame I includes original texture maps of a plurality of original viewpoints and original depth maps of corresponding viewpoints, and a target video frame I is obtained based on a virtual viewpoint₀Based on the target video frame I₀The original texture maps and corresponding original depth maps of the multiple original viewpoints contained in (1) can be synthesized into the texture map T of the virtual viewpoint by virtual viewpoint reconstruction₀. And based on the target video frame, obtaining a background texture map Tb and a background depth map Db of a corresponding viewpoint of the target video frame, based on the background texture map Tb and the background depth map Db of the corresponding viewpoint, obtaining a background texture map Tb0 of the virtual viewpoint through virtual viewpoint reconstruction, and using the background texture map Tb0 of the virtual viewpoint to match the texture map T of the virtual viewpoint₀And (5) carrying out hole filling post-processing to obtain a final free viewpoint video reconstruction image Te.

In the foregoing, a free viewpoint video reconstruction method is described in detail by using some specific examples, and an embodiment of the present specification further provides a corresponding free viewpoint video playing processing method, where referring to a flowchart of the free viewpoint video playing processing method shown in fig. 11, the method may specifically include the following steps:

and S111, determining a virtual viewpoint and determining a target video frame according to the virtual viewpoint.

In specific implementation, the virtual viewpoint may be generated in real time during the playing process of the free viewpoint video, or may be preset. More specifically, a virtual viewpoint may be determined in response to a gesture interaction by a user. For example, the virtual viewpoint at the corresponding interaction time is determined by acquiring trajectory data corresponding to the user interaction operation. Alternatively, the location information of the virtual viewpoint corresponding to the corresponding video frame may be preset in a server (such as a server or a cloud), and the set location information of the virtual viewpoint may be transmitted in a header file of the free-view video stream, so that the virtual viewpoint may be determined based on the virtual viewpoint location information included in the video stream.

After the virtual viewpoint is determined, the corresponding frame time and the video frame at the corresponding frame time can be determined as the target video frame according to the virtual viewpoint.

And S112, synthesizing the texture map of the virtual viewpoint by adopting the original texture maps and the corresponding original depth maps of the plurality of original viewpoints in the target video frame.

After the virtual viewpoint is determined, in order to save data processing resources, an original texture map of a part of original viewpoints and an original depth map of corresponding viewpoints in the target video frame are selected according to a preset rule for combined rendering based on the virtual viewpoint position and parameter data corresponding to the target video frame, and the texture map of the virtual viewpoint is synthesized. For example, the original texture map and the original depth map corresponding to 2 to N viewpoints closest to the virtual viewpoint position in the target video frame may be selected. And N is the number of original texture maps in the target video frame, namely the number of acquisition devices corresponding to the original texture maps. In particular implementations, the quantitative relationship value may be fixed or may vary.

S113, obtaining a background texture map and a background depth map of a corresponding viewpoint of the target video frame, and obtaining the background texture map of the virtual viewpoint according to the background texture map and the background depth map of the corresponding viewpoint.

A specific implementation manner of obtaining the background texture map and the background depth map of the viewpoint corresponding to the target video frame, and obtaining the background texture map of the virtual viewpoint according to the background texture map and the background depth map of the corresponding viewpoint may refer to the description of step S113 and the specific implementation manner in the foregoing embodiment, and a description thereof is not repeated here.

And S114, carrying out hole filling post-processing on a hole area in the texture map of the virtual viewpoint by adopting the background texture map of the virtual viewpoint to obtain a reconstructed image of the virtual viewpoint.

In specific implementation, any one or more filtering modes such as bilateral filtering, joint bilateral filtering, guided filtering and the like may be adopted to perform hole filling post-processing on a hole region in the texture map of the virtual viewpoint, so as to obtain a reconstructed image of the virtual viewpoint.

By adopting the free viewpoint video playing processing method, all videos displayed by playing are subjected to hole post-processing based on the texture map of the corresponding virtual viewpoint, and the texture map of the corresponding virtual viewpoint is obtained by performing time-domain filtering based on the reference texture map and the reference depth map of the original viewpoint corresponding to the synthesized texture map of the virtual viewpoint, so that the background texture map of the virtual viewpoint contains stable and complete background texture information, and the hole filling post-processing is performed on the texture map of the virtual viewpoint by adopting the background texture map of the virtual viewpoint, so that the reconstructed image quality of the virtual viewpoint can be improved.

To reduce image holes, the position of the camera (acquisition device) can be configured by a specific viewpoint configuration algorithm or system. In specific implementation, three-dimensional space information of a field of view, the number of selectable viewpoints, internal and external parameters of a camera (including parameters such as a horizontal field angle and a vertical field angle of the camera) and the like can be acquired, matching and operation are performed according to a preset configuration model, and a suggested camera arrangement mode and a corresponding camera position can be output.

In specific implementation, the playing mode of the free viewpoint video can be further optimized and expanded on the basis of the above embodiment. An exemplary extension is given below.

In order to enrich the visual experience of the user, an Augmented Reality (AR) special effect can be implanted in the reconstructed free viewpoint image. In some embodiments of the present description, referring to the flowchart of the free viewpoint video playing processing method shown in fig. 12, implantation of the AR special effect may be implemented in the following manner:

and S121, acquiring a virtual rendering target object in the reconstructed image of the virtual viewpoint.

In a specific implementation, some objects in the image of the free viewpoint video may be determined as virtual rendering target objects based on some indication information, which may be generated based on user interaction or obtained based on some preset trigger condition or third party instruction. In an optional embodiment of the present specification, in response to the special effect generation interaction control instruction, the virtual rendering target object in the reconstructed image of the virtual viewpoint may be acquired.

And S122, acquiring a virtual information image generated based on the augmented reality special effect input data of the virtual rendering target object.

In the embodiment of the specification, the implanted AR special effect is presented in the form of a virtual information image. The virtual information image may be generated based on augmented reality special effects input data of the target object. After determining the virtual rendering target object, a virtual information image generated based on augmented reality effect input data of the virtual rendering target object may be acquired.

In this embodiment of the present specification, the virtual information image corresponding to the virtual rendering target object may be generated in advance, or may be generated in real time in response to a special effect generation instruction.

In specific implementation, a virtual information image matched with the virtual rendering target object in position can be obtained based on the position of the virtual rendering target object in a reconstructed image obtained by three-dimensional calibration, so that the obtained virtual information image can be more matched with the position of the virtual rendering target object in a three-dimensional space, and the displayed virtual information image is more in line with a real state in the three-dimensional space, so that the displayed synthetic image is more real and vivid, and the visual experience of a user is enhanced.

In specific implementation, a virtual information image corresponding to a target object may be generated according to a preset special effect generation manner based on augmented reality special effect input data of a virtual rendering target object.

In particular implementations, a variety of special effect generation approaches may be employed.

For example, augmented reality special effect input data of the target object may be input to a preset three-dimensional model, and a virtual information image matched with the virtual rendering target object may be output based on a position of the virtual rendering target object in the image obtained by three-dimensional calibration;

for another example, the augmented reality special effect input data of the virtual rendering target object may be input to a preset machine learning model, and based on the position of the virtual rendering target object in the image obtained by three-dimensional calibration, a virtual information image matched with the virtual rendering target object is output.

And S123, synthesizing and displaying the virtual information image and the image of the virtual viewpoint.

In specific implementation, the virtual information image and the reconstructed image of the virtual viewpoint may be synthesized and displayed in various ways, and two specific realizable examples are given below:

example one: fusing the virtual information image and the corresponding reconstructed image to obtain a fused image, and displaying the fused image;

example two: and superposing the virtual information image on the corresponding reconstructed image to obtain an superposed composite image, and displaying the superposed composite image.

In specific implementation, the obtained composite image may be directly displayed, or the obtained composite image may be inserted into a video stream to be played for playing and displaying. For example, the fused image may be inserted into a video stream to be played for playing and displaying.

In a specific implementation, the virtual information image may be determined at a position superimposed on the image of the virtual viewpoint based on the special effect display identifier, and then the virtual information image may be displayed at the determined position superimposed.

For a better understanding and implementation by those skilled in the art, the following detailed description is given through an image presentation process of an interactive terminal. Referring to the video playback screen views of the interactive terminals shown in fig. 13 to 17, the interactive terminal T1 plays a video in real time. Referring to fig. 13, a video frame P1 is shown, and next, a video frame P2 shown by the interactive terminal includes a plurality of special effect display identifiers, such as a special effect display identifier I1, and the video frame P2 is represented by an inverted triangle symbol pointing to the target object, as shown in fig. 14. It is understood that the special effect display mark may be displayed in other manners. When the terminal user touches and clicks the special effect display identifier I1, the system automatically acquires a virtual information image corresponding to the special effect display identifier I1, and displays the virtual information image in a video frame P3 in an overlapping manner, as shown in fig. 15, by taking the place where the player Q1 stands as the center, a stereoscopic ring R1 is rendered. Next, as shown in fig. 16 and 17, the end user touches and clicks the special effect display identifier I2 in the video frame P3, the system automatically acquires the virtual information image corresponding to the special effect display identifier I2, and displays the virtual information image on the video frame P3 in an overlapping manner, that is, the video frame P4, in which the hit rate information display board M0 is displayed. The hit rate information presentation board M0 presents the number, name, and hit rate information of the target object, i.e., the actor Q2.

As shown in fig. 13 to 17, the terminal user may continue to click on other special effect display identifiers displayed in the video frame, and view a video displaying an AR special effect corresponding to each special effect display identifier.

It will be appreciated that different types of implant special effects may be distinguished by different types of special effect presentation indicia.

The embodiment of the present specification further provides a corresponding free viewpoint video reconstruction apparatus, referring to a schematic structural diagram of the free viewpoint video reconstruction apparatus shown in fig. 18, where the free viewpoint video reconstruction apparatus 180 may include: a video frame acquisition unit 181, a target video frame determination unit 182, a virtual viewpoint texture map synthesis unit 183, a virtual viewpoint background texture map synthesis unit 184, and a post-processing unit 185, wherein:

the video frame acquiring unit 181 is adapted to acquire a free viewpoint video frame, where the video frame includes original texture maps of a plurality of original viewpoints and original depth maps of corresponding viewpoints that are synchronized;

the target video frame determining unit 182 is adapted to obtain a target video frame corresponding to the virtual viewpoint;

the virtual viewpoint texture map synthesizing unit 183 is adapted to synthesize a texture map of the virtual viewpoint by using original texture maps of a plurality of original viewpoints in the target video frame and corresponding original depth maps;

the virtual viewpoint background texture map synthesizing unit 184 is adapted to obtain a background texture map and a background depth map of a viewpoint corresponding to the target video frame, and obtain a background texture map of the virtual viewpoint according to the background texture map and the background depth map of the corresponding viewpoint;

the post-processing unit 185 is adapted to perform hole filling post-processing on a hole area in the texture map of the virtual viewpoint by using the background texture map of the virtual viewpoint, so as to obtain a reconstructed image of the virtual viewpoint.

By adopting the virtual viewpoint reconstruction device 180, the complete background texture map of the virtual viewpoint is obtained by reconstruction, and the texture map of the virtual viewpoint corresponding to the synthesized target video frame is subjected to hole filling post-processing.

In a specific implementation, each unit in the virtual viewpoint video reconstruction apparatus may be implemented by using a specific method example, a specific manner, and the like of a corresponding step in the foregoing free viewpoint video reconstruction method, and for details, reference may be made to the foregoing embodiment.

In some embodiments of the present specification, as shown in fig. 19, the free viewpoint video playing processing device 190 may include: a virtual viewpoint determining unit 191, a target video frame determining unit 192, a virtual viewpoint texture map synthesizing unit 193, a virtual viewpoint background texture map synthesizing unit 194, and a post-processing unit 195, wherein:

a virtual viewpoint determining unit 191 adapted to determine a virtual viewpoint;

a target video frame determination unit 192 adapted to determine a target video frame from the virtual viewpoint;

a virtual viewpoint texture map synthesizing unit 193 adapted to synthesize a texture map of the virtual viewpoint by using original texture maps of a plurality of original viewpoints and corresponding original depth maps in the target video frame;

the virtual viewpoint background texture map synthesizing unit 194 is adapted to obtain a background texture map and a background depth map of a viewpoint corresponding to the target video frame, and obtain a background texture map of the virtual viewpoint according to the background texture map and the background depth map of the corresponding viewpoint;

the post-processing unit 195 is adapted to perform hole filling post-processing on a hole area in the texture map of the virtual viewpoint by using the background texture map of the virtual viewpoint, so as to obtain a reconstructed image of the virtual viewpoint.

By adopting the free viewpoint video playing processing unit, the complete background texture map of the virtual viewpoint is obtained by reconstruction, and the texture map of the virtual viewpoint corresponding to the synthesized target video frame is subjected to hole filling post-processing.

In a specific implementation, each unit in the virtual viewpoint video playing and processing apparatus may be implemented by using a specific method example, a specific manner, and the like of a corresponding step in the foregoing free viewpoint video reconstruction method, and for details, refer to the foregoing embodiment.

In this embodiment, each specific unit of the virtual viewpoint video reconstruction apparatus, the virtual viewpoint video playing processing apparatus, and the like may be implemented by software, hardware, or a combination of software and hardware.

Referring to the schematic structural diagram of the electronic device shown in fig. 20, in some embodiments of the present specification, as shown in fig. 20, an electronic device 200 may include a memory 201 and a processor 202, where the memory 201 stores computer instructions executable on the processor 202, and the processor executes the computer instructions to perform the steps of the method according to any one of the foregoing embodiments.

The electronic device may also include other electronic components or assemblies based on where the electronic device is located throughout the video processing system.

Referring to the schematic structural diagram of another electronic device shown in fig. 21, in other embodiments of the present description, as shown in fig. 21, an electronic device 210 may include a communication component 211, a processor 212, and a display component 213, where:

the communication component 211 is adapted to obtain a free viewpoint video;

the processor 212, adapted to perform the steps of the method according to any of the previous embodiments;

the display component 213 is adapted to display the reconstructed image of the virtual viewpoint processed by the processor.

In particular implementations, the display component 213 may be specifically one or more of a display, a touch screen, a projector, and the like.

In a specific implementation, the communication component 211, the display component 213, and the like may be components disposed inside the electronic device 210, or may be external devices connected through an expansion component such as an expansion interface, a docking station, an expansion line, and the like.

In a specific implementation, the processor 212 may be implemented cooperatively by any one or more of a Central Processing Unit (CPU) (e.g., a single-core processor, a multi-core processor), a CPU group, a Graphics Processing Unit (GPU), an Artificial Intelligence (AI) chip, a Field Programmable Gate Array (FPGA) chip, and the like.

In some embodiments of the present description, communication between the memory, the processor, the communication component, and the display component in the electronic device may be over a bus network.

For a better understanding and implementation by those skilled in the art, a specific application scenario is described below. Referring to a schematic structural diagram of a video processing system shown in fig. 22, as shown in fig. 22, which is a schematic structural diagram of a video processing system in an application scenario, where an arrangement scenario of a data processing system of a basketball game is shown, the video processing system a0 includes a capture array a1 composed of multiple capture devices, a data processing device a2, a server cluster A3 in a cloud, a play control device a4, a play terminal a5, and an interaction terminal a 6.

Referring to fig. 21, a basketball frame on the left side is taken as a core viewpoint, the core viewpoint is taken as a circle center, and a sector area located on the same plane as the core viewpoint is taken as a preset multi-angle free viewing angle range. Each acquisition device in the acquisition array A1 can be arranged in different positions of a field acquisition area in a fan shape according to the preset multi-angle free visual angle range, and can synchronously acquire video data streams from corresponding angles in real time.

In particular implementations, the collection devices may also be located in the ceiling area of a basketball venue, on a basketball stand, or the like. The acquisition devices can be arranged and distributed along a straight line, a fan shape, an arc line, a circle or an irregular shape. The specific arrangement mode can be set according to one or more factors such as specific field environment, the number of the acquisition equipment, the characteristics of the acquisition equipment, imaging effect requirements and the like. The acquisition device may be any device having a camera function, such as a general camera, a mobile phone, a professional camera, and the like.

In order not to affect the operation of the acquisition device, the data processing device a2 may be located in a field non-acquisition area, which may be regarded as a field server. The data processing device a2 may send a pull stream command to each acquisition device in the acquisition array a1 through a wireless local area network, and each acquisition device in the acquisition array a1 transmits an obtained video data stream to the data processing device A3 in real time based on the pull stream command sent by the data processing device a 2. Each acquisition device in the acquisition array a1 can transmit the obtained video data stream to the data processing device a2 in real time through the switch a 7. Acquisition array a1 and switch a7 together form an acquisition system.

When the data processing device a2 receives a video frame capture instruction, capture a plurality of frame images of the synchronized video frames from the video frames at the specified frame time in the received multiple video data streams, and upload the plurality of obtained synchronized video frames at the specified frame time to the server cluster A3 at the cloud.

Correspondingly, the cloud server cluster a3 uses the received original texture maps of multiple synchronous video frames as an image combination, determines parameter data corresponding to the image combination and original depth maps corresponding to the original texture maps in the image combination, and performs image splicing based on the obtained virtual viewpoints based on the parameter data corresponding to the image combination, the pixel data of the texture maps in the image combination and the depth data of the corresponding depth maps to obtain corresponding multi-angle free view video data.

The server may be placed in the cloud, and in order to process data in parallel more quickly, the server cluster a3 in the cloud may be composed of a plurality of different servers or server groups according to different processing data.

For example, the cloud server cluster a3 may include: a first cloud server a31, a second cloud server a32, a third cloud server a33, and a fourth cloud server a 34. The first cloud server a31 may be configured to determine parameter data corresponding to the image combination; the second cloud server a32 may be configured to determine an estimated depth map of the original texture map of each viewpoint in the image combination and perform depth map correction processing; the third cloud server a33 may perform frame Image reconstruction Based on the parameter data corresponding to the Image combination, the texture map and the Depth map of the Image combination, and using a Depth Image Based Rendering (DIBR) algorithm Based on the Depth map according to the position information of the virtual viewpoint, to obtain an Image of the virtual viewpoint; the fourth cloud server a34 may be configured to generate free-viewpoint video (multi-angle free-view video).

It can be understood that the first cloud server a31, the second cloud server a32, the third cloud server a33, and the fourth cloud server a34 may also be a server group composed of a server array or a server sub-cluster, which is not limited in the embodiment of the present invention.

Then, the playback control apparatus a4 may insert the received free viewpoint video frame into the video stream to be played, and the playback terminal a5 receives the video stream to be played from the playback control apparatus a4 and plays it in real time. The playing control device 34 may be a manual playing control device or a virtual playing control device. In specific implementation, a special server capable of automatically switching video streams can be set as a virtual play control device to control data sources. A director control device such as a director may be used as one of the playback control devices a4 in embodiments of the present invention.

The interactive device a6 may play free viewpoint video based on user interaction.

It can be understood that each acquisition device in the acquisition array a1 and the data processing device a2 may be connected through a switch a7 and/or a local area network, the number of the play terminal a5 and the number of the interaction terminals a6 may be one or more, the play terminal a5 and the interaction terminal a6 may be the same terminal device, the data processing device a2 may be placed in a field non-acquisition area or a cloud according to a specific scenario, the server cluster A3 and the play control device a4 may be placed in the field non-acquisition area or the cloud or a terminal access side according to the specific scenario, and this embodiment is not used to limit the specific implementation and protection scope of the present invention.

The embodiments of the present disclosure also provide a computer-readable storage medium, on which computer instructions are stored, where the computer instructions, when executed, perform the steps of the method according to any one of the foregoing embodiments, which may be specifically described with reference to the foregoing embodiments, and are not described herein again.

In particular implementations, the computer-readable storage medium may be a variety of suitable readable storage media such as an optical disk, a mechanical hard disk, a solid state disk, and so on.

Although the embodiments of the present invention are disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected by one skilled in the art without departing from the spirit and scope of the embodiments of the invention as defined in the appended claims.

Claims

1. A free viewpoint video reconstruction method, comprising:

acquiring a target video frame corresponding to the virtual viewpoint;

2. The method of claim 1, wherein the obtaining the background texture map and the background depth map of the corresponding viewpoint of the target video frame comprises:

3. The method of claim 2, wherein the temporally filtering the reference texture map sequence and the reference depth map sequence to obtain a background texture map and a background depth map of a corresponding viewpoint of the target video frame comprises:

4. The method of claim 1, wherein the synthesizing the texture map of the virtual viewpoint using the original texture maps and the corresponding original depth maps of the plurality of original viewpoints in the target video frame comprises:

5. The method of claim 4, wherein the obtaining the background texture map and the background depth map of the corresponding viewpoint of the target video frame comprises:

6. The method of claim 1, wherein the obtaining the background texture map and the background depth map of the corresponding viewpoint of the target video frame comprises:

7. The method according to any one of claims 1 to 6, wherein the obtaining the reconstructed image of the virtual viewpoint by performing hole filling post-processing on a hole area in the texture map of the virtual viewpoint by using the background texture map of the virtual viewpoint includes:

8. The method according to any one of claims 1 to 6, wherein after the hole filling post-processing is performed on the hole region in the texture map of the virtual viewpoint, before obtaining the reconstructed image of the virtual viewpoint, further comprising:

9. A free viewpoint video playing processing method comprises the following steps:

10. The method of claim 9, wherein the determining a virtual viewpoint comprises at least one of:

determining a virtual viewpoint in response to the user interaction behavior;

11. The method of claim 9, further comprising:

12. The method of claim 11, wherein said obtaining a virtual information image generated based on augmented reality effect input data of the virtual rendering target object comprises:

13. A free viewpoint video reconstruction apparatus, comprising:

14. A free viewpoint video playback processing apparatus, comprising:

a virtual viewpoint determining unit adapted to determine a virtual viewpoint;

15. An electronic device comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, wherein the processor, when executing the computer instructions, performs the steps of the method of any of claims 1 to 8 or claims 9 to 12.

16. An electronic device, comprising: a communication component, a processor, and a display component, wherein:

the communication component is suitable for acquiring a free viewpoint video;

the processor adapted to perform the steps of the method of any one of claims 1 to 8 or claims 9 to 12;

17. A computer readable storage medium having computer instructions stored thereon, wherein the computer instructions when executed perform the steps of the method of any of claims 1 to 8 or claims 9 to 12.