WO2024159553A1

WO2024159553A1 - Decoding method for volumetric video, and storage medium and electronic device

Info

Publication number: WO2024159553A1
Application number: PCT/CN2023/075406
Authority: WO
Inventors: 张煜; 岳鑫; 邵志兢; 孙伟
Original assignee: 珠海普罗米修斯视觉技术有限公司
Priority date: 2023-01-31
Filing date: 2023-02-10
Publication date: 2024-08-08
Also published as: CN116095338A

Abstract

Disclosed in the present application are a decoding method for a volumetric video, and a storage medium and an electronic device. The method comprises: acquiring a plurality of frames of images to be decoded; extracting global features of the images to be decoded; performing depth analysis processing on the basis of the global features, so as to obtain depths; and performing rendering processing on the basis of the frames of the images to be decoded, the global features and the depths, so as to obtain a multi-frame three-dimensional model for generating a volumetric video.

Description

Volumetric video decoding method, storage medium, and electronic device

Technical Field

The present application relates to the field of computer technology, and in particular to a volumetric video decoding method, a storage medium, and an electronic device.

Background Art

Volumetric video is a model sequence of continuous three-dimensional models. Volumetric video usually includes a large number of three-dimensional models. There is usually a need to encode and decode volumetric video. At present, in the related technology, there is a solution that encodes the three-dimensional model in the volumetric video into encoded data such as vertex data and facet data, and decodes the encoded data through a large number of complex decoding calculations to play the volumetric video.

Technical issues

In the current solution, the decoding of volumetric video has the problem of low decoding efficiency and poor decoding effect.

Technical Solutions

The embodiment of the present application provides a solution that can improve the decoding efficiency of volumetric video and improve the decoding effect.

To solve the above technical problems, the present application provides the following technical solutions:
According to one embodiment of the present application, a method for decoding a volumetric video includes: obtaining multiple frames of images to be decoded corresponding to the volumetric video; extracting global features corresponding to each frame of the image to be decoded; performing depth analysis processing based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded; performing rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.

In some embodiments of the present application, the extracting of global features corresponding to each frame of the image to be decoded includes: for each frame of the image to be decoded, performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded; performing feature fusion processing on the image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.

In some embodiments of the present application, the feature extraction processing of the image to be decoded to obtain the image features corresponding to the image to be decoded includes: performing multi-level encoding processing on the image to be decoded to obtain the image features output by the encoding processing of each level; wherein the encoding processing of each level includes convolution processing and maximum pooling processing performed in sequence; the image features output by the encoding processing of the previous level are used for the encoding processing of the next level; the feature fusion processing of the image features corresponding to the image to be decoded to obtain the global features corresponding to the image to be decoded includes: performing multi-level decoding processing on the features to be fused to obtain the fused features output by the decoding processing of each level; wherein the decoding processing of each level includes deconvolution processing, splicing processing and convolution processing performed in sequence; the fused features output by the decoding processing of the previous level are used for the decoding processing of the next level; the splicing processing of each level includes: splicing the deconvolution features output by the deconvolution processing with the image features of the same level to obtain the splicing features used for convolution processing; and obtaining the global features corresponding to the image to be decoded according to the fused features output by the decoding processing of the last level.

In some embodiments of the present application, the deconvolution features output by the deconvolution processing are spliced with the image features of the same level to obtain the spliced features for convolution processing, including: calculating the attention distribution for the image features of the same level, and calculating the weighted average based on the attention distribution to obtain the weighted average features of the same level; splicing the deconvolution features output by the deconvolution processing with the weighted average features of the same level to obtain the spliced features for convolution processing.

In some embodiments of the present application, the depth analysis processing is performed based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded, including: inputting the global features corresponding to each frame of the image to be decoded into a recurrent neural network for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network.

In some embodiments of the present application, the global features corresponding to each frame of the image to be decoded are respectively input into a recurrent neural network for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network, including: the global features corresponding to each frame of the image to be decoded are respectively input into a gated loop unit for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the gated loop unit.

In some embodiments of the present application, rendering processing is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, including: inputting each frame of the image to be decoded and the corresponding global features and the depth into a convolutional neural network for rendering processing to obtain a three-dimensional model corresponding to each frame of the image to be decoded output by the convolutional neural network.

In some embodiments of the present application, after rendering is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, the method includes: serializing the decoded multi-frame three-dimensional model in the order corresponding to the image to be decoded corresponding to each frame of the three-dimensional model to obtain the volumetric video.

According to one embodiment of the present application, a decoding device for volumetric video includes: an acquisition module, which is used to acquire multiple frames of images to be decoded corresponding to the volumetric video; an extraction module, which is used to extract global features corresponding to each frame of the image to be decoded; an analysis module, which is used to perform depth analysis processing based on the global features corresponding to each frame of the image to be decoded, and obtain the depth corresponding to each frame of the image to be decoded; a rendering module, which is used to perform rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth, and obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.

In some embodiments of the present application, the extraction module is used to perform feature extraction processing on the image to be decoded for each frame of the image to be decoded to obtain image features corresponding to the image to be decoded; and perform feature fusion processing on the image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.

In some embodiments of the present application, the extraction module is used to: perform multi-level encoding processing on the image to be decoded to obtain image features output by the encoding processing at each level; wherein the encoding processing at each level includes convolution processing and maximum pooling processing performed in sequence; the image features output by the encoding processing at the previous level are used for the encoding processing at the next level; the extraction module is also used to: perform multi-level decoding processing on the features to be fused to obtain fused features output by the decoding processing at each level; wherein the decoding processing at each level includes deconvolution processing, splicing processing and convolution processing performed in sequence; the fused features output by the decoding processing at the previous level are used for the decoding processing at the next level; the splicing processing at each level includes: splicing the deconvolution features output by the deconvolution processing with the image features of the same level to obtain splicing features for convolution processing; and obtaining the global features corresponding to the image to be decoded based on the fused features output by the decoding processing at the last level.

In some embodiments of the present application, the extraction module is also used to: calculate the attention distribution of the image features at the same level, and calculate the weighted average based on the attention distribution to obtain the weighted average features at the same level; splice the deconvolution features output by the deconvolution processing with the weighted average features at the same level to obtain the spliced features for convolution processing.

In some embodiments of the present application, the analysis module is used to: input the global features corresponding to each frame of the image to be decoded into a recurrent neural network for depth analysis and processing, and obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network.

In some embodiments of the present application, the analysis module is used to: input the global features corresponding to each frame of the image to be decoded into the gated loop unit for depth analysis processing, and obtain the depth corresponding to each frame of the image to be decoded output by the gated loop unit.

In some embodiments of the present application, the rendering module is used to: input each frame of the image to be decoded and the corresponding global features and the depth into a convolutional neural network for rendering processing, so as to obtain a three-dimensional model corresponding to each frame of the image to be decoded output by the convolutional neural network.

In some embodiments of the present application, the device further includes a generation module, which is used to: serialize the decoded multi-frame three-dimensional models in the order corresponding to the image to be decoded corresponding to each frame of the three-dimensional model to obtain the volumetric video.

According to another embodiment of the present application, a storage medium stores a computer program thereon, and when the computer program is executed by a processor of a computer, the computer executes the method described in the embodiment of the present application.

According to another embodiment of the present application, an electronic device may include: a memory storing a computer program; and a processor reading the computer program stored in the memory to execute the method described in the embodiment of the present application.

According to another embodiment of the present application, a computer program product or a computer program includes a computer instruction stored in a computer-readable storage medium. A processor of a computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the method provided in various optional implementations described in the embodiments of the present application.

Beneficial Effects

In the decoding scheme of the volumetric video in the embodiment of the present application, multiple frames of images to be decoded corresponding to the volumetric video are obtained; global features corresponding to each frame of the image to be decoded are extracted; depth analysis processing is performed based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded; rendering processing is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.

In this way, the volumetric video is provided in the form of multiple frames of images to be decoded. The global features of each frame of the image to be decoded are enhanced, and the depth is analyzed through the global features. The three-dimensional model is rendered by combining the image to be decoded, the global features and the depth. The overall decoding process is efficient and the decoded three-dimensional model is highly reliable, which can effectively improve the decoding efficiency of the volumetric video and improve the decoding effect.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For those skilled in the art, other drawings can be obtained based on these drawings without creative work.

FIG. 1 shows a schematic diagram of a system to which an embodiment of the present application can be applied.

FIG. 2 shows a flow chart of a method for decoding volumetric video according to an embodiment of the present application.

FIG. 3 shows a block diagram of a volumetric video decoding device according to another embodiment of the present application.

FIG. 4 shows a block diagram of an electronic device according to an embodiment of the present application.

Embodiments of the present invention

The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those skilled in the art without creative work are within the scope of protection of this application.

FIG1 shows a schematic diagram of a system 100 to which an embodiment of the present application can be applied. As shown in FIG1 , the system 100 may include a server 101 and a terminal 102 .

Server 101 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. It can also be a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, as well as big data and artificial intelligence platforms.

The terminal 102 may be any device, including but not limited to mobile phones, computers, intelligent voice interaction devices, smart home appliances, vehicle terminals, VR/AR devices, smart watches, computers, etc. In one embodiment, the server 101 or the terminal 102 may be a node device in a blockchain network or a map vehicle networking platform.

In one implementation of this example, the server 101 or the terminal 102 may: obtain multiple frames of images to be decoded corresponding to the volumetric video; extract global features corresponding to each frame of the image to be decoded; perform depth analysis processing based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded; perform rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth respectively to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.

Fig. 2 schematically shows a flow chart of a method for decoding volumetric video according to an embodiment of the present application. The method for decoding volumetric video may be executed by any device, such as the server 101 or the terminal 102 shown in Fig. 1 .

As shown in FIG. 2 , the volumetric video decoding method may include steps S210 to S240 .

Step S210, obtaining multiple frames of images to be decoded corresponding to the volumetric video; step S220, extracting global features corresponding to each frame of the image to be decoded; step S230, performing depth analysis processing based on the global features corresponding to each frame of the image to be decoded, and obtaining the depth corresponding to each frame of the image to be decoded; step S240, performing rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth, and obtaining a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.

Volumetric video is a model sequence of multi-frame three-dimensional models. The three-dimensional model can be a three-dimensional model corresponding to a person, an animal, etc. Volumetric video can demonstrate the object behavior (such as dancing) of an object through continuous multi-frame three-dimensional models. Each frame of the three-dimensional model can be reconstructed through multiple two-dimensional images from multiple perspectives.

The created volumetric video may be encoded in advance into corresponding multiple frames of images to be decoded. The volumetric video may be encoded into corresponding multiple frames of images to be decoded in a manner that: the volumetric video may be encoded into a multi-perspective color image for reconstructing each frame of the three-dimensional model in the volumetric video; or the volumetric video may be encoded into a multi-perspective model image captured from different angles for each frame of the three-dimensional model. Each frame of the image to be decoded may correspond to a frame of the three-dimensional model, and each frame of the image to be decoded may include at least one image.

By extracting the features of global information for each frame of the image to be decoded, the global features containing global information corresponding to each frame of the image to be decoded can be extracted. Based on the global features corresponding to each frame of the image to be decoded, a depth analysis process is performed to obtain the depth corresponding to each frame of the image to be decoded.

Finally, rendering is performed based on each frame of the image to be decoded and the global features and depth corresponding to the image to be decoded, so as to obtain a decoded multi-frame 3D model. The multi-frame 3D model can be serialized to obtain the restored/decoded volumetric video.

In this way, based on steps S210 to S240, the volumetric video is provided in the form of multiple frames of images to be decoded. The global features of each frame of the image to be decoded are enhanced, and the depth is analyzed through the global features. The three-dimensional model is rendered by combining the image to be decoded, the global features and the depth. The overall decoding process is efficient and the decoded three-dimensional model is highly reliable, which can effectively improve the decoding efficiency of the volumetric video and improve the decoding effect.

The following describes further specific optional embodiments of the steps performed when decoding the volumetric video in the embodiment of FIG. 2 .

In one embodiment, step S220, extracting global features corresponding to each frame of the image to be decoded, includes: for each frame of the image to be decoded, performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded; performing feature fusion processing on the image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.

The image to be decoded can be subjected to feature extraction processing through a feature extraction network (such as a convolutional network) to obtain image features corresponding to the image to be decoded. Furthermore, the image features corresponding to the image to be decoded can be subjected to feature fusion processing through a feature fusion network (such as a fully connected network) to obtain global features corresponding to the image to be decoded.

In one embodiment, step S220, extracting the global features corresponding to each frame of the image to be decoded includes: for each frame of the image to be decoded, calculating the histogram corresponding to each frame of the image to be decoded as the global feature through a histogram calculation function (such as the histogram calculation function in Opencv).

In one embodiment, the feature extraction processing of the image to be decoded to obtain the image features corresponding to the image to be decoded includes: performing multi-level encoding processing on the image to be decoded to obtain the image features output by the encoding processing of each level; wherein the encoding processing of each level includes convolution processing and maximum pooling processing performed in sequence; the image features output by the encoding processing of the previous level are used for the encoding processing of the next level; the feature fusion processing of the image features corresponding to the image to be decoded to obtain the global features corresponding to the image to be decoded includes: performing multi-level decoding processing on the features to be fused to obtain the fused features output by the decoding processing of each level; wherein the decoding processing of each level includes deconvolution processing, splicing processing and convolution processing performed in sequence; the fused features output by the decoding processing of the previous level are used for the decoding processing of the next level; the splicing processing of each level includes: splicing the deconvolution features output by the deconvolution processing with the image features of the same level to obtain the splicing features used for convolution processing; and obtaining the global features corresponding to the image to be decoded according to the fused features output by the decoding processing of the last level.

In this embodiment, first, for each frame of the image to be decoded, a multi-level encoding process is performed on the image to be decoded through a feature extraction network (encoder). The feature extraction network (encoder encoder) may include a multi-layer cascaded extraction network, and each level of the extraction network can output the image features of the corresponding level through the encoding process. Among them, the encoding process performed in the extraction network of each level specifically includes convolution processing and maximum pooling processing performed in sequence. For example, the encoding process performed in the extraction network of the first level may include: firstly performing convolution processing on the image to be decoded to obtain convolution features, and then performing maximum pooling processing on the convolution features to obtain the image features output by the encoding process of the first level. Further, the image features output by the encoding process of the previous level are used for the encoding process of the next level. For example, the image features of the first level are used as the input features of the extraction network of the second level. In the extraction network of the second level, the image features of the first level are firstly convolved to obtain convolution features, and then the convolution features are subjected to maximum pooling processing to obtain the image features output by the encoding process of the second level.

Furthermore, the features to be fused are subjected to multi-level decoding processing through a feature fusion network (decoder), and the feature fusion network (decoder) may include a multi-layer cascaded fusion network, and each level of the fusion network may output the fusion features of the corresponding level through decoding processing. Among them, the decoding processing performed in the fusion network of each level specifically includes deconvolution processing, splicing processing and convolution processing performed in sequence. For example, the decoding processing in the fusion network of the first level may include: firstly deconvolution processing the features to be fused to obtain deconvolution features, then splicing processing the deconvolution features to obtain splicing features, and then convolution processing the splicing features to obtain the fusion features output by the decoding processing of the first level. Furthermore, the fusion features output by the decoding processing of the previous level are used for the decoding processing of the next level. For example, the fusion features of the first level are used as the input features of the fusion network of the second level, and the fusion features of the first level are first subjected to deconvolution processing, splicing processing and convolution processing in sequence in the fusion network of the second level. Furthermore, the splicing processing at each level includes: splicing the deconvolution features output by the deconvolution processing with the image features at the same level to obtain the splicing features used for the convolution processing. For example, the splicing processing at the first level includes: splicing the deconvolution features output by the deconvolution processing at the first level with the image features output by the encoding processing at the first level to obtain the splicing features used for the convolution processing at the first level.

Finally, according to the fused features output by the last level of decoding processing, the global features corresponding to the image to be decoded are obtained. The fused features output by the last level of decoding processing can be used as the global features corresponding to the image to be decoded, or the fused features output by the last level of decoding processing can be reduced in dimension, and the reduced-dimensional features can be used as the global features corresponding to the image to be decoded.

In one implementation of this example, the feature extraction network is a UNet network, which includes a feature extraction network (encoder) on the left and a feature fusion network (decoder) on the right. The feature extraction network (encoder) includes 4 layers of extraction networks, and the feature fusion network (decoder) also includes 4 layers of fusion networks.

In one embodiment, the step of splicing the deconvolution features output by the deconvolution process with the image features at the same level to obtain the spliced features for performing the convolution process includes:
The attention distribution is calculated for the image features at the same level, and the weighted average is calculated according to the attention distribution to obtain the weighted average features at the same level; the deconvolution features output by the deconvolution processing are spliced with the weighted average features at the same level to obtain the spliced features for convolution processing.

In this embodiment, the weighted average features of the image features at the same level are further calculated through the attention mechanism, and then the deconvolution features output by the deconvolution process are spliced with the weighted average features to obtain the spliced features for convolution processing. For example, first, the weighted average features of the image features output by the encoding process of the first level are calculated through the attention mechanism, and then the deconvolution features output by the deconvolution process of the first level are spliced with the weighted average features of the first level to obtain the spliced features for convolution processing in the first level. In this way, the extracted global features can contain more global information, further improving the decoding effect of the three-dimensional model as a whole.

The weighted average features of the image features at the same level are calculated through the attention mechanism, which specifically includes: calculating the attention distribution of the image features at the same level, and calculating the weighted average according to the attention distribution to obtain the weighted average features at the same level. Specifically, the weighted average features can be calculated through the soft attention mechanism.

In one embodiment, the depth analysis processing is performed based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded, including: inputting the global features corresponding to each frame of the image to be decoded into a recurrent neural network for depth analysis processing, and obtaining the depth corresponding to each frame of the image to be decoded output by the recurrent neural network.

In this embodiment, a recurrent neural network (RNN) performs depth analysis based on the global features corresponding to each frame of the image to be decoded, and obtains the depth corresponding to each frame of the image to be decoded. Specifically, the sequence of global features corresponding to the image to be decoded can be input into the recurrent neural network for depth analysis, and the depth corresponding to the image to be decoded output by the recurrent neural network can be obtained. The parameters in the recurrent neural network are shared at different times, and the accurate depth can be output through analysis and processing. The recurrent neural network (RNN) can specifically include a long short-term memory network (LSTM) or a gated recurrent unit (GRU), etc.

In one embodiment, the global features corresponding to each frame of the image to be decoded are respectively input into a recurrent neural network for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network, including: the global features corresponding to each frame of the image to be decoded are respectively input into a gated recurrent unit for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the gated recurrent unit.

In this embodiment, a gated recurrent unit (GRU) is specifically used to perform depth analysis based on the global features corresponding to each frame of the image to be decoded, and the depth corresponding to each frame of the image to be decoded is obtained. Specifically, the sequence of global features corresponding to the image to be decoded is input into the gated recurrent unit (GRU) for depth analysis, and the depth corresponding to the image to be decoded output by the gated recurrent unit (GRU) is obtained. The gated recurrent unit (GRU) is a gated recurrent neural network that can efficiently analyze and obtain reliable depth.

In one embodiment, rendering processing is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, including: outputting each frame of the image to be decoded and the corresponding global features and the depth to a convolutional neural network for rendering processing to obtain a three-dimensional model corresponding to each frame of the image to be decoded output by the convolutional neural network.

In this embodiment, a convolutional neural network is used to render the image to be decoded and the global features and depth corresponding to the image to be decoded to obtain a decoded three-dimensional model. For example, the first frame of the image to be decoded and the global features and depth corresponding to the first frame of the image to be decoded are output to the convolutional neural network for rendering, and the three-dimensional model corresponding to the first frame of the image to be decoded output by the convolutional neural network is obtained. Similarly, the three-dimensional models corresponding to other frames of the image to be decoded can be obtained, and then the decoded multi-frame three-dimensional models are obtained.

In one embodiment, the rendering process is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame 3D model, and the multi-frame 3D model is used to generate the volumetric video, including: serializing the decoded multi-frame 3D model in the order corresponding to the image to be decoded corresponding to each frame of the 3D model to obtain the volumetric video. A corresponding frame of the 3D model can be obtained for each frame of the image to be decoded, and all the obtained 3D models are sequentially connected in the order of the corresponding images to be decoded to obtain the decoded volumetric video.

The volumetric video (also called volumetric video, spatial video, volumetric 3D video or 6-DOF video, etc.) in the aforementioned embodiments of the present application is a technology that captures information in 3D space (such as depth and color information, etc.) and generates a 3D dynamic model sequence. Compared with traditional videos, volumetric videos add the concept of space to videos, using 3D models to better restore the real 3D world, rather than using 2D flat videos plus camera movements to simulate the sense of space of the real 3D world. Since volumetric video is essentially a 3D model sequence, users can adjust to any viewing angle to watch it as they like, and it has a higher degree of restoration and immersion than 2D flat videos.

Optionally, in the present application, before step S210, the three-dimensional model used to construct the volumetric video (the three-dimensional model is not the three-dimensional model obtained by decoding in steps S210 to S240, but refers to the three-dimensional model obtained by three-dimensional reconstruction before step S210) can be reconstructed as follows:
First, color images and depth images of the object from different perspectives, as well as camera parameters corresponding to the color images, are obtained; then, a neural network model that implicitly expresses the three-dimensional model of the object is trained based on the obtained color images and their corresponding depth images and camera parameters, and isosurface extraction is performed based on the trained neural network model to achieve three-dimensional reconstruction of the object and obtain a three-dimensional model of the object.

It should be noted that the present application embodiment does not specifically limit the neural network model architecture, and can be selected by those skilled in the art according to actual needs. For example, a multilayer perceptron (MLP) without a normalization layer can be selected as the basic model for model training.

The three-dimensional model reconstruction method provided by the present application will be described in detail below.

First, multiple color cameras and depth cameras can be used simultaneously to shoot multiple perspectives of the object to be reconstructed in three dimensions, and obtain color images and corresponding depth images of the object at multiple different perspectives, that is, at the same shooting time (the difference in the actual shooting time is less than or equal to the time threshold, which means that the shooting time is the same), the color camera of each perspective will shoot the color image of the object at the corresponding perspective, and correspondingly, the depth camera of each perspective will shoot the depth image of the object at the corresponding perspective. It should be noted that the object can be any object, including but not limited to living objects such as people, animals and plants, or non-living objects such as machinery, furniture, and dolls.

In this way, the color images of the object at different viewing angles all have corresponding depth images, that is, when shooting, the color camera and the depth camera can adopt the configuration of a camera group, and the color camera of the same viewing angle cooperates with the depth camera to synchronously shoot the same object. For example, a studio can be built, the central area of the studio is the shooting area, and around the shooting area, multiple groups of color cameras and depth cameras are paired and arranged at certain angles in the horizontal and vertical directions. When the object is in the shooting area surrounded by these color cameras and depth cameras, the color images of the object at different viewing angles and the corresponding depth images can be obtained by shooting with these color cameras and depth cameras.

In addition, the camera parameters of the color camera corresponding to each color image are further obtained. The camera parameters include the internal and external parameters of the color camera, which can be determined by calibration. The camera internal parameters are parameters related to the characteristics of the color camera itself, including but not limited to the focal length, pixels and other data of the color camera. The camera external parameters are parameters of the color camera in the world coordinate system, including but not limited to the position (coordinates) of the color camera and the rotation direction of the camera.

As described above, after obtaining color images of the object at multiple different perspectives at the same shooting time and their corresponding depth images, the object can be reconstructed in three dimensions based on these color images and their corresponding depth images. Different from the method of converting depth into point cloud for three-dimensional reconstruction in the related art, the present application trains a neural network model to realize the implicit expression of the three-dimensional model of the object, thereby realizing the three-dimensional reconstruction of the object based on the neural network model.

Optionally, the present application uses a multilayer perceptron (MLP) without a normalization layer as the basic model and trains it in the following manner:
Based on the corresponding camera parameters, the pixel points in each color image are converted into rays; multiple sampling points are sampled on the rays, and the first coordinate information of each sampling point and the SDF value of each sampling point from the pixel point are determined; the first coordinate information of the sampling point is input into the basic model to obtain the predicted SDF value and the predicted RGB color value of each sampling point output by the basic model; based on the first difference between the predicted SDF value and the SDF value, and the second difference between the predicted RGB color value and the RGB color value of the pixel point, the parameters of the basic model are adjusted until the preset stop condition is met; the basic model that meets the preset stop condition is used as the neural network model of the three-dimensional model of the implicit expression object.

First, based on the camera parameters corresponding to the color image, a pixel in the color image is converted into a ray, which can be a ray passing through the pixel and perpendicular to the color image surface; then, multiple sampling points are sampled on the ray, and the sampling process of the sampling points can be performed in two steps. First, some sampling points can be uniformly sampled, and then multiple sampling points can be further sampled at key points based on the depth value of the pixel point to ensure that as many sampling points as possible can be sampled near the model surface; then, the first coordinate information of each sampling point obtained by sampling in the world coordinate system and the signed distance (Signed Distance) of each sampling point are calculated according to the camera parameters and the depth value of the pixel point. A depth field (SDF) value is obtained by calculating a depth field (SDF) value of a pixel point and a distance between the sampling point and the imaging surface of the camera. The difference is a signed value. When the difference is a positive value, it indicates that the sampling point is outside the three-dimensional model. When the difference is a negative value, it indicates that the sampling point is inside the three-dimensional model. When the difference is zero, it indicates that the sampling point is on the surface of the three-dimensional model. Then, after completing the sampling of the sampling point and calculating the SDF value corresponding to each sampling point, the first coordinate information of the sampling point in the world coordinate system is further input into a basic model (the basic model is configured to map the input coordinate information into an SDF value and an RGB color value and then output), the SDF value output by the basic model is recorded as a predicted SDF value, and the RGB color value output by the basic model is recorded as a predicted RGB color value. Then, based on a first difference between the predicted SDF value and the SDF value corresponding to the sampling point, and a second difference between the predicted RGB color value and the RGB color value of the pixel corresponding to the sampling point, the parameters of the basic model are adjusted.

In addition, for other pixel points in the color image, sampling points are sampled in the same manner as described above, and then the coordinate information of the sampling points in the world coordinate system is input into the basic model to obtain the corresponding predicted SDF value and predicted RGB color value, which are used to adjust the parameters of the basic model until the preset stop condition is met. For example, the preset stop condition can be configured as the number of iterations of the basic model reaches a preset number, or the preset stop condition can be configured as the convergence of the basic model. When the iteration of the basic model meets the preset stop condition, a neural network model that can accurately and implicitly express the three-dimensional model of the object is obtained. Finally, the isosurface extraction algorithm can be used to extract the surface of the three-dimensional model of the neural network model to obtain the three-dimensional model of the object.

Optionally, in some embodiments, an imaging plane of the color image is determined according to camera parameters; and a ray passing through a pixel point in the color image and perpendicular to the imaging plane is determined to be a ray corresponding to the pixel point.

The coordinate information of the color image in the world coordinate system, that is, the imaging plane, can be determined according to the camera parameters of the color camera corresponding to the color image. Then, the ray passing through the pixel point in the color image and perpendicular to the imaging plane can be determined as the ray corresponding to the pixel point.

Optionally, in some embodiments, the second coordinate information and the rotation angle of the color camera in the world coordinate system are determined according to the camera parameters; and the imaging plane of the color image is determined according to the second coordinate information and the rotation angle.

Optionally, in some embodiments, a first number of first sampling points are sampled at equal intervals on the ray; a plurality of key sampling points are determined according to the depth value of the pixel point, and a second number of second sampling points are sampled according to the key sampling points; and the first number of first sampling points and the second number of second sampling points are determined as a plurality of sampling points obtained by sampling on the ray.

Among them, firstly, n (i.e., the first number) first sampling points are uniformly sampled on the ray, where n is a positive integer greater than 2; then, according to the depth value of the aforementioned pixel point, a preset number of key sampling points closest to the aforementioned pixel point are determined from the n first sampling points, or a key sampling point whose distance to the aforementioned pixel point is less than a distance threshold is determined from the n first sampling points; then, m second sampling points are sampled according to the determined key sampling points, where m is a positive integer greater than 1; finally, the sampled n+m sampling points are determined as multiple sampling points sampled on the ray. Among them, sampling m more sampling points at the key sampling points can make the training effect of the model more accurate on the surface of the three-dimensional model, thereby improving the reconstruction accuracy of the three-dimensional model.

Optionally, in some embodiments, the depth value corresponding to the pixel is determined according to the depth image corresponding to the color image; the SDF value of each sampling point from the pixel is calculated based on the depth value; and the coordinate information of each sampling point is calculated according to the camera parameters and the depth value.

Among them, after sampling multiple sampling points on the ray corresponding to each pixel point, for each sampling point, the distance between the shooting position of the color camera and the corresponding point on the object is determined according to the camera parameters and the depth value of the pixel point, and then the SDF value of each sampling point is calculated one by one based on the distance, and the coordinate information of each sampling point is calculated.

It should be noted that after completing the training of the basic model, for the coordinate information of any given point, the trained basic model can predict its corresponding SDF value. The predicted SDF value represents the positional relationship (inside, outside or on the surface) between the point and the three-dimensional model of the object, thereby realizing the implicit expression of the three-dimensional model of the object and obtaining a neural network model for implicitly expressing the three-dimensional model of the object.

Finally, isosurface extraction is performed on the above neural network model. For example, an isosurface extraction algorithm (Marching cubes, MC) can be used to draw the surface of the three-dimensional model to obtain the three-dimensional model surface, and then the three-dimensional model of the object is obtained based on the three-dimensional model surface.

The 3D reconstruction scheme provided by the present application uses a neural network to implicitly model the 3D model of the object, and adds depth to improve the speed and accuracy of model training. By adopting the 3D reconstruction scheme provided by the present application, the 3D model of the photographed object is continuously reconstructed in time sequence, and the 3D model of the photographed object at different times can be obtained. The 3D model sequence composed of these 3D models at different times in time sequence is the volumetric video obtained by photographing the photographed object. In this way, "volume video shooting" can be performed on any photographed object to obtain a volumetric video presenting specific content. For example, a volumetric video of a dancing subject can be shot to obtain a volumetric video of the subject's dance that can be viewed from any angle. A volumetric video of a teaching subject can be shot to obtain a volumetric video of the subject's teaching that can be viewed from any angle, and so on.

In order to facilitate better implementation of the volumetric video decoding method provided in the embodiment of the present application, the embodiment of the present application also provides a volumetric video decoding device based on the above volumetric video decoding method. The meanings of the terms are the same as those in the above volumetric video decoding method, and the specific implementation details can refer to the description in the method embodiment. Figure 3 shows a block diagram of a volumetric video decoding device according to an embodiment of the present application.

As shown in FIG. 3 , a volumetric video decoding device 300 may include: an acquisition module 310 , an extraction module 320 , an analysis module 330 , and a rendering module 340 .

The acquisition module 310 can be used to acquire multiple frames of images to be decoded corresponding to the volumetric video; the extraction module 320 can be used to extract global features corresponding to each frame of the image to be decoded; the analysis module 330 can be used to perform depth analysis processing based on the global features corresponding to each frame of the image to be decoded, and obtain the depth corresponding to each frame of the image to be decoded; the rendering module 340 can be used to perform rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth, and obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.

It should be noted that, although several modules or units of the equipment for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to the embodiments of the present application, the features and functions of two or more modules or units described above can be embodied in one module or unit. On the contrary, the features and functions of one module or unit described above can be further divided into being embodied by multiple modules or units.

In addition, an embodiment of the present application further provides an electronic device, which may be a terminal or a server, as shown in FIG4 , which shows a schematic diagram of the structure of the electronic device involved in the embodiment of the present application, specifically:
The electronic device may include components such as a processor 401 with one or more processing cores, a memory 402 with one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device structure shown in FIG4 does not limit the electronic device, and may include more or fewer components than shown in the figure, or combine certain components, or arrange the components differently. Among them:
The processor 401 is the control center of the electronic device. It uses various interfaces and lines to connect various parts of the entire computer device. By running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, it executes various functions of the computer device and processes data, thereby monitoring the electronic device as a whole. Optionally, the processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user pages and application programs, etc., and the modem processor mainly processes wireless communications. It is understandable that the above-mentioned modem processor may not be integrated into the processor 401.

The memory 402 can be used to store software programs and modules. The processor 401 executes various functional applications and data processing by running the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the data storage area may store data created according to the use of the computer device, etc. In addition, the memory 402 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other volatile solid-state storage devices. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The electronic device also includes a power supply 403 for supplying power to each component. Preferably, the power supply 403 can be logically connected to the processor 401 through a power management system, so that the power management system can manage charging, discharging, power consumption and other functions. The power supply 403 can also include one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators and other arbitrary components.

The electronic device may further include an input unit 404, which may be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.

Although not shown, the electronic device may further include a display unit, etc., which will not be described in detail herein. Specifically in this embodiment, the processor 401 in the electronic device will load the executable files corresponding to the processes of one or more computer programs into the memory 402 according to the following instructions, and the processor 401 will run the computer programs stored in the memory 402, thereby realizing various functions in the aforementioned embodiments of the present application.

For example, the processor 401 can execute the following steps: obtain multiple frames of images to be decoded corresponding to the volumetric video; extract the global features corresponding to each frame of the image to be decoded; perform depth analysis processing based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded; perform rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth respectively to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.

A person of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be completed by a computer program, or by controlling related hardware through a computer program. The computer program may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application further provides a storage medium, in which a computer program is stored. The computer program can be loaded by a processor to execute the steps in any method provided in the embodiment of the present application.

The storage medium may include: a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, etc.

Since the computer program stored in the storage medium can execute the steps in any method provided in the embodiments of the present application, the beneficial effects that can be achieved by the method provided in the embodiments of the present application can be achieved. Please refer to the previous embodiments for details and will not be repeated here.

Those skilled in the art will readily appreciate other embodiments of the present application after considering the specification and practicing the embodiments disclosed herein. The present application is intended to cover any variations, uses or adaptations of the present application, which follow the general principles of the present application and include common knowledge or customary technical means in the art that are not disclosed in the present application.

It should be understood that the present application is not limited to the embodiments that have been described above and shown in the accompanying drawings, but various modifications and changes may be made without departing from the scope thereof.

Claims

A method for decoding volumetric video, wherein the method comprises:

Obtain multiple frames of images to be decoded corresponding to the volumetric video;

Extracting global features corresponding to each frame of the image to be decoded;

Performing depth analysis based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded;

Rendering is performed based on each frame of the image to be decoded and the corresponding global feature and the depth to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
The method according to claim 1, wherein the extracting the global features corresponding to each frame of the image to be decoded comprises:

For each frame of the image to be decoded, performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded;

Perform feature fusion processing on image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.
The method according to claim 2, wherein the step of performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded comprises:

Performing multi-level encoding processing on the image to be decoded to obtain image features output by the encoding processing at each level; wherein the encoding processing at each level includes convolution processing and maximum pooling processing performed in sequence; the image features output by the encoding processing at the previous level are used for the encoding processing at the next level;

The performing feature fusion processing on the image features corresponding to the image to be decoded to obtain the global features corresponding to the image to be decoded includes:

Performing multi-level decoding processing on the features to be fused to obtain fused features output by the decoding processing at each level; wherein the decoding processing at each level includes deconvolution processing, splicing processing and convolution processing performed in sequence; the fused features output by the decoding processing at the previous level are used for the decoding processing at the next level; the splicing processing at each level includes: splicing the deconvolution features output by the deconvolution processing with the image features at the same level to obtain splicing features for convolution processing; and,

According to the fused features output by the decoding process at the last level, the global features corresponding to the image to be decoded are obtained.
The method according to claim 3, wherein the step of splicing the deconvolution features output by the deconvolution process with the image features at the same level to obtain the spliced features for performing the convolution process comprises:

Calculating the attention distribution of the image features at the same level, and calculating the weighted average according to the attention distribution to obtain the weighted average features at the same level;

The deconvolution features output by the deconvolution process are concatenated with the weighted average features of the same level to obtain concatenated features for convolution processing.
The method according to claim 1, wherein the performing depth analysis processing based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded comprises:

The global features corresponding to each frame of the image to be decoded are respectively input into the recurrent neural network for depth analysis and processing, so as to obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network.
The method according to claim 5, wherein the inputting the global features corresponding to each frame of the image to be decoded into the recurrent neural network for depth analysis and processing, and obtaining the depth corresponding to each frame of the image to be decoded output by the recurrent neural network, comprises:

The global features corresponding to each frame of the image to be decoded are respectively input into the gated cycle unit for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the gated cycle unit.
The method according to claim 1, wherein the rendering process is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, comprising:

Each frame of the image to be decoded and the corresponding global features and the depth are respectively input into a convolutional neural network for rendering processing to obtain a three-dimensional model corresponding to each frame of the image to be decoded output by the convolutional neural network.
The method according to claim 1, wherein after rendering is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, the method comprises:

The decoded multiple frames of three-dimensional models are serialized in the order corresponding to the to-be-decoded images corresponding to each frame of the three-dimensional model to obtain the volumetric video.
The method according to claim 1, wherein the extracting the global features corresponding to each frame of the image to be decoded comprises:

For each frame of the image to be decoded, a histogram corresponding to each frame of the image to be decoded is calculated as a global feature by using a histogram calculation function.
The method according to claim 3, wherein the step of obtaining the global features corresponding to the image to be decoded based on the fusion features output by the last level of decoding processing comprises:

The fused features output by the last level of decoding processing are used as the global features corresponding to the image to be decoded.
The method according to claim 3, wherein the step of obtaining the global features corresponding to the image to be decoded based on the fusion features output by the last level of decoding processing comprises:

The fusion features output by the last level of decoding processing are reduced in dimension, and the reduced-dimensional features are used as the global features corresponding to the image to be decoded.
A non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor of a computer, causes the computer to perform operations including:

Obtain multiple frames of images to be decoded corresponding to the volumetric video;

Extracting global features corresponding to each frame of the image to be decoded;

Performing depth analysis based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded;

Rendering is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
The non-transitory computer-readable storage medium according to claim 12, wherein the extracting the global features corresponding to each frame of the image to be decoded comprises:

For each frame of the image to be decoded, performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded;

Perform feature fusion processing on image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.
The non-transitory computer-readable storage medium according to claim 13, wherein the step of performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded comprises:

Performing multi-level encoding processing on the image to be decoded to obtain image features output by the encoding processing at each level; wherein the encoding processing at each level includes convolution processing and maximum pooling processing performed in sequence; the image features output by the encoding processing at the previous level are used for the encoding processing at the next level;

The performing feature fusion processing on the image features corresponding to the image to be decoded to obtain the global features corresponding to the image to be decoded includes:

Perform multi-level decoding processing on the features to be fused to obtain fused features output by the decoding processing at each level; wherein the decoding processing at each level includes deconvolution processing, splicing processing and convolution processing performed in sequence; the fused features output by the decoding processing at the previous level are used for the decoding processing at the next level; the splicing processing at each level includes: splicing the deconvolution features output by the deconvolution processing with the image features at the same level to obtain splicing features for convolution processing;

According to the fusion features output by the decoding process at the last level, the global features corresponding to the image to be decoded are obtained.
The non-transitory computer-readable storage medium according to claim 14, wherein the step of concatenating the deconvolution features output by the deconvolution process with the image features at the same level to obtain the concatenated features for convolution processing comprises:

Calculating the attention distribution of the image features at the same level, and calculating the weighted average according to the attention distribution to obtain the weighted average features at the same level;

The deconvolution features output by the deconvolution process are concatenated with the weighted average features of the same level to obtain concatenated features for convolution processing.
The non-transitory computer-readable storage medium according to claim 12, wherein the performing depth analysis processing based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded comprises:

The global features corresponding to each frame of the image to be decoded are respectively input into the recurrent neural network for depth analysis and processing, so as to obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network.
An electronic device includes: a memory storing a computer program; and a processor reading the computer program stored in the memory to perform the following operations:

Obtain multiple frames of images to be decoded corresponding to the volumetric video;

Extracting global features corresponding to each frame of the image to be decoded;

Performing depth analysis based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded;

Rendering is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
The electronic device according to claim 17, wherein the extracting the global features corresponding to each frame of the image to be decoded comprises:

For each frame of the image to be decoded, performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded;

Perform feature fusion processing on image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.
The electronic device according to claim 18, wherein the step of performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded comprises:

Performing multi-level encoding processing on the image to be decoded to obtain image features output by the encoding processing at each level; wherein the encoding processing at each level includes convolution processing and maximum pooling processing performed in sequence; the image features output by the encoding processing at the previous level are used for the encoding processing at the next level;

The performing feature fusion processing on the image features corresponding to the image to be decoded to obtain the global features corresponding to the image to be decoded includes:

Perform multi-level decoding processing on the features to be fused to obtain fused features output by the decoding processing at each level; wherein the decoding processing at each level includes deconvolution processing, splicing processing and convolution processing performed in sequence; the fused features output by the decoding processing at the previous level are used for the decoding processing at the next level; the splicing processing at each level includes: splicing the deconvolution features output by the deconvolution processing with the image features at the same level to obtain splicing features for convolution processing;

According to the fused features output by the decoding process at the last level, the global features corresponding to the image to be decoded are obtained.
The electronic device according to claim 19, wherein the step of splicing the deconvolution features output by the deconvolution processing with the image features at the same level to obtain the spliced features for performing the convolution processing comprises:

Calculating the attention distribution of the image features at the same level, and calculating the weighted average according to the attention distribution to obtain the weighted average features at the same level;

The deconvolution features output by the deconvolution process are concatenated with the weighted average features of the same level to obtain concatenated features for convolution processing.