WO2024159553A1 - Decoding method for volumetric video, and storage medium and electronic device - Google Patents

Decoding method for volumetric video, and storage medium and electronic device Download PDF

Info

Publication number
WO2024159553A1
WO2024159553A1 PCT/CN2023/075406 CN2023075406W WO2024159553A1 WO 2024159553 A1 WO2024159553 A1 WO 2024159553A1 CN 2023075406 W CN2023075406 W CN 2023075406W WO 2024159553 A1 WO2024159553 A1 WO 2024159553A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
decoded
features
processing
frame
Prior art date
Application number
PCT/CN2023/075406
Other languages
French (fr)
Chinese (zh)
Inventor
张煜
岳鑫
邵志兢
孙伟
Original Assignee
珠海普罗米修斯视觉技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 珠海普罗米修斯视觉技术有限公司 filed Critical 珠海普罗米修斯视觉技术有限公司
Publication of WO2024159553A1 publication Critical patent/WO2024159553A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding

Definitions

  • the present application relates to the field of computer technology, and in particular to a volumetric video decoding method, a storage medium, and an electronic device.
  • Volumetric video is a model sequence of continuous three-dimensional models. Volumetric video usually includes a large number of three-dimensional models. There is usually a need to encode and decode volumetric video. At present, in the related technology, there is a solution that encodes the three-dimensional model in the volumetric video into encoded data such as vertex data and facet data, and decodes the encoded data through a large number of complex decoding calculations to play the volumetric video.
  • the decoding of volumetric video has the problem of low decoding efficiency and poor decoding effect.
  • the embodiment of the present application provides a solution that can improve the decoding efficiency of volumetric video and improve the decoding effect.
  • a method for decoding a volumetric video includes: obtaining multiple frames of images to be decoded corresponding to the volumetric video; extracting global features corresponding to each frame of the image to be decoded; performing depth analysis processing based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded; performing rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
  • the extracting of global features corresponding to each frame of the image to be decoded includes: for each frame of the image to be decoded, performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded; performing feature fusion processing on the image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.
  • the deconvolution features output by the deconvolution processing are spliced with the image features of the same level to obtain the spliced features for convolution processing, including: calculating the attention distribution for the image features of the same level, and calculating the weighted average based on the attention distribution to obtain the weighted average features of the same level; splicing the deconvolution features output by the deconvolution processing with the weighted average features of the same level to obtain the spliced features for convolution processing.
  • the depth analysis processing is performed based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded, including: inputting the global features corresponding to each frame of the image to be decoded into a recurrent neural network for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network.
  • the global features corresponding to each frame of the image to be decoded are respectively input into a recurrent neural network for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network, including: the global features corresponding to each frame of the image to be decoded are respectively input into a gated loop unit for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the gated loop unit.
  • rendering processing is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, including: inputting each frame of the image to be decoded and the corresponding global features and the depth into a convolutional neural network for rendering processing to obtain a three-dimensional model corresponding to each frame of the image to be decoded output by the convolutional neural network.
  • the method includes: serializing the decoded multi-frame three-dimensional model in the order corresponding to the image to be decoded corresponding to each frame of the three-dimensional model to obtain the volumetric video.
  • a decoding device for volumetric video includes: an acquisition module, which is used to acquire multiple frames of images to be decoded corresponding to the volumetric video; an extraction module, which is used to extract global features corresponding to each frame of the image to be decoded; an analysis module, which is used to perform depth analysis processing based on the global features corresponding to each frame of the image to be decoded, and obtain the depth corresponding to each frame of the image to be decoded; a rendering module, which is used to perform rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth, and obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
  • the extraction module is used to perform feature extraction processing on the image to be decoded for each frame of the image to be decoded to obtain image features corresponding to the image to be decoded; and perform feature fusion processing on the image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.
  • the extraction module is used to: perform multi-level encoding processing on the image to be decoded to obtain image features output by the encoding processing at each level; wherein the encoding processing at each level includes convolution processing and maximum pooling processing performed in sequence; the image features output by the encoding processing at the previous level are used for the encoding processing at the next level; the extraction module is also used to: perform multi-level decoding processing on the features to be fused to obtain fused features output by the decoding processing at each level; wherein the decoding processing at each level includes deconvolution processing, splicing processing and convolution processing performed in sequence; the fused features output by the decoding processing at the previous level are used for the decoding processing at the next level; the splicing processing at each level includes: splicing the deconvolution features output by the deconvolution processing with the image features of the same level to obtain splicing features for convolution processing; and obtaining the global features corresponding to the image to be decoded based on the
  • the extraction module is also used to: calculate the attention distribution of the image features at the same level, and calculate the weighted average based on the attention distribution to obtain the weighted average features at the same level; splice the deconvolution features output by the deconvolution processing with the weighted average features at the same level to obtain the spliced features for convolution processing.
  • the analysis module is used to: input the global features corresponding to each frame of the image to be decoded into a recurrent neural network for depth analysis and processing, and obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network.
  • the analysis module is used to: input the global features corresponding to each frame of the image to be decoded into the gated loop unit for depth analysis processing, and obtain the depth corresponding to each frame of the image to be decoded output by the gated loop unit.
  • the rendering module is used to: input each frame of the image to be decoded and the corresponding global features and the depth into a convolutional neural network for rendering processing, so as to obtain a three-dimensional model corresponding to each frame of the image to be decoded output by the convolutional neural network.
  • the device further includes a generation module, which is used to: serialize the decoded multi-frame three-dimensional models in the order corresponding to the image to be decoded corresponding to each frame of the three-dimensional model to obtain the volumetric video.
  • a generation module which is used to: serialize the decoded multi-frame three-dimensional models in the order corresponding to the image to be decoded corresponding to each frame of the three-dimensional model to obtain the volumetric video.
  • a storage medium stores a computer program thereon, and when the computer program is executed by a processor of a computer, the computer executes the method described in the embodiment of the present application.
  • an electronic device may include: a memory storing a computer program; and a processor reading the computer program stored in the memory to execute the method described in the embodiment of the present application.
  • a computer program product or a computer program includes a computer instruction stored in a computer-readable storage medium.
  • a processor of a computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the method provided in various optional implementations described in the embodiments of the present application.
  • multiple frames of images to be decoded corresponding to the volumetric video are obtained; global features corresponding to each frame of the image to be decoded are extracted; depth analysis processing is performed based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded; rendering processing is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
  • the volumetric video is provided in the form of multiple frames of images to be decoded.
  • the global features of each frame of the image to be decoded are enhanced, and the depth is analyzed through the global features.
  • the three-dimensional model is rendered by combining the image to be decoded, the global features and the depth. The overall decoding process is efficient and the decoded three-dimensional model is highly reliable, which can effectively improve the decoding efficiency of the volumetric video and improve the decoding effect.
  • FIG. 1 shows a schematic diagram of a system to which an embodiment of the present application can be applied.
  • FIG. 2 shows a flow chart of a method for decoding volumetric video according to an embodiment of the present application.
  • FIG. 3 shows a block diagram of a volumetric video decoding device according to another embodiment of the present application.
  • FIG. 4 shows a block diagram of an electronic device according to an embodiment of the present application.
  • FIG1 shows a schematic diagram of a system 100 to which an embodiment of the present application can be applied.
  • the system 100 may include a server 101 and a terminal 102 .
  • Server 101 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. It can also be a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, as well as big data and artificial intelligence platforms.
  • cloud services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, as well as big data and artificial intelligence platforms.
  • the terminal 102 may be any device, including but not limited to mobile phones, computers, intelligent voice interaction devices, smart home appliances, vehicle terminals, VR/AR devices, smart watches, computers, etc.
  • the server 101 or the terminal 102 may be a node device in a blockchain network or a map vehicle networking platform.
  • the server 101 or the terminal 102 may: obtain multiple frames of images to be decoded corresponding to the volumetric video; extract global features corresponding to each frame of the image to be decoded; perform depth analysis processing based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded; perform rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth respectively to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
  • Fig. 2 schematically shows a flow chart of a method for decoding volumetric video according to an embodiment of the present application.
  • the method for decoding volumetric video may be executed by any device, such as the server 101 or the terminal 102 shown in Fig. 1 .
  • the volumetric video decoding method may include steps S210 to S240 .
  • Step S210 obtaining multiple frames of images to be decoded corresponding to the volumetric video; step S220, extracting global features corresponding to each frame of the image to be decoded; step S230, performing depth analysis processing based on the global features corresponding to each frame of the image to be decoded, and obtaining the depth corresponding to each frame of the image to be decoded; step S240, performing rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth, and obtaining a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
  • Volumetric video is a model sequence of multi-frame three-dimensional models.
  • the three-dimensional model can be a three-dimensional model corresponding to a person, an animal, etc.
  • Volumetric video can demonstrate the object behavior (such as dancing) of an object through continuous multi-frame three-dimensional models.
  • Each frame of the three-dimensional model can be reconstructed through multiple two-dimensional images from multiple perspectives.
  • the created volumetric video may be encoded in advance into corresponding multiple frames of images to be decoded.
  • the volumetric video may be encoded into corresponding multiple frames of images to be decoded in a manner that: the volumetric video may be encoded into a multi-perspective color image for reconstructing each frame of the three-dimensional model in the volumetric video; or the volumetric video may be encoded into a multi-perspective model image captured from different angles for each frame of the three-dimensional model.
  • Each frame of the image to be decoded may correspond to a frame of the three-dimensional model, and each frame of the image to be decoded may include at least one image.
  • the global features containing global information corresponding to each frame of the image to be decoded can be extracted.
  • a depth analysis process is performed to obtain the depth corresponding to each frame of the image to be decoded.
  • rendering is performed based on each frame of the image to be decoded and the global features and depth corresponding to the image to be decoded, so as to obtain a decoded multi-frame 3D model.
  • the multi-frame 3D model can be serialized to obtain the restored/decoded volumetric video.
  • the volumetric video is provided in the form of multiple frames of images to be decoded.
  • the global features of each frame of the image to be decoded are enhanced, and the depth is analyzed through the global features.
  • the three-dimensional model is rendered by combining the image to be decoded, the global features and the depth. The overall decoding process is efficient and the decoded three-dimensional model is highly reliable, which can effectively improve the decoding efficiency of the volumetric video and improve the decoding effect.
  • step S220 extracting global features corresponding to each frame of the image to be decoded, includes: for each frame of the image to be decoded, performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded; performing feature fusion processing on the image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.
  • the image to be decoded can be subjected to feature extraction processing through a feature extraction network (such as a convolutional network) to obtain image features corresponding to the image to be decoded. Furthermore, the image features corresponding to the image to be decoded can be subjected to feature fusion processing through a feature fusion network (such as a fully connected network) to obtain global features corresponding to the image to be decoded.
  • a feature extraction network such as a convolutional network
  • feature fusion processing through a feature fusion network (such as a fully connected network) to obtain global features corresponding to the image to be decoded.
  • step S220, extracting the global features corresponding to each frame of the image to be decoded includes: for each frame of the image to be decoded, calculating the histogram corresponding to each frame of the image to be decoded as the global feature through a histogram calculation function (such as the histogram calculation function in Opencv).
  • a histogram calculation function such as the histogram calculation function in Opencv.
  • a multi-level encoding process is performed on the image to be decoded through a feature extraction network (encoder).
  • the feature extraction network may include a multi-layer cascaded extraction network, and each level of the extraction network can output the image features of the corresponding level through the encoding process.
  • the encoding process performed in the extraction network of each level specifically includes convolution processing and maximum pooling processing performed in sequence.
  • the encoding process performed in the extraction network of the first level may include: firstly performing convolution processing on the image to be decoded to obtain convolution features, and then performing maximum pooling processing on the convolution features to obtain the image features output by the encoding process of the first level. Further, the image features output by the encoding process of the previous level are used for the encoding process of the next level.
  • the image features of the first level are used as the input features of the extraction network of the second level.
  • the image features of the first level are firstly convolved to obtain convolution features, and then the convolution features are subjected to maximum pooling processing to obtain the image features output by the encoding process of the second level.
  • the features to be fused are subjected to multi-level decoding processing through a feature fusion network (decoder), and the feature fusion network (decoder) may include a multi-layer cascaded fusion network, and each level of the fusion network may output the fusion features of the corresponding level through decoding processing.
  • the decoding processing performed in the fusion network of each level specifically includes deconvolution processing, splicing processing and convolution processing performed in sequence.
  • the decoding processing in the fusion network of the first level may include: firstly deconvolution processing the features to be fused to obtain deconvolution features, then splicing processing the deconvolution features to obtain splicing features, and then convolution processing the splicing features to obtain the fusion features output by the decoding processing of the first level.
  • the fusion features output by the decoding processing of the previous level are used for the decoding processing of the next level.
  • the fusion features of the first level are used as the input features of the fusion network of the second level, and the fusion features of the first level are first subjected to deconvolution processing, splicing processing and convolution processing in sequence in the fusion network of the second level.
  • the splicing processing at each level includes: splicing the deconvolution features output by the deconvolution processing with the image features at the same level to obtain the splicing features used for the convolution processing.
  • the splicing processing at the first level includes: splicing the deconvolution features output by the deconvolution processing at the first level with the image features output by the encoding processing at the first level to obtain the splicing features used for the convolution processing at the first level.
  • the global features corresponding to the image to be decoded are obtained.
  • the fused features output by the last level of decoding processing can be used as the global features corresponding to the image to be decoded, or the fused features output by the last level of decoding processing can be reduced in dimension, and the reduced-dimensional features can be used as the global features corresponding to the image to be decoded.
  • the feature extraction network is a UNet network, which includes a feature extraction network (encoder) on the left and a feature fusion network (decoder) on the right.
  • the feature extraction network (encoder) includes 4 layers of extraction networks, and the feature fusion network (decoder) also includes 4 layers of fusion networks.
  • the step of splicing the deconvolution features output by the deconvolution process with the image features at the same level to obtain the spliced features for performing the convolution process includes: The attention distribution is calculated for the image features at the same level, and the weighted average is calculated according to the attention distribution to obtain the weighted average features at the same level; the deconvolution features output by the deconvolution processing are spliced with the weighted average features at the same level to obtain the spliced features for convolution processing.
  • the weighted average features of the image features at the same level are further calculated through the attention mechanism, and then the deconvolution features output by the deconvolution process are spliced with the weighted average features to obtain the spliced features for convolution processing.
  • the weighted average features of the image features output by the encoding process of the first level are calculated through the attention mechanism, and then the deconvolution features output by the deconvolution process of the first level are spliced with the weighted average features of the first level to obtain the spliced features for convolution processing in the first level.
  • the extracted global features can contain more global information, further improving the decoding effect of the three-dimensional model as a whole.
  • the weighted average features of the image features at the same level are calculated through the attention mechanism, which specifically includes: calculating the attention distribution of the image features at the same level, and calculating the weighted average according to the attention distribution to obtain the weighted average features at the same level.
  • the weighted average features can be calculated through the soft attention mechanism.
  • the depth analysis processing is performed based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded, including: inputting the global features corresponding to each frame of the image to be decoded into a recurrent neural network for depth analysis processing, and obtaining the depth corresponding to each frame of the image to be decoded output by the recurrent neural network.
  • a recurrent neural network performs depth analysis based on the global features corresponding to each frame of the image to be decoded, and obtains the depth corresponding to each frame of the image to be decoded.
  • the sequence of global features corresponding to the image to be decoded can be input into the recurrent neural network for depth analysis, and the depth corresponding to the image to be decoded output by the recurrent neural network can be obtained.
  • the parameters in the recurrent neural network are shared at different times, and the accurate depth can be output through analysis and processing.
  • the recurrent neural network can specifically include a long short-term memory network (LSTM) or a gated recurrent unit (GRU), etc.
  • the global features corresponding to each frame of the image to be decoded are respectively input into a recurrent neural network for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network, including: the global features corresponding to each frame of the image to be decoded are respectively input into a gated recurrent unit for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the gated recurrent unit.
  • a gated recurrent unit is specifically used to perform depth analysis based on the global features corresponding to each frame of the image to be decoded, and the depth corresponding to each frame of the image to be decoded is obtained.
  • the sequence of global features corresponding to the image to be decoded is input into the gated recurrent unit (GRU) for depth analysis, and the depth corresponding to the image to be decoded output by the gated recurrent unit (GRU) is obtained.
  • the gated recurrent unit (GRU) is a gated recurrent neural network that can efficiently analyze and obtain reliable depth.
  • rendering processing is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, including: outputting each frame of the image to be decoded and the corresponding global features and the depth to a convolutional neural network for rendering processing to obtain a three-dimensional model corresponding to each frame of the image to be decoded output by the convolutional neural network.
  • a convolutional neural network is used to render the image to be decoded and the global features and depth corresponding to the image to be decoded to obtain a decoded three-dimensional model.
  • the first frame of the image to be decoded and the global features and depth corresponding to the first frame of the image to be decoded are output to the convolutional neural network for rendering, and the three-dimensional model corresponding to the first frame of the image to be decoded output by the convolutional neural network is obtained.
  • the three-dimensional models corresponding to other frames of the image to be decoded can be obtained, and then the decoded multi-frame three-dimensional models are obtained.
  • the rendering process is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame 3D model
  • the multi-frame 3D model is used to generate the volumetric video, including: serializing the decoded multi-frame 3D model in the order corresponding to the image to be decoded corresponding to each frame of the 3D model to obtain the volumetric video.
  • a corresponding frame of the 3D model can be obtained for each frame of the image to be decoded, and all the obtained 3D models are sequentially connected in the order of the corresponding images to be decoded to obtain the decoded volumetric video.
  • volumetric video also called volumetric video, spatial video, volumetric 3D video or 6-DOF video, etc.
  • volumetric video in the aforementioned embodiments of the present application is a technology that captures information in 3D space (such as depth and color information, etc.) and generates a 3D dynamic model sequence.
  • 3D space such as depth and color information, etc.
  • volumetric videos add the concept of space to videos, using 3D models to better restore the real 3D world, rather than using 2D flat videos plus camera movements to simulate the sense of space of the real 3D world.
  • volumetric video is essentially a 3D model sequence, users can adjust to any viewing angle to watch it as they like, and it has a higher degree of restoration and immersion than 2D flat videos.
  • the three-dimensional model used to construct the volumetric video can be reconstructed as follows: First, color images and depth images of the object from different perspectives, as well as camera parameters corresponding to the color images, are obtained; then, a neural network model that implicitly expresses the three-dimensional model of the object is trained based on the obtained color images and their corresponding depth images and camera parameters, and isosurface extraction is performed based on the trained neural network model to achieve three-dimensional reconstruction of the object and obtain a three-dimensional model of the object.
  • the present application embodiment does not specifically limit the neural network model architecture, and can be selected by those skilled in the art according to actual needs.
  • a multilayer perceptron (MLP) without a normalization layer can be selected as the basic model for model training.
  • multiple color cameras and depth cameras can be used simultaneously to shoot multiple perspectives of the object to be reconstructed in three dimensions, and obtain color images and corresponding depth images of the object at multiple different perspectives, that is, at the same shooting time (the difference in the actual shooting time is less than or equal to the time threshold, which means that the shooting time is the same), the color camera of each perspective will shoot the color image of the object at the corresponding perspective, and correspondingly, the depth camera of each perspective will shoot the depth image of the object at the corresponding perspective.
  • the object can be any object, including but not limited to living objects such as people, animals and plants, or non-living objects such as machinery, furniture, and dolls.
  • the color images of the object at different viewing angles all have corresponding depth images, that is, when shooting, the color camera and the depth camera can adopt the configuration of a camera group, and the color camera of the same viewing angle cooperates with the depth camera to synchronously shoot the same object.
  • a studio can be built, the central area of the studio is the shooting area, and around the shooting area, multiple groups of color cameras and depth cameras are paired and arranged at certain angles in the horizontal and vertical directions.
  • the color images of the object at different viewing angles and the corresponding depth images can be obtained by shooting with these color cameras and depth cameras.
  • the camera parameters of the color camera corresponding to each color image are further obtained.
  • the camera parameters include the internal and external parameters of the color camera, which can be determined by calibration.
  • the camera internal parameters are parameters related to the characteristics of the color camera itself, including but not limited to the focal length, pixels and other data of the color camera.
  • the camera external parameters are parameters of the color camera in the world coordinate system, including but not limited to the position (coordinates) of the color camera and the rotation direction of the camera.
  • the object can be reconstructed in three dimensions based on these color images and their corresponding depth images.
  • the present application trains a neural network model to realize the implicit expression of the three-dimensional model of the object, thereby realizing the three-dimensional reconstruction of the object based on the neural network model.
  • the present application uses a multilayer perceptron (MLP) without a normalization layer as the basic model and trains it in the following manner: Based on the corresponding camera parameters, the pixel points in each color image are converted into rays; multiple sampling points are sampled on the rays, and the first coordinate information of each sampling point and the SDF value of each sampling point from the pixel point are determined; the first coordinate information of the sampling point is input into the basic model to obtain the predicted SDF value and the predicted RGB color value of each sampling point output by the basic model; based on the first difference between the predicted SDF value and the SDF value, and the second difference between the predicted RGB color value and the RGB color value of the pixel point, the parameters of the basic model are adjusted until the preset stop condition is met; the basic model that meets the preset stop condition is used as the neural network model of the three-dimensional model of the implicit expression object.
  • MLP multilayer perceptron
  • a pixel in the color image is converted into a ray, which can be a ray passing through the pixel and perpendicular to the color image surface; then, multiple sampling points are sampled on the ray, and the sampling process of the sampling points can be performed in two steps.
  • some sampling points can be uniformly sampled, and then multiple sampling points can be further sampled at key points based on the depth value of the pixel point to ensure that as many sampling points as possible can be sampled near the model surface; then, the first coordinate information of each sampling point obtained by sampling in the world coordinate system and the signed distance (Signed Distance) of each sampling point are calculated according to the camera parameters and the depth value of the pixel point.
  • a depth field (SDF) value is obtained by calculating a depth field (SDF) value of a pixel point and a distance between the sampling point and the imaging surface of the camera.
  • the difference is a signed value. When the difference is a positive value, it indicates that the sampling point is outside the three-dimensional model. When the difference is a negative value, it indicates that the sampling point is inside the three-dimensional model. When the difference is zero, it indicates that the sampling point is on the surface of the three-dimensional model.
  • the first coordinate information of the sampling point in the world coordinate system is further input into a basic model (the basic model is configured to map the input coordinate information into an SDF value and an RGB color value and then output), the SDF value output by the basic model is recorded as a predicted SDF value, and the RGB color value output by the basic model is recorded as a predicted RGB color value. Then, based on a first difference between the predicted SDF value and the SDF value corresponding to the sampling point, and a second difference between the predicted RGB color value and the RGB color value of the pixel corresponding to the sampling point, the parameters of the basic model are adjusted.
  • sampling points are sampled in the same manner as described above, and then the coordinate information of the sampling points in the world coordinate system is input into the basic model to obtain the corresponding predicted SDF value and predicted RGB color value, which are used to adjust the parameters of the basic model until the preset stop condition is met.
  • the preset stop condition can be configured as the number of iterations of the basic model reaches a preset number, or the preset stop condition can be configured as the convergence of the basic model.
  • the isosurface extraction algorithm can be used to extract the surface of the three-dimensional model of the neural network model to obtain the three-dimensional model of the object.
  • an imaging plane of the color image is determined according to camera parameters; and a ray passing through a pixel point in the color image and perpendicular to the imaging plane is determined to be a ray corresponding to the pixel point.
  • the coordinate information of the color image in the world coordinate system can be determined according to the camera parameters of the color camera corresponding to the color image. Then, the ray passing through the pixel point in the color image and perpendicular to the imaging plane can be determined as the ray corresponding to the pixel point.
  • the second coordinate information and the rotation angle of the color camera in the world coordinate system are determined according to the camera parameters; and the imaging plane of the color image is determined according to the second coordinate information and the rotation angle.
  • a first number of first sampling points are sampled at equal intervals on the ray; a plurality of key sampling points are determined according to the depth value of the pixel point, and a second number of second sampling points are sampled according to the key sampling points; and the first number of first sampling points and the second number of second sampling points are determined as a plurality of sampling points obtained by sampling on the ray.
  • n i.e., the first number
  • first sampling points are uniformly sampled on the ray, where n is a positive integer greater than 2; then, according to the depth value of the aforementioned pixel point, a preset number of key sampling points closest to the aforementioned pixel point are determined from the n first sampling points, or a key sampling point whose distance to the aforementioned pixel point is less than a distance threshold is determined from the n first sampling points; then, m second sampling points are sampled according to the determined key sampling points, where m is a positive integer greater than 1; finally, the sampled n+m sampling points are determined as multiple sampling points sampled on the ray.
  • sampling m more sampling points at the key sampling points can make the training effect of the model more accurate on the surface of the three-dimensional model, thereby improving the reconstruction accuracy of the three-dimensional model.
  • the depth value corresponding to the pixel is determined according to the depth image corresponding to the color image; the SDF value of each sampling point from the pixel is calculated based on the depth value; and the coordinate information of each sampling point is calculated according to the camera parameters and the depth value.
  • the distance between the shooting position of the color camera and the corresponding point on the object is determined according to the camera parameters and the depth value of the pixel point, and then the SDF value of each sampling point is calculated one by one based on the distance, and the coordinate information of each sampling point is calculated.
  • the trained basic model can predict its corresponding SDF value.
  • the predicted SDF value represents the positional relationship (inside, outside or on the surface) between the point and the three-dimensional model of the object, thereby realizing the implicit expression of the three-dimensional model of the object and obtaining a neural network model for implicitly expressing the three-dimensional model of the object.
  • isosurface extraction is performed on the above neural network model.
  • an isosurface extraction algorithm (Marching cubes, MC) can be used to draw the surface of the three-dimensional model to obtain the three-dimensional model surface, and then the three-dimensional model of the object is obtained based on the three-dimensional model surface.
  • the 3D reconstruction scheme provided by the present application uses a neural network to implicitly model the 3D model of the object, and adds depth to improve the speed and accuracy of model training.
  • the 3D model of the photographed object is continuously reconstructed in time sequence, and the 3D model of the photographed object at different times can be obtained.
  • the 3D model sequence composed of these 3D models at different times in time sequence is the volumetric video obtained by photographing the photographed object.
  • "volume video shooting" can be performed on any photographed object to obtain a volumetric video presenting specific content.
  • a volumetric video of a dancing subject can be shot to obtain a volumetric video of the subject's dance that can be viewed from any angle.
  • a volumetric video of a teaching subject can be shot to obtain a volumetric video of the subject's teaching that can be viewed from any angle, and so on.
  • the embodiment of the present application also provides a volumetric video decoding device based on the above volumetric video decoding method.
  • the meanings of the terms are the same as those in the above volumetric video decoding method, and the specific implementation details can refer to the description in the method embodiment.
  • Figure 3 shows a block diagram of a volumetric video decoding device according to an embodiment of the present application.
  • a volumetric video decoding device 300 may include: an acquisition module 310 , an extraction module 320 , an analysis module 330 , and a rendering module 340 .
  • the acquisition module 310 can be used to acquire multiple frames of images to be decoded corresponding to the volumetric video; the extraction module 320 can be used to extract global features corresponding to each frame of the image to be decoded; the analysis module 330 can be used to perform depth analysis processing based on the global features corresponding to each frame of the image to be decoded, and obtain the depth corresponding to each frame of the image to be decoded; the rendering module 340 can be used to perform rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth, and obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
  • the extraction module is used to perform feature extraction processing on the image to be decoded for each frame of the image to be decoded to obtain image features corresponding to the image to be decoded; and perform feature fusion processing on the image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.
  • the extraction module is used to: perform multi-level encoding processing on the image to be decoded to obtain image features output by the encoding processing at each level; wherein the encoding processing at each level includes convolution processing and maximum pooling processing performed in sequence; the image features output by the encoding processing at the previous level are used for the encoding processing at the next level; the extraction module is also used to: perform multi-level decoding processing on the features to be fused to obtain fused features output by the decoding processing at each level; wherein the decoding processing at each level includes deconvolution processing, splicing processing and convolution processing performed in sequence; the fused features output by the decoding processing at the previous level are used for the decoding processing at the next level; the splicing processing at each level includes: splicing the deconvolution features output by the deconvolution processing with the image features of the same level to obtain splicing features for convolution processing; and obtaining the global features corresponding to the image to be decoded based on the
  • the extraction module is also used to: calculate the attention distribution of the image features at the same level, and calculate the weighted average based on the attention distribution to obtain the weighted average features at the same level; splice the deconvolution features output by the deconvolution processing with the weighted average features at the same level to obtain the spliced features for convolution processing.
  • the analysis module is used to: input the global features corresponding to each frame of the image to be decoded into a recurrent neural network for depth analysis and processing, and obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network.
  • the analysis module is used to: input the global features corresponding to each frame of the image to be decoded into the gated loop unit for depth analysis processing, and obtain the depth corresponding to each frame of the image to be decoded output by the gated loop unit.
  • the rendering module is used to: input each frame of the image to be decoded and the corresponding global features and the depth into a convolutional neural network for rendering processing, so as to obtain a three-dimensional model corresponding to each frame of the image to be decoded output by the convolutional neural network.
  • the device further includes a generation module, which is used to: serialize the decoded multi-frame three-dimensional models in the order corresponding to the image to be decoded corresponding to each frame of the three-dimensional model to obtain the volumetric video.
  • a generation module which is used to: serialize the decoded multi-frame three-dimensional models in the order corresponding to the image to be decoded corresponding to each frame of the three-dimensional model to obtain the volumetric video.
  • an embodiment of the present application further provides an electronic device, which may be a terminal or a server, as shown in FIG4 , which shows a schematic diagram of the structure of the electronic device involved in the embodiment of the present application, specifically:
  • the electronic device may include components such as a processor 401 with one or more processing cores, a memory 402 with one or more computer-readable storage media, a power supply 403, and an input unit 404.
  • a processor 401 with one or more processing cores
  • a memory 402 with one or more computer-readable storage media
  • a power supply 403 a power supply 403
  • an input unit 404 input unit
  • the processor 401 is the control center of the electronic device. It uses various interfaces and lines to connect various parts of the entire computer device.
  • the processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user pages and application programs, etc., and the modem processor mainly processes wireless communications. It is understandable that the above-mentioned modem processor may not be integrated into the processor 401.
  • the memory 402 can be used to store software programs and modules.
  • the processor 401 executes various functional applications and data processing by running the software programs and modules stored in the memory 402.
  • the memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the data storage area may store data created according to the use of the computer device, etc.
  • the memory 402 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other volatile solid-state storage devices. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.
  • the electronic device also includes a power supply 403 for supplying power to each component.
  • the power supply 403 can be logically connected to the processor 401 through a power management system, so that the power management system can manage charging, discharging, power consumption and other functions.
  • the power supply 403 can also include one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators and other arbitrary components.
  • the electronic device may further include an input unit 404, which may be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
  • an input unit 404 which may be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
  • the electronic device may further include a display unit, etc., which will not be described in detail herein.
  • the processor 401 in the electronic device will load the executable files corresponding to the processes of one or more computer programs into the memory 402 according to the following instructions, and the processor 401 will run the computer programs stored in the memory 402, thereby realizing various functions in the aforementioned embodiments of the present application.
  • the processor 401 can execute the following steps: obtain multiple frames of images to be decoded corresponding to the volumetric video; extract the global features corresponding to each frame of the image to be decoded; perform depth analysis processing based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded; perform rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth respectively to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
  • an embodiment of the present application further provides a storage medium, in which a computer program is stored.
  • the computer program can be loaded by a processor to execute the steps in any method provided in the embodiment of the present application.
  • the storage medium may include: a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, etc.
  • ROM read-only memory
  • RAM random access memory
  • magnetic disk or an optical disk etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Processing (AREA)

Abstract

Disclosed in the present application are a decoding method for a volumetric video, and a storage medium and an electronic device. The method comprises: acquiring a plurality of frames of images to be decoded; extracting global features of the images to be decoded; performing depth analysis processing on the basis of the global features, so as to obtain depths; and performing rendering processing on the basis of the frames of the images to be decoded, the global features and the depths, so as to obtain a multi-frame three-dimensional model for generating a volumetric video.

Description

体积视频的解码方法、存储介质、以及电子设备Volumetric video decoding method, storage medium, and electronic device 技术领域Technical Field
本申请涉及计算机技术领域,具体涉及一种体积视频的解码方法、存储介质、以及电子设备。The present application relates to the field of computer technology, and in particular to a volumetric video decoding method, a storage medium, and an electronic device.
背景技术Background Art
体积视频是连续的三维模型的模型序列,体积视频中通常包括大量的三维模型,通常存在对体积视频进行编解码的需求。目前,相关技术中,存在将体积视频中三维模型编码为顶点数据及面片数据等编码数据,通过大量复杂解码计算对编码数据进行解码来播放体积视频的方案。Volumetric video is a model sequence of continuous three-dimensional models. Volumetric video usually includes a large number of three-dimensional models. There is usually a need to encode and decode volumetric video. At present, in the related technology, there is a solution that encodes the three-dimensional model in the volumetric video into encoded data such as vertex data and facet data, and decodes the encoded data through a large number of complex decoding calculations to play the volumetric video.
技术问题Technical issues
目前的方案中,存在体积视频的解码存在解码效率较低及解码效果较差的问题。In the current solution, the decoding of volumetric video has the problem of low decoding efficiency and poor decoding effect.
技术解决方案Technical Solutions
本申请实施例提供一种方案,可以提升体积视频的解码效率且提升解码效果。The embodiment of the present application provides a solution that can improve the decoding efficiency of volumetric video and improve the decoding effect.
为解决上述技术问题,本申请实施例提供以下技术方案:
根据本申请的一个实施例,一种体积视频的解码方法,所述方法包括:获取体积视频对应的多帧待解码图像;提取每一帧所述待解码图像对应的全局特征;基于每一帧所述待解码图像对应的全局特征进行深度分析处理,得到每一帧所述待解码图像对应的深度;基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型,所述多帧三维模型用于生成所述体积视频。
To solve the above technical problems, the present application provides the following technical solutions:
According to one embodiment of the present application, a method for decoding a volumetric video includes: obtaining multiple frames of images to be decoded corresponding to the volumetric video; extracting global features corresponding to each frame of the image to be decoded; performing depth analysis processing based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded; performing rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
在本申请的一些实施例中,所述提取每一帧所述待解码图像对应的全局特征,包括:针对每一帧所述待解码图像,对所述待解码图像进行特征提取处理,得到所述待解码图像对应的图像特征;对所述待解码图像对应的图像特征进行特征融合处理,得到所述待解码图像对应的全局特征。In some embodiments of the present application, the extracting of global features corresponding to each frame of the image to be decoded includes: for each frame of the image to be decoded, performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded; performing feature fusion processing on the image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.
在本申请的一些实施例中,所述对所述待解码图像进行特征提取处理,得到所述待解码图像对应的图像特征,包括:对所述待解码图像进行多层级的编码处理,得到每一层级的编码处理所输出的图像特征;其中,每个层级的编码处理包括依次进行的卷积处理及最大池化处理;前一层级的编码处理所输出的图像特征用于下一层级的编码处理;所述对所述待解码图像对应的图像特征进行特征融合处理,得到所述待解码图像对应的全局特征,包括:对待融合特征进行多层级的解码处理,得到每一层级的解码处理所输出的融合特征;其中,每一层级的解码处理包括依次进行的反卷积处理、拼接处理及卷积处理;前一层级的解码处理所输出的融合特征用于下一层级的解码处理;每一层级的拼接处理包括:将反卷积处理输出的反卷积特征与相同层级的图像特征拼接,得到用于进行卷积处理的拼接特征;根据最后一层级的解码处理所输出的所述融合特征,得到所述待解码图像对应的全局特征。In some embodiments of the present application, the feature extraction processing of the image to be decoded to obtain the image features corresponding to the image to be decoded includes: performing multi-level encoding processing on the image to be decoded to obtain the image features output by the encoding processing of each level; wherein the encoding processing of each level includes convolution processing and maximum pooling processing performed in sequence; the image features output by the encoding processing of the previous level are used for the encoding processing of the next level; the feature fusion processing of the image features corresponding to the image to be decoded to obtain the global features corresponding to the image to be decoded includes: performing multi-level decoding processing on the features to be fused to obtain the fused features output by the decoding processing of each level; wherein the decoding processing of each level includes deconvolution processing, splicing processing and convolution processing performed in sequence; the fused features output by the decoding processing of the previous level are used for the decoding processing of the next level; the splicing processing of each level includes: splicing the deconvolution features output by the deconvolution processing with the image features of the same level to obtain the splicing features used for convolution processing; and obtaining the global features corresponding to the image to be decoded according to the fused features output by the decoding processing of the last level.
在本申请的一些实施例中,所述将反卷积处理输出的反卷积特征与相同层级的图像特征拼接,得到用于进行卷积处理的拼接特征,包括:对所述相同层级的图像特征计算注意力分布,并根据所述注意力分布计算加权平均,得到相同层级的加权平均特征;将反卷积处理输出的反卷积特征与相同层级的加权平均特征拼接,得到用于进行卷积处理的拼接特征。In some embodiments of the present application, the deconvolution features output by the deconvolution processing are spliced with the image features of the same level to obtain the spliced features for convolution processing, including: calculating the attention distribution for the image features of the same level, and calculating the weighted average based on the attention distribution to obtain the weighted average features of the same level; splicing the deconvolution features output by the deconvolution processing with the weighted average features of the same level to obtain the spliced features for convolution processing.
在本申请的一些实施例中,所述基于每一帧所述待解码图像对应的全局特征进行深度分析处理,得到每一帧所述待解码图像对应的深度,包括:将每一帧所述待解码图像对应的全局特征分别输入循环神经网络进行深度分析处理,得到所述循环神经网络输出的每一帧所述待解码图像对应的深度。In some embodiments of the present application, the depth analysis processing is performed based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded, including: inputting the global features corresponding to each frame of the image to be decoded into a recurrent neural network for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network.
在本申请的一些实施例中,所述将每一帧所述待解码图像对应的全局特征分别输入循环神经网络进行深度分析处理,得到所述循环神经网络输出的每一帧所述待解码图像对应的深度,包括:将每一帧所述待解码图像对应的全局特征分别输入门控循环单元进行深度分析处理,得到所述门控循环单元输出的每一帧所述待解码图像对应的深度。In some embodiments of the present application, the global features corresponding to each frame of the image to be decoded are respectively input into a recurrent neural network for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network, including: the global features corresponding to each frame of the image to be decoded are respectively input into a gated loop unit for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the gated loop unit.
在本申请的一些实施例中,所述基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型,包括:将每一帧所述待解码图像及对应的所述全局特征与所述深度分别输入卷积神经网络进行渲染处理,得到所述卷积神经网络输出的每一帧所述待解码图像对应的三维模型。In some embodiments of the present application, rendering processing is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, including: inputting each frame of the image to be decoded and the corresponding global features and the depth into a convolutional neural network for rendering processing to obtain a three-dimensional model corresponding to each frame of the image to be decoded output by the convolutional neural network.
在本申请的一些实施例中,所述基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型之后,所述方法包括:将解码出的多帧三维模型,按照每帧三维模型对应的待解码图像对应的顺序序列化处理,得到所述体积视频。In some embodiments of the present application, after rendering is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, the method includes: serializing the decoded multi-frame three-dimensional model in the order corresponding to the image to be decoded corresponding to each frame of the three-dimensional model to obtain the volumetric video.
根据本申请的一个实施例,一种体积视频的解码装置,所述装置包括:获取模块,用于获取体积视频对应的多帧待解码图像;提取模块,用于提取每一帧所述待解码图像对应的全局特征;分析模块,用于基于每一帧所述待解码图像对应的全局特征进行深度分析处理,得到每一帧所述待解码图像对应的深度;渲染模块,用于基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型,所述多帧三维模型用于生成所述体积视频。According to one embodiment of the present application, a decoding device for volumetric video includes: an acquisition module, which is used to acquire multiple frames of images to be decoded corresponding to the volumetric video; an extraction module, which is used to extract global features corresponding to each frame of the image to be decoded; an analysis module, which is used to perform depth analysis processing based on the global features corresponding to each frame of the image to be decoded, and obtain the depth corresponding to each frame of the image to be decoded; a rendering module, which is used to perform rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth, and obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
在本申请的一些实施例中,所述提取模块,用于针对每一帧所述待解码图像,对所述待解码图像进行特征提取处理,得到所述待解码图像对应的图像特征;对所述待解码图像对应的图像特征进行特征融合处理,得到所述待解码图像对应的全局特征。In some embodiments of the present application, the extraction module is used to perform feature extraction processing on the image to be decoded for each frame of the image to be decoded to obtain image features corresponding to the image to be decoded; and perform feature fusion processing on the image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.
在本申请的一些实施例中,所述提取模块,用于:对所述待解码图像进行多层级的编码处理,得到每一层级的编码处理所输出的图像特征;其中,每个层级的编码处理包括依次进行的卷积处理及最大池化处理;前一层级的编码处理所输出的图像特征用于下一层级的编码处理;所述提取模块,还用于:对待融合特征进行多层级的解码处理,得到每一层级的解码处理所输出的融合特征;其中,每一层级的解码处理包括依次进行的反卷积处理、拼接处理及卷积处理;前一层级的解码处理所输出的融合特征用于下一层级的解码处理;每一层级的拼接处理包括:将反卷积处理输出的反卷积特征与相同层级的图像特征拼接,得到用于进行卷积处理的拼接特征;根据最后一层级的解码处理所输出的所述融合特征,得到所述待解码图像对应的全局特征。In some embodiments of the present application, the extraction module is used to: perform multi-level encoding processing on the image to be decoded to obtain image features output by the encoding processing at each level; wherein the encoding processing at each level includes convolution processing and maximum pooling processing performed in sequence; the image features output by the encoding processing at the previous level are used for the encoding processing at the next level; the extraction module is also used to: perform multi-level decoding processing on the features to be fused to obtain fused features output by the decoding processing at each level; wherein the decoding processing at each level includes deconvolution processing, splicing processing and convolution processing performed in sequence; the fused features output by the decoding processing at the previous level are used for the decoding processing at the next level; the splicing processing at each level includes: splicing the deconvolution features output by the deconvolution processing with the image features of the same level to obtain splicing features for convolution processing; and obtaining the global features corresponding to the image to be decoded based on the fused features output by the decoding processing at the last level.
在本申请的一些实施例中,所述提取模块,还用于:对所述相同层级的图像特征计算注意力分布,并根据所述注意力分布计算加权平均,得到相同层级的加权平均特征;将反卷积处理输出的反卷积特征与相同层级的加权平均特征拼接,得到用于进行卷积处理的拼接特征。In some embodiments of the present application, the extraction module is also used to: calculate the attention distribution of the image features at the same level, and calculate the weighted average based on the attention distribution to obtain the weighted average features at the same level; splice the deconvolution features output by the deconvolution processing with the weighted average features at the same level to obtain the spliced features for convolution processing.
在本申请的一些实施例中,所述分析模块,用于:将每一帧所述待解码图像对应的全局特征分别输入循环神经网络进行深度分析处理,得到所述循环神经网络输出的每一帧所述待解码图像对应的深度。In some embodiments of the present application, the analysis module is used to: input the global features corresponding to each frame of the image to be decoded into a recurrent neural network for depth analysis and processing, and obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network.
在本申请的一些实施例中,所述分析模块,用于:将每一帧所述待解码图像对应的全局特征分别输入门控循环单元进行深度分析处理,得到所述门控循环单元输出的每一帧所述待解码图像对应的深度。In some embodiments of the present application, the analysis module is used to: input the global features corresponding to each frame of the image to be decoded into the gated loop unit for depth analysis processing, and obtain the depth corresponding to each frame of the image to be decoded output by the gated loop unit.
在本申请的一些实施例中,所述渲染模块,用于:将每一帧所述待解码图像及对应的所述全局特征与所述深度分别输入卷积神经网络进行渲染处理,得到所述卷积神经网络输出的每一帧所述待解码图像对应的三维模型。In some embodiments of the present application, the rendering module is used to: input each frame of the image to be decoded and the corresponding global features and the depth into a convolutional neural network for rendering processing, so as to obtain a three-dimensional model corresponding to each frame of the image to be decoded output by the convolutional neural network.
在本申请的一些实施例中,所述装置还包括生成模块,用于:将解码出的多帧三维模型,按照每帧三维模型对应的待解码图像对应的顺序序列化处理,得到所述体积视频。In some embodiments of the present application, the device further includes a generation module, which is used to: serialize the decoded multi-frame three-dimensional models in the order corresponding to the image to be decoded corresponding to each frame of the three-dimensional model to obtain the volumetric video.
根据本申请的另一实施例,一种存储介质,其上存储有计算机程序,当所述计算机程序被计算机的处理器执行时,使计算机执行本申请实施例所述的方法。According to another embodiment of the present application, a storage medium stores a computer program thereon, and when the computer program is executed by a processor of a computer, the computer executes the method described in the embodiment of the present application.
根据本申请的另一实施例,一种电子设备可以包括:存储器,存储有计算机程序;处理器,读取存储器存储的计算机程序,以执行本申请实施例所述的方法。According to another embodiment of the present application, an electronic device may include: a memory storing a computer program; and a processor reading the computer program stored in the memory to execute the method described in the embodiment of the present application.
根据本申请的另一实施例,一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例所述的各种可选实现方式中提供的方法。According to another embodiment of the present application, a computer program product or a computer program includes a computer instruction stored in a computer-readable storage medium. A processor of a computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the method provided in various optional implementations described in the embodiments of the present application.
有益效果Beneficial Effects
本申请实施例的体积视频的解码方案中,获取体积视频对应的多帧待解码图像;提取每一帧所述待解码图像对应的全局特征;基于每一帧所述待解码图像对应的全局特征进行深度分析处理,得到每一帧所述待解码图像对应的深度;基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型,所述多帧三维模型用于生成所述体积视频。In the decoding scheme of the volumetric video in the embodiment of the present application, multiple frames of images to be decoded corresponding to the volumetric video are obtained; global features corresponding to each frame of the image to be decoded are extracted; depth analysis processing is performed based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded; rendering processing is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
以这种方式,体积视频以编码为多帧待解码图像的形式提供,通过对每一帧待解码图像提升全局特征,并通过全局特征分析出深度,结合待解码图像、全局特征及深度渲染出三维模型,整体解码过程效率高且解码出的三维模型可靠性高,可以有效提升体积视频的解码效率且提升解码效果。In this way, the volumetric video is provided in the form of multiple frames of images to be decoded. The global features of each frame of the image to be decoded are enhanced, and the depth is analyzed through the global features. The three-dimensional model is rendered by combining the image to be decoded, the global features and the depth. The overall decoding process is efficient and the decoded three-dimensional model is highly reliable, which can effectively improve the decoding efficiency of the volumetric video and improve the decoding effect.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For those skilled in the art, other drawings can be obtained based on these drawings without creative work.
图1示出了一种可以应用本申请实施例的系统的示意图。FIG. 1 shows a schematic diagram of a system to which an embodiment of the present application can be applied.
图2示出了根据本申请的一个实施例的体积视频的解码方法的流程图。FIG. 2 shows a flow chart of a method for decoding volumetric video according to an embodiment of the present application.
图3示出了根据本申请的另一个实施例的体积视频的解码装置的框图。FIG. 3 shows a block diagram of a volumetric video decoding device according to another embodiment of the present application.
图4示出了根据本申请的一个实施例的电子设备的框图。FIG. 4 shows a block diagram of an electronic device according to an embodiment of the present application.
本发明的实施方式Embodiments of the present invention
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those skilled in the art without creative work are within the scope of protection of this application.
图1示出了可以应用本申请实施例的系统100的示意图。如图1所示,系统100可以包括服务器101及终端102。FIG1 shows a schematic diagram of a system 100 to which an embodiment of the present application can be applied. As shown in FIG1 , the system 100 may include a server 101 and a terminal 102 .
服务器101可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。Server 101 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. It can also be a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, as well as big data and artificial intelligence platforms.
终端102可以是任意的设备,终端102包括但不限于手机、电脑、智能语音交互设备、智能家电、车载终端、VR/AR设备、智能手表以及计算机等等。一种实施方式中,服务器101或终端102可以是区块链网络或地图车联网平台中的节点设备。The terminal 102 may be any device, including but not limited to mobile phones, computers, intelligent voice interaction devices, smart home appliances, vehicle terminals, VR/AR devices, smart watches, computers, etc. In one embodiment, the server 101 or the terminal 102 may be a node device in a blockchain network or a map vehicle networking platform.
本示例的一种实施方式中,服务器101或终端102可以:获取体积视频对应的多帧待解码图像;提取每一帧所述待解码图像对应的全局特征;基于每一帧所述待解码图像对应的全局特征进行深度分析处理,得到每一帧所述待解码图像对应的深度;基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型,所述多帧三维模型用于生成所述体积视频。In one implementation of this example, the server 101 or the terminal 102 may: obtain multiple frames of images to be decoded corresponding to the volumetric video; extract global features corresponding to each frame of the image to be decoded; perform depth analysis processing based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded; perform rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth respectively to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
图2示意性示出了根据本申请的一个实施例的体积视频的解码方法的流程图。该体积视频的解码方法的执行主体可以是任意的设备,例如图1所示的服务器101或终端102。Fig. 2 schematically shows a flow chart of a method for decoding volumetric video according to an embodiment of the present application. The method for decoding volumetric video may be executed by any device, such as the server 101 or the terminal 102 shown in Fig. 1 .
如图2所示,该体积视频的解码方法可以包括步骤S210至步骤S240。As shown in FIG. 2 , the volumetric video decoding method may include steps S210 to S240 .
步骤S210,获取体积视频对应的多帧待解码图像;步骤S220,提取每一帧所述待解码图像对应的全局特征;步骤S230,基于每一帧所述待解码图像对应的全局特征进行深度分析处理,得到每一帧所述待解码图像对应的深度;步骤S240,基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型,所述多帧三维模型用于生成所述体积视频。Step S210, obtaining multiple frames of images to be decoded corresponding to the volumetric video; step S220, extracting global features corresponding to each frame of the image to be decoded; step S230, performing depth analysis processing based on the global features corresponding to each frame of the image to be decoded, and obtaining the depth corresponding to each frame of the image to be decoded; step S240, performing rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth, and obtaining a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
体积视频为多帧三维模型的模型序列,三维模型可以是人物、动物等对应的三维模型,体积视频通过连续的多帧三维模型可以演示对象的对象行为(例如跳舞),每一帧三维模型可以通过多视角的多张二维图像重建得到。Volumetric video is a model sequence of multi-frame three-dimensional models. The three-dimensional model can be a three-dimensional model corresponding to a person, an animal, etc. Volumetric video can demonstrate the object behavior (such as dancing) of an object through continuous multi-frame three-dimensional models. Each frame of the three-dimensional model can be reconstructed through multiple two-dimensional images from multiple perspectives.
预先可以将创建的体积视频编码为对应的多帧待解码图像。体积视频编码为对应的多帧待解码图像的方式:可以是将体积视频编码为用于重建体积视频中每一帧三维模型的多视角的彩色图像;也可以是将体积视频编码为针对每一帧三维模型从不同角度截取的多视角的模型图像。每一帧待解码图像可以对应一帧三维模型,每一帧待解码图像中可以包括至少一张图像。The created volumetric video may be encoded in advance into corresponding multiple frames of images to be decoded. The volumetric video may be encoded into corresponding multiple frames of images to be decoded in a manner that: the volumetric video may be encoded into a multi-perspective color image for reconstructing each frame of the three-dimensional model in the volumetric video; or the volumetric video may be encoded into a multi-perspective model image captured from different angles for each frame of the three-dimensional model. Each frame of the image to be decoded may correspond to a frame of the three-dimensional model, and each frame of the image to be decoded may include at least one image.
对每一帧待解码图像进行全局信息的特征抽取,可以提取到每一帧待解码图像对应的包含全局信息的全局特征。基于每一帧待解码图像对应的全局特征进行深度分析处理,得到每一帧待解码图像对应的深度。By extracting the features of global information for each frame of the image to be decoded, the global features containing global information corresponding to each frame of the image to be decoded can be extracted. Based on the global features corresponding to each frame of the image to be decoded, a depth analysis process is performed to obtain the depth corresponding to each frame of the image to be decoded.
最后,基于每一帧待解码图像及待解码图像对应的全局特征与深度分别进行渲染处理,得到解码出的多帧三维模型,多帧三维模型序列化即可得到恢复/解码出的体积视频。Finally, rendering is performed based on each frame of the image to be decoded and the global features and depth corresponding to the image to be decoded, so as to obtain a decoded multi-frame 3D model. The multi-frame 3D model can be serialized to obtain the restored/decoded volumetric video.
以这种方式,基于步骤S210至步骤S240,体积视频以编码为多帧待解码图像的形式提供,通过对每一帧待解码图像提升全局特征,并通过全局特征分析出深度,结合待解码图像、全局特征及深度渲染出三维模型,整体解码过程效率高且解码出的三维模型可靠性高,可以有效提升体积视频的解码效率且提升解码效果。In this way, based on steps S210 to S240, the volumetric video is provided in the form of multiple frames of images to be decoded. The global features of each frame of the image to be decoded are enhanced, and the depth is analyzed through the global features. The three-dimensional model is rendered by combining the image to be decoded, the global features and the depth. The overall decoding process is efficient and the decoded three-dimensional model is highly reliable, which can effectively improve the decoding efficiency of the volumetric video and improve the decoding effect.
下面描述图2实施例中进行体积视频的解码时,所进行的各步骤下进一步具体可选实施例。The following describes further specific optional embodiments of the steps performed when decoding the volumetric video in the embodiment of FIG. 2 .
一种实施例中,步骤S220,所述提取每一帧所述待解码图像对应的全局特征,包括:针对每一帧所述待解码图像,对所述待解码图像进行特征提取处理,得到所述待解码图像对应的图像特征;对所述待解码图像对应的图像特征进行特征融合处理,得到所述待解码图像对应的全局特征。In one embodiment, step S220, extracting global features corresponding to each frame of the image to be decoded, includes: for each frame of the image to be decoded, performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded; performing feature fusion processing on the image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.
通过特征提取网络(例如卷积网络)可以对待解码图像进行特征提取处理,得到所述待解码图像对应的图像特征,进一步的,通过特征融合网络(例如全连接网络)可以对待解码图像对应的图像特征进行特征融合处理,得到待解码图像对应的全局特征。The image to be decoded can be subjected to feature extraction processing through a feature extraction network (such as a convolutional network) to obtain image features corresponding to the image to be decoded. Furthermore, the image features corresponding to the image to be decoded can be subjected to feature fusion processing through a feature fusion network (such as a fully connected network) to obtain global features corresponding to the image to be decoded.
一种实施例中,步骤S220,所述提取每一帧所述待解码图像对应的全局特征,包括:针对每一帧所述待解码图像,通过直方图计算函数(例如Opencv中直方图计算函数)计算每一帧所述待解码图像对应的直方图作为全局特征。In one embodiment, step S220, extracting the global features corresponding to each frame of the image to be decoded includes: for each frame of the image to be decoded, calculating the histogram corresponding to each frame of the image to be decoded as the global feature through a histogram calculation function (such as the histogram calculation function in Opencv).
一种实施例中,所述对所述待解码图像进行特征提取处理,得到所述待解码图像对应的图像特征,包括:对所述待解码图像进行多层级的编码处理,得到每一层级的编码处理所输出的图像特征;其中,每个层级的编码处理包括依次进行的卷积处理及最大池化处理;前一层级的编码处理所输出的图像特征用于下一层级的编码处理;所述对所述待解码图像对应的图像特征进行特征融合处理,得到所述待解码图像对应的全局特征,包括:对待融合特征进行多层级的解码处理,得到每一层级的解码处理所输出的融合特征;其中,每一层级的解码处理包括依次进行的反卷积处理、拼接处理及卷积处理;前一层级的解码处理所输出的融合特征用于下一层级的解码处理;每一层级的拼接处理包括:将反卷积处理输出的反卷积特征与相同层级的图像特征拼接,得到用于进行卷积处理的拼接特征;根据最后一层级的解码处理所输出的所述融合特征,得到所述待解码图像对应的全局特征。In one embodiment, the feature extraction processing of the image to be decoded to obtain the image features corresponding to the image to be decoded includes: performing multi-level encoding processing on the image to be decoded to obtain the image features output by the encoding processing of each level; wherein the encoding processing of each level includes convolution processing and maximum pooling processing performed in sequence; the image features output by the encoding processing of the previous level are used for the encoding processing of the next level; the feature fusion processing of the image features corresponding to the image to be decoded to obtain the global features corresponding to the image to be decoded includes: performing multi-level decoding processing on the features to be fused to obtain the fused features output by the decoding processing of each level; wherein the decoding processing of each level includes deconvolution processing, splicing processing and convolution processing performed in sequence; the fused features output by the decoding processing of the previous level are used for the decoding processing of the next level; the splicing processing of each level includes: splicing the deconvolution features output by the deconvolution processing with the image features of the same level to obtain the splicing features used for convolution processing; and obtaining the global features corresponding to the image to be decoded according to the fused features output by the decoding processing of the last level.
该实施例下,首先,针对每帧待解码图像,通过特征提取网络(编码器encoder)对待解码图像进行多层级的编码处理,特征提取网络(编码器encoder)可以包括多层级联的提取网络,每一层级的提取网络通过编码处理可以输出对应层级的图像特征。其中,每个层级的提取网络中进行的编码处理具体包括依次进行的卷积处理及最大池化处理,例如,第一层级的提取网络中进行编码处理可以包括:先将待解码图像进行卷积处理,得到卷积特征,然后,将卷积特征进行最大池化处理,得到第一层级的编码处理所输出的图像特征。进一步的,前一层级的编码处理所输出的图像特征用于下一层级的编码处理,例如,第一层级的图像特征作为第二层级的提取网络的输入特征,第二层级的提取网络中先对第一层级的图像特征进行卷积处理,得到卷积特征,然后,将卷积特征进行最大池化处理,得到第二层级的编码处理所输出的图像特征。In this embodiment, first, for each frame of the image to be decoded, a multi-level encoding process is performed on the image to be decoded through a feature extraction network (encoder). The feature extraction network (encoder encoder) may include a multi-layer cascaded extraction network, and each level of the extraction network can output the image features of the corresponding level through the encoding process. Among them, the encoding process performed in the extraction network of each level specifically includes convolution processing and maximum pooling processing performed in sequence. For example, the encoding process performed in the extraction network of the first level may include: firstly performing convolution processing on the image to be decoded to obtain convolution features, and then performing maximum pooling processing on the convolution features to obtain the image features output by the encoding process of the first level. Further, the image features output by the encoding process of the previous level are used for the encoding process of the next level. For example, the image features of the first level are used as the input features of the extraction network of the second level. In the extraction network of the second level, the image features of the first level are firstly convolved to obtain convolution features, and then the convolution features are subjected to maximum pooling processing to obtain the image features output by the encoding process of the second level.
进一步的,通过特征融合网络(解码器decoder)对待融合特征进行多层级的解码处理,特征融合网络(解码器decoder)可以包括多层级联的融合网络,每一层级的融合网络通过解码处理可以输出对应层级的融合特征。其中,每个层级的融合网络中进行的解码处理具体包括依次进行的反卷积处理、拼接处理及卷积处理,例如,第一层级的融合网络中进行解码处理可以包括:先将待融合特征进行反卷积处理,得到反卷积特征,然后,将反卷积特征进行拼接处理,得到拼接特征,然后,将拼接特征进行卷积处理,得到第一层级的解码处理所输出的融合特征。进一步的,前一层级的解码处理所输出的融合特征用于下一层级的解码处理,例如,第一层级的融合特征作为第二层级的融合网络的输入特征,第二层级的融合网络中先对第一层级的融合特征依次进行反卷积处理、拼接处理及卷积处理。进一步的,每一层级的拼接处理包括:将反卷积处理输出的反卷积特征与相同层级的图像特征拼接,得到用于进行卷积处理的拼接特征,例如,第一层级的拼接处理包括:将第一层级的反卷积处理输出的反卷积特征与第一层级的编码处理所输出的图像特征拼接,得到第一层级中用于进行卷积处理的拼接特征。Furthermore, the features to be fused are subjected to multi-level decoding processing through a feature fusion network (decoder), and the feature fusion network (decoder) may include a multi-layer cascaded fusion network, and each level of the fusion network may output the fusion features of the corresponding level through decoding processing. Among them, the decoding processing performed in the fusion network of each level specifically includes deconvolution processing, splicing processing and convolution processing performed in sequence. For example, the decoding processing in the fusion network of the first level may include: firstly deconvolution processing the features to be fused to obtain deconvolution features, then splicing processing the deconvolution features to obtain splicing features, and then convolution processing the splicing features to obtain the fusion features output by the decoding processing of the first level. Furthermore, the fusion features output by the decoding processing of the previous level are used for the decoding processing of the next level. For example, the fusion features of the first level are used as the input features of the fusion network of the second level, and the fusion features of the first level are first subjected to deconvolution processing, splicing processing and convolution processing in sequence in the fusion network of the second level. Furthermore, the splicing processing at each level includes: splicing the deconvolution features output by the deconvolution processing with the image features at the same level to obtain the splicing features used for the convolution processing. For example, the splicing processing at the first level includes: splicing the deconvolution features output by the deconvolution processing at the first level with the image features output by the encoding processing at the first level to obtain the splicing features used for the convolution processing at the first level.
最后,根据最后一层级的解码处理所输出的融合特征,得到待解码图像对应的全局特征,可以是将最后一层级的解码处理所输出的融合特征,作为得到的待解码图像对应的全局特征,也可以是将最后一层级的解码处理所输出的融合特征进行降维,将降维特征作为得到的待解码图像对应的全局特征。Finally, according to the fused features output by the last level of decoding processing, the global features corresponding to the image to be decoded are obtained. The fused features output by the last level of decoding processing can be used as the global features corresponding to the image to be decoded, or the fused features output by the last level of decoding processing can be reduced in dimension, and the reduced-dimensional features can be used as the global features corresponding to the image to be decoded.
本示例的一种实施方式中,特征提取网络为UNet网络,UNet网络中包括左侧的特征提取网络(编码器encoder)及右侧的特征融合网络(解码器decoder),特征提取网络(编码器encoder)中包括4层提取网络,特征融合网络(解码器decoder)中也包括4层融合网络。In one implementation of this example, the feature extraction network is a UNet network, which includes a feature extraction network (encoder) on the left and a feature fusion network (decoder) on the right. The feature extraction network (encoder) includes 4 layers of extraction networks, and the feature fusion network (decoder) also includes 4 layers of fusion networks.
一种实施例中,所述将反卷积处理输出的反卷积特征与相同层级的图像特征拼接,得到用于进行卷积处理的拼接特征,包括:
对所述相同层级的图像特征计算注意力分布,并根据所述注意力分布计算加权平均,得到相同层级的加权平均特征;将反卷积处理输出的反卷积特征与相同层级的加权平均特征拼接,得到用于进行卷积处理的拼接特征。
In one embodiment, the step of splicing the deconvolution features output by the deconvolution process with the image features at the same level to obtain the spliced features for performing the convolution process includes:
The attention distribution is calculated for the image features at the same level, and the weighted average is calculated according to the attention distribution to obtain the weighted average features at the same level; the deconvolution features output by the deconvolution processing are spliced with the weighted average features at the same level to obtain the spliced features for convolution processing.
该实施例下,进一步先对相同层级的图像特征通过注意力机制计算加权平均特征,然后,将反卷积处理输出的反卷积特征与加权平均特征拼接,得到用于进行卷积处理的拼接特征。例如,首先,对第一层级的编码处理所输出的图像特征通过注意力机制计算加权平均特征,然后,将第一层级的反卷积处理输出的反卷积特征与第一层级的加权平均特征拼接,得到第一层级中用于进行卷积处理的拼接特征。以这种方式,提取的全局特征可以包含更多的全局信息,整体上进一步提升三维模型的解码效果。In this embodiment, the weighted average features of the image features at the same level are further calculated through the attention mechanism, and then the deconvolution features output by the deconvolution process are spliced with the weighted average features to obtain the spliced features for convolution processing. For example, first, the weighted average features of the image features output by the encoding process of the first level are calculated through the attention mechanism, and then the deconvolution features output by the deconvolution process of the first level are spliced with the weighted average features of the first level to obtain the spliced features for convolution processing in the first level. In this way, the extracted global features can contain more global information, further improving the decoding effect of the three-dimensional model as a whole.
其中,对相同层级的图像特征通过注意力机制计算加权平均特征,具体包括:对相同层级的图像特征计算注意力分布,并根据注意力分布计算加权平均,得到相同层级的加权平均特征。具体可以通过软性注意力(Soft Attention)机制计算得到加权平均特征。The weighted average features of the image features at the same level are calculated through the attention mechanism, which specifically includes: calculating the attention distribution of the image features at the same level, and calculating the weighted average according to the attention distribution to obtain the weighted average features at the same level. Specifically, the weighted average features can be calculated through the soft attention mechanism.
一种实施例中,所述基于每一帧所述待解码图像对应的全局特征进行深度分析处理,得到每一帧所述待解码图像对应的深度,包括:将每一帧所述待解码图像对应的全局特征分别输入循环神经网络进行深度分析处理,得到所述循环神经网络输出的每一帧所述待解码图像对应的深度。In one embodiment, the depth analysis processing is performed based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded, including: inputting the global features corresponding to each frame of the image to be decoded into a recurrent neural network for depth analysis processing, and obtaining the depth corresponding to each frame of the image to be decoded output by the recurrent neural network.
该实施例下,通过循环神经网络(recurrent neural network,RNN)基于每一帧待解码图像对应的全局特征进行深度分析处理,得到每一帧待解码图像对应的深度。具体的,将待解码图像对应的全局特征的序列输入循环神经网络中可以进行深度分析处理,得到循环神经网络输出的待解码图像对应的深度。循环神经网络中的参数在不同时刻是共享的,可以通过分析处理输出准确的深度。循环神经网络(recurrent neural network,RNN)具体可以包括长短时记忆网络(Long Short Term Memory Network,LSTM)或门控循环单元(gated recurrent unit,GRU)等。In this embodiment, a recurrent neural network (RNN) performs depth analysis based on the global features corresponding to each frame of the image to be decoded, and obtains the depth corresponding to each frame of the image to be decoded. Specifically, the sequence of global features corresponding to the image to be decoded can be input into the recurrent neural network for depth analysis, and the depth corresponding to the image to be decoded output by the recurrent neural network can be obtained. The parameters in the recurrent neural network are shared at different times, and the accurate depth can be output through analysis and processing. The recurrent neural network (RNN) can specifically include a long short-term memory network (LSTM) or a gated recurrent unit (GRU), etc.
一种实施例中,所述将每一帧所述待解码图像对应的全局特征分别输入循环神经网络进行深度分析处理,得到所述循环神经网络输出的每一帧所述待解码图像对应的深度,包括:将每一帧所述待解码图像对应的全局特征分别输入门控循环单元进行深度分析处理,得到所述门控循环单元输出的每一帧所述待解码图像对应的深度。In one embodiment, the global features corresponding to each frame of the image to be decoded are respectively input into a recurrent neural network for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network, including: the global features corresponding to each frame of the image to be decoded are respectively input into a gated recurrent unit for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the gated recurrent unit.
该实施例下,具体采用门控循环单元(gated recurrent unit,GRU)基于每一帧待解码图像对应的全局特征进行深度分析处理,得到每一帧待解码图像对应的深度。具体的,将待解码图像对应的全局特征的序列输入门控循环单元(gated recurrent unit,GRU)中可以进行深度分析处理,得到门控循环单元(gated recurrent unit,GRU)输出的待解码图像对应的深度。门控循环单元(gated recurrent unit,GRU)是一种门控循环神经网络,可以高效的分析得到可靠的深度。In this embodiment, a gated recurrent unit (GRU) is specifically used to perform depth analysis based on the global features corresponding to each frame of the image to be decoded, and the depth corresponding to each frame of the image to be decoded is obtained. Specifically, the sequence of global features corresponding to the image to be decoded is input into the gated recurrent unit (GRU) for depth analysis, and the depth corresponding to the image to be decoded output by the gated recurrent unit (GRU) is obtained. The gated recurrent unit (GRU) is a gated recurrent neural network that can efficiently analyze and obtain reliable depth.
一种实施例中,所述基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型,包括:将每一帧所述待解码图像及对应的所述全局特征与所述深度分别输出卷积神经网络进行渲染处理,得到所述卷积神经网络输出的每一帧所述待解码图像对应的三维模型。In one embodiment, rendering processing is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, including: outputting each frame of the image to be decoded and the corresponding global features and the depth to a convolutional neural network for rendering processing to obtain a three-dimensional model corresponding to each frame of the image to be decoded output by the convolutional neural network.
该实施例下,采用卷积神经网络综合待解码图像及待解码图像对应的全局特征与深度进行渲染处理,得到解码出的三维模型。例如,将第一帧待解码图像及第一帧待解码图像对应的全局特征与深度输出卷积神经网络进行渲染处理,得到卷积神经网络输出的第一帧待解码图像对应的三维模型,同理,可以得到其他帧待解码图像对应的三维模型,进而,得到解码出的多帧三维模型。In this embodiment, a convolutional neural network is used to render the image to be decoded and the global features and depth corresponding to the image to be decoded to obtain a decoded three-dimensional model. For example, the first frame of the image to be decoded and the global features and depth corresponding to the first frame of the image to be decoded are output to the convolutional neural network for rendering, and the three-dimensional model corresponding to the first frame of the image to be decoded output by the convolutional neural network is obtained. Similarly, the three-dimensional models corresponding to other frames of the image to be decoded can be obtained, and then the decoded multi-frame three-dimensional models are obtained.
一种实施例中,所述基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型,所述多帧三维模型用于生成所述体积视频,包括:将解码出的多帧三维模型,按照每帧三维模型对应的待解码图像对应的顺序序列化处理,得到所述体积视频。每一帧待解码图像的可以得到对应的一帧三维模型,将得到的所有三维模型按照对应的待解码图像的顺序依次串联,即可得到解码出的体积视频。In one embodiment, the rendering process is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame 3D model, and the multi-frame 3D model is used to generate the volumetric video, including: serializing the decoded multi-frame 3D model in the order corresponding to the image to be decoded corresponding to each frame of the 3D model to obtain the volumetric video. A corresponding frame of the 3D model can be obtained for each frame of the image to be decoded, and all the obtained 3D models are sequentially connected in the order of the corresponding images to be decoded to obtain the decoded volumetric video.
本申请前述实施例中的体积视频(Volumetric Video,又称容积视频、空间视频、体三维视频或6自由度视频等)是一种通过捕获三维空间中信息(如深度和色彩信息等)并生成三维动态模型序列的技术。相对于传统的视频,体积视频将空间的概念加入到视频中,用三维模型来更好的还原真实三维世界,而不是以二维的平面视频加上运镜来模拟真实三维世界的空间感。由于体积视频实质为三维模型序列,使得用户可以随自己喜好调整到任意视角进行观看,较二维平面视频具有更高的还原度和沉浸感。The volumetric video (also called volumetric video, spatial video, volumetric 3D video or 6-DOF video, etc.) in the aforementioned embodiments of the present application is a technology that captures information in 3D space (such as depth and color information, etc.) and generates a 3D dynamic model sequence. Compared with traditional videos, volumetric videos add the concept of space to videos, using 3D models to better restore the real 3D world, rather than using 2D flat videos plus camera movements to simulate the sense of space of the real 3D world. Since volumetric video is essentially a 3D model sequence, users can adjust to any viewing angle to watch it as they like, and it has a higher degree of restoration and immersion than 2D flat videos.
可选地,在本申请中,在步骤S210之前,用于构成体积视频的三维模型(该三维模型非步骤S210至S240解码所得到三维模型,而是指步骤S210之前事先通过三维重建得到的三维模型)可以按照如下方式重建得到:
先获取拍摄对象的不同视角的彩色图像和深度图像,以及彩色图像对应的相机参数;然后根据获取到的彩色图像及其对应的深度图像和相机参数,训练隐式表达拍摄对象三维模型的神经网络模型,并基于训练的神经网络模型进行等值面提取,实现对拍摄对象的三维重建,得到拍摄对象的三维模型。
Optionally, in the present application, before step S210, the three-dimensional model used to construct the volumetric video (the three-dimensional model is not the three-dimensional model obtained by decoding in steps S210 to S240, but refers to the three-dimensional model obtained by three-dimensional reconstruction before step S210) can be reconstructed as follows:
First, color images and depth images of the object from different perspectives, as well as camera parameters corresponding to the color images, are obtained; then, a neural network model that implicitly expresses the three-dimensional model of the object is trained based on the obtained color images and their corresponding depth images and camera parameters, and isosurface extraction is performed based on the trained neural network model to achieve three-dimensional reconstruction of the object and obtain a three-dimensional model of the object.
应当说明的是,本申请实施例中对采用何种架构的神经网络模型不作具体限制,可由本领域技术人员根据实际需要选取。比如,可以选取不带归一化层的多层感知机(Multilayer Perceptron,MLP)作为模型训练的基础模型。It should be noted that the present application embodiment does not specifically limit the neural network model architecture, and can be selected by those skilled in the art according to actual needs. For example, a multilayer perceptron (MLP) without a normalization layer can be selected as the basic model for model training.
下面将对本申请提供的三维模型重建方法进行详细描述。The three-dimensional model reconstruction method provided by the present application will be described in detail below.
首先,可以同步采用多个彩色相机和深度相机对需要进行三维重建的对象进行多视角的拍摄,得到对象在多个不同视角的彩色图像及对应的深度图像,即在同一拍摄时刻(实际拍摄时刻的差值小于或等于时间阈值即认为拍摄时刻相同),各视角的彩色相机将拍摄得到对象在对应视角的彩色图像,相应的,各视角的深度相机将拍摄得到对象在对应视角的深度图像。需要说明的是,对象可以是任意物体,包括但不限于人物、动物以及植物等生命物体,或者机械、家具、玩偶等非生命物体。First, multiple color cameras and depth cameras can be used simultaneously to shoot multiple perspectives of the object to be reconstructed in three dimensions, and obtain color images and corresponding depth images of the object at multiple different perspectives, that is, at the same shooting time (the difference in the actual shooting time is less than or equal to the time threshold, which means that the shooting time is the same), the color camera of each perspective will shoot the color image of the object at the corresponding perspective, and correspondingly, the depth camera of each perspective will shoot the depth image of the object at the corresponding perspective. It should be noted that the object can be any object, including but not limited to living objects such as people, animals and plants, or non-living objects such as machinery, furniture, and dolls.
以此,对象在不同视角的彩色图像均具备对应的深度图像,即在拍摄时,彩色相机和深度相机可以采用相机组的配置,同一视角的彩色相机配合深度相机同步对同一对象进行拍摄。比如,可以搭建一摄影棚,该摄影棚中心区域为拍摄区域,环绕该拍摄区域,在水平方向和垂直方向每间隔一定角度配对设置有多组彩色相机和深度相机。当对象处于这些彩色相机和深度相机所环绕的拍摄区域时,即可通过这些彩色相机和深度相机拍摄得到该对象在不同视角的彩色图像及对应的深度图像。In this way, the color images of the object at different viewing angles all have corresponding depth images, that is, when shooting, the color camera and the depth camera can adopt the configuration of a camera group, and the color camera of the same viewing angle cooperates with the depth camera to synchronously shoot the same object. For example, a studio can be built, the central area of the studio is the shooting area, and around the shooting area, multiple groups of color cameras and depth cameras are paired and arranged at certain angles in the horizontal and vertical directions. When the object is in the shooting area surrounded by these color cameras and depth cameras, the color images of the object at different viewing angles and the corresponding depth images can be obtained by shooting with these color cameras and depth cameras.
此外,进一步获取每一彩色图像对应的彩色相机的相机参数。其中,相机参数包括彩色相机的内外参,可以通过标定确定,相机内参为与彩色相机自身特性相关的参数,包括但不限于彩色相机的焦距、像素等数据,相机外参为彩色相机在世界坐标系中的参数,包括但不限于彩色相机的位置(坐标)和相机的旋转方向等数据。In addition, the camera parameters of the color camera corresponding to each color image are further obtained. The camera parameters include the internal and external parameters of the color camera, which can be determined by calibration. The camera internal parameters are parameters related to the characteristics of the color camera itself, including but not limited to the focal length, pixels and other data of the color camera. The camera external parameters are parameters of the color camera in the world coordinate system, including but not limited to the position (coordinates) of the color camera and the rotation direction of the camera.
如上,在获取到对象在同一拍摄时刻的多个不同视角的彩色图像及其对应的深度图像之后,即可根据这些彩色图像及其对应深度图像对对象进行三维重建。区别于相关技术中将深度转换为点云进行三维重建的方式,本申请训练一神经网络模型用以实现对对象的三维模型的隐式表达,从而基于该神经网络模型实现对对象的三维重建。As described above, after obtaining color images of the object at multiple different perspectives at the same shooting time and their corresponding depth images, the object can be reconstructed in three dimensions based on these color images and their corresponding depth images. Different from the method of converting depth into point cloud for three-dimensional reconstruction in the related art, the present application trains a neural network model to realize the implicit expression of the three-dimensional model of the object, thereby realizing the three-dimensional reconstruction of the object based on the neural network model.
可选地,本申请选用一不包括归一化层的多层感知机(Multilayer Perceptron,MLP)作为基础模型,按照如下方式进行训练:
基于对应的相机参数将每一彩色图像中的像素点转化为射线;在射线上采样多个采样点,并确定每一采样点的第一坐标信息以及每一采样点距离像素点的SDF值;将采样点的第一坐标信息输入基础模型,得到基础模型输出的每一采样点的预测SDF值以及预测RGB颜色值;基于预测SDF值与SDF值之间的第一差异,以及预测RGB颜色值与像素点的RGB颜色值之间的第二差异,对基础模型的参数进行调整,直至满足预设停止条件;将满足预设停止条件的基础模型作为隐式表达对象的三维模型的神经网络模型。
Optionally, the present application uses a multilayer perceptron (MLP) without a normalization layer as the basic model and trains it in the following manner:
Based on the corresponding camera parameters, the pixel points in each color image are converted into rays; multiple sampling points are sampled on the rays, and the first coordinate information of each sampling point and the SDF value of each sampling point from the pixel point are determined; the first coordinate information of the sampling point is input into the basic model to obtain the predicted SDF value and the predicted RGB color value of each sampling point output by the basic model; based on the first difference between the predicted SDF value and the SDF value, and the second difference between the predicted RGB color value and the RGB color value of the pixel point, the parameters of the basic model are adjusted until the preset stop condition is met; the basic model that meets the preset stop condition is used as the neural network model of the three-dimensional model of the implicit expression object.
首先,基于彩色图像对应的相机参数将彩色图像中的一像素点转化为一条射线,该射线可以为经过像素点且垂直于彩色图像面的射线;然后,在该射线上采样多个采样点,采样点的采样过程可以分两步执行,可以先均匀采样部分采样点,然后再在基于像素点的深度值在关键处进一步采样多个采样点,以保证在模型表面附近可以采样到尽量多的采样点;然后,根据相机参数和像素点的深度值计算出采样得到的每一采样点在世界坐标系中的第一坐标信息以及每一采样点的有向距离(Signed Distance Field,SDF)值,其中,SDF值可以为像素点的深度值与采样点距离相机成像面的距离之间的差值,该差值为有符号的值,当差值为正值时,表示采样点在三维模型的外部,当差值为负值时,表示采样点在三维模型的内部,当差值为零时,表示采样点在三维模型的表面;然后,在完成采样点的采样并计算得到每一采样点对应的SDF值之后,进一步将采样点在世界坐标系的第一坐标信息输入基础模型(该基础模型被配置为将输入的坐标信息映射为SDF值和RGB颜色值后输出),将基础模型输出的SDF值记为预测SDF值,将基础模型输出的RGB颜色值记为预测RGB颜色值;然后,基于预测SDF值与采样点对应的SDF值之间的第一差异,以及预测RGB颜色值与采样点所对应像素点的RGB颜色值之间的第二差异,对基础模型的参数进行调整。First, based on the camera parameters corresponding to the color image, a pixel in the color image is converted into a ray, which can be a ray passing through the pixel and perpendicular to the color image surface; then, multiple sampling points are sampled on the ray, and the sampling process of the sampling points can be performed in two steps. First, some sampling points can be uniformly sampled, and then multiple sampling points can be further sampled at key points based on the depth value of the pixel point to ensure that as many sampling points as possible can be sampled near the model surface; then, the first coordinate information of each sampling point obtained by sampling in the world coordinate system and the signed distance (Signed Distance) of each sampling point are calculated according to the camera parameters and the depth value of the pixel point. A depth field (SDF) value is obtained by calculating a depth field (SDF) value of a pixel point and a distance between the sampling point and the imaging surface of the camera. The difference is a signed value. When the difference is a positive value, it indicates that the sampling point is outside the three-dimensional model. When the difference is a negative value, it indicates that the sampling point is inside the three-dimensional model. When the difference is zero, it indicates that the sampling point is on the surface of the three-dimensional model. Then, after completing the sampling of the sampling point and calculating the SDF value corresponding to each sampling point, the first coordinate information of the sampling point in the world coordinate system is further input into a basic model (the basic model is configured to map the input coordinate information into an SDF value and an RGB color value and then output), the SDF value output by the basic model is recorded as a predicted SDF value, and the RGB color value output by the basic model is recorded as a predicted RGB color value. Then, based on a first difference between the predicted SDF value and the SDF value corresponding to the sampling point, and a second difference between the predicted RGB color value and the RGB color value of the pixel corresponding to the sampling point, the parameters of the basic model are adjusted.
此外,对于彩色图像中的其它像素点,同样按照上述方式进行采样点采样,然后将采样点在世界坐标系的坐标信息输入至基础模型以得到对应的预测SDF值和预测RGB颜色值,用于对基础模型的参数进行调整,直至满足预设停止条件,比如,可以配置预设停止条件为对基础模型的迭代次数达到预设次数,或者配置预设停止条件为基础模型收敛。在对基础模型的迭代满足预设停止条件时,即得到能够对对象的三维模型进行准确地隐式表达的神经网络模型。最后,可以采用等值面提取算法对该神经网络模型进行三维模型表面的提取,从而得到对象的三维模型。In addition, for other pixel points in the color image, sampling points are sampled in the same manner as described above, and then the coordinate information of the sampling points in the world coordinate system is input into the basic model to obtain the corresponding predicted SDF value and predicted RGB color value, which are used to adjust the parameters of the basic model until the preset stop condition is met. For example, the preset stop condition can be configured as the number of iterations of the basic model reaches a preset number, or the preset stop condition can be configured as the convergence of the basic model. When the iteration of the basic model meets the preset stop condition, a neural network model that can accurately and implicitly express the three-dimensional model of the object is obtained. Finally, the isosurface extraction algorithm can be used to extract the surface of the three-dimensional model of the neural network model to obtain the three-dimensional model of the object.
可选地,在一些实施例中,根据相机参数确定彩色图像的成像面;确定经过彩色图像中像素点且垂直于成像面的射线为像素点对应的射线。Optionally, in some embodiments, an imaging plane of the color image is determined according to camera parameters; and a ray passing through a pixel point in the color image and perpendicular to the imaging plane is determined to be a ray corresponding to the pixel point.
其中,可以根据彩色图像对应的彩色相机的相机参数,确定该彩色图像在世界坐标系中的坐标信息,即确定成像面。然后,可以确定经过彩色图像中像素点且垂直于该成像面的射线为该像素点对应的射线。The coordinate information of the color image in the world coordinate system, that is, the imaging plane, can be determined according to the camera parameters of the color camera corresponding to the color image. Then, the ray passing through the pixel point in the color image and perpendicular to the imaging plane can be determined as the ray corresponding to the pixel point.
可选地,在一些实施例中,根据相机参数确定彩色相机在世界坐标系中的第二坐标信息及旋转角度;根据第二坐标信息和旋转角度确定彩色图像的成像面。Optionally, in some embodiments, the second coordinate information and the rotation angle of the color camera in the world coordinate system are determined according to the camera parameters; and the imaging plane of the color image is determined according to the second coordinate information and the rotation angle.
可选地,在一些实施例中,在射线上等间距采样第一数量个第一采样点;根据像素点的深度值确定多个关键采样点,并根据关键采样点采样第二数量个第二采样点;将第一数量个的第一采样点与第二数量个的第二采样点确定为在射线上采样得到的多个采样点。Optionally, in some embodiments, a first number of first sampling points are sampled at equal intervals on the ray; a plurality of key sampling points are determined according to the depth value of the pixel point, and a second number of second sampling points are sampled according to the key sampling points; and the first number of first sampling points and the second number of second sampling points are determined as a plurality of sampling points obtained by sampling on the ray.
其中,先在射线上均匀采样n(即第一数量)个第一采样点,n为大于2的正整数;然后,再根据前述像素点的深度值,从n个第一采样点中确定出距离前述像素点最近的预设数量个关键采样点,或者从n个第一采样点中确定出距离前述像素点小于距离阈值的关键采样点;然后,根据确定出的关键采样点再采样m个第二采样点,m为大于1的正整数;最后,将采样得到的n+m个采样点确定为在射线上采样得到的多个采样点。其中,在关键采样点处再多采样m个采样点,可以使得模型的训练效果在三维模型表面处更为精确,从而提升三维模型的重建精度。Among them, firstly, n (i.e., the first number) first sampling points are uniformly sampled on the ray, where n is a positive integer greater than 2; then, according to the depth value of the aforementioned pixel point, a preset number of key sampling points closest to the aforementioned pixel point are determined from the n first sampling points, or a key sampling point whose distance to the aforementioned pixel point is less than a distance threshold is determined from the n first sampling points; then, m second sampling points are sampled according to the determined key sampling points, where m is a positive integer greater than 1; finally, the sampled n+m sampling points are determined as multiple sampling points sampled on the ray. Among them, sampling m more sampling points at the key sampling points can make the training effect of the model more accurate on the surface of the three-dimensional model, thereby improving the reconstruction accuracy of the three-dimensional model.
可选地,在一些实施例中,根据彩色图像对应的深度图像确定像素点对应的深度值;基于深度值计算每一采样点距离像素点的SDF值;根据相机参数与深度值计算每一采样点的坐标信息。Optionally, in some embodiments, the depth value corresponding to the pixel is determined according to the depth image corresponding to the color image; the SDF value of each sampling point from the pixel is calculated based on the depth value; and the coordinate information of each sampling point is calculated according to the camera parameters and the depth value.
其中,在每一像素点对应的射线上采样了多个采样点后,对于每一采样点,根据相机参数、像素点的深度值确定彩色相机的拍摄位置与对象上对应点之间的距离,然后基于该距离逐一计算每一采样点的SDF值以及计算出每一采样点的坐标信息。Among them, after sampling multiple sampling points on the ray corresponding to each pixel point, for each sampling point, the distance between the shooting position of the color camera and the corresponding point on the object is determined according to the camera parameters and the depth value of the pixel point, and then the SDF value of each sampling point is calculated one by one based on the distance, and the coordinate information of each sampling point is calculated.
需要说明的是,在完成对基础模型的训练之后,对于给定的任意一个点的坐标信息,即可由完成训练的基础模型预测其对应的SDF值,该预测的SDF值即表示了该点与对象的三维模型的位置关系(内部、外部或者表面),实现对对象的三维模型的隐式表达,得到用于隐式表达对象的三维模型的神经网络模型。It should be noted that after completing the training of the basic model, for the coordinate information of any given point, the trained basic model can predict its corresponding SDF value. The predicted SDF value represents the positional relationship (inside, outside or on the surface) between the point and the three-dimensional model of the object, thereby realizing the implicit expression of the three-dimensional model of the object and obtaining a neural network model for implicitly expressing the three-dimensional model of the object.
最后,对以上神经网络模型进行等值面提取,比如可以采用等值面提取算法(Marching cubes,MC)绘制出三维模型的表面,得到三维模型表面,进而根据该三维模型表面得到对象的三维模型。Finally, isosurface extraction is performed on the above neural network model. For example, an isosurface extraction algorithm (Marching cubes, MC) can be used to draw the surface of the three-dimensional model to obtain the three-dimensional model surface, and then the three-dimensional model of the object is obtained based on the three-dimensional model surface.
本申请提供的三维重建方案,通过神经网络去隐式建模对象的三维模型,并加入深度提高模型训练的速度和精度。采用本申请提供的三维重建方案,在时序上持续的对拍摄对象进行三维重建,即可得到拍摄对象在不同时刻的三维模型,这些不同时刻的三维模型按时序构成的三维模型序列即为对拍摄对象所拍摄得到的体积视频。以此,可以针对任意拍摄对象进行“体积视频拍摄”,得到特定内容呈现的体积视频。比如,可以对跳舞的拍摄对象进行体积视频拍摄,得到可以在任意角度观看对象舞蹈的体积视频,可以对教学的拍摄对象进行体积视频拍摄,得到可以在任意角度观看拍摄对象教学的体积视频,等等。The 3D reconstruction scheme provided by the present application uses a neural network to implicitly model the 3D model of the object, and adds depth to improve the speed and accuracy of model training. By adopting the 3D reconstruction scheme provided by the present application, the 3D model of the photographed object is continuously reconstructed in time sequence, and the 3D model of the photographed object at different times can be obtained. The 3D model sequence composed of these 3D models at different times in time sequence is the volumetric video obtained by photographing the photographed object. In this way, "volume video shooting" can be performed on any photographed object to obtain a volumetric video presenting specific content. For example, a volumetric video of a dancing subject can be shot to obtain a volumetric video of the subject's dance that can be viewed from any angle. A volumetric video of a teaching subject can be shot to obtain a volumetric video of the subject's teaching that can be viewed from any angle, and so on.
为便于更好的实施本申请实施例提供的体积视频的解码方法,本申请实施例还提供一种基于上述体积视频的解码方法的体积视频的解码装置。其中名词的含义与上述体积视频的解码方法中相同,具体实现细节可以参考方法实施例中的说明。图3示出了根据本申请的一个实施例的体积视频的解码装置的框图。In order to facilitate better implementation of the volumetric video decoding method provided in the embodiment of the present application, the embodiment of the present application also provides a volumetric video decoding device based on the above volumetric video decoding method. The meanings of the terms are the same as those in the above volumetric video decoding method, and the specific implementation details can refer to the description in the method embodiment. Figure 3 shows a block diagram of a volumetric video decoding device according to an embodiment of the present application.
如图3所示,体积视频的解码装置300,体积视频的解码装置300中可以包括:获取模块310、提取模块320、分析模块330以及渲染模块340。As shown in FIG. 3 , a volumetric video decoding device 300 may include: an acquisition module 310 , an extraction module 320 , an analysis module 330 , and a rendering module 340 .
获取模块310可以用于获取体积视频对应的多帧待解码图像;提取模块320可以用于提取每一帧所述待解码图像对应的全局特征;分析模块330可以用于基于每一帧所述待解码图像对应的全局特征进行深度分析处理,得到每一帧所述待解码图像对应的深度;渲染模块340可以用于基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型,所述多帧三维模型用于生成所述体积视频。The acquisition module 310 can be used to acquire multiple frames of images to be decoded corresponding to the volumetric video; the extraction module 320 can be used to extract global features corresponding to each frame of the image to be decoded; the analysis module 330 can be used to perform depth analysis processing based on the global features corresponding to each frame of the image to be decoded, and obtain the depth corresponding to each frame of the image to be decoded; the rendering module 340 can be used to perform rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth, and obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
在本申请的一些实施例中,所述提取模块,用于针对每一帧所述待解码图像,对所述待解码图像进行特征提取处理,得到所述待解码图像对应的图像特征;对所述待解码图像对应的图像特征进行特征融合处理,得到所述待解码图像对应的全局特征。In some embodiments of the present application, the extraction module is used to perform feature extraction processing on the image to be decoded for each frame of the image to be decoded to obtain image features corresponding to the image to be decoded; and perform feature fusion processing on the image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.
在本申请的一些实施例中,所述提取模块,用于:对所述待解码图像进行多层级的编码处理,得到每一层级的编码处理所输出的图像特征;其中,每个层级的编码处理包括依次进行的卷积处理及最大池化处理;前一层级的编码处理所输出的图像特征用于下一层级的编码处理;所述提取模块,还用于:对待融合特征进行多层级的解码处理,得到每一层级的解码处理所输出的融合特征;其中,每一层级的解码处理包括依次进行的反卷积处理、拼接处理及卷积处理;前一层级的解码处理所输出的融合特征用于下一层级的解码处理;每一层级的拼接处理包括:将反卷积处理输出的反卷积特征与相同层级的图像特征拼接,得到用于进行卷积处理的拼接特征;根据最后一层级的解码处理所输出的所述融合特征,得到所述待解码图像对应的全局特征。In some embodiments of the present application, the extraction module is used to: perform multi-level encoding processing on the image to be decoded to obtain image features output by the encoding processing at each level; wherein the encoding processing at each level includes convolution processing and maximum pooling processing performed in sequence; the image features output by the encoding processing at the previous level are used for the encoding processing at the next level; the extraction module is also used to: perform multi-level decoding processing on the features to be fused to obtain fused features output by the decoding processing at each level; wherein the decoding processing at each level includes deconvolution processing, splicing processing and convolution processing performed in sequence; the fused features output by the decoding processing at the previous level are used for the decoding processing at the next level; the splicing processing at each level includes: splicing the deconvolution features output by the deconvolution processing with the image features of the same level to obtain splicing features for convolution processing; and obtaining the global features corresponding to the image to be decoded based on the fused features output by the decoding processing at the last level.
在本申请的一些实施例中,所述提取模块,还用于:对所述相同层级的图像特征计算注意力分布,并根据所述注意力分布计算加权平均,得到相同层级的加权平均特征;将反卷积处理输出的反卷积特征与相同层级的加权平均特征拼接,得到用于进行卷积处理的拼接特征。In some embodiments of the present application, the extraction module is also used to: calculate the attention distribution of the image features at the same level, and calculate the weighted average based on the attention distribution to obtain the weighted average features at the same level; splice the deconvolution features output by the deconvolution processing with the weighted average features at the same level to obtain the spliced features for convolution processing.
在本申请的一些实施例中,所述分析模块,用于:将每一帧所述待解码图像对应的全局特征分别输入循环神经网络进行深度分析处理,得到所述循环神经网络输出的每一帧所述待解码图像对应的深度。In some embodiments of the present application, the analysis module is used to: input the global features corresponding to each frame of the image to be decoded into a recurrent neural network for depth analysis and processing, and obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network.
在本申请的一些实施例中,所述分析模块,用于:将每一帧所述待解码图像对应的全局特征分别输入门控循环单元进行深度分析处理,得到所述门控循环单元输出的每一帧所述待解码图像对应的深度。In some embodiments of the present application, the analysis module is used to: input the global features corresponding to each frame of the image to be decoded into the gated loop unit for depth analysis processing, and obtain the depth corresponding to each frame of the image to be decoded output by the gated loop unit.
在本申请的一些实施例中,所述渲染模块,用于:将每一帧所述待解码图像及对应的所述全局特征与所述深度分别输入卷积神经网络进行渲染处理,得到所述卷积神经网络输出的每一帧所述待解码图像对应的三维模型。In some embodiments of the present application, the rendering module is used to: input each frame of the image to be decoded and the corresponding global features and the depth into a convolutional neural network for rendering processing, so as to obtain a three-dimensional model corresponding to each frame of the image to be decoded output by the convolutional neural network.
在本申请的一些实施例中,所述装置还包括生成模块,用于:将解码出的多帧三维模型,按照每帧三维模型对应的待解码图像对应的顺序序列化处理,得到所述体积视频。In some embodiments of the present application, the device further includes a generation module, which is used to: serialize the decoded multi-frame three-dimensional models in the order corresponding to the image to be decoded corresponding to each frame of the three-dimensional model to obtain the volumetric video.
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本申请的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。It should be noted that, although several modules or units of the equipment for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to the embodiments of the present application, the features and functions of two or more modules or units described above can be embodied in one module or unit. On the contrary, the features and functions of one module or unit described above can be further divided into being embodied by multiple modules or units.
此外,本申请实施例还提供一种电子设备,该电子设备可以为终端或者服务器,如图4所示,其示出了本申请实施例所涉及的电子设备的结构示意图,具体来讲:
该电子设备可以包括一个或者一个以上处理核心的处理器401、一个或一个以上计算机可读存储介质的存储器402、电源403和输入单元404等部件。本领域技术人员可以理解,图4中示出的电子设备结构并不构成对电子设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
处理器401是该电子设备的控制中心,利用各种接口和线路连接整个计算机设备的各个部分,通过运行或执行存储在存储器402内的软件程序和/或模块,以及调用存储在存储器402内的数据,执行计算机设备的各种功能和处理数据,从而对电子设备进行整体监控。可选的,处理器401可包括一个或多个处理核心;优选的,处理器401可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户页面和应用程序等,调制解调处理器主要处理无线通讯。可以理解的是,上述调制解调处理器也可以不集成到处理器401中。
In addition, an embodiment of the present application further provides an electronic device, which may be a terminal or a server, as shown in FIG4 , which shows a schematic diagram of the structure of the electronic device involved in the embodiment of the present application, specifically:
The electronic device may include components such as a processor 401 with one or more processing cores, a memory 402 with one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device structure shown in FIG4 does not limit the electronic device, and may include more or fewer components than shown in the figure, or combine certain components, or arrange the components differently. Among them:
The processor 401 is the control center of the electronic device. It uses various interfaces and lines to connect various parts of the entire computer device. By running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, it executes various functions of the computer device and processes data, thereby monitoring the electronic device as a whole. Optionally, the processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user pages and application programs, etc., and the modem processor mainly processes wireless communications. It is understandable that the above-mentioned modem processor may not be integrated into the processor 401.
存储器402可用于存储软件程序以及模块,处理器401通过运行存储在存储器402的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器402可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据计算机设备的使用所创建的数据等。此外,存储器402可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器402还可以包括存储器控制器,以提供处理器401对存储器402的访问。The memory 402 can be used to store software programs and modules. The processor 401 executes various functional applications and data processing by running the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the data storage area may store data created according to the use of the computer device, etc. In addition, the memory 402 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other volatile solid-state storage devices. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.
电子设备还包括给各个部件供电的电源403,优选的,电源403可以通过电源管理系统与处理器401逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源403还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。The electronic device also includes a power supply 403 for supplying power to each component. Preferably, the power supply 403 can be logically connected to the processor 401 through a power management system, so that the power management system can manage charging, discharging, power consumption and other functions. The power supply 403 can also include one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators and other arbitrary components.
该电子设备还可包括输入单元404,该输入单元404可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。The electronic device may further include an input unit 404, which may be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
尽管未示出,电子设备还可以包括显示单元等,在此不再赘述。具体在本实施例中,电子设备中的处理器401会按照如下的指令,将一个或一个以上的计算机程序的进程对应的可执行文件加载到存储器402中,并由处理器401来运行存储在存储器402中的计算机程序,从而实现本申请前述实施例中各种功能。Although not shown, the electronic device may further include a display unit, etc., which will not be described in detail herein. Specifically in this embodiment, the processor 401 in the electronic device will load the executable files corresponding to the processes of one or more computer programs into the memory 402 according to the following instructions, and the processor 401 will run the computer programs stored in the memory 402, thereby realizing various functions in the aforementioned embodiments of the present application.
如处理器401可以执行下述步骤:获取体积视频对应的多帧待解码图像;提取每一帧所述待解码图像对应的全局特征;基于每一帧所述待解码图像对应的全局特征进行深度分析处理,得到每一帧所述待解码图像对应的深度;基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型,所述多帧三维模型用于生成所述体积视频。For example, the processor 401 can execute the following steps: obtain multiple frames of images to be decoded corresponding to the volumetric video; extract the global features corresponding to each frame of the image to be decoded; perform depth analysis processing based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded; perform rendering processing based on each frame of the image to be decoded and the corresponding global features and the depth respectively to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过计算机程序来完成,或通过计算机程序控制相关的硬件来完成,该计算机程序可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。A person of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be completed by a computer program, or by controlling related hardware through a computer program. The computer program may be stored in a computer-readable storage medium and loaded and executed by a processor.
为此,本申请实施例还提供一种存储介质,其中存储有计算机程序,该计算机程序能够被处理器进行加载,以执行本申请实施例所提供的任一种方法中的步骤。To this end, an embodiment of the present application further provides a storage medium, in which a computer program is stored. The computer program can be loaded by a processor to execute the steps in any method provided in the embodiment of the present application.
其中,该存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、磁盘或光盘等。The storage medium may include: a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, etc.
由于该存储介质中所存储的计算机程序,可以执行本申请实施例所提供的任一种方法中的步骤,因此,可以实现本申请实施例所提供的方法所能实现的有益效果,详见前面的实施例,在此不再赘述。Since the computer program stored in the storage medium can execute the steps in any method provided in the embodiments of the present application, the beneficial effects that can be achieved by the method provided in the embodiments of the present application can be achieved. Please refer to the previous embodiments for details and will not be repeated here.
本领域技术人员在考虑说明书及实践这里公开的实施方式后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。Those skilled in the art will readily appreciate other embodiments of the present application after considering the specification and practicing the embodiments disclosed herein. The present application is intended to cover any variations, uses or adaptations of the present application, which follow the general principles of the present application and include common knowledge or customary technical means in the art that are not disclosed in the present application.
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的实施例,而可以在不脱离其范围的情况下进行各种修改和改变。It should be understood that the present application is not limited to the embodiments that have been described above and shown in the accompanying drawings, but various modifications and changes may be made without departing from the scope thereof.

Claims (20)

  1. 一种体积视频的解码方法,其中,所述方法包括:A method for decoding volumetric video, wherein the method comprises:
    获取体积视频对应的多帧待解码图像;Obtain multiple frames of images to be decoded corresponding to the volumetric video;
    提取每一帧所述待解码图像对应的全局特征;Extracting global features corresponding to each frame of the image to be decoded;
    基于每一帧所述待解码图像对应的全局特征进行深度分析处理,得到每一帧所述待解码图像对应的深度;Performing depth analysis based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded;
    基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型,所述多帧三维模型用于生成所述体积视频。Rendering is performed based on each frame of the image to be decoded and the corresponding global feature and the depth to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
  2. 根据权利要求1所述的方法,其中,所述提取每一帧所述待解码图像对应的全局特征,包括:The method according to claim 1, wherein the extracting the global features corresponding to each frame of the image to be decoded comprises:
    针对每一帧所述待解码图像,对所述待解码图像进行特征提取处理,得到所述待解码图像对应的图像特征;For each frame of the image to be decoded, performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded;
    对所述待解码图像对应的图像特征进行特征融合处理,得到所述待解码图像对应的全局特征。Perform feature fusion processing on image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.
  3. 根据权利要求2所述的方法,其中,所述对所述待解码图像进行特征提取处理,得到所述待解码图像对应的图像特征,包括:The method according to claim 2, wherein the step of performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded comprises:
    对所述待解码图像进行多层级的编码处理,得到每一层级的编码处理所输出的图像特征;其中,每个层级的编码处理包括依次进行的卷积处理及最大池化处理;前一层级的编码处理所输出的图像特征用于下一层级的编码处理;Performing multi-level encoding processing on the image to be decoded to obtain image features output by the encoding processing at each level; wherein the encoding processing at each level includes convolution processing and maximum pooling processing performed in sequence; the image features output by the encoding processing at the previous level are used for the encoding processing at the next level;
    所述对所述待解码图像对应的图像特征进行特征融合处理,得到所述待解码图像对应的全局特征,包括:The performing feature fusion processing on the image features corresponding to the image to be decoded to obtain the global features corresponding to the image to be decoded includes:
    对待融合特征进行多层级的解码处理,得到每一层级的解码处理所输出的融合特征;其中,每一层级的解码处理包括依次进行的反卷积处理、拼接处理及卷积处理;前一层级的解码处理所输出的融合特征用于下一层级的解码处理;每一层级的拼接处理包括:将反卷积处理输出的反卷积特征与相同层级的图像特征拼接,得到用于进行卷积处理的拼接特征;以及,Performing multi-level decoding processing on the features to be fused to obtain fused features output by the decoding processing at each level; wherein the decoding processing at each level includes deconvolution processing, splicing processing and convolution processing performed in sequence; the fused features output by the decoding processing at the previous level are used for the decoding processing at the next level; the splicing processing at each level includes: splicing the deconvolution features output by the deconvolution processing with the image features at the same level to obtain splicing features for convolution processing; and,
    根据最后一层级的解码处理所输出的所述融合特征,得到所述待解码图像对应的全局特征。According to the fused features output by the decoding process at the last level, the global features corresponding to the image to be decoded are obtained.
  4. 根据权利要求3所述的方法,其中,所述将反卷积处理输出的反卷积特征与相同层级的图像特征拼接,得到用于进行卷积处理的拼接特征,包括:The method according to claim 3, wherein the step of splicing the deconvolution features output by the deconvolution process with the image features at the same level to obtain the spliced features for performing the convolution process comprises:
    对所述相同层级的图像特征计算注意力分布,并根据所述注意力分布计算加权平均,得到相同层级的加权平均特征;Calculating the attention distribution of the image features at the same level, and calculating the weighted average according to the attention distribution to obtain the weighted average features at the same level;
    将反卷积处理输出的反卷积特征与相同层级的加权平均特征拼接,得到用于进行卷积处理的拼接特征。The deconvolution features output by the deconvolution process are concatenated with the weighted average features of the same level to obtain concatenated features for convolution processing.
  5. 根据权利要求1所述的方法,其中,所述基于每一帧所述待解码图像对应的全局特征进行深度分析处理,得到每一帧所述待解码图像对应的深度,包括:The method according to claim 1, wherein the performing depth analysis processing based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded comprises:
    将每一帧所述待解码图像对应的全局特征分别输入循环神经网络进行深度分析处理,得到所述循环神经网络输出的每一帧所述待解码图像对应的深度。The global features corresponding to each frame of the image to be decoded are respectively input into the recurrent neural network for depth analysis and processing, so as to obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network.
  6. 根据权利要求5所述的方法,其中,所述将每一帧所述待解码图像对应的全局特征分别输入循环神经网络进行深度分析处理,得到所述循环神经网络输出的每一帧所述待解码图像对应的深度,包括:The method according to claim 5, wherein the inputting the global features corresponding to each frame of the image to be decoded into the recurrent neural network for depth analysis and processing, and obtaining the depth corresponding to each frame of the image to be decoded output by the recurrent neural network, comprises:
    将每一帧所述待解码图像对应的全局特征分别输入门控循环单元进行深度分析处理,得到所述门控循环单元输出的每一帧所述待解码图像对应的深度。The global features corresponding to each frame of the image to be decoded are respectively input into the gated cycle unit for depth analysis processing to obtain the depth corresponding to each frame of the image to be decoded output by the gated cycle unit.
  7. 根据权利要求1所述的方法,其中,所述基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型,包括:The method according to claim 1, wherein the rendering process is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, comprising:
    将每一帧所述待解码图像及对应的所述全局特征与所述深度分别输入卷积神经网络进行渲染处理,得到所述卷积神经网络输出的每一帧所述待解码图像对应的三维模型。Each frame of the image to be decoded and the corresponding global features and the depth are respectively input into a convolutional neural network for rendering processing to obtain a three-dimensional model corresponding to each frame of the image to be decoded output by the convolutional neural network.
  8. 根据权利要求1所述的方法,其中,在所述基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型之后,所述方法包括:The method according to claim 1, wherein after rendering is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, the method comprises:
    将解码出的多帧三维模型,按照每帧三维模型对应的待解码图像对应的顺序序列化处理,得到所述体积视频。The decoded multiple frames of three-dimensional models are serialized in the order corresponding to the to-be-decoded images corresponding to each frame of the three-dimensional model to obtain the volumetric video.
  9. 根据权利要求1所述的方法,其中,所述提取每一帧所述待解码图像对应的全局特征,包括:The method according to claim 1, wherein the extracting the global features corresponding to each frame of the image to be decoded comprises:
    针对每一帧所述待解码图像,通过直方图计算函数计算每一帧所述待解码图像对应的直方图作为全局特征。For each frame of the image to be decoded, a histogram corresponding to each frame of the image to be decoded is calculated as a global feature by using a histogram calculation function.
  10. 根据权利要求3所述的方法,其中,所述根据最后一层级的解码处理所输出的所述融合特征,得到所述待解码图像对应的全局特征,包括:The method according to claim 3, wherein the step of obtaining the global features corresponding to the image to be decoded based on the fusion features output by the last level of decoding processing comprises:
    将最后一层级的解码处理所输出的融合特征,作为得到的待解码图像对应的全局特征。The fused features output by the last level of decoding processing are used as the global features corresponding to the image to be decoded.
  11. 根据权利要求3所述的方法,其中,所述根据最后一层级的解码处理所输出的所述融合特征,得到所述待解码图像对应的全局特征,包括:The method according to claim 3, wherein the step of obtaining the global features corresponding to the image to be decoded based on the fusion features output by the last level of decoding processing comprises:
    将最后一层级的解码处理所输出的融合特征进行降维,将降维特征作为得到的待解码图像对应的全局特征。The fusion features output by the last level of decoding processing are reduced in dimension, and the reduced-dimensional features are used as the global features corresponding to the image to be decoded.
  12. 一种非暂时性计算机可读存储介质,其上存储有计算机程序,当所述计算机程序被计算机的处理器执行时,使计算机执行包括以下的操作:A non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor of a computer, causes the computer to perform operations including:
    获取体积视频对应的多帧待解码图像;Obtain multiple frames of images to be decoded corresponding to the volumetric video;
    提取每一帧所述待解码图像对应的全局特征;Extracting global features corresponding to each frame of the image to be decoded;
    基于每一帧所述待解码图像对应的全局特征进行深度分析处理,得到每一帧所述待解码图像对应的深度;Performing depth analysis based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded;
    基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型,所述多帧三维模型用于生成所述体积视频。Rendering is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
  13. 根据权利要求12所述的非暂时性计算机可读存储介质,其中,所述提取每一帧所述待解码图像对应的全局特征,包括:The non-transitory computer-readable storage medium according to claim 12, wherein the extracting the global features corresponding to each frame of the image to be decoded comprises:
    针对每一帧所述待解码图像,对所述待解码图像进行特征提取处理,得到所述待解码图像对应的图像特征;For each frame of the image to be decoded, performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded;
    对所述待解码图像对应的图像特征进行特征融合处理,得到所述待解码图像对应的全局特征。Perform feature fusion processing on image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.
  14. 根据权利要求13所述的非暂时性计算机可读存储介质,其中,所述对所述待解码图像进行特征提取处理,得到所述待解码图像对应的图像特征,包括:The non-transitory computer-readable storage medium according to claim 13, wherein the step of performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded comprises:
    对所述待解码图像进行多层级的编码处理,得到每一层级的编码处理所输出的图像特征;其中,每个层级的编码处理包括依次进行的卷积处理及最大池化处理;前一层级的编码处理所输出的图像特征用于下一层级的编码处理;Performing multi-level encoding processing on the image to be decoded to obtain image features output by the encoding processing at each level; wherein the encoding processing at each level includes convolution processing and maximum pooling processing performed in sequence; the image features output by the encoding processing at the previous level are used for the encoding processing at the next level;
    所述对所述待解码图像对应的图像特征进行特征融合处理,得到所述待解码图像对应的全局特征,包括:The performing feature fusion processing on the image features corresponding to the image to be decoded to obtain the global features corresponding to the image to be decoded includes:
    对待融合特征进行多层级的解码处理,得到每一层级的解码处理所输出的融合特征;其中,每一层级的解码处理包括依次进行的反卷积处理、拼接处理及卷积处理;前一层级的解码处理所输出的融合特征用于下一层级的解码处理;每一层级的拼接处理包括:将反卷积处理输出的反卷积特征与相同层级的图像特征拼接,得到用于进行卷积处理的拼接特征;Perform multi-level decoding processing on the features to be fused to obtain fused features output by the decoding processing at each level; wherein the decoding processing at each level includes deconvolution processing, splicing processing and convolution processing performed in sequence; the fused features output by the decoding processing at the previous level are used for the decoding processing at the next level; the splicing processing at each level includes: splicing the deconvolution features output by the deconvolution processing with the image features at the same level to obtain splicing features for convolution processing;
    根据最后一层级的解码处理所输出的所述融合特征,得到所述待解码图像对应的全局特征。According to the fusion features output by the decoding process at the last level, the global features corresponding to the image to be decoded are obtained.
  15. 根据权利要求14所述的非暂时性计算机可读存储介质,其中,所述将反卷积处理输出的反卷积特征与相同层级的图像特征拼接,得到用于进行卷积处理的拼接特征,包括:The non-transitory computer-readable storage medium according to claim 14, wherein the step of concatenating the deconvolution features output by the deconvolution process with the image features at the same level to obtain the concatenated features for convolution processing comprises:
    对所述相同层级的图像特征计算注意力分布,并根据所述注意力分布计算加权平均,得到相同层级的加权平均特征;Calculating the attention distribution of the image features at the same level, and calculating the weighted average according to the attention distribution to obtain the weighted average features at the same level;
    将反卷积处理输出的反卷积特征与相同层级的加权平均特征拼接,得到用于进行卷积处理的拼接特征。The deconvolution features output by the deconvolution process are concatenated with the weighted average features of the same level to obtain concatenated features for convolution processing.
  16. 根据权利要求12所述的非暂时性计算机可读存储介质,其中,所述基于每一帧所述待解码图像对应的全局特征进行深度分析处理,得到每一帧所述待解码图像对应的深度,包括:The non-transitory computer-readable storage medium according to claim 12, wherein the performing depth analysis processing based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded comprises:
    将每一帧所述待解码图像对应的全局特征分别输入循环神经网络进行深度分析处理,得到所述循环神经网络输出的每一帧所述待解码图像对应的深度。The global features corresponding to each frame of the image to be decoded are respectively input into the recurrent neural network for depth analysis and processing, so as to obtain the depth corresponding to each frame of the image to be decoded output by the recurrent neural network.
  17. 一种电子设备,包括:存储器,存储有计算机程序;处理器,读取存储器存储的计算机程序,以执行包括以下的操作:An electronic device includes: a memory storing a computer program; and a processor reading the computer program stored in the memory to perform the following operations:
    获取体积视频对应的多帧待解码图像;Obtain multiple frames of images to be decoded corresponding to the volumetric video;
    提取每一帧所述待解码图像对应的全局特征;Extracting global features corresponding to each frame of the image to be decoded;
    基于每一帧所述待解码图像对应的全局特征进行深度分析处理,得到每一帧所述待解码图像对应的深度;Performing depth analysis based on the global features corresponding to each frame of the image to be decoded to obtain the depth corresponding to each frame of the image to be decoded;
    基于每一帧所述待解码图像及对应的所述全局特征与所述深度分别进行渲染处理,得到解码出的多帧三维模型,所述多帧三维模型用于生成所述体积视频。Rendering is performed based on each frame of the image to be decoded and the corresponding global features and the depth to obtain a decoded multi-frame three-dimensional model, and the multi-frame three-dimensional model is used to generate the volumetric video.
  18. 根据权利要求17所述的电子设备,其中,所述提取每一帧所述待解码图像对应的全局特征,包括:The electronic device according to claim 17, wherein the extracting the global features corresponding to each frame of the image to be decoded comprises:
    针对每一帧所述待解码图像,对所述待解码图像进行特征提取处理,得到所述待解码图像对应的图像特征;For each frame of the image to be decoded, performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded;
    对所述待解码图像对应的图像特征进行特征融合处理,得到所述待解码图像对应的全局特征。Perform feature fusion processing on image features corresponding to the image to be decoded to obtain global features corresponding to the image to be decoded.
  19. 根据权利要求18所述的电子设备,其中,所述对所述待解码图像进行特征提取处理,得到所述待解码图像对应的图像特征,包括:The electronic device according to claim 18, wherein the step of performing feature extraction processing on the image to be decoded to obtain image features corresponding to the image to be decoded comprises:
    对所述待解码图像进行多层级的编码处理,得到每一层级的编码处理所输出的图像特征;其中,每个层级的编码处理包括依次进行的卷积处理及最大池化处理;前一层级的编码处理所输出的图像特征用于下一层级的编码处理;Performing multi-level encoding processing on the image to be decoded to obtain image features output by the encoding processing at each level; wherein the encoding processing at each level includes convolution processing and maximum pooling processing performed in sequence; the image features output by the encoding processing at the previous level are used for the encoding processing at the next level;
    所述对所述待解码图像对应的图像特征进行特征融合处理,得到所述待解码图像对应的全局特征,包括:The performing feature fusion processing on the image features corresponding to the image to be decoded to obtain the global features corresponding to the image to be decoded includes:
    对待融合特征进行多层级的解码处理,得到每一层级的解码处理所输出的融合特征;其中,每一层级的解码处理包括依次进行的反卷积处理、拼接处理及卷积处理;前一层级的解码处理所输出的融合特征用于下一层级的解码处理;每一层级的拼接处理包括:将反卷积处理输出的反卷积特征与相同层级的图像特征拼接,得到用于进行卷积处理的拼接特征;Perform multi-level decoding processing on the features to be fused to obtain fused features output by the decoding processing at each level; wherein the decoding processing at each level includes deconvolution processing, splicing processing and convolution processing performed in sequence; the fused features output by the decoding processing at the previous level are used for the decoding processing at the next level; the splicing processing at each level includes: splicing the deconvolution features output by the deconvolution processing with the image features at the same level to obtain splicing features for convolution processing;
    根据最后一层级的解码处理所输出的所述融合特征,得到所述待解码图像对应的全局特征。According to the fused features output by the decoding process at the last level, the global features corresponding to the image to be decoded are obtained.
  20. 根据权利要求19所述的电子设备,其中,所述将反卷积处理输出的反卷积特征与相同层级的图像特征拼接,得到用于进行卷积处理的拼接特征,包括:The electronic device according to claim 19, wherein the step of splicing the deconvolution features output by the deconvolution processing with the image features at the same level to obtain the spliced features for performing the convolution processing comprises:
    对所述相同层级的图像特征计算注意力分布,并根据所述注意力分布计算加权平均,得到相同层级的加权平均特征;Calculating the attention distribution of the image features at the same level, and calculating the weighted average according to the attention distribution to obtain the weighted average features at the same level;
    将反卷积处理输出的反卷积特征与相同层级的加权平均特征拼接,得到用于进行卷积处理的拼接特征。The deconvolution features output by the deconvolution process are concatenated with the weighted average features of the same level to obtain concatenated features for convolution processing.
PCT/CN2023/075406 2023-01-31 2023-02-10 Decoding method for volumetric video, and storage medium and electronic device WO2024159553A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310064989.1 2023-01-31
CN202310064989.1A CN116095338A (en) 2023-01-31 2023-01-31 Decoding method, device, medium, equipment and product of volume video

Publications (1)

Publication Number Publication Date
WO2024159553A1 true WO2024159553A1 (en) 2024-08-08

Family

ID=86198873

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/075406 WO2024159553A1 (en) 2023-01-31 2023-02-10 Decoding method for volumetric video, and storage medium and electronic device

Country Status (2)

Country Link
CN (1) CN116095338A (en)
WO (1) WO2024159553A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116095338A (en) * 2023-01-31 2023-05-09 珠海普罗米修斯视觉技术有限公司 Decoding method, device, medium, equipment and product of volume video

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3457688A1 (en) * 2017-09-15 2019-03-20 Thomson Licensing Methods and devices for encoding and decoding three degrees of freedom and volumetric compatible video stream
CN110874864A (en) * 2019-10-25 2020-03-10 深圳奥比中光科技有限公司 Method, device, electronic equipment and system for obtaining three-dimensional model of object
CN111325068A (en) * 2018-12-14 2020-06-23 北京京东尚科信息技术有限公司 Video description method and device based on convolutional neural network
CN112101252A (en) * 2020-09-18 2020-12-18 广州云从洪荒智能科技有限公司 Image processing method, system, device and medium based on deep learning
CN112669441A (en) * 2020-12-09 2021-04-16 北京达佳互联信息技术有限公司 Object reconstruction method and device, electronic equipment and storage medium
EP3873095A1 (en) * 2020-02-27 2021-09-01 Nokia Technologies Oy An apparatus, a method and a computer program for omnidirectional video
CN115578542A (en) * 2022-10-27 2023-01-06 珠海普罗米修斯视觉技术有限公司 Three-dimensional model processing method, device, equipment and computer readable storage medium
CN116095338A (en) * 2023-01-31 2023-05-09 珠海普罗米修斯视觉技术有限公司 Decoding method, device, medium, equipment and product of volume video

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3457688A1 (en) * 2017-09-15 2019-03-20 Thomson Licensing Methods and devices for encoding and decoding three degrees of freedom and volumetric compatible video stream
CN111325068A (en) * 2018-12-14 2020-06-23 北京京东尚科信息技术有限公司 Video description method and device based on convolutional neural network
CN110874864A (en) * 2019-10-25 2020-03-10 深圳奥比中光科技有限公司 Method, device, electronic equipment and system for obtaining three-dimensional model of object
EP3873095A1 (en) * 2020-02-27 2021-09-01 Nokia Technologies Oy An apparatus, a method and a computer program for omnidirectional video
CN112101252A (en) * 2020-09-18 2020-12-18 广州云从洪荒智能科技有限公司 Image processing method, system, device and medium based on deep learning
CN112669441A (en) * 2020-12-09 2021-04-16 北京达佳互联信息技术有限公司 Object reconstruction method and device, electronic equipment and storage medium
CN115578542A (en) * 2022-10-27 2023-01-06 珠海普罗米修斯视觉技术有限公司 Three-dimensional model processing method, device, equipment and computer readable storage medium
CN116095338A (en) * 2023-01-31 2023-05-09 珠海普罗米修斯视觉技术有限公司 Decoding method, device, medium, equipment and product of volume video

Also Published As

Publication number Publication date
CN116095338A (en) 2023-05-09

Similar Documents

Publication Publication Date Title
US20240046557A1 (en) Method, device, and non-transitory computer-readable storage medium for reconstructing a three-dimensional model
CN111542861A (en) System and method for rendering an avatar using a depth appearance model
WO2022205760A1 (en) Three-dimensional human body reconstruction method and apparatus, and device and storage medium
WO2024051445A1 (en) Image generation method and related device
WO2020247075A1 (en) Novel pose synthesis
EP3991140A1 (en) Portrait editing and synthesis
US20220156987A1 (en) Adaptive convolutions in neural networks
CN116977522A (en) Rendering method and device of three-dimensional model, computer equipment and storage medium
WO2022205755A1 (en) Texture generation method and apparatus, device, and storage medium
WO2024159553A1 (en) Decoding method for volumetric video, and storage medium and electronic device
CN116385667B (en) Reconstruction method of three-dimensional model, training method and device of texture reconstruction model
WO2024027063A1 (en) Livestream method and apparatus, storage medium, electronic device and product
CN117218246A (en) Training method and device for image generation model, electronic equipment and storage medium
CN117274446A (en) Scene video processing method, device, equipment and storage medium
Li et al. Dynamic View Synthesis with Spatio-Temporal Feature Warping from Sparse Views
CN116132653A (en) Processing method and device of three-dimensional model, storage medium and computer equipment
CN116245989A (en) Method and device for processing volume video, storage medium and computer equipment
CN116095353A (en) Live broadcast method and device based on volume video, electronic equipment and storage medium
CN115497029A (en) Video processing method, device and computer readable storage medium
CN110689602A (en) Three-dimensional face reconstruction method, device, terminal and computer readable storage medium
WO2024124664A1 (en) Video processing method and apparatus, computer device, and computer-readable storage medium
WO2024159555A1 (en) Video processing method and apparatus, and computer readable storage medium
CN117422809B (en) Data processing method for rendering light field image
CN115035230B (en) Video rendering processing method, device and equipment and storage medium
CN115442634A (en) Image compression method, device, storage medium, electronic equipment and product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23919127

Country of ref document: EP

Kind code of ref document: A1