CN115497029A

CN115497029A - Video processing method, device and computer readable storage medium

Info

Publication number: CN115497029A
Application number: CN202211268922.1A
Authority: CN
Inventors: 张煜; 江宇骄; 孙伟; 邵志兢
Original assignee: Zhuhai Prometheus Vision Technology Co ltd
Current assignee: Zhuhai Prometheus Vision Technology Co ltd
Priority date: 2022-10-17
Filing date: 2022-10-17
Publication date: 2022-12-20

Abstract

The application discloses a video processing method, a video processing device and a computer-readable storage medium, wherein the method comprises the steps of acquiring a multi-view synchronous video of a target object, and identifying joint points of each video frame in the multi-view synchronous video; determining three-dimensional model parameters of the target object in each video frame according to the joint point identification result, wherein the three-dimensional model parameters comprise attitude parameters, body type parameters, vertex displacement data and texture data; training a conditional variation self-encoder with attitude parameters as variables based on a training data set consisting of a plurality of video frames and corresponding three-dimensional model parameters; and acquiring target attitude parameters, and generating a driving video based on the conditional variation self-encoder and the target attitude parameters. The method can improve the effect of driving the roles in the video.

Description

Video processing method, device and computer readable storage medium

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video processing method and apparatus, and a computer-readable storage medium.

Background

With the continuous development of internet technology, daily life is inseparable from the internet. In the internet era, with the continuous development of intelligent terminal technology and the continuous reduction of traffic cost, the form of information transmission is also greatly changed. The information transmission gradually develops from the traditional text transmission to a transmission mode combining text, pictures and videos. Among them, video is becoming the first transmission mode of current information transmission due to its characteristics of large information transmission amount, rich content, and various presentation modes.

At present, most of research on videos focuses on video acquisition, transmission and playing ends, research on redriving roles in videos is less, and the redriving effect on roles in videos is poor at present.

Disclosure of Invention

The embodiment of the application provides a video processing method, a video processing device and a computer readable storage medium, and the method can effectively improve the driving effect on roles in a video.

A first aspect of the present application provides a video processing method, where the method includes:

acquiring a multi-view synchronous video of a target object, and identifying joint points of each video frame in the multi-view synchronous video;

determining three-dimensional model parameters of the target object in each video frame according to joint point identification results, wherein the three-dimensional model parameters comprise attitude parameters, body type parameters, vertex displacement data and texture data;

training a conditional variation self-encoder with attitude parameters as variables based on a training data set consisting of a plurality of video frames and corresponding three-dimensional model parameters;

and acquiring target attitude parameters, and generating a driving video based on the conditional variation self-encoder and the target attitude parameters.

Accordingly, a second aspect of the present application provides a video processing apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a multi-view synchronous video of a target object and identifying joint points of each video frame in the multi-view synchronous video;

the determining unit is used for determining three-dimensional model parameters of the target object in each video frame according to joint point identification results, wherein the three-dimensional model parameters comprise attitude parameters, body type parameters, vertex displacement data and texture data;

the training unit is used for training a conditional variation autoencoder taking the attitude parameters as variables on the basis of a training data set consisting of a plurality of video frames and corresponding three-dimensional model parameters;

and the generating unit is used for acquiring target attitude parameters and generating a driving video based on the conditional variation self-encoder and the target attitude parameters.

In some embodiments, the obtaining unit comprises:

the frame cutting subunit is used for cutting frames of the video corresponding to each visual angle of the multi-visual angle synchronous video to obtain multi-frame video frames;

and the detection subunit is used for carrying out joint point detection on each video frame to obtain a joint point identification result of each video frame.

In some embodiments, a detection subunit includes:

the detection module is used for detecting the posture of each video frame to obtain two-dimensional joint point data of each video frame;

the processing module is used for carrying out triangularization processing on the two-dimensional joint point data to obtain three-dimensional joint point data of each video frame;

and the determining module is used for determining the joint point identification result of each video frame according to the two-dimensional joint point data and the three-dimensional joint point data.

In some embodiments, the determining unit comprises:

the first fitting subunit is used for fitting the three-dimensional model of the target object according to the joint point identification result of each video frame to obtain the body type parameters and the posture parameters of the three-dimensional model;

the second fitting subunit is used for performing displacement fitting on the top point of the three-dimensional model to obtain top point displacement data of the three-dimensional model;

and the determining subunit is used for determining texture data of the three-dimensional model according to each video frame, and determining parameters of the three-dimensional model of the target object according to the body type parameters, the posture parameters, the vertex displacement data and the texture data.

In some embodiments, the second fitting subunit comprises:

the determining module is used for determining a target joint point corresponding to each vertex in the three-dimensional model, wherein the target joint point is the joint point closest to the vertex;

and the calculation module is used for calculating displacement data between each vertex and the corresponding target joint point to obtain vertex displacement data of the three-dimensional model.

In some embodiments, a training unit comprises:

the first input subunit is used for taking the attitude parameters of the three-dimensional model as input to obtain output model data output by the conditional variation self-encoder;

and the adjusting subunit is used for adjusting the parameters of the conditional variation self-encoder based on the difference between the output model data and the body type parameters, the vertex displacement data and the texture data in the three-dimensional model parameters.

In some embodiments, a generation unit comprises:

the receiving subunit is used for receiving the input target attitude parameters;

the second input subunit is used for inputting the target attitude parameters into the conditional variation self-encoder to obtain output target model data;

the rendering subunit is used for rendering the target model data to obtain a multi-view synchronous image;

and the generating subunit is used for generating a driving video according to the multi-view synchronous image.

The third aspect of the present application further provides a computer-readable storage medium, which stores a plurality of instructions, where the instructions are suitable for being loaded by a processor to execute the steps in the video processing method provided by the first aspect of the present application.

A fourth aspect of the present application provides a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the video processing method provided in the first aspect of the present application when executing the computer program.

A fifth aspect of the present application provides a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps in the video processing method provided by the first aspect.

According to the video processing method provided by the embodiment of the application, joint identification is carried out on each video frame in the multi-view synchronous video by acquiring the multi-view synchronous video of the target object; determining three-dimensional model parameters of the target object in each video frame according to joint point identification results, wherein the three-dimensional model parameters comprise attitude parameters, body type parameters, vertex displacement data and texture data; training a conditional variation self-encoder with attitude parameters as variables based on a training data set consisting of a plurality of video frames and corresponding three-dimensional model parameters; and acquiring target attitude parameters, and generating a driving video based on the conditional variation self-encoder and the target attitude parameters.

Therefore, according to the video processing method provided by the application, the attitude parameters and the three-dimensional model are decoupled through design, then the conditional variational self-encoder taking the attitude parameters as variables is trained, and therefore, the three-dimensional model can be driven to generate the driving video as long as new attitude parameters are input into the conditional variational self-encoder. The method can improve the accuracy of three-dimensional model driving, and further can improve the effect of driving the roles in the video.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a scene of video processing in the present application;

FIG. 2 is a schematic flow chart of a video processing method provided in the present application;

FIG. 3 is a schematic structural diagram of a video processing apparatus provided in the present application;

fig. 4 is a schematic structural diagram of a computer device provided in the present application.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It should be apparent that the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a video processing method, a video processing device, a computer readable storage medium and computer equipment. The video processing method can be used in a video processing device. The video processing apparatus may be integrated in a computer device, which may be a terminal or a server. The terminal can be a mobile phone, a tablet Computer, a notebook Computer, a smart television, a wearable smart device, a Personal Computer (PC), a vehicle-mounted terminal, and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, network acceleration service (CDN), big data and an artificial intelligence platform. Wherein a server may be a node in a blockchain.

Please refer to fig. 1, which is a scene diagram of a video processing method according to the present application. As shown in the figure, a server A acquires a multi-view synchronous video of a target object and performs joint point identification on each video frame in the multi-view synchronous video; determining three-dimensional model parameters of the target object in each video frame according to the joint point identification result, wherein the three-dimensional model parameters comprise attitude parameters, body type parameters, vertex displacement data and texture data; training a conditional variation self-encoder with attitude parameters as variables based on a training data set consisting of a plurality of video frames and corresponding three-dimensional model parameters; and the server A acquires the target attitude parameters from the terminal B and generates a driving video based on the conditional variation self-encoder and the target attitude parameters. Further, the server a may transmit the generated driving video to the terminal B.

Based on the above-described implementation scenarios, detailed descriptions will be given below.

In the related art, volume video has been extensively studied and developed in recent years. Volume Video (also called volume Video, spatial Video, volumetric three-dimensional Video, or 6-degree-of-freedom Video, etc.) is a technology for generating a three-dimensional model sequence by capturing information (such as depth information and color information, etc.) in a three-dimensional space. Compared with the traditional video, the volume video adds the concept of space into the video, and uses a three-dimensional model to better restore the three-dimensional world, rather than using a two-dimensional plane video and a moving mirror to simulate the sense of space of the three-dimensional world. Because the volume video is a three-dimensional model sequence, a user can adjust to any visual angle to watch the volume video according to the preference of the user, and the volume video has higher reduction degree and immersion feeling compared with a two-dimensional plane video.

In this embodiment, a three-dimensional model constituting a volume video is represented by voxels and volume textures. The three-dimensional model may represent real objects or imaginary objects including, but not limited to, real characters, real animals, and the like.

A voxel is an abbreviation of a Volume element (Volume Pixel), which, as its name implies, is the smallest unit of digital data on a three-dimensional partition, conceptually resembling the smallest unit of a two-dimensional space-a Pixel, which is used on image data of a two-dimensional computer image, while a voxel represents a three-dimensional object with a constant scalar or vector.

A volume texture is a logical extension of a traditional two-dimensional texture, a two-dimensional texture is a simple two-dimensional graph used to provide surface details (such as patterns, lines, colors, etc.) for a three-dimensional model, and a volume texture can be considered to be composed of a plurality of two-dimensional textures used to describe a three-dimensional spatial data picture.

Alternatively, the three-dimensional model used to construct the volumetric video may be reconstructed as follows (a neural network-based multi-view stereoscopic 3D geometric reconstruction method):

acquiring a color image and a depth image of a shooting object and internal and external parameters of a camera when shooting each color image and depth image;

a NeuS-based three-dimensional geometric model reconstruction technology of nerve radiation is adopted, a nerve network model of a three-dimensional model of a shooting object is implicitly expressed according to internal and external parameters of a camera and color images and depth images corresponding to the internal and external parameters, isosurface extraction is carried out on the basis of the trained nerve network model, three-dimensional reconstruction of the shooting object is achieved, and the three-dimensional model of the shooting object is obtained.

Specifically, any pixel point in the color image may be converted into a ray based on the camera internal and external parameters corresponding to each color image, and the ray may be a ray passing through the pixel point and perpendicular to the color image plane. And then sampling a plurality of sampling points on the ray, wherein the sampling process of the sampling points can be executed in two steps, part of the sampling points can be uniformly sampled, and then the plurality of sampling points are further sampled at a key position based on the depth value of the pixel point so as to ensure that the sampling points can be sampled near the surface of the model as many as possible. Further, the coordinate value of each sampling point in the world coordinate system, the directed Distance Field (SDF) value of each sampling point, and the RGB color value obtained by sampling may be calculated according to the depth values of the internal and external parameter pixels of the camera, where the directed Distance value may be a difference between the depth value of the pixel and the Distance from the sampling point to the imaging plane of the camera, and the difference is a Signed value. When the difference value is a positive value, the sampling point is represented to be outside the model; when the difference value is a negative value, the sampling point is represented inside the model; when the difference is 0, it represents that the sampling point is on the model surface.

Then, a neural network model, which may be a Multilayer Perceptron (MLP) without a normalization layer, may be supervised-trained with the coordinate values of the sample points as model inputs, the directional distance values of the sample points and the RGB color values as models. Then, sampling points can be sampled for all pixel points corresponding to the internal and external parameters of one camera according to the method, the sampling points are input into a neural network model to obtain the sampling points, data output by the model are further rendered into a color image and a depth image, difference comparison is carried out between the color image and the depth image corresponding to the internal and external parameters of the camera, the neural network model is adjusted circularly based on a preset loss function and comparison difference until the neural network model converges, and the accurate neural network model of the three-dimensional model of the implicit expression shooting object is obtained. That is, for the neural network model, the coordinate values in the world coordinate system of an arbitrary point in the input space can be determined to be inside, outside or on the surface of the model, that is, the neural network model can implicitly express the three-dimensional model of the photographic object. Further, the surface of the model can be extracted from the neural network model by adopting an isosurface extraction algorithm, so that a three-dimensional model of the shot object is obtained.

The method for three-dimensional reconstruction of the volumetric video is described in detail below.

The shooting of the volume video can be to adopt a plurality of industrial cameras and depth cameras to shoot a target object (shooting object) in a studio at multiple angles simultaneously to obtain shooting data. That is, at each instant, color images of the target object at multiple angles and depth images corresponding to each color image can be captured. That is, during shooting, the industrial camera and the depth camera may adopt the configuration of a camera set, and one industrial camera is matched with one depth camera to shoot a target object.

In addition, in the embodiment of the present application, the camera parameters of the camera at each shooting time may be further acquired. The camera parameters comprise internal and external parameters of the camera; the camera internal parameters may be parameters related to the characteristics of the camera itself, and specifically may include data such as a focal length and pixels of the camera; the camera external parameters may be parameters of the camera in a world coordinate system, and specifically may include data such as a position (coordinates) of the camera and a rotation direction of the camera. Camera parameters can be determined by calibration, wherein in image measurement processes and machine vision applications, in order to determine the correlation between the three-dimensional geometric position of a certain point on the surface of an object in space and the corresponding point in the image, a geometric model of the camera image must be established, and the geometric model parameters are the camera parameters. Under most conditions, the parameters can be obtained only through experiments and calculation, and the process of solving the parameters (internal parameters, external parameters and distortion parameters) is called camera calibration (or video camera calibration). In image measurement or machine vision application, calibration of camera parameters is a very critical link, and the accuracy of a calibration result and the stability of an algorithm directly influence the accuracy of a result generated by the operation of a camera. Therefore, the camera calibration is a precondition for subsequent work, and the improvement of the calibration precision is a key point of scientific research.

After the shooting data of the target object, that is, the shooting data obtained by shooting the volume video of the target object, including the color images and the depth images of the target object at different time points and at multiple view angles, is obtained, three-dimensional reconstruction of the target object needs to be performed based on the shooting data obtained by shooting. In the related art, pixels are often converted into voxels based on depth information of pixel points in a photographed image to obtain a point cloud, and then three-dimensional reconstruction is performed based on the point cloud. However, as mentioned above, the reconstruction accuracy of this method is low. In an embodiment of the present application, a method for performing three-dimensional reconstruction based on a neural network model is provided. Specifically, a neural network model implicitly representing a three-dimensional model of the target object may be trained, and then the three-dimensional model of the target object may be reconstructed based on the neural network model.

The neural network model may be a multi-layer Perceptron (MLP) that does not include a normalization layer. The neural network model may be trained using the camera parameters and corresponding captured color and depth images in the captured data. Specifically, internal and external parameters included in the camera parameters may be used as input of the neural network model, volume rendering may be performed on data output by the neural network model to obtain corresponding depth images and color images, and then the parameters of the neural network model are adjusted based on differences between the depth images and the color images obtained by rendering the result output by the neural network model and the actual depth images and the color images corresponding to the camera parameters, that is, the neural network model is continuously iteratively trained based on the actual depth images and the color images corresponding to the camera parameters as supervision of model training, so as to obtain the trained neural network model.

Wherein, in some embodiments, training a neural network model that implicitly represents a three-dimensional model of the target object based on the shot data comprises:

converting pixel points in each color image into rays based on corresponding camera parameters;

sampling a plurality of sampling points on a ray, and determining first coordinate information of each sampling point and a directed distance value of each sampling point from a pixel point;

inputting the coordinate information of the sampling points into a neural network model which implicitly represents a three-dimensional model of a target object to obtain a predicted directed distance value and a predicted color value of each output sampling point;

and adjusting parameters of the neural network model based on a first difference between the predicted directed distance value and the directed distance value and a second difference between the predicted color value and the color value of the pixel point to obtain the trained neural network model.

Specifically, in the embodiment of the present application, the specific step of training the neural net building model based on the camera parameters and the corresponding color image and depth image may be to convert pixel points in the photographed color image into a ray based on the camera parameters. And then sampling a plurality of sampling points on the camera, and determining the coordinate information of each sampling point and the directed distance value of each sampling point from a pixel point. After a plurality of sampling points are obtained through sampling, the coordinate information of each sampling point and the numerical value of the directed distance of each sampling point can be further determined. The directed distance value may be a difference between a depth value of the pixel point and a distance from the sampling point to the imaging plane of the camera, and the difference is a signed value.

The directed Distance value may also be referred to as a Signed Distance Field (SDF) value, where when the sampling point is inside the target object, the SDF value of the sampling point is a negative value, if the sampling point is outside the target object, the SDF value is a positive value, and if the sampling point is on the surface of the target object, the SDF value is 0. Namely, the directional distance value of the pixel point corresponding to the distance of the sampling point also represents the position relation between the sampling point and the three-dimensional model. And then, inputting the coordinate information of the sampling point into a neural network model which implicitly represents a three-dimensional model of the target object to obtain a predicted directed distance value and a predicted color value which are output by the neural network model. And then, carrying out iterative training on the neural network model by taking the actual color numerical value of the pixel point in the color image and the actual depth value of the pixel point in the depth image corresponding to the camera parameters as supervision until the model parameters of the neural network model are converged to obtain the trained neural network model.

Wherein, in some embodiments, converting the pixel points in each color image into rays based on the corresponding camera parameters comprises:

determining an imaging surface of the color image according to the camera parameters;

and determining rays which pass through the pixel points in the color image and are vertical to the imaging surface as rays corresponding to the pixel points.

In the embodiment of the present application, a specific method for performing ray transformation on a pixel point in a corresponding color image based on a camera parameter may be to determine coordinate information of an image shot by a camera in a world coordinate system, that is, to determine an imaging plane, according to internal and external parameters of the camera. Then, the ray passing through the pixel point in the color image and perpendicular to the imaging plane can be determined as the ray corresponding to the pixel point. Furthermore, each pixel point in the color image can be traversed, and a ray corresponding to each pixel point is generated.

In some embodiments, determining an imaging plane of a color image from camera parameters comprises:

determining second coordinate information of the camera in a world coordinate system and a rotation angle of the camera according to the camera parameters;

and determining an imaging surface of the color image according to the second coordinate information and the rotation angle.

In the embodiment of the present application, the imaging plane of the color image is determined according to the camera parameters, specifically, the coordinate information of the camera in the world coordinate system and the rotation angle of the camera may be extracted from the camera parameters, and then the coordinate data of the imaging plane of the camera in the world coordinate system may be determined according to the camera parameters such as the coordinate information of the camera in the world coordinate system and the rotation angle.

In some embodiments, sampling a plurality of sample points on a ray includes:

sampling a first number of first sampling points on a ray at equal intervals;

determining a plurality of key sampling points according to the depth values of the pixel points;

and sampling a second number of second sampling points near the key sampling point, and determining the first number of first sampling points and the second number of second sampling points as a plurality of sampled sampling points.

In the embodiment of the present application, sampling of sampling points is performed in a ray generated based on a pixel point, specifically, n sampling points are uniformly sampled on the ray, where n is a positive integer greater than 2, and then m sampling points are sampled at an important position according to the density of the n sampling points, where m is a positive integer greater than 1. The important part can be a position which is close to the distance between the pixel points, and the sampling point which is close to the distance between the pixel points in the n sampling points can be called a key sampling point. Then, m sampling points can be sampled at the key sampling point, and the n + m sampling points obtained by sampling are taken as final sampling points. The m sampling points are sampled at the key sampling points, so that the training effect of the model is more accurate on the surface of the three-dimensional model, and the reconstruction precision of the three-dimensional model can be further improved.

In some embodiments, determining the coordinate information of each sampling point and the directional distance value of each sampling point from the pixel point includes:

determining a depth value corresponding to the pixel point according to the depth image corresponding to the color image;

calculating a directed distance value of each sampling point from the pixel point based on the depth value;

and calculating the coordinate information of each sampling point according to the camera parameters and the depth values.

In the embodiment of the present application, after a plurality of pixel points are sampled on a ray corresponding to each pixel point, a distance between a camera shooting position and the pixel point can be determined according to camera external parameters and depth information (read from a depth image) of the pixel points, and then directional distance data of each sampling point and coordinate information of each sampling point are calculated one by one based on the distance.

After the neural network model implicitly representing the three-dimensional model of the target object is trained, the trained neural network model is obtained, that is, the neural network model can be understood as the signed distance function, that is, for given coordinate information of any given point, the corresponding SDF value can be determined by the neural network model, and the SDF value can represent the position relationship (internal, external or surface) between the point and the three-dimensional model, so that the three-dimensional model can also be implicitly represented by the neural network model. Through repeated iterative training of the neural network model, a more accurate three-dimensional model can be obtained. The trained neural network model can be reconstructed to obtain a more accurate three-dimensional model of the target object, so that a volume video with clearer texture and better reality can be obtained.

In the embodiment of the application, after the neural network model implicitly representing the three-dimensional model is obtained through training, only the virtual model is obtained, and isosurface extraction needs to be further performed on the neural network model, that is, an isosurface extraction algorithm (MC) is adopted to draw the surface of the three-dimensional model to obtain the surface of the three-dimensional model, and then the three-dimensional model of the target object is determined according to the surface of the three-dimensional model.

By adopting the three-dimensional model reconstruction method provided by the application, the three-dimensional model is implicitly modeled through the neural network, the accuracy of the speed of training the neural network model can be improved by adding depth, the three-dimensional model learned by the network is rendered back to the picture for indirect correction of the model, and the gradual correction of the three-dimensional model is performed through continuous iteration, so that the three-dimensional model is more accurate.

Therefore, three-dimensional models of the shot object at different moments can be obtained by continuously performing three-dimensional reconstruction on the shot object in time sequence, and a three-dimensional model sequence formed by the three-dimensional models at different moments according to time sequence is the volume video shot by the shot object.

By adopting the volume video shooting mode provided by the application, shooting can be performed aiming at any shooting object, and the volume video presented by specific content can be obtained. For example, a photographic subject of dancing can be photographed, a volume video in which dancing can be viewed at an arbitrary angle can be obtained, and the dance images can be photographed for 8230 \8230 \ 8230; \8230;, and the like.

It should be noted that the following embodiments of the present application can be implemented by using the volume video captured by the above volume video capturing manner.

At present, the research on the volume video is mostly in the shooting and reconstruction dimensions of the volume video, the research on the role redriving in the shot volume video is less, and a scheme for redriving the role in the volume video is lacked at present. In this regard, the present application provides a video processing method for redriving a volumetric video, which is described in detail below.

Embodiments of the present application will be described from the perspective of a video processing apparatus, which may be integrated in a computer device. The computer device may be a terminal or a server. The terminal can be a mobile phone, a tablet Computer, a notebook Computer, a smart television, a wearable smart device, a Personal Computer (PC), a vehicle-mounted terminal, and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, network acceleration service (CDN), big data, an artificial intelligence platform, and the like. As shown in fig. 2, a schematic flow chart of a video processing method provided in the present application is shown, where the method includes:

step 101, acquiring a multi-view synchronous video of a target object, and performing joint point identification on each video frame in the multi-view synchronous video.

The target object here may be a person, an animal or other objects, such as a robot, a car, etc. The target object may be one object or a plurality of objects. In this embodiment, the target object may specifically be a human object. The multi-view synchronous video of the object may be specifically a video obtained by synchronously shooting the target object from multiple views when the volume video is produced, and the volume video of the target object may be obtained by performing three-dimensional reconstruction based on the multi-view synchronous video.

As described above, a two-dimensional video is an image frame stream composed of a plurality of two-dimensional image frames, and a volumetric video is a model stream composed of a plurality of three-dimensional models. For each frame of the volumetric video, the image frame corresponding to that angle may be observed from a different angle. That is, in the multi-view synchronous video, each time corresponds to one frame of image frame corresponding to each view. For example, if the multi-view synchronized video is a synchronized video captured from 100 views, at any time, 100 image frames obtained by capturing the target object from different views corresponding to the time can be determined.

After the multi-view synchronous video of the target object is acquired, joint point identification can be further performed on each frame of video frame in the multi-view synchronous video. The joint point identification is to identify a joint point of the target object, so that when the target object needs to be re-driven, the target object can be driven by controlling the position of the joint point of the target object.

In some embodiments, joint identification is performed for each video frame in the multiview synchronized video, comprising:

cutting frames of videos corresponding to each visual angle of the multi-visual angle synchronous video to obtain multi-frame video frames;

and performing joint detection on each video frame to obtain a joint identification result of each video frame.

Before joint detection is performed on the multi-view synchronous video, frames of the multi-view synchronous video can be cut to obtain multi-frame video frames. Here, it is understood that the multi-frame video frame includes a multi-frame video frame corresponding to each view. For example, a video of one view may be framed into 200 video frames. If there are 100 views, the frame can be cut to obtain 100 × 200 video frames. After multi-frame video frames are obtained by cutting frames of the multi-view synchronous video, joint point detection is carried out on each video frame. At the same time, different joint point identification results can be detected for a plurality of video frames of the same target object at different visual angles.

In some embodiments, performing joint detection on each video frame to obtain a joint identification result of each video frame includes:

performing attitude detection on each video frame to obtain two-dimensional joint point data of each video frame;

triangularization processing is carried out on the two-dimensional joint point data to obtain three-dimensional joint point data of each video frame;

and determining the joint point identification result of each video frame according to the two-dimensional joint point data and the three-dimensional joint point data.

In order to redrive an object in a volume video, three-dimensional joint points of a three-dimensional model of the object in the volume video need to be determined, and the current joint point detection algorithm can only detect joint points of a two-dimensional image. Therefore, in the embodiment of the present application, the gesture detection may be performed on each video frame first to obtain the two-dimensional joint point data of each video frame. The posture detection can specifically adopt an openposition algorithm (a human body posture estimation algorithm) for detection. The Openpos algorithm is a real-time multi-person two-dimensional attitude estimation algorithm based on deep learning, and can detect a two-dimensional image to obtain all key points, and then group the key points into different persons.

In the embodiment of the application, joint point detection can be performed on each frame of video frame by using an openposition algorithm to obtain two-dimensional joint point data of each video frame. Then, based on a plurality of two-dimensional joint point data obtained by performing joint point identification on a plurality of video frames with different visual angles at the same time, triangulation processing can be performed to obtain three-dimensional joint point data corresponding to the video frames. It can be understood that the three-dimensional joint point data corresponding to multiple frames of video frames with different viewing angles at the same time may be the same joint point data.

In some cases, due to the observation angle difference, two-dimensional joint points which can be observed and identified at some visual angles cannot be observed or identified at other visual angles, and then the joint points cannot be triangulated to obtain the corresponding three-dimensional joint points. The two-dimensional joint points can be processed without triangulation, and the two-dimensional joint point data corresponding to the video frame under the visual angle is reserved. Finally, the result of the joint recognition of the multi-view synchronous video includes the three-dimensional joint and the two-dimensional joint.

And 102, determining the three-dimensional model parameters of the target object in each video frame according to the joint point identification result.

After joint identification is carried out on the multi-view synchronous video to obtain joint identification results, a three-dimensional model of the target object can be fitted based on two-dimensional joints and three-dimensional joints obtained through identification, wherein the three-dimensional model can be a parameterized model. The parameterized human body shape reconstruction method depends on a certain statistically-based human body parameterized model, and only one group of low-dimensional vectors (namely human body parameters) is needed to describe the human body shape. For example, the human parametric model may define two independent low-dimensional parameter controls: body shape (shape) and body posture (position). A group of human body shape parameters in a given space are the human body posture parameters, and then a human body shape can be directly synthesized. The human body type space is a subspace obtained by performing Principal Component Analysis (PCA) dimension reduction on human body databases of the same posture and different body types, and the body type parameters are coefficients of all bases in the subspace.

In the embodiment of the application, after joint point identification is carried out on the three-dimensional model of the target object of each frame to obtain three-dimensional joint point data and two-dimensional joint point data, the parameterized model of the target object can be fitted according to the two-dimensional joint point data and the three-dimensional joint point data obtained through identification to obtain the attitude parameter and the body type parameter of the parameterized model.

In some embodiments, determining the three-dimensional model parameters of the target object in each video frame according to the joint point recognition result comprises:

fitting a three-dimensional model of a target object according to the joint point identification result of each video frame to obtain body type parameters and posture parameters of the three-dimensional model;

performing displacement fitting on the vertex of the three-dimensional model to obtain vertex displacement data of the three-dimensional model;

and determining texture data of the three-dimensional model according to each video frame, and determining parameters of the three-dimensional model of the target object according to the body type parameters, the posture parameters, the vertex displacement data and the texture data.

In the embodiment of the application, after the body type parameters and the posture parameters of the three-dimensional model are determined, displacement fitting can be further performed on the vertex of the three-dimensional model to fit the model obtained by scanning, and vertex displacement data of the three-dimensional model are obtained. Then, the texture map may be further fitted to the parametric model, and the obtained data includes not only intrinsic attribute data such as body shape parameters, texture data, and vertex displacement data of the parametric model, but also pose parameters that can be used to drive the model.

In some embodiments, performing displacement fitting on vertices of the three-dimensional model to obtain vertex displacement data of the three-dimensional model includes:

determining a target joint point corresponding to each vertex in the three-dimensional model, wherein the target joint point is the joint point closest to the vertex;

and calculating displacement data between each vertex and the corresponding target joint point to obtain vertex displacement data of the three-dimensional model.

In the embodiment of the present application, displacement fitting is performed on vertices of the three-dimensional model, and specifically, a joint point corresponding to each vertex may be determined first. The vertex of the three-dimensional model may specifically be vertex data in a model obtained by performing three-dimensional reconstruction based on the acquired multi-view synchronous video of the target object, where the three-dimensional reconstruction may be performed by using the three-dimensional reconstruction method based on the neural network model, and the obtained three-dimensional model includes coordinate data of each vertex on the surface of the three-dimensional model of the target object. The coordinate data for each vertex may be represented by the joint parameters and vertex displacement data in the parametric model.

Specifically, a target joint point corresponding to each vertex may be determined, where the target joint point of the target vertex may be a three-dimensional joint point closest to the target vertex in the three-dimensional model. After the target joint point of the target vertex is determined, the displacement data between the target vertex and the target joint point can be calculated, so that the displacement data of the target vertex can be obtained. Furthermore, each vertex can be traversed to obtain vertex displacement data corresponding to each vertex.

Step 103, training the conditional variational self-encoder with the attitude parameter as a variable based on a training data set consisting of a plurality of video frames and corresponding three-dimensional model parameters.

Wherein a data set for the target object has been constructed according to the two preceding steps, the data set comprising body shape parameters, pose parameters and texture data for each frame of a parametric model generated from a multi-view synchronized video of the target object.

After the data set is constructed, a conditional variational autocoder with attitude parameters as variables can be trained based on the data set. Among them, the Variational Auto Encoder (VAE) is similar to the challenge generation network, and is used for solving the problem of data generation. In a self-encoder architecture, an input data is typically required and the generated data is the same as the input data. It is generally desirable to generate data that differs to some extent, which requires the input of a random vector and the ability of the model to learn the stylized nature of the generated image, so that in subsequent studies a competing generating network structure is created that generates a particular sample with the randomized vector as input. The variational self-encoder also takes as input a random sample of a particular distribution and can generate a corresponding image, which is similar to the challenge generation network object in this respect. But the variational self-encoder does not need a discriminator but uses the encoder to estimate a particular distribution. During training, a discriminator and a generator of the antagonistic generation network, and an encoder and a decoder of the self-encoder participate in training; in use, the corresponding data may be generated using only the generator of the challenge generation network, or the decoder of the self-encoder.

That is, a stack of real samples is converted from an encoder to an ideal data distribution by an encoder network, and then the data distribution is transmitted to a decoder network to obtain a stack of generated samples. When the generated samples are sufficiently close to the true samples, an encoder model is trained. The variational self-encoder performs further variational processing on the self-encoder model, and the output result of the encoder can correspond to the mean value and the variance of the target distribution.

As can be seen from the above description, a VAE can implement a given random noise to generate data, the data generation process being uncontrollable. If a given condition is required to generate data, for example, in a handwritten number, input labels 0 to 9 generate corresponding handwritten specific data, the VAE cannot meet the requirement, and at this time, a condition input needs to be added on the basis of the VAE, and a corresponding model is a Conditional Variable Auto Encoder (CVAE). In the embodiment of the present application, a CVAE with posture parameters as variables may be trained based on the above-constructed data set.

Specifically, the method for training the conditional variational self-encoder by taking the attitude parameter as a variable based on a training data set formed by a plurality of video frames and corresponding three-dimensional model parameters comprises the following steps:

taking the attitude parameters of the three-dimensional model as input to obtain output model data output by the conditional variation self-encoder;

and adjusting the parameters of the conditional variation self-encoder based on the difference between the output model data and the body model parameters, the vertex displacement data and the texture data in the three-dimensional model parameters.

When the conditional variational self-encoder is trained based on the constructed data set, the posture parameters of the parameter model of the target object are used as input to obtain the object surface texture and the mesh model corresponding to the posture, namely, texture data, body type parameters and vertex displacement data are output. And then adjusting parameters of the CVAE based on the difference between the output data and the true data, so as to realize the training of the CVAE. Or the object surface texture and the grid model output by the CVAE can be subjected to micro-rendering to obtain rendering images under multiple viewing angles, and then the multi-viewing-angle image of the current frame corresponding to the attitude data is adopted as a monitor to adjust the model parameters of the CVAE, so that the training of the CVAE is realized.

And 104, acquiring target attitude parameters, and generating a driving video based on the conditional variation self-encoder and the target attitude parameters.

After the CVAE is trained, target posture parameters for driving a target object can be obtained, surface textures and a grid model are generated based on the target posture parameters, and a multi-view image or a three-dimensional model can be obtained through further rendering. When the target attitude parameters are continuously acquired and the three-dimensional model is continuously generated, the corresponding driving video can be generated.

By means of the video processing method, original data (multi-view synchronous video) of the generated volume video are split into a plurality of synchronous video frames, parameter model fitting is conducted on the basis of the synchronous video frames, and therefore the posture parameters (variable parameters) corresponding to each frame and the intrinsic parameters (body type parameters and texture data) of other models are obtained. And then training a conditional variation self-encoder model by taking the attitude parameter corresponding to each frame as model input and taking the corresponding inherent parameter as the output of the model to obtain the trained conditional variation self-encoder model, wherein the conditional variation self-encoder model is a model aiming at a shooting object of the volume video. Therefore, when the volume video needs to be re-driven, the driving parameters, namely the continuous attitude parameters, can be continuously input into the trained conditional variation self-encoder model, and the conditional variation self-encoder model can continuously output the three-dimensional model corresponding to each attitude parameter and the corresponding texture data. The output three-dimensional model and corresponding texture data can then be rendered, thereby realizing the redrive of the volumetric video.

According to the above description, the video processing method provided in the embodiment of the present application obtains the multi-view synchronous video of the target object, and performs joint identification on each video frame in the multi-view synchronous video; determining three-dimensional model parameters of the target object in each video frame according to the joint point identification result, wherein the three-dimensional model parameters comprise attitude parameters, body type parameters, vertex displacement data and texture data; training a conditional variation self-encoder taking the attitude parameters as variables on the basis of a training data set consisting of a plurality of video frames and corresponding three-dimensional model parameters; and acquiring target attitude parameters, and generating a driving video based on the conditional variation self-encoder and the target attitude parameters.

Therefore, the video processing method provided by the application can realize the driving of the three-dimensional model and generate the driving video by designing and decoupling the attitude parameters and the three-dimensional model and then training the conditional variational self-encoder taking the attitude parameters as variables, so long as new attitude parameters are input into the conditional variational self-encoder. The method can improve the accuracy of three-dimensional model driving, and further can improve the effect of driving the roles in the video.

In order to better implement the above video processing method, embodiments of the present application also provide a video processing apparatus, which may be integrated in a terminal or a server.

For example, as shown in fig. 3, which is a schematic structural diagram of a video processing apparatus provided in an embodiment of the present application, the video processing apparatus may include an obtaining unit 201, a determining unit 202, a training unit 203, and a generating unit 204, as follows:

an acquiring unit 201, configured to acquire a multi-view synchronous video of a target object, and perform joint identification on each video frame in the multi-view synchronous video;

the determining unit 202 is configured to determine three-dimensional model parameters of the target object in each video frame according to the joint point identification result, where the three-dimensional model parameters include a posture parameter, a body type parameter, vertex displacement data, and texture data;

a training unit 203, configured to train a conditional variation auto-encoder using the pose parameter as a variable based on a training data set composed of a plurality of video frames and corresponding three-dimensional model parameters;

and the generating unit 204 is used for acquiring the target attitude parameters and generating the driving video based on the conditional variation self-encoder and the target attitude parameters.

In some embodiments, the obtaining unit comprises:

and the detection subunit is used for detecting the joint point of each video frame to obtain the joint point identification result of each video frame.

In some embodiments, a detection subunit includes:

the detection module is used for detecting the attitude of each video frame to obtain two-dimensional joint point data of each video frame;

the processing module is used for triangulating the two-dimensional joint point data to obtain three-dimensional joint point data of each video frame;

In some embodiments, the determining unit comprises:

the second fitting subunit is used for performing displacement fitting on the vertex of the three-dimensional model to obtain vertex displacement data of the three-dimensional model;

and the determining subunit is used for determining texture data of the three-dimensional model according to each video frame and determining parameters of the three-dimensional model of the target object according to the body type parameters, the posture parameters, the vertex displacement data and the texture data.

In some embodiments, the second fitting subunit comprises:

the determining module is used for determining a target joint point corresponding to each vertex in the three-dimensional model, and the target joint point is the joint point closest to the vertex;

In some embodiments, a training unit comprises:

and the adjusting subunit is used for adjusting the parameters of the conditional variation self-encoder based on the difference between the output model data and the volume parameters, the vertex displacement data and the texture data in the three-dimensional model parameters.

In some embodiments, the generating unit comprises:

the second input subunit is used for inputting the target attitude parameters into the conditional variational self-encoder to obtain output target model data;

and a generation subunit for generating the drive video from the multi-view synchronization image.

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

As can be seen from the above description, the video processing apparatus provided in the embodiment of the present application acquires a multi-view synchronous video of a target object through the acquisition unit 201, and performs joint identification on each video frame in the multi-view synchronous video; the determining unit 202 determines three-dimensional model parameters of the target object in each video frame according to the joint point identification result, wherein the three-dimensional model parameters comprise attitude parameters, body type parameters, vertex displacement data and texture data; the training unit 203 trains a conditional variational self-encoder taking the attitude parameter as a variable based on a training data set consisting of a plurality of video frames and corresponding three-dimensional model parameters; the generation unit 204 acquires the target attitude parameter, and generates a drive video based on the conditional variation autoencoder and the target attitude parameter.

An embodiment of the present application further provides a computer device, where the computer device may be a terminal or a server, and as shown in fig. 4, is a schematic structural diagram of the computer device provided in the present application. Specifically, the method comprises the following steps:

the computer device may include components such as a processing unit 301 of one or more processing cores, a storage unit 302 of one or more storage media, a power module 303, and an input module 304. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 4 does not constitute a limitation of the computer device, and may include more or fewer components than illustrated, or some components may be combined, or a different arrangement of components. Wherein:

the processing unit 301 is a control center of the computer device, connects various parts of the entire computer device with various interfaces and lines, and executes various functions of the computer device and processes data by running or executing software programs and/or modules stored in the storage unit 302 and calling data stored in the storage unit 302. Optionally, the processing unit 301 may include one or more processing cores; preferably, the processing unit 301 may integrate an application processor and a modem processor, wherein the application processor mainly handles an operating system, an object interface, an application program, and the like, and the modem processor mainly handles wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processing unit 301.

The storage unit 302 may be used to store software programs and modules, and the processing unit 301 executes various functional applications and data processing by running the software programs and modules stored in the storage unit 302. The storage unit 302 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, a web page access, and the like), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the storage unit 302 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory unit 302 may also include a memory controller to provide the processing unit 301 access to the memory unit 302.

The computer device further comprises a power module 303 for supplying power to each component, and preferably, the power module 303 may be logically connected to the processing unit 301 through a power management system, so as to implement functions of managing charging, discharging, and power consumption management through the power management system. The power module 303 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The computer device may also include an input module 304, the input module 304 operable to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to object setting and function control.

Although not shown, the computer device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processing unit 301 in the computer device loads an executable file corresponding to a process of one or more application programs into the storage unit 302 according to the following instructions, and the processing unit 301 runs the application programs stored in the storage unit 302, so as to implement various functions as follows:

acquiring a multi-view synchronous video of a target object, and identifying a joint point of each video frame in the multi-view synchronous video; determining three-dimensional model parameters of the target object in each video frame according to joint point identification results, wherein the three-dimensional model parameters comprise attitude parameters, body type parameters, vertex displacement data and texture data; training a conditional variation self-encoder with attitude parameters as variables based on a training data set consisting of a plurality of video frames and corresponding three-dimensional model parameters; and acquiring target attitude parameters, and generating a driving video based on the conditional variation self-encoder and the target attitude parameters.

It should be noted that, the computer device provided in the embodiment of the present application and the method in the foregoing embodiment belong to the same concept, and specific implementation of the above operations may refer to the foregoing embodiment, which is not described herein again.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the embodiment of the present invention provides a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any one of the methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

acquiring a multi-view synchronous video of a target object, and identifying a joint point of each video frame in the multi-view synchronous video; determining three-dimensional model parameters of the target object in each video frame according to the joint point identification result, wherein the three-dimensional model parameters comprise attitude parameters, body type parameters, vertex displacement data and texture data; training a conditional variation self-encoder taking the attitude parameters as variables on the basis of a training data set consisting of a plurality of video frames and corresponding three-dimensional model parameters; and acquiring target attitude parameters, and generating a driving video based on the conditional variation self-encoder and the target attitude parameters.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the computer-readable storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the computer-readable storage medium can execute the steps in any method provided by the embodiment of the present invention, the beneficial effects that can be achieved by any method provided by the embodiment of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

According to an aspect of the application, there is provided, among other things, a computer program product or computer program comprising computer instructions stored in a storage medium. The processor of the computer device reads the computer instructions from the storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided in various optional implementation modes of the video processing method.

The video processing method, the video processing apparatus, and the computer-readable storage medium according to the embodiments of the present invention are described in detail above, and the principles and embodiments of the present invention are described herein by applying specific examples, and the descriptions of the above embodiments are only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed, and in summary, the content of the present specification should not be construed as limiting the present invention.

Claims

1. A method of video processing, the method comprising:

training a conditional variation self-encoder taking the attitude parameters as variables on the basis of a training data set consisting of a plurality of video frames and corresponding three-dimensional model parameters;

2. The method of claim 1, wherein the joint identification of each video frame in the multiview synchronized video comprises:

performing frame cutting on the video corresponding to each visual angle of the multi-visual angle synchronous video to obtain a plurality of frame video frames;

3. The method of claim 2, wherein the performing joint detection on each video frame to obtain a joint identification result of each video frame comprises:

and determining a joint point identification result of each video frame according to the two-dimensional joint point data and the three-dimensional joint point data.

4. The method of claim 1, wherein said determining three-dimensional model parameters of said target object in each video frame according to joint point recognition results comprises:

fitting a three-dimensional model of the target object according to the joint point identification result of each video frame to obtain body type parameters and posture parameters of the three-dimensional model;

determining texture data of the three-dimensional model according to each video frame, and determining parameters of the three-dimensional model of the target object according to the body type parameters, the posture parameters, the vertex displacement data and the texture data.

5. The method of claim 4, wherein the fitting the displacement of the vertices of the three-dimensional model to obtain vertex displacement data of the three-dimensional model comprises:

6. The method of claim 1, wherein training the conditionally variant transcoder that uses pose parameters as variables based on a training data set comprising a plurality of video frames and corresponding three-dimensional model parameters comprises:

and adjusting the parameters of the conditional variation self-encoder based on the difference between the output model data and the body type parameters, the vertex displacement data and the texture data in the three-dimensional model parameters.

7. The method of claim 1, wherein obtaining the target pose parameters and generating the drive video based on the conditional variational autocoder and the target pose parameters comprises:

receiving input target attitude parameters;

inputting the target attitude parameters into the conditional variational self-encoder to obtain output target model data;

rendering the target model data to obtain a multi-view synchronous image;

and generating a driving video according to the multi-view synchronous image.

8. A video processing apparatus, characterized in that the apparatus comprises:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a multi-view synchronous video of a target object and identifying joint points of each video frame in the multi-view synchronous video;

9. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the video processing method according to any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps in the video processing method of any one of claims 1 to 7 when executing the computer program.

11. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the steps in the video processing method of any of claims 1 to 7.