CN113077383B

CN113077383B - Model training method and model training device

Info

Publication number: CN113077383B
Application number: CN202110629148.1A
Authority: CN
Inventors: 王鑫宇; 刘炫鹏; 杨国基; 刘致远; 刘云峰
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2021-11-02
Anticipated expiration: 2041-06-07
Also published as: CN113077383A

Abstract

The embodiment of the invention discloses a model training method and a model training device, which are used for increasing the posture detail information of a digital person when the digital person is generated. The method provided by the embodiment of the invention comprises the following steps: extracting human body key point parameters of each frame of image from each frame of image of video data respectively, and inputting the human body key point parameters of each frame of image into an SMPL-X model to generate a human body 3D model corresponding to the human body posture of each frame of image, wherein the human body key point parameters comprise human body skeleton point parameters, hand key point parameters and human face key point parameters; inputting a human body 3D model of a target frame and a target frame image into an image translation model for training, and taking the trained image translation model as a first image translation model, wherein the image translation model is a GAN network model, and the target frame is any one frame or any multiple frames in the video data.

Description

Model training method and model training device

Technical Field

The invention relates to the technical field of image translation, in particular to a model training method and a model training device.

Background

Image translation refers to the conversion from one image to another. One language can be converted to another language by analogy with machine translation.

The more classical image translation models in the prior art include pix2pix, pix2pixHD, and vid2 vid. pix2pix provides a unified framework to solve the problem of translation of various images, pix2pixHD better solves the problem of high-resolution image conversion (translation) on the basis of pix2pix, and vid2vid better solves the problem of high-resolution video conversion on the basis of pix2 pixHD.

The digital human is a virtual simulation of the shape and function of human body at different levels by using an information science method. However, in the image translation model in the prior art, training is performed only based on the bone point data in the training process, so that the existing image translation model has only the information of the bone point and lacks the detailed information of the human body posture when generating the posture of the digital person.

Disclosure of Invention

The embodiment of the invention provides a model training method and a training device, which are used for increasing the posture detail information of a digital person when the digital person is generated.

A first aspect of an embodiment of the present application provides a model training method, including:

extracting human body key point parameters of each frame of image from each frame of image of video data respectively, and inputting the human body key point parameters of each frame of image into an SMPL-X model to generate a human body 3D model corresponding to the human body posture of each frame of image, wherein the human body key point parameters comprise human body skeleton point parameters, hand key point parameters and human face key point parameters;

inputting a human body 3D model of a target frame and a target frame image into an image translation model for training, and taking the trained image translation model as a first image translation model, wherein the image translation model is a GAN network model, and the target frame is any one frame or any multiple frames in the video data.

Preferably, the inputting the human body 3D model of the target frame and the target frame image into the image translation model for training includes:

and inputting the human body 3D model of the target frame, the human body key point parameters and the target frame image into the image translation model for training.

inputting training data of the target frame and a plurality of frames adjacent to the target frame into the image translation model for training, wherein the training data comprises a human body 3D model of the target frame, human body key point parameters, images of the target frame, the human body 3D models of the frames and the human body key point parameters.

Preferably, the generation model in the first image translation model is a coding-decoding model, the coding model is trained by using a RestNet residual error network architecture, and the method further includes:

modifying the RestNet architecture in the coding model to a lightweight model architecture;

and inputting first training data into the first image translation model for training, and taking the trained first image translation model as a second image translation model.

Preferably, the generation model in the second image translation model is a coding-decoding model, and the decoding model in the second image translation model is trained by using an inverse convolution operator, and the method further includes:

replacing the deconvolution operator with an upsampling operator;

and inputting the first training data into the second image translation model for training, and taking the trained second image translation model as a third image translation model.

Preferably, the first training data includes:

a human 3D model of the target frame and the target frame image;

or the like, or, alternatively,

the human body 3D model of the target frame, the human body key point parameters and the target frame image;

or the like, or, alternatively,

the human body 3D model of the target frame, the human body key point parameters, the target frame image, and the human body 3D model and the human body key point parameters of a plurality of frames adjacent to the target frame.

Preferably, the method further comprises:

inputting a generation model in the third image translation model into an acceleration frame, so that the acceleration frame analyzes the third generation model and accelerates the image translation speed of the third generation model.

Preferably, the method further comprises:

when the acceleration frame is TensorRT, replacing a reflect padding operator in the image translation model with a padding operator.

Preferably, the image translation model includes at least one of a pix2pix model, a pix2pixHD model, and a vid2vid model.

Preferably, the lightweight model includes at least one of a MobileNet model, a shefflenet model, a SqueezeNet model, and an Xception model.

A second aspect of the embodiments of the present application provides a model training apparatus, including:

the system comprises a first module, a second module and a third module, wherein the first module is used for respectively extracting human body key point parameters of each frame of image from each frame of image of video data and inputting the human body key point parameters of each frame of image into an SMPL-X model to generate a human body 3D model corresponding to the human body posture of each frame of image, and the human body key point parameters comprise human body skeleton point parameters, hand key point parameters and human face key point parameters;

and the second module is used for inputting the human body 3D model of the target frame and the target frame image into an image translation model for training, and taking the trained image translation model as a first image translation model, wherein the image translation model is a GAN network model, and the target frame is any one frame or any multiple frames in the video data.

Preferably, the second module is specifically configured to:

Preferably, the generation model in the first image translation model is a coding-decoding model, the coding model is trained by using a RestNet residual error network architecture, and the apparatus further includes:

a modification module, configured to modify the RestNet architecture in the coding model into a lightweight model architecture;

and the training module is used for inputting first training data into the first image translation model for training and taking the trained first image translation model as a second image translation model.

Preferably, the generation model in the second image translation model is a coding-decoding model, and the decoding model in the second image translation model is trained by using an inverse convolution operator:

the modification module is further used for replacing the deconvolution operator with an upsampling operator;

the training module is further configured to input the first training data to the second image translation model for training, and use the trained second image translation model as a third image translation model.

Preferably, the first training data includes:

a human 3D model of the target frame and the target frame image;

or the like, or, alternatively,

Preferably, the apparatus further comprises:

and the acceleration module is used for inputting the generation model in the third image translation model into an acceleration frame, so that the acceleration frame analyzes the third generation model and accelerates the image translation speed of the third generation model.

Preferably, the modification module is further configured to:

A third aspect of embodiments of the present application provides a computer apparatus, including a processor, where the processor is configured to implement the model training method provided in the first aspect of embodiments of the present application when executing a computer program stored in a memory.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is used, when being executed by a processor, to implement the model training method provided in the first aspect of the embodiments of the present application.

According to the technical scheme, the embodiment of the invention has the following advantages:

in the embodiment of the application, human body key point parameters of each frame of image are respectively extracted from each frame of image of video data, and the human body key point parameters of each frame of image are input into an SMPL-X model to generate a human body 3D model corresponding to the human body posture of each frame of image, wherein the human body key point parameters comprise human body skeleton point parameters, hand key point parameters and human face key point parameters; inputting a human body 3D model of a target frame and a target frame image into an image translation model for training, and taking the trained image translation model as a first image translation model, wherein the image translation model is a GAN network model, and the target frame is any one frame or any multiple frames in the video data.

Because the model input data of the embodiment of the application is the 3D model of the human body, the trained first image translation model can have more posture detail information when generating the digital human posture.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a model training method in an embodiment of the present application;

FIG. 2 is a schematic diagram of a neural network architecture in an embodiment of the present application;

FIG. 3 is a schematic diagram of another embodiment of a model training method in an embodiment of the present application;

FIG. 4 is a schematic diagram of another embodiment of a model training method in the embodiment of the present application;

FIG. 5 is a diagram illustrating a residual error network architecture according to an embodiment of the present application;

FIG. 6 is a schematic diagram of another embodiment of a model training method in an embodiment of the present application;

FIG. 7 is a schematic diagram of another embodiment of a model training method in an embodiment of the present application;

fig. 8 is a schematic diagram of an embodiment of a model training apparatus in an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The method is different from the prior art that when a digital person is generated, only the information of the skeleton points of the digital person is input, so that the generated digital person is lack of posture detail information.

For convenience of understanding, the following describes a model training method and a model training apparatus in the present application, and referring to fig. 1, an embodiment of the model training method in the present application includes:

101. extracting human body key point parameters of each frame of image from each frame of image of video data respectively, and inputting the human body key point parameters of each frame of image into an SMPL-X model to generate a human body 3D model corresponding to the human body posture of each frame of image, wherein the human body key point parameters comprise human body skeleton point parameters, hand key point parameters and human face key point parameters;

SMPL-X (SMPL eXpress) model, which combines the SMPL model of body, MANO model of hand and FLAME model of head, and registers 5586 3D scans of the model to guarantee quality. The model is trained through the data such that the model has no artifacts on the natural correlations between bodies, hands, and faces.

Because the trained SMPL-x (SMPL expressive) model has complete 3D surfaces (full 3D surfaces) of body, hands and face, the digital person in the model has more detailed information of human body posture.

In order to realize the simulation of the human body posture in each frame of image in the video data, the human body key point parameters of each frame of image are respectively extracted from each frame of image in the video data, and the human body key point parameters of each frame of image are input into the SMPL-X model so as to generate the human body 3D model corresponding to the human body posture of each frame of image.

As a specific implementation mode, OpenPose can be used for detecting human body key point parameters in each frame of image in a video, wherein OpenPose is an open source library which is based on a convolutional neural network and supervised learning and developed by taking cafe as a framework, can realize posture estimation of human body actions, facial expressions, finger motions and the like, is suitable for single people and multiple people, has excellent robustness, and can recognize key point identification of 15, 18 or 25 bodies/feet, 221 hand key point identification and 70 face key point identification when being used for realizing real-time identification of two-dimensional multiple people key points.

For openpoint, the specific input may be a picture, a video stream of a network camera, such as a hair or Point gray or an IP camera, and the output may be an original picture + a key Point display, or a key Point data storage file), and so on.

After the human body key point parameters in each frame of image are identified, the human body key point parameters of each frame of image are input into the SMPL-X model, so that a human body 3D model corresponding to the human body posture of each frame of image can be generated, wherein the specific process of inputting the human body key point parameters into the SMPL-X model to obtain the corresponding human body 3D model can refer to the description of the SMPLfy-X method in the prior art, and the description is not repeated herein.

102. Inputting a human body 3D model of a target frame and a target frame image into an image translation model for training, and taking the trained image translation model as a first image translation model, wherein the image translation model is a GAN network model, and the target frame is any one frame or any multiple frames in the video data.

After a human body 3D model corresponding to the human body posture of each frame of image is obtained, inputting the human body 3D model of a target frame and the image of the target frame into an image translation model for training, and taking the trained image translation model as a first generation model, wherein the image translation model is a GAN network model, and the target frame is any one frame of image or any multiple frames of images in video data.

In the case that the GAN network is applied to the deep learning neural network, the G learns the data distribution by continuously playing a game through a generation model G (generator) and a discrimination model d (discriminator), and if the image generation is used, the G can generate a vivid image from a random number after the training is completed.

G. The main functions of D are:

g is a generative network which receives a random noise z (random number) by which to generate an image;

d is a discrimination network for discriminating whether a picture is "real"; the input parameter is x, x represents a picture, and the output D (x) represents the probability that x is a real picture, if 1, 100% of the picture is real, and the output is 0, the picture cannot be real.

In the training process, the aim of generating the model G is to generate a real picture as much as possible to deceive the discrimination model D, and the aim of D is to discriminate whether the image generated by G is a false image or a real image as much as possible. Thus, G and D form a dynamic game process, the final balance point is Nash balance point, and if the Nash balance point is reached between G and D, the training of G is finished.

Specifically, the process of training the generation model and the discrimination model in the image translation model is as follows:

inputting the human body 3D model of the target frame into a generation model and a discrimination model of an image translation model to obtain a generation image of the target frame, then calculating a first loss between the target frame image and the generation image of the target frame according to a preset loss function, and performing gradient updating on the weight of a convolution layer in the generation model according to the first loss and a back propagation algorithm.

It should be noted that the target frame image and the generated image of the target frame in the embodiment of the present application are two different objects, where the target frame image is a real image of the target frame, that is, a real image of the target frame in the video data, and the generated image of the target frame is an image generated by inputting the human body 3D model of the target frame into the generated model of the image translation model.

Specifically, the image translation model in the embodiment of the present application includes at least one of pix2pix, pix2pixHD, and vid2vid, and for a specific image translation model, the corresponding loss functions are also different:

for pix2pix, the preset Loss function includes L1Loss between the target frame image and the target frame generation image, and GANLoss to diversify the output;

for pix2pixHD, the preset Loss function includes L1Loss between the target frame image and the target frame generation image, GANLoss, Feature matching Loss, and Content Loss that make the output diversified;

for the vid2vid, the preset Loss function comprises L1Loss between the target frame image and the target frame generation image, GANLOs for diversifying output, Feature matching Loss, Content Loss, video Loss and optical flow Loss;

each Loss function is described in detail in the prior art, and will not be described in detail here.

In order to facilitate understanding of the gradient updating process, firstly, a generation model and a discrimination model in the GAN network are simply described:

the generation model and the discrimination model in the image translation model adopt a Neural Network algorithm, and a Multi-Layer Perceptron (MLP), also called an Artificial Neural Network (ANN), generally includes an input Layer, an output Layer, and a plurality of hidden layers arranged between the input Layer and the output Layer. The simplest MLP requires a hidden layer, i.e., an input layer, a hidden layer, and an output layer, to be referred to as a simple neural network.

Next, taking the neural network in fig. 2 as an example, the data transmission process is described:

1. forward output of neural network

Where, layer 0 (input layer), we vectorize X1, X2, X3 to X;

between the 0 layer and the 1 layer (hidden layer), there are weights W1, W2, W3, the weight vector is quantized to W1, where W1 represents the weight of the first layer;

between layer 0 and layer 1 (hidden layer), there are also offsets b1, b2, b3 vectorized to b [1], where b [1] represents the weight of the first layer;

for layer 1, the calculation formula is:

；

;

wherein Z is a linear combination of input values, a is a value of Z through an activation function sigmoid, for an input value X of a first layer, an output value is a, which is also an input value of a next layer, and in the sigmoid activation function, the value thereof is between [0,1], which can be understood as a valve, just like a human neuron, when a neuron is stimulated and not immediately felt, but the stimulation exceeds a threshold value, the neuron is allowed to propagate to an upper level.

Between layer 1 and layer 2 (output layer), similarly to between layer 0 and layer 1, the calculation formula is as follows:

；

where yhat is the output value of the neural network at this time.

2. Loss function

In the course of neural network training, whether the neural network is trained in place is generally measured by a loss function.

In general, we choose the following function as the loss function:

；

wherein y is the true characteristic value of the picture, yhat: (

) Generating a characteristic value of the picture;

when y =1, the closer yhat is to 1,

the closer to 0, the better the prediction effect is, and when the loss function reaches the minimum value, the generated image of the current frame generated by the generation model is closer to the original image of the current frame.

3. Back propagation algorithm

In the neural network model, the training effect of the neural network can be obtained by calculating the loss function, and the parameters can be updated by a back propagation algorithm, so that the neural network model can obtain the desired predicted value. The gradient descent algorithm is a method for optimizing the weight W and the bias b.

Specifically, the gradient descent algorithm is to calculate a partial derivative of the loss function, and then update w1, w2, and b with the partial derivative.

For easy understanding, we will lose the function

Formulated as the following equation:

；

；

；

then, respectively couple

And z derivation:

；

and then performing derivation on w1, w2 and b:

；

the weight parameter w and the bias parameter b are then updated with a gradient descent algorithm:

wherein, w 1: = w1-

dw1

w2：=w2-

dw2

b：=b-

db。

Wherein the content of the first and second substances,

the learning rate, that is, the learning step length, is represented, in the actual training process, if the learning rate is too large, the vibration will be returned near the optimal solution, and the optimal solution cannot be reached, and if the learning rate is too small, the optimal solution may be reached by many iterations, so in the actual training process, the learning rate is also an important selection parameter.

In the present application, the training process of the generative model and the discriminant model, that is, the process of calculating the corresponding loss according to the loss functions in the different image translation models, and then updating the weights of the convolution layers in the generative model and the discriminant model by using the back propagation algorithm, and the specific updating process can refer to the calculation process of the loss functions and the back propagation algorithm.

Different from the prior art, only human skeleton points are used as input data of an image translation model, and a human 3D model is used as input data of a generation model in the image translation model in the embodiment of the application, because the human 3D model has a 3D curved surface (full 3D surface) with complete body, hands and face, a digital person in the model has more human posture detail information.

In order to make the information of the key points of the human body more accurate when the generated model generates the digital human pose, the information of the key points of the human body of the target frame image may be added to the input data of the generated model, specifically referring to fig. 3, another embodiment of the model training method in the embodiment of the present application includes:

301. and inputting the human body 3D model of the target frame, the human body key point parameters and the target frame image into a generation model of an image translation model for training.

In the embodiment illustrated in fig. 1, because the input data only includes the human body 3D model of the target frame, in order to prevent a large deviation from occurring in the digital human pose translated by the generative model when the human body 3D model corresponding to the target frame has a deviation, the human body key point parameters of the target frame may also be added to the input data of the generative model, so as to perform a function of correcting the human body 3D model.

Specifically, when the image translation model is trained, the human body 3D model of the target frame, the human body key point parameters and the target frame image can be input into the image translation model for training, so as to improve the accuracy of the image translation model in translating the digital human posture.

The training process of the image translation model by using the human body 3D model of the target frame, the human body key point parameters, and the target frame image may refer to the embodiment described in fig. 1, and details are not repeated here.

In addition, when the image translation model is trained, in order to ensure continuity of digital human pose generation, training data of a target frame and a plurality of frames of images adjacent to the target frame may be input during a training process, and referring to fig. 4 specifically, another embodiment of the model training method in the embodiment of the present application includes:

401. inputting training data of the target frame and a plurality of frames adjacent to the target frame into an image translation model for training, wherein the training data comprises a human body 3D model of the target frame, human body key point parameters, images of the target frame, the human body 3D models of the frames and the human body key point parameters.

Different from the embodiment shown in fig. 1 and 3, when the image translation model generates the digital human pose, in order to ensure continuity of the digital human pose, training data of a target frame and a plurality of frames adjacent to the target frame may be input into the image translation model for training, where the training data includes a human body 3D model of the target frame, a human body key point parameter, a target frame image, a human body 3D model of the plurality of frames, and a human body key point parameter.

Specifically, assuming that a plurality of frames adjacent to the target frame are data of 5 frames before and after the target frame, the training data are a human body 3D model of the target frame, a human body key point parameter, a target frame image, a human body 3D model and a human body key point parameter of each frame image in the 5 frames before the target frame, and a human body 3D model and a human body key point parameter of each frame image in the 5 frames after the target frame, so that when the trained image translation model generates the digital human pose of the target frame, the trained image translation model not only refers to the image information of the 5 frames before the target frame, but also considers the image information of the 5 frames after the target frame, so that the digital human pose of the target frame has more relevance from the front-back continuity aspect, and the image translation model has better stability when generating the digital human pose.

The training process of the image translation model by using the training data of the target frame and the frames adjacent to the target frame may also refer to the embodiment shown in fig. 1, and is not described herein again.

Further, based on the embodiments described in fig. 1 to fig. 4, the generation model in the first image translation model is an encode-decode model, that is, an encoder-decoder model, where the encode model and the decode model may use any one of deep learning algorithms such as CNN, RNN, BiRNN, LSTM, etc., and the current deep learning algorithm is increasingly poor in model effect when the network is deep, and it can be found through experiments that: with the continuous increase of network levels, the model precision is continuously improved, and when the network levels are increased to a certain number, the training precision and the test precision are rapidly reduced, which means that when the network becomes very deep, the deep network becomes more difficult to train, so in order to reduce errors, the existing deep learning algorithm can maintain the model precision by adopting a RestNet architecture, wherein fig. 5 shows a schematic diagram of a Residual network structure (Residual network), and through the Residual network structure, the model precision can be maintained when the network levels are very deep.

For the problem, the present application may further perform the following steps on the coding model in the first image translation model, specifically, referring to fig. 6, in which another embodiment of the model training method in the present application includes:

601. modifying the RestNet architecture in the coding model to a lightweight model architecture;

in order to improve the operation speed of the model, the RestNet architecture of the coding model in the generation model can be modified into a lightweight model architecture, wherein the lightweight model architecture comprises at least one of a MobileNet model, a shefflenet model, a SqueezeNet model and an Xception model.

For convenience of understanding, the acceleration process of the model is described below by taking the MobileNet model as an example:

the basic unit of MobileNet is a depth-level separable convolution, which is in fact a decomposable convolution operation that can be decomposed into two smaller operations: depthwise restriction and pointwise restriction. Depthwise convolution is different from standard convolution, for which the convolution kernel is used on all input channels, and Depthwise convolution uses a different convolution kernel for each input channel, that is, one convolution kernel for each input channel, so that it is said that Depthwise convolution is a depth-level operation. Instead, the poitwise convolution is simply a normal convolution, but it uses a convolution kernel of 1 × 1.

For depthwise partial convolution, firstly, depthwise convolution is adopted to carry out convolution on different input channels respectively, and then pointwise convolution is adopted to combine the outputs, so that the integral effect is almost the same as that of a standard convolution, but the calculated amount and the model parameter amount are greatly reduced.

The following is an analysis of the amount of computation of the depth separable convolution and the standard convolution:

assume that the input feature map size is

The size of the output characteristic diagram is

Wherein, in the step (A),

is the width and height of the feature map, assuming that the width and height of the input feature map and the output feature map are both

For standard convolution

Calculated by the amount of

And for separable convolution, where depthwise convolution is calculated by:

the calculated amount of poitwise restriction is:

then the amount of computation of the depth separable convolution is

。

The ratio of the calculated quantities of the depth separable convolution and the standard convolution can be found as follows:

in general, the value of N is large, and the computation amount of the deep separable convolution can be reduced to the standard volume under the condition of adopting 3 × 3 convolution kernelsOf product

。

The acceleration principle of other models is described in the prior art, and will not be described herein.

602. And inputting first training data into the first image translation model for training, and taking the trained first image translation model as a second image translation model.

After a RestNet architecture in a coding model of the first image translation model is modified into a lightweight model architecture, the first image translation model is trained by using first training data, and the trained first image translation model is used as a second image translation model.

The first training data may be a human body 3D model and a target frame image of a target frame, or a human body 3D model, a human body key point parameter, a target frame image of a target frame, and a human body 3D model and a human body key point parameter of a plurality of frames adjacent to the target frame, which are not limited herein.

The training process of the first image translation model using the first training data is similar to that described in the embodiment of fig. 1 to 4, and is not limited in detail here.

For the embodiment shown in fig. 6, after the second image translation model is obtained, the coding-decoding model also used by the generation model in the second image translation model, that is, the ecoder-decoder model, is obtained, wherein the decoding model can be regarded as an inverse process of the coding model, because the convolution operator used in the coding model is correspondingly used in the decoding model, and the deconvolution, which is also called transposed convolution, is an inverse process of convolution.

In the experimental process, it is found that the deconvolution of the decoding model in the second image translation model may cause the generated image to have a mesh effect, and to eliminate the mesh effect, the following steps may be further performed, please refer to fig. 7, where another embodiment of the model training method in the embodiment of the present application includes:

701. replacing the deconvolution operator with an upsampling operator;

in particular, upsampling refers to any technique that allows an image to be made higher resolution. The simplest way is resampling and interpolation: the input picture is rescaled to a desired size, pixel points of each point are calculated, and interpolation methods such as bilinear interpolation are used for interpolating other points to complete an up-sampling process.

702. And inputting the first training data into the second image translation model for training, and taking the trained second image translation model as a third image translation model.

And after the deconvolution operator in the second image translation model is replaced by the up-sampling operator, inputting the first training data into the second image translation model for training, and taking the trained second image translation model as a third image translation model.

Specifically, the first training data may be a human 3D model and a target frame image of the target frame, or a human 3D model, a human key point parameter, a target frame image of the target frame, and a human 3D model and a human key point parameter of a plurality of frames adjacent to the target frame, which is not limited herein.

The process of training the second image translation model by using the first training data is similar to that described in the embodiment of fig. 1 to 4, and is not limited in detail here.

After the third image translation model is obtained, in order to further increase the image translation speed of the generation model in the third image translation model, the generation model in the third image translation model may also be input to the acceleration frame, so that the acceleration frame analyzes the generation model in the third image translation model to accelerate the image translation speed of the generation model in the third image translation model.

Specifically, the acceleration framework includes, but is not limited to, TensrT, openvion, TVM, and NCNN, wherein TensrT is a high-performance deep learning Inference (Inference) optimizer, which can provide low-latency, high-throughput deployment Inference for deep learning applications. The TensorRT can directly optimize the trained model, and after the network model is trained, the training model file can be directly dropped into the TensorRT so as to accelerate the reasoning speed of the TensorRT.

When the acceleration frame is the TentsorRT, because the TentsorRT does not support the reflext paging operator in the image translation model, as an optional embodiment, the reflext paging operator of the model generated in the third image translation model may be replaced with the paging operator; as another alternative embodiment, the reflex padding of the generation model in the image translation model can be directly ignored.

The acceleration principle of the acceleration framework on the network model is described in detail in the prior art, and is not described herein again.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the above steps do not mean the execution sequence, and the execution sequence of each step should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

With reference to fig. 8, the model training method in the present application is described above, and the following describes the model training apparatus in the present application, where an embodiment of the model training apparatus in the present application includes:

the first module 801 is configured to extract human body key point parameters of each frame of image from each frame of image of the video data, and input the human body key point parameters of each frame of image into an SMPL-X model to generate a human body 3D model corresponding to a human body posture of each frame of image, where the human body key point parameters include human body skeleton point parameters, hand key point parameters, and face key point parameters;

the second module 802 inputs the human body 3D model of the target frame and the target frame image into an image translation model for training, and uses the trained image translation model as the first image translation model, where the image translation model is a GAN network model, and the target frame is any one or more frames in the video data.

Preferably, the second module 802 is specifically configured to:

a modification module 803, configured to modify the RestNet architecture in the coding model into a lightweight model architecture;

the training module 804 is configured to input first training data into the first image translation model for training, and use the trained first image translation model as a second image translation model.

the modifying module 803 is further configured to replace the deconvolution operator with an upsampling operator;

the training module 804 is further configured to input the first training data to the second image translation model for training, and use the trained second image translation model as a third image translation model.

Preferably, the first training data includes:

a human 3D model of the target frame and the target frame image;

or the like, or, alternatively,

Preferably, the apparatus further comprises:

an acceleration module 805, configured to input a generation model in the third image translation model to an acceleration framework, so that the acceleration framework analyzes the third generation model, and accelerates an image translation speed of the third generation model.

Preferably, the modification module 803 is further configured to:

The functions of the above modules are similar to those described in the embodiments of fig. 1 to 7, and are not described herein again.

Different from the prior art, only human skeleton points are used as input data of an image translation model, and the human 3D model is used as the input data of a generation model in the image translation model through the second module 802 in the embodiment of the present application, because the human 3D model has a 3D curved surface (full 3D surface) with complete body, hands and face, a digital person in the model has more human posture detail information.

Further, in the embodiment of the present application, a modification module 803 is further used to modify a RestNet architecture in the coding model in the first image translation model into a lightweight model architecture, so as to accelerate the image translation speed of the model.

Further, in the embodiment of the present application, the modification module 803 is further used to replace the deconvolution operator in the decoding model in the second image translation model with the upsampling operator, so that the grid effect in the generated picture is eliminated.

Further, in the embodiment of the present application, the acceleration module 805 inputs the generative model in the third image translation model to the acceleration framework, so as to further increase the image translation speed of the generative model in the third image translation model.

The model training apparatus in the embodiment of the present invention is described above from the perspective of the modular functional entity, and the computer apparatus in the embodiment of the present invention is described below from the perspective of hardware processing:

the computer device is used for realizing the functions of the model training device, and one embodiment of the computer device in the embodiment of the invention comprises the following components:

a processor and a memory;

the memory is used for storing the computer program, and the processor is used for realizing the following steps when executing the computer program stored in the memory:

In some embodiments of the present invention, the processor may be further configured to:

replacing the deconvolution operator with an upsampling operator;

It is to be understood that, when the processor in the computer apparatus described above executes the computer program, the functions of each unit in the corresponding apparatus embodiments may also be implemented, and are not described herein again. Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory and executed by the processor to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of the computer program in the model training apparatus. For example, the computer program may be divided into units in the above-described model training apparatus, and each unit may realize specific functions as described in the above-described corresponding model training apparatus.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing equipment. The computer device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the processor, memory are merely examples of a computer apparatus and are not meant to be limiting, and that more or fewer components may be included, or certain components may be combined, or different components may be included, for example, the computer apparatus may also include input output devices, network access devices, buses, etc.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the terminal, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The present invention also provides a computer-readable storage medium for implementing the functionality of a model training apparatus, having a computer program stored thereon, which, when executed by a processor, may be adapted to carry out the steps of:

In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:

replacing the deconvolution operator with an upsampling operator;

It will be appreciated that the integrated units, if implemented as software functional units and sold or used as a stand-alone product, may be stored in a corresponding one of the computer readable storage media. Based on such understanding, all or part of the flow of the method according to the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium and used by a processor to implement the steps of the above embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of model training, the method comprising:

2. The method of claim 1, wherein the inputting the human body 3D model of the target frame and the target frame image into an image translation model for training comprises:

3. The method of claim 1, wherein the inputting the human body 3D model of the target frame and the target frame image into an image translation model for training comprises:

4. The method of any of claims 1 to 3, wherein the generative model in the first image translation model is a coding-decoding model, the coding model of the generative model in the first image translation model is trained using a RestNet residual network architecture, the method further comprising:

inputting first training data into the first image translation model for training, and taking the trained first image translation model as a second image translation model;

the first training data comprises:

a human 3D model of the target frame and the target frame image;

or the like, or, alternatively,

5. The method of claim 4, wherein the generation model in the second image translation model is a coding-decoding model, and wherein the decoding model in the second image translation model is trained using an deconvolution operator, the method further comprising:

replacing the deconvolution operator with an upsampling operator;

6. The method of claim 5, further comprising:

inputting a generation model in the third image translation model into an acceleration frame, so that the acceleration frame analyzes the generation model in the third image translation model and accelerates the image translation speed of the generation model in the third image translation model.

7. The method of claim 6, further comprising:

8. The method of any of claims 1 to 3, wherein the image translation model comprises at least one of a pix2pix model, a pix2pixHD model, and a vid2vid model.

9. The method of claim 4, wherein the lightweight model comprises at least one of a MobileNet model, a ShuffleNet model, a SqueezeNet model, and an Xception model.

10. A model training apparatus, the apparatus comprising:

11. A computer arrangement comprising a processor, characterized in that the processor, when executing a computer program stored on a memory, is adapted to carry out the model training method of any of claims 1 to 9.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the model training method according to any one of claims 1 to 9.