CN113077383B - Model training method and model training device - Google Patents

Model training method and model training device Download PDF

Info

Publication number
CN113077383B
CN113077383B CN202110629148.1A CN202110629148A CN113077383B CN 113077383 B CN113077383 B CN 113077383B CN 202110629148 A CN202110629148 A CN 202110629148A CN 113077383 B CN113077383 B CN 113077383B
Authority
CN
China
Prior art keywords
model
image
human body
target frame
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110629148.1A
Other languages
Chinese (zh)
Other versions
CN113077383A (en
Inventor
王鑫宇
刘炫鹏
杨国基
刘致远
刘云峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhuiyi Technology Co Ltd
Original Assignee
Shenzhen Zhuiyi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhuiyi Technology Co Ltd filed Critical Shenzhen Zhuiyi Technology Co Ltd
Priority to CN202110629148.1A priority Critical patent/CN113077383B/en
Publication of CN113077383A publication Critical patent/CN113077383A/en
Application granted granted Critical
Publication of CN113077383B publication Critical patent/CN113077383B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • G06T3/04
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Abstract

The embodiment of the invention discloses a model training method and a model training device, which are used for increasing the posture detail information of a digital person when the digital person is generated. The method provided by the embodiment of the invention comprises the following steps: extracting human body key point parameters of each frame of image from each frame of image of video data respectively, and inputting the human body key point parameters of each frame of image into an SMPL-X model to generate a human body 3D model corresponding to the human body posture of each frame of image, wherein the human body key point parameters comprise human body skeleton point parameters, hand key point parameters and human face key point parameters; inputting a human body 3D model of a target frame and a target frame image into an image translation model for training, and taking the trained image translation model as a first image translation model, wherein the image translation model is a GAN network model, and the target frame is any one frame or any multiple frames in the video data.

Description

Model training method and model training device
Technical Field
The invention relates to the technical field of image translation, in particular to a model training method and a model training device.
Background
Image translation refers to the conversion from one image to another. One language can be converted to another language by analogy with machine translation.
The more classical image translation models in the prior art include pix2pix, pix2pixHD, and vid2 vid. pix2pix provides a unified framework to solve the problem of translation of various images, pix2pixHD better solves the problem of high-resolution image conversion (translation) on the basis of pix2pix, and vid2vid better solves the problem of high-resolution video conversion on the basis of pix2 pixHD.
The digital human is a virtual simulation of the shape and function of human body at different levels by using an information science method. However, in the image translation model in the prior art, training is performed only based on the bone point data in the training process, so that the existing image translation model has only the information of the bone point and lacks the detailed information of the human body posture when generating the posture of the digital person.
Disclosure of Invention
The embodiment of the invention provides a model training method and a training device, which are used for increasing the posture detail information of a digital person when the digital person is generated.
A first aspect of an embodiment of the present application provides a model training method, including:
extracting human body key point parameters of each frame of image from each frame of image of video data respectively, and inputting the human body key point parameters of each frame of image into an SMPL-X model to generate a human body 3D model corresponding to the human body posture of each frame of image, wherein the human body key point parameters comprise human body skeleton point parameters, hand key point parameters and human face key point parameters;
inputting a human body 3D model of a target frame and a target frame image into an image translation model for training, and taking the trained image translation model as a first image translation model, wherein the image translation model is a GAN network model, and the target frame is any one frame or any multiple frames in the video data.
Preferably, the inputting the human body 3D model of the target frame and the target frame image into the image translation model for training includes:
and inputting the human body 3D model of the target frame, the human body key point parameters and the target frame image into the image translation model for training.
Preferably, the inputting the human body 3D model of the target frame and the target frame image into the image translation model for training includes:
inputting training data of the target frame and a plurality of frames adjacent to the target frame into the image translation model for training, wherein the training data comprises a human body 3D model of the target frame, human body key point parameters, images of the target frame, the human body 3D models of the frames and the human body key point parameters.
Preferably, the generation model in the first image translation model is a coding-decoding model, the coding model is trained by using a RestNet residual error network architecture, and the method further includes:
modifying the RestNet architecture in the coding model to a lightweight model architecture;
and inputting first training data into the first image translation model for training, and taking the trained first image translation model as a second image translation model.
Preferably, the generation model in the second image translation model is a coding-decoding model, and the decoding model in the second image translation model is trained by using an inverse convolution operator, and the method further includes:
replacing the deconvolution operator with an upsampling operator;
and inputting the first training data into the second image translation model for training, and taking the trained second image translation model as a third image translation model.
Preferably, the first training data includes:
a human 3D model of the target frame and the target frame image;
or the like, or, alternatively,
the human body 3D model of the target frame, the human body key point parameters and the target frame image;
or the like, or, alternatively,
the human body 3D model of the target frame, the human body key point parameters, the target frame image, and the human body 3D model and the human body key point parameters of a plurality of frames adjacent to the target frame.
Preferably, the method further comprises:
inputting a generation model in the third image translation model into an acceleration frame, so that the acceleration frame analyzes the third generation model and accelerates the image translation speed of the third generation model.
Preferably, the method further comprises:
when the acceleration frame is TensorRT, replacing a reflect padding operator in the image translation model with a padding operator.
Preferably, the image translation model includes at least one of a pix2pix model, a pix2pixHD model, and a vid2vid model.
Preferably, the lightweight model includes at least one of a MobileNet model, a shefflenet model, a SqueezeNet model, and an Xception model.
A second aspect of the embodiments of the present application provides a model training apparatus, including:
the system comprises a first module, a second module and a third module, wherein the first module is used for respectively extracting human body key point parameters of each frame of image from each frame of image of video data and inputting the human body key point parameters of each frame of image into an SMPL-X model to generate a human body 3D model corresponding to the human body posture of each frame of image, and the human body key point parameters comprise human body skeleton point parameters, hand key point parameters and human face key point parameters;
and the second module is used for inputting the human body 3D model of the target frame and the target frame image into an image translation model for training, and taking the trained image translation model as a first image translation model, wherein the image translation model is a GAN network model, and the target frame is any one frame or any multiple frames in the video data.
Preferably, the second module is specifically configured to:
and inputting the human body 3D model of the target frame, the human body key point parameters and the target frame image into the image translation model for training.
Preferably, the second module is specifically configured to:
inputting training data of the target frame and a plurality of frames adjacent to the target frame into the image translation model for training, wherein the training data comprises a human body 3D model of the target frame, human body key point parameters, images of the target frame, the human body 3D models of the frames and the human body key point parameters.
Preferably, the generation model in the first image translation model is a coding-decoding model, the coding model is trained by using a RestNet residual error network architecture, and the apparatus further includes:
a modification module, configured to modify the RestNet architecture in the coding model into a lightweight model architecture;
and the training module is used for inputting first training data into the first image translation model for training and taking the trained first image translation model as a second image translation model.
Preferably, the generation model in the second image translation model is a coding-decoding model, and the decoding model in the second image translation model is trained by using an inverse convolution operator:
the modification module is further used for replacing the deconvolution operator with an upsampling operator;
the training module is further configured to input the first training data to the second image translation model for training, and use the trained second image translation model as a third image translation model.
Preferably, the first training data includes:
a human 3D model of the target frame and the target frame image;
or the like, or, alternatively,
the human body 3D model of the target frame, the human body key point parameters and the target frame image;
or the like, or, alternatively,
the human body 3D model of the target frame, the human body key point parameters, the target frame image, and the human body 3D model and the human body key point parameters of a plurality of frames adjacent to the target frame.
Preferably, the apparatus further comprises:
and the acceleration module is used for inputting the generation model in the third image translation model into an acceleration frame, so that the acceleration frame analyzes the third generation model and accelerates the image translation speed of the third generation model.
Preferably, the modification module is further configured to:
when the acceleration frame is TensorRT, replacing a reflect padding operator in the image translation model with a padding operator.
Preferably, the image translation model includes at least one of a pix2pix model, a pix2pixHD model, and a vid2vid model.
Preferably, the lightweight model includes at least one of a MobileNet model, a shefflenet model, a SqueezeNet model, and an Xception model.
A third aspect of embodiments of the present application provides a computer apparatus, including a processor, where the processor is configured to implement the model training method provided in the first aspect of embodiments of the present application when executing a computer program stored in a memory.
A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is used, when being executed by a processor, to implement the model training method provided in the first aspect of the embodiments of the present application.
According to the technical scheme, the embodiment of the invention has the following advantages:
in the embodiment of the application, human body key point parameters of each frame of image are respectively extracted from each frame of image of video data, and the human body key point parameters of each frame of image are input into an SMPL-X model to generate a human body 3D model corresponding to the human body posture of each frame of image, wherein the human body key point parameters comprise human body skeleton point parameters, hand key point parameters and human face key point parameters; inputting a human body 3D model of a target frame and a target frame image into an image translation model for training, and taking the trained image translation model as a first image translation model, wherein the image translation model is a GAN network model, and the target frame is any one frame or any multiple frames in the video data.
Because the model input data of the embodiment of the application is the 3D model of the human body, the trained first image translation model can have more posture detail information when generating the digital human posture.
Drawings
FIG. 1 is a schematic diagram of an embodiment of a model training method in an embodiment of the present application;
FIG. 2 is a schematic diagram of a neural network architecture in an embodiment of the present application;
FIG. 3 is a schematic diagram of another embodiment of a model training method in an embodiment of the present application;
FIG. 4 is a schematic diagram of another embodiment of a model training method in the embodiment of the present application;
FIG. 5 is a diagram illustrating a residual error network architecture according to an embodiment of the present application;
FIG. 6 is a schematic diagram of another embodiment of a model training method in an embodiment of the present application;
FIG. 7 is a schematic diagram of another embodiment of a model training method in an embodiment of the present application;
fig. 8 is a schematic diagram of an embodiment of a model training apparatus in an embodiment of the present application.
Detailed Description
The embodiment of the invention provides a model training method and a training device, which are used for increasing the posture detail information of a digital person when the digital person is generated.
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The method is different from the prior art that when a digital person is generated, only the information of the skeleton points of the digital person is input, so that the generated digital person is lack of posture detail information.
For convenience of understanding, the following describes a model training method and a model training apparatus in the present application, and referring to fig. 1, an embodiment of the model training method in the present application includes:
101. extracting human body key point parameters of each frame of image from each frame of image of video data respectively, and inputting the human body key point parameters of each frame of image into an SMPL-X model to generate a human body 3D model corresponding to the human body posture of each frame of image, wherein the human body key point parameters comprise human body skeleton point parameters, hand key point parameters and human face key point parameters;
SMPL-X (SMPL eXpress) model, which combines the SMPL model of body, MANO model of hand and FLAME model of head, and registers 5586 3D scans of the model to guarantee quality. The model is trained through the data such that the model has no artifacts on the natural correlations between bodies, hands, and faces.
Because the trained SMPL-x (SMPL expressive) model has complete 3D surfaces (full 3D surfaces) of body, hands and face, the digital person in the model has more detailed information of human body posture.
In order to realize the simulation of the human body posture in each frame of image in the video data, the human body key point parameters of each frame of image are respectively extracted from each frame of image in the video data, and the human body key point parameters of each frame of image are input into the SMPL-X model so as to generate the human body 3D model corresponding to the human body posture of each frame of image.
As a specific implementation mode, OpenPose can be used for detecting human body key point parameters in each frame of image in a video, wherein OpenPose is an open source library which is based on a convolutional neural network and supervised learning and developed by taking cafe as a framework, can realize posture estimation of human body actions, facial expressions, finger motions and the like, is suitable for single people and multiple people, has excellent robustness, and can recognize key point identification of 15, 18 or 25 bodies/feet, 221 hand key point identification and 70 face key point identification when being used for realizing real-time identification of two-dimensional multiple people key points.
For openpoint, the specific input may be a picture, a video stream of a network camera, such as a hair or Point gray or an IP camera, and the output may be an original picture + a key Point display, or a key Point data storage file), and so on.
After the human body key point parameters in each frame of image are identified, the human body key point parameters of each frame of image are input into the SMPL-X model, so that a human body 3D model corresponding to the human body posture of each frame of image can be generated, wherein the specific process of inputting the human body key point parameters into the SMPL-X model to obtain the corresponding human body 3D model can refer to the description of the SMPLfy-X method in the prior art, and the description is not repeated herein.
102. Inputting a human body 3D model of a target frame and a target frame image into an image translation model for training, and taking the trained image translation model as a first image translation model, wherein the image translation model is a GAN network model, and the target frame is any one frame or any multiple frames in the video data.
After a human body 3D model corresponding to the human body posture of each frame of image is obtained, inputting the human body 3D model of a target frame and the image of the target frame into an image translation model for training, and taking the trained image translation model as a first generation model, wherein the image translation model is a GAN network model, and the target frame is any one frame of image or any multiple frames of images in video data.
In the case that the GAN network is applied to the deep learning neural network, the G learns the data distribution by continuously playing a game through a generation model G (generator) and a discrimination model d (discriminator), and if the image generation is used, the G can generate a vivid image from a random number after the training is completed.
G. The main functions of D are:
g is a generative network which receives a random noise z (random number) by which to generate an image;
d is a discrimination network for discriminating whether a picture is "real"; the input parameter is x, x represents a picture, and the output D (x) represents the probability that x is a real picture, if 1, 100% of the picture is real, and the output is 0, the picture cannot be real.
In the training process, the aim of generating the model G is to generate a real picture as much as possible to deceive the discrimination model D, and the aim of D is to discriminate whether the image generated by G is a false image or a real image as much as possible. Thus, G and D form a dynamic game process, the final balance point is Nash balance point, and if the Nash balance point is reached between G and D, the training of G is finished.
Specifically, the process of training the generation model and the discrimination model in the image translation model is as follows:
inputting the human body 3D model of the target frame into a generation model and a discrimination model of an image translation model to obtain a generation image of the target frame, then calculating a first loss between the target frame image and the generation image of the target frame according to a preset loss function, and performing gradient updating on the weight of a convolution layer in the generation model according to the first loss and a back propagation algorithm.
It should be noted that the target frame image and the generated image of the target frame in the embodiment of the present application are two different objects, where the target frame image is a real image of the target frame, that is, a real image of the target frame in the video data, and the generated image of the target frame is an image generated by inputting the human body 3D model of the target frame into the generated model of the image translation model.
Specifically, the image translation model in the embodiment of the present application includes at least one of pix2pix, pix2pixHD, and vid2vid, and for a specific image translation model, the corresponding loss functions are also different:
for pix2pix, the preset Loss function includes L1Loss between the target frame image and the target frame generation image, and GANLoss to diversify the output;
for pix2pixHD, the preset Loss function includes L1Loss between the target frame image and the target frame generation image, GANLoss, Feature matching Loss, and Content Loss that make the output diversified;
for the vid2vid, the preset Loss function comprises L1Loss between the target frame image and the target frame generation image, GANLOs for diversifying output, Feature matching Loss, Content Loss, video Loss and optical flow Loss;
each Loss function is described in detail in the prior art, and will not be described in detail here.
In order to facilitate understanding of the gradient updating process, firstly, a generation model and a discrimination model in the GAN network are simply described:
the generation model and the discrimination model in the image translation model adopt a Neural Network algorithm, and a Multi-Layer Perceptron (MLP), also called an Artificial Neural Network (ANN), generally includes an input Layer, an output Layer, and a plurality of hidden layers arranged between the input Layer and the output Layer. The simplest MLP requires a hidden layer, i.e., an input layer, a hidden layer, and an output layer, to be referred to as a simple neural network.
Next, taking the neural network in fig. 2 as an example, the data transmission process is described:
1. forward output of neural network
Where, layer 0 (input layer), we vectorize X1, X2, X3 to X;
between the 0 layer and the 1 layer (hidden layer), there are weights W1, W2, W3, the weight vector is quantized to W1, where W1 represents the weight of the first layer;
between layer 0 and layer 1 (hidden layer), there are also offsets b1, b2, b3 vectorized to b [1], where b [1] represents the weight of the first layer;
for layer 1, the calculation formula is:
Figure DEST_PATH_IMAGE001
Figure DEST_PATH_IMAGE002
;
wherein Z is a linear combination of input values, a is a value of Z through an activation function sigmoid, for an input value X of a first layer, an output value is a, which is also an input value of a next layer, and in the sigmoid activation function, the value thereof is between [0,1], which can be understood as a valve, just like a human neuron, when a neuron is stimulated and not immediately felt, but the stimulation exceeds a threshold value, the neuron is allowed to propagate to an upper level.
Between layer 1 and layer 2 (output layer), similarly to between layer 0 and layer 1, the calculation formula is as follows:
Figure DEST_PATH_IMAGE003
Figure DEST_PATH_IMAGE004
Figure DEST_PATH_IMAGE005
where yhat is the output value of the neural network at this time.
2. Loss function
In the course of neural network training, whether the neural network is trained in place is generally measured by a loss function.
In general, we choose the following function as the loss function:
Figure DEST_PATH_IMAGE006
wherein y is the true characteristic value of the picture, yhat: (
Figure DEST_PATH_IMAGE007
) Generating a characteristic value of the picture;
when y =1, the closer yhat is to 1,
Figure DEST_PATH_IMAGE008
the closer to 0, the better the prediction effect is, and when the loss function reaches the minimum value, the generated image of the current frame generated by the generation model is closer to the original image of the current frame.
3. Back propagation algorithm
In the neural network model, the training effect of the neural network can be obtained by calculating the loss function, and the parameters can be updated by a back propagation algorithm, so that the neural network model can obtain the desired predicted value. The gradient descent algorithm is a method for optimizing the weight W and the bias b.
Specifically, the gradient descent algorithm is to calculate a partial derivative of the loss function, and then update w1, w2, and b with the partial derivative.
For easy understanding, we will lose the function
Figure 4589DEST_PATH_IMAGE006
Formulated as the following equation:
Figure DEST_PATH_IMAGE009
Figure DEST_PATH_IMAGE010
Figure DEST_PATH_IMAGE011
then, respectively couple
Figure DEST_PATH_IMAGE012
And z derivation:
Figure DEST_PATH_IMAGE013
Figure DEST_PATH_IMAGE014
and then performing derivation on w1, w2 and b:
Figure DEST_PATH_IMAGE015
Figure DEST_PATH_IMAGE016
Figure DEST_PATH_IMAGE018
the weight parameter w and the bias parameter b are then updated with a gradient descent algorithm:
wherein, w 1: = w1-
Figure DEST_PATH_IMAGE019
dw1
w2:=w2-
Figure 341636DEST_PATH_IMAGE019
dw2
b:=b-
Figure 839483DEST_PATH_IMAGE019
db。
Wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE020
the learning rate, that is, the learning step length, is represented, in the actual training process, if the learning rate is too large, the vibration will be returned near the optimal solution, and the optimal solution cannot be reached, and if the learning rate is too small, the optimal solution may be reached by many iterations, so in the actual training process, the learning rate is also an important selection parameter.
In the present application, the training process of the generative model and the discriminant model, that is, the process of calculating the corresponding loss according to the loss functions in the different image translation models, and then updating the weights of the convolution layers in the generative model and the discriminant model by using the back propagation algorithm, and the specific updating process can refer to the calculation process of the loss functions and the back propagation algorithm.
Different from the prior art, only human skeleton points are used as input data of an image translation model, and a human 3D model is used as input data of a generation model in the image translation model in the embodiment of the application, because the human 3D model has a 3D curved surface (full 3D surface) with complete body, hands and face, a digital person in the model has more human posture detail information.
In order to make the information of the key points of the human body more accurate when the generated model generates the digital human pose, the information of the key points of the human body of the target frame image may be added to the input data of the generated model, specifically referring to fig. 3, another embodiment of the model training method in the embodiment of the present application includes:
301. and inputting the human body 3D model of the target frame, the human body key point parameters and the target frame image into a generation model of an image translation model for training.
In the embodiment illustrated in fig. 1, because the input data only includes the human body 3D model of the target frame, in order to prevent a large deviation from occurring in the digital human pose translated by the generative model when the human body 3D model corresponding to the target frame has a deviation, the human body key point parameters of the target frame may also be added to the input data of the generative model, so as to perform a function of correcting the human body 3D model.
Specifically, when the image translation model is trained, the human body 3D model of the target frame, the human body key point parameters and the target frame image can be input into the image translation model for training, so as to improve the accuracy of the image translation model in translating the digital human posture.
The training process of the image translation model by using the human body 3D model of the target frame, the human body key point parameters, and the target frame image may refer to the embodiment described in fig. 1, and details are not repeated here.
In addition, when the image translation model is trained, in order to ensure continuity of digital human pose generation, training data of a target frame and a plurality of frames of images adjacent to the target frame may be input during a training process, and referring to fig. 4 specifically, another embodiment of the model training method in the embodiment of the present application includes:
401. inputting training data of the target frame and a plurality of frames adjacent to the target frame into an image translation model for training, wherein the training data comprises a human body 3D model of the target frame, human body key point parameters, images of the target frame, the human body 3D models of the frames and the human body key point parameters.
Different from the embodiment shown in fig. 1 and 3, when the image translation model generates the digital human pose, in order to ensure continuity of the digital human pose, training data of a target frame and a plurality of frames adjacent to the target frame may be input into the image translation model for training, where the training data includes a human body 3D model of the target frame, a human body key point parameter, a target frame image, a human body 3D model of the plurality of frames, and a human body key point parameter.
Specifically, assuming that a plurality of frames adjacent to the target frame are data of 5 frames before and after the target frame, the training data are a human body 3D model of the target frame, a human body key point parameter, a target frame image, a human body 3D model and a human body key point parameter of each frame image in the 5 frames before the target frame, and a human body 3D model and a human body key point parameter of each frame image in the 5 frames after the target frame, so that when the trained image translation model generates the digital human pose of the target frame, the trained image translation model not only refers to the image information of the 5 frames before the target frame, but also considers the image information of the 5 frames after the target frame, so that the digital human pose of the target frame has more relevance from the front-back continuity aspect, and the image translation model has better stability when generating the digital human pose.
The training process of the image translation model by using the training data of the target frame and the frames adjacent to the target frame may also refer to the embodiment shown in fig. 1, and is not described herein again.
Further, based on the embodiments described in fig. 1 to fig. 4, the generation model in the first image translation model is an encode-decode model, that is, an encoder-decoder model, where the encode model and the decode model may use any one of deep learning algorithms such as CNN, RNN, BiRNN, LSTM, etc., and the current deep learning algorithm is increasingly poor in model effect when the network is deep, and it can be found through experiments that: with the continuous increase of network levels, the model precision is continuously improved, and when the network levels are increased to a certain number, the training precision and the test precision are rapidly reduced, which means that when the network becomes very deep, the deep network becomes more difficult to train, so in order to reduce errors, the existing deep learning algorithm can maintain the model precision by adopting a RestNet architecture, wherein fig. 5 shows a schematic diagram of a Residual network structure (Residual network), and through the Residual network structure, the model precision can be maintained when the network levels are very deep.
For the problem, the present application may further perform the following steps on the coding model in the first image translation model, specifically, referring to fig. 6, in which another embodiment of the model training method in the present application includes:
601. modifying the RestNet architecture in the coding model to a lightweight model architecture;
in order to improve the operation speed of the model, the RestNet architecture of the coding model in the generation model can be modified into a lightweight model architecture, wherein the lightweight model architecture comprises at least one of a MobileNet model, a shefflenet model, a SqueezeNet model and an Xception model.
For convenience of understanding, the acceleration process of the model is described below by taking the MobileNet model as an example:
the basic unit of MobileNet is a depth-level separable convolution, which is in fact a decomposable convolution operation that can be decomposed into two smaller operations: depthwise restriction and pointwise restriction. Depthwise convolution is different from standard convolution, for which the convolution kernel is used on all input channels, and Depthwise convolution uses a different convolution kernel for each input channel, that is, one convolution kernel for each input channel, so that it is said that Depthwise convolution is a depth-level operation. Instead, the poitwise convolution is simply a normal convolution, but it uses a convolution kernel of 1 × 1.
For depthwise partial convolution, firstly, depthwise convolution is adopted to carry out convolution on different input channels respectively, and then pointwise convolution is adopted to combine the outputs, so that the integral effect is almost the same as that of a standard convolution, but the calculated amount and the model parameter amount are greatly reduced.
The following is an analysis of the amount of computation of the depth separable convolution and the standard convolution:
assume that the input feature map size is
Figure DEST_PATH_IMAGE021
The size of the output characteristic diagram is
Figure DEST_PATH_IMAGE022
Wherein, in the step (A),
Figure DEST_PATH_IMAGE023
is the width and height of the feature map, assuming that the width and height of the input feature map and the output feature map are both
Figure 595211DEST_PATH_IMAGE023
For standard convolution
Figure DEST_PATH_IMAGE024
Calculated by the amount of
Figure DEST_PATH_IMAGE025
And for separable convolution, where depthwise convolution is calculated by:
Figure DEST_PATH_IMAGE026
the calculated amount of poitwise restriction is:
Figure DEST_PATH_IMAGE027
then the amount of computation of the depth separable convolution is
Figure DEST_PATH_IMAGE028
The ratio of the calculated quantities of the depth separable convolution and the standard convolution can be found as follows:
Figure DEST_PATH_IMAGE029
in general, the value of N is large, and the computation amount of the deep separable convolution can be reduced to the standard volume under the condition of adopting 3 × 3 convolution kernelsOf product
Figure DEST_PATH_IMAGE030
The acceleration principle of other models is described in the prior art, and will not be described herein.
602. And inputting first training data into the first image translation model for training, and taking the trained first image translation model as a second image translation model.
After a RestNet architecture in a coding model of the first image translation model is modified into a lightweight model architecture, the first image translation model is trained by using first training data, and the trained first image translation model is used as a second image translation model.
The first training data may be a human body 3D model and a target frame image of a target frame, or a human body 3D model, a human body key point parameter, a target frame image of a target frame, and a human body 3D model and a human body key point parameter of a plurality of frames adjacent to the target frame, which are not limited herein.
The training process of the first image translation model using the first training data is similar to that described in the embodiment of fig. 1 to 4, and is not limited in detail here.
For the embodiment shown in fig. 6, after the second image translation model is obtained, the coding-decoding model also used by the generation model in the second image translation model, that is, the ecoder-decoder model, is obtained, wherein the decoding model can be regarded as an inverse process of the coding model, because the convolution operator used in the coding model is correspondingly used in the decoding model, and the deconvolution, which is also called transposed convolution, is an inverse process of convolution.
In the experimental process, it is found that the deconvolution of the decoding model in the second image translation model may cause the generated image to have a mesh effect, and to eliminate the mesh effect, the following steps may be further performed, please refer to fig. 7, where another embodiment of the model training method in the embodiment of the present application includes:
701. replacing the deconvolution operator with an upsampling operator;
in particular, upsampling refers to any technique that allows an image to be made higher resolution. The simplest way is resampling and interpolation: the input picture is rescaled to a desired size, pixel points of each point are calculated, and interpolation methods such as bilinear interpolation are used for interpolating other points to complete an up-sampling process.
702. And inputting the first training data into the second image translation model for training, and taking the trained second image translation model as a third image translation model.
And after the deconvolution operator in the second image translation model is replaced by the up-sampling operator, inputting the first training data into the second image translation model for training, and taking the trained second image translation model as a third image translation model.
Specifically, the first training data may be a human 3D model and a target frame image of the target frame, or a human 3D model, a human key point parameter, a target frame image of the target frame, and a human 3D model and a human key point parameter of a plurality of frames adjacent to the target frame, which is not limited herein.
The process of training the second image translation model by using the first training data is similar to that described in the embodiment of fig. 1 to 4, and is not limited in detail here.
After the third image translation model is obtained, in order to further increase the image translation speed of the generation model in the third image translation model, the generation model in the third image translation model may also be input to the acceleration frame, so that the acceleration frame analyzes the generation model in the third image translation model to accelerate the image translation speed of the generation model in the third image translation model.
Specifically, the acceleration framework includes, but is not limited to, TensrT, openvion, TVM, and NCNN, wherein TensrT is a high-performance deep learning Inference (Inference) optimizer, which can provide low-latency, high-throughput deployment Inference for deep learning applications. The TensorRT can directly optimize the trained model, and after the network model is trained, the training model file can be directly dropped into the TensorRT so as to accelerate the reasoning speed of the TensorRT.
When the acceleration frame is the TentsorRT, because the TentsorRT does not support the reflext paging operator in the image translation model, as an optional embodiment, the reflext paging operator of the model generated in the third image translation model may be replaced with the paging operator; as another alternative embodiment, the reflex padding of the generation model in the image translation model can be directly ignored.
The acceleration principle of the acceleration framework on the network model is described in detail in the prior art, and is not described herein again.
It should be understood that, in various embodiments of the present invention, the sequence numbers of the above steps do not mean the execution sequence, and the execution sequence of each step should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
With reference to fig. 8, the model training method in the present application is described above, and the following describes the model training apparatus in the present application, where an embodiment of the model training apparatus in the present application includes:
the first module 801 is configured to extract human body key point parameters of each frame of image from each frame of image of the video data, and input the human body key point parameters of each frame of image into an SMPL-X model to generate a human body 3D model corresponding to a human body posture of each frame of image, where the human body key point parameters include human body skeleton point parameters, hand key point parameters, and face key point parameters;
the second module 802 inputs the human body 3D model of the target frame and the target frame image into an image translation model for training, and uses the trained image translation model as the first image translation model, where the image translation model is a GAN network model, and the target frame is any one or more frames in the video data.
Preferably, the second module 802 is specifically configured to:
and inputting the human body 3D model of the target frame, the human body key point parameters and the target frame image into the image translation model for training.
Preferably, the second module 802 is specifically configured to:
inputting training data of the target frame and a plurality of frames adjacent to the target frame into the image translation model for training, wherein the training data comprises a human body 3D model of the target frame, human body key point parameters, images of the target frame, the human body 3D models of the frames and the human body key point parameters.
Preferably, the generation model in the first image translation model is a coding-decoding model, the coding model is trained by using a RestNet residual error network architecture, and the apparatus further includes:
a modification module 803, configured to modify the RestNet architecture in the coding model into a lightweight model architecture;
the training module 804 is configured to input first training data into the first image translation model for training, and use the trained first image translation model as a second image translation model.
Preferably, the generation model in the second image translation model is a coding-decoding model, and the decoding model in the second image translation model is trained by using an inverse convolution operator:
the modifying module 803 is further configured to replace the deconvolution operator with an upsampling operator;
the training module 804 is further configured to input the first training data to the second image translation model for training, and use the trained second image translation model as a third image translation model.
Preferably, the first training data includes:
a human 3D model of the target frame and the target frame image;
or the like, or, alternatively,
the human body 3D model of the target frame, the human body key point parameters and the target frame image;
or the like, or, alternatively,
the human body 3D model of the target frame, the human body key point parameters, the target frame image, and the human body 3D model and the human body key point parameters of a plurality of frames adjacent to the target frame.
Preferably, the apparatus further comprises:
an acceleration module 805, configured to input a generation model in the third image translation model to an acceleration framework, so that the acceleration framework analyzes the third generation model, and accelerates an image translation speed of the third generation model.
Preferably, the modification module 803 is further configured to:
when the acceleration frame is TensorRT, replacing a reflect padding operator in the image translation model with a padding operator.
Preferably, the image translation model includes at least one of a pix2pix model, a pix2pixHD model, and a vid2vid model.
Preferably, the lightweight model includes at least one of a MobileNet model, a shefflenet model, a SqueezeNet model, and an Xception model.
The functions of the above modules are similar to those described in the embodiments of fig. 1 to 7, and are not described herein again.
Different from the prior art, only human skeleton points are used as input data of an image translation model, and the human 3D model is used as the input data of a generation model in the image translation model through the second module 802 in the embodiment of the present application, because the human 3D model has a 3D curved surface (full 3D surface) with complete body, hands and face, a digital person in the model has more human posture detail information.
Further, in the embodiment of the present application, a modification module 803 is further used to modify a RestNet architecture in the coding model in the first image translation model into a lightweight model architecture, so as to accelerate the image translation speed of the model.
Further, in the embodiment of the present application, the modification module 803 is further used to replace the deconvolution operator in the decoding model in the second image translation model with the upsampling operator, so that the grid effect in the generated picture is eliminated.
Further, in the embodiment of the present application, the acceleration module 805 inputs the generative model in the third image translation model to the acceleration framework, so as to further increase the image translation speed of the generative model in the third image translation model.
The model training apparatus in the embodiment of the present invention is described above from the perspective of the modular functional entity, and the computer apparatus in the embodiment of the present invention is described below from the perspective of hardware processing:
the computer device is used for realizing the functions of the model training device, and one embodiment of the computer device in the embodiment of the invention comprises the following components:
a processor and a memory;
the memory is used for storing the computer program, and the processor is used for realizing the following steps when executing the computer program stored in the memory:
extracting human body key point parameters of each frame of image from each frame of image of video data respectively, and inputting the human body key point parameters of each frame of image into an SMPL-X model to generate a human body 3D model corresponding to the human body posture of each frame of image, wherein the human body key point parameters comprise human body skeleton point parameters, hand key point parameters and human face key point parameters;
inputting a human body 3D model of a target frame and a target frame image into an image translation model for training, and taking the trained image translation model as a first image translation model, wherein the image translation model is a GAN network model, and the target frame is any one frame or any multiple frames in the video data.
In some embodiments of the present invention, the processor may be further configured to:
and inputting the human body 3D model of the target frame, the human body key point parameters and the target frame image into the image translation model for training.
In some embodiments of the present invention, the processor may be further configured to:
inputting training data of the target frame and a plurality of frames adjacent to the target frame into the image translation model for training, wherein the training data comprises a human body 3D model of the target frame, human body key point parameters, images of the target frame, the human body 3D models of the frames and the human body key point parameters.
In some embodiments of the present invention, the processor may be further configured to:
modifying the RestNet architecture in the coding model to a lightweight model architecture;
and inputting first training data into the first image translation model for training, and taking the trained first image translation model as a second image translation model.
In some embodiments of the present invention, the processor may be further configured to:
replacing the deconvolution operator with an upsampling operator;
and inputting the first training data into the second image translation model for training, and taking the trained second image translation model as a third image translation model.
In some embodiments of the present invention, the processor may be further configured to:
inputting a generation model in the third image translation model into an acceleration frame, so that the acceleration frame analyzes the third generation model and accelerates the image translation speed of the third generation model.
In some embodiments of the present invention, the processor may be further configured to:
when the acceleration frame is TensorRT, replacing a reflect padding operator in the image translation model with a padding operator.
It is to be understood that, when the processor in the computer apparatus described above executes the computer program, the functions of each unit in the corresponding apparatus embodiments may also be implemented, and are not described herein again. Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory and executed by the processor to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of the computer program in the model training apparatus. For example, the computer program may be divided into units in the above-described model training apparatus, and each unit may realize specific functions as described in the above-described corresponding model training apparatus.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing equipment. The computer device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the processor, memory are merely examples of a computer apparatus and are not meant to be limiting, and that more or fewer components may be included, or certain components may be combined, or different components may be included, for example, the computer apparatus may also include input output devices, network access devices, buses, etc.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.
The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the terminal, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
The present invention also provides a computer-readable storage medium for implementing the functionality of a model training apparatus, having a computer program stored thereon, which, when executed by a processor, may be adapted to carry out the steps of:
extracting human body key point parameters of each frame of image from each frame of image of video data respectively, and inputting the human body key point parameters of each frame of image into an SMPL-X model to generate a human body 3D model corresponding to the human body posture of each frame of image, wherein the human body key point parameters comprise human body skeleton point parameters, hand key point parameters and human face key point parameters;
inputting a human body 3D model of a target frame and a target frame image into an image translation model for training, and taking the trained image translation model as a first image translation model, wherein the image translation model is a GAN network model, and the target frame is any one frame or any multiple frames in the video data.
In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:
and inputting the human body 3D model of the target frame, the human body key point parameters and the target frame image into the image translation model for training.
In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:
inputting training data of the target frame and a plurality of frames adjacent to the target frame into the image translation model for training, wherein the training data comprises a human body 3D model of the target frame, human body key point parameters, images of the target frame, the human body 3D models of the frames and the human body key point parameters.
In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:
modifying the RestNet architecture in the coding model to a lightweight model architecture;
and inputting first training data into the first image translation model for training, and taking the trained first image translation model as a second image translation model.
In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:
replacing the deconvolution operator with an upsampling operator;
and inputting the first training data into the second image translation model for training, and taking the trained second image translation model as a third image translation model.
In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:
inputting a generation model in the third image translation model into an acceleration frame, so that the acceleration frame analyzes the third generation model and accelerates the image translation speed of the third generation model.
In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:
when the acceleration frame is TensorRT, replacing a reflect padding operator in the image translation model with a padding operator.
It will be appreciated that the integrated units, if implemented as software functional units and sold or used as a stand-alone product, may be stored in a corresponding one of the computer readable storage media. Based on such understanding, all or part of the flow of the method according to the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium and used by a processor to implement the steps of the above embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (12)

1. A method of model training, the method comprising:
extracting human body key point parameters of each frame of image from each frame of image of video data respectively, and inputting the human body key point parameters of each frame of image into an SMPL-X model to generate a human body 3D model corresponding to the human body posture of each frame of image, wherein the human body key point parameters comprise human body skeleton point parameters, hand key point parameters and human face key point parameters;
inputting a human body 3D model of a target frame and a target frame image into an image translation model for training, and taking the trained image translation model as a first image translation model, wherein the image translation model is a GAN network model, and the target frame is any one frame or any multiple frames in the video data.
2. The method of claim 1, wherein the inputting the human body 3D model of the target frame and the target frame image into an image translation model for training comprises:
and inputting the human body 3D model of the target frame, the human body key point parameters and the target frame image into the image translation model for training.
3. The method of claim 1, wherein the inputting the human body 3D model of the target frame and the target frame image into an image translation model for training comprises:
inputting training data of the target frame and a plurality of frames adjacent to the target frame into the image translation model for training, wherein the training data comprises a human body 3D model of the target frame, human body key point parameters, images of the target frame, the human body 3D models of the frames and the human body key point parameters.
4. The method of any of claims 1 to 3, wherein the generative model in the first image translation model is a coding-decoding model, the coding model of the generative model in the first image translation model is trained using a RestNet residual network architecture, the method further comprising:
modifying the RestNet architecture in the coding model to a lightweight model architecture;
inputting first training data into the first image translation model for training, and taking the trained first image translation model as a second image translation model;
the first training data comprises:
a human 3D model of the target frame and the target frame image;
or the like, or, alternatively,
the human body 3D model of the target frame, the human body key point parameters and the target frame image;
or the like, or, alternatively,
the human body 3D model of the target frame, the human body key point parameters, the target frame image, and the human body 3D model and the human body key point parameters of a plurality of frames adjacent to the target frame.
5. The method of claim 4, wherein the generation model in the second image translation model is a coding-decoding model, and wherein the decoding model in the second image translation model is trained using an deconvolution operator, the method further comprising:
replacing the deconvolution operator with an upsampling operator;
and inputting the first training data into the second image translation model for training, and taking the trained second image translation model as a third image translation model.
6. The method of claim 5, further comprising:
inputting a generation model in the third image translation model into an acceleration frame, so that the acceleration frame analyzes the generation model in the third image translation model and accelerates the image translation speed of the generation model in the third image translation model.
7. The method of claim 6, further comprising:
when the acceleration frame is TensorRT, replacing a reflect padding operator in the image translation model with a padding operator.
8. The method of any of claims 1 to 3, wherein the image translation model comprises at least one of a pix2pix model, a pix2pixHD model, and a vid2vid model.
9. The method of claim 4, wherein the lightweight model comprises at least one of a MobileNet model, a ShuffleNet model, a SqueezeNet model, and an Xception model.
10. A model training apparatus, the apparatus comprising:
the system comprises a first module, a second module and a third module, wherein the first module is used for respectively extracting human body key point parameters of each frame of image from each frame of image of video data and inputting the human body key point parameters of each frame of image into an SMPL-X model to generate a human body 3D model corresponding to the human body posture of each frame of image, and the human body key point parameters comprise human body skeleton point parameters, hand key point parameters and human face key point parameters;
and the second module is used for inputting the human body 3D model of the target frame and the target frame image into an image translation model for training, and taking the trained image translation model as a first image translation model, wherein the image translation model is a GAN network model, and the target frame is any one frame or any multiple frames in the video data.
11. A computer arrangement comprising a processor, characterized in that the processor, when executing a computer program stored on a memory, is adapted to carry out the model training method of any of claims 1 to 9.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the model training method according to any one of claims 1 to 9.
CN202110629148.1A 2021-06-07 2021-06-07 Model training method and model training device Active CN113077383B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110629148.1A CN113077383B (en) 2021-06-07 2021-06-07 Model training method and model training device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110629148.1A CN113077383B (en) 2021-06-07 2021-06-07 Model training method and model training device

Publications (2)

Publication Number Publication Date
CN113077383A CN113077383A (en) 2021-07-06
CN113077383B true CN113077383B (en) 2021-11-02

Family

ID=76617062

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110629148.1A Active CN113077383B (en) 2021-06-07 2021-06-07 Model training method and model training device

Country Status (1)

Country Link
CN (1) CN113077383B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114154092B (en) * 2021-11-18 2023-04-18 网易有道信息技术(江苏)有限公司 Method for translating web pages and related product

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109558832A (en) * 2018-11-27 2019-04-02 广州市百果园信息技术有限公司 A kind of human body attitude detection method, device, equipment and storage medium
CN110163048A (en) * 2018-07-10 2019-08-23 腾讯科技(深圳)有限公司 Identification model training method, recognition methods and the equipment of hand key point
CN111753824A (en) * 2019-03-27 2020-10-09 辉达公司 Image segmentation using neural network translation models
CN112233054A (en) * 2020-10-12 2021-01-15 北京航空航天大学 Human-object interaction image generation method based on relation triple
CN112418399A (en) * 2020-11-20 2021-02-26 清华大学 Method and device for training attitude estimation model and method and device for attitude estimation
US10990848B1 (en) * 2019-12-27 2021-04-27 Sap Se Self-paced adversarial training for multimodal and 3D model few-shot learning
CN112818898A (en) * 2021-02-20 2021-05-18 北京字跳网络技术有限公司 Model training method and device and electronic equipment
CN112884640A (en) * 2021-03-01 2021-06-01 深圳追一科技有限公司 Model training method, related device and readable storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3579196A1 (en) * 2018-06-05 2019-12-11 Cristian Sminchisescu Human clothing transfer method, system and device
CN111680562A (en) * 2020-05-09 2020-09-18 北京中广上洋科技股份有限公司 Human body posture identification method and device based on skeleton key points, storage medium and terminal
CN112801064A (en) * 2021-04-12 2021-05-14 北京的卢深视科技有限公司 Model training method, electronic device and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163048A (en) * 2018-07-10 2019-08-23 腾讯科技(深圳)有限公司 Identification model training method, recognition methods and the equipment of hand key point
CN109558832A (en) * 2018-11-27 2019-04-02 广州市百果园信息技术有限公司 A kind of human body attitude detection method, device, equipment and storage medium
CN111753824A (en) * 2019-03-27 2020-10-09 辉达公司 Image segmentation using neural network translation models
US10990848B1 (en) * 2019-12-27 2021-04-27 Sap Se Self-paced adversarial training for multimodal and 3D model few-shot learning
CN112233054A (en) * 2020-10-12 2021-01-15 北京航空航天大学 Human-object interaction image generation method based on relation triple
CN112418399A (en) * 2020-11-20 2021-02-26 清华大学 Method and device for training attitude estimation model and method and device for attitude estimation
CN112818898A (en) * 2021-02-20 2021-05-18 北京字跳网络技术有限公司 Model training method and device and electronic equipment
CN112884640A (en) * 2021-03-01 2021-06-01 深圳追一科技有限公司 Model training method, related device and readable storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
bmlSUP – A SMPL Unity Player;Adam O. Bebko等;《IEEE》;20210506;573-574 *
人体姿态估计网络的轻量化设计;高丙坤等;《实验室研究与探索》;20200125(第01期);85-88 *
基于对抗网络的姿态迁移方法研究;张焜耀;《中国优秀硕士学位论文全文数据库 信息科技辑》;20210215;第2021年卷(第02期);I138-977 *
基于深度学习的三维人体姿态估计和图像合成;陈菲;《中国优秀硕士学位论文全文数据库 信息科技辑》;20210415;第2021年卷(第04期);I138-546 *
基于视频的三维人体姿态估计;杨彬等;《北京航空航天大学学报》;20191231(第12期);122-128 *

Also Published As

Publication number Publication date
CN113077383A (en) 2021-07-06

Similar Documents

Publication Publication Date Title
KR102602112B1 (en) Data processing method, device, and medium for generating facial images
CN111881926A (en) Image generation method, image generation model training method, image generation device, image generation equipment and image generation medium
WO2023174036A1 (en) Federated learning model training method, electronic device and storage medium
CN115461785A (en) Generating a non-linear human shape model
US20220101121A1 (en) Latent-variable generative model with a noise contrastive prior
CN114339409A (en) Video processing method, video processing device, computer equipment and storage medium
CN112258625B (en) Method and system for reconstructing single image to three-dimensional point cloud model based on attention mechanism
CN111462274A (en) Human body image synthesis method and system based on SMP L model
JP6832252B2 (en) Super-resolution device and program
CN116634242A (en) Speech-driven speaking video generation method, system, equipment and storage medium
CN113129447A (en) Three-dimensional model generation method and device based on single hand-drawn sketch and electronic equipment
CN113077383B (en) Model training method and model training device
CN115526223A (en) Score-based generative modeling in a potential space
US20240013464A1 (en) Multimodal disentanglement for generating virtual human avatars
AU2022241513B2 (en) Transformer-based shape models
CN113160041B (en) Model training method and model training device
CN113112400B (en) Model training method and model training device
CN114373034A (en) Image processing method, image processing apparatus, image processing device, storage medium, and computer program
Jang et al. Ensemble learning using observational learning theory
CN112439200B (en) Data processing method, data processing device, storage medium and electronic equipment
CN117788629B (en) Image generation method, device and storage medium with style personalization
CN113096206B (en) Human face generation method, device, equipment and medium based on attention mechanism network
US20220176245A1 (en) Predicting the Appearance of Deformable Objects in Video Games
US20240135616A1 (en) Automated system for generation of facial animation rigs
Niemeyer Neural Scene Representations for 3D Reconstruction and Generative Modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant