CN114758108A - Virtual object driving method, device, storage medium and computer equipment - Google Patents

Virtual object driving method, device, storage medium and computer equipment Download PDF

Info

Publication number
CN114758108A
CN114758108A CN202210314199.XA CN202210314199A CN114758108A CN 114758108 A CN114758108 A CN 114758108A CN 202210314199 A CN202210314199 A CN 202210314199A CN 114758108 A CN114758108 A CN 114758108A
Authority
CN
China
Prior art keywords
target
dimensional
difference
parameter
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210314199.XA
Other languages
Chinese (zh)
Inventor
彭国柱
张�雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Cubesili Information Technology Co Ltd
Original Assignee
Guangzhou Cubesili Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Cubesili Information Technology Co Ltd filed Critical Guangzhou Cubesili Information Technology Co Ltd
Priority to CN202210314199.XA priority Critical patent/CN114758108A/en
Publication of CN114758108A publication Critical patent/CN114758108A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Abstract

The embodiment of the application discloses a virtual object driving method, a virtual object driving device, a storage medium and computer equipment. According to the method, the target monocular image is obtained, and the target image characteristic information corresponding to the target monocular image is extracted; respectively inputting the target image characteristic information into a first target preset decoder and a second target preset decoder, and outputting corresponding target two-dimensional heat map information and target three-dimensional skeleton direction field information; inputting the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into a target parameter regressor, and outputting corresponding target joint rotation parameters; the virtual object is driven based on the target joint rotation parameters. Therefore, the corresponding target joint rotation parameters can be rapidly output to drive the virtual object to do corresponding actions according to the actions of the characters in the target monocular image, and the virtual object driving efficiency is greatly improved.

Description

Virtual object driving method, device, storage medium and computer equipment
Technical Field
The present application relates to the field of computer vision technologies, and in particular, to a virtual object driving method and apparatus, a storage medium, and a computer device.
Background
With the prosperity of the interactive entertainment industry and the development of computer image and vision related technologies, the virtual live broadcast technology is increasingly hot in the field related to live broadcast. In the whole technical process of virtual live broadcast, the motion capture of each limb of a human body is an important ring.
In the related art, the generation of the virtual object requires that the reconstructed result is consistent with the real human and needs to contain much detail, so that the target object needs to be completely motion-captured.
Disclosure of Invention
The embodiment of the application provides a virtual object driving method and device, a storage medium and computer equipment, which can improve the efficiency of virtual object driving.
In order to solve the above technical problem, the embodiments of the present application provide the following technical solutions:
a virtual object driven method, comprising:
acquiring a target monocular image, and extracting target image characteristic information corresponding to the target monocular image;
respectively inputting the target image characteristic information into a first target preset decoder and a second target preset decoder, and outputting corresponding target two-dimensional heat map information and target three-dimensional skeleton direction field information;
The first target preset decoder is obtained by training image characteristic information extracted through monocular image samples, predicted two-dimensional heat map information and label two-dimensional heat map information, and the second target preset decoder is obtained by training three-dimensional skeleton direction field information and label three-dimensional skeleton direction field information predicted through the image characteristic information;
inputting the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into a target parameter regressor, and outputting corresponding target joint rotation parameters;
the target parameter regressor is obtained by training predicted joint rotation parameters and label key rotation parameters through target two-dimensional heat map information and target three-dimensional skeleton direction field information;
driving a virtual object based on the target joint rotation parameter.
A virtual object driving apparatus, comprising:
the extraction unit is used for acquiring a target monocular image and extracting target image characteristic information corresponding to the target monocular image;
the first output unit is used for respectively inputting the target image characteristic information into a first target preset decoder and a second target preset decoder and outputting corresponding target two-dimensional heat map information and target three-dimensional skeleton direction field information;
The first target preset decoder is obtained by training image characteristic information extracted through a monocular image sample, predicted two-dimensional heat map information and label two-dimensional heat map information, and the second target preset decoder is obtained by training three-dimensional skeleton direction field information predicted through the image characteristic information and label three-dimensional skeleton direction field information;
the second output unit is used for inputting the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into a target parameter regressor and outputting corresponding target joint rotation parameters;
the target parameter regressor is obtained by training predicted joint rotation parameters and label key rotation parameters through target two-dimensional heat map information and target three-dimensional skeleton direction field information;
a driving unit for driving a virtual object based on the target joint rotation parameter.
In some embodiments, the apparatus further comprises:
the acquisition unit is used for acquiring a monocular image sample and extracting image characteristic information corresponding to the monocular image sample;
the third output unit is used for inputting the image characteristic information to a first preset decoder and outputting corresponding two-dimensional heat map information;
The first calculation unit is used for calculating a first difference corresponding to the two-dimensional heat map information and the label two-dimensional heat map information;
the fourth output unit is used for inputting the image characteristic information to a second preset decoder and outputting corresponding three-dimensional skeleton direction field information;
the second calculation unit is used for calculating a second difference corresponding to the three-dimensional bone direction field information and the label three-dimensional bone direction field information;
and the first iteration adjusting unit is used for performing iteration adjustment on the network parameters of the first preset decoder and the second preset decoder according to the first difference and the second difference, returning to execute the input of the image feature information to the first preset decoder, and outputting corresponding two-dimensional heat map information until the first difference and the second difference are converged to obtain a first target preset decoder and a second target preset decoder.
In some embodiments, the first iterative adjustment unit is to:
weighting the first difference through a first weight to obtain a first target loss;
weighting the second difference through a second weight to obtain a second target loss;
constructing a third target loss based on the first target loss and the second target loss;
And iteratively adjusting the network parameters of the first preset decoder and the second preset decoder according to the third target loss, returning to execute the step of inputting the image characteristic information into the first preset decoder, and outputting corresponding two-dimensional heat map information until the third target loss is converged to obtain the first target preset decoder and the second target preset decoder.
In some embodiments, the apparatus further comprises:
the fifth output unit is used for inputting the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into a parameter regressor and outputting corresponding human body shape parameters and joint rotation parameters;
the first determining unit is used for determining three-dimensional joint points according to the human body shape parameters and the joint rotation parameters;
the projection unit is used for projecting the three-dimensional joint points to a two-dimensional image plane according to a preset function to obtain two-dimensional key points;
the third calculation unit is used for calculating a third difference corresponding to the three-dimensional joint point and the label three-dimensional joint point and a fourth difference corresponding to the two-dimensional key point and the label two-dimensional key point;
and the second iterative adjustment unit is used for iteratively adjusting the network parameters of the parameter regressor based on the third difference and the fourth difference, returning to input the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into the parameter regressor, and outputting corresponding human body shape parameters and joint rotation parameters until the third difference and the fourth difference are converged to obtain the target parameter regressor.
In some embodiments, the first determining unit is configured to:
acquiring a preset parameterized human body model;
and deforming a preset parameterized human body model according to the human body shape parameters and the joint rotation parameters to determine three-dimensional joint points.
In some embodiments, the second iterative adjustment unit is to:
calculating a fifth difference corresponding to the human body shape parameter and the label shape parameter and a sixth difference corresponding to the joint rotation parameter and the label rotation parameter;
combining the fifth difference and the sixth difference to form a seventh difference;
generating an a priori constraint loss of the joint rotation parameter;
weighting the third difference by a third weight to obtain a fourth target loss;
weighting the fourth difference by a fourth weight to obtain a fifth target loss;
weighting the seventh difference by a fifth weight to obtain a sixth target loss;
weighting the prior limiting loss by a sixth weight to obtain a seventh target loss;
constructing an eighth target loss based on the fourth target loss, the fifth target loss, the sixth target loss, and the seventh target loss;
and iteratively adjusting the network parameters of the parameter regressor according to the eighth target loss, returning to input the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into the parameter regressor, and outputting corresponding human body shape parameters and joint rotation parameters until the eighth target loss is converged to obtain the target parameter regressor.
In some embodiments, the extraction unit is configured to:
acquiring a target monocular image;
and inputting the target monocular image into a preset image encoder, and extracting corresponding target image characteristic information.
In some embodiments, the drive unit is configured to:
determining a root joint point of the virtual object;
starting based on the root joint point, driving virtual objects according to the target joint rotation parameters and joint point driving sequence.
A computer storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the above-described virtual object driving method.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the virtual object driving method provided above when executing the computer program.
A computer program product or computer program comprising computer instructions stored in a storage medium. The computer instructions are read from the storage medium by a processor of the computer device, and the computer instructions are executed by the processor, so that the computer device executes the steps in the virtual object driving method provided above.
According to the method, the target monocular image is obtained, and the target image characteristic information corresponding to the target monocular image is extracted; respectively inputting the target image characteristic information into a first target preset decoder and a second target preset decoder, and outputting corresponding target two-dimensional heat map information and target three-dimensional skeleton direction field information; inputting the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into a target parameter regressor, and outputting corresponding target joint rotation parameters; the virtual object is driven based on the target joint rotation parameter. Therefore, the corresponding target joint rotation parameters can be rapidly output to drive the virtual object to do corresponding actions according to the actions of the characters in the target monocular image, and the virtual object driving efficiency is greatly improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic view of a virtual object driving system provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of a virtual object driving method provided in an embodiment of the present application;
fig. 3 is another schematic flowchart of a virtual object driving method according to an embodiment of the present application;
fig. 4a is a two-dimensional human body schematic diagram of a virtual object driving method according to an embodiment of the present application;
fig. 4b is a three-dimensional human body diagram of a virtual object driving method according to an embodiment of the present application;
fig. 4c is a scene schematic diagram of a virtual object driving method provided in an embodiment of the present application;
fig. 5 is a schematic structural diagram of a virtual object driving apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a computer device provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides a virtual object driving method and device, a storage medium and computer equipment.
Referring to fig. 1, fig. 1 is a schematic view of a scene of a virtual object driving system according to an embodiment of the present application, including: the audience client A and the server B can be connected through a communication network, and the communication network comprises a wireless network and a wired network, wherein the wireless network comprises one or more of a wireless wide area network, a wireless local area network, a wireless metropolitan area network and a wireless personal area network. The network includes network entities such as routers, gateways, etc., which are not shown in the figure. The viewer client a may interact with the server B via a communication network.
The terminal a may be a tablet computer, a mobile phone, a notebook computer, a desktop computer, or other terminals having a storage unit and a microprocessor and having computing capabilities, and the terminal may be equipped with a client a, such as a live client or a game client, and the client a may acquire a target monocular image of a currently used user through a camera and send the target monocular image to the server B.
The virtual object driving system can comprise a server B, wherein the server B can store the corresponding relation between the anchor client and each live broadcast room, and after the audience client A selects the live broadcast room, the server B sends the live broadcast video stream corresponding to the anchor client to all the audience clients A belonging to the same live broadcast room according to the corresponding relation between each live broadcast room and the anchor client. The server B may be used to store the target monocular image sent by the viewer client a.
The virtual object driving system may include a virtual object driving apparatus, and the virtual object driving apparatus may be specifically integrated into a server B having a storage unit and a microprocessor, and having an arithmetic capability, in fig. 1, the server B may be configured to obtain a target monocular image and extract target image feature information corresponding to the target monocular image; inputting the target image characteristic information into a first target preset decoder and a second target preset decoder respectively, and outputting corresponding target two-dimensional heat map information and target three-dimensional skeleton direction field information; the first target preset decoder is obtained by training image characteristic information extracted through a monocular image sample, predicted two-dimensional heat map information and label two-dimensional heat map information, and the second target preset decoder is obtained by training three-dimensional skeleton direction field information predicted through the image characteristic information and label three-dimensional skeleton direction field information; inputting the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into a target parameter regressor, and outputting corresponding target joint rotation parameters; the target parameter regressor is obtained by training predicted joint rotation parameters and label key rotation parameters through target two-dimensional heat map information and target three-dimensional skeleton direction field information; the virtual object is driven based on the target joint rotation parameter.
It should be noted that the scene schematic diagram of the virtual object driving system shown in fig. 1 is merely an example, and the virtual object driving system and the scene described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not form a limitation on the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows, with the evolution of the virtual object driving system and the occurrence of a new service scene, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.
The following are detailed descriptions.
In the present embodiment, description will be made from the perspective of a virtual object driving apparatus, which may be specifically integrated in a viewer client.
Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a virtual object driving method according to an embodiment of the present disclosure. The virtual object driving method includes:
in step 101, a target monocular image is acquired, and target image feature information corresponding to the target monocular image is extracted.
The audience client can enter a live broadcast room corresponding to any anchor client, open the live broadcast room corresponding to the anchor client to perform live broadcast watching, watch live broadcast contents of a live broadcast user in the live broadcast room, such as talent show, game live broadcast, joyful fighting and the like, and can also pay attention to the anchor user, give gifts to the anchor user or send interactive messages and the like to realize interactive operation with the anchor user.
Correspondingly, the anchor client can perform live broadcast by using a virtual live broadcast technology, the virtual live broadcast performs some actions for the anchor to control virtual characters to perform the same actions so as to realize interesting live broadcast, and based on the action, a target monocular image acquired by the anchor client in real time can be acquired, the monocular image is a single-frame image, the Format of the image can be a Graphics Interchange Format (GIF) or a Bitmap (Bitmap, BMP), and the like, and the Format is not specifically limited here.
Further, the server may also extract target image feature information corresponding to the target monocular image, where the extraction manner may be extraction by a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN), and the target image feature information may better express an image meaning to be used subsequently in the target monocular image.
In step 102, the feature information of the target image is respectively input to a first target preset decoder and a second target preset decoder, and corresponding target two-dimensional heat map information and target three-dimensional bone orientation field information are output.
The first target default decoder and the second target default decoder may be recurrent neural network models, and it should be noted that the first target default decoder and the second target default decoder are both trained network models.
The first target preset decoder is obtained by training image characteristic information extracted by a monocular image sample, predicted two-dimensional heat map information and label two-dimensional heat map information, the second target preset decoder is obtained by training three-dimensional skeleton direction field information and label three-dimensional skeleton direction field information predicted by the image characteristic information, the monocular image sample is an image sample used for training, the monocular image sample can carry the label two-dimensional heat map information and label three-dimensional skeleton direction field information, the two-dimensional heat map information is two-dimensional position information of key points in an image coordinate system corresponding to the monocular image sample, the three-dimensional skeleton direction field information is skeleton direction information in a space coordinate system corresponding to the monocular image sample, the label two-dimensional heat map information is standard two-dimensional heat map information, and the label three-dimensional skeleton direction field is standard three-dimensional skeleton direction field information, the label two-dimensional heat map information and the label three-dimensional skeleton direction field information can be manually preset and used for calibration, so that the preset encoder is subjected to back propagation training, and network parameter adjustment is performed.
In one embodiment, the mathematical expression of the two-dimensional heat map information may be:
heatmapμ=exp((x-μ)2/2σ2)
the heatmapμThe exp () is an exponential function with a natural constant e as a base, and the σ can be a constant.
In one embodiment, the mathematical expression of the three-dimensional bone orientation field information may be:
POFxy=(x-y)/‖x-y‖
the POFxyWherein x and y are respectively 3D coordinates of two end points of the skeleton, and POF values corresponding to all pixels in the xy skeleton region are POFxy
In some embodiments, the process of this training may be as follows:
(1) acquiring a monocular image sample, and extracting image characteristic information corresponding to the monocular image sample;
(2) inputting the image characteristic information into a first preset decoder, and outputting corresponding two-dimensional heat map information;
(3) calculating a first difference corresponding to the two-dimensional heat map information and the label two-dimensional heat map information;
(4) inputting the image characteristic information into a second preset decoder, and outputting corresponding three-dimensional skeleton direction field information;
(5) calculating a second difference corresponding to the three-dimensional bone direction field information and the label three-dimensional bone direction field information;
(6) and iteratively adjusting the network parameters of the first preset decoder and the second preset decoder according to the first difference and the second difference, returning to execute the step of inputting the image characteristic information into the first preset decoder, and outputting corresponding two-dimensional heat map information until the first difference and the second difference are converged to obtain a first target preset decoder and a second target preset decoder.
The monocular image sample can be obtained, the monocular image sample is a monocular image used for training, a large number of monocular images can be contained in the monocular image sample, for example, 100 monocular images can be obtained, the monocular image can carry preset label two-dimensional heat map information and label three-dimensional skeleton direction field information, and image feature information of the monocular image sample is extracted.
It should be noted that the first preset decoder and the second preset decoder may be recurrent neural network models, and both the first preset decoder and the second preset decoder are untrained network models, the first preset decoder is used for predicting two-dimensional heat map information of the monocular image, and the second preset encoder is used for predicting three-dimensional bone direction field information of the monocular image.
Because the network parameters of the first preset decoder and the second preset decoder are configured initial network parameters, after the image feature information is input into the first preset decoder and the second preset decoder, the predicted two-dimensional heat map information and the three-dimensional bone direction field information are both inaccurate, so that a first difference corresponding to the two-dimensional heat map information and the tag two-dimensional heat map information can be calculated, the first difference is a difference between the two, the larger the first difference is, the more inaccurate the prediction of the two-dimensional heat map information is, the smaller the first difference is, the more accurate the prediction of the two-dimensional heat map information is. And a second difference corresponding to the three-dimensional bone direction field information and the label three-dimensional bone direction field information can be continuously calculated, wherein the second difference is the difference between the two information.
Further, the network parameters of the first pre-decoder and the second pre-decoder may be iteratively adjusted in a gradient descending manner according to the first difference and the second difference, so that the network parameters are more and more accurate, in order to verify the above adjustment manner, after the network parameters of the first pre-decoder and the second pre-decoder are adjusted, the image feature information is returned to be continuously input to the first pre-decoder and the second pre-decoder after the network parameter adjustment, new two-dimensional heat map information and three-dimensional bone direction field information are output, so as to continuously calculate new first difference and second difference, and as the number of training times increases, the first difference and the second difference become smaller and smaller until the first difference and the second difference tend to converge, that is, it is stated that the adjustment of the network parameters learned by the first pre-decoder and the second pre-decoder is completed, the first preset decoder learns how to predict accurate two-dimensional heat map information according to the image characteristic information, the second preset decoder learns how to predict accurate three-dimensional skeleton direction field information according to the image characteristic information, and the first target preset encoder and the second target preset encoder are obtained after training is completed.
Based on the first target preset encoder and the second target preset encoder obtained in the training process, the target image characteristic information can be respectively input into the first target preset decoder and the second target preset decoder, and correspondingly accurate target two-dimensional heat map information and target three-dimensional skeleton direction field information are output.
In step 103, the target two-dimensional heat map information and the target three-dimensional bone orientation field information are input into a target parameter regressor, and corresponding target joint rotation parameters are output.
In the related art, the virtual human body model may set 18 key points and 24 joints, specifically, the 24 joint points implement control on the virtual human body, and the target joint rotation parameter may be a 72-dimensional joint rotation parameter (24 joints in total, each joint has three-dimensional coordinates), which is used to control the deformation of the entire virtual human body.
The target parameter regressor can be a convolutional neural network model or a cyclic neural network model and is used for predicting joint rotation parameters according to target two-dimensional heat map information and target three-dimensional skeleton direction field information, the target parameter regressor is obtained by training the predicted joint rotation parameters and label key rotation parameters through the target two-dimensional heat map information and the target three-dimensional skeleton direction field information, the label key rotation parameters are standard key rotation parameters and can be preset manually and used for calibration, the parameter regressor is subjected to back propagation training, and network parameter adjustment is performed on the parameter regressor.
In some embodiments, the process of this training may be as follows:
(1) inputting the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into a parameter regressor, and outputting corresponding human body shape parameters and joint rotation parameters;
(2) determining three-dimensional joint points according to the human body shape parameters and the joint rotation parameters;
(3) projecting the three-dimensional joint points to a two-dimensional image plane according to a preset function to obtain two-dimensional key points;
(4) calculating a third difference corresponding to the three-dimensional joint point and the label three-dimensional joint point and a fourth difference corresponding to the two-dimensional key point and the label two-dimensional key point;
(5) and iteratively adjusting the network parameters of the parameter regressor based on the third difference and the fourth difference, returning to input the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into the parameter regressor, and outputting corresponding human body shape parameters and joint rotation parameters until the third difference and the fourth difference are converged to obtain the target parameter regressor.
The target two-dimensional heat map information and the target three-dimensional skeleton direction field information can be input into an untrained parameter regression device, corresponding human body shape parameters and joint rotation parameters are output, the human body shape parameters are used for controlling deformation of the whole human body model, the shape parameters control body sizes such as height, weight and the like, and the joint rotation parameters control body actions such as hand lifting and foot lifting.
It should be noted that the parameter regressor may be a recurrent neural network model, and the parameter regressor is an untrained network model, and the parameter regressor is used for predicting the human body shape parameters and the joint rotation parameters corresponding to the target two-dimensional heat map information and the target three-dimensional bone direction field information.
Since the network parameters of the parameter regressor are configured initial network parameters, after the target two-dimensional heat map information and the target three-dimensional skeleton direction field information are output to the parameter regressor, the predicted human body shape parameters and joint rotation parameters are not accurate, so that in order to verify whether the human body shape parameters and the joint rotation parameters are correct or not, firstly, three-dimensional joint points need to be determined according to the human body shape parameters and the joint rotation parameters, specifically, human body shape and motion deformation can be carried out on a standard human body model in a standard posture according to the human body shape parameters and the joint rotation parameters, a changed human body model is obtained, and then 24 three-dimensional joint points are found on the human body model.
Further, the preset function may be a weak perspective projection function, and is configured to convert a three-dimensional image into a two-dimensional image, so that the three-dimensional joint point may be projected onto a two-dimensional image plane through the preset function, and a two-dimensional key point on the two-dimensional image plane is obtained.
The label three-dimensional joint point is a standard three-dimensional joint point, and the two-dimensional key point is a standard two-dimensional key point. Therefore, a third difference corresponding to the three-dimensional joint point and the label three-dimensional joint point can be calculated, the third difference is the difference between the three-dimensional joint point and the label three-dimensional joint point, the larger the third difference is, the more inaccurate the three-dimensional joint point prediction is, and the smaller the third difference is, the more accurate the three-dimensional joint point prediction is. And a fourth difference corresponding to the two-dimensional key point and the label two-dimensional joint point can be calculated, wherein the fourth difference is the difference between the two, and similarly, the larger the fourth difference is, the more inaccurate the two-dimensional key point prediction is, and the smaller the fourth difference is, the more accurate the two-dimensional key point prediction is.
Further, the network parameters of the parameter regressor can be subjected to iterative adjustment of gradient descent according to the third difference and the fourth difference, so that the network parameters are more and more accurate, in order to verify the adjustment degree of the adjustment mode, after the network parameters of the parameter regressor are adjusted, the target two-dimensional heat map information and the target three-dimensional bone direction field information are returned to the parameter regressor after being input into the network adjustment, the corresponding new human body shape parameters and joint rotation parameters are output, the steps are repeated, new third difference and fourth difference are continuously calculated for network adjustment, the third difference and the fourth difference become smaller along with the increase of the training times until the third difference and the fourth difference tend to converge, namely, the adjustment of the network parameters learned by the parameter regressor is finished, and the parameter regressor learns how to perform network adjustment according to the target two-dimensional heat map information and the target three-dimensional bone direction field information And predicting accurate human body shape parameters and joint rotation parameters, and completing training to obtain the target parameter regressor.
The target parameter regressor obtained based on the training process can input target two-dimensional heat map information and target three-dimensional skeleton direction field information into the target parameter regressor and output corresponding accurate target joint rotation parameters and target human body shape parameters.
In step 104, the virtual object is driven based on the target joint rotation parameters.
In the related art, the existing virtual object driving method (i.e., motion capture method) is a professional device requiring high cost, which is unacceptable for ordinary users, or uses ordinary monocular image input, but the motion information is obtained through an iterative optimization process, and the existing three-dimensional human body data set cannot be utilized to label information of joint rotation parameters, and is not favorable for engineering to deploy and apply new hardware (such as a Graphics Processing Unit (GPU)). How to train a human motion capture model which is friendly to an inference engine and can utilize the labeled information as much as possible is an urgent problem. Therefore, the embodiment of the application provides an end-to-end model training method for directly obtaining the rotation parameters of each joint after monocular images are input, and a target parameter regressor obtained based on the model training method provided by the embodiment of the application can directly output the rotation parameters of the target joint.
Based on this, because the target joint rotation parameter records the motion information of each joint point in the 24 joint points, a forward motion algorithm (FK) can be used, and the forward motion algorithm refers to that the motion of the virtual character drives the next joint point to move according to the sequence for each joint point, that is, the application can be controlled sequentially according to the joint driving sequence based on the forward motion algorithm, so as to realize that the virtual object is driven to perform corresponding motion in a simulation manner.
As can be seen from the above, in the embodiment of the application, the target monocular image is obtained, and the target image feature information corresponding to the target monocular image is extracted; respectively inputting the target image characteristic information into a first target preset decoder and a second target preset decoder, and outputting corresponding target two-dimensional heat map information and target three-dimensional skeleton direction field information; inputting the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into a target parameter regressor, and outputting corresponding target joint rotation parameters; the virtual object is driven based on the target joint rotation parameters. Therefore, the corresponding target joint rotation parameters can be rapidly output to drive the virtual object to do corresponding actions according to the actions of the characters in the target monocular image, and the virtual object driving efficiency is greatly improved.
In this embodiment, the virtual object driving apparatus will be described from the perspective of a virtual object driving apparatus, which may be specifically integrated in a terminal having computing capability, such as a tablet computer, a mobile phone, and the like, which has a storage unit and is equipped with a microprocessor, and the terminal may start a live broadcast client, which may be a viewer client in this embodiment.
Referring to fig. 3, fig. 3 is another schematic flow chart of a virtual object driving method according to an embodiment of the present disclosure. The method flow can comprise the following steps:
in step 201, the server obtains a target monocular image, inputs the target monocular image to a preset image encoder, and extracts corresponding target image feature information.
Fig. 4c is a schematic view of a scene of a virtual object driving method provided in an embodiment of the present application, where the image 40 may be a live broadcast frame of an anchor, the anchor may make a specific action, the anchor image is collected by a camera to generate a target monocular image 41, and the target monocular image 41 includes an anchor user.
Further, the target monocular image 41 is input into a preset image encoder, and the preset image encoder extracts corresponding target image feature information, which can better represent the image meaning of the subsequent anchor user action to be used in the target monocular image.
In step 202, the server obtains a monocular image sample, extracts image feature information corresponding to the monocular image sample, inputs the image feature information to a first preset decoder, outputs corresponding two-dimensional heat map information, and calculates a first difference between the two-dimensional heat map information and the label two-dimensional heat map information.
The server may obtain a large number of monocular image samples in advance, the monocular image samples are monocular images used for training, the monocular image samples may be a large number of monocular images, for example, 100 monocular images, the monocular images may carry preset label two-dimensional heat map information and label three-dimensional bone direction field information, and image feature information of each monocular image in the monocular image samples is extracted.
Further, the first predetermined decoder may be a recurrent neural network model, and the first predetermined decoder is an untrained network model, and the first predetermined decoder is configured to predict two-dimensional heat map information of the monocular image.
Because the first preset decoder is the configured initial network parameter, after the image characteristic information is input into the first preset decoder, the predicted two-dimensional heat map information is inaccurate, so that a first difference corresponding to the two-dimensional heat map information and the label two-dimensional heat map information can be calculated, the first difference is a difference between the two, the larger the first difference is, the more inaccurate the prediction of the two-dimensional heat map information is, and the smaller the first difference is, the more accurate the prediction of the two-dimensional heat map information is. Specifically, the first difference may be calculated by the following formula:
Figure BDA0003568404980000121
the L isheatmapFor the first difference, I is the number of keypoints, q is the pixel index, w is the weight of the different keypoints, hi(q) represents the heat value corresponding to the q-th pixel in the heatmap corresponding to the ith key point (i.e. two-dimensional heat map information),
Figure BDA0003568404980000122
representing true values (i.e., label two-dimensional heat map information).
In step 203, the server inputs the image feature information to a second preset decoder, outputs corresponding three-dimensional bone direction field information, and calculates a second difference between the three-dimensional bone direction field information and the tag three-dimensional bone direction field information.
The second preset decoder can be a recurrent neural network model, the second preset decoders are untrained network models, and the second preset encoder is used for predicting three-dimensional bone direction field information of the monocular image.
Since the network parameter of the second preset decoder is the configured initial network parameter, after the image feature information is input into the second preset decoder, the three-dimensional bone direction field information is predicted to be inaccurate, so that a second difference corresponding to the three-dimensional bone direction field information and the label three-dimensional bone direction field information can be calculated, wherein the second difference is a difference between the three-dimensional bone direction field information and the label three-dimensional bone direction field information. Specifically, the second difference may be calculated by the following formula:
Figure BDA0003568404980000123
the L isPOFFor the second difference, U is the number of bones, w is the weight of the different bones, Pu(q) represents the POF value (i.e. three-dimensional bone direction field information) corresponding to the qth pixel in the POF corresponding to the uth bone
Figure BDA0003568404980000124
Representing true values (i.e., label three-dimensional bone orientation field information).
In step 204, the server weights the first difference by a first weight to obtain a first target loss, weights the second difference by a second weight to obtain a second target loss, and constructs a third target loss based on the first target loss and the second target loss.
Wherein the third target loss may be calculated by the following formula:
L=wh*Lheatmap+wp*LPOF
the whIs a first weight, for example 0.4. The wpIs a second weight, e.g., 0.6. That is, according to the above formula, the first difference may be multiplied by the first weight to obtain a first target loss, the second difference may be multiplied by the second weight to obtain a second target loss, and the first target loss and the second target loss may be added to construct a third target loss.
In step 205, the server iteratively adjusts the network parameters of the first preset decoder and the second preset decoder according to the third target loss, and returns to execute inputting the image feature information to the first preset decoder, and outputs the corresponding two-dimensional heat map information until the third target loss converges, so as to obtain the first target preset decoder and the second target preset decoder.
In order to verify the above adjustment mode, after the network parameters of the first preset decoder and the second preset decoder are adjusted, the steps 202, 203 and 204 are returned to be executed again, the image feature information is continuously input to the first preset decoder and the second preset decoder after the network parameter adjustment, the corresponding new two-dimensional heat map information and the three-dimensional bone direction field information are output to continuously calculate a new third target loss, and the third target loss is smaller and smaller along with the increase of the training times until the third target loss tends to converge, namely, the network parameter adjustment learned by the first preset decoder and the second preset decoder is completed, the first preset decoder learns how to predict accurate two-dimensional heat map information according to the image characteristic information, the second preset decoder learns how to predict accurate three-dimensional skeleton direction field information according to the image characteristic information, and the training is completed to obtain a first target preset encoder and a second target preset encoder.
In step 206, the server inputs the target image feature information to a first target preset decoder and a second target preset decoder, respectively, and outputs corresponding target two-dimensional heat map information and target three-dimensional bone direction field information.
Based on the first target preset encoder and the second target preset encoder obtained in the training process, the server can respectively input the target image feature information into the first target preset decoder and the second target preset decoder, and correspondingly and accurately outputs target two-dimensional heat map information and target three-dimensional skeleton direction field information.
In step 207, the server inputs the target two-dimensional heat map information and the target three-dimensional bone orientation field information into a parameter regressor, and outputs corresponding human body shape parameters and joint rotation parameters.
The target two-dimensional heat map information and the target three-dimensional skeleton direction field information can be input into an untrained parameter regressor, and corresponding human body shape parameters and joint rotation parameters are output, wherein the human body shape parameters are used for controlling the deformation of the whole human body model, the shape parameters control the body shapes such as height, weight and the like, the joint rotation parameters control the body actions such as lifting hands and feet and the like, the human body shape parameters can be 10-dimensional human body shape parameters, and the joint rotation parameters can be 72-dimensional joint rotation parameters (24 joints, each corresponding to three-dimensional coordinates).
It should be noted that the parameter regressor may be a recurrent neural network model, and the parameter regressor is an untrained network model, and the parameter regressor is used for predicting the human body shape parameters and the joint rotation parameters corresponding to the target two-dimensional heat map information and the target three-dimensional bone direction field information.
Because the network parameters of the parameter regressor are configured initial network parameters, after the target two-dimensional heat map information and the target three-dimensional skeleton direction field information are input into the parameter regressor, corresponding human body shape parameters and joint rotation parameters are output inaccurately.
In step 208, the server obtains a preset parameterized human body model, and deforms the preset parameterized human body model according to the human body shape parameters and the joint rotation parameters to determine three-dimensional joint points.
Referring to fig. 4a and 4b together, fig. 4a is a two-dimensional human body schematic diagram of the virtual object driving method according to the embodiment of the present application, and fig. 4b is a three-dimensional human body schematic diagram of the virtual object driving method according to the embodiment of the present application. The virtual two-dimensional mannequin 20 can have 18 two-dimensional key points set. The virtual three-dimensional human model 30 may set 24 three-dimensional joint points each for controlling a different part of the three-dimensional virtual human model 30.
In order to verify whether the human body shape parameters and the joint rotation parameters are correct or not, firstly, the server can obtain a preset parameterized human body model, the preset parameterized human body model T is a standard human body model in a standard posture, for example, the human body model 30, the standard human body model in the standard posture can be subjected to human body shape and motion deformation according to the human body shape parameters and the joint rotation parameters to obtain a changed human body model, and then 24 three-dimensional joint points are found on the human body model. Specifically, the deformation of the standard human body model in the standard posture can be realized through the following formula:
T(β,θ)=T+BS(β)+BP(θ) (1)
M(β,θ)=W(T(β,θ),J(θ),θ,w) (2)
t is a standard human body model in a standard posture, BS(beta) and BP(θ) represents a body shape difference due to an individual difference and a human body motion, and T (β, θ) represents a 3D human body in a standard posture after body shape deformation. J (θ) represents 3D coordinates of each joint after rotation of each joint by performing rotation conversion of each joint in a standard posture based on the joint rotation parameter. And W is a skin matrix, W is a linear skin function, M (beta, theta) represents the 3D human body after body shape deformation and action deformation, and three-dimensional joint points corresponding to 24 joint points can be obtained based on the formula (1) and the formula (2).
In step 209, the server projects the three-dimensional joint point to the two-dimensional image plane according to a preset function to obtain a two-dimensional key point, and calculates a third difference between the three-dimensional joint point and the tag three-dimensional joint point and a fourth difference between the two-dimensional key point and the tag two-dimensional key point.
The preset function can be a weak perspective projection function and is used for converting a three-dimensional image into a two-dimensional image, so that the three-dimensional joint point can be projected to a two-dimensional image plane through the preset function, and then a two-dimensional key point on the two-dimensional image plane is obtained. Specifically, the calculation of the two-dimensional key points can be realized by the following formula:
K=Π(R·J+t)
k is a two-dimensional key point, pi () is a weak perspective projection function, R and t are global rotation and translation, which are predicted, or predicted by a parameter regressor, and J is a three-dimensional joint point.
Further, a third difference corresponding to the three-dimensional joint point and the label three-dimensional joint point can be calculated, the third difference is a difference between the three-dimensional joint point and the label three-dimensional joint point, the larger the third difference is, the more inaccurate the three-dimensional joint point prediction is, and the smaller the third difference is, the more accurate the three-dimensional joint point prediction is. And a fourth difference corresponding to the two-dimensional key point and the label two-dimensional joint point can be calculated, wherein the fourth difference is the difference between the two, and similarly, the larger the fourth difference is, the more inaccurate the two-dimensional key point prediction is, and the smaller the fourth difference is, the more accurate the two-dimensional key point prediction is. Specifically, the third difference may be calculated by the following formula:
Figure BDA0003568404980000151
The L is2dFor the third difference, K (I) is the two-dimensional key point, I ═ 18, K*(i) I.e. the tag two-dimensional articulation point.
Specifically, the third difference can be calculated by the following formula:
Figure BDA0003568404980000152
the L is3dAs a fourth difference, theJ (x) -J (y) is the three-dimensional bone direction calculated according to the three-dimensional joint point, wherein J (x) -J (y) is the three-dimensional bone direction*(x)-J*(y) is the true value of the three-dimensional bone direction calculated according to the label three-dimensional joint point, wherein U is 24, FcosIs a cosine similarity operator.
In step 210, the server calculates a fifth difference corresponding to the human body shape parameter and the tag shape parameter and a sixth difference corresponding to the joint rotation parameter and the tag rotation parameter, combines the fifth difference and the sixth difference to form a seventh difference, generates a prior constraint loss of the joint rotation parameter, weights the third difference by a third weight to obtain a fourth target loss, weights the fourth difference by a fourth weight to obtain a fifth target loss, weights the seventh difference by a fifth weight to obtain a sixth target loss, weights the prior constraint loss by a sixth weight to obtain a seventh target loss, and constructs an eighth target loss based on the fourth target loss, the fifth target loss, the sixth target loss, and the seventh target loss.
In order to achieve a better training effect, a fifth difference corresponding to the human shape parameter and the label human shape parameter of the preset standard and a sixth difference corresponding to the joint rotation parameter and the label rotation parameter of the preset standard can be introduced, and the fifth difference and the sixth difference are combined to form a seventh difference. Specifically, the seventh difference may be calculated by the following formula:
Figure BDA0003568404980000161
the L isparmThat is, the seventh difference, θ (i) is the joint rotation parameter, θ*(i) Is the tag rotation parameter. The beta (i) is a human body shape parameter, the beta*(i) The human body shape parameters are labeled.
Further, to avoid reverse bending and large-angle bending of the joint, a priori constraint loss of the joint rotation parameters may be generated using some a priori knowledge, which may be expressed by the following formula:
Lprior=prior(θ)
the L ispriorI.e., the a priori constraint loss, θ is the a priori knowledge constraint.
Wherein the eighth target loss may be calculated by the following formula:
l=w2d*L2d+w3d*L3d+wparm*Lparm+wprior*Lprior
the w2d,w3d,wparmAnd wpriorRespectively, a third weight, a fourth weight, a fifth weight, and a sixth weight, and L is an eighth target penalty. In this way, the eighth target loss can be calculated in combination with the above formula.
In step 211, the server iteratively adjusts the network parameters of the parameter regressor according to the eighth target loss, returns to input the target two-dimensional heat map information and the target three-dimensional bone direction field information into the parameter regressor, and outputs corresponding human body shape parameters and joint rotation parameters until the eighth target loss converges to obtain the target parameter regressor.
In order to verify the degree of adjustment of the adjustment mode, after the network parameters of the parameter regressor are adjusted, the steps 207, 208, 209 and 210 are returned to be executed, the target two-dimensional heat map information and the target three-dimensional bone direction field information are input into the parameter regressor after network adjustment, new human body shape parameters and joint rotation parameters corresponding to the target two-dimensional heat map information and the target three-dimensional bone direction field information are output, new eighth target losses are continuously calculated for network adjustment, and with the increase of training times, the eighth target losses are smaller and smaller until the eighth target losses tend to converge, namely, the network parameter adjustment learned by the parameter regressor is completed, and the parameter regressor learns how to predict accurate human body shapes according to the target two-dimensional heat map information and the target three-dimensional bone direction field information Counting the number and the joint rotation parameters, and finishing training to obtain the target parameter regressor.
In step 212, the server inputs the target two-dimensional heat map information and the target three-dimensional bone orientation field information into the target parameter regressor, outputs corresponding target joint rotation parameters, determines root joint points of the virtual object, and drives the virtual object according to the target joint rotation parameters and the joint point driving sequence, starting from the root joint points.
The target parameter regressor obtained based on the training process can input target two-dimensional heat map information and target three-dimensional skeleton direction field information into the target parameter regressor and output corresponding accurate target joint rotation parameters and target human body shape parameters.
Further, with continuing reference to fig. 4b, since the target joint rotation parameter records the motion information of each of the 24 joint points, a forward motion algorithm (FK) may be used, wherein the forward motion algorithm means that the virtual character moves such that each joint point drives the next joint point in sequence, and each joint of the human body forms a structure which can be understood as a tree model, and the rotation sequence is to drive the virtual object according to a joint driving sequence from the root joint point 0 of the virtual object, and the joint driving sequence may be as shown in table 1 below:
serial number Father node Serial number Father node Serial number Father node Serial number Father node
0 6 3 12 9 18 16
1 0 7 4 13 9 19 17
2 0 8 5 14 9 20 18
3 0 9 6 15 12 21 19
4 1 10 7 16 13 22 20
5 2 11 8 17 14 23 21
TABLE 1
Based on table 1, table 1 shows that the father nodes of three-dimensional joint points 1, 2, and 3 are all 0, and the father node of 4 is the driving sequence of 1, which indicates that the joint point 0 first executes an action to drive the joint points 1, 2, and 3, and then drives the joint point 4, and so on, which is the joint driving sequence. That is, the present application may drive the virtual object based on the target joint rotation parameters and in the joint point drive order, starting based on the root joint point, based on the forward motion algorithm. As shown in fig. 4c, the virtual object 42 can be driven according to the action of the anchor, so as to greatly enhance the interest of the live broadcast.
In view of the above, in the embodiment of the application, the target monocular image is obtained, and the target image characteristic information corresponding to the target monocular image is extracted; respectively inputting the target image characteristic information into a first target preset decoder and a second target preset decoder, and outputting corresponding target two-dimensional heat map information and target three-dimensional skeleton direction field information; inputting the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into a target parameter regressor, and outputting corresponding target joint rotation parameters; the virtual object is driven based on the target joint rotation parameters. Therefore, the corresponding target joint rotation parameters can be rapidly output to drive the virtual object to do corresponding actions according to the actions of the characters in the target monocular image, and the virtual object driving efficiency is greatly improved.
In the present embodiment, a description will be made from the perspective of a virtual object drive apparatus, which may be specifically integrated in a host client.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a virtual object driving apparatus according to an embodiment of the present disclosure, where the virtual object driving apparatus is applied to a server, and the virtual object driving apparatus may include an extraction unit 301, a first output unit 302, a second output unit 303, a driving unit 304, and the like.
The extracting unit 301 is configured to obtain a target monocular image and extract target image feature information corresponding to the target monocular image.
In some embodiments, the extraction unit 301 is configured to:
acquiring a target monocular image;
and inputting the target monocular image into a preset image encoder, and extracting corresponding target image characteristic information.
A first output unit 302, configured to input the target image feature information into a first target preset decoder and a second target preset decoder, respectively, and output corresponding target two-dimensional heat map information and target three-dimensional bone direction field information.
The first target preset decoder is obtained by training image characteristic information extracted through monocular image samples, predicted two-dimensional heat map information and label two-dimensional heat map information, and the second target preset decoder is obtained by training three-dimensional skeleton direction field information and label three-dimensional skeleton direction field information predicted through the image characteristic information.
A second output unit 303, configured to input the target two-dimensional heat map information and the target three-dimensional bone direction field information into a target parameter regressor, and output corresponding target joint rotation parameters.
The target parameter regressor is obtained by training predicted joint rotation parameters and label key rotation parameters through target two-dimensional heat map information and target three-dimensional skeleton direction field information.
A driving unit 304 for driving the virtual object based on the target joint rotation parameter.
In some embodiments, the driving unit 304 is configured to:
determining a root joint point of the virtual object;
starting based on the root joint point, the virtual objects are driven according to the target joint rotation parameters and the joint point drive order.
In some embodiments, the apparatus further comprises:
the acquisition unit is used for acquiring a monocular image sample and extracting image characteristic information corresponding to the monocular image sample;
the third output unit is used for inputting the image characteristic information to the first preset decoder and outputting corresponding two-dimensional heat map information;
the first calculation unit is used for calculating a first difference corresponding to the two-dimensional heat map information and the label two-dimensional heat map information;
the fourth output unit is used for inputting the image characteristic information into a second preset decoder and outputting corresponding three-dimensional skeleton direction field information;
the second calculation unit is used for calculating a second difference corresponding to the three-dimensional skeleton direction field information and the label three-dimensional skeleton direction field information;
And the first iteration adjusting unit is used for performing iteration adjustment on the network parameters of the first preset decoder and the second preset decoder according to the first difference and the second difference, returning to execute the input of the image characteristic information to the first preset decoder, and outputting corresponding two-dimensional heat map information until the first difference and the second difference are converged to obtain a first target preset decoder and a second target preset decoder.
In some embodiments, the first iterative adjustment unit is to:
weighting the first difference through a first weight to obtain a first target loss;
weighting the second difference through a second weight to obtain a second target loss;
constructing a third target loss based on the first target loss and the second target loss;
and iteratively adjusting the network parameters of the first preset decoder and the second preset decoder according to the third target loss, returning to execute the step of inputting the image characteristic information into the first preset decoder, and outputting corresponding two-dimensional heat map information until the third target loss is converged to obtain the first target preset decoder and the second target preset decoder.
In some embodiments, the apparatus further comprises:
The fifth output unit is used for inputting the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into the parameter regressor and outputting corresponding human body shape parameters and joint rotation parameters;
a first determining unit for determining a three-dimensional joint point according to the human body shape parameter and the joint rotation parameter;
the projection unit is used for projecting the three-dimensional joint points to a two-dimensional image plane according to a preset function to obtain two-dimensional key points;
the third calculation unit is used for calculating a third difference corresponding to the three-dimensional joint point and the label three-dimensional joint point and a fourth difference corresponding to the two-dimensional key point and the label two-dimensional key point;
and the second iterative adjustment unit is used for iteratively adjusting the network parameters of the parameter regressor based on the third difference and the fourth difference, returning to input the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into the parameter regressor, and outputting corresponding human body shape parameters and joint rotation parameters until the third difference and the fourth difference are converged to obtain the target parameter regressor.
In some embodiments, the first determining unit is configured to:
acquiring a preset parameterized human body model;
And deforming the preset parameterized human body model according to the human body shape parameters and the joint rotation parameters to determine three-dimensional joint points.
In some embodiments, the second iterative adjustment unit is configured to:
calculating a fifth difference corresponding to the human body shape parameter and the label shape parameter and a sixth difference corresponding to the joint rotation parameter and the label rotation parameter;
combining the fifth difference and the sixth difference to form a seventh difference;
generating an a priori constraint loss for the joint rotation parameter;
weighting the third difference through a third weight to obtain a fourth target loss;
weighting the fourth difference by a fourth weight to obtain a fifth target loss;
weighting the seventh difference by a fifth weight to obtain a sixth target loss;
weighting the prior limiting loss by a sixth weight to obtain a seventh target loss;
constructing an eighth target loss based on the four target losses, the fifth target loss, the sixth target loss, and the seventh target loss;
and iteratively adjusting the network parameters of the parameter regressor according to the eighth target loss, returning to input the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into the parameter regressor, and outputting corresponding human body shape parameters and joint rotation parameters until the eighth target loss is converged to obtain the target parameter regressor.
An embodiment of the present application further provides a computer device, where the computer device may be a terminal or a server, as shown in fig. 6, which illustrates a schematic structural diagram of a computer device according to an embodiment of the present invention, specifically:
the computer device may include components such as a processor 601 of one or more processing cores, memory 602 of one or more computer-readable storage media, a power supply 603, and an input unit 604. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 6 does not constitute a limitation of the computer device, and may include more or fewer components than illustrated, or some components may be combined, or a different arrangement of components. Wherein:
the processor 601 is a control center of the computer device, connects various parts of the whole computer device by various interfaces and lines, performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 602 and calling data stored in the memory 602, thereby monitoring the computer device as a whole. Alternatively, processor 601 may include one or more processing cores; preferably, the processor 601 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 601.
The memory 602 may be used to store software programs and modules, and the processor 601 executes various functional applications and data processing by operating the software programs and modules stored in the memory 602. The memory 602 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 601 access to the memory 602.
The computer device further comprises a power supply 603 for supplying power to the various components, and preferably, the power supply 603 is logically connected to the processor 601 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 603 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
The computer device may also include an input unit 604, the input unit 604 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
Although not shown, the computer device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 601 in the computer device loads the executable file corresponding to the process of one or more application programs into the memory 602 according to the following instructions, and the processor 601 runs the application programs stored in the memory 602, thereby implementing various functions as follows:
acquiring a target monocular image, and extracting target image characteristic information corresponding to the target monocular image;
inputting the target image characteristic information into a first target preset decoder and a second target preset decoder respectively, and outputting corresponding target two-dimensional heat map information and target three-dimensional skeleton direction field information;
the first target preset decoder is obtained by training image characteristic information extracted through a monocular image sample, predicted two-dimensional heat map information and label two-dimensional heat map information, and the second target preset decoder is obtained by training three-dimensional skeleton direction field information predicted through the image characteristic information and label three-dimensional skeleton direction field information;
Inputting the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into a target parameter regressor, and outputting corresponding target joint rotation parameters;
the target parameter regressor is obtained by training predicted joint rotation parameters and label key rotation parameters through target two-dimensional heat map information and target three-dimensional skeleton direction field information;
the virtual object is driven based on the target joint rotation parameter.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and parts that are not described in detail in a certain embodiment may refer to the above detailed description of the virtual object driving method, which is not described herein again.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present application provide a computer storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute steps in any one of the virtual object driving methods provided in the embodiments of the present application.
According to an aspect of the application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided in the various alternative implementations provided by the above embodiments.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Wherein the computer storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
Since the instructions stored in the computer storage medium may execute the steps in any virtual object driving method provided in the embodiments of the present application, beneficial effects that can be achieved by any virtual object driving method provided in the embodiments of the present application can be achieved, for which details are shown in the foregoing embodiments and are not described herein again.
The foregoing detailed description is directed to a virtual object driving method, apparatus, storage medium, and computer device provided in the embodiments of the present application, and specific examples are applied herein to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the method and its core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, the specific implementation manner and the application scope may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (11)

1. A virtual object driven method, comprising:
acquiring a target monocular image, and extracting target image characteristic information corresponding to the target monocular image;
inputting the target image characteristic information into a first target preset decoder and a second target preset decoder respectively, and outputting corresponding target two-dimensional heat map information and target three-dimensional skeleton direction field information;
the first target preset decoder is obtained by training image characteristic information extracted through monocular image samples, predicted two-dimensional heat map information and label two-dimensional heat map information, and the second target preset decoder is obtained by training three-dimensional skeleton direction field information and label three-dimensional skeleton direction field information predicted through the image characteristic information;
inputting the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into a target parameter regressor, and outputting corresponding target joint rotation parameters;
the target parameter regressor is obtained by training predicted joint rotation parameters and label key rotation parameters through target two-dimensional heat map information and target three-dimensional skeleton direction field information;
Driving a virtual object based on the target joint rotation parameter.
2. The virtual object driven method of claim 1, wherein the method further comprises:
acquiring a monocular image sample, and extracting image characteristic information corresponding to the monocular image sample;
inputting the image characteristic information into a first preset decoder, and outputting corresponding two-dimensional heat map information;
calculating a first difference corresponding to the two-dimensional heat map information and the label two-dimensional heat map information;
inputting the image characteristic information into a second preset decoder, and outputting corresponding three-dimensional bone direction field information;
calculating a second difference corresponding to the three-dimensional bone direction field information and the label three-dimensional bone direction field information;
and iteratively adjusting network parameters of the first preset decoder and the second preset decoder according to the first difference and the second difference, returning to execute the step of inputting the image characteristic information into the first preset decoder, and outputting corresponding two-dimensional heat map information until the first difference and the second difference are converged to obtain a first target preset decoder and a second target preset decoder.
3. The virtual object driving method according to claim 2, wherein iteratively adjusting network parameters of the first preset decoder and the second preset decoder according to the first difference and the second difference, and returning to perform inputting the image feature information into the first preset decoder, and outputting corresponding two-dimensional heat map information until the first difference and the second difference converge to obtain a first target preset decoder and a second target preset decoder, includes:
Weighting the first difference through a first weight to obtain a first target loss;
weighting the second difference through a second weight to obtain a second target loss;
constructing a third target loss based on the first target loss and the second target loss;
and iteratively adjusting the network parameters of the first preset decoder and the second preset decoder according to the third target loss, returning to execute the step of inputting the image characteristic information into the first preset decoder, and outputting corresponding two-dimensional heat map information until the third target loss is converged to obtain the first target preset decoder and the second target preset decoder.
4. The virtual object driven method according to claim 1, further comprising:
inputting the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into a parameter regressor, and outputting corresponding human body shape parameters and joint rotation parameters;
determining three-dimensional joint points according to the human body shape parameters and the joint rotation parameters;
projecting the three-dimensional joint points to a two-dimensional image plane according to a preset function to obtain two-dimensional key points;
calculating a third difference corresponding to the three-dimensional joint point and the label three-dimensional joint point and a fourth difference corresponding to the two-dimensional key point and the label two-dimensional key point;
And iteratively adjusting the network parameters of the parameter regressor based on the third difference and the fourth difference, returning to input the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into the parameter regressor, and outputting corresponding human body shape parameters and joint rotation parameters until the third difference and the fourth difference are converged to obtain the target parameter regressor.
5. The virtual object driving method according to claim 4, wherein the determining three-dimensional joint points according to the human body shape parameters and joint rotation parameters comprises:
acquiring a preset parameterized human body model;
and deforming a preset parameterized human body model according to the human body shape parameters and the joint rotation parameters to determine three-dimensional joint points.
6. The virtual object driving method according to claim 4, wherein the iteratively adjusting the network parameters of the parameter regressor based on the third difference and the fourth difference, and returning to input the target two-dimensional heat map information and the target three-dimensional bone orientation field information into the parameter regressor, and outputting corresponding human body shape parameters and joint rotation parameters until the third difference and the fourth difference converge to obtain the target parameter regressor, comprises:
Calculating a fifth difference corresponding to the human body shape parameter and the label shape parameter and a sixth difference corresponding to the joint rotation parameter and the label rotation parameter;
combining the fifth difference and the sixth difference to form a seventh difference;
generating an a priori constraint loss of the joint rotation parameter;
weighting the third difference through a third weight to obtain a fourth target loss;
weighting the fourth difference through a fourth weight to obtain a fifth target loss;
weighting the seventh difference by a fifth weight to obtain a sixth target loss;
weighting the prior limiting loss through a sixth weight to obtain a seventh target loss;
constructing an eighth target loss based on the fourth target loss, the fifth target loss, the sixth target loss, and the seventh target loss;
and iteratively adjusting the network parameters of the parameter regressor according to the eighth target loss, returning to input the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into the parameter regressor, and outputting corresponding human body shape parameters and joint rotation parameters until the eighth target loss is converged to obtain the target parameter regressor.
7. The virtual object driving method according to claim 1, wherein the acquiring a target monocular image and extracting target image feature information corresponding to the target monocular image comprises:
acquiring a target monocular image;
and inputting the target monocular image into a preset image encoder, and extracting corresponding target image characteristic information.
8. The virtual object driving method according to claim 1, wherein the driving a virtual object based on the target joint rotation parameter comprises:
determining a root joint point of the virtual object;
starting based on the root joint point, driving a virtual object according to the target joint rotation parameters and a joint point driving sequence.
9. A virtual object driving apparatus, comprising:
the extraction unit is used for acquiring a target monocular image and extracting target image characteristic information corresponding to the target monocular image;
the first output unit is used for respectively inputting the target image characteristic information into a first target preset decoder and a second target preset decoder and outputting corresponding target two-dimensional heat map information and target three-dimensional skeleton direction field information;
The first target preset decoder is obtained by training image characteristic information extracted through monocular image samples, predicted two-dimensional heat map information and label two-dimensional heat map information, and the second target preset decoder is obtained by training three-dimensional skeleton direction field information and label three-dimensional skeleton direction field information predicted through the image characteristic information;
the second output unit is used for inputting the target two-dimensional heat map information and the target three-dimensional skeleton direction field information into a target parameter regressor and outputting corresponding target joint rotation parameters;
the target parameter regressor is obtained by training predicted joint rotation parameters and label key rotation parameters through target two-dimensional heat map information and target three-dimensional skeleton direction field information;
a driving unit for driving a virtual object based on the target joint rotation parameter.
10. A computer readable storage medium, storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the virtual object driven method according to any one of claims 1 to 8.
11. A computer arrangement comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps in the virtual object driven method of any of claims 1 to 8 when executing the computer program.
CN202210314199.XA 2022-03-28 2022-03-28 Virtual object driving method, device, storage medium and computer equipment Pending CN114758108A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210314199.XA CN114758108A (en) 2022-03-28 2022-03-28 Virtual object driving method, device, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210314199.XA CN114758108A (en) 2022-03-28 2022-03-28 Virtual object driving method, device, storage medium and computer equipment

Publications (1)

Publication Number Publication Date
CN114758108A true CN114758108A (en) 2022-07-15

Family

ID=82328231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210314199.XA Pending CN114758108A (en) 2022-03-28 2022-03-28 Virtual object driving method, device, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN114758108A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115641647A (en) * 2022-12-23 2023-01-24 海马云(天津)信息技术有限公司 Digital human wrist driving method and device, storage medium and electronic equipment
CN116071527A (en) * 2022-12-19 2023-05-05 支付宝(杭州)信息技术有限公司 Object processing method and device, storage medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071527A (en) * 2022-12-19 2023-05-05 支付宝(杭州)信息技术有限公司 Object processing method and device, storage medium and electronic equipment
CN115641647A (en) * 2022-12-23 2023-01-24 海马云(天津)信息技术有限公司 Digital human wrist driving method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
US11741668B2 (en) Template based generation of 3D object meshes from 2D images
Cao et al. Facewarehouse: A 3d facial expression database for visual computing
CN103942822B (en) Facial feature point tracking and facial animation method based on single video vidicon
US9818217B2 (en) Data driven design and animation of animatronics
Chen et al. KinÊtre: animating the world with the human body
CN106780681B (en) Role action generation method and device
CN114758108A (en) Virtual object driving method, device, storage medium and computer equipment
US11514638B2 (en) 3D asset generation from 2D images
EP3980974A1 (en) Single image-based real-time body animation
CN110458924B (en) Three-dimensional face model establishing method and device and electronic equipment
CN112330805B (en) Face 3D model generation method, device, equipment and readable storage medium
CN109144252A (en) Object determines method, apparatus, equipment and storage medium
CN114202615A (en) Facial expression reconstruction method, device, equipment and storage medium
US20230079478A1 (en) Face mesh deformation with detailed wrinkles
CN116342782A (en) Method and apparatus for generating avatar rendering model
CN114373033A (en) Image processing method, image processing apparatus, image processing device, storage medium, and computer program
CN114452646A (en) Virtual object perspective processing method and device and computer equipment
Messmer et al. Animato: 2D shape deformation and animation on mobile devices
US20230386135A1 (en) Methods and systems for deforming a 3d body model based on a 2d image of an adorned subject
Tang Application of Virtual 3D Reproduction Technology in Interactive Intelligence of Game Design
Zhang 3D Interactive model design based on Multimedia Animation Technology
WO2024066549A1 (en) Data processing method and related device
CN117504296A (en) Action generating method, action displaying method, device, equipment, medium and product
KR20240055025A (en) Inferred skeletal structures for practical 3D assets
Rajendran Understanding the Desired Approach for Animating Procedurally

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination