CN117218246A

CN117218246A - Training method and device for image generation model, electronic equipment and storage medium

Info

Publication number: CN117218246A
Application number: CN202310283088.1A
Authority: CN
Inventors: 杨泽军
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2023-12-12

Abstract

The present application relates to the field of data processing technologies, and in particular, to a training method and apparatus for an image generation model, an electronic device, and a storage medium, where the method includes: acquiring a training sample set; a training sample includes: a sample reference map, a sample skeleton map and a sample depth map under a target pose and a sample standard map of a target object are contained; training the pre-trained image generation model by adopting the training sample set, and outputting a target image generation model; in each iteration process, based on the multi-scale global comprehensive difference loss between the output prediction standard diagram and the sample standard diagram in the training sample, the local difference loss in the image area is appointed by combining the prediction standard diagram and the sample standard diagram, and the model parameters are adjusted. Therefore, the self-shielding problem of different areas can be processed by means of depth values at different key point positions, and the generation effect of the trained target image generation model is guaranteed.

Description

Training method and device for image generation model, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a training method and apparatus for an image generation model, an electronic device, and a storage medium.

Background

In the related art, when generating a planar image of a virtual object under different actions, a motion migration model is usually trained to realize a reference image based on the virtual object, and a virtual object image under a target pose is generated.

Currently, when an image corresponding to a target gesture is generated by adopting an action migration model, the corresponding target image is generally generated based on a reference image and a two-dimensional skeleton diagram for indicating the target gesture, wherein only the positions of key points on the head and the limbs are shown in the two-dimensional skeleton diagram.

However, when an image is generated based on the existing motion migration model, the target gesture can only be indicated according to the two-dimensional key point coordinates, so that effective discrimination of different motions is difficult to realize in similar motions, and the generation accuracy of the target image is reduced; in addition, only limbs in different areas can be considered based on the same image scale, so that the definition of the different limbs in the generated target image is different, and the existing action migration model only carries out detail consideration on the face area, so that the terminal limb posture under the target posture is inaccurate to restore, and the generation effect of the target image is difficult to ensure.

Disclosure of Invention

The embodiment of the application provides a training method, a training device, electronic equipment and a storage medium of an image generation model, which are used for improving the generation accuracy of a target image corresponding to a target gesture and guaranteeing the generation effect of the target image.

In a first aspect, a training method for an image generation model is provided, including:

acquiring a training sample set; a training sample includes: the system comprises a sample reference map of a target object, a sample skeleton map and a sample depth map which indicate the positions of key points of the target object under a target pose, and a sample standard map of the target pose; the sample skeleton diagram at least comprises a limb tail end skeleton;

performing multiple rounds of iterative training on the pre-trained image generation model by adopting the training sample set, and outputting a trained target image generation model; wherein, in a round of iterative process, the following operations are performed:

based on a sample skeleton diagram and a sample depth diagram contained in the selected training sample, performing action migration processing on the target object in the contained sample reference diagram according to the corresponding target pose to obtain a prediction standard diagram;

and based on the multi-scale global comprehensive difference loss between the prediction standard diagram and the sample standard diagram, combining the local difference loss in the appointed image area between the prediction standard diagram and the sample standard diagram, and adjusting model parameters in the image generation model.

In a second aspect, a training device for generating a model from an image is provided, including:

the acquisition unit is used for acquiring a training sample set; a training sample includes: the system comprises a sample reference map of a target object, a sample skeleton map and a sample depth map which indicate the positions of key points of the target object under a target pose, and a sample standard map of the target pose; the sample skeleton diagram at least comprises a limb tail end skeleton;

the training unit is used for performing multi-round iterative training on the pre-trained image generation model by adopting the training sample set and outputting a trained target image generation model; wherein, in a round of iterative process, the following operations are performed:

Optionally, the image generation model includes: a first encoding network configured with a convolution attention layer, a second encoding network configured with a convolution attention layer and an image fusion layer, and a multi-scale decoding network configured with a convolution attention layer;

the training unit is configured to, when performing motion migration processing on the target object in the included sample reference map according to the corresponding target pose based on the sample skeleton map and the sample depth map included in the selected training sample to obtain a prediction standard map:

inputting a sample reference picture contained in the selected training sample into the first coding network to obtain coded reference image characteristics;

splicing a sample skeleton diagram and a sample depth diagram contained in the training sample in a channel dimension, and inputting the sample skeleton diagram and the sample depth diagram into the second coding network to obtain coded and fused skeleton action characteristics;

and decoding the reference image characteristic based on the skeleton action characteristic by adopting the multi-scale decoding network to obtain a prediction standard diagram after finishing action migration.

Optionally, the training sample set is generated in the following manner:

obtaining a sample standard graph and a three-dimensional coordinate set of a target object under different poses, wherein one three-dimensional coordinate set comprises: three-dimensional coordinates corresponding to each key point in one pose;

Processing each three-dimensional coordinate set by adopting a preset two-dimensional re-projection technology to obtain a sample skeleton map generated based on pixel point coordinates of each key point position under an image coordinate system, and obtaining a sample depth map generated based on pixel depth values corresponding to each key point position;

and generating a training sample set based on the sample standard graph, the sample skeleton graph and the sample depth graph corresponding to the different poses.

Optionally, when the sample skeleton map generated based on two-dimensional coordinates of each key point position in the image coordinate system is obtained, the obtaining unit is configured to:

obtaining coordinates of each pixel point after projecting the positions of each key point in the three-dimensional coordinate set to an image coordinate system;

and restoring skeleton distribution under the corresponding pose by connecting pixel points corresponding to the pixel point coordinates, so as to obtain a sample skeleton diagram with the same size as the corresponding sample standard diagram.

Optionally, when the obtaining the sample depth map generated based on the pixel depth values of the keypoint positions, the obtaining unit is configured to:

after each key point position in the three-dimensional coordinate set is projected to an image coordinate system, each pixel point coordinate and each pixel depth value which are obtained corresponding to each key point position are obtained;

And constructing an initial depth map matched with the image coordinate system, and based on the depth values of all pixels, combining the pixel value difference determined for the pixel point range to which the coordinates of all pixels belong, and adjusting the pixel values corresponding to all pixels in the initial depth map to obtain a sample depth map.

Optionally, when the image generation model is trained as a generator in a generator-antagonist structure, after the obtaining the prediction standard graph, the training unit is further configured to:

adopting a preset generated countermeasures loss function, and obtaining corresponding countermeasures loss based on the prediction standard diagram and the corresponding sample standard diagram;

and adjusting model parameters in the image generation model based on the countermeasures, the prediction standard diagram and the global comprehensive difference losses between the sample standard diagram, and combining local difference losses in an appointed image area between the prediction standard diagram and the sample standard diagram.

Optionally, the local difference loss is determined in the following manner:

determining the positions of all target key points for positioning the sub-image areas in the prediction standard chart and the sample standard chart respectively, and cutting out the designated image areas containing a plurality of sub-image areas based on the determined positions of all target key points in the prediction standard chart and the sample standard chart respectively;

And obtaining corresponding local difference loss based on the pixel value difference and the image characteristic difference in each sub-image region.

Optionally, the global integrated difference loss is determined in the following manner:

obtaining global pixel value loss based on pixel value differences of all pixel points between the prediction standard diagram and the sample standard diagram, and obtaining multi-scale feature loss based on image feature differences of the prediction standard diagram and the sample standard diagram under a plurality of preset scales;

and obtaining the corresponding global comprehensive difference loss by the global pixel value loss and the multi-scale feature loss.

Optionally, the training unit performs pre-training of the image generation model in the following manner:

acquiring a designated data set, and obtaining sample depth maps corresponding to each sample skeleton map by performing monocular depth estimation processing on each sample skeleton map in the data set, wherein the data set comprises a sample standard map and a sample skeleton map of each sample object under different poses;

and constructing a pre-training sample set based on a sample standard diagram, a sample skeleton diagram and a sample depth diagram which are obtained according to the data set, performing multi-round iterative training on an initial image generation model based on the pre-training sample set, and outputting a pre-trained image generation model.

Optionally, the training unit determines the learning rate used in each iteration of the pre-trained image generation model according to any one of the following manners:

determining a learning rate value corresponding to each training period based on a preset initial learning rate by adopting a preset cosine annealing algorithm, and determining a target learning rate corresponding to the current iteration process according to the training period to which the current iteration process belongs, wherein one training period comprises at least one round of iteration process;

determining a learning rate value corresponding to each training period based on a preset initial learning rate and a learning rate attenuation coefficient, and determining a target learning rate corresponding to a current iteration process according to the training period to which the current iteration process belongs, wherein one training period comprises at least one round of iteration process.

Optionally, the apparatus further includes a generating unit, where the generating unit is configured to:

acquiring a reference image of a target object under a reference action, and a plane skeleton map and a plane depth map of the target object under a specified pose, wherein the plane skeleton map comprises hand skeletons;

and adopting the target image generation model, and performing action migration processing on the reference image based on the plane skeleton diagram and the plane depth diagram to obtain a target image of the target object under the designated pose.

In a third aspect, an electronic device is presented comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above method when executing the program.

In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, implements the above method.

In a fifth aspect, a computer program product is presented, comprising a computer program, which when executed by a processor, performs the above method.

The application has the following beneficial effects:

in the embodiment of the application, the training method, the device, the electronic equipment and the storage medium of the image generation model are provided, and by means of the constructed training sample comprising the sample skeleton diagram and the sample depth diagram, depth values at different key point positions can be introduced in the process of training the image generation model, so that more referenceable basis is provided for image generation, similar pose can be effectively distinguished, and the self-shielding problem of different areas is processed; in the process of training according to the sample skeleton diagram and the sample depth diagram and performing action migration on the sample reference diagram to obtain the prediction standard diagram, the generation accuracy of the image can be improved.

In addition, the limb tail end skeleton is considered in the sample skeleton diagram, so that detailed processing on the limb tail end gesture can be learned in the model training processing process, the tail end limb action can be effectively restored in the generated prediction standard diagram, and the image generation effect is ensured.

In addition, in the model training process, the comprehensive overall difference loss and the local difference loss of multiple scales are considered, different image scales can be adopted, the image presentation differences of different areas are evaluated, and the definition of image presentation is ensured; moreover, the training effect of the model can be improved to a certain extent, the model is better guided to learn to realize action migration, a guarantee is provided for training to obtain the target image generation model which can restore action details and has accurate pose, and the accuracy of the subsequent image generation based on the target image generation model is improved.

Drawings

Fig. 1 is a schematic diagram of a possible application scenario in an embodiment of the present application;

FIG. 2A is a schematic diagram of a training process of an image generation model according to an embodiment of the present application;

FIG. 2B is a schematic diagram of a process for generating a training sample set according to an embodiment of the present application;

FIG. 2C is a schematic diagram of the details of the motion of a target object in one pose according to an embodiment of the application;

FIG. 2D is a schematic diagram of a process for generating a sample skeleton diagram according to an embodiment of the present application;

FIG. 2E is a sample depth illustration generated in an embodiment of the present application;

FIG. 2F is a schematic diagram of an image generation model initially constructed in an embodiment of the present application;

FIG. 2G is a schematic diagram of a model training process in accordance with an embodiment of the present application;

FIG. 3A is a schematic diagram of a process in a training phase and an application phase of a target image generation model according to an embodiment of the present application;

FIG. 3B is a schematic diagram of a single-round iterative training process in an embodiment of the present application;

FIG. 3C is a schematic diagram of the overall structure of a training target image generation model in an embodiment of the present application;

FIG. 4 is a schematic diagram of a logic structure of a training device for generating an image model according to an embodiment of the present application;

fig. 5 is a schematic diagram of a hardware composition structure of an electronic device according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a computing device according to an embodiment of the application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the technical solutions of the present application, but not all embodiments. All other embodiments, based on the embodiments described in the present document, which can be obtained by a person skilled in the art without any creative effort, are within the scope of protection of the technical solutions of the present application.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be capable of operation in sequences other than those illustrated or otherwise described.

Some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

Virtual object: refers to virtual characters created in a virtual space after three-dimensional modeling, and in a possible embodiment of the present application, a target object refers to a virtual object.

Three-dimensional coordinates: in the embodiment of the application, when the target object is a virtual object, the three-dimensional coordinates refer to XYZ coordinates in the world coordinates after the world coordinates are established in the created virtual space; in the case where the target object is a solid object, the three-dimensional coordinates refer to XYZ coordinates in a real world coordinate system.

Image coordinate system: the coordinate system established in the planar image is denoted as a UV coordinate system, wherein the U value represents the pixel coordinate of the pixel point in the horizontal axis direction of the planar image, and the V value represents the pixel coordinate of the pixel point in the vertical axis direction of the planar image.

UVZ coordinates: after the image coordinate system is established in the planar image, the UV value represents the pixel coordinates in the lateral and longitudinal directions on the image, and the Z value represents the depth distance of the pixel point on the image plane relative to the origin of the camera coordinate system.

Pixel coordinates: representing the coordinates of the pixel point in the image coordinate system.

Each key point position: the method is characterized in that positions corresponding to key points for describing bone distribution conditions under different poses are selected, and for the selected key points, eye key points, nose key points, shoulder joint points, elbow joint points, wrist joint points, hip joint points, knee joint points and ankle joint points are commonly appointed for positioning the different poses; in the embodiment of the application, in order to describe local details under different poses, various finger joint points are creatively introduced, so that the poses of the tail end areas under different poses can be described, and detailed learning of the areas is introduced in the learning process, wherein the tail end areas can be any one or combination of hand areas and foot areas.

Sample skeleton diagram: the method is used for describing skeleton distribution conditions under one pose, and in the embodiment of the application, generated two-dimensional images are connected according to the positions of all key points under the corresponding pose; in the technical scheme provided by the application, corresponding to each pose, a skeleton map and a depth map for describing the action of a target object under the pose are arranged, and are called a sample skeleton map and a sample depth map in the training process, and are called a plane skeleton map and a plane depth map in the application process; in UVZ coordinates, a sample skeleton map is determined from the UV values of each keypoint location.

Sample depth map: the method is used for describing the pose of the target object together with the sample skeleton diagram, and the pose is the same as the corresponding sample skeleton diagram in size; in the embodiment of the application, for a sample depth map and a sample skeleton map corresponding to one pose, the sample skeleton map describes the skeleton shape and distribution condition of a target object under the pose; the pixel values of the pixel points in the sample depth map are used for representing the depth values of the pixel points from the origin of the camera coordinate system, in other words, the sample depth map describes the depth values corresponding to the key points used for positioning the skeleton, so that the distance difference between the skeletons of different positions of the target object and the origin of the camera coordinate system can be described under the pose; in UVZ coordinates, the sample depth map is determined according to the Z values corresponding to the key points.

Action migration algorithm: and a deep learning algorithm for converting the target object image into a new image of the target pose based on a two-dimensional key point skeleton of the target object image and the target pose.

Human body posture estimation algorithm: the method is a deep learning algorithm capable of realizing human body key point detection.

Monocular depth estimation algorithm: the method refers to a deep learning algorithm for estimating the depth of a pixel point based on a single view image.

Super resolution algorithm: a deep learning algorithm for converting a low resolution image to a high resolution high definition large image.

Bone redirection: the method is used for transferring the motion of one three-dimensional skeleton to another three-dimensional skeleton with different body types, for example, based on the three-dimensional coordinates of each key point of the target object A under the motion 1, transferring the motion of the target object A to the target object B with different body types, and obtaining the three-dimensional coordinates of each key point of the target object B under the motion 1.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The following briefly describes the design concept of the embodiment of the present application:

Under the related technology, when generating the plane images under different poses of the virtual person, in a possible implementation manner, the three-dimensional key point coordinate set of the virtual person under different actions in the constructed virtual space can be acquired first; then, based on the three-dimensional key point coordinate sets under different actions, driving the virtual person to make corresponding actions; and then, carrying out cloth resolving on the virtual person to realize decoration on the virtual person, and finally, carrying out art rendering on the decorated virtual person to obtain plane images under different poses.

However, in this image generation method, each time a planar image is generated, the processes of motion driving, cloth resolving, and art rendering need to be repeatedly performed, and the cloth resolving process needs to consume a large amount of computing resources, which increases not only the generation cost of the image but also the generation time of the image, and greatly limits the generation efficiency of the image.

Furthermore, in the prior art, it is proposed to generate a virtual person image in a target posture based on a reference image of a virtual person and a two-dimensional skeleton map in the target posture by training an action migration model.

However, in the existing processing mode, effective discrimination of different actions is difficult to realize in similar actions, and the generation accuracy of the target image is reduced; moreover, only the limbs in different areas can be considered based on the same image scale, so that the definition of different limbs in the generated target image is different, and in addition, the terminal limb posture reduction under the target posture is inaccurate, so that the generation effect of the target image is difficult to ensure.

In view of this, in an embodiment of the present application, a training method, apparatus, electronic device, and storage medium for an image generation model are provided, and a training sample set is obtained; a training sample includes: the system comprises a sample reference map of a target object, a sample skeleton map and a sample depth map which indicate the positions of key points of the target object under the target pose, and a sample standard map of the target pose; the sample skeleton diagram at least comprises a limb tail end skeleton; performing multiple rounds of iterative training on the pre-trained image generation model by adopting a training sample set, and outputting a trained target image generation model; wherein, in a round of iterative process, the following operations are performed: based on a sample skeleton diagram and a sample depth diagram contained in the selected training sample, performing action migration processing on a target object in a contained sample reference diagram according to a corresponding target pose to obtain a prediction standard diagram; and based on the multi-scale global comprehensive difference loss between the prediction standard diagram and the sample standard diagram, combining the local difference loss in the designated image area between the prediction standard diagram and the sample standard diagram, and adjusting model parameters in the image generation model.

In this way, by means of the constructed training sample comprising the sample skeleton diagram and the sample depth diagram, depth values at different key point positions can be introduced in the process of training the image generation model, so that more referents are provided for image generation, similar poses can be effectively distinguished, and the self-shielding problem of different areas is processed; in the process of training according to the sample skeleton diagram and the sample depth diagram and performing action migration on the sample reference diagram to obtain the prediction standard diagram, the generation accuracy of the image can be improved.

The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and not for limitation of the present application, and that the embodiments of the present application and the features of the embodiments may be combined with each other without conflict.

Fig. 1 is a schematic diagram of a possible application scenario in an embodiment of the present application. The application scenario schematic diagram includes an image acquisition device 110, and a processing device 120.

In the embodiment of the present application, the image obtaining device 110 may provide an image for generating a training sample set according to actual processing requirements; or, in the process of generating the model by training the image, generating a training sample set, wherein the image types in the generated training sample set include: sample standard images, sample depth images and sample skeleton images under different poses; and providing a reference image according to the processing and a plane skeleton map and a plane depth map under the designated pose when the processing is performed based on the trained target image generation model.

In the case that the target object is a virtual object, the device specifically corresponding to the image capturing device 110 includes, but is not limited to, an electronic device with specific processing capabilities such as a desktop computer, a mobile phone, a mobile computer, a tablet computer, and the like. In the case that the target object is a physical object, the image acquisition device 110 may specifically be a device such as a depth camera having a processing function, or an electronic device capable of processing in accordance with an image provided by the depth camera.

The processing device 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms. The system can also be electronic equipment such as a desktop computer, a mobile computer, a tablet personal computer and the like.

In the embodiment of the present application, a connection manner of wired connection or wireless connection is adopted between the image acquisition device 110 and the processing device 120, and a communication connection is established through a communication network.

In a possible solution of the present application, the image acquisition device 110 may provide the processing device 120 with images required for training; further, according to the actual processing requirement, the processing device 120 generates a pre-training sample set and a training sample set, and after the pre-training of the initial image generation model is completed according to the pre-training sample set, the processing device 120 continues to train the image generation model according to the training sample set to obtain a trained target image generation model.

In the embodiment of the application, different target image generation models can be obtained by training according to actual processing requirements for different target objects, specifically, after the pre-training is completed according to the data set to obtain the pre-trained image generation model, fine tuning training can be performed on each target object with image generation requirements according to images of the target object under different poses to obtain the target image generation model corresponding to the target object.

For example, assuming that virtual objects a and B exist, in order to generate images in different poses for the virtual objects a and B, respectively, it is necessary to train and generate corresponding target image generation models for the virtual object a and the virtual object B, respectively.

The technical scheme provided by the application can realize the training of the image generation model in various application scenes, and the possible application scenes are described as follows:

scene one, an image generation model is generated for a virtual object.

After the processing equipment acquires a pre-training sample set generated according to the designated data set, performing multi-round iterative training on the constructed image generation model based on the pre-training sample set to obtain a pre-trained image generation model; further, adopting cloth resolving and art rendering technologies, generating a sample standard diagram, a sample skeleton diagram and a sample depth diagram under different poses aiming at the virtual object, and constructing a training sample set; and performing multiple rounds of iterative training on the pre-trained image generation model according to the training sample set, and finally obtaining the trained target image generation model.

In the embodiment of the present application, the virtual object may be a game character or an animal in a game scene, or may be a virtual character or an animal in a live scene of the virtual object, or may be a virtual character or an animal image in a movie work such as an authorized cartoon or expression package; when the virtual object is processed according to the target image generation model obtained through training, target images under corresponding positions can be generated according to different positions configured for the virtual object, and then dynamic images of the virtual object are displayed through continuous broadcasting of the target images.

And generating an image generation model aiming at the entity object in the second scene.

After the processing equipment acquires a pre-training sample set generated according to the designated data set, performing multi-round iterative training on the constructed image generation model based on the pre-training sample set to obtain a pre-trained image generation model; and further, the images of the entity object under different poses are shot by adopting the depth camera, based on the images shot by the depth camera, a sample standard image, a sample skeleton image and a sample depth image of the entity object under the different poses can be extracted and determined to obtain a constructed training sample set, then the training sample set is used for carrying out multi-round iterative training on the pre-trained image generation model, and finally the trained target image generation model is obtained.

It should be noted that, in the embodiment of the present application, the entity object may be a physical person or an animal. Taking an entity object as an example of an entity person, under the condition of obtaining the authorization of the entity person, a target image generation model can be obtained by training the entity person, further, when the entity object is processed according to the target image generation model, target images under corresponding poses can be generated corresponding to different poses configured for the entity object in an interesting way respectively, and further, the dynamic image of the entity object is displayed through continuous broadcasting of the target image.

The relevant process is schematically illustrated from the point of view of the processing device 120, taking as an example the training of the sample generation by the processing device and the image generation model, in connection with the accompanying drawings:

referring to fig. 2A, which is a schematic diagram of a training flow of an image generation model according to an embodiment of the present application, a related training process is described below with reference to fig. 2A:

step 201: the processing device obtains a training sample set, wherein one training sample comprises: the system comprises a sample reference map of a target object, a sample skeleton map and a sample depth map which indicate the positions of key points of the target object under the target pose, and a sample standard map of the target pose; the sample skeleton diagram at least comprises a limb tail end skeleton.

In the embodiment of the application, before performing fine-tuning training on the pre-trained image generation model aiming at the target object, the processing device needs to acquire an adopted training sample set, wherein one training sample comprises: the system comprises a sample reference image of a target object, a sample skeleton image and a sample depth image for indicating the positions of key points of the target object in a target pose, and a sample standard image of the target object in the target pose; the sample skeleton diagram at least comprises a limb tail end skeleton.

It should be noted that, in the case that the target object is an entity object, an image captured by the depth camera may be obtained, and positions of key points for describing the pose may be predetermined; further, based on the image shot by the depth camera, UV coordinates of each key point position of the target object under the corresponding pose and a depth value Z of each key point position are determined, and a corresponding sample skeleton map and a sample depth map are generated according to the UV coordinates and the depth value Z.

In the case that the target object is a virtual object, referring to fig. 2B, which is a schematic diagram of a process of generating a training sample set in an embodiment of the present application, a process of generating a training sample set for a virtual object is described below with reference to fig. 2B:

Step 2011: the processing equipment acquires a sample standard graph and a three-dimensional coordinate set of a target object under different poses, wherein one three-dimensional coordinate set comprises: and (3) three-dimensional coordinates corresponding to each key point in the pose.

In the embodiment of the application, because the corresponding target image classification model needs to be trained for each target object with the image generation requirement, when the training sample set is generated for each target object, the training sample set needs to be constructed based on the images of the target object under different poses.

Specifically, aiming at virtual objects used in business, the processing equipment carries out cloth resolving on the virtual objects in various different postures in a constructed virtual space to obtain virtual object rendering images; and further deriving three-dimensional world coordinates of the selected key point positions in the virtual space under each gesture.

It should be noted that, in the embodiment of the present application, each selected key position at least includes: eye keypoints, nose keypoints, shoulder keypoints, elbow keypoints, wrist keypoints, hip keypoints, knee keypoints, ankle keypoints, finger keypoints, and partial face keypoints; in addition, deriving three-dimensional coordinates of different keypoint locations of the virtual object from the constructed virtual space is a matter of ordinary skill in the art, and will not be specifically described in the present disclosure.

Step 2012: the processing equipment adopts a preset two-dimensional re-projection technology to process each three-dimensional coordinate set to obtain a sample skeleton map generated based on pixel point coordinates of each key point position under an image coordinate system, and obtains a sample depth map generated based on pixel depth values corresponding to each key point position.

The processing equipment processes each three-dimensional coordinate set by adopting a preset two-dimensional re-projection technology after obtaining each three-dimensional coordinate set according to the three-dimensional coordinates of each key point position of the target object with different poses, and transforms to obtain pixel point coordinates and depth values of the pixel points under the image coordinate system based on each three-dimensional coordinate in each three-dimensional coordinate set.

Specifically, when the two-dimensional re-projection technology is adopted for processing, the following formula can be adopted for processing:

firstly, according to camera parameters in a virtual engine, a camera internal parameter matrix and an external parameter matrix for data conversion are calculated. Assume that there are obtained XYZ coordinates of the camera in a world coordinate system in the virtual space, a rotation angle α in the X axis, a rotation angle β in the Y axis, and a rotation angle γ in the Z axis, a camera focal length f, and a physical size of a photosensitive sensor (sensor). In addition, in actual calculation, the variables need to be adjusted according to the coordinate axis order and direction of the virtual engine.

The camera extrinsic matrix reference calculation formula is as follows:

the reference calculation formula of the camera reference matrix is as follows:

where f is the camera focal length, d _x And d _y Is the physical length of each pixel of the photosensitive sensor, c _x And c _y Is the center pixel coordinate of the image.

After obtaining world coordinates (x) ₀ ,y ₀ ,z ₀ ) Then, two-dimensional re-projection can be performed by using the following formula, so as to obtain the UV coordinates of the key position on the image and the corresponding pixel depth value Z:

after solving based on the above formula, the three-dimensional coordinates (x ₀ ,y ₀ ,z ₀ ) And projecting to obtain pixel point coordinates in an image coordinate system and depth values at corresponding pixel points.

In the embodiment of the present application, when performing two-dimensional re-projection calculation, considering that the virtual object related to the present application exists in the virtual space, C in the camera external parameters _e The application creatively solves C as the solution mode is different from the conversion mode on a real world coordinate system _e Is determined asThe method can better adapt to the conversion requirement in the virtual space, has a very good conversion effect in practice, and improves the effectiveness of converting the world coordinate system in the virtual space into the image coordinate system.

Further, the processing device generates a sample skeleton map based on, for each three-dimensional coordinate set, two-dimensional coordinates of each key point contained in the three-dimensional coordinate set under the image coordinate system.

Specifically, after obtaining the positions of each key point in the three-dimensional coordinate set and projecting the positions to each pixel point coordinate in the image coordinate system, the processing equipment restores the skeleton distribution under the corresponding pose by connecting the pixel points corresponding to each pixel point coordinate, and obtains a sample skeleton diagram with the same size as the corresponding sample standard diagram.

In the embodiment of the application, in the process of generating a two-dimensional sample skeleton diagram, processing equipment marks pixel points corresponding to each key point position on an image according to UV coordinates (namely pixel point coordinates under an image coordinate system) of each key point position; and then, connecting the pixel points capable of restoring skeleton distribution, correspondingly drawing each section of skeleton, and finally obtaining a sample skeleton diagram.

In the embodiment of the application, when the processing device connects each pixel point to generate the sample skeleton graph, the processing device can directly connect the relevant pixel points into line segments according to actual processing requirements to obtain each segment of skeleton; alternatively, connection lines, such as double-arc connections, may be established between the associated pixels, enabling the bone distribution to be highlighted.

For example, referring to fig. 2C, a schematic diagram of details of an action of a target object in one pose is shown in an embodiment of the present application. As can be seen from the content illustrated in fig. 2C, after the processing device performs cloth resolving and art rendering on the target object in the pose 1, a corresponding sample standard chart can be obtained, and at the same time, a three-dimensional coordinate set for describing the pose 1 can be derived according to the distribution situation of each key point when the target object exists in the virtual space in the form of the pose 1.

For example, referring to fig. 2D, which is a schematic diagram of a process of generating a sample skeleton diagram in the embodiment of the present application, according to the content illustrated in fig. 2D, after a three-dimensional coordinate set of a target object in a pose in a virtual space is obtained, a two-dimensional reprojection technology is adopted to reproject each key point position in the pose into an image plane, so as to obtain a distribution situation of each key point position in a corresponding image coordinate system, that is, a pixel point corresponding to each key point position in the image coordinate system can be determined; and the processing equipment restores the distribution condition of different bones under the corresponding pose by connecting the pixel points at different positions into bones, so as to obtain a corresponding sample skeleton diagram.

In this way, the plane distribution condition of each corresponding key point position after the target object in one pose is projected to an image coordinate system can be determined based on the three-dimensional coordinate set of each key point position of the target object in the pose, and the skeleton distribution in the pose can be restored by connecting each corresponding pixel point of each key point; moreover, by introducing consideration to the extremity skeleton such as the hand, the hand posture details can be effectively restored, and the posture restoring effect to the target object is improved.

The processing device generates a sample skeleton map corresponding to a target object under one pose, and simultaneously generates a corresponding sample depth map according to pixel depth values corresponding to each key point position determined by adopting a two-dimensional re-projection technology.

Specifically, after the processing device acquires the positions of all the key points in the three-dimensional coordinate set and projects the positions to the image coordinate system, the processing device corresponds to the coordinates of all the pixel points and the pixel depth values obtained by the positions of all the key points; an initial depth map matched with an image coordinate system is constructed, and based on the depth values of all pixels, the pixel values corresponding to all pixels in the initial depth map are adjusted by combining the pixel value difference determined for the pixel point range to which the coordinates of all pixels belong, so that a sample depth map is obtained.

In a possible implementation manner, when the processing device generates the sample depth map, a black background (i.e., the pixel value is 0) map with the same size as the two-dimensional sample skeleton map can be created first, and then the pixel value of the pixel point corresponding to each key point position is initialized to the corresponding pixel depth value; generating a Gaussian distribution with a radius of N and an average value of M at the pixel point position corresponding to each key point position to obtain a Gaussian distribution result of pixel value coefficients in the corresponding pixel point range, wherein the values of N and M are set according to actual processing requirements, for example, N is 25, M is 1, the pixel value coefficient differences corresponding to different pixel points represent the pixel value difference between different pixel points; and multiplying the obtained pixel value coefficient in the pixel point range by the pixel depth value at the pixel point position according to the generation of the pixel point range to obtain the pixel values at different pixel point positions in the pixel point range, and further obtaining a corresponding sample depth map, wherein the unit of the pixel depth value can be m, the pixel values at different pixel point positions in the sample depth map represent the differentiated depth values corresponding to different pixel positions.

Optionally, considering that the value range of the pixel depth value may be different from the value range of the pixel value, the value of the pixel depth value of different pixel values may be normalized, and then the normalized pixel depth value is multiplied by the pixel value coefficient of the corresponding position, and the multiplication result is multiplied by the value range of the pixel value, so as to obtain the pixel value of the corresponding position.

For example, assuming that the object under the corresponding gesture 2 corresponds to the pixel 1 at the key point position 1, and the pixel depth value at the pixel 1 is 1.5 meters, then considering that the pixel depth value may only take 10 meters at most, all the depth values are divided by 10 for normalization, so as to obtain the pixel value at the processed pixel 1; and then generating a Gaussian kernel with the mean value of 1 and the radius of 25 pixels by taking the pixel point 1 as the center, and drawing the Gaussian kernel in an image after multiplying the depth value of the pixel point 1 as a whole to obtain a depth information graph of the key point.

For another example, referring to fig. 2E, which is a schematic diagram of a sample depth generated in an embodiment of the present application, as can be seen from the schematic diagram of fig. 2E, the processing device generates a sample skeleton diagram with the same size as the sample skeleton diagram corresponding to the sample standard diagram, and generates a sample depth diagram with the same size as the sample skeleton diagram; as can be seen from the content illustrated in fig. 2E, in the generated sample depth map, the pixel values of the pixel points corresponding to each key point in the initial depth map with full black are taken as the corresponding pixel depth values, so that the pixel point range can be determined by taking the pixel point corresponding to the key point as the center; and generating a Gaussian distribution result of the pixel value coefficient according to a preset Gaussian radius and a preset average value, and obtaining a final pixel value of the corresponding position by calculating a product result of the pixel value coefficient and the pixel value of the corresponding position.

In this way, according to the pixel depth values of the corresponding pixel points after the projection of each key point, not only the difference distance between each key point and the origin of the camera coordinate system can be represented, but also the relative depth difference of different pixel points can be expressed, so that the distribution situation of the key points between similar actions can be effectively expressed, and the accuracy of gesture indication is improved; in addition, by determining the pixel point positions corresponding to the key point positions and determining the pixel point ranges, the influence of the key point positions can be enlarged in the generated sample depth map, which is equivalent to enlarging the positions corresponding to the key point positions, avoiding single-pixel point identification and reducing the detection difficulty.

Step 2013: the processing device generates a training sample set based on the sample standard graph, the sample skeleton graph and the sample depth graph corresponding to different poses.

After obtaining a sample standard diagram, a sample skeleton diagram and a sample depth diagram of a target object under different poses, processing equipment firstly selects a sample reference diagram from the sample standard diagrams corresponding to the different poses, and then combines the sample reference diagram with the sample standard diagrams, the sample skeleton diagrams and the sample depth diagrams corresponding to other poses except the poses corresponding to the sample reference diagram to obtain each training sample; and forming a training sample set according to each generated training sample.

Alternatively, the processing device may use the sample standard chart under each pose as the sample reference chart, and combine the sample reference chart with the sample standard chart, the sample skeleton chart, and the sample depth chart under each other pose to obtain each training sample.

For example, assuming that there are a sample standard diagram, a sample skeleton diagram, and a sample depth diagram of the target object 1 under the poses 1-5, when generating a training sample, the sample standard diagram corresponding to the selected pose 1 may be used as a sample reference diagram, and the sample reference diagram may be combined with the sample standard diagram, the sample skeleton diagram, and the sample depth diagram under each other pose to obtain 4 training samples.

Therefore, a training sample set for jointly indicating the positions of all key points under the target pose according to the sample skeleton diagram and the sample depth diagram can be established, and the consideration of the terminal skeleton of the limb is introduced into the sample skeleton diagram of the training sample set, which is equivalent to the integration of more learnable factors into the training sample, so that a training basis is provided for training to obtain an effective image generation model.

Step 202: the processing equipment adopts a training sample set to carry out multi-round iterative training on the pre-trained image generation model, and outputs a trained target image generation model.

In the embodiment of the application, according to actual processing requirements, in order to save training time, the processing equipment can perform multiple-round iterative pre-training on the initial image generation model to obtain a pre-trained image generation model, further perform multiple-round iterative training on the pre-trained image generation model, and output a trained target image generation model.

Referring to fig. 2F, which is a schematic diagram of an image generation model initially constructed in an embodiment of the present application, according to the present application, algorithm and structure adjustment are performed on the basis of an action migration model of a Nueral-Texture-extraction-Distribution structure, and according to the present application shown in fig. 2F, the image generation model constructed in the present application includes: a first encoding network configured with a convolutionally attentive layer, a second encoding network configured with a convolutionally attentive layer and an image fusion layer, and a multi-scale decoding network configured with a convolutionally attentive layer, wherein,

1) A first encoding network configured with a convolved attention layer.

Corresponds to the bone encoder (The Skeleton Encoder) of fig. 2F with a lightweight attention module (Convolutional Block Attention Module, CBAM) connected thereto, wherein the CBAM module is also referred to as a convolved attention layer.

2) A second encoding network configured with a convolved attention layer and an image fusion layer.

Corresponding to the reference image encoder (The Reference Encoder) in fig. 2F, the reference image encoder is connected to a CBAM and has an image fusion layer built in, where the image fusion layer is also called a depth map fusion convolution layer, and is used for fusing the skeleton plane map and the key point depth map.

Specifically, in the training process, after the sample skeleton diagram and the sample depth diagram with the same image size are spliced in the channel dimension, a second coding network is input, and fusion of the sample skeleton diagram and the sample depth diagram is realized inside the second coding network.

3) A multi-scale decoding network configured with a convolved attention layer.

Corresponding to what is illustrated by the thick dashed box in fig. 2F, the network structure includes a target image generator (The Target Image Renderer), NTEDs, and convolution modules (Conv Blocks), where the NTEDs are used to extract spatial texture features of the input image and map the spatial texture features into feature distribution states corresponding to the target pose; conv Blocks are modules formed by stacking convolution layers, and the built image generation model comprises convolution layer modules with multiple sizes of 16×8, 32×16, …, 512×256, 1024×512, 2048×1024 and the like, which are respectively used for extracting and fusing images with different sizes to obtain deep features; tRGB is used to convert the deep feature matrix with a convolutional layer into an RGB image with a channel number of 3, and upsampled part is used to Upsample the image.

Continuing to combine the content schematically shown in fig. 2F, in the image generation model constructed by the application, by adding the CBAM layer in the first coding network and the second coding network, the model is beneficial to extracting the characteristics of target objects with different scales; in a multi-scale decoding network, the CBAM is added to the last two layers of parts close to image output, so that the training is facilitated, and the details of image generation are improved.

In the embodiment of the application, when the processing equipment pre-trains an initial image generation model, a designated data set is obtained, and a sample depth map corresponding to each sample skeleton map is obtained by carrying out monocular depth estimation processing on each sample skeleton map in the data set, wherein the data set comprises a sample standard map and a sample skeleton map of each sample object under different poses; and constructing a pre-training sample set based on a sample standard diagram, a sample skeleton diagram and a sample depth diagram which are obtained according to the data set, performing multi-round iterative training on an initial image generation model based on the pre-training sample set, and outputting the pre-trained image generation model.

Specifically, in order to enhance generalization of the image generation model capable of implementing motion migration, a large amount of human body plane image data may be used in advance to pretrain the initial image generation model, considering that the amount of data rendered for the target object is limited. In this regard, the present application may acquire a dataset for human body posture estimation that currently includes a larger data amount, and further screen out a single person in the dataset and generate training data from a planar image with a person duty ratio exceeding a set threshold, where the selected dataset may be a COCO dataset or a human3.6, etc., which is not specifically limited in this aspect of the present application; the images in the dataset include images of persons at different poses (i.e., sample standard images containing sample objects), and planar skeleton images at corresponding poses (i.e., sample skeleton images).

In the embodiment of the application, the plane images are included in the data set, so that when the corresponding depth image is generated, a monocular depth estimation algorithm can be adopted to obtain the depth value corresponding to each marked key point, thereby obtaining UVZ data of each key point position in each figure image, generating a pre-training sample set according to the UVZ data, and training according to the pre-training sample set to obtain a pre-training model with stronger generalization.

It should be noted that, in the embodiment of the present application, the reason that the monocular depth estimation algorithm is adopted for processing is that at present, the data set with rich character gestures basically only provides two-dimensional key point labels, namely UV coordinates; however, the UVZ coordinates of the key points are considered in the application, so that the distance between the pixel points corresponding to the positions of the key points of the person in the image and the camera is predicted by means of a monocular depth estimation algorithm, and then the corresponding Z value can be obtained corresponding to each position of the key points, thereby synthesizing the required UVZ data.

In addition, for the monocular depth estimation algorithm adopted by the application, a training set formed by RGB images and depth maps acquired based on the RGBD camera can be obtained, and the training set is adopted to train the depth convolution neural network capable of predicting the depth value of each pixel, so that the depth information of the RGB images can be well complemented according to the monocular depth estimation algorithm function realized by the depth convolution neural network.

In the embodiment of the application, the processing equipment can obtain each sample object under different poses in the data set by processing the acquired data set, and a sample standard graph, a sample skeleton graph and a sample depth graph which are respectively corresponding to each sample object; by constructing pre-training samples based on sample standard diagrams, sample skeleton diagrams and sample depth diagrams of the same sample object under different poses, a corresponding pre-training sample set can be generated; furthermore, according to the generated pre-training sample set, multiple rounds of iterative training of the initial image generation model are realized, and the pre-trained image generation model is obtained.

The model processing procedure executed in the pre-training process is the same as the processing procedure executed for the pre-trained image generation model, and the specific processing procedure in the pre-training process will not be described here.

Therefore, the generalization of the model can be improved by pre-training the constructed image generation model, the labeling requirement for a training sample in the process of training the target image generation model aiming at a target object is reduced, and the training speed of the model is improved.

Further, after the processing device obtains the pre-trained image generation model, performing multiple rounds of iterative fine tuning training on the image generation model until a preset convergence condition is met, and outputting the trained target image generation model, wherein the preset convergence condition can be that the number of training rounds reaches a set value, and the like.

Referring to fig. 2G, which is a schematic diagram of a training process of a model in an embodiment of the present application, a training process related to a training process of a pre-trained image generation model is described below with reference to fig. 2G by taking a training process of the pre-trained image generation model as an example:

step 2021: and the processing equipment performs action migration processing on the target object in the included sample reference image according to the corresponding target pose based on the sample skeleton image and the sample depth image included in the selected training sample to obtain a prediction standard image.

Specifically, the image generation model includes: a first encoding network configured with a convolution attention layer, a second encoding network configured with a convolution attention layer and an image fusion layer, and a multi-scale decoding network configured with a convolution attention layer; when executing step 2021, the processing device inputs the sample reference map contained in the selected training sample into the first coding network, so as to obtain the coded reference image feature; splicing a sample skeleton diagram and a sample depth diagram contained in a training sample in a channel dimension, and inputting a second coding network to obtain coded and fused skeleton action characteristics; and then, decoding the reference image characteristic based on the skeleton action characteristic by adopting a multi-scale decoding network to obtain a prediction standard diagram after finishing action migration.

In the embodiment of the application, the processing equipment adopts a first coding network in the image generation model to realize the coding processing of the sample reference image of the target object, and obtain the reference image characteristics corresponding to the sample reference image; meanwhile, a second coding network in the image generation model is adopted to realize coding fusion processing of a sample skeleton diagram and a sample depth diagram of the target object, so that skeleton action characteristics capable of describing the target gesture are obtained; and guiding the migration of the target object from the pose corresponding to the reference image characteristic to the target pose based on the skeleton action characteristic by means of a multi-scale decoding network, and obtaining a prediction standard graph output by the model.

In this way, by means of the first coding network, the second coding network and the multi-scale decoding network comprising the CBAM, the motion migration of the target motion can be learned, and the sample depth map is used as a part of model input, so that the two-dimensional sample skeleton map and the two-dimensional sample depth map can be input at the same time, three-dimensional information of each key point position can be effectively introduced, the information quantity of model input data is improved, and the model learning is facilitated to realize the motion migration more accurately.

Step 2022: the processing device adjusts model parameters in the image generation model based on the multi-scale global comprehensive difference loss between the prediction standard diagram and the sample standard diagram, and in combination with the local difference loss in the specified image area between the prediction standard diagram and the sample standard diagram.

In executing step 2022, the processing device calculates a model loss value based on the image difference between the prediction standard graph generated by the image generation model and the corresponding sample standard graph in the selected training sample, and further adjusts model parameters in the image generation model according to the model loss value.

In some possible implementations, the processing device calculates a global integrated difference loss of multiple scales between the prediction standard chart and the sample standard chart, calculates a local difference loss in the specified image area between the prediction standard chart and the sample standard chart, and then uses a weighted superposition result of the global integrated difference loss and the local difference loss as a model loss value according to which model parameters are adjusted.

In other possible embodiments, when the image generation model is trained as a generator in the generator-antagonist structure, after obtaining the prediction standard graph, obtaining the corresponding countermeasure loss based on the prediction standard graph and the corresponding sample standard graph by adopting a preset generated countermeasure loss function; and then, based on the countermeasures, the overall comprehensive difference losses between the prediction standard diagram and the sample standard diagram, and the local difference losses in the image area are designated by combining the prediction standard diagram and the sample standard diagram, and the model parameters in the image generation model are adjusted.

In this way, by introducing global comprehensive loss difference and local loss difference, the local image difference and the whole difference between images can be effectively considered, and model learning is facilitated to restore pose details in the generated images; and the generation countermeasures are additionally introduced, so that the generation quality of the pictures can be further evaluated by means of the training frame of the generator countermeasures, and the training effect of the model can be improved.

In the embodiment of the application, when local difference loss between a prediction standard chart and a sample standard chart is determined, processing equipment respectively determines the positions of all target key points for positioning sub-image areas in the prediction standard chart and the sample standard chart, and respectively cuts out a designated image area containing a plurality of sub-image areas based on the determined positions of all target key points in the prediction standard chart and the sample standard chart; and obtaining corresponding local difference loss based on the pixel value difference and the image characteristic difference in each sub-image region.

Specifically, when the processing device calculates the local difference loss, firstly selecting a local area to be considered from the sample standard chart and the prediction standard chart, for example, a face area and a hand area can be selected; and in the sample standard chart and the prediction standard chart, corresponding local areas are respectively positioned according to the key points of the human body, and then the corresponding local areas are cut out from the sample standard chart and the prediction standard chart.

For example, if the preset local area is a face image area and a hand image area, when the face image area is cut, firstly, generating a rectangular area frame according to the eye key point positions in the selected key point positions and then based on the connecting line of the eye key points so as to divide the face area, and then cutting to obtain the face image areas (i.e. sub-image areas) corresponding to the sample standard image and the prediction standard image respectively; similarly, finger joint points can be selected, and hand image areas are divided according to the finger joint points, so that hand image areas (namely sub-image areas) corresponding to the sample standard image and the sample prediction image can be obtained through cutting.

Furthermore, the processing device may calculate, according to actual processing needs, a pixel value difference loss and an image feature difference loss of the pixel points in each sub-image area by using an L1 loss function, and further calculate, according to the pixel value difference loss and the image feature difference loss corresponding to each sub-image area, a specified area including each sub-image area, and a corresponding local difference loss.

Therefore, the difference of local areas among images can be effectively considered by introducing the difference of local loss, so that the method can guide the reduction of pose details in the generated images in the learning and training process of the model, and the generation effect of the images is improved.

In the embodiment of the application, when the processing equipment determines the global comprehensive difference loss between the prediction standard diagram and the sample standard diagram, the processing equipment obtains the global pixel value loss based on the pixel value difference of each pixel point between the prediction standard diagram and the sample standard diagram and obtains the multi-scale feature loss based on the image feature difference between the prediction standard diagram and the sample standard diagram under a plurality of preset scales; and then the global pixel value loss and the multi-scale characteristic loss are subjected to global comprehensive difference loss.

Specifically, when the processing device calculates the multi-scale feature loss, the processing device may input the prediction standard chart and the sample standard chart into the VGG network respectively by means of a visual geometry group (Visual Geometry Group, VGG) network, so as to obtain image features output by the VGG network under a plurality of preset scales; and further, by adopting an L1 loss function, respectively calculating the image characteristic difference between the prediction standard diagram and the sample standard diagram under each scale, and finally obtaining the multi-scale characteristic loss.

When calculating the global pixel value loss, the processing device determines a corresponding global pixel value loss based on the pixel value difference between the prediction standard graph and the sample standard graph using the L1 loss function.

Furthermore, by calculating the weighted superposition result between the multi-scale feature loss and the global pixel value loss, the corresponding global comprehensive difference loss can be finally obtained.

Thus, by means of global comprehensive loss difference, the difference between images can be considered effectively on the whole, and the comprehensive difference influence on the pixel value layer and the image characteristic layer can be considered, so that the model can be adjusted towards the trend of reducing the pixel value difference and the image characteristic difference.

In the embodiment of the present application, in the process of pre-training the image generation model for implementing the motion migration function, the existing real human image data set is adopted, and considering that the fitting effect of the pre-trained image generation model on the specific target object may not be high enough, after the training sample set is generated based on the processed target object data, further optimization is performed on the pre-trained image generation model, so as to improve the fitting degree of the model on the target object data. When optimizing the model, in order to prevent the model from forgetting the pre-training knowledge caused by excessive fitting on the target object data, the training period can be controlled and the model learning rate can be reduced, wherein the number of the training periods is set according to the actual processing requirement, and the application is not particularly limited to this.

Specifically, the processing device may determine the learning rate used in each iteration of the pre-trained image generation model in any of the following ways:

and in the first mode, a cosine annealing algorithm is adopted to calculate the learning rate.

Specifically, the processing device may use a preset cosine annealing algorithm, determine a learning rate value corresponding to each training period based on a preset initial learning rate, and determine a target learning rate corresponding to the current iteration process according to a training period to which the current iteration process belongs, where one training period includes at least one iteration process.

Based on the above, the processing device may determine the learning rate adopted in each training period by adopting a cosine annealing manner, so that the learning rate can be periodically adjusted to a value corresponding to the training period, and the learning rate gradually decreases as the training period increases.

2. The learning rate is calculated based on a preset learning rate decay function.

Specifically, the processing device determines a learning rate value corresponding to each training period based on a preset initial learning rate and a learning rate attenuation coefficient, and determines a target learning rate corresponding to the current iteration process according to the training period to which the current iteration process belongs, wherein one training period comprises at least one round of iteration process.

For example, assuming a learning rate decay factor of 0.5, the same learning rate training model may be used during each iteration run of the training cycle, and for two adjacent training cycles, the learning rate used during the former training cycle is twice the learning rate used during the latter training cycle.

Therefore, the model can be prevented from being excessively fitted on the target object data to a certain extent by adjusting the learning rate used for model training, forgetting of the image generation model on pre-training knowledge is avoided, and the training effect of the model is guaranteed.

Further, after training to obtain a target image generation model based on the pre-trained image generation model, the processing device may perform motion migration of the target object in the image according to the target image generation model to obtain a target image of the target object in the specified gesture.

The processing equipment acquires a reference image of a target object under a reference action and a plane skeleton diagram and a plane depth diagram of the target object under a specified pose, wherein the plane skeleton diagram comprises hand skeletons; and then adopting a target image generation model, and performing action migration processing on the reference image based on the plane skeleton map and the plane depth map to obtain a target image of the target object under the specified pose.

Specifically, after the processing device obtains the optimized target image generation module, the processing device can generate the image offline according to the actual processing requirement. In a specific generation process, processing equipment firstly acquires a three-dimensional coordinate set corresponding to each key point position of a target object under a designated pose.

Then, the processing equipment respectively carries out reprojection on the three-dimensional coordinates of each key point position in the three-dimensional coordinate set to obtain UVZ coordinates, and further synthesizes a corresponding plane skeleton map and a plane depth map; and then, the reference image of the target object under the reference pose, the plane skeleton diagram and the plane depth diagram corresponding to the designated pose are input into the optimized target image generation model together, and the plane image of the target object under the designated pose can be obtained.

It should be noted that, when there is a three-dimensional coordinate set of another object in the specified pose and there is no three-dimensional coordinate set of the target object in the specified pose, bone redirection processing may be performed on the three-dimensional coordinate set indicating the specified pose of the other object to obtain the three-dimensional coordinate set of the target object in the specified pose.

In particular, when the resolution of the generated planar image is insufficient, the processing device may use a super-resolution algorithm to increase the resolution of the image and finally output a final image, where the super-resolution algorithm and the redirection algorithm are conventional in the art, which is not specifically described in the present application.

In addition, in some possible implementation scenarios of the present application, a desired pose sequence for indicating different poses of the target object may be obtained, where each pose in the desired pose sequence corresponds to one reference image, and a three-dimensional coordinate set for indicating the desired pose; and the processing equipment can process and generate target images corresponding to the expected poses respectively, and finally generate target image sequences corresponding to the expected pose sequences, so that pose change videos of the target objects can be obtained under the condition of continuously broadcasting the target images.

Therefore, target object data for training an image generation model is only required to be rendered in a single batch mode, so that target object image materials with various specified postures can be generated off-line in a later period, external art technology is not required to be additionally relied on, time cost and equipment cost for resolving a rendered image and cloth can be effectively reduced, and production efficiency of long-term operation is improved; in addition, by introducing the participation of the plane depth map, the self-shielding problem can be effectively processed, the target object under the designated pose can be effectively restored, in addition, by means of the increased consideration of the terminal skeleton of the limb in the training process, the function of processing the terminal skeleton of the hand can be at least increased, so that the gesture in the generated image becomes controllable, in addition, by considering the local loss difference, the further optimization of the local characteristics can be realized, the problem of too low image quality caused by too small size is relieved, in addition, by changing the model structure, the image with the size of 2048 x 1024 and the like can be processed in the model, the resolution of the generated image can be greatly improved, and the generated image can reach 1080p resolution.

The training process and the application process related in the embodiment of the application are schematically illustrated below by taking a target object as a virtual person as an example in combination with a specific application scene:

referring to fig. 3A, which is a schematic diagram of a processing procedure in a training stage and an application stage of a target image generation model according to an embodiment of the present application, it can be known from the content illustrated in fig. 3A that, in the training stage, a processing device renders a large number of planar images of a virtual person in a virtual space, and derives coordinates of each key point corresponding to different planar images, so as to obtain a corresponding three-dimensional coordinate set, and generate a training sample set; and training the pre-trained image generation model by using the training sample set, and outputting the trained target image generation model.

Referring to fig. 3B, which is a schematic diagram of a single-round iterative training process in an embodiment of the present application, as shown in fig. 3B, when performing a round of iterative training on a pre-trained image generation model, a processing device selects a training sample, inputs a sample reference graph in the training sample into a first coding network, and splices a sample skeleton graph and a sample depth graph in the training sample in a channel dimension, and inputs the sample reference graph into a second coding network, so as to obtain a prediction standard graph output by a multi-scale decoding network;

Then, continuing to combine the content of fig. 3B, when the image generation model is used as a generator in the generator countermeasure structure to train, the processing device inputs the prediction standard graph and the sample standard graph into the discriminator respectively to obtain corresponding countermeasure losses, wherein calculation of the countermeasure losses is a conventional technology in the art, and is not specifically described herein; in addition, the processing device calculates an image pixel difference loss (also referred to as global pixel value loss) from the image pixel differences between the prediction standard chart and the sample standard chart; meanwhile, after the processing equipment cuts out a face image area and a hand image area from the prediction standard chart and the sample standard chart respectively, calculating pixel difference loss and image characteristic difference loss corresponding to the face image area and pixel value difference loss and image characteristic difference loss corresponding to the hand image area through weighting to obtain corresponding local difference loss; and the processing equipment can also input the prediction standard chart and the sample standard chart into a preset VGG network to obtain multi-scale image characteristics corresponding to the prediction standard chart and the sample standard chart, and finally obtain the multi-scale characteristic loss by calculating the image characteristic difference loss corresponding to the multi-scale image characteristics.

Further, in the training process illustrated in fig. 3B, model losses are obtained by weighting by means of the calculated various losses, and model parameters of the image generation model are adjusted in accordance with the model losses.

Continuing with the description illustrated in FIG. 3A, during the application phase, the processing device first prepares a sequence of desired poses of the virtual person, wherein each desired pose is associated with a three-dimensional set of coordinates for indicating the location of each keypoint; then, the processing equipment determines a reference image corresponding to the virtual person, and determines a corresponding plane skeleton map and a plane depth map according to each expected pose; generating a model by adopting the trained target image, and respectively obtaining a target image according to the reference image, the plane skeleton image and the plane depth map corresponding to each expected pose; and finally, the processing equipment corresponds to the expected pose sequence to obtain a target image sequence.

Referring to fig. 3C, an overall structure diagram of a training target image generation model in the embodiment of the present application is shown, and the overall structure diagram is divided into a pre-training stage, a stage of constructing a training sample set in an optimization process, a model optimization stage, and an application stage in the content shown in fig. 3C.

In the pre-training stage, processing equipment processes a data set, stores UV coordinates of positions of key points in an image, and obtains depth values corresponding to the key points by using a monocular depth estimation algorithm; and training an initial image generation model according to the image under the reference pose, the image under the target pose, the skeleton diagram and the depth diagram to obtain a pre-trained image generation model.

In the stage of constructing a training sample set in the optimization process, preprocessing training data is realized by processing equipment, cloth calculation and art rendering are carried out on a virtual person under various poses, corresponding plane images are respectively obtained, and camera parameters used for rendering and a three-dimensional coordinate set under the corresponding poses are stored; and re-projecting the three-dimensional skeleton represented by the three-dimensional coordinate set according to camera parameters to obtain UVZ coordinates, generating a skeleton map according to the pixel points of the UV position determined by projection, determining the pixel depth values of the pixel point positions according to UVZ values, and generating a depth map.

In the model optimization stage, the processing equipment adopts a training sample set to optimize the pre-trained image generation model.

In the application stage, processing equipment firstly prepares a desired pose sequence of a virtual person, wherein each desired pose is associated with a three-dimensional coordinate set for indicating the position of each key point; determining a reference image corresponding to the virtual person, and processing to obtain a corresponding plane skeleton diagram and a plane depth diagram based on the obtained three-dimensional coordinate set; then, the processing equipment generates a model based on a target image capable of fitting the virtual human data, and corresponds to each expected pose in the expected pose sequence to generate a corresponding plane image; finally, according to the actual processing requirement, a super-resolution algorithm can be used to improve the resolution of each target image.

In this way, a batch of images with different actions and cloth calculation completion are firstly rendered based on the virtual person, and the three-dimensional coordinates of each key point position in the image corresponding to the virtual person are subjected to two-dimensional reprojection by utilizing the parameters of the camera in the virtual engine, so that the two-dimensional coordinates of each key point position in the image plane and the pixel depth value are obtained; and training an image generation model based on a depth convolution network by using the images, two-dimensional coordinates corresponding to the positions of the key points and pixel depth values. The subsequent sequence can be converted into a plane image sequence corresponding to each expected pose based on the expected pose sequence of the virtual person, and the detail consideration can be carried out in the image generation process, so that the influence of a self-shielding area is avoided, and the image generation efficiency and the image generation accuracy are improved.

Based on the same inventive concept, referring to fig. 4, which is a schematic logic structure diagram of an image generation model training apparatus according to an embodiment of the present application, an image generation model training apparatus 400 includes an acquisition unit 401, and a training unit 402, where,

an acquiring unit 401, configured to acquire a training sample set; a training sample includes: the system comprises a sample reference map of a target object, a sample skeleton map and a sample depth map which indicate the positions of key points of the target object under the target pose, and a sample standard map of the target pose; the sample skeleton diagram at least comprises a limb tail end skeleton;

The training unit 402 is configured to perform multiple rounds of iterative training on the pre-trained image generation model by using a training sample set, and output a trained target image generation model; wherein, in a round of iterative process, the following operations are performed:

based on a sample skeleton diagram and a sample depth diagram contained in the selected training sample, performing action migration processing on a target object in a contained sample reference diagram according to a corresponding target pose to obtain a prediction standard diagram;

and based on the multi-scale global comprehensive difference loss between the prediction standard diagram and the sample standard diagram, combining the local difference loss in the designated image area between the prediction standard diagram and the sample standard diagram, and adjusting model parameters in the image generation model.

then, based on the sample skeleton map and the sample depth map included in the selected training sample, according to the corresponding target pose, performing motion migration processing on the target object in the included sample reference map, so as to obtain a prediction standard map, where the training unit 402 is configured to:

Inputting a sample reference picture contained in the selected training sample into a first coding network to obtain coded reference picture characteristics;

splicing a sample skeleton diagram and a sample depth diagram contained in a training sample in a channel dimension, and inputting a second coding network to obtain coded and fused skeleton action characteristics;

and decoding the reference image characteristic based on the skeleton action characteristic by adopting a multi-scale decoding network to obtain a prediction standard diagram after finishing action migration.

Optionally, the training sample set is generated in the following manner:

Optionally, when obtaining a sample skeleton map generated based on two-dimensional coordinates of each keypoint location in the image coordinate system, the obtaining unit 401 is configured to:

obtaining the position of each key point in the three-dimensional coordinate set and the coordinates of each pixel point after projection to an image coordinate system;

and restoring skeleton distribution under the corresponding pose by connecting pixel points corresponding to each pixel point coordinate, so as to obtain a sample skeleton diagram with the same size as the corresponding sample standard diagram.

Optionally, when obtaining a sample depth map generated based on pixel depth values of each keypoint location, the obtaining unit 401 is configured to:

after the positions of all the key points in the three-dimensional coordinate set are projected to an image coordinate system, the coordinates of all the pixel points and the pixel depth values obtained corresponding to the positions of all the key points are obtained;

an initial depth map matched with an image coordinate system is constructed, and based on the depth values of all pixels, the pixel values corresponding to all pixels in the initial depth map are adjusted by combining the pixel value difference determined for the pixel point range to which the coordinates of all pixels belong, so that a sample depth map is obtained.

Optionally, when the image generation model is trained as a generator in the generator-countermeasure structure, after obtaining the prediction standard graph, the training unit 402 is further configured to:

Adopting a preset generated countermeasures loss function, and obtaining corresponding countermeasures loss based on a prediction standard diagram and a corresponding sample standard diagram;

and adjusting model parameters in the image generation model based on the antagonism loss, the global comprehensive difference loss between the prediction standard chart and the sample standard chart, and the local difference loss in the designated image area between the prediction standard chart and the sample standard chart.

Alternatively, the local discrepancy loss is determined in the following way:

determining the positions of all target key points for positioning the sub-image areas in the prediction standard chart and the sample standard chart respectively, and cutting the target key point to obtain a designated image area containing a plurality of sub-image areas based on the determined positions of all target key points in the prediction standard chart and the sample standard chart respectively;

Optionally, the global integrated difference loss is determined as follows:

obtaining global pixel value loss based on pixel value differences of all pixel points between a prediction standard chart and a sample standard chart, and obtaining multi-scale feature loss based on image feature differences of the prediction standard chart and the sample standard chart under a plurality of preset scales;

And (3) losing the global pixel value and the multi-scale characteristics to obtain the corresponding global comprehensive difference loss.

Optionally, the training unit 402 performs pre-training of the image generation model in the following manner:

acquiring a designated data set, and obtaining sample depth maps corresponding to each sample skeleton map by monocular depth estimation processing of each sample skeleton map in the data set, wherein the data set comprises a sample standard map and a sample skeleton map of each sample object under different poses;

and constructing a pre-training sample set based on a sample standard diagram, a sample skeleton diagram and a sample depth diagram which are obtained according to the data set, performing multi-round iterative training on an initial image generation model based on the pre-training sample set, and outputting the pre-trained image generation model.

Optionally, the training unit 402 determines the learning rate used in each iteration of the pre-trained image generation model in any of the following ways:

Optionally, the apparatus further comprises a generating unit 403, where the generating unit 403 is configured to:

acquiring a reference image of a target object under a reference action and a plane skeleton diagram and a plane depth diagram of the target object under a designated pose, wherein the plane skeleton diagram comprises hand skeletons;

and adopting a target image generation model, and performing action migration processing on the reference image based on the plane skeleton map and the plane depth map to obtain a target image of the target object under a specified pose.

Having described the training method and apparatus of the image generation model of the exemplary embodiment of the present application, next, an electronic device according to another exemplary embodiment of the present application is described.

Those skilled in the art will appreciate that the various aspects of the application may be implemented as a system, method, or program product. Accordingly, aspects of the application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

Based on the same inventive concept as the above-mentioned method embodiment, an electronic device is further provided in the embodiment of the present application, and referring to fig. 5, which is a schematic diagram of a hardware composition structure of an electronic device to which the embodiment of the present application is applied, the electronic device 500 may at least include a processor 501 and a memory 502. The memory 502 stores program code that, when executed by the processor 501, causes the processor 501 to perform the steps of any one of the image generation model training methods described above.

In some possible implementations, a computing device according to the application may include at least one processor, and at least one memory. Wherein the memory stores program code that, when executed by the processor, causes the processor to perform the steps of training the image generation model according to the various exemplary embodiments of the application described hereinabove. For example, the processor may perform the steps as shown in fig. 2A.

A computing device 600 according to such an embodiment of the application is described below with reference to fig. 6. As shown in fig. 6, computing device 600 is in the form of a general purpose computing device. Components of computing device 600 may include, but are not limited to: the at least one processing unit 601, the at least one memory unit 602, a bus 603 connecting the different system components, including the memory unit 602 and the processing unit 601.

Bus 603 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, and a local bus using any of a variety of bus architectures.

The storage unit 602 may include readable media in the form of volatile memory, such as Random Access Memory (RAM) 6021 and/or cache memory 6022, and may further include Read Only Memory (ROM) 6023.

The storage unit 602 may also include a program/utility 6025 having a set (at least one) of program modules 8024, such program modules 6024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The computing device 600 may also communicate with one or more external devices 604 (e.g., keyboard, pointing device, etc.), one or more devices that enable objects to interact with the computing device 600, and/or any devices (e.g., routers, modems, etc.) that enable the computing device 600 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 605. Moreover, computing device 600 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 606. As shown, network adapter 606 communicates with other modules for computing device 600 over bus 603. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computing device 600, including, but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

Based on the same inventive concept as the above-described method embodiments, aspects of the training of an image generation model provided by the present application may also be implemented in the form of a program product comprising program code for causing an electronic device to perform the steps in the method of training an image generation model according to the various exemplary embodiments of the application described in the present specification, when the program product is run on an electronic device, e.g. the electronic device may perform the steps as shown in fig. 2A.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of training an image generation model, comprising:

2. The method of claim 1, wherein the image generation model comprises: a first encoding network configured with a convolution attention layer, a second encoding network configured with a convolution attention layer and an image fusion layer, and a multi-scale decoding network configured with a convolution attention layer;

performing motion migration processing on the target object in the included sample reference image according to the corresponding target pose based on the sample skeleton image and the sample depth image included in the selected training sample to obtain a prediction standard image, wherein the method comprises the following steps:

3. The method of claim 1, wherein the training sample set is generated by:

4. A method according to claim 3, wherein said deriving a sample skeleton map generated based on two-dimensional coordinates of each keypoint location in the image coordinate system comprises:

5. A method as claimed in claim 3, wherein said deriving a sample depth map generated based on pixel depth values for said keypoint locations comprises:

6. The method of claim 1, wherein when the image generation model is trained as a generator in a generator-antagonist structure, after the deriving a prediction standard graph, further comprising:

7. The method of claim 1, wherein the local variance loss is determined by:

8. The method of claim 1, wherein the global integrated difference penalty is determined by:

9. The method of any of claims 1-8, wherein the pre-training of the image generation model is accomplished by:

10. The method of any of claims 1-8, wherein the learning rate used in each iteration of the pre-trained image generation model is determined in any of the following ways:

11. The method of any one of claims 1-8, further comprising:

12. A training device for an image generation model, comprising:

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-11 when the program is executed by the processor.

14. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program implementing the method according to any of claims 1-11 when executed by a processor.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-11.