CN112101320A

CN112101320A - Model training method, image generation method, device, equipment and storage medium

Info

Publication number: CN112101320A
Application number: CN202011289278.7A
Authority: CN
Inventors: 陈博; 高原; 刘霄
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2020-12-18

Abstract

The application provides a model training method, an image generation device, electronic equipment and a storage medium. The specific implementation scheme is as follows: acquiring video data of a first object; extracting a head image of the first object and a key point in the head image from a frame image of video data of the first object; generating a face contour map of the first object according to the key points; and training a head generation model by using the face contour diagram of the first object and the head image of the first object, so that the trained head generation model obtains the head generation image of the first object according to the face contour diagram of the second object. According to the method and the device, the head image of the second object in the video data can be automatically replaced by the head image of the first object without depending on manual participation, and time consumption and cost of image generation are reduced.

Description

Model training method, image generation method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a model training method, an image generation device, an image generation apparatus, and a storage medium.

Background

In the process of shooting movies and teaching videos, actors and teachers can not continuously participate in shooting due to certain factors. In this case, a substitute body needs to be searched for photographing. Generally, the appearance of the substitute is required to be similar to the original actor, and the threshold for selecting the substitute is also high. The selection threshold of the substitute can be reduced by changing the face and the head. However, the existing face and head changing method usually adopts manual post adjustment through software. This post-adjustment is time and cost consuming.

Disclosure of Invention

The embodiment of the application provides a model training method, an image generation device, electronic equipment and a storage medium, which are used for solving the problems in the related art, and the technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a model training method, including:

acquiring video data of a first object;

extracting a head image of the first object and a key point in the head image from a frame image of video data of the first object;

generating a face contour map of the first object according to the key points;

and training a head generation model by using the face contour diagram of the first object and the head image of the first object, so that the trained head generation model obtains the head generation image of the first object according to the face contour diagram of the second object.

In one embodiment, the face contour map is used to characterize the facial expression information and/or pose information of the person.

In one embodiment, the head generative model comprises a generative confrontation network.

In one embodiment, training a head generation model using a face contour map of a first object and a head image of the first object comprises:

the head generative model is trained using a time-series smoothing constraint function.

In one embodiment, the time-series smooth constraint function is constructed based on a face contour map of the first object in two consecutive frames of images, a head image of the first object in the two consecutive frames of images, and a head generation image of the first object in the two consecutive frames of images.

In one embodiment, the timing smoothing constraint function comprises:

wherein the content of the first and second substances,

a representation generator;

a presentation discriminator;

a penalty value representing a time-series smoothing constraint function;

represents a mathematical expectation;

representing a face contour map;

a face contour map representing a first object of the current frame;

a face contour map representing a first object of a previous frame;

a head image representing a first object;

a head image representing a first object of the current frame;

a head image representing a first object of a previous frame.

the head generation model is trained using at least one of an alignment loss function, a feature matching loss function, and a perceptual loss function.

In a second aspect, an embodiment of the present application provides an image generation method, including:

acquiring video data of a second object;

extracting a head image of the second object and a key point in the head image from a frame image of video data of the second object;

generating a face contour map of a second object corresponding to the frame image according to the key points;

inputting the face contour map of the second object into a head generation model to obtain a head generation image of the first object corresponding to the frame image; the head generation model is obtained by adopting any one of the model training methods.

In one embodiment, after inputting the face contour map of the second object into the head generating model and obtaining the head generating image of the first object corresponding to the frame image, the method further includes:

obtaining a mask region of a head generation image of a first object corresponding to the frame image and a mask region of a head image of a second object by using a head segmentation algorithm;

processing the frame image based on the mask area, and replacing the head generation image of the first object onto the head image of the second object;

corroding the mask region in the replaced frame image;

and repairing the mask region in the frame image after the corrosion treatment by using the head fusion model.

In one embodiment, the head fusion model is a model obtained by using any one of the above-mentioned model training methods.

In one embodiment, the method further comprises:

and splicing the repaired frame images frame by frame to obtain a head replacement video for replacing the head of the second object with the head of the first object.

In a third aspect, an embodiment of the present application provides a model training apparatus, including:

a first acquisition unit configured to acquire video data of a first object;

a first extraction unit configured to extract a head image of a first object and a key point in the head image from a frame image of video data of the first object;

the first generating unit is used for generating a face contour map of the first object according to the key points;

and the training unit is used for training the head generation model by utilizing the face contour map of the first object and the head image of the first object, so that the trained head generation model obtains the head generation image of the first object according to the face contour map of the second object.

In one embodiment, the training unit is further configured to:

In one embodiment, the timing smoothing constraint function comprises:

wherein the content of the first and second substances,

a representation generator;

a presentation discriminator;

a penalty value representing a time-series smoothing constraint function;

represents a mathematical expectation;

representing a face contour map;

a face contour map representing a first object of the current frame;

a face contour map representing a first object of a previous frame;

a head image representing a first object;

a head image representing a first object of the current frame;

a head image representing a first object of a previous frame.

In one embodiment, the training unit is further configured to:

In a fourth aspect, an embodiment of the present application provides an image generating apparatus, including:

a second acquisition unit configured to acquire video data of a second object;

a second extraction unit configured to extract a head image of the second object and a key point in the head image from a frame image of video data of the second object;

a second generating unit, configured to generate a face contour map of a second object corresponding to the frame image according to the key points;

a third generating unit, configured to input a face contour map of the second object into the head generating model, and obtain a head generating image of the first object corresponding to the frame image; the head generation model is a model obtained by adopting any one of the model training devices.

In one embodiment, the apparatus further comprises a repair unit configured to:

corroding the mask region in the replaced frame image;

In one embodiment, the head fusion model is a model obtained by using the model training apparatus according to any one of the above embodiments.

In an embodiment, the apparatus further includes a splicing unit, where the splicing unit is configured to:

In a fifth aspect, an embodiment of the present application provides an electronic device, where the electronic device includes: a memory and a processor. Wherein the memory and the processor are in communication with each other via an internal connection path, the memory is configured to store instructions, the processor is configured to execute the instructions stored by the memory, and the processor is configured to perform the method of any of the above aspects when the processor executes the instructions stored by the memory.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the computer program runs on a computer, the method in any one of the above-mentioned aspects is executed.

The advantages or beneficial effects in the above technical solution at least include: the head image of the second object in the video data can be automatically replaced by the head image of the first object without depending on human participation, so that the time consumption and the cost of image generation are reduced.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

FIG. 1 is a flow chart of a model training method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a training flow of a model training method according to another embodiment of the present application;

FIG. 3 is a schematic test flow chart of a model training method according to another embodiment of the present application;

FIG. 4 is a flow chart of an image generation method according to another embodiment of the present application;

FIG. 5 is a flow chart of an image generation method according to another embodiment of the present application;

FIG. 6 is a schematic diagram of a training flow of a head fusion model according to another embodiment of the present application;

FIG. 7 is a schematic diagram illustrating a testing process of a head fusion model according to another embodiment of the present application;

FIG. 8 is a schematic diagram of a head replacement effect of an image generation method according to another embodiment of the present application;

FIG. 9 is a flow chart of an image generation method according to another embodiment of the present application;

FIG. 10 is a schematic diagram of a model training apparatus according to another embodiment of the present application;

FIG. 11 is a schematic structural diagram of an image generation apparatus according to another embodiment of the present application;

FIG. 12 is a schematic structural diagram of an image generation apparatus according to another embodiment of the present application;

FIG. 13 is a block diagram of an electronic device used to implement embodiments of the present application.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

FIG. 1 is a flow chart of a model training method according to an embodiment of the present application. As shown in fig. 1, the model training method may include:

step S110, acquiring video data of a first object;

step S120, extracting a head image of the first object and a key point in the head image from a frame image of the video data of the first object;

step S130, generating a face contour map of the first object according to the key points;

step S140, training a head generation model by using the face contour map of the first object and the head image of the first object, so that the trained head generation model obtains the head generation image of the first object according to the face contour map of the second object.

During the process of shooting movie and teaching videos, a substitute body is required to be searched for shooting possibly due to certain factors. The selection threshold of the substitute can be reduced by changing the face and the head. The method for changing face and head can adopt manual work to carry out later adjustment through software. However, this post-adjustment is time and cost consuming. In addition, the face changing is to replace the face area of the substitute body, and the hairstyle and the face shape of the substitute body are not changed. In general, the similarity of appearance is greatly influenced by the hairstyle and the face shape of a person. Therefore, the face changing-based method improves the selection threshold of the substitute to a certain extent.

In view of the above problems, the present application provides a model training method, wherein a trained model can automatically replace a head image of a person in a video. In one example, the first object may include a source (source) character, which may be an actor. The second object may include a target person, which may be a avatar. According to the video A of the actor and the video B of the avatar performance which need to be replaced, under the condition of not depending on manual participation, the model obtained by training through the model training method provided by the application is used for automatically replacing the head image of the avatar in the video B with the head image of the corresponding actor, and the expression and the posture of the avatar in the video B are kept.

Still taking the example of the head images of the actor and avatar as an alternative, in step S110, video data of the actor may be collected. The collected data is used to train a head generative model. The trained head generation model can obtain head generation images of actors according to the face contour map of the avatar.

In step S120, each frame image of the video data of the actor is extracted first. Then, the head images of actors and key points in the head images are extracted from each frame of image by using a face detection model and a key point detection model. A keypoint (landmark) comprises several marker points drawn on the head image. The mark points can be drawn at the positions of the relevant keys such as edges, corners, outlines, intersections, equal parts and the like. The morphology of the face can be described by means of the key points.

In an embodiment of the present application, a face detection model in the Dlib tool library may be used to extract the head image of the actor and the key points in the head image from each frame of image in the video data of the actor. The Dlib is a C + + open source toolkit that contains machine learning algorithms. The method for detecting the human face by using the Dlib can comprise the following steps: and detecting the face image by using the trained classifier, extracting the features of the face image, and classifying the face image according to the extracted features. In one example, landmark in a face detection model in the Dlib tool library may contain 68 marker points. Each marker point is sequentially represented as: 0-16 is mandible line, 17-21 is right eyebrow, 22-26 is left eyebrow, 27-35 is nose, 36-41 is right eye, 42-47 is left eye, 48-60 is outer mouth contour, 61-67 is inner mouth contour. The face shape of the human face and the contour and the characteristic of the five sense organs can be represented by using the key points. According to the key points, key region positions of the face of the human face, including eyebrows, eyes, a nose, a mouth, a face contour and the like, can be located in the given face image.

In step S130, a face contour map of the actor is generated from the key points in the head image of the actor extracted in step S120. Fig. 2 is a schematic diagram of a Training (Training) process of a model Training method according to another embodiment of the present application. Reference numeral 1 in fig. 2 denotes each frame image extracted from the video data of the actor, and reference numeral 2 denotes a face contour diagram of the actor. As shown in fig. 2, a keypoint detector (landmark detector) is performed on each frame image extracted from the video data of the actor, and then a face contour map of the actor shown by reference numeral 2 is generated using the keypoints.

In step S140, a head generation model is trained using the face contour map of the actor and the head image of the actor, so that the trained head generation model can obtain the head generation image of the actor from the face contour map of the avatar. The Head generation model may also be referred to as a Head Generation Network (HGN), among others. Reference numeral 3 in fig. 2 denotes a head generation model, and reference numeral 4 denotes a head generation image of an actor. As shown in fig. 2, in the training process of the head generation model, the face contour map 2 of the actor is input into the head generation model 3, and a head generation image 4 of the actor is obtained through the head generation model 3. After the training of the head generation model is completed, the face contour map of the substitute can be input into the head generation model, and the head generation image of the actor is obtained through the head generation model.

According to the embodiment of the application, the head image of the second object in each frame of image of the video data can be automatically replaced by the head image of the first object without depending on human participation. The selection threshold of the substitute is reduced through the head image replacement, and meanwhile, the time consumption and the cost of image generation are reduced.

In one embodiment, the face contour map is used to characterize the facial expression information and/or pose information of the person. In the embodiment of the application, the key points can be marked on the outer contour of the human face and the edge of the organ. The face contour map generated by using the key points can contain head orientation and expression information, and can be used for representing expression information and/or posture information of a character. Wherein the pose information may include pose information of a head of the person. For example, the pose information may include head orientation, head angle, etc.

In the embodiment of the application, the trained head generation model can be used for processing the video data of the avatar, specifically, each frame of image in the video data of the avatar is processed, and the head image of the actor with the same posture and expression as the avatar is generated. In order to generate a head image having the same pose and expression as the actor during the model training process, the detected face key points of the actor are drawn into a face contour map of the actor including the head orientation and expression information. Then, the human face outline image of the actor is used as input information of a head generation model HGN, and the head generation model is trained by using the head image of the actor extracted from the frame image of the video data of the actor as supervision information.

According to the embodiment of the application, the head image of the second object in the video data can be automatically replaced by the head image of the first object, and the expression and the gesture of the second object are maintained. The selection threshold of the substitute is reduced through the head image replacement, and the visual effect and the sense of vividness of the generated image are improved.

A Generative Adaptive Networks (GAN) is a deep learning model based on Convolutional Neural Networks (CNN) and antagonistic learning. The convolutional neural network is a neural network for analyzing image input through multilayer convolutional calculation. The generation type countermeasure network can be used for automatically generating a brand new image according to the specified rule. The framework of the generative confrontation network includes a generator and an arbiter. Wherein, the generator is also called as a Generative Model (G for short); the discriminator is also called a discrimination model (D). The game learning by the two models with each other yields a fairly good output. Taking the image processing procedure as an example, the generator is used for generating the image. That is, the generator will output an automatically generated, false image. The discriminator is used to determine, and it receives as input the image output by the generator, and then determines the authenticity probability of this image. For example, if the probability of authenticity of the output of the discriminator is 1, it represents that the discrimination result is that 100% of the image generated by the generator is authentic, i.e., the generated image and the authentic image are almost identical. If the output of the discriminator is the authenticity probability 0, the image generated by the representative generator cannot be a real image, namely, the generated image has a large difference with the real image. If the output is a plausibility probability of 0.5, this represents a difficulty in determining whether the image generated by the generator is real or not.

In the training process of generating the countermeasure network, the generator G aims to generate as many real pictures as possible to deceive the discriminator D. The object of the discriminator D is to separate the picture and the real image generated by the generator G as much as possible. Thus, the generator G and the discriminator D constitute a dynamic "gaming process".

As a result of the last game, in the most ideal situation, the generator G can generate enough images to be "spurious". It is difficult for the discriminator D to determine whether the image generated by the generator G is real or not, so the probability of authenticity of the output of the discriminator D is 0.5. Thus, through the process of game learning, a generator G of the generative confrontation network is finally obtained, and the generator G can be used for generating a head generation image of the first object.

Referring again to fig. 2, reference numeral 3 in fig. 2 particularly denotes a generator G in the head generating model. As shown in fig. 2, in the training process of the head generation model, the face contour map 2 of the actor is input into the generator 3 of the head generation model, and the head generation image 4 of the actor is obtained by the generator 3.

Reference numeral 5 in fig. 2 specifically denotes a discriminator D in the head generating model. As shown in fig. 2, in the training process of the head generation model, the face contour map 2 of the actor and the head generation image 4 of the actor generated by the generator 3 are input to the discriminator 5 of the head generation model, the authenticity probability of the head generation image is discriminated by the discriminator 5, and whether the head generation image is true or false is determined (Real or Fake).

Fig. 3 is a schematic diagram of a Testing (Testing) flow of a model training method according to another embodiment of the present application. As shown in FIG. 3, after training of the head generative model is completed, the trained head generative model may be tested. Reference numeral 6 in fig. 3 denotes each frame image extracted from the frame images of the video data of the avatar, reference numeral 7 denotes the face contour map of the avatar, and reference numeral 8 denotes the head generation image of the actor generated from the face contour map of the avatar. As shown in fig. 3, a keypoint detector (landmark detector) is performed on each frame of image, and then a face contour map of the avatar shown by reference numeral 7 is generated using the keypoints. The face contour map of the avatar is input into the head generation model generator 3, and the head generation image 8 of the actor is obtained through the head generation model.

In the embodiment of the application, in order to generate a head image having the same posture and expression as an actor in a model training process, a face contour diagram of the actor may be used as input information of a head generation model HGN, and a head image of the actor corresponding to a substitute may be used as supervision information to train the head generation model in a counterstudy manner. The loss function of model training defined in embodiments of the present application may include at least one of:

1) the penalty function:

wherein s represents a face contour map, in particular a face contour map of a first object (e.g. an actor);

an image of the head representing a real head, in particular a first object (e.g. an actor); d represents a discriminator, G represents a generator, and G(s) represents a head generation image output by the generator;

representing a mathematical expectation.

The purpose of the generator in the generative confrontation network is to fit the generated image distribution to the true image distribution as closely as possible. The discriminator aims to discriminate as much as possible whether the input image is from the real or generated. The discriminator D and the generator G achieve that the generated head image can match the real data distribution, that is, can match the real head image distribution, by the countertraining.

2) Feature matching loss function:

wherein the content of the first and second substances,

a loss value representing a feature matching loss function; s represents a face contour map, in particular a face contour map of a first object (e.g. an actor);

represents a mathematical expectation; k represents the number of discriminators; i represents the current number of layers of the arbiter network; t represents the total number of layers of the arbiter network; n represents the number of neurons in the feature map.

In the above formula, the face contour map s and the real head image x, and the face contour map s and the head generation image (g (s)) output by the generator are respectively spliced by the channel dimensions to form (s, x) and (s, g (s)). In one example, arbiter D comprises a 4-tier network. Inputting (s, x) and (s, G (s)) into the discriminator to obtain feature maps D (s, x) and D (s, G (s)) corresponding to 1-4 layers of the discriminator network, respectively. And then calculating the L1 distance between D (s, x) and D (s, (G (s)) of the corresponding layer, and dividing the distance by the number N of the neurons of the layer feature map to obtain the feature matching loss value corresponding to the layer.

In yet another example, a 3-scale discriminator may be employed, with the 3-scale loss values being combined to obtain a final loss value. For example, the dimensions of the feature maps corresponding to 3 scales may be 256 × 256 pixels, 128 × 128 pixels, and 56 × 56 pixels, respectively. The output results of the 3-scale discriminators may be averaged as the final output result of the discriminator. Using discriminators of multiple scales, detailed information of the feature map can be enhanced.

The feature matching loss function calculates the distance and similarity of features of different layers of the real head image and the head generated image in the discriminator so as to improve the reality of the head generated image. The loss value of the feature matching loss function is reduced through model training, so that the head generated image output by the generator can be reduced at the semantic feature level of the discriminator and is different from a real image, and further the authenticity of the head generated image can be improved.

3) Perceptual loss function:

wherein the content of the first and second substances,

a loss value representing a perceptual loss function; s represents a face contour map, in particular a face contour map of a first object (e.g. an actor);

an image of the head representing a real head, in particular a first object (e.g. an actor); g represents a generator, and G(s) represents a head generation image output by the generator; i represents the current number of layers of the network; n represents the total number of layers of the network; C. h, W is the number of channels, height, width on the corresponding i-layer characteristic diagram; f represents a VGG (Visual Geometry Group Network) model. The VGG model is a face recognition network. In one example, face recognition may be implemented by extracting facial features using a VGGFace2 network structure.

In the related art, the compute perceptual loss function uses a VGG model trained using the ImageNet dataset. In contrast, in the embodiment of the application, the VGG network model is retrained by using the face recognition data according to the characteristics of the head image generation task, so that the style of the head image can be better matched. In the formula, a real head image x is input into a VGG model F trained based on face recognition data, and then an image G(s) is generated according to the output result of the model F and the head output by a generator, and the distance between features is calculated at different layers of a face recognition network. In the perception loss function, the distance between the features is calculated at different layers of the face recognition network, so that the texture detail information of the generated head image is close to the real head image, and the detail effect of the generated image is improved.

In the related art, the head generative model is trained in a manner of generating a countermeasure network using a single graph. Specifically, this method uses only one image at a time as input information of the model. The drawback of this training method is that even a slight change in the key point between two consecutive images can cause the head of the person in the generated video to shake dramatically.

When shooting and producing movie and teaching videos, the effect that any person can be used as a substitute is generally expected, and the head generation image of a target actor with the same expression and posture as the substitute can be generated by inputting the face outline image of the substitute. The above-mentioned expected effect cannot be achieved by using a training mode of generating an antagonistic network by using a single graph.

In order to solve the problem of video jitter, in the related art of the video consistency method, the purpose of video stabilization is achieved by calculating optical flow. However, the method needs to use extra network resources to calculate optical flow information between video frames, transform features directly through optical flow or constrain images generated by a generator through the optical flow information to achieve consistency between video frames. The method for calculating the optical flow generally requires a large system overhead for calculating the optical flow information, and consumes a large amount of system resources.

Different from a method for calculating optical flow, the embodiment of the application adopts a time sequence smooth constraint mode, constructs a time sequence smooth constraint function through the characteristics of a generating type countermeasure network, and trains a head generation model by using the time sequence smooth constraint function, so that the time sequence consistency of a generated video is ensured, and the stability of a head image in the generated video is improved.

In order to ensure the time sequence consistency of the generated video, the embodiment of the application introduces a time sequence smooth constraint method through the characteristics of the generative countermeasure network. In the model training process, the face contour map of the first object in two continuous frames of images is used as the input information of the head generation model. The output information of the head generation model is a head generation image of the first object in two consecutive frame images. Correspondingly, a head image of the first object in two consecutive frames of images extracted from the video data of the first object is taken as the surveillance information. In the embodiment of the application, a time sequence smooth constraint function is constructed based on the input information, the supervision information and the output information of the model corresponding to the two continuous frames of images. The time sequence smooth constraint function constructed by the method is used for training the generating type confrontation network, so that the shake between two continuous frames of images can be reduced, and the stability of the head image in the generated video is improved.

In one embodiment, the timing smoothing constraint function comprises:

wherein the content of the first and second substances,

a representation generator;

a presentation discriminator;

a penalty value representing a time-series smoothing constraint function;

represents a mathematical expectation;

representing a face contour map;

a face contour map representing a first object of the current frame;

a face contour map representing a first object of a previous frame;

a head image representing a first object;

a head image representing a first object of the current frame;

a head image representing a first object of a previous frame.

In the embodiment of the application, in consideration of the correlation between video frames, in the model training process, the generator is firstly enabled to generate two continuous frames of images. For example, the current frame image is a generated image of the second frame

. Then it is firstGenerating images of frames

From the face contour map of the corresponding first frame

And a blank image

As a constraint. Generated image of second frame

From the face contour map of the corresponding second frame

And an image generated by the first frame

As a constraint. The discriminator splices the face contour map corresponding to the two continuous frames of images with the real head image (namely the head image of the first object extracted from the video data), and the face contour map corresponding to the two continuous frames of images with the head generation image to form

And

as input information to the input arbiter. The discriminator compares the head-generated image with the real head image, and discriminates the authenticity and the time-series consistency of the head-generated image.

Training is carried out based on the mode, and the face contour map of each frame of image can be extracted from the video data of the avatar performance according to the video frame sequence in the model testing stage. The generator processes each frame image in sequence according to the video frame sequence. The first input information of the generator is composed of a blank image and a face contour image corresponding to the first frame image. Each subsequent input information is composed of a head generated image generated last time and a face contour map of the current frame processed this time. Based on the method, the time sequence information is introduced in the model training process, and the time sequence constraint is added between the continuous video frames, so that the consistency of the whole video sequence is realized, and the stability of the generated image is improved.

Fig. 4 is a flowchart of an image generation method according to another embodiment of the present application. As shown in fig. 4, the image generating method may include:

step S210, acquiring video data of a second object;

step S220, extracting a head image of the second object and key points in the head image from a frame image of the video data of the second object;

step S230, generating a face contour map of a second object corresponding to the frame image according to the key points;

step S240, inputting the face outline image of the second object into the head generation model to obtain a head generation image of the first object corresponding to the frame image; the head generation model is obtained by adopting any one of the model training methods.

In the process of shooting movie and teaching videos, a substitute body may be required to replace an actor or a teacher for shooting, and then the head image of the substitute body in the videos is replaced by the head image of the actor or the teacher.

Still taking the head images of the actors and the avatar as an example instead, in step S210, video data of the avatar performance is acquired. The video data is processed in a subsequent step to replace the avatar's head image in the video data with the actor's head image.

In step S220, each frame image is first extracted from the video data of the avatar. Then, a human face detection model and a key point detection model are utilized to extract a head image of the substitute and key points in the head image from each frame of image. In the embodiment of the application, a face detection model in a Dlib tool library may be used to extract a head image of an avatar and key points in the head image from each frame of image in video data of the avatar.

In step S230, a face contour map of the avatar is generated according to the key points in the head image of the avatar extracted in step S220.

In step S240, the head-generated model obtained by any one of the above-mentioned model training methods is used to input the face contour map of the avatar into the trained head-generated model, so as to obtain the head-generated image of the actor corresponding to each frame of image in the video data of the avatar.

According to the embodiment of the application, the head image of the second object in the video data can be automatically replaced by the head image of the first object, and the expression and the gesture of the second object are maintained. The selection threshold of the substitute is reduced through the head image replacement, and the visual effect and the sense of vividness of the generated image are improved. And the constructed time sequence smooth constraint function is utilized to train the generation type countermeasure network, so that time sequence constraint is added between continuous video frames, and the shake between two continuous frames of images is reduced, thereby improving the stability of the generated images.

Fig. 5 is a flowchart of an image generation method according to another embodiment of the present application. As shown in fig. 5, in an embodiment, step S240, after inputting the face contour map of the second object into the head generating model, and obtaining the head generating image of the first object corresponding to the frame image, further includes:

step S250, obtaining a mask region of a head generation image of a first object corresponding to the frame image and a mask region of a head image of a second object by using a head segmentation algorithm;

step S260, processing the frame image based on the mask area, and replacing the head generation image of the first object onto the head image of the second object;

step S270, carrying out corrosion treatment on the mask area in the replaced frame image;

and step S280, repairing the mask region in the frame image after the corrosion treatment by using the head fusion model.

After the head generation image of the first object is obtained by using the head generation model, the head generation image of the second object in the video data of the second object can be replaced by the head generation image of the first object. In the replaced image, the problem of inconsistent and inconsistent skin color and background may occur in the joint area between the head of the actor and the body part of the avatar, that is, the image area where the neck of the character object in the replaced image is located. Step S250 is a preprocessing operation of the head replacement step, which is used to solve the problem of incongruity of the image area where the necks of different character objects are located, and inconsistency between the skin color and the background. In step S250, a Mask (Mask) region of the head images of the head generation image obtained by the head generation model and the corresponding avatar head image extracted from each frame of image of the video data of the avatar performance are obtained by using a head segmentation algorithm.

The mask in the image processing comprises the steps of utilizing the selected image, the selected graph or the selected object to shield all or part of the processed image, and controlling the image processing area or the processing process through shielding. In the embodiment of the present application, an image area where a neck of a person object is located is used as a mask area.

In step S260, each frame of image of the video data of the avatar performance is processed based on the mask region, and the head-generated image of the actor obtained by the head-generated model is replaced on the corresponding avatar head image. In step S270, since there may be skin color and incompatibility of the backgrounds of the necks of different people, the image area where the neck is located is subjected to erosion processing in the replaced frame image, so as to obtain an image in which the image area where the neck is located is wiped off.

Fig. 6 is a schematic diagram of a training flow of a head fusion model according to another embodiment of the present application. Fig. 7 is a schematic diagram illustrating a testing process of a head fusion model according to another embodiment of the present application. Reference numeral 17 in fig. 7 denotes an image obtained by performing erosion processing on the mask region in the frame image after head replacement. The black image area where the neck is located in the image represents the mask area.

In step S280, a mask region in the frame image after the erosion processing is repaired by using a pre-trained head fusion model.

In one embodiment, the head fusion model is a model obtained by using any one of the above-mentioned model training methods. The Head fusion model may also be referred to as a Head fusion Network (HIN).

In the process of training and predicting the face repairing network in the related technology, the network can be trained by constructing image data through a certain area of the smeared face. For example, to supplement a missing mouth of a certain face through network repair, the network may be trained by constructing image data from the mouth region of the smeared face. In the embodiment of the present application, the same type of data may also be used to train the head fusion model.

The task of the head fusion model is to repair on a combined picture with the actor in the head and the proxy in the lower body. In one example, the training process of the head fusion model may employ an auto-supervised learning approach. In the process of self-supervision learning, the model directly learns by itself from the non-label data without marking the data.

Reference numeral 11 in fig. 6 denotes an image obtained by performing erosion processing on the mask region in the frame image of the substitute video data, and reference numeral 12 denotes an image obtained by performing erosion processing on the mask region in the frame image of the actor video data. The black image area where the neck is located in the images shown by

reference numerals

11 and 12 represents a mask area. Reference numeral 13 in fig. 6 denotes a generator G in the head fusion model, reference numeral 14 denotes an image of a substitute having skin color and background harmonized with each other after the image 11 is restored, reference numeral 15 denotes an image of an actor having skin color and background harmonized with each other after the image 12 is restored, and reference numeral 16 denotes a discriminator D in the head fusion model.

Referring to fig. 6, in the model training process, the head image 12 of the actor after erosion and the head image 11 of the avatar after erosion are used as input information of a head fusion model, and after being processed by a generator 13 of the head fusion model, images 14 and 15 with the skin color and the background being consistent after repairing are output. The images 14 and 15 are then input to a discriminator 16 of the head fusion model, and the discriminator 16 discriminates the authenticity probability of the head-generated image, and determines whether the restored image is true or false (Real or Fake). In the above model training process, the head images of the actor and the avatar extracted from the video data may be used as the monitoring information, and it is expected that the network captures the corresponding features of the actor and the avatar in the training process.

Referring to fig. 7, in the stage of testing the head fusion model, an image 17 obtained by performing erosion processing on the mask region in the frame image after head replacement is input into the generator 13 of the head fusion model, and after the processing by the generator 13 of the head fusion model, an image 18 with the repaired skin color and the background in harmony is output.

In one embodiment, the head fusion model may be trained using a similar training method as the head generation model described above. For example, the head fusion model is trained using at least one of an alignment loss function, a feature matching loss function, a perceptual loss function, and a temporal smoothing constraint function.

In another embodiment, when the image jitter phenomenon generated in the repairing process is not obvious, the head fusion model can be trained in a training mode similar to the head generation model without the time sequence information without introducing the time sequence information in the head fusion model training process. For example, the head fusion model may be trained using at least one of an alignment loss function, a feature matching loss function, and a perceptual loss function.

Compared with the traditional repairing mode based on manual adjustment, the head fusion model can repair the discordance problem of the image area where the neck is located and can ensure the consistency of skin color. The method for repairing the image based on the mask and combined with the head fusion model is more automatic and intelligent, the finally obtained image is more natural, and the visual experience is better.

In one embodiment, the method further comprises:

For each frame of image in the video data of the second object, the corresponding image in which the head is replaced with the first object may be generated through the above steps. And then, splicing the generated frame images frame by frame to obtain a head replacement video for replacing the head of the second object with the head of the first object.

Fig. 8 is a schematic diagram of a head replacement effect of an image generation method according to another embodiment of the present application. Fig. 8 shows a header replacement effect of one frame image. The left image in fig. 8 is an image of one frame in the performance video of the avatar, and the right image in fig. 8 is an image corresponding to the left image in which the head of the avatar is replaced with the head of the actor.

Fig. 9 is a flowchart of an image generation method according to another embodiment of the present application. As shown in fig. 9, an exemplary image generation method may include the steps of:

the first step is as follows: video data of an actor (source) and a avatar (target) are prepared.

Actor video data is collected and used as training for generating a network model for the actor (source) head. Preparing video data of the avatar performance, and using the data as the video data needing replacement.

The second step is that: and extracting a head picture and a corresponding key point from the video data.

Extracting each frame of image from actor video data, and obtaining the face and the corresponding key point of each frame of image through face detection and key point detection algorithms. And generating a face contour map by using the key points to encode expression and posture information. Wherein the pose information may include pose information of a head of the person. For example, the pose information may include head orientation, head angle, etc.

The third step: a head-generated image of the actor is obtained using a head-generation network.

Head-generated images of actors having the same pose and expression as the avatar are obtained by training a head-generated network (HGN). And training the head generation network by using the time sequence smooth constraint function, thereby ensuring the time sequence consistency of the generated video and improving the stability of the head image in the generated video.

The fourth step: and obtaining a mask region of the head image of the dummy and the head generation image of the actor by using a head segmentation algorithm, and corroding the mask region.

The step is a preprocessing operation of head changing and is used for solving the problems of incoordination, skin color and background inconsistency of necks of different characters. And for the head generated image of the actor and the corresponding head image of the avatar, obtaining Mask (Mask) areas of the heads of the actor and the avatar by a head segmentation method, and replacing the head generated image of the actor on the corresponding avatar image based on the masks. Since the necks of different people may have incompatibility and inconsistency, the neck region is corroded to obtain a picture with the neck region erased.

The fifth step: and repairing the mask region in the frame image after the corrosion treatment by using the head fusion network.

And in order to enable the finally obtained head replacement image to have a consistent visual effect, repairing the neck by training a head repairing network (HPN) based on the image processed in the fourth step.

And a sixth step: and generating head-changing video.

Through the five steps, the image corresponding to the image of the head of the actor can be generated for each frame of the avatar. And the generated images are spliced frame by frame, so that automatic head changing in the video can be finally realized.

Fig. 10 is a schematic structural diagram of a model training apparatus according to another embodiment of the present application. As shown in fig. 10, the apparatus may include:

a first acquisition unit 100 for acquiring video data of a first object;

a first extraction unit 200 for extracting a head image of the first object and a key point in the head image from a frame image of video data of the first object;

a first generating unit 300, configured to generate a face contour map of the first object according to the key points;

a training unit 400, configured to train a head generation model using the face contour map of the first object and the head image of the first object, so that the trained head generation model obtains the head generation image of the first object according to the face contour map of the second object.

In one embodiment, the training unit 400 is further configured to:

In one embodiment, the timing smoothing constraint function comprises:

wherein the content of the first and second substances,

a representation generator;

a presentation discriminator;

a penalty value representing a time-series smoothing constraint function;

represents a mathematical expectation;

representing a face contour map;

a face contour map representing a first object of the current frame;

a face contour map representing a first object of a previous frame;

a head image representing a first object;

a head image representing a first object of the current frame;

a head image representing a first object of a previous frame.

In one embodiment, the training unit 400 is further configured to:

Fig. 11 is a schematic structural diagram of an image generating apparatus according to another embodiment of the present application. As shown in fig. 11, the apparatus may include:

a second acquiring unit 500 for acquiring video data of a second object;

a second extraction unit 600 for extracting a head image of the second object and a key point in the head image from a frame image of video data of the second object;

a second generating unit 700 for generating a face contour map of a second object corresponding to the frame image from the key points;

a third generating unit 800, configured to input a face contour map of the second object into the head generating model, so as to obtain a head generating image of the first object corresponding to the frame image; the head generation model is a model obtained by adopting any one of the model training devices.

Fig. 12 is a schematic structural diagram of an image generating apparatus according to another embodiment of the present application. As shown in fig. 12, in an embodiment, the apparatus further includes a repair unit 900, and the repair unit 900 is configured to:

corroding the mask region in the replaced frame image;

In one embodiment, the apparatus further comprises a splicing unit 950, wherein the splicing unit 950 is configured to:

The functions of each module in each apparatus in the embodiment of the present application may refer to corresponding descriptions in the above method, and are not described herein again.

FIG. 13 is a block diagram of an electronic device used to implement embodiments of the present application. As shown in fig. 13, the electronic apparatus includes: a memory 910 and a processor 920, the memory 910 having stored therein computer programs operable on the processor 920. The processor 920, when executing the computer program, implements the model training method and the image generation method in the above-described embodiments. The number of the memory 910 and the processor 920 may be one or more.

The electronic device further includes:

and a communication interface 930 for communicating with an external device to perform data interactive transmission.

If the memory 910, the processor 920 and the communication interface 930 are implemented independently, the memory 910, the processor 920 and the communication interface 930 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 13, but this is not intended to represent only one bus or type of bus.

Optionally, in an implementation, if the memory 910, the processor 920 and the communication interface 930 are integrated on a chip, the memory 910, the processor 920 and the communication interface 930 may complete communication with each other through an internal interface.

Embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method provided in the embodiments of the present application.

The embodiment of the present application further provides a chip, where the chip includes a processor, and is configured to call and execute the instruction stored in the memory from the memory, so that the communication device in which the chip is installed executes the method provided in the embodiment of the present application.

An embodiment of the present application further provides a chip, including: the system comprises an input interface, an output interface, a processor and a memory, wherein the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the application.

It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be an advanced reduced instruction set machine (ARM) architecture supported processor.

Further, optionally, the memory may include a read-only memory and a random access memory, and may further include a nonvolatile random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may include a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available. For example, Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the present application are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present application, and these should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of model training, comprising:

acquiring video data of a first object;

extracting a head image of a first object and a key point in the head image from a frame image of video data of the first object;

generating a face contour map of the first object according to the key points;

and training a head generation model by using the face contour map of the first object and the head image of the first object, so that the trained head generation model obtains a head generation image of the first object according to the face contour map of the second object.

2. The method of claim 1, wherein the face contour map is used to characterize facial expression information and/or pose information of a person.

3. The method of claim 1 or 2, wherein the head-generated model comprises a generative confrontation network.

4. The method of claim 3, wherein training a head generative model using the face contour map of the first object and the head image of the first object comprises:

training the head generative model using a temporal smoothing constraint function.

5. The method of claim 4, wherein the time-series smoothing constraint function is constructed based on a face contour map of the first object in two consecutive frames of images, a head image of the first object in two consecutive frames of images, and a head generation image of the first object in two consecutive frames of images.

6. The method of claim 5, wherein the timing smoothing constraint function comprises:

wherein the content of the first and second substances,

a representation generator;

a presentation discriminator;

a penalty value representing a time-series smoothing constraint function;

represents a mathematical expectation;

representing a face contour map;

a face contour map representing a first object of the current frame;

a face contour map representing a first object of a previous frame;

a head image representing a first object;

a head image representing a first object of the current frame;

a head image representing a first object of a previous frame.

7. The method of claim 3, wherein training a head generative model using the face contour map of the first object and the head image of the first object comprises:

training the head generation model with at least one of an alignment loss function, a feature matching loss function, and a perceptual loss function.

8. An image generation method, comprising:

acquiring video data of a second object;

inputting the face contour map of the second object into a head generation model to obtain a head generation image of the first object corresponding to the frame image; wherein the head generation model is a model obtained by the model training method according to any one of claims 1 to 7.

9. The method of claim 8, wherein after inputting the face contour map of the second object into the head generation model to obtain the head generation image of the first object corresponding to the frame image, further comprising:

obtaining a mask region of a head generation image of the first object and a mask region of a head image of the second object corresponding to the frame image by using a head segmentation algorithm;

processing the frame image based on the mask region, replacing the head generation image of the first object onto the head image of the second object;

corroding the mask area in the replaced frame image;

10. The method according to claim 9, wherein the head fusion model is a model obtained by the model training method according to any one of claims 1 to 7.

11. The method according to claim 9 or 10, characterized in that the method further comprises:

12. A model training apparatus, comprising:

a first acquisition unit configured to acquire video data of a first object;

and the training unit is used for training a head generation model by using the face contour map of the first object and the head image of the first object, so that the trained head generation model obtains the head generation image of the first object according to the face contour map of the second object.

13. The apparatus of claim 12, wherein the face contour map is used to represent facial expression information and/or posture information of a person.

14. The apparatus of claim 12 or 13, wherein the head generative model comprises a generative confrontation network.

15. The apparatus of claim 14, wherein the training unit is further configured to:

16. The apparatus of claim 15, wherein the time-series smoothing constraint function is constructed based on a face contour map of the first object in two consecutive images, a head image of the first object in two consecutive images, and a head generation image of the first object in two consecutive images.

17. The apparatus of claim 16, wherein the timing smoothing constraint function comprises:

wherein the content of the first and second substances,

a representation generator;

a presentation discriminator;

a penalty value representing a time-series smoothing constraint function;

represents a mathematical expectation;

representing a face contour map;

a face contour map representing a first object of the current frame;

a face contour map representing a first object of a previous frame;

a head image representing a first object;

a head image representing a first object of the current frame;

a head image representing a first object of a previous frame.

18. The apparatus of claim 14, wherein the training unit is further configured to:

19. An image generation apparatus, comprising:

a second acquisition unit configured to acquire video data of a second object;

a second generating unit, configured to generate a face contour map of a second object corresponding to the frame image according to the key point;

a third generating unit, configured to input the face contour map of the second object into a head generation model, so as to obtain a head generation image of the first object corresponding to the frame image; wherein the head-generated model is a model obtained by using the model training apparatus according to any one of claims 12 to 18.

20. The apparatus of claim 19, further comprising a repair unit to:

corroding the mask area in the replaced frame image;

21. The apparatus according to claim 20, wherein the head fusion model is a model obtained by using the model training apparatus according to any one of claims 12 to 18.

22. The apparatus according to claim 20 or 21, further comprising a splicing unit for:

23. An electronic device comprising a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of claims 1 to 11.

24. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 11.