CN110516598B

CN110516598B - Method and apparatus for generating image

Info

Publication number: CN110516598B
Application number: CN201910797619.2A
Authority: CN
Inventors: 胡天舒; 张世昌; 洪智滨; 韩钧宇; 刘经拓
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2022-03-01
Anticipated expiration: 2039-08-27
Also published as: CN110516598A

Abstract

Embodiments of the present disclosure disclose methods and apparatus for generating images. One embodiment of the method comprises: acquiring a master image and a target face image, wherein the master image comprises a face image to be replaced and a background; determining a matched face image from a preset face image library matched with the face image to be replaced, wherein the matched preset face image library comprises face images of different face postures of the face indicated by the face image to be replaced, and the matched face image is used for representing that the face posture displayed by the face image to be replaced is consistent with the face posture displayed by the target face image; and generating a target image based on the replacement of the matched face image to the face image to be replaced, wherein the target image comprises a face image consistent with the matched face image and a background consistent with the bottom image. This embodiment speeds up the generation of an image that is consistent with the facial pose of the target face image.

Description

Method and apparatus for generating image

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method and an apparatus for generating an image.

Background

With the rapid development of artificial intelligence technology, interactive functions related to human faces are gradually added in applications such as short videos, live videos and the like, so that a preset human face template (such as a cartoon image) can be driven to generate approximate expressions through the facial expression change of a user.

The related method is to deform a preset face template by detecting changes of key points of the face, so as to generate an expression, a mouth shape and the like on the face template, wherein the expression, the mouth shape and the like are consistent with the facial pose of the user.

Disclosure of Invention

Embodiments of the present disclosure propose methods and apparatuses for generating an image.

In a first aspect, an embodiment of the present disclosure provides a method for generating an image, the method including: acquiring a master image and a target face image, wherein the master image comprises a face image to be replaced and a background; determining a matched face image from a preset face image library matched with the face image to be replaced, wherein the matched preset face image library comprises face images of different face postures of the face indicated by the face image to be replaced, and the matched face image is used for representing that the face posture displayed by the face image to be replaced is consistent with the face posture displayed by the target face image; and generating a target image based on the replacement of the matching face image to the face image to be replaced, wherein the target image comprises a face image consistent with the matching face image and a background consistent with the bottom image.

In some embodiments, the acquiring the master image and the target face image includes: acquiring a first video shot for a first user and a second video shot for a second user; extracting a video frame comprising a face image of a first user from a first video as a background image; extracting a video frame including a face image of a second user from the second video; extracting a face image of a second user from a video frame comprising the face image of the second user as a target face image; and after the target image is generated based on the replacement of the matching face image to the face image to be replaced, the method further comprises the following steps: based on the target image, a target video is generated, wherein the facial pose of the second user displayed in the target video matches the facial pose of the first user displayed in the first video.

In some embodiments, the preset face image library is obtained by the following steps: acquiring a reference human face image library, wherein the reference human face image library comprises images displaying different facial poses of a reference human face; inputting images in a reference facial image library into a pre-trained image generation model to generate a matched reference facial image, wherein the image generation model comprises a coding network, a hidden layer network and a decoding network, and the facial pose displayed by the matched reference facial image is consistent with the facial pose displayed by the input image; and generating a preset face image library based on the matched reference face image.

In some embodiments, the hidden layer networks include a first hidden layer network and a second hidden layer network, the image generation model includes a first image generation sub-model and a second image generation sub-model, the first image generation sub-model includes an encoding network, a first hidden layer network, a second hidden layer network, and a decoding network, the second image generation sub-model includes an encoding network, a decoding network, and a target hidden layer network, and the target hidden layer network is one of the first hidden layer network and the second hidden layer network.

In some embodiments, the image generation model is trained by: acquiring a sample reference face image set and a sample face image set, wherein the sample reference face image set comprises a subset of a reference face image library; carrying out image preprocessing transformation on the sample reference facial image set and the sample facial image set to generate a sample preprocessing reference facial image set and a sample preprocessing facial image set; and respectively taking the sample preprocessing reference face image and the sample preprocessing face image as the input of a first image generation sub-model and a second image generation sub-model, respectively taking the sample preprocessing reference face image and the sample face image corresponding to the input as the expected output of the first image generation sub-model and the second image generation sub-model, and training to obtain an image generation model.

In some embodiments, the replacing the to-be-replaced face image based on the matching face image to generate the target image includes: carrying out face alignment on the matched face image and the face image to be replaced; performing triangulation based on the aligned matched face image and the face image to be replaced; replacing according to the corresponding relation of the triangular areas divided by the triangular subdivision in the aligned matched face image and the face image to be replaced to generate a quasi-target image; extracting the outline of the face image from the quasi-target image; generating a mask according to the outline of the face image; generating color distribution information of the face image according to the mask and the quasi-target image; rendering the face image according to the color distribution information to generate a target image.

In a second aspect, an embodiment of the present disclosure provides an apparatus for generating an image, the apparatus including: the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a master image and a target face image, and the master image comprises a face image to be replaced and a background; the determining unit is configured to determine a matching face image from a preset face image library matched with the face image to be replaced, wherein the matched preset face image library comprises face images of different face poses of the face indicated by the face image to be replaced, and the matching face image is used for representing that the face pose displayed by the face image to be replaced is consistent with the face pose displayed by the target face image; and the first generation unit is configured to generate a target image based on the replacement of the matching face image for the face image to be replaced, wherein the target image comprises a face image consistent with the matching face image and a background consistent with the bottom image.

In some embodiments, the obtaining unit includes: an acquisition module configured to acquire a first video photographed for a first user and a second video photographed for a second user; a first extraction module configured to extract a video frame including a face image of a first user from a first video as a master image; a second extraction module configured to extract a video frame including a face image of a second user from a second video; a third extraction module configured to extract a face image of a second user as a target face image from a video frame including the face image of the second user; and the apparatus further comprises: a second generation unit configured to generate a target video based on the target image, wherein a face pose of the second user displayed in the target video matches a face pose of the first user displayed in the first video.

In some embodiments, the first generating unit includes: the alignment module is configured to align the matched face image with the face image to be replaced; the triangulation module is configured to triangulate based on the aligned matching face image and the face image to be replaced; the first generation module is configured to replace the triangular areas divided by triangulation according to the corresponding relation between the aligned matched face image and the face image to be replaced, and generate a quasi-target image; the fourth extraction module is configured to extract the outline of the face image from the quasi-target image; the second generation module is configured to generate a mask according to the outline of the face image; the third generation module is configured to generate color distribution information of the face image according to the mask and the quasi-target image; and the fourth generation module is configured to render the face image according to the color distribution information and generate a target image.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which when executed by a processor implements the method as described in any of the implementations of the first aspect.

According to the method and the device for generating the image, the master image and the target face image are firstly acquired. The master image comprises a face image to be replaced and a background. And then, determining a matched face image from a preset face image library matched with the face image to be replaced. The matched preset face image library comprises face images of different face postures of the face indicated by the face image to be replaced. The matched face image is used for representing that the face pose displayed by the face image to be replaced is consistent with the face pose displayed by the target face image. And then, replacing the face image to be replaced based on the matched face image to generate a target image. Wherein the target image comprises a face image consistent with the matched face image and a background consistent with the master image. Thereby increasing the speed of generating an image consistent with the facial pose of the target face image.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating an image according to the present disclosure;

FIG. 3 is a schematic illustration of one application scenario of a method for generating an image according to an embodiment of the present disclosure;

FIG. 4 is a flow diagram of yet another embodiment of a method for generating an image according to the present disclosure;

FIG. 5 is a schematic block diagram of one embodiment of an apparatus for generating an image according to the present disclosure;

FIG. 6 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary architecture 100 to which the method for generating an image or the apparatus for generating an image of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102, 103 interact with a server 105 via a network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a search-type application, an instant messaging tool, a mailbox client, social platform software, an image processing-type application, a video editing-type application, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting image processing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for image processing type applications on the

terminal devices

101, 102, 103. The background server may process the received image and feed back a processing result (e.g., the processed image) to the terminal device.

The image may be directly stored locally in the server 105, and the server 105 may directly extract and process the locally stored image, and in this case, the

terminal apparatuses

101, 102, and 103 and the network 104 may not be present.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for generating an image provided by the embodiment of the present disclosure is generally performed by the server 105, and accordingly, the apparatus for generating an image is generally disposed in the server 105. Alternatively, the method for generating the image provided by the embodiment of the present disclosure may also be directly executed by the

terminal device

101, 102, 103, and accordingly, the apparatus for generating the image may also be disposed in the

terminal device

101, 102, 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating an image according to the present disclosure is shown. The method for generating an image comprises the steps of:

step 201, acquiring a master image and a target face image.

In this embodiment, the execution subject of the method for generating an image (such as the server 105 shown in fig. 1) may acquire the master image and the target face image by a wired connection or a wireless connection. The master image may include a face image to be replaced and a background. The background may include an Image portion of the original Image other than the face Image to be replaced, which is determined by using a Matting technique (Image Matting). The target face image can be any face image which is specified in advance according to actual application requirements. The target face image may also be a form determined according to a rule, for example, a face image uploaded by a user terminal.

As an example, the execution subject may acquire a copy image and a target face image, which are stored locally in advance. As another example, the execution subject may also obtain a master image and a target face image transmitted by an electronic device (e.g., the terminal device shown in fig. 1) communicatively connected to the execution subject.

Step 202, determining a matched face image from a preset face image library matched with the face image to be replaced.

In this embodiment, the executing subject may first obtain a preset facial image library matched with the facial image to be replaced. The preset face image library may include face images of different face poses. The facial gestures described above may include, but are not limited to, at least one of: expression, mouth shape, attitude angle (euler angle). Optionally, different faces may correspond to different preset face image libraries, so that the faces displayed by the face images in the preset face image libraries all correspond to the same person. The correspondence between the preset face image library and the corresponding face may have various forms, such as matching of a correspondence table or identifiers (e.g., ID, feature vector, etc.). The preset face image library may generally include an index configured based on a feature representation (embedding) of each face image. Therefore, the execution main body can acquire a preset facial image library matched with the facial image to be replaced. It can be understood that the preset facial image library matched with the facial image to be replaced may include facial images of different facial poses of the face indicated by the facial image to be replaced.

It should be noted that, in order to improve the accuracy of matching the facial images and ensure the training effect, the number of images in the preset facial image library is usually large. For example, the number of images in each set may be no less than 6000.

In this implementation, the executing entity may further determine a matching facial image from the preset facial image library matched with the facial image to be replaced. The matched face image can be used for representing that the face posture displayed by the face image to be replaced is consistent with the face posture displayed by the target face image. As an example, the execution subject may perform face key point extraction on the target face image, so as to generate a target face image feature vector. And searching in the preset human face image library matched with the human face image to be replaced by utilizing various image searching modes according to the target human face image feature vector. Then, the execution subject may determine the retrieved image with the best match (e.g., the smallest distance, the largest similarity, etc.) as the matching face image. The image search Method may be, for example, an approximate nearest neighbor Method (nearest neighbor Method), a Multidimensional Indexing Method (MIM), or the like.

In some optional implementation manners of this embodiment, the preset face image library may be obtained by:

firstly, a reference face image library is obtained.

In these implementations, the executing agent may obtain the library of reference facial images from a local or communicatively coupled electronic device (e.g., a database server). The reference face image library may include images with different facial poses of the reference face. In general, the faces displayed by the face images in the reference face image library all correspond to the same person.

It should be noted that, in order to preset the completeness of the face image library, the number of face images in the reference face image library is usually large. For example, it may be not less than 6000. The reference face image library may typically include an index constructed based on a feature representation of each face image.

And secondly, inputting the images in the reference human face image library into a pre-trained image generation model to generate a matched reference human face image.

In these implementations, the image generation model described above may include an encoding network, a hidden network, and a decoding network. The face pose displayed by the matching reference face image is identical to the face pose displayed by the input image.

As an example, the image generation model described above may be an auto encoder (auto encoder) trained in advance using a machine learning method. The image generation model can be used for representing the corresponding relation between the matching reference face image and the images in the reference face image library. Thus, the executing body can input the images in the reference facial image library into the pre-trained image generation model to generate the matching reference facial image.

In these implementations, the hidden layer network may include a first hidden layer network and a second hidden layer network. The image generation model may include a first image generation submodel and a second image generation submodel. The first image generation submodel may include the encoding network (encoder), the first hidden layer network, the second hidden layer network, and the decoding network (decoder). The second image generation submodel includes the encoding network, the decoding network, and a target hidden layer network. The target hidden layer network may be one of the first hidden layer network and the second hidden layer network.

Alternatively, the first hidden layer network and the second hidden layer network may have the same network structure, but typically have different network parameters.

Alternatively, the first hidden layer network and the second hidden layer network may be connected in parallel between the encoding network and the decoding network.

Optionally, based on the optional implementation manner, the image generation model may be obtained by training through the following steps:

the method comprises the steps of firstly, obtaining a sample reference face image set and a sample face image set.

In these implementations, the performing subject of the training step may obtain the set of sample reference facial images and the set of sample facial images from a locally or communicatively connected electronic device. The sample reference facial image set may be a subset of the reference facial image library. In order to improve the training effect of the model, the number of images in the sample reference facial image set and the sample facial image set is usually large. For example, the number of images in each set may be no less than 700.

It should be noted that the sample reference facial image and the sample facial image are generally identical in size, for example, 128 × 128 pixels.

And secondly, performing image preprocessing transformation on the sample reference facial image set and the sample facial image set to generate a sample preprocessing reference facial image set and a sample preprocessing facial image set.

In these implementations, the executing entity may perform image preprocessing transformation on the images in the sample reference facial image set and the sample facial image set acquired in the first step. The image preprocessing transformation may include various processing operations for fine-tuning the image. Such as Image Warping (Image Warping), adjusting brightness, contrast, etc. Thus, a sample pre-processing reference face image set and a sample pre-processing face image set respectively corresponding to the sample reference face image set and the sample face image set can be generated.

And thirdly, respectively taking the sample preprocessing reference face image and the sample preprocessing face image as the input of a first image generation sub-model and a second image generation sub-model, respectively taking the sample preprocessing reference face image and the sample face image corresponding to the input as the expected output of the first image generation sub-model and the second image generation sub-model, and training to obtain an image generation model.

Specifically, the executing agent of the training step may perform training according to the following steps:

s1, firstly, inputting the sample preprocessing reference facial image in the sample preprocessing reference facial image set to an initial coding network to obtain a first sample code; then, the sample first codes are respectively input into an initial first hidden layer network and an initial second hidden layer network, and a sample second code and a sample third code are respectively obtained; then, connecting the second sample code with the third sample code to obtain a fourth sample code; then, inputting the fourth code of the sample into an initial decoding network to obtain a first reconstructed image of the sample; next, a difference degree between the obtained sample first reconstructed image and a sample reference face image corresponding to the input sample preprocessing reference face image is calculated as a first loss value by using a preset loss function.

S2, inputting the sample preprocessing face image in the sample preprocessing face image set to the initial coding network to obtain a fifth sample code; then, inputting the fifth sample code into an initial target hidden layer network to obtain a sixth sample code; then, copying and connecting the sixth code of the sample to obtain a seventh code of the sample; then, inputting a seventh code of the sample into the initial decoding network to obtain a second reconstructed image of the sample; next, a difference degree between the obtained sample second reconstructed image and a sample face image corresponding to the input sample preprocessed face image is calculated as a second loss value by using a preset loss function. The dimension of the sample seventh encoding is generally the same as the dimension of the sample fourth encoding.

And S3, adjusting the network parameters of the initial coding network, the initial first hidden network, the initial second hidden network and the initial decoding network based on the calculated difference degree, and training according to the steps of S1 and S2. And finishing the training under the condition that a preset training finishing condition is met. And finally, determining an initial image generation model consisting of the initial coding network, the initial first hidden layer network, the initial second hidden layer network and the initial decoding network obtained by training as the image generation model.

The loss function may be, for example, a Mean Squared Error (MSE) loss function or a structural similarity index (ssim) loss function. Alternatively, more than two loss functions may be selected for weighting at the same time. Alternatively, the first loss value and the second loss value may be subjected to various processes, such as averaging. The preset training end condition may include, but is not limited to, at least one of the following: the training time exceeds the preset duration, the training times exceed the preset times, and the comprehensive loss value calculated based on the first loss value and the second loss value is smaller than the preset difference threshold value; the accuracy on the test set reaches a preset accuracy threshold.

Therefore, the sample reference face image is input into the trained image generation model, and the sample face image consistent with the face pose of the sample reference face image can be generated through the coding network, the target hidden layer network and the decoding network. It can be understood that, when the sample face image in the sample face image set and the face image to be replaced correspond to the same person, the obtained image generation model is the image generation model used for generating the preset face image library matched with the face image to be replaced.

It is noted that the subject of execution of the training steps described above may be the same as or different from the subject of execution of the method for generating images. If the two images are the same, the executing body of the training step can store the network structure and the network parameters of the trained image generation model locally after the image generation model is obtained through training. If the two images are different, the executing agent of the training step may send the network structure and the network parameters of the trained image generation model to the executing agent of the method for generating images after the image generation model is obtained through training.

And step 203, replacing the face image to be replaced based on the matched face image to generate a target image.

In this embodiment, the executing entity may adopt various methods to replace the face image to be replaced with the matching face image, so as to generate the target image. The target image may include a face image consistent with the matching face image and a background consistent with the master image. As an example, the execution subject may first process the matching face image and the face image to be replaced into images of matching size (e.g., 128 × 128). Then, the execution subject may combine the matched face image with the background of the master image, thereby generating the target image.

In some optional implementation manners of this embodiment, the executing body may further perform fusion processing on the combined image by using various methods, so as to generate the target image. As an example, the execution subject may generate the target image by Alpha fusion, multiband fusion, poisson fusion, or the like.

In some optional implementations of this embodiment, the executing body may further generate the target image according to the following steps:

firstly, performing face alignment (face alignment) on the matched face image and the face image to be replaced.

In these implementations, the execution subject may perform face alignment on the matching face image and the face image to be replaced by using various face alignment algorithms. As an example, the execution subject may first detect the positions of key points (for example, 150 points may be included) of the face in the matching face image and the face image to be replaced. Then, the executing body may perform face alignment using four points, i.e., a left eye outer corner (e.g., 13), a right eye outer corner (e.g., 34), a top lip center (e.g., 60), and a chin center (e.g., 6), as references.

And secondly, performing triangulation based on the aligned matched face image and the face image to be replaced.

In these implementations, the executing agent may triangulate based on the positions of the face key points in the matching face image and the face image to be replaced determined in the first step. As an example, a related API (Application Programming Interface) of the subpiv 2D class of OpenCV may be called to implement triangulation of the face image. Wherein, a plurality of non-overlapping triangular regions can be obtained through the subdivision.

And thirdly, replacing the triangular areas divided by the triangular section according to the corresponding relation between the aligned matched face image and the face image to be replaced, and generating the quasi-target image.

In these implementations, the execution subject may replace each triangular region divided by the face image to be replaced with each triangular region divided by the corresponding aligned matching face image, so as to generate the quasi-target image. Therefore, the matched face image consistent with the face pose of the face image to be replaced in the bottom image can be generated, and the matched face image has high reality degree and naturalness.

And fourthly, extracting the outline of the face image from the quasi-target image.

In these implementations, the execution body may extract the contour of the face image using various methods. Such as face keypoint detection, edge detection techniques.

And fifthly, generating a mask according to the contour of the face image.

And sixthly, generating color distribution information of the face image according to the mask and the quasi-target image.

In these implementations, the execution subject may first determine the color distribution of the portion of the quasi-target image other than the face image, based on the mask generated in the fifth step and the quasi-target image generated in the third step. Then, the execution subject can determine the color distribution information of the face image by using a linear color transformation method.

And fourthly, rendering the face image according to the color distribution information to generate a target image.

In these implementations, the execution subject may render the face image in the quasi-target image into a skin color in accordance with the color distribution indicated by the color distribution information. Therefore, the fusion of the human face image and the background in the generated target image can be more natural.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of a method for generating an image according to an embodiment of the present disclosure. In the application scenario of fig. 3, a user 301 uploads a master image 3031 and a target face image 3032 using a terminal device 302. The background server 304 receives the image 303 transmitted by the terminal device 302. Then, the background server 304 determines a matching face image 306 from the preset face image library 305 matching the face image in the master image 3031. Among them, the face in the template image 3031 having the face posture identical to that of the target face image 3032 is displayed in the matching face image 306. Then, the background server 304 replaces the face image in the master image 3031 with the matching face image 306, and generates the target image 307. Optionally, the background server 304 may further send the generated target image 307 to the terminal device 302 for displaying to the user 301.

At present, in the prior art, a face template is usually subjected to deformation adjustment through face key point monitoring, so that a generated face image is not natural enough. The method provided by the embodiment of the disclosure performs matching through the preset face image library, thereby improving the quality of the matched face image by ensuring the image quality in the preset face image library, and avoiding the defect that the online model method is easy to generate a failed image (bad case). In addition, the scheme described in the embodiment can be directly matched with the database without on-line training or reasoning by applying a model, so that the image generation speed can be increased, and the waiting time can be reduced.

With further reference to FIG. 4, a flow 400 of yet another embodiment of a method for generating an image is shown. The flow 400 of the method for generating an image comprises the steps of:

in step 401, a first video shot for a first user and a second video shot for a second user are obtained.

In the present embodiment, an execution subject (e.g., the server 105 shown in fig. 1) of the method for generating an image may acquire a first video taken for a first user and a second video taken for a second user from a locally or communicatively connected electronic device (e.g., the terminal device shown in fig. 1) in various ways.

Step 402, extracting a video frame including a face image of a first user from a first video as a master image.

In this embodiment, the executing entity may extract a video frame including a face image of the first user from the first video acquired in step 401 as a template image.

It should be noted that the video is substantially an image sequence arranged according to a chronological order, so the first video may correspond to an image sequence including a face image of the first user. Here, the execution subject may select a video frame including a face image of the first user from the image sequence as a background image in various ways. For example, a random selection mode may be adopted, or a video frame with better definition of the face image may be preferentially selected as the master image.

In step 403, a video frame including a face image of the second user is extracted from the second video.

In this embodiment, the execution subject may extract a video frame including a face image of the second user from the second video according to a similar procedure to the procedure of step 402.

Step 404, extracting the face image of the second user from the video frame including the face image of the second user as a target face image.

In this embodiment, the executing entity may adopt various algorithms of face recognition and face feature point extraction to extract a face image from the video frame extracted in step 403 as a target face image.

It should be noted that the description of the above-mentioned template image and the target face image may be consistent with the description of step 201 in the foregoing embodiment, and will not be described herein again.

Step 405, determining a matching facial image from a preset facial image library matched with the facial image to be replaced.

And 406, replacing the face image to be replaced based on the matched face image to generate a target image.

The

steps

405 and 406 are respectively consistent with the

steps

202 and 203 in the foregoing embodiment, and the above description for the

steps

202 and 203 also applies to the

steps

405 and 406, which is not repeated herein.

Step 407, generating a target video based on the target image.

In this embodiment, the executing entity may extract a plurality of original images and target face images from the first video and the second video acquired in step 401, respectively, and generate an original image sequence and a target face image sequence, respectively. The order of the images in the above-described sequence of the master image and the sequence of the target face image may correspond to the order of the sequence of frames of the video frames. Then, the execution subject may perform steps 405 to 406 on each of the extracted original image sequence and target face image sequence, thereby generating a target image sequence. Wherein, the sequence of the target image sequence may be consistent with the sequence of the frame sequences of the first video or the second video. Thus, the execution body can generate a target video. And matching the facial pose of the second user displayed in the target video with the facial pose of the first user displayed in the first video.

In some optional implementations of this embodiment, the executing body may further send the generated target video to a target device (e.g., a mobile phone, a tablet, etc.) connected in communication, so that the target device displays the target video. As an example, the first video may be a video uploaded by a user terminal (e.g., a mobile phone, a tablet computer, etc.). The second video may be a self-portrait video of the user terminal. The execution main body can also send the generated target video to the user terminal for uploading the video. Therefore, the user can drive the expression of the character in the uploaded video by using the facial expression of the user through the user terminal.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for generating an image in the present embodiment embodies the steps of capturing the master image and the target face image from the video, and generating the target video. Thus, the scheme described in the present embodiment can drive the facial gesture of the second user displayed in the video according to the facial gesture of the first user. In addition, the scheme described in the embodiment can be directly matched with the database without on-line training or reasoning by an application model, so that the image generation speed is greatly increased, and the method can be operated on computers and other mobile equipment with low delay, and is suitable for the fields of short videos, live video, special effects of movies and the like.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for generating an image, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable in various electronic devices.

As shown in fig. 5, the apparatus 500 for generating an image provided by the present embodiment includes an acquisition unit 501, a determination unit 502, and a first generation unit 503. The acquiring unit 501 is configured to acquire a master image and a target face image. The master image comprises a face image to be replaced and a background. A determining unit 502 configured to determine a matching face image from a preset face image library matching the face image to be replaced. The matched preset face image library comprises face images of different face postures of the face indicated by the face image to be replaced. The matched face image is used for representing that the face pose displayed by the face image to be replaced is consistent with the face pose displayed by the target face image. A first generating unit 503 configured to generate a target image based on the replacement of the matching face image with the face image to be replaced. Wherein the target image comprises a face image consistent with the matched face image and a background consistent with the master image.

In the present embodiment, in the apparatus 500 for generating an image: the specific processing of the obtaining unit 501, the determining unit 502, and the first generating unit 503 and the technical effects thereof can refer to the related descriptions of step 201, step 202, and step 203 in the corresponding embodiment of fig. 2, which are not described herein again.

In some optional implementations of the present embodiment, the obtaining unit 501 may include an obtaining module (not shown in the figure), a first extracting module (not shown in the figure), a second extracting module (not shown in the figure), and a third extracting module (not shown in the figure). The acquiring module may be configured to acquire a first video shot for a first user and a second video shot for a second user. The first extraction module may be configured to extract a video frame including a face image of the first user from the first video as a template image. The second extraction module may be configured to extract a video frame including a face image of the second user from the second video. The third extraction module may be configured to extract a face image of the second user from a video frame including the face image of the second user as the target face image. And the apparatus for generating an image may further include: and a second generating unit (not shown in the figure) configured to generate a target video based on the target image. Wherein the facial pose of the second user displayed in the target video may match the facial pose of the first user displayed in the first video.

In some optional implementation manners of this embodiment, the preset face image library may be obtained by: acquiring a reference human face image library, wherein the reference human face image library comprises images displaying different facial poses of a reference human face; inputting images in a reference facial image library into a pre-trained image generation model to generate a matched reference facial image, wherein the image generation model comprises a coding network, a hidden layer network and a decoding network, and the facial pose displayed by the matched reference facial image is consistent with the facial pose displayed by the input image; and generating a preset face image library based on the matched reference face image.

In some optional implementations of this embodiment, the hidden layer network may include a first hidden layer network and a second hidden layer network. The image generation model may include a first image generation submodel and a second image generation submodel. The first image generation submodel may include an encoding network, a first hidden layer network, a second hidden layer network, and a decoding network. The second image generation submodel may include an encoding network, a decoding network, and a target hidden layer network. The target hidden layer network may be one of a first hidden layer network and a second hidden layer network.

In some optional implementations of the present embodiment, the image generation model may be obtained by training through the following steps: acquiring a sample reference face image set and a sample face image set, wherein the sample reference face image set comprises a subset of a reference face image library; carrying out image preprocessing transformation on the sample reference facial image set and the sample facial image set to generate a sample preprocessing reference facial image set and a sample preprocessing facial image set; and respectively taking the sample preprocessing reference face image and the sample preprocessing face image as the input of a first image generation sub-model and a second image generation sub-model, respectively taking the sample preprocessing reference face image and the sample face image corresponding to the input as the expected output of the first image generation sub-model and the second image generation sub-model, and training to obtain an image generation model.

In some optional implementations of this embodiment, the first generating unit 503 may include: an alignment module (not shown), a splitting module (not shown), a first generating module (not shown), a fourth extracting module (not shown), a second generating module (not shown), a third generating module (not shown), and a fourth generating module (not shown). The alignment module may be configured to perform face alignment on the matched face image and the face image to be replaced. The triangulation module may be configured to triangulate based on the aligned matching face image and the face image to be replaced. The first generation module may be configured to perform replacement according to a corresponding relationship between the aligned matching face image and the face image to be replaced in the triangular region divided by the triangulation, so as to generate the quasi-target image. The fourth extraction module may be configured to extract the contour of the face image from the quasi-target image. The second generating module may be configured to generate a mask according to the contour of the face image. The third generating module may be configured to generate color distribution information of the face image according to the mask and the quasi-target image. The fourth generating module may be configured to render the face image according to the color distribution information, and generate the target image.

The apparatus provided by the above embodiment of the present disclosure acquires the master image and the target face image through the acquisition unit 501. The master image comprises a face image to be replaced and a background. Then, the determining unit 502 determines a matching face image from a preset face image library matched with the face image to be replaced. The matched preset face image library comprises face images of different face postures of the face indicated by the face image to be replaced. The matched face image is used for representing that the face pose displayed by the face image to be replaced is consistent with the face pose displayed by the target face image. Finally, the first generation unit 503 generates a target image based on the replacement of the matching face image with the face image to be replaced. Wherein the target image comprises a face image consistent with the matched face image and a background consistent with the master image. Thereby increasing the speed of generating an image consistent with the facial pose of the target face image.

Referring now to FIG. 6, and referring now to FIG. 6, a block diagram of an electronic device (e.g., server in FIG. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, etc.; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, or the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (Radio Frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a master image and a target face image, wherein the master image comprises a face image to be replaced and a background; determining a matched face image from a preset face image library matched with the face image to be replaced, wherein the matched preset face image library comprises face images of different face postures of the face indicated by the face image to be replaced, and the matched face image is used for representing that the face posture displayed by the face image to be replaced is consistent with the face posture displayed by the target face image; and generating a target image based on the replacement of the matching face image to the face image to be replaced, wherein the target image comprises a face image consistent with the matching face image and a background consistent with the bottom image.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a determination unit, and a first generation unit. The names of these units do not in some cases constitute a limitation to the unit itself, and for example, the acquisition unit may also be described as a unit for acquiring a master image including a face image to be replaced and a background and a target face image.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A method for generating an image, comprising:

acquiring a master image and a target face image, wherein the master image comprises a face image to be replaced and a background;

determining a matched face image from a preset face image library matched with the face image to be replaced, wherein the matched preset face image library comprises face images of different face poses of a face indicated by the face image to be replaced, and the matched face image is used for representing that the face pose displayed by the face image to be replaced is consistent with the face pose displayed by the target face image;

and generating a target image based on the replacement of the matched face image to the face image to be replaced, wherein the target image comprises a face image consistent with the matched face image and a background consistent with the bottom image.

2. The method of claim 1, wherein the acquiring a master image and a target face image comprises:

acquiring a first video shot for a first user and a second video shot for a second user;

extracting a video frame comprising a face image of a first user from the first video as the master image;

extracting a video frame comprising a face image of a second user from the second video;

extracting a face image of a second user from the video frame comprising the face image of the second user as the target face image; and

after the target image is generated based on the replacement of the matching facial image to the facial image to be replaced, the method further comprises:

generating a target video based on the target image, wherein the facial pose of the second user displayed in the target video matches the facial pose of the first user displayed in the first video.

3. The method of claim 1, wherein the library of pre-set facial images is obtained by:

acquiring a reference human face image library, wherein the reference human face image library comprises images displaying different facial poses of a reference human face;

inputting images in the reference facial image library into a pre-trained image generation model to generate a matching reference facial image, wherein the image generation model comprises a coding network, a hidden layer network and a decoding network, and the facial pose displayed by the matching reference facial image is consistent with the facial pose displayed by the input image;

and generating the preset face image library based on the matching reference face image.

4. The method of claim 3, wherein the hidden layer networks include a first hidden layer network and a second hidden layer network, the image generation model includes a first image generation sub-model and a second image generation sub-model, the first image generation sub-model includes the encoding network, the first hidden layer network, the second hidden layer network, and the decoding network, the second image generation sub-model includes the encoding network, the decoding network, and a target hidden layer network, the target hidden layer network being one of the first hidden layer network and the second hidden layer network.

5. The method of claim 4, wherein the image generation model is trained by:

acquiring a sample reference facial image set and a sample facial image set, wherein the sample reference facial image set comprises a subset of the reference facial image library;

carrying out image preprocessing transformation on the sample reference facial image set and the sample facial image set to generate a sample preprocessing reference facial image set and a sample preprocessing facial image set;

and respectively taking the sample preprocessing reference face image and the sample preprocessing face image as the input of the first image generation sub-model and the second image generation sub-model, respectively taking the sample preprocessing reference face image and the sample face image corresponding to the input as the expected output of the first image generation sub-model and the second image generation sub-model, and training to obtain the image generation model.

6. The method according to one of claims 1 to 5, wherein the generating of the target image based on the replacement of the face image to be replaced by the matching face image comprises:

carrying out face alignment on the matched face image and the face image to be replaced;

performing triangulation based on the aligned matched face image and the face image to be replaced;

replacing according to the corresponding relation of the triangular areas divided by the triangular subdivision in the aligned matched face image and the face image to be replaced to generate a quasi-target image;

extracting the outline of the face image from the quasi-target image;

generating a mask according to the outline of the face image;

generating color distribution information of the face image according to the mask and the quasi-target image;

rendering the face image according to the color distribution information to generate the target image.

7. An apparatus for generating an image, comprising:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a master image and a target face image, and the master image comprises a face image to be replaced and a background;

a determining unit configured to determine a matching face image from a preset face image library matched with the face image to be replaced, wherein the preset face image library comprises face images of different face poses of the face indicated by the face image to be replaced, and the matching face image is used for representing that the face pose displayed by the face image to be replaced is consistent with the face pose displayed by the target face image;

and the first generation unit is configured to generate a target image based on the replacement of the matching face image for the face image to be replaced, wherein the target image comprises a face image consistent with the matching face image and a background consistent with the background image.

8. The apparatus of claim 7, wherein the obtaining unit comprises:

an acquisition module configured to acquire a first video photographed for a first user and a second video photographed for a second user;

a first extraction module configured to extract a video frame including a face image of a first user from the first video as the master image;

a second extraction module configured to extract a video frame including a face image of a second user from the second video;

a third extraction module configured to extract a face image of a second user from the video frame including the face image of the second user as the target face image;

and the apparatus further comprises:

a second generation unit configured to generate a target video on the basis of the target image, wherein a face pose of a second user displayed in the target video matches a face pose of a first user displayed in the first video.

9. The apparatus of claim 7, wherein the preset face image library is obtained by:

10. The apparatus of claim 9, wherein the hidden layer network comprises a first hidden layer network and a second hidden layer network, the image generation model comprises a first image generation sub-model and a second image generation sub-model, the first image generation sub-model comprises the encoding network, the first hidden layer network, the second hidden layer network, and the decoding network, the second image generation sub-model comprises the encoding network, the decoding network, and a target hidden layer network, the target hidden layer network being one of the first hidden layer network and the second hidden layer network.

11. The apparatus of claim 10, wherein the image generation model is trained by:

12. The apparatus according to one of claims 7-11, wherein the first generating unit comprises:

an alignment module configured to perform face alignment on the matching face image and the face image to be replaced;

the triangulation module is configured to triangulate based on the aligned matching face image and the face image to be replaced;

the first generation module is configured to replace the aligned matched face image and the face image to be replaced according to the corresponding relation of the triangular area divided by triangulation, and generate a quasi-target image;

a fourth extraction module configured to extract a contour of a face image from the quasi-target image;

a second generation module configured to generate a mask according to the contour of the face image;

a third generation module configured to generate color distribution information of the face image according to the mask and the quasi-target image;

and the fourth generation module is configured to render the face image according to the color distribution information and generate the target image.

13. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

14. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.