CN114049290A

CN114049290A - Image processing method, device, device and storage medium

Info

Publication number: CN114049290A
Application number: CN202111325365.8A
Authority: CN
Inventors: 束长勇; 刘家铭; 洪智滨; 韩钧宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-10
Filing date: 2021-11-10
Publication date: 2022-02-15

Abstract

The present disclosure provides an image processing method, an apparatus, and a storage medium, which relate to the field of artificial intelligence, specifically the field of deep learning and computer vision technology, and can be applied to scenarios such as face image processing and face recognition. The specific implementation scheme is: obtaining the reference image and the head image of the target person; using the head of the target person to replace the head of the reference person in the reference image to obtain the image to be synthesized, and the image to be synthesized includes part of the reference background, the head of the target person, and the image to be synthesized in part of the reference image. The area to be filled between the background and the head of the target person; the feature extraction of the reference image and the image to be synthesized to obtain the skin color sample feature map and the filled sample feature map; based on the skin color sample feature map, the filled sample feature map and the feature map to be Composite Image Generate composite image. By extracting the skin color sample feature map and filling sample feature map for image synthesis, the composite image is more natural and realistic.

Description

Image processing method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning and computer vision technologies, which can be applied to scenes such as face image processing and face recognition, and in particular, to an image processing method, apparatus, device, storage medium, and computer program product.

Background

With the development of computing technology and artificial intelligence, the fusion network has functions of skin color alignment, neck and background filling, and is widely applied to scenes such as face image editing and fusion, for example, fusing a head portrait of a person to a body of a specific person or a specific scene or background.

Disclosure of Invention

The present disclosure provides an image processing method, apparatus, device, storage medium, and computer program product.

According to a first aspect of the present disclosure, there is provided an image processing method including: acquiring a reference image and a target person head image, wherein the reference image comprises a reference background and a reference person head; replacing the head of a reference person in the reference image by the head of the target person to obtain an image to be synthesized, wherein the image to be synthesized comprises a part of reference background, the head of the target person and a region to be filled between the part of reference background and the head of the target person; performing feature extraction on the reference image and the image to be synthesized to obtain a skin color sample feature map and a filling sample feature map; and generating a composite image based on the skin color sample characteristic diagram, the filling sample characteristic diagram and the image to be synthesized.

According to a second aspect of the present disclosure, there is provided an image processing apparatus comprising: an acquisition module configured to acquire a reference image and a target person head image, wherein the reference image includes a reference background and a reference person head; the replacing module is configured to replace the head of the reference person in the reference image by using the head image of the target person to obtain an image to be synthesized, wherein the image to be synthesized comprises a part of reference background, the head of the target person and a region to be filled between the part of reference background and the head of the target person; the characteristic extraction module is configured to extract the characteristics of the reference image and the image to be synthesized to obtain a skin color sample characteristic diagram and a filling sample characteristic diagram; and the image generation module is configured to generate a composite image based on the skin color sample characteristic diagram, the filling sample characteristic diagram and the image to be synthesized.

According to a third aspect of the present disclosure, there is provided an electronic apparatus comprising: at least one processor; and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to implement a method as described in any of the implementations of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described in any implementation manner of the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method as described in any of the implementations of the first aspect.

According to the image processing method, the device, the equipment, the storage medium and the computer program product, firstly, the head of a target person is used for replacing the head of a reference person in a reference image to obtain an image to be synthesized, then, the reference image and the image to be synthesized are subjected to feature extraction to obtain a skin color sample feature map and a filling sample feature map, and finally, a synthetic image is generated based on the skin color sample feature map, the filling sample feature map and the image to be synthesized, and the generated synthetic image is more natural and real.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture to which the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of an image processing method of the present disclosure;

FIG. 3 is a flow diagram of another embodiment of an image processing method of the present disclosure;

FIG. 4 is a flow chart of yet another embodiment of an image processing method of the present disclosure;

fig. 5A is a schematic view of an application scenario of the image processing method of the present disclosure;

FIG. 5B is a schematic diagram of extracting a skin tone sample feature map and a fill sample feature map in the scene of FIG. 5A;

fig. 6 is a schematic configuration diagram of an example of an image processing apparatus of the present disclosure;

fig. 7 is a block diagram of an electronic device for implementing a method of image processing of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the image processing method or image processing apparatus of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user can use the

terminal apparatuses

101, 102, 103 to interact with the server 105 through the network 104 to acquire image processing results and the like. Various client applications, such as an image composition application, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the above-described electronic apparatuses. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may provide various image composition based services or applications. For example, the server 105 may process the reference image and the target person head image acquired from the

terminal apparatuses

101, 102, 103, and generate a processing result (e.g., generate a composite image).

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the image processing method provided by the embodiment of the present disclosure is generally executed by the server 105, and accordingly, the image processing apparatus is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring to fig. 2, fig. 2 shows a flowchart of an image processing method provided by an embodiment of the present disclosure, where the flowchart 200 includes the following steps:

step 201, acquiring a reference image and a target person head image.

In the present embodiment, the execution subject of the image processing method (e.g., the server 105 shown in fig. 1) may acquire the reference image and the target person head image. Wherein the reference image comprises a reference background and a reference person head. The reference image may be acquired by directly using an image sensor, for example, the image sensor may be a camera, or may be acquired from a local file storing a large number of images. For example, the reference image may be an image captured by a camera with the reference person as a target and the environment where the reference person is located as a background. The head image of the target person may be an image obtained by individually dividing a head region of a certain person from an image captured by the camera. Optionally, the reference image further comprises an exposed skin area other than the head of the reference person; illustratively, the reference image may include the neck and/or arms, etc. of the reference person in addition to the head of the reference person.

And 202, replacing the head of the reference person in the reference image by using the head image of the target person to obtain an image to be synthesized.

In this embodiment, after the execution main body acquires the reference image and the target person head image, the target person head image may be used to replace the head of the reference person in the reference image to obtain an image to be synthesized, where the image to be synthesized includes a part of the reference background, the head of the target person, and a region to be filled between the part of the reference background and the head of the target person. In the implementation process, considering that the head sizes, shapes and the like of different people have differences, when the head of the reference person is replaced, the head of the reference person and the background of the preset distance or shape around the head can be removed in advance, the head image of the target person is added to the position of the head of the original reference person, and therefore the image to be synthesized is obtained, wherein the partial reference background refers to a part of the reference background in the reference image, for example, the residual background area of the background of the preset distance or shape around the head of the reference person is removed from the reference background.

And 203, extracting the features of the reference image and the image to be synthesized to obtain a skin color sample feature map and a filling sample feature map.

In this embodiment, after obtaining the image to be synthesized, the executing entity may perform feature extraction on the reference image and the image to be synthesized to obtain a skin color sample feature map and a filling sample feature map. The way of feature extraction may be any existing way of extraction including, but not limited to, HOG (histogram of oriented gradients) extraction algorithm, scale invariant feature transform, neural network features, and the like. The skin color sample characteristic features skin color information of a reference person in the reference image, and the filling sample characteristic map features filling information of a region to be filled indicated by the image to be synthesized, which is extracted from the reference image.

And step 204, generating a composite image based on the skin color sample characteristic diagram, the filling sample characteristic diagram and the image to be synthesized.

In this embodiment, after obtaining the skin color sample feature map and the filling sample feature map, the execution subject may generate a synthesized image based on the skin color sample feature map, the filling sample feature map and the image to be synthesized. The executing body can fuse the images to be synthesized by combining the acquired skin color sample characteristic diagram and the filling sample characteristic diagram by adopting a fusion network to obtain a synthesized image, wherein the synthesized image comprises the head of the target person, and the skin color of the target person is the same as that of the reference person.

The image processing method provided by this embodiment includes first replacing the head of a reference person in a reference image with the head of a target person to obtain an image to be synthesized, then performing feature extraction on the reference image and the image to be synthesized to obtain a skin color sample feature map and a filling sample feature map, and finally generating a synthesized image based on the skin color sample feature map, the filling sample feature map and the image to be synthesized, where the generated synthesized image is more natural and real.

With further continuing reference to fig. 3, fig. 3 illustrates a flow 300 of another embodiment of an image processing method of the present disclosure. The image processing method comprises the following steps:

and 301, acquiring a reference image and a target person head image.

In this embodiment, the specific operation of step 301 has been described in detail in step 201 in the embodiment shown in fig. 2, and is not described herein again.

Step 302, performing five sense organs segmentation on the head of the reference person in the reference image to obtain a head mask, and taking the region except the head mask in the reference image as a reference background.

In the present embodiment, the five sense organs segmentation means that an image is divided into regions corresponding to respective organs through five sense organs (eyebrows, eyes, nose, mouth, eyebrows, ears) of a human face, and then a complete head region is located through the divided regions; the head mask represents a region where the head is located, for example, in a specific implementation process, the execution subject may extract the head as a solid to obtain a binarized image as the head mask, specifically, a pixel value of the region where the head is located in the reference image may be set to 255, and a pixel value of a region other than the head in the reference image may be set to 0, that is, the head is represented by white, and the rest is filled by black. Optionally, in a specific implementation process, the execution subject may further perform translation and scaling on the reference image according to the size of the head portrait of the target person before performing the five-sense organ segmentation, so that the size of the head of the reference person in the reference image after the translation and scaling is equivalent to the size of the head of the target person.

And step 303, expanding the head mask to obtain an expanded region, wherein the area of the expanded region is larger than that of the image of the head of the target person.

In this embodiment, the expansion manner includes, but is not limited to, expanding the outermost edge of the head part by a preset distance with reference to the center of the head mask, or expanding the other side of the head part by a preset distance with reference to one side of the head part. For example, the above implementation subject may set the pixel value of the region having a distance of less than two millimeters from the outermost edge of the head to 0 in the region having a pixel value of 255 in the head mask obtained in the foregoing, so as to obtain an expansion region having a larger area than the original head, and the embodiment does not limit the expansion size, and the above-mentioned specific distance value is merely used for illustration.

And step 304, determining a region which does not intersect with the expansion region in the reference background as a part of the reference background, adding the head image of the target person to the expansion region, and determining a region which does not intersect with the head of the target person in the expansion region as a region to be filled.

In this embodiment, after the execution main body obtains the expansion region, the execution main body may map the expansion region to a position corresponding to a reference person on the reference image by comparing the reference background with the expansion region, and the expansion region may be mapped in a manner that an original head region in the expansion region corresponds to a head region on the reference image, so as to obtain a region that is not overlapped with the expansion region in the reference background as a partial reference background; further, the execution main body may map the head of the target person to the expansion region after the expansion region is obtained, and since the expansion region is obtained based on the expansion of the head of the reference person, mapping may be performed in a manner that parts of five sense organs are aligned when mapping the head of the target person, for example, aligning the nose of the target person with the nose of the original reference person when adding the head portrait of the target person, and finally, taking a region of the expansion region that does not cover the head of the target person as a region to be filled.

And 305, integrating the head image of the target person, part of the reference background and the region to be filled to obtain an image to be synthesized.

In this embodiment, after the execution main body determines the area to be filled and the partial reference background, the execution main body integrates the obtained partial reference background, the area to be filled, and the head image of the target person to obtain an image to be synthesized, where a part of the background of the image to be synthesized surrounds the area to be filled, and the area to be filled surrounds the head image of the target person.

Step 306, extracting the features of the reference image and the image to be synthesized to obtain a skin color sample feature map and a filling sample feature map;

and 307, generating a synthetic image based on the skin color sample characteristic diagram, the filling sample characteristic diagram and the image to be synthesized.

In the present embodiment, the specific operations of steps 306-307 have been described in detail in

steps

203 and 204 in the embodiment shown in fig. 2, and are not described herein again.

According to the image processing method provided by the embodiment, the images to be synthesized are obtained by segmenting and expanding the reference image and integrating the segmented and expanded reference image with the head image of the target person, and the images to be synthesized are constructed by utilizing the reference image, so that the synthesized images are more real and natural.

With further continuing reference to fig. 4, fig. 4 shows a flow chart in yet another embodiment of an image processing method of the present disclosure, the image processing method comprising the steps of:

step 401, acquiring a reference image and a target person head image.

And 402, replacing the head of the reference person in the reference image by the head image of the target person to obtain an image to be synthesized.

In the present embodiment, the specific operations of

steps

401 and 402 have been described in detail in

step

201 and 202 in the embodiment shown in fig. 2, and are not described herein again.

And step 403, extracting the features of the reference image and the features of the image to be synthesized by using the feature extraction network.

In this embodiment, the feature extraction network may select classical backbone networks such as AlexNet, ZF Net, VGGNet, inclusion, ResNet, and the like without loss of generality, and the feature extraction network adopts a dual-input dual-output structure, specifically, a reference image and an image to be synthesized are used as dual inputs, and features of the reference image and features of the image to be synthesized are used as dual outputs.

And step 404, extracting a skin color sample feature map and a filling sample feature map from the features of the reference image and the features of the image to be synthesized by adopting an attention mechanism.

In this embodiment, after the execution subject extracts the features of the reference image and the features of the image to be synthesized, the execution subject may extract skin color information of the reference person as a skin color sample feature map by combining the features of the reference image and the features of the image to be synthesized, and extract information of a region to be filled as a filling sample feature map by combining the features of the reference image and the features of the image to be synthesized, so as to provide richer fusion information for subsequent image synthesis.

Optionally, step 404 includes determining the head features and the region features to be filled of the target person based on the features of the image to be synthesized; calculating an attention matrix by using the head characteristics of the target person and the characteristics of the reference image to obtain a color attention characteristic diagram; multiplying the color attention feature map and the features of the reference image to obtain a skin color sample feature map; calculating an attention matrix by using the characteristics of the region to be filled and the characteristics of the reference image to obtain a filled region attention characteristic diagram; and multiplying the attention feature map of the filling area with the features of the reference image to obtain a filling sample feature map.

And step 405, performing color processing on the head image of the target person to obtain a head gray scale image.

In this embodiment, the color processing refers to traversing the color of the head image of the target person, and processing the head image of the target person, which is originally composed of three RGB channels, into a single-channel grayscale image, so as to obtain a head grayscale image.

And step 406, synthesizing the skin color sample feature map, the filling sample feature map, the head mask, the head gray scale map and part of the reference background to obtain a synthesized image.

In this embodiment, after the execution subject extracts the skin color sample feature map and the filling sample feature map, the execution subject sends the skin color sample feature map and the filling sample feature map, the head mask, the head grayscale map, and a part of the reference background into a pre-trained fusion network for fusion, so as to obtain a composite image.

Optionally, step 406 includes stitching the skin color sample feature map, the filling sample feature map, the head mask, the head grayscale map, and a part of the reference background to obtain a stitched map; and inputting the mosaic into a pre-trained fusion network for fusion to obtain a synthetic image. For example, converged networks include, but are not limited to, a Unet network. Splicing refers to combining information of each channel along the dimension of the channel; for example, the signature 1 has four channels, denoted as B × C1 × W × H, the signature 2 has four channels, denoted as B × C2 × W × H, and a spliced graph is obtained by splicing the signature 1 and the signature 2 along the channels, denoted as B (C1+ C2) × W × H, the number of channels in the signature is not limited in this embodiment, and the number of channels in this embodiment is merely used for illustration.

In this embodiment, the pre-trained fusion network generates a human body image G (X) specifying a pose expression and an ID by a generator_Input,X_Ref) Y, wherein X_InputFor picture of image to be synthesized, X_RefThe image is a reference image, Y is a composite image output after fusion, and the loss function during the training of the fusion network comprises the following components:

(1) ID reservation loss. The intermediate features extracted by Arcface are adopted for alignment in a high-dimensional information space:

L_ID＝||Arcface(Y)-Arcface(X_GT)||₂

wherein X_GTIs part of the background derived from the reference image.

(2) Image feature alignment is lost. The intermediate features extracted using VGG19 are aligned in the high-dimensional information space:

L_VGG＝||VGG(Y)-VGG(X_GT)||₂

(3) and judging the loss of feature alignment. The intermediate features extracted by the discriminator D are aligned in the high-dimensional information space:

L_D＝||D(Y)-D(X_GT)||₂

(4) the discriminator is lost. Countermeasure training with discriminators to reduce artifacts in the generated images:

L_GAN＝E(logD(X_GT))+E(log(1-D(Y)))

in the image processing method provided by the embodiment, the attention mechanism is adopted to extract the skin color sample characteristic diagram and the filling sample characteristic diagram, so that the skin color sample characteristic diagram has more real skin texture compared with the average color information adopted by the fusion network, the reduction of the migration quality of skin color caused by the adoption of the average color information is avoided, meanwhile, the filling sample characteristic diagram avoids the information imagined by the network from being inconsistent with the original information of the region to be filled, more complete and reliable information is provided for the fusion network, and the image synthesis mode is enriched.

In order to facilitate understanding of the technical solution of the present invention, a head-changing application scenario is taken as an example for detailed description, please refer to fig. 5A and 5B, where fig. 5A illustrates an application scenario of the image processing method of the present disclosure, fig. 5B is a schematic diagram of extracting a skin color sample feature map and filling a sample feature map in the application scenario of fig. 5A, in the application scenario, a reference image 1 includes a reference person's neck in addition to a reference person's head and background, in an implementation process, the execution subject inputs the reference image 1 and an image to be synthesized 2 into a feature extraction network 3, and the feature extraction network 3 outputs a feature 4 of the reference image and a feature 5 of the image to be synthesized. Further, the executing subject extracts 6 the features 4 of the reference image and the features 5 of the image to be synthesized through attention features to obtain a skin color sample feature map 7 and a filling sample feature map 8; referring to fig. 5B, the features 5 of the image to be synthesized include a head feature 5a of the target person and a feature 5B of the area to be filled, and the execution subject calculates an attention matrix by using the head feature 5a of the target person and the feature 5B of the area to be filled and the feature 4 of the reference image, and multiplies the attention matrix by the feature 4 of the reference image to obtain a skin color sample feature map 7 and a filling sample feature map 8. Referring to fig. 5A again, after the execution subject acquires the head mask 9, the partial reference background 10, and the head gray-scale image 11 based on the image to be synthesized 2, the execution subject concatenates the three with the skin color sample feature map 7 and the filling sample feature map 8 obtained in the previous step, and then inputs the three into the pre-trained fusion network 12, and the pre-trained fusion network 12 outputs the synthesized image 13.

In the embodiment, the synthesized image obtained by the above method is targeted at the target person, is the background of the reference image, and further comprises the neck of the reference person in the reference image, and by the fact that the skin color of the target person in the synthesized image is the same as the skin color of the reference person, it is ensured that the head of the target person in the image to be synthesized has no difference from the skin color of the neck in the image, and the area around the head combined with the background is more real and natural, so that the attractiveness of the synthesized image is improved.

Referring further to fig. 6, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an image processing apparatus, and an embodiment of an apparatus for training an image style conversion model corresponds to an embodiment of a method for training an image style conversion model shown in fig. 2. The device can be applied to various electronic equipment.

As shown in fig. 5, the image processing apparatus 600 of the present embodiment may include: an acquisition module 601, a replacement module 602, a feature extraction module 603, and an image generation module 604. The acquiring module 601 is configured to acquire a reference image and a target person head image, where the reference image includes a reference background and a reference person head; a replacing module 602, configured to replace the head of the reference person in the reference image with the head image of the target person, so as to obtain an image to be synthesized, where the image to be synthesized includes a partial reference background, the head of the target person, and a region to be filled between the partial reference background and the head of the target person; a feature extraction module 603 configured to perform feature extraction on the reference image and the image to be synthesized to obtain a skin color sample feature map and a filling sample feature map; and the image generating module 604 is configured to generate a composite image based on the skin color sample feature map, the filling sample feature map and the image to be synthesized.

In the present embodiment, in the image processing apparatus 600: the specific processing and the technical effects thereof of the obtaining module 601, the replacing module 602, the feature extracting module 603 and the image generating module 604 can refer to the related descriptions of step 201 and step 204 in the corresponding embodiment of fig. 2, which are not described herein again.

In some optional implementations of this embodiment, the feature extraction module 603 includes:

a first extraction module configured to extract features of a reference image and features of an image to be synthesized using a feature extraction network;

and the second extraction module is configured to adopt an attention mechanism to extract a skin color sample feature map and a filling sample feature map from the features of the reference image and the features of the image to be synthesized.

In some optional implementations of this embodiment, the second extracting module includes:

the characteristic determination module is configured to determine the head characteristic of the target person and the characteristic of the region to be filled based on the characteristic of the image to be synthesized;

the first calculation module is configured to calculate an attention matrix by using the head characteristics of the target person and the characteristics of the reference image to obtain a color attention characteristic map;

the first multiplication module is configured to multiply the color attention feature map and the features of the reference image to obtain a skin color sample feature map;

the second calculation module is configured to calculate an attention matrix by using the characteristics of the region to be filled and the characteristics of the reference image to obtain a filled region attention characteristic map;

and the second multiplying module is configured to multiply the filled region attention feature map and the features of the reference image to obtain a filled sample feature map.

In some optional implementations of this embodiment, the replacing module 602 includes:

the segmentation module is configured to perform five-sense organ segmentation on the head of a reference person in the reference image to obtain a head mask, and the region except the head mask in the reference image is used as a reference background;

the expansion module is configured to expand the head mask to obtain an expansion area, wherein the area of the expansion area is larger than that of the image of the head of the target person;

the region determining module is configured to determine a region, which does not intersect with the expansion region, in the reference background as a partial reference background, add the head image of the target person to the expansion region, and determine a region, which does not intersect with the head of the target person, in the expansion region as a region to be filled;

and the integration module is configured to integrate the head image of the target person, part of the reference background and the region to be filled to obtain an image to be synthesized.

In some optional implementations of this embodiment, the image generating module 604 includes:

the color processing module is configured to perform color processing on the head image of the target person to obtain a head gray scale image;

and the synthesis module is configured to synthesize the skin color sample feature map, the filling sample feature map, the head mask, the head gray scale map and part of the reference background to obtain a synthesized image.

In some optional implementations of this embodiment, the synthesizing module includes:

the splicing module is configured to splice the skin color sample characteristic graph, the filling sample characteristic graph, the head mask, the head gray-scale graph and part of the reference background to obtain a spliced graph;

and the fusion module is configured to input the mosaic into a fusion network trained in advance for fusion to obtain a composite image.

In some optional implementations of the present embodiment, the reference image further includes an exposed skin area other than the head of the reference person.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 executes the respective methods and processes described above, such as an image processing method. For example, in some embodiments, the image processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 03 and executed by the computing unit 701, one or more steps of the image processing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the image processing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An image processing method, comprising:

acquiring a reference image and a head image of the target person, wherein the reference image includes a reference background and a head of the reference person;

The head of the reference person in the reference image is replaced by the head image of the target person to obtain an image to be synthesized, wherein the image to be synthesized includes a part of the reference background, the head of the target person, and the part of the reference background and the part of the head of the target person. Describe the area to be filled between the head of the target person;

Feature extraction is performed on the reference image and the to-be-synthesized image to obtain a skin color sample feature map and a filling sample feature map;

A composite image is generated based on the skin color sample feature map, the fill sample feature map, and the to-be-combined image.

2. The method according to claim 1, wherein the feature extraction is performed on the reference image and the to-be-synthesized image to obtain a skin color sample feature map and a filling sample feature map, comprising:

Extract the feature of the reference image and the feature of the to-be-synthesized image by using a feature extraction network;

An attention mechanism is used to extract a skin color sample feature map and a filling sample feature map from the features of the reference image and the features of the to-be-synthesized image.

3. The method according to claim 2, wherein the use of an attention mechanism to extract a skin color sample feature map and a fill sample feature map from the extracted features of the reference image and the to-be-synthesized image, comprising: :

Determine the head feature of the target person and the feature of the area to be filled based on the feature of the to-be-synthesized image;

Calculate the attention matrix by utilizing the head feature of the target person and the feature of the reference image to obtain a color attention feature map;

Multiplying the color attention feature map and the feature of the reference image to obtain the skin color sample feature map;

The attention matrix is calculated by utilizing the feature of the to-be-filled area and the feature of the reference image to obtain an attention feature map of the filled area;

Multiplying the filled region attention feature map with the feature of the reference image to obtain the filled sample feature map.

4. The method according to any one of claims 1-3, wherein replacing the head of the reference person in the reference image with the head image of the target person to obtain an image to be synthesized, comprising:

Perform facial features segmentation on the head of the reference person in the reference image to obtain a head mask, and use the area of the reference image other than the head mask as the reference background;

Expanding the head mask to obtain an expanded area, wherein the area of the expanded area is larger than the area of the target person's head image;

determining an area of the reference background that does not intersect the inflated area as the partial reference background, adding the target person head image to the inflated area, and combining the inflated area with the target person The area where the heads do not intersect is determined as the area to be filled;

The to-be-combined image is obtained by integrating the target person's head image, the partial reference background, and the to-be-filled area.

5. The method according to claim 4, wherein the generating a composite image based on the skin color sample feature map, the filling sample feature map and the to-be-synthesized image comprises:

Perform color processing on the head image of the target person to obtain a grayscale image of the head;

The composite image is obtained by synthesizing the skin color sample feature map, the filling sample feature map, the head mask, the head grayscale map, and the partial reference background.

6. The method of claim 5, wherein the comparison of the skin color sample feature map, the fill sample feature map, the head mask, the head grayscale map, and the portion Perform synthesis processing with reference to the background to obtain the synthesized image, including:

splicing the skin color sample feature map, the filling sample feature map, the head mask, the head grayscale map and the partial reference background to obtain a mosaic map;

The mosaic image is input into a pre-trained fusion network for fusion to obtain the composite image.

7. The method of claim 1 , wherein the reference image further includes exposed areas of skin other than the head of the reference person.

8. An image processing device, comprising:

an acquisition module configured to acquire a reference image and a head image of the target person, wherein the reference image includes a reference background and a reference person's head;

The replacement module is configured to replace the head of the reference person in the reference image with the head image of the target person to obtain an image to be synthesized, wherein the image to be synthesized includes a part of the reference background, the head of the target person, and the image to be synthesized. the area to be filled between the part of the reference background and the head of the target person;

a feature extraction module, configured to perform feature extraction on the reference image and the to-be-synthesized image to obtain a skin color sample feature map and a filling sample feature map;

An image generation module configured to generate a composite image based on the skin color sample feature map, the fill sample feature map, and the to-be-synthesized image.

9. The apparatus according to claim 8, wherein the feature extraction module comprises:

a first extraction module, configured to extract the feature of the reference image and the feature of the to-be-synthesized image by using a feature extraction network;

The second extraction module is configured to use an attention mechanism to extract a skin color sample feature map and a filling sample feature map from the features of the reference image and the features of the to-be-synthesized image.

10. The apparatus of claim 9, wherein the second extraction module comprises:

a feature determination module, configured to determine the head feature of the target person and the feature of the area to be filled based on the feature of the to-be-synthesized image;

a first calculation module, configured to calculate an attention matrix using the head feature of the target person and the feature of the reference image to obtain a color attention feature map;

a first multiplication module, configured to multiply the features of the color attention feature map and the reference image to obtain the skin color sample feature map;

A second computing module, configured to calculate an attention matrix using the feature of the to-be-filled area and the feature of the reference image to obtain an attention feature map of the filled area;

The second multiplication module is configured to multiply the filled region attention feature map with the feature of the reference image to obtain the filled sample feature map.

11. The apparatus of any one of claims 8-10, the replacement module comprising:

a segmentation module, configured to perform facial features segmentation on the head of the reference person in the reference image to obtain a head mask, and use the area in the reference image except for the head mask as the reference background;

an expansion module, configured to expand the head mask to obtain an expanded area, wherein the area of the expanded area is larger than the area of the target person's head image;

an area determination module configured to determine an area of the reference background that does not intersect the inflated area as the partial reference background, add the target person's head image to the inflated area, and add the inflated area The area in the area that does not intersect with the head of the target person is determined as the area to be filled;

The integration module is configured to integrate the target person's head image, the partial reference background, and the to-be-filled area to obtain the to-be-synthesized image.

12. The apparatus of claim 11, wherein the image generation module comprises:

The color processing module is configured to perform color processing on the head image of the target person to obtain a grayscale image of the head;

A synthesis module, configured to perform synthesis processing on the skin color sample feature map, the filling sample feature map, the head mask, the head grayscale map, and the partial reference background, to obtain the Composite image.

13. The apparatus of claim 12, wherein the synthesis module comprises:

a splicing module, configured to splicing the skin color sample feature map, the filling sample feature map, the head mask, the head gray image and the partial reference background to obtain a splicing map;

The fusion module is configured to input the mosaic image into a pre-trained fusion network for fusion to obtain the composite image.

14. The apparatus of claim 8, wherein the reference image further comprises exposed areas of skin other than the head of the reference person.

15. An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the execution of any of claims 1-7 method.

16. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of any of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-7.