CN112184886A

CN112184886A - Image processing method and device, computer equipment and storage medium

Info

Publication number: CN112184886A
Application number: CN202011042591.0A
Authority: CN
Inventors: 赵鑫; 邱学侃
Original assignee: Beijing Lexuebang Network Technology Co ltd
Current assignee: Beijing Lexuebang Network Technology Co ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2021-01-05
Anticipated expiration: 2040-09-28
Also published as: CN112184886B

Abstract

The present disclosure provides an image processing method, apparatus, computer device, and storage medium, wherein the method comprises: acquiring a first image comprising a first target object and a second image comprising a first target garment; performing semantic segmentation processing on the first image and the second image respectively to obtain a first semantic segmentation result of the first image and a second semantic segmentation result of the second image; respectively carrying out feature extraction processing on the first image and the second image to obtain a first feature map of the first image and a second feature map of the second image; performing feature fusion processing on the first feature map and the second feature map based on the first semantic segmentation result and the second semantic segmentation result to obtain a fusion feature map; and obtaining a target image based on the fusion feature map. According to the method and the device, the target object is reloaded by processing the image shot by the target object, the human body does not need to be subjected to three-dimensional reconstruction in advance, and the efficiency is high.

Description

Image processing method and device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of machine learning technologies, and in particular, to an image processing method and apparatus, a computer device, and a storage medium.

Background

The virtual reloading technology can perform image fusion on the human body posture image and the clothing image, and under the condition that the human body posture is not changed, the clothing worn by the human body has the same effect as the clothing in the clothing image.

The current virtual reloading method needs to shoot reloaded target objects from multiple angles in advance, rebuild a three-dimensional model of the target objects, and realize virtual reloading of the target objects according to the three-dimensional model. The method needs a large amount of preliminary work due to the need of establishing a three-dimensional model for the target object, and is low in efficiency.

Disclosure of Invention

The embodiment of the disclosure at least provides an image processing method, an image processing device, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides an image processing method, including:

acquiring a first image comprising a first target object and a second image comprising a first target garment;

performing semantic segmentation processing on the first image and the second image respectively to obtain a first semantic segmentation result of the first image and a second semantic segmentation result of the second image; respectively performing feature extraction processing on the first image and the second image to obtain a first feature map of the first image and a second feature map of the second image;

performing feature fusion processing on the first feature map and the second feature map based on the first semantic segmentation result and the second semantic segmentation result to obtain a fusion feature map;

and obtaining a target image based on the fusion feature map.

In an optional implementation manner, the performing, based on the first semantic segmentation result and the second semantic segmentation result, feature fusion processing on the first feature map and the second feature map to obtain a fused feature map includes:

determining a first feature subgraph corresponding to each first part in a plurality of first parts of the target object based on the first semantic segmentation result and the first feature graph;

determining a second feature subgraph corresponding to each of a plurality of second parts of the target garment based on the second semantic segmentation result and the second feature graph;

and performing feature fusion on the first feature sub-graph and the second feature sub-graph to obtain the fusion feature graph.

In an optional embodiment, determining, based on the first semantic segmentation result and the first feature map, a first feature subgraph corresponding to each of a plurality of first portions of the target object includes:

determining a target pixel point corresponding to each first part from the first image based on the first semantic segmentation result;

determining a target feature point corresponding to the target pixel point from the first feature map based on the mapping relation between the pixel point in the first image and the feature point in the first feature map;

and determining a first feature subgraph corresponding to each first part based on the target feature points and the first feature graph.

In an optional implementation manner, performing feature fusion on the first feature sub-graph and the second feature sub-graph to obtain the fused feature graph includes:

for each first part, determining a target second part matched with each first part from a plurality of second parts;

performing first fusion processing on the first feature subgraph corresponding to the first part and the second feature subgraph corresponding to the target second part to obtain an intermediate fusion feature graph corresponding to each first part;

and performing second fusion processing on the intermediate fusion feature maps respectively corresponding to the plurality of first parts to obtain the fusion feature maps.

In an optional embodiment, obtaining the target image based on the fused feature map includes:

and decoding the fusion feature graph to obtain the target image.

In an optional implementation, performing semantic segmentation processing on the first image and the second image respectively includes:

and respectively performing semantic segmentation processing on the first image and the second image by using a pre-trained semantic segmentation model to obtain a first semantic segmentation result and a second semantic segmentation result.

In an optional implementation, the performing the feature extraction processing on the first image and the second image respectively includes:

respectively performing feature extraction processing on the first image and the second image by using a pre-trained feature extraction network to obtain a first feature map of the first image and a second feature map of the second image;

in an optional implementation manner, the obtaining a target image based on the fused feature map includes:

and decoding the fusion characteristic graph by using a pre-trained decoder to obtain the target image.

In an alternative embodiment, training the feature extraction network and the decoder includes:

acquiring a plurality of first sample images including a second target object and a plurality of second sample images including a second target garment;

performing feature extraction processing on the first sample image and the second sample image by using a feature extraction network to be trained to obtain a first sample feature map of the first sample image and a second sample feature map of the second sample image;

performing semantic segmentation processing on the first sample image and the second sample image to obtain a first sample semantic segmentation result of the first sample image and a second sample semantic segmentation result of the second sample image;

performing feature fusion processing on the first sample feature map and the second sample feature map based on the first sample semantic segmentation result and the second sample semantic segmentation result to obtain a sample fusion feature map;

decoding the sample fusion characteristic graph by using a decoder to be trained to obtain a sample generation image;

generating an image based on the sample, and determining a model loss based on the sample image; training the feature extraction network to be trained and the decoder to be trained based on the model loss;

and obtaining the trained feature extraction network and the trained decoder through multi-round training of the feature extraction network to be trained and the decoder to be trained.

In an alternative embodiment, generating an image based on the sample, and determining a model loss based on the sample image, comprises:

taking the sample generation image as a new first sample image, and taking a first sample image corresponding to the sample generation image as a new second sample image;

obtaining a new generated image of the new first sample image based on the new first sample image and the new second sample image by using a feature extraction network to be trained and the encoder to be trained;

determining the model loss based on the new generated image and a first sample image corresponding to the sample generated image; the model loss comprises at least one of: loss of makeup, loss of style, and loss of face.

In an optional implementation manner, in a case where the model loss includes a makeup loss, the determining the model loss based on the new generated image and a first sample image corresponding to the sample generated image includes:

determining the dressing loss based on a dressing difference between the new generated image and a first sample image corresponding to the sample generated image.

In an optional embodiment, in a case that the model loss includes a style loss, the determining the model loss based on the new generated image and a first sample image corresponding to the sample generated image includes:

determining first style information of the new generated image based on the new generated image;

generating a first sample image corresponding to the image based on the sample, and determining second style information of the first sample image;

determining the style loss based on the first style information and the second style information.

In an alternative embodiment, in a case that the model loss includes a face loss, the determining the model loss based on the new generated image and a first sample image corresponding to the sample generated image includes:

and determining the dressing loss based on the human face difference degree between the new generated image and a first sample image corresponding to the sample generated image.

In an alternative embodiment, training the feature extraction network and the decoder further comprises:

carrying out discrimination processing on the sample generation image by using a discriminator to obtain a discrimination result of whether the sample generation image is a generation image;

and carrying out countermeasure training on a discriminator and a generator consisting of the to-be-trained feature extraction network and the to-be-trained decoder based on the discrimination result.

In a second aspect, an embodiment of the present disclosure further provides an image processing apparatus, including:

an acquisition module for acquiring a first image comprising a first target object and a second image comprising a first target garment;

the processing module is used for performing semantic segmentation processing on the first image and the second image respectively to obtain a first semantic segmentation result of the first image and a second semantic segmentation result of the second image; respectively performing feature extraction processing on the first image and the second image to obtain a first feature map of the first image and a second feature map of the second image;

the feature fusion module is used for performing feature fusion processing on the first feature map and the second feature map based on the first semantic segmentation result and the second semantic segmentation result to obtain a fusion feature map;

and the determining module is used for obtaining a target image based on the fusion characteristic graph.

In an optional implementation manner, when the feature fusion module performs feature fusion processing on the first feature map and the second feature map based on the first semantic segmentation result and the second semantic segmentation result to obtain a fused feature map, the feature fusion module is configured to:

In an optional embodiment, when determining, based on the first semantic segmentation result and the first feature map, a first feature subgraph corresponding to each of a plurality of first locations of the target object, the feature fusion module is configured to:

In an optional implementation manner, when the feature fusion module performs feature fusion on the first feature sub-graph and the second feature sub-graph to obtain the fused feature graph, the feature fusion module is configured to:

In an optional embodiment, when obtaining the target image based on the fused feature map, the determining module is configured to:

and decoding the fusion feature graph to obtain the target image.

In an optional embodiment, when performing semantic segmentation processing on the first image and the second image, the processing module is configured to:

In an optional implementation manner, when performing feature extraction processing on the first image and the second image respectively, the processing module is configured to:

In an optional embodiment, the method further comprises: a training module for training the feature extraction network and the decoder in the following manner:

In an alternative embodiment, the training module, when generating an image based on the sample and determining a model loss based on the sample image, is configured to:

In an optional implementation manner, when the model loss includes a makeup loss, the training module is configured to, when determining the model loss based on the new generated image and a first sample image corresponding to the sample generated image:

In an optional embodiment, in a case that the model loss includes a style loss, the training module, when determining the model loss based on the new generated image and a first sample image corresponding to the sample generated image, is configured to:

In an alternative embodiment, when the model loss includes a face loss, the training module is configured to, when determining the model loss based on the new generated image and a first sample image corresponding to the sample generated image,:

In an optional embodiment, the training module, when training the feature extraction network and the decoder, is further configured to:

In a third aspect, this disclosure also provides a computer device, a processor, and a memory, where the memory stores machine-readable instructions executable by the processor, and the processor is configured to execute the machine-readable instructions stored in the memory, and when the machine-readable instructions are executed by the processor, the machine-readable instructions are executed by the processor to perform the steps in the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, this disclosure also provides a computer-readable storage medium having a computer program stored thereon, where the computer program is executed to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.

For the description of the effects of the image processing apparatus, the computer device, and the computer-readable storage medium, reference is made to the description of the image processing method, which is not repeated here.

According to the image processing method, the image processing device, the computer equipment and the storage medium, the semantic segmentation processing and the feature extraction processing are respectively carried out on the first image comprising the first target object and the second image comprising the first target garment, and then the first feature map of the first image and the second feature map of the second image are fused based on the result of the semantic segmentation, so that the target image is obtained based on the fused feature map fused with the relevant features of the first target garment.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 shows a flowchart of an image processing method provided by an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating a specific method for performing feature fusion processing on a first feature map and a second feature map based on a first semantic segmentation result and a second semantic segmentation result to obtain a fusion feature map in the image processing method provided in the embodiment of the present disclosure;

fig. 3 is a flowchart illustrating a specific method for determining a first feature subgraph corresponding to each of a plurality of first portions of a target object based on a first semantic segmentation result and a first feature graph according to an embodiment of the present disclosure;

FIG. 4 is a flow chart illustrating a specific method for training a feature extraction network and a decoder according to an embodiment of the present disclosure;

FIG. 5 is a flow chart illustrating a method for generating an image based on a sample, and determining model loss based on the sample image, provided by an embodiment of the present disclosure;

fig. 6 shows a schematic diagram of an image processing apparatus provided by an embodiment of the present disclosure;

fig. 7 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of embodiments of the present disclosure, as generally described and illustrated herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

Research shows that the virtual suit changing technology has wide application in many fields, for example, photographing software with a beautifying function can display different dresses worn by a user in a user interface, or a teacher wears different dresses worn by a teacher according to different teaching contents during online teaching. The current virtual reloading method needs to shoot a target object to be virtually reloaded from multiple angles in advance to reconstruct a three-dimensional model of the target object, and realizes the virtual reloading of the target object according to the three-dimensional model. The method needs to establish a three-dimensional model for the target object in advance, so that a large amount of preliminary preparation work needs to be completed in a long time, and the efficiency is low.

Based on the research, the present disclosure provides an image processing method, in which a first image including a first target object and a second image of a first target garment are subjected to semantic segmentation processing and feature extraction processing, and then a first feature map of the first image and a second feature map of the second image are fused based on a result of the semantic segmentation, so that a target image is obtained based on a fusion feature map in which relevant features of the first target garment are fused, and the process is performed directly for image shooting of a user, which is high in efficiency.

The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the present disclosure to the above-mentioned problems should be the contribution of the inventor in the process of the present disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

To facilitate understanding of the present embodiment, first, an image processing method disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the image processing method provided in the embodiments of the present disclosure is generally a computer device with certain computing capability, and the computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the image processing method may be implemented by a processor calling computer readable instructions stored in a memory.

The following describes an image processing method provided in the embodiment of the present disclosure, taking the replacement of a teacher who gives online lectures as an example.

It should be noted that the online lectures mentioned in the embodiments of the present disclosure may include a live lecture and a recorded lecture, and the like, which is not limited thereto.

Referring to fig. 1, which is a flowchart of an image processing method provided in the embodiment of the present disclosure, the method includes steps S101 to S104, where:

s101: acquiring a first image comprising a first target object and a second image comprising a first target garment;

s102: performing semantic segmentation processing on the first image and the second image respectively to obtain a first semantic segmentation result of the first image and a second semantic segmentation result of the second image; respectively carrying out feature extraction processing on the first image and the second image to obtain a first feature map of the first image and a second feature map of the second image;

s103: performing feature fusion processing on the first feature map and the second feature map based on the first semantic segmentation result and the second semantic segmentation result to obtain a fusion feature map;

s104: and obtaining a target image based on the fusion feature map.

The embodiment of the disclosure fuses the features of the first target garment and the features of the first target object together to form a fused feature map based on the semantic segmentation result of the first image including the first target object and the semantic segmentation result of the second image including the first target garment, so that the target image obtained based on the fused feature map can present the features of both the first target object and the first target garment, and therefore, the target object does not need to be subjected to three-dimensional reconstruction in advance on a human body, and the target object is directly changed by processing the image shot by the target object, and the efficiency is high.

The following describes the details of S101 to S104.

For the above S101, the first target object is a teacher giving lessons online, and the first image may be a video frame image obtained based on a teaching video captured by a capturing device (such as a camera) when the teacher gives lessons online. In other application scenarios, the first image may be any image comprising at least one person.

For example, in a case that a higher-quality teaching video needs to be pushed to a user (such as a student, or the like) watching an online course, a first image may be obtained by extracting video frame images from a shot teaching video frame by frame, and the video quality generated in this way is higher; or, in the case that the calculation amount needs to be reduced to more quickly push the teaching video to the user watching the online course, the first image can be acquired by extracting the video frame image from the taken teaching video according to the preset time interval; or, the first image may be obtained based on the determined moment when the pixel point in the video frame has a large change; the condition that the pixel point is changed greatly comprises at least one of the following conditions: the teacher has larger action amplitude change and teacher position change in the course of giving lessons. The specific mode can be selected according to the actual situation, and is not described herein again.

The first target clothes are different according to teaching contents, and the teacher can independently select teaching decoration, specifically can select before starting teaching, and also supports to change the decoration at any time according to needs in the teaching process. Illustratively, in the case of a teacher teaching an ancient literature, the teacher may dress the ancient clothes, and the first target clothes is the ancient clothes.

The second image comprising the first target garment is, for example, an image of a person or mannequin wearing the target garment, or an image containing only the target garment (excluding any real body or mannequin wearing the target garment).

For the above S102:

(1): the first semantic segmentation result obtained by performing semantic segmentation processing on the first image comprises indication marks of human body parts to which a plurality of pixel points in the first image belong respectively; the second semantic segmentation result obtained by performing semantic segmentation processing on the second image comprises the indication marks of the clothing regions where the plurality of pixel points are located in the second image; wherein, different clothing regions correspond to different human body parts.

The embodiment of the present disclosure provides a specific method for performing semantic segmentation processing on a first image and a second image, respectively, including: and respectively performing semantic segmentation processing on the first image and the second image by using a pre-trained semantic segmentation model to obtain a first semantic segmentation result corresponding to the first image and a second semantic segmentation result corresponding to the second image.

At this time, the semantic segmentation model includes at least one of: convolutional Neural Networks (CNN), and Fully Convolutional Networks (FCN). When performing semantic segmentation processing on the first image and the second image by using a pre-trained semantic segmentation model, taking semantic segmentation processing on the first image as an example, for example, different labels may be respectively set for different parts of a human body; for example, when the resolution of the first image is 1024 × 1024, that is, the first image includes 1024 × 1024 pixels, if the first image is divided into a background, four limbs, a head, and a trunk, and the background, the four limbs, the head, and the trunk may be respectively labeled as 0, 1, 2, and 3, the first semantic segmentation result of the first image obtained by using the pre-trained semantic segmentation model includes 1024 × 1024 labels corresponding to the 1024 × 1024 pixels in the first image. Taking a pixel point in the first image as an example, if the pixel point represents the background, the label of the position corresponding to the pixel point in the first image in the first semantic segmentation result is 0. The process of performing semantic segmentation processing on the second image by using the semantic segmentation model to obtain the second semantic segmentation result is similar to the process of performing semantic segmentation processing on the first image by using the semantic segmentation model to obtain the first semantic segmentation result, and is not repeated here.

It should be noted that, in a scene requiring precision, the first image and the second image may also be divided into more detailed portions, for example, the division of the first image includes: background, limbs, head, torso, hair, etc., the division of the second image includes: collar, sleeve, ornament, pocket, etc., without limitation.

(2): the embodiment of the present disclosure further provides a specific method for respectively performing feature extraction processing on a first image and a second image, including: and respectively carrying out feature extraction processing on the first image and the second image by using a pre-trained feature extraction network to obtain a first feature map of the first image and a second feature map of the second image.

Wherein the feature extraction network comprises at least one of: convolutional neural networks and transformed neural networks (transformers). Taking CNN as an example of a feature extraction network trained in advance, when the resolution of the first image is 1024 × 1024, the CNN may obtain a first feature map corresponding to the first image by using a convolution method, and the first feature map may include, for example, 64 × 64 feature points.

The process of obtaining the second feature map of the second image is similar to the process of obtaining the first feature map of the first image, and is not repeated here.

For the above S103:

referring to fig. 2, an embodiment of the present disclosure provides a specific method for performing feature fusion processing on a first feature map and a second feature map based on a first semantic segmentation result and a second semantic segmentation result to obtain a fused feature map, including:

s201: and determining a first feature subgraph corresponding to each first part in the plurality of first parts of the target object based on the first semantic segmentation result and the first feature graph.

For example, for the first image, the target object in the first image may be divided into four limbs, a head, and a torso as the plurality of first parts according to the division of the first image.

Because the CNN network performs at least one stage of convolution on the first image to obtain a first feature map, based on the convolution process, a mapping relation between feature points in the first feature map and pixel points in the first image can be established; meanwhile, a mapping relation is also formed between the first feature map and a first map segmentation image formed by the first semantic segmentation result; based on the mapping relationship between the first image and the first feature map and the mapping relationship between the first image and the first semantic segmentation image, a corresponding first feature subgraph can be determined for each first part.

Illustratively, referring to fig. 3, an embodiment of the present disclosure provides a specific method for determining, based on a first semantic segmentation result and a first feature map, a first feature subgraph corresponding to each of a plurality of first portions of a target object, including:

s301: determining a target pixel point corresponding to each first part from the first image based on the first semantic segmentation result;

s302: determining a target feature point corresponding to the target pixel point from the first feature map based on the mapping relation between the pixel point in the first image and the feature point in the first feature map;

s303: and determining a first feature subgraph corresponding to each first part based on the target feature points and the first feature graph.

When a plurality of first portions are determined, a target pixel point corresponding to each first portion may be determined based on a number preset when the semantic segmentation is performed on the first image using the semantic segmentation model. Taking one of the first parts as an example, when the first part is a limb and the corresponding label is 1, the positions of all labels 1 in the first semantic segmentation result can be determined by querying the labels of the positions corresponding to the pixel points in the first image in the first semantic segmentation result. Since each position in the first semantic segmentation result corresponds to a pixel point in the first image, the corresponding pixel point in the first image, that is, the corresponding target pixel point when the first part is four limbs, can be determined according to the positions with the labels of 1 in the first semantic segmentation result.

Because a mapping relation exists between the first feature map and the first image, a corresponding first feature sub-map when the first part is a limb in the first feature map can be determined according to the target pixel point determined by the first image.

And the first characteristic subgraph only reflects the first part in the first characteristic graph. For example, in a case where the first feature map includes 64 × 64 feature points, there are, for example, 100 feature points corresponding to the limbs in the first feature map, and positions of the feature points in the 100 first feature maps in the first feature map are determined, then the 100 feature points are determined as target feature points corresponding to the target pixel points in the first feature map.

Exemplarily, when the first feature sub-graph includes 64 × 64 feature points, the numerical value of the feature point at the position corresponding to the target feature point in the first feature sub-graph is determined as the numerical value of the corresponding feature point in the first feature graph, and the rest positions are masked by using "0" to obtain the first feature sub-graph corresponding to the four limbs.

The method for determining the first characteristic subgraph corresponding to the head and the trunk in the first part is similar to the method for determining the first characteristic subgraph corresponding to the trunk, and is not repeated here.

Illustratively, assume that the first image is represented as:

performing semantic segmentation processing on the first image, wherein the obtained first semantic segmentation image is represented as:

performing feature extraction processing on the first image, wherein the obtained first feature map is represented as:

wherein, c₁₁Is based on

Determining; c. C₁₂Is based on

Determining; c. C₁₃Is based on

Determining; c. C₂₁Is based on

Determining; c. C₂₂Is based on

Determining; c. C₂₃Is based on

And (4) determining.

Suppose, in a first semantically segmented image, b₁₁、b₁₂Are all characterized by the head of the human body, since b₁₁、b₁₂Respectively corresponding to a₁₁、a₁₂And c₁₁、c₁₂Having a mapping relation, obtaining a characteristic subgraph corresponding to the head of the human body as follows:

it should be noted that the above example is only to illustrate the principle of obtaining a feature subgraph, and does not set any limit to the image processing method provided in the present embodiment.

Receiving the above S201, the method for performing feature fusion processing on the first feature map and the second feature map to obtain a fused feature map according to the embodiment of the present disclosure further includes:

s202: and determining a second feature subgraph corresponding to each of a plurality of second parts of the target garment based on the second semantic segmentation result and the second feature graph.

The manner of determining the second feature sub-graph is similar to the manner of determining the first feature sub-graph in S201, and is not described herein again.

Here, S201 and S202 have no sequential logic to execute.

S203: and performing feature fusion on the first feature subgraph and the second feature subgraph to obtain a fusion feature graph.

The method for performing feature fusion on the first feature subgraph and the second feature subgraph comprises at least one of the following steps:

splicing the first characteristic subgraph and the second characteristic subgraph in a third dimension to obtain a fused characteristic graph; for example, if the dimensions of the first feature sub-graph and the second feature sub-graph are both 64 × 64 × 1, the dimensions of the obtained fused feature graph are 64 × 64 × 2 after the first feature sub-graph and the second feature sub-graph are superimposed.

Splicing the first characteristic subgraph and the second characteristic subgraph in a second dimension to obtain a fusion characteristic graph; for example, if the dimensions of the first feature sub-graph and the second feature sub-graph are both 64 × 64 × 1, the dimensions of the obtained fused feature graph are 64 × 128 × 1 after the first feature sub-graph and the second feature sub-graph are overlapped and spliced.

For example, in the case that the second image is an image only including ancient clothes, the plurality of first portions include heads, but the plurality of second portions do not include heads, and at this time, when the first portion is a head, there is no second portion corresponding to the first portion, that is, there is no second feature sub-image corresponding to the first portion, and the second feature sub-image corresponding to the head may be set as the preset feature image. For example, the feature values of the feature points in the preset feature map are equal to 0 and 1. The setting is carried out according to the actual requirement.

For the above S104, obtaining the target image based on the fusion feature map includes:

and decoding the fusion characteristic graph by using a pre-trained decoder to obtain a target image.

Referring to fig. 4, an embodiment of the present disclosure further provides a specific method for training a feature extraction network and a coder, including:

s401: acquiring a plurality of first sample images including a second target object and a plurality of second sample images including a second target garment;

s402: performing feature extraction processing on the first sample image and the second sample image by using a feature extraction network to be trained to obtain a first sample feature map of the first sample image and a second sample feature map of the second sample image;

s403: performing semantic segmentation processing on the first sample image and the second sample image to obtain a first sample semantic segmentation result of the first sample image and a second sample semantic segmentation result of the second sample image;

s404: and performing feature fusion processing on the first sample feature map and the second sample feature map based on the first sample semantic segmentation result and the second sample semantic segmentation result to obtain a sample fusion feature map.

Illustratively, when N (N is an integer greater than 1) first sample images S1-SN including a second target object and a second sample image D including a second target garment are used for training a feature extraction network to be trained, the feature extraction network to be trained is used for performing feature extraction processing on the first sample image S1 and the second sample image D to obtain a first sample feature map Sc and a second sample feature map Dc corresponding to the first sample image S1 and the second sample image D, respectively; and performing semantic segmentation processing on the first sample image S1 and the second sample image D to obtain a first sample semantic segmentation result Sl and a second sample semantic segmentation result Dl corresponding to the first sample image S1 and the second sample image D respectively. And performing feature fusion processing on the first sample feature map Sc and the second sample feature map Dc by using the first sample semantic segmentation result Sl and the second sample semantic segmentation result Dl to obtain a sample fusion feature map Gc. The specific process is similar to the method for performing feature fusion processing on the first feature map and the second feature map based on the first semantic segmentation result and the second semantic segmentation result shown in fig. 2 to obtain a fused feature map, and details are not repeated here.

The method for training the feature extraction network to be trained by using the first sample images S2-SN and the second sample image D is similar to the above process, and is not described herein again.

S405: and decoding the sample fusion characteristic graph by using a decoder to be trained to obtain a sample generation image.

For example, the decoder decodes each of the sample fusion feature maps of the N first sample images to obtain sample generation images, indicated as Q1 to QN, corresponding to the N first sample images.

S406: generating an image based on the sample, and determining a model loss based on the sample image; and training the feature extraction network to be trained and the decoder to be trained based on the model loss.

Referring to fig. 5, an embodiment of the present disclosure further provides a method for determining a model loss based on a sample generated image and a sample image, including:

s501: taking the sample generation image as a new first sample image, and taking the first sample image corresponding to the sample generation image as a new second sample image;

s502: obtaining a new generated image of the new first sample image based on the new first sample image and the new second sample image by using the feature extraction network to be trained and the encoder to be trained;

here, the specific manner of acquiring the new generated image is similar to that of acquiring the generated image, and is not described herein again.

S503: determining a model loss based on the new generated image and a first sample image corresponding to the sample generated image; the model loss comprises at least one of the following losses: loss of makeup, loss of style, and loss of face.

Illustratively, in the case where the sample generation images are Q1 to QN, if Qi (i e [1, N ]) is used as a new first sample image, then the first sample image Si corresponding to the sample generation image Qi is used as a new second sample image, and a new generation image Qnew is obtained based on Qi and Si by using the feature extraction network to be trained and the encoder to be trained.

Model loss is then determined based on this newly generated image Qnew, and Si.

In the case where the model loss is determined using the new generated image Qnew and the first sample image Si, the method of determining the model loss includes at least one of:

(1): determining a model loss if the model loss comprises a grooming loss, comprising: the dressing loss is determined based on the degree of difference in dressing between the new generated image and the first sample image corresponding to the sample generated image.

In a possible implementation manner, feature extraction is performed on the new generated image Qnew by using a feature extraction network, so that a feature map Qnewc corresponding to the new generated image Qnew can be determined; and (3) performing feature extraction on the first sample image Si by using a feature extraction network to determine a feature map Sic (i belongs to [1, N ]) corresponding to the first sample image Si. Based on the feature map Qnew c corresponding to the new generated image Qnew and the feature map Sic corresponding to the first sample image Si, the Degree of Similarity (DOS) between the new generated image Qnew and the first sample image Si can be determined and set to DOSi (i e [1, N ]). The method for determining the similarity comprises at least one of the following steps: minkowski Distance (MD), Manhattan Distance (MD), Euclidean Distance (ED), Chebyshev Distance (CD).

With the similarity DOSi, the dressing difference between the new generated image Qnew and the first sample image Si can be determined to determine the dressing loss.

(2): in the case where the model penalty comprises a style penalty, determining the model penalty comprises:

based on the first style information and the second style information, a style loss is determined.

The style information includes, for example, a Gray-Scale Value (GSV) of the image. Determining the gray value corresponding to the new generated image Qnew as first style information when the style information is the gray value of the image; and determining the gray value corresponding to the first sample image Si corresponding to the sample generation image as the second style information. Using the first style information and the second style information, a style loss may be determined. For example, the style loss may be determined based on a difference value of the first style information and each of the second style information.

(3): in the event that the model loss comprises a face loss, determining the model loss comprising:

and determining the dressing loss based on the human face difference degree between the new generated image and the first sample image corresponding to the sample generated image.

The face difference can be used for measuring the similarity of the face features. The method for determining the human face difference degree comprises at least one of the following steps: euclidean Distance (ED), Cosine Distance (CD). Based on the degree of face difference between the new generated image Qnew and the first sample image Si, the dressing loss can be determined.

In connection with the above step S406, the method for training the feature extraction network and the decoder according to the embodiment of the present disclosure further includes:

s407: and obtaining the trained feature extraction network and the trained decoder through multi-round training of the feature extraction network to be trained and the decoder to be trained.

Through the process, the trained feature extraction network and the decoder can be obtained.

In addition, in order to make the generated image have a more realistic display effect, in another possible implementation, when training the feature extraction network and the decoder, the method further includes:

carrying out discrimination processing on the sample generated image by using a discriminator to obtain a discrimination result of whether the sample generated image is a generated image; and performing countermeasure training on the discriminator and a generator consisting of the feature extraction network to be trained and the decoder to be trained on the basis of the discrimination result.

The discriminator is, for example, a markov discriminator (PatchGAN). The discriminator can be used for judging whether each sample generation image in the sample generation images Q1-QN is a generation image, namely the discriminator is used for judging the authenticity of the image of the sample generation image, and the quality of the sample generation image is continuously improved, so that the sample generation image is closer to the image obtained by real shooting.

For example, the discrimination result may be represented as 1 or 0, where "1" indicates that the sample generation image is determined to be the generation image and "0" indicates that the sample generation image is not determined to be the generation image.

When the discriminator is adjusted based on the discrimination result of each of the sample generation images Q1 to QN, the parameter corresponding to the discriminator may be optimized when the corresponding discrimination result is "1".

For example, when the parameters corresponding to the discriminator are optimized, the optimization direction may be determined as a direction corresponding to the occurrence of more discrimination results "1", that is, the discriminator may more accurately determine whether the sample generated image is a generated image.

In the case where the generator constituted by the feature extraction network to be trained and the decoder to be trained is adjusted based on the discrimination result of each sample generation image Q1 to QN, the parameter corresponding to the generator can be optimized when the corresponding discrimination result is "1".

For example, when the generator is optimized, the optimization direction may be determined as a direction corresponding to the occurrence of more discrimination results "0", that is, the generator may generate an image obtained by being close to a real shot more accurately.

By means of the countermeasure training of the discriminator and the generator, the discrimination capability of the discriminator and the reality of the generated sample image of the generator can be improved at the same time, and the method is beneficial to obtaining a target image with more reality during use.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, an image processing apparatus corresponding to the image processing method is also provided in the embodiments of the present disclosure, and since the principle of the apparatus in the embodiments of the present disclosure for solving the problem is similar to the image processing method described above in the embodiments of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not described again.

Referring to fig. 6, a schematic diagram of an image processing apparatus provided in an embodiment of the present disclosure is shown, where the apparatus includes: the system comprises an acquisition module 61, a processing module 62, a feature fusion module 63 and a determination module 64; wherein the content of the first and second substances,

an acquisition module 61 for acquiring a first image including a first target object and a second image including a first target garment;

a processing module 62, configured to perform semantic segmentation processing on the first image and the second image respectively to obtain a first semantic segmentation result of the first image and a second semantic segmentation result of the second image; respectively performing feature extraction processing on the first image and the second image to obtain a first feature map of the first image and a second feature map of the second image;

a feature fusion module 63, configured to perform feature fusion processing on the first feature map and the second feature map based on the first semantic segmentation result and the second semantic segmentation result to obtain a fusion feature map;

and the determining module 64 is configured to obtain a target image based on the fusion feature map.

In an optional implementation manner, when performing feature fusion processing on the first feature map and the second feature map based on the first semantic segmentation result and the second semantic segmentation result to obtain a fused feature map, the feature fusion module 63 is configured to:

In an optional embodiment, the feature fusion module 63, when determining, based on the first semantic segmentation result and the first feature map, a first feature sub-map corresponding to each of a plurality of first locations of the target object, is configured to:

In an optional embodiment, when performing feature fusion on the first feature sub-graph and the second feature sub-graph to obtain the fused feature graph, the feature fusion module 63 is configured to:

In an alternative embodiment, the determining module 64, when obtaining the target image based on the fused feature map, is configured to:

and decoding the fusion feature graph to obtain the target image.

In an alternative embodiment, the processing module 62, when performing semantic segmentation processing on the first image and the second image respectively, is configured to:

In an optional implementation, when performing the feature extraction processing on the first image and the second image respectively, the processing module 62 is configured to:

In an optional embodiment, the method further comprises: a training module 65 for training the feature extraction network and the decoder in the following manner:

In an alternative embodiment, the training module 65, when generating an image based on the sample and determining a model loss based on the sample image, is configured to:

In an optional implementation manner, when determining the model loss based on the new generated image and the first sample image corresponding to the sample generated image, the training module 65 is configured to, when the model loss includes a make-up loss:

In an alternative embodiment, in a case that the model loss includes a style loss, the training module 65 is configured to, when determining the model loss based on the new generated image and a first sample image corresponding to the sample generated image:

In an alternative embodiment, in a case that the model loss includes a face loss, the training module 65 is configured to, when determining the model loss based on the new generated image and a first sample image corresponding to the sample generated image:

In an alternative embodiment, the training module 65, when training the feature extraction network and the decoder, is further configured to:

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

An embodiment of the present disclosure further provides a computer device, as shown in fig. 7, which is a schematic structural diagram of the computer device provided in the embodiment of the present disclosure, and includes:

a processor 71 and a memory 72; the memory 72 stores machine-readable instructions executable by the processor 71, the processor 71 being configured to execute the machine-readable instructions stored in the memory 72, the processor 71 performing the following steps when the machine-readable instructions are executed by the processor 71:

and obtaining a target image based on the fusion feature map.

The memory 72 includes a memory 721 and an external memory 722; the memory 721 is also referred to as an internal memory, and temporarily stores operation data in the processor 71 and data exchanged with an external memory 722 such as a hard disk, and the processor 71 exchanges data with the external memory 722 through the memory 721.

For the specific execution process of the instruction, reference may be made to the steps of the image processing method described in the embodiments of the present disclosure, and details are not described here.

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the image processing method described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The computer program product of the image processing method provided in the embodiments of the present disclosure includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute steps of the image processing method described in the above method embodiments, which may be referred to specifically for the above method embodiments, and are not described herein again.

The embodiments of the present disclosure also provide a computer program, which when executed by a processor implements any one of the methods of the foregoing embodiments. The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. An image processing method, comprising:

and obtaining a target image based on the fusion feature map.

2. The image processing method according to claim 1, wherein performing feature fusion processing on the first feature map and the second feature map based on the first semantic segmentation result and the second semantic segmentation result to obtain a fused feature map comprises:

3. The image processing method according to claim 2, wherein determining a first feature sub-graph corresponding to each of a plurality of first portions of the target object based on the first semantic segmentation result and the first feature graph comprises:

4. The image processing method according to claim 2, wherein performing feature fusion on the first feature sub-graph and the second feature sub-graph to obtain the fused feature graph comprises:

5. The image processing method according to claim 1, wherein obtaining a target image based on the fused feature map comprises:

and decoding the fusion feature graph to obtain the target image.

6. The image processing method according to any one of claims 1 to 5, wherein performing semantic segmentation processing on the first image and the second image, respectively, comprises:

7. The image processing method according to any one of claims 1 to 5, wherein the performing feature extraction processing on the first image and the second image, respectively, includes:

obtaining a target image based on the fusion feature map comprises:

8. The image processing method of claim 7, wherein training the feature extraction network and the decoder comprises:

9. The image processing method of claim 8, wherein generating an image based on the sample, and determining a model loss based on the sample image comprises:

10. The image processing method according to claim 9, wherein, when the model loss includes a makeup loss, the determining the model loss based on the new generated image and a first sample image corresponding to the sample generated image includes:

11. The image processing method according to claim 9, wherein, in a case where the model loss includes a style loss, the determining the model loss based on the new generated image and a first sample image corresponding to the sample generated image includes:

12. The image processing method according to claim 9, wherein, in a case where the model loss includes a face loss, the determining the model loss based on the new generated image and a first sample image corresponding to the sample generated image includes:

13. The image processing method of claim 8, wherein training the feature extraction network and the decoder further comprises:

14. An image processing apparatus characterized by comprising:

15. A computer device, comprising: a processor, a memory storing machine-readable instructions executable by the processor, the processor for executing the machine-readable instructions stored in the memory, the processor performing the steps of the image processing method of any one of claims 1 to 13 when the machine-readable instructions are executed by the processor.

16. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when executed by a computer device, performs the steps of the image processing method according to any one of claims 1 to 13.