CN118334161B

CN118334161B - Photo picture generation method and device, electronic equipment and storage medium

Info

Publication number: CN118334161B
Application number: CN202410741616.8A
Authority: CN
Inventors: 唐宇
Original assignee: Haima Cloud Tianjin Information Technology Co Ltd
Current assignee: Haima Cloud Tianjin Information Technology Co Ltd
Priority date: 2024-06-11
Filing date: 2024-06-11
Publication date: 2024-08-30
Anticipated expiration: 2044-06-11
Also published as: CN118334161A

Abstract

The application provides a method and a device for generating a portrait picture, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a face feature map, a highlight feature map and a shadow feature map, wherein the face feature map, the highlight feature map and the shadow feature map are input by a user or are generated based on characters or a first picture; and inputting the face feature map, the highlight feature map and the shadow feature map into the modified controlnet model to obtain a second picture, wherein the modified controlnet model comprises a feature input layer, a coding layer, a middle layer and a decoding layer, the feature input layer is of a newly added layer structure and is used for inputting the integration features of the face contour features, the highlight features and the shadow features, the feature input layer is externally connected with a first zero convolution layer, the middle layer is externally connected with a second zero convolution layer, the second picture is a target portrait picture, the shadow effect of the portrait picture can be improved, and the whole scheme is low in cost and high in efficiency, quality and generalization degree.

Description

Photo picture generation method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image processing, and in particular, to a method and apparatus for generating a pictorial picture, an electronic device, and a storage medium.

Background

Taking individual portraits has been an inherent need for people to pursue aesthetics and self-expression. In the past, people have been accustomed to shooting portraits in a photo studio, but with the rapid development of AI technology, the personal portrait portraits generated by the diffion model of AIGC (AI GENERATED Content, i.e., artificial intelligence generated Content) have reached a very natural and realistic high-quality level.

Compared with the traditional photography house, the photo generated by AIGC has a plurality of advantages, including low economic cost, convenient and quick manufacture and the like. Meanwhile, the level of the photo composition and the scenery is almost the same as that of a photo studio. However, photography houses have plentiful professional lighting facilities such as light supplementing lamps, reflectors, etc., so that the effect of light and shadow of the photographed pictures is often more excellent. Therefore, how to improve the effect of the photo and the shadow of the true picture generated by AIGC by the technical means is an important subject for better landing and adapting to the market demand. The current AIGC means for improving the shadow effect mainly comprises the following steps: 1. for large models finetuning (fine tuning); 2. training a special shadow effect lora model; 3. and accessing a separate shadow beautifying processing algorithm module for processing. However, these methods have drawbacks to varying degrees. The method 1 needs high calculation force support, the training difficulty is high, and the training effect is not easy to guarantee; methods 2 and 3 tend to be tailored to specific features, and have the problem of poor generalization ability, i.e., the light and shadow effects may be good for some pictures, but poor or even worse for other pictures.

Therefore, how to provide a scheme for generating a photo picture with low cost, high efficiency, high quality and high generalization degree so as to improve the light and shadow effect of the photo picture is a technical problem to be solved.

Disclosure of Invention

Aiming at the technical problems existing in the prior art, the embodiment of the application provides a method and a device for generating a photo picture, electronic equipment and a storage medium.

In a first aspect, an embodiment of the present application provides a method for generating a pictorial picture, including:

Acquiring a face feature map, a highlight feature map and a shadow feature map, wherein the face feature map, the highlight feature map and the shadow feature map are input by a user or are generated based on characters or a first picture;

And inputting the face feature map, the highlight feature map and the shadow feature map into a modified controlnet model to obtain a second picture, wherein the modified controlnet model comprises a feature input layer, a coding layer, a middle layer and a decoding layer, the feature input layer is of a newly added layer structure and is used for inputting the integration features of the face contour features, the highlight features and the shadow features, the feature input layer is externally connected with a first zero convolution layer, the middle layer is externally connected with a second zero convolution layer, and the second picture is a target photo picture.

In a second aspect, an embodiment of the present application further provides a device for generating a pictorial picture, including:

The first generation unit is used for acquiring a face feature map, a highlight feature map and a shadow feature map, wherein the face feature map, the highlight feature map and the shadow feature map are input by a user or are generated based on characters or a first picture;

The second generating unit is configured to input the face feature map, the highlight feature map and the shadow feature map into the modified controlnet model to obtain a second picture, where the modified controlnet model includes a feature input layer, an encoding layer, an intermediate layer and a decoding layer, the feature input layer is a newly added layer structure and is used for inputting integration features of face contour features, highlight features and shadow features, the feature input layer is externally connected with a first zero convolution layer, the intermediate layer is externally connected with a second zero convolution layer, and the second picture is a target pictorial picture.

In a third aspect, an embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, the computer program when executed by a processor performs the steps of the method for generating a pictorial picture according to the first aspect.

In a fourth aspect, an embodiment of the present application further provides an electronic device, including: a processor, a storage medium, and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating over the bus when the electronic device is running, the processor executing the machine-readable instructions to perform the steps of the pictorial picture generation method as described in the first aspect.

In the scheme, the controlnet model after modification is input based on the face feature image, the highlight feature image and the shadow feature image, so that a photo picture with high-quality shadow effect can be generated. In the specific implementation, the face feature map, the highlight feature map and the shadow feature map which are automatically generated by the photo picture can be used as the input of the controlnet model after modification so as to improve the photo effect of the generated photo picture, and the face feature map, the highlight feature map and the shadow feature map which are manufactured by professionals can be used as the input of the controlnet model after modification so as to generate the photo picture with special photo effect, thereby realizing a scheme for improving the photo effect of the photo picture with low cost, high efficiency, high quality and high generalization degree.

Drawings

Fig. 1 is a schematic flow chart of an embodiment of a method for generating a pictorial picture according to an embodiment of the present application;

FIG. 2 is an architecture diagram of a modified controlnet model;

fig. 3 is a schematic structural diagram of an embodiment of a device for generating a realistic picture according to an embodiment of the present application;

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for the purpose of illustration and description only and are not intended to limit the scope of the present application. In addition, it should be understood that the schematic drawings are not drawn to scale. A flowchart, as used in this disclosure, illustrates operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be implemented out of order and that steps without logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to or removed from the flow diagrams by those skilled in the art under the direction of the present disclosure.

In addition, the described embodiments are only some, but not all, embodiments of the application. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that the term "comprising" will be used in embodiments of the application to indicate the presence of the features stated hereafter, but not to exclude the addition of other features.

Referring to fig. 1, a flow chart of a method for generating a photo picture according to an embodiment of the present application is shown, where the method includes:

S10, acquiring a face feature map, a highlight feature map and a shadow feature map, wherein the face feature map, the highlight feature map and the shadow feature map are input by a user or are generated based on characters or a first picture;

S11, inputting the face feature map, the highlight feature map and the shadow feature map into a modified controlnet model to obtain a second picture, wherein the modified controlnet model comprises a feature input layer, a coding layer, an intermediate layer and a decoding layer, the feature input layer is of a newly added layer structure and is used for inputting integration features of face contour features, highlight features and shadow features, the feature input layer is externally connected with a first zero convolution layer, the intermediate layer is externally connected with a second zero convolution layer, and the second picture is a target pictorial picture.

According to the method for generating the portrait picture, which is provided by the embodiment of the application, the portrait picture with the high-quality shadow effect can be generated by inputting the modified controlnet model based on the face feature map, the highlight feature map and the shadow feature map.

Fig. 2 is a schematic diagram of a modified controlnet model (in which time steps are omitted from the schematic diagram) and the prompt term prompt is omitted, and referring to fig. 2, the feature input layer may include a first convolution layer Conv (GN/SILU), a second convolution layer Conv (GN/SILU), and a stitching layer Concat, where the first convolution layer Conv (GN/SILU) and the second convolution layer Conv (GN/SILU) are connected to the stitching layer Concat, the stitching layer Concat is connected to the first zero convolution layer, the input of the first convolution layer Conv (GN/SILU) is a shadow feature map cl, the input of the second convolution layer Conv (GN/SILU) is a highlight feature map ch, the input of the stitching layer Concat is the output of the first convolution layer Conv (GN/SILU), the output of the second convolution layer Conv (GN/SILU) and the face feature map, and the stitching layer Concat is used to make a graph for the first convolution layer Conv (GN/3492) and the output of the second convolution layer (GN/SILU) is the first convolution layer 35/3742, and the output of the second convolution layer is the face feature map (GN/SILU):

The coding layers may comprise a first coding layer and a second coding layer, and the modified controlnet model may further comprise an input layer input and an output layer output;

The input layer input is connected with the first coding layer, the first coding layer is connected with the Middle layer Middle Block, the Middle layer Middle Block is connected with the decoding layer, the decoding layer is connected with the output layer output, the output of the input layer input is input into the first coding layer, the output of the first coding layer is input into the Middle layer Middle Block, the output of the Middle layer Middle Block is input into the decoding layer, and the output of the decoding layer is input into the output layer;

the second coding layer is connected with the second zero convolution layer, the result obtained by adding the output of the input layer input and the output of the first zero convolution layer is input into the second coding layer, the output of the second coding layer is input into the second zero convolution layer, and the output of the second zero convolution layer is input into the Middle layer Middle Block.

In this embodiment, it should be noted that, after cl and ch, the first convolution layer Conv (GN/SILU) and the second convolution layer Conv (GN/SILU) are respectively connected, so that the influence of the highlight and shadow may be more toward the region rather than the single pixel value. And splicing (concat) the three feature images in the channel dimension through a splicing layer, and taking the generated feature images as input. Assuming that the sizes of the three feature maps are 1×h×w (where H represents height and W represents width), respectively, the size of the feature map generated by stitching the three feature maps in the channel dimension is 3×h×w. In the architecture shown in fig. 2, the modification to the original controlnet model is actually to copy the first coding layer in the diffration model (including the first coding layer, the Middle layer Middle Block and the decoding layer in fig. 2) to obtain the second coding layer, and delete the zero convolution layer externally connected with the decoding layer, so that the feature input layer and the first zero convolution layer are added. In this embodiment, only the Middle layer Middle Block of the output portion is externally connected with one zero conv layer (i.e. the second zero convolution layer), which is different from each layer of the output portion of the original controlnet model, so that training of the modified controlnet model can be prevented from being overfitted in a small sample space. It should be noted that, in fig. 2, the first encoding layer and the second encoding layer each include N Encoder Block, the decoding layer includes N Decoder blocks, encoder Blocki (i e (1, 2, …, N)) in the first encoding layer and the second encoding layer in fig. 2, intermediate layer Middle Block, and Decoder Blocki (i e (1, 2, …, N)) in the decoding layer are structures existing in the unet network, and specific meanings thereof are not described herein.

On the basis of the foregoing method embodiment, each of the first convolution layer and the second convolution layer may include one convolution layer Conv (may be a2×2 convolution layer), one group normalization layer GN, and one activation function layer SILU.

On the basis of the foregoing method embodiment, generating a face feature map based on the first picture may include:

And carrying out face recognition on the first picture to obtain a face area, generating a line diagram of the face area, converting each pixel value of the line diagram of the face area into a value within the range of < -1 > and 1 >, and taking the picture obtained by converting the pixel value as a face feature picture.

In this embodiment, the face area includes at least a face area, but may include other areas, such as a partial neck area. The line block diagram generating the face region may employ a canny algorithm. The conversion formula for converting a pixel value into a value in the range of [ -1,1] may be pix_1= (pix/255-0.5) ×2, where pix_1 is the pixel value after conversion and pix is the pixel value before conversion.

On the basis of the foregoing method embodiment, generating a highlight feature map based on the first picture may include:

converting the first picture into a picture in a YCbCr format;

For the pixel value of each Y channel in the picture in the YCbCr format, if judging that the pixel value is smaller than the highlight threshold value, adjusting the pixel value to 0, otherwise, maintaining the pixel value unchanged;

and converting the pixel value of each Y channel in the picture in the YCbCr format into a value in the range of [ -1,1], and taking the picture obtained after converting the pixel value as a highlight feature map.

On the basis of the foregoing method embodiment, generating a shadow feature map based on the first picture may include:

converting the first picture into a picture in a YCbCr format;

For the pixel value of each Y channel in the picture in the YCbCr format, if judging that the pixel value is larger than the shadow threshold value, adjusting the pixel value to 0, otherwise, adjusting the pixel value to be 255 and the difference value of the pixel value;

and converting the pixel value of each Y channel in the picture in the YCbCr format into a value in the range of [ -1,1], and taking the picture obtained after converting the pixel value as a shadow feature map.

On the basis of the foregoing method embodiment, before the inputting the face feature map, the highlight feature map and the shadow feature map into the modified controlnet model to obtain the second picture, the method may further include:

And training the modified controlnet model by using a plurality of pictorial pictures, wherein parameters of the first coding layer, the middle layer and the decoding layer are kept unchanged in the training process of the controlnet model.

In the embodiment, in the training stage, in order to ensure the low calculation force requirement, only the photo picture with the high quality light and shadow effect and the size of 5k-10k is needed to form a data set. Meanwhile, due to the small data set, an adjustment needs to be made to the controlnet model structure (the adjusted controlnet model is the modified controlnet model) so as to ensure that training cannot be over-fitted. During training, for each true picture, three corresponding conditional feature maps need to be generated: face feature map, highlight feature map and shadow feature map. And then, keeping parameters of the first coding layer, the middle layer and the decoding layer unchanged, and training the modified controlnet model by using the face feature diagram, the highlight feature diagram and the shadow feature diagram (namely obtaining optimal parameters of a first convolution layer Conv (GN/SILU), a second convolution layer Conv (GN/SILU), a splicing layer Concat, the first zero convolution layer, the second coding layer and the second zero convolution layer). And then, a picture with obvious light and dark change effect in the face region can be generated by using the controlnet model after training. The user may generate a writeable picture using the modified controlnet model in two ways: 1. the user uses a selected photo graph, three feature graphs of cl, ch and cf are generated by the method in the embodiment, and cl, ch and cf are input into controlnet models to generate high-quality photo pictorial pictures; 2. and the user directly generates a photo picture by using the diffusion model, then generates three characteristic diagrams of cl, ch and cf by using the photo picture, and inputs the three characteristic diagrams to the controlnet model so as to generate the high-quality photo picture.

Referring to fig. 3, a schematic structural diagram of a photo image generating device according to an embodiment of the present application is shown, where the device includes:

A first generating unit 30, configured to obtain a face feature map, a highlight feature map, and a shadow feature map, where the face feature map, the highlight feature map, and the shadow feature map are input by a user, or are generated based on text or a first picture;

the second generating unit 31 is configured to input the face feature map, the highlight feature map, and the shadow feature map into the modified controlnet model to obtain a second picture, where the modified controlnet model includes a feature input layer, an encoding layer, an intermediate layer, and a decoding layer, the feature input layer is a newly added layer structure, and is configured to input integration features of the face contour feature, the highlight feature, and the shadow feature, the feature input layer is externally connected with a first zero convolution layer, the intermediate layer is externally connected with a second zero convolution layer, and the second picture is a target photo.

According to the portrait picture generation device provided by the embodiment of the application, the facial feature map, the highlight feature map and the shadow feature map are input into the modified controlnet model, so that a portrait picture with a high-quality shadow effect can be generated.

The implementation process of the device for generating the portrait picture provided by the embodiment of the application is consistent with the method for generating the portrait picture provided by the embodiment of the application, and the effect achieved by the device is the same as that of the method for generating the portrait picture provided by the embodiment of the application, and the description thereof is omitted.

As shown in fig. 4, an electronic device provided in an embodiment of the present application includes: the system comprises a processor 40, a memory 41 and a bus 42, wherein the memory 41 stores machine-readable instructions executable by the processor 40, and when the electronic device is running, the processor 40 communicates with the memory 41 through the bus 42, and the processor 40 executes the machine-readable instructions to perform the steps of the photo picture generation method.

Specifically, the above-mentioned memory 41 and the processor 40 can be general-purpose memories and processors, and are not particularly limited herein, and the above-mentioned pictorials picture generation method can be performed when the processor 40 runs a computer program stored in the memory 41.

Corresponding to the above-mentioned pictorial picture generation method, the embodiment of the present application further provides a computer readable storage medium, where a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the steps of the above-mentioned pictorial picture generation method are executed.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily appreciate variations or alternatives within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A photo picture generation method is characterized by comprising the following steps:

Inputting the face feature map, the highlight feature map and the shadow feature map into a modified controlnet model to obtain a second picture, wherein the modified controlnet model comprises a feature input layer, a coding layer, an intermediate layer and a decoding layer, the feature input layer is of a newly added layer structure and is used for inputting the integration features of the face contour features, the highlight features and the shadow features, the feature input layer is externally connected with a first zero convolution layer, the intermediate layer is externally connected with a second zero convolution layer, and the second picture is a target photo picture;

The characteristic input layer comprises a first convolution layer, a second convolution layer and a splicing layer, wherein the first convolution layer and the second convolution layer are connected with the splicing layer, the splicing layer is connected with the first zero convolution layer, the input of the first convolution layer is a shadow characteristic diagram, the input of the second convolution layer is a highlight characteristic diagram, the input of the splicing layer is the output of the first convolution layer, the output of the second convolution layer and a face characteristic diagram, the splicing layer is used for splicing the output of the first convolution layer, the output of the second convolution layer and the face characteristic diagram in a channel dimension, and the input of the first zero convolution layer is the output of the splicing layer:

The coding layer comprises a first coding layer and a second coding layer, and the modified controlnet model also comprises an input layer and an output layer;

the input layer is connected with the first coding layer, the first coding layer is connected with the middle layer, the middle layer is connected with the decoding layer, the decoding layer is connected with the output layer, the output of the input layer is input into the first coding layer, the output of the first coding layer is input into the middle layer, the output of the middle layer is input into the decoding layer, and the output of the decoding layer is input into the output layer;

the second coding layer is connected with the second zero convolution layer, the result obtained by adding the output of the input layer and the output of the first zero convolution layer is input into the second coding layer, the output of the second coding layer is input into the second zero convolution layer, and the output of the second zero convolution layer is input into the middle layer.

2. The method of claim 1, wherein the first convolution layer and the second convolution layer each comprise a convolution layer, a group normalization layer, and an activation function layer.

3. The method of claim 1 or 2, wherein generating a face feature map based on the first picture comprises:

4. The method of claim 1 or 2, wherein generating the highlight feature map based on the first picture comprises:

converting the first picture into a picture in a YCbCr format;

5. The method of claim 1 or 2, wherein generating the shadow feature map based on the first picture comprises:

converting the first picture into a picture in a YCbCr format;

6. The method of claim 1, further comprising, prior to said inputting the face feature map, the highlight feature map, and the shadow feature map into the modified controlnet model to obtain the second picture:

7. A pictorial picture generation apparatus, comprising:

The second generating unit is used for inputting the face feature map, the highlight feature map and the shadow feature map into the modified controlnet model to obtain a second picture, wherein the modified controlnet model comprises a feature input layer, a coding layer, an intermediate layer and a decoding layer, the feature input layer is of a newly added layer structure and is used for inputting the integration features of the face contour features, the highlight features and the shadow features, the feature input layer is externally connected with a first zero convolution layer, the intermediate layer is externally connected with a second zero convolution layer, and the second picture is a target pictorial picture;

8. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, performs the steps of the pictogram generation method as set forth in any one of claims 1 to 6.

9. An electronic device, comprising: a processor, a storage medium, and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating over the bus when the electronic device is running, the processor executing the machine-readable instructions to perform the steps of the pictorial picture generation method of any of claims 1 to 6.