CN118261807A

CN118261807A - Image conversion method, device, electronic equipment and storage medium

Info

Publication number: CN118261807A
Application number: CN202410266728.2A
Authority: CN
Inventors: 宋亦仁; 刘家铭
Original assignee: Xiaohongshu Technology Co ltd
Current assignee: Xiaohongshu Technology Co ltd
Filing date: 2024-03-08
Publication date: 2024-06-28

Abstract

The application discloses an image conversion method, an image conversion device, electronic equipment and a storage medium. The method comprises the following steps: converting an original image into a first image, wherein the content of the first image at least comprises the content of the original image; based on the color of the original image, adjusting the color of the first image to obtain a second image; and adjusting the second image based on the color of the original image and the image description of the original image to obtain a first target image, wherein the image description is at least used for describing the content of the original image.

Description

Image conversion method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image conversion method, an image conversion device, an electronic device, and a storage medium.

Background

With the continuous development of artificial intelligence technology in the field of image generation, the user's creation mode is not limited to generating images according to texts, but also can generate images according to original images, and in the creation process of original images, multiple colors are generally used for filling the original images so as to express the creation concept of an creator. However, the current generation of the target image based on the original image generates a corresponding matching chart through an image matching mode, and the generation mode may lose the colors drawn by the creator in the original image, so that the color consistency between the finally generated target image and the original image is low, and the user experience is not high.

Disclosure of Invention

The embodiment of the application provides an image conversion method, an image conversion device, electronic equipment and a storage medium, wherein colors of a second image and a first target image are adjusted through color features of an original image, so that the original image and a finally output image have higher color consistency, and user experience is improved.

In a first aspect, an embodiment of the present application provides an image conversion method, including:

Converting an original image into a first image, wherein the content of the first image at least comprises the content of the original image;

based on the color of the original image, adjusting the color of the first image to obtain a second image;

and adjusting the second image based on the color of the original image and the image description of the original image to obtain a first target image, wherein the image description is at least used for describing the content of the original image.

In a second aspect, an embodiment of the present application provides an image conversion apparatus, including: a transceiver unit and a processing unit;

a receiving and transmitting unit for acquiring an original image;

A processing unit that converts an original image into a first image, wherein the content of the first image at least includes the content of the original image;

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor and a memory, the processor being connected to the memory, the memory being for storing a computer program, the processor being for executing the computer program stored in the memory to cause the electronic device to perform the method as in the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program that causes a computer to perform the method as in the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program, the computer being operable to cause a computer to perform a method as in the first aspect.

The embodiment of the application has the following beneficial effects:

It can be seen that, in the embodiment of the present application, an original image is acquired, and the original image is converted into a first image, where the content of the first image at least includes the content of the original image; based on the color of the original image, adjusting the color of the first image to obtain a second image; adjusting the color of the second image based on the color of the original image and the image description of the original image to obtain a first target image, wherein the image description is at least used for describing the content of the original image; and returning the first target image to the user side so as to display the first target image on a display interface of the user side. In the process of generating the target image based on the original image, the colors of the first image and the second image are adjusted based on the colors of the original image so as to finally obtain the first target image, so that the colors of the target object used in the original image can be reserved in the finally generated first target image, the color consistency between the output first target image and the original image is improved, and the user experience is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an image conversion system according to an embodiment of the present application;

Fig. 2 is a schematic view of a scene of image conversion according to an embodiment of the present application;

Fig. 3 is a schematic flow chart of an image conversion method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an image conversion process according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an image transformation model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of another image conversion process according to an embodiment of the present application;

Fig. 7 is a functional unit block diagram of an image conversion device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," "third," and "fourth" and the like in the description and in the claims and drawings are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to facilitate understanding of the technical scheme of the present application, related technical terms related to the present application are explained first.

Target object: the target object in the present application may be a user, a virtual person, a robot, a digital person, etc., and in the present application, the target object is described as an example of the user.

Original image: the original image may be an original image drawn by the target object, an image generated by an artificial intelligence technology, or an original image photo uploaded by the target object, which is not limited herein, and the original image drawn by taking the original image as the target object is described as an example in the present application, and secondly, the original image and the graffiti drawn by taking the original image as the target object are described as an example in the present application, and for convenience of description, the original image and the graffiti are similar in nature and may not be distinguished.

The image generation, image processing, and image conversion according to the present application are all to convert an image on the user side (i.e., an original image) into a generated image, i.e., into an AI image or an artificial intelligence image. Therefore, the image generation related to the present application may be also referred to as AI generation, the image conversion may be also referred to as AI conversion, and the image generation may be similar in nature and may not be distinguished.

Image description: the image description is used to describe the content in the original image, i.e., to describe the detailed content in the original image.

Color value of pixel: the color value of a pixel refers to a numerical representation of the color that the pixel displays in an image, and is typically made up of multiple channels, each representing one color, most commonly using RGB (red green blue), where each pixel has three channels: red (R), green (G) and blue (B), the color value of each channel is typically between 0 and 255, representing the intensity of the corresponding color, wherein RGB represents the value of the three colors of red, green and blue mixed by a certain ratio. In RGB, if the color values of all channels are 255, then this pixel will be pure white, and if the color values of all channels are 0, then this pixel will be pure black. Weights are set for the color values of each channel in the pixel point to obtain color values of the pixel point, namely RGB, for example, the color values of three channels of the pixel point are expressed as (100,220,150), and the color values of the three channels are mixed according to the red ratio of 50%, the green ratio of 30% and the blue ratio of 20%, so that the color value of the pixel point is 136.

Masking: is a command for adjusting the degree of modification of an image and performing local adjustment; the mask has only three colors, black, white and gray; where black represents protection, white represents allowed modification, and gray is used to adjust the degree of modification.

Number of times of noise addition: the method is characterized in that the noise adding operation is repeated for a plurality of times when the image is randomly disturbed. The more the number of noise increases, the greater the degree of random disturbance.

Noise intensity: refers to the size of noise added each time when the image is randomly disturbed. The greater the noise adding strength is, the greater the degree of random disturbance is, and the better the privacy protection effect is.

Number of denoising times: the method refers to the number of times of denoising the image, and each denoising reduces certain noise.

Denoising strength: refers to the degree of noise reduction performed on an image in one denoising, using the parameter setting. The denoising strength needs to be adjusted on a case-by-case basis. It should be noted that the denoising strength may also be referred to as denoising responsivity in the present application.

In the process of denoising an image, the number of times of denoising and the number of times of denoising may be the same or different, and the denoising intensity may be the same or different.

Blip2 model: the method is a universal visual-language pre-training method with high calculation efficiency, the image part and the character part are respectively encoded, then deep interaction is carried out, pre-training is carried out in two stages, the visual language is guided to learn the image representation in the first stage, and the language is guided to generate learning in the second stage.

Show and Tell model: the image annotation algorithm based on deep learning realizes the conversion from the image to the natural language by mapping the image and the corresponding descriptive text into the same vector space.

Referring to fig. 1, fig. 1 is a schematic diagram of an image conversion system according to an embodiment of the application. As shown in fig. 1, the image conversion system includes: a client 101 and an image conversion device 102.

The user terminal 101 is deployed with an application program that provides an image conversion function. The target object may perform a series of operations through the display interface of the application, draw the original image on the display interface of the application or upload the original image, and the target object may send the original image to the image conversion device 102 by clicking a corresponding virtual function button on the display interface of the application, so as to request the image conversion device 102 to convert the original image into the first target image.

Correspondingly, after receiving the original image, the user terminal 101 sends the original image to the image conversion device 102, and the image conversion device 102 converts the original image into a first image, wherein the content of the first image at least comprises the content of the original image; based on the color of the original image, adjusting the color of the first image to obtain a second image; and adjusting the color of the second image based on the color of the original image and the image description of the original image to obtain a first target image, wherein the image description is at least used for describing the content of the original image. After the image conversion device 102 converts the original image into the first target image, the image conversion device 102 sends the first target image to the user terminal 101, and the user terminal 101 displays the first target image in the display screen.

In practical applications, the function of converting the original image into the image may be localized at the user side, and in this case, the user side 101 may implement the function of converting the original image into the image without interacting with the image conversion device 102. In the embodiment of the present application, description will be made mainly taking an example in which the image conversion device 102 performs image conversion.

It should be appreciated that the image conversion apparatus of the present application may be a server, or other computing device, where the server may be a cloud computing server, a content delivery network (Content Delivery Network, CDN) server, a network time protocol (Network Time Protocol, NTP), a Domain name resolution system (Domain NAME SYSTEM, DNS) server, or other various types of servers. The servers described above are merely examples, and are not exhaustive, including but not limited to the servers described above.

It can be seen that, in the embodiment of the present application, the user terminal 101 obtains an original image, and sends the original image to the image conversion device 102 based on the requirement of the target object. After the image conversion device 102 obtains an original image, converting the original image drawn by the target object into a first image, wherein the content of the first image at least comprises the content of the original image; based on the color of the original image, adjusting the color of the first image to obtain a second image; adjusting the color of the second image based on the color of the original image and the image description of the original image to obtain a first target image, wherein the image description is at least used for describing the content of the original image; and returns the first target image to the user terminal 101 so as to display the first target image on the display interface of the user terminal 101. The first target image is obtained by adjusting the colors of the first image and the second image based on the colors of the original image in the process of generating the image based on the original image, and the content of the first image at least comprises the content of the original image in the process of converting the original image into the first image, so that the content in the original image and the used colors can be reserved in the finally generated first target image, and the consistency of the first target image and the original image in the colors and the content is improved, so that the user experience is improved.

Referring to fig. 2, fig. 2 is a schematic view of an image conversion scenario provided by an embodiment of the present application, where the application scenario includes an initial interface, an intermediate interface, a first target interface, a second target interface, and a release interface.

The method comprises the steps that an initial interface is used for drawing an original image for a target object, and the target object can perform original image creation on a drawing board based on a drawing pen and a color board displayed in the interface, wherein the color board comprises multiple colors; after the target object draws the original image, clicking a first control in the initial interface, wherein the first control can be the next step in the initial interface, and the user side responds to the clicking operation to start the process of entering the image conversion, wherein the first control can be a virtual function button, and the original image can also be an image uploaded by the target object.

The intermediate interface includes an original image, a first control, and a target style selection area, where the target style selection area includes a preset style 1, a preset style 2, and a preset style 3 displayed by the intermediate interface, and selectable styles include, but are not limited to, a pickex style, a future sense style, and a dopamine style, and fig. 2 only uses the first preset style, the second preset style, and the third preset style as an example, and the application is not limited to the types and the numbers of the styles.

In the intermediate interface, the target object may click on a target style, as shown in fig. 2, after the target object clicks on the preset style 2, the user side responds to the click operation to generate a first target interface, where the first target interface includes a first target image generated based on an original image of the target object, and the target style selects an area, a first control, and a second control, where the second control may be "change one" in the first target interface. It should be noted that, when the first target image is displayed on the first target interface, the user side and the image conversion device are equivalent to complete the process of generating the image from the original image once.

Further, if the target object is not satisfied with the generated first target image, the second control can be clicked to continue generating the image. When the target object clicks the second control, the user side responds to the clicking operation to generate a second target interface, wherein the second target interface comprises a second target image generated based on an original image of the target object, a target style selection area, the first control and the second control.

Further, when the target object clicks the first control in the second target interface, indicating and generating a second target image expected by the target object, the user side responds to the clicking operation and enters a release interface, wherein the release interface comprises an original image, the second target image and a third control, the third control is a release note, and the target object can also add characters in a character editing area in the release interface and edit pictures in a picture editing area.

Referring to fig. 3, fig. 3 is a flowchart of an image conversion method according to an embodiment of the present application, where the method is applied to the original image conversion apparatus shown in fig. 1. The method includes, but is not limited to, steps 301-303:

301: the original image is converted into a first image.

It should be noted that, before converting the original image into the first image, it may further include: the target object draws an original image on a display interface of an application program of the user side, or the target object uploads the original image on the display interface, and then the original image is sent to the image conversion device through the user side, and accordingly, the image conversion device can acquire the original image.

The method comprises the steps of converting an original image into a first image, wherein the content of the first image at least comprises the content of the original image, and converting the original image into the first image based on the original image, a target style and the image description, and the style of the first image is the target style.

Specifically, the content of the original image is identified, and an image description corresponding to the original image is obtained, wherein the original image description is used for describing the content in the original image.

Optionally, the image conversion device may identify the content of the original image through a pre-trained first model and a prompt term (prompt), so as to obtain a first image description. The method includes the steps of obtaining a prompt word, inputting the prompt word and an original image into a first model trained in advance, and identifying the content of the original image to obtain a first image description. For example, the first model may be a blip2 model, show and Tell model, etc., and the present application is not limited to the first model, and the hint may be "please describe the content in the original image" or "please describe the details in the original image," etc. Alternatively, the hint word may be generated by a pre-trained second model.

Further, the image conversion device may convert the original image into the first image through a pre-trained third model and image description. The method includes the steps of obtaining a target style control model corresponding to a target style, wherein the target style control model is used for controlling the style of the first image to be the target style. And constructing a gesture feature control model by using the stable diffusion model, wherein the gesture feature control model is used for fusing the gesture features of the original image into a denoising process, so that the dominant features of the first image comprise the gesture features of the original image. And fusing the target style control model, the attitude characteristic control model and the stable diffusion model to obtain the third model. Inputting the original image description, the original image and the posture feature of the original image into a third model to generate a first image, wherein the content of the first image and the posture feature of the content are the same as those of the original image, and the generated first image is a first image with a target style. For example, as shown in FIG. 4, it should be noted that the target style used in FIG. 4 is a future sense style. The original image is converted into a first image including content in the original image, and the first image having a future-sense style is generated based on the target style selected by the target object.

It can be seen that, in the embodiment of the present application, the generation of the first image based on the original image is controlled by the third model, so that the content of the first image and the gesture feature of the content are the same as those of the original image, and the generated first image is the first image with the target style, so that the generated first image accords with the expectation of the target object in terms of content and style, and the experience of the target object is improved.

302: And adjusting the color of the first image based on the color of the original image to obtain a second image.

It should be noted that, in the present application, the colors of the original image include: and the color value of the pixel point in the original image is in the color corresponding to the pixel point in the preset range. Specifically, the preset range may be (0,225), and when the color value of the pixel point is 0, the color corresponding to the pixel point is pure black; when the color values of the pixel points are 255, the color corresponding to the pixel point is pure white. That is, the colors of the original image include other colors than black and white.

Illustratively, the color of the first image is adjusted based on the color of the original image to obtain a second image. Specifically, a first area corresponding to a pixel point, of which the color value of the pixel point in the original image is within a preset range, is obtained; that is, areas corresponding to colors other than black and white in the original image are acquired. And covering the first area on the first image as a mask to obtain the second image, wherein the position of the mask covered on the first image is the same as the position of the first area in the original image.

Optionally, the method of superimposing a mask for the first image includes: the original image is dragged into the first image so that the original image becomes a layer of the first image. Secondly, setting a mask on the original image based on the color values of the pixel points in the original image, and coating the part to be displayed in the original image with black and the other areas with white to obtain the mask. For example, in the present application, a first region in an original image may be painted black, and other regions in the original image except for the first region may be painted white, to obtain a mask of the original image. And superposing the mask into the first image, and taking the first image after the superposition of the mask as a second image. In the process of superposition, the black part in the mask is a protection area which is not allowed to be modified, namely the first area is a protection area which is not allowed to be modified, the white part is an area which is allowed to be modified, the mask is superimposed into the first image, the black part in the mask is still displayed in the superimposed image, but the white part in the mask is replaced by the content of the corresponding position in the first image. And finally, taking the first image after the mask plate superposition as a second image. As shown in fig. 4, the first area in fig. 4 includes the background of the target object drawn with the color brush and the lips of the panda, and is painted black, and the other areas are painted white, so as to obtain a mask, the mask is superimposed into the first image, and the first image after the mask is superimposed is used as the second image, so that the color in the first area can be displayed in the second image.

It can be seen that, in the embodiment of the present application, the color of the original image is superimposed on the first image through the mask, so as to obtain the second image, so that the color of the second image is consistent with the color of the original image, and the authored content of the target object is fully reserved.

303: And adjusting the second image based on the color of the original image and the image description of the original image to obtain a first target image.

It should be noted that the adjustment of the second image, including the adjustment of the color of the second image, is based on the color of the original image and the image description of the original image, wherein the image description is used at least for describing the content of the original image.

Illustratively, adding noise to the second image at least once based on the color of the original image to obtain a first noise image; and denoising the first noise image at least once through the color of the original image and the image description of the original image to obtain the first target image.

Further, the number of times of adding noise to the second image is determined based on the color of the original image. Acquiring the color duty ratio of the original image; obtaining the noise adding times N based on the color duty ratio; and adding the N times of noise to the second image to obtain the first noise image. Specifically, the area of the original image and the color value of each pixel point in the original image are obtained; determining a plurality of pixel points with color values within a preset range based on the color value of each pixel point, wherein the preset range can be 0,225, namely, the colors corresponding to the plurality of pixel points are colors except black and white; an area of a color patch is determined based on the plurality of pixel points, and a color duty cycle of the original image is determined based on the area of the color patch and the area of the original image.

Optionally, based on the color duty ratio, the number of times of adding noise N is obtained, where the number of times of adding noise may be equal to the number of times of removing noise. The application provides a method for determining the noise adding times N, which comprises the following steps:

Specifically, the number of times of noise addition is determined based on a preset number of times of noise removal and the noise removal intensity, wherein the preset number of times of noise removal is used for representing the number of times of noise removal required from random noise to the second image, and the number of times of noise addition N can be represented by the following formula:

N＝F*S

wherein N is the number of times of adding noise, F is the preset number of times of removing noise, and S is the intensity of removing noise. It can be seen from the formula that the denoising strength S and the number of times of adding noise are in a proportional relationship, namely the greater the denoising strength is, the greater the number of times of adding noise N is. The larger the denoising strength is, the smaller the similarity between the finally generated image and the original image is, so that the noise needs to be added multiple times, namely the larger the denoising strength is, the larger the number of times of adding noise N is.

In the application, two methods for determining the denoising intensity are also provided, wherein the first method is to directly take the color duty ratio as the denoising intensity S; the second method is to determine the denoising intensity S corresponding to the color duty ratio of the original image based on the mapping relationship of the color duty ratio and the denoising intensity. Specifically, the color ratio may range from 0% to 99%, and the denoising strength S may range from 0.6 to 0.85 or from 0.6 to 0.9. The implementation process of adding the N times of noise to the second image is as follows: and in each noise adding process, adding noise to the second image based on the denoising intensity S of each time until N times of noise are added, and obtaining the first noise image. In the present application, the noise adding intensity and the noise removing intensity are the same, and the noise adding number and the noise removing number are the same.

As shown in fig. 4, the color of the original image includes the background of the target object drawn with the color brush and the lips of the panda, and the color patch area of the original image is relatively large in the original image, that is, the color of the original image is relatively large. Based on the method for determining the denoising intensity S based on the color ratio described above, it can be obtained that the denoising intensity S should be relatively large in the process of converting the image. Based on the denoising strength S, the number of times of denoising is determined, wherein the greater the denoising strength is, the greater the number of times of denoising is required. And determining the noise adding times N of the second image based on the denoising intensity S and the preset denoising times F, and adding N times of noise to the second image to obtain a first noise image. Further, since the number of times of denoising and the number of times of adding noise are equal, when denoising the first noise image, the first target image shown in fig. 4 is also obtained by denoising N times correspondingly.

The difference between the first target image and the second target image in fig. 4 is due to the difference in denoising strength. The denoising intensity corresponding to the first target image is S1, the denoising intensity corresponding to the second target image is S2, wherein S2 is greater than S1, and compared with the first target image, the content and details of the image in the second target image are enriched to a greater degree, namely the details of the face and the legs of the panda in the second target image are further beautified, so that the generated panda is more attractive and better fused with the target style of the image. In addition, the background in the second target image is richer than the first target image, so that the color of the background is consistent with that of the background in the original image, background image details corresponding to future sense style are generated, for example, a background image with stereoscopic effect content is generated. Therefore, when the color of the original image is relatively large, the denoising intensity S should be correspondingly increased so as to generate an image with a more attractive effect and more in line with the requirements of the target object. It can be seen that in the embodiment of the present application, the denoising intensity is determined based on the color ratio in the original image, that is, the smaller the color ratio, the smaller the denoising intensity, and the fewer times the denoising is. For example, when the color of the original image input by the target object occupies a relatively small area, in order to reflect the color depicted by the target object in the finally generated image, that is, to maintain the consistency of the color, the original image is required to maintain a relatively large similarity with the finally generated image, and accordingly, the denoising strength should be reduced and the denoising times should be reduced in the denoising process, so as to ensure that the finally generated image has a relatively large similarity with the original image, thereby reflecting the color of the original image in the finally generated image, that is, maintaining the consistency of the color. The denoising intensity is determined by the color ratio in the original image, so that the color in the original image and the color in the finally generated first target image are consistent, and the expectation of a user is met.

Further, the first noise image is denoised at least once through the color of the original image and the image description of the original image to obtain the first target image, wherein the denoising times can be N. Specifically, extracting text features from the image description of the original image to obtain text features; fusing the color of the original image with the first noise image to obtain a second noise image; extracting color features of the second noise image to obtain color features; encoding the first noise image, and fusing the text characteristics during encoding to obtain a first characteristic diagram; extracting the characteristics of the first characteristic map, and fusing the text characteristics and the color characteristics during characteristic extraction to obtain a second characteristic map; decoding the second feature map, and fusing the text features and the color features during decoding to obtain a target feature map; and decoding the target feature map to obtain the first target image.

Specifically, for the ith denoising in the N times of denoising of the first noise image, firstly obtaining a time step corresponding to the ith denoising, and encoding the time step to obtain a time feature corresponding to the ith denoising; and coding the target feature map obtained by the ith-1 th denoising for multiple times based on the text feature and the time feature corresponding to the ith denoising to obtain a first feature map corresponding to the ith denoising, wherein when i=1, the target feature map is a first noise image.

Further, in the process of carrying out the ith denoising, fusing the color of the original image with the first noise image corresponding to the ith denoising to obtain a second noise image corresponding to the ith denoising; performing color feature extraction on a second noise image corresponding to the ith denoising to obtain a first color feature; and encoding the first color feature to obtain a second color feature corresponding to each encoding. And carrying out zero convolution on the second color feature obtained by the last coding to obtain a second color feature corresponding to the ith denoising.

Then, based on the text feature, the time feature corresponding to the ith denoising and the second color feature, carrying out feature fusion on the first feature map corresponding to the ith denoising to obtain a second feature map corresponding to the ith denoising; and finally, decoding the second feature map corresponding to the ith denoising for multiple times based on the text feature, the time feature corresponding to the ith denoising and the second color feature to obtain a target feature map corresponding to the ith denoising. Wherein the ith denoising is any one of the N denoising steps.

And denoising a coding result obtained by the j-1 th coding according to the time characteristic corresponding to the i-th denoising aiming at the j-th coding in the multiple codes corresponding to the i-th denoising, so as to obtain a first characteristic diagram corresponding to the j-th coding. Specifically, the coding result obtained by the j-1 th coding, the text feature and the downsampling hierarchy corresponding to the j-1 th coding are subjected to denoising, so that a first feature map corresponding to the j-1 th coding is obtained. It should be understood that after performing encoding multiple times in the ith denoising process, a first feature map corresponding to the ith denoising may be obtained.

And denoising a decoding result obtained by the j-1 th decoding according to the time characteristic corresponding to the i-th denoising aiming at the j-th decoding in the multiple decoding corresponding to the i-th denoising, so as to obtain a first characteristic diagram corresponding to the j-th decoding. Specifically, the decoding result obtained by the j-1 decoding, the text feature, the up-sampling level corresponding to the j decoding and the second color feature are subjected to denoising, so that a first feature map corresponding to the j decoding is obtained. It should be understood that after decoding is performed multiple times in the ith denoising process, a feature map corresponding to the ith denoising can be obtained.

It should be understood that the decoding result corresponding to the last decoding may be regarded as the feature map corresponding to the ith denoising.

It should be appreciated that after the last denoising, a final target feature map may be obtained, and decoding the final feature map may obtain the first target image.

It can be seen that in the embodiment of the application, the colors of the first noise image and the original image are fused to obtain the second noise image, the first color feature of the second noise image is extracted, the first color feature is encoded for multiple times to obtain the second color feature, and the second color feature is fused in each decoding process of the first noise image, so that the consistency of the color feature in the finally generated target feature image and the color feature in the second noise image is higher, the color consistency of the original image and the first target image is ensured to be kept higher, the generated first target image accords with the target object expectation, and the target object experience is improved.

In one embodiment of the present application, the above description of the color of the original image and the image of the original image is performed at least once to denoise the first noise image to obtain the first target image through the image conversion model shown in fig. 5. Specifically, the image description is input into a text editor, a first noise image is used as input, the input is input into a stable diffusion model, the color feature is input into a color feature style control model, and a target feature map is obtained.

The process of constructing the image conversion model, which is applied to convert the first noise image into the target feature map in the present application, will be described first.

The method includes the steps of obtaining a target style control model corresponding to the target style, wherein the target style control model is used for controlling the style of the first target image to be the target style. Constructing a color feature control model by using a stable diffusion model, wherein the color feature control model is used for fusing color features into a denoising process so that the color features of the first target image comprise color features in an original image, and the feature control model can be a control Net model; and finally, fusing the target style control model, the color feature control model and the stable diffusion model to obtain the image conversion model. The stable diffusion model comprises a plurality of network layers, wherein each network layer comprises a plurality of coding blocks, a plurality of decoding blocks and an intermediate block; the color feature control model may be mounted in an intermediate block and a decoding block on the stable diffusion model.

Referring to fig. 5, fig. 5 is a schematic diagram of an image conversion model according to an embodiment of the present application.

The ith denoising process in the above-described multiple denoising is described below with the model structure of the image conversion model shown in fig. 5.

Illustratively, as shown in FIG. 5, the image conversion model includes a target style control model, a color feature control model (ControlNet), and a Stable Diffusion model (Stable Diffusion).

Among them, stable Diffusion includes a text encoder (Text Encoder), a time encoder (Time Encoder), an encoded Block (e.g., encoded Block 1 (Encoder Block 1), encoded Block 2 (Encoder Block 2), encoded Block 3 (Encoder Block 3), and encoded Block 4 (Encoder Block 4) shown in fig. 5)), an intermediate Block (e.g., middle Block shown in fig. 5), and a plurality of decoded blocks (e.g., decoded Block 1 (Decoder Block 1), decoded Block 2 (Decoder Block 2), decoded Block 3 (Decoder Block 3), and decoded Block 4 (Decoder Block 4) shown in fig. 5), wherein the plurality of encoded blocks and the plurality of decoded blocks are in one-to-one correspondence. In the application, the target style control model can be mounted in the coding block, the middle block and the decoding block on the stable diffusion model; the color feature control model may be mounted in an intermediate block and a decoding block on the stable diffusion model.

Wherein, the control net comprises a plurality of coding blocks, an intermediate block and a plurality of zero convolutions (zero convolutions), the plurality of coding blocks are obtained by multiplexing the plurality of coding blocks of the Stable Diffusion, and the intermediate block of the control net is also obtained by multiplexing the intermediate block of the Stable Diffusion.

The target style control model shown in fig. 5 is a target style control model corresponding to a target style, and the target style control model is mounted in the encoding block, the middle block and the decoding block on the stable diffusion model.

As shown in fig. 5, in the ith denoising process, inputting the image description into a text encoder for encoding to obtain target text characteristics; and acquiring a time step Ti corresponding to the ith denoising, and inputting the time step Ti into a time encoder for encoding to obtain a time characteristic.

And then, coding the target feature map obtained by the i-1 th denoising for multiple times through a plurality of coding layers, text features, time features and a target style control model to obtain a first feature map corresponding to the i-1 th denoising. Wherein one coding block is used to implement one coding process.

Similarly, for the color features, the convolution processing is performed through zero convolution to obtain the color features of the original image. And then fusing the color features of the original image with the target feature image obtained by the i-1 th denoising to obtain a first color feature. Then, the first color feature is input to a plurality of encoding blocks of the feature control model for encoding, and a second color feature corresponding to each encoding is obtained. And extracting the characteristics of the second color obtained by the last encoding in the middle block to obtain the second color corresponding to the ith denoising.

And then, inputting the first feature map corresponding to the ith denoising into the middle block for feature extraction, and fusing the extracted features with the second color features corresponding to the ith denoising to obtain a second feature map corresponding to the ith denoising.

And then, inputting a second feature map corresponding to the ith denoising into a plurality of decoding blocks for decoding, fusing a second color feature corresponding to the ith denoising in the feature control model in the decoding process to obtain a decoding result of each decoding, and taking the decoding result obtained by the last decoding as a target feature map corresponding to the ith denoising. The decoding process of each decoding block is similar to the j-th decoding process in the multiple decoding corresponding to the i-th denoising, and will not be described.

Referring to fig. 6, a schematic diagram of another image conversion process according to an embodiment of the present application is provided.

It should be noted that in the embodiments of the present application, the description is mainly focused on different denoising responsivity, and the influence of different color feature control parameters on the generated image.

As illustrated in fig. 6, the third image is an original image input by the target object, and the color features in the third image include jagged hair with color features, mouth with color features, and shoulder lines with color features, and since the area of the color patch in the third image is smaller, the method described in step 303 should correspond to a smaller denoising intensity when processing the third image.

The fourth image is generated by converting the original image into an image having the target style and then overlaying the color of the original image into the image by the method described in steps 301-302, and the specific generating step is similar to the method described in steps 301-302 for generating the second image, which is not described herein.

Specifically, the fourth image is generated into a third target image, a fourth target image, a fifth target image and a sixth target image through the image conversion model. In the process of generating a third target image by the fourth image through the image conversion model, the denoising intensity is set to be S3, and the color characteristic control parameter is set to be H1; in the process of generating a fourth target image by the fourth image through the image conversion model, the denoising intensity is set to S3, and the color characteristic control parameter is set to H2; the denoising intensity is set to S4 and the color feature control parameter is set to H1 in the process that the fourth image generates a fifth target image through an image conversion model; the fourth image sets the denoising intensity to S3 in the process of generating the sixth target image through the image conversion model, and the color feature control parameters are not set. Wherein S4 is greater than S3 and H2 is greater than H1.

And comparing the third target image with the fourth target image, wherein the color characteristic of the generated image is consistent with the color characteristic of the third image under the condition that the denoising intensity is the same as the color characteristic control parameter is larger. Although the image content in the third target image is more attractive than the fourth target image, the image content in the third target image loses part of the color information in the third image, such as "jagged hair with color features", while the "jagged hair with color features" remains in the fourth target image, i.e. the larger the color feature control parameter is under the same denoising strength, the more consistent the color features of the generated image and the color features of the third image are.

And comparing the fourth target image with the fifth target image, wherein the color consistency of the generated fifth target image and the third image is reduced due to the fact that the color characteristic control parameter of the fifth target image is smaller, and the similarity of the fifth target image and the third image is correspondingly reduced due to the increase of the denoising strength. It can be seen that since the patch area in the third image is smaller, a smaller denoising intensity, and a larger color feature control parameter should be corresponded to when processing the third image.

Finally, the denoising intensity is set to be the first intensity in the generation process of the sixth target image, and the color feature control parameters are not set, so that the similarity between the sixth target image and the third image is small, and the jagged hair with the color features and other color features in the third image are not reserved at all.

Therefore, it can be seen that, because the denoising intensity is determined by the color block area when the image is generated based on the image conversion model, so as to control the denoising times, the proper denoising intensity and denoising times can be determined for the original image, so that the generated image can be attractive and has higher color consistency with the original image.

Referring to fig. 7, fig. 7 is a block diagram illustrating functional units of an image conversion device according to an embodiment of the present application. The image conversion apparatus 700 includes: a transceiver unit 701 and a processing unit 702;

A transceiver unit 701, configured to acquire an original image;

A processing unit 702, configured to convert an original image into a first image, where content of the first image at least includes content of the original image;

In one embodiment of the present application, the processing unit 702 is specifically configured to, in the aspect of adjusting the second image based on the color of the original image and the image description of the original image to obtain the first target image:

Adding noise to the second image at least once based on the color of the original image to obtain a first noise image;

And denoising the first noise image at least once through the color of the original image and the image description of the original image to obtain the first target image.

In one embodiment of the present application, the processing unit 702 is specifically configured to, based on the color of the original image, add noise to the second image at least once to obtain a first noise image:

Determining a color duty cycle of the original image;

Obtaining the noise adding times N based on the color duty ratio;

And adding the N times of noise to the second image to obtain the first noise image.

In one embodiment of the present application, the processing unit 702 is specifically configured to, in terms of performing denoising on the first noise image at least once to obtain the first target image by using the color of the original image and the image description of the original image:

Extracting text features from the image description of the original image to obtain text features;

Fusing the color of the original image with the first noise image to obtain a second noise image;

Extracting color features of the second noise image to obtain color features;

encoding the first noise image, and fusing the text characteristics during encoding to obtain a first characteristic diagram;

extracting the characteristics of the first characteristic map, and fusing the text characteristics and the color characteristics during characteristic extraction to obtain a second characteristic map;

decoding the second feature map, and fusing the text features and the color features during decoding to obtain a target feature map;

and decoding the target feature map to obtain the first target image.

In one embodiment of the present application, the processing unit 702 is specifically configured to, in determining the color duty ratio of the original image:

Acquiring the area of the original image and the color value of each pixel point in the original image;

determining a plurality of pixel points with color values within a preset range based on the color value of each pixel point;

determining an area of a color patch based on the plurality of pixel points;

a color duty cycle of the original image is determined based on the area of the color patch and the area of the original image.

In one embodiment of the present application, the colors of the original image include: the processing unit 702 is specifically configured to, in adjusting, based on the color of the original image, the color of the first image to obtain a second image, where the color value of the pixel point in the original image is corresponding to the color of the pixel point in the preset range:

acquiring a first region corresponding to a pixel point, wherein the color value of the pixel point in the original image is within a preset range;

And covering the first area on the first image as a mask to obtain the second image, wherein the position of the mask covered on the first image is the same as the position of the first area in the original image.

In one embodiment of the present application, the processing unit 702 is specifically configured to, in the aspect of converting the original image into the first image:

And converting the original image into the first image based on the original image, a target style and the image description, wherein the style of the first image is the target style.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 8, the electronic device 800 includes a transceiver 801, a processor 802, and a memory 803. Which are connected by a bus 804. The memory 803 is used to store computer programs and data, and the data stored in the memory 803 can be transferred to the processor 802.

The processor 802 is configured to read a computer program in the memory 803 to perform the following operations:

controlling the transceiver 801 to acquire an original image;

Specifically, the transceiver 801 may be the transceiver unit 701 of the image conversion apparatus 700 of the embodiment of fig. 7, and the processor 802 may be the processing unit 702 of the image conversion apparatus 700 of the embodiment of fig. 7. Accordingly, the specific function of the processor 802 may refer to the specific function of the processing unit 702, and the specific function of the transceiver 801 may refer to the specific function of the transceiver unit 701.

It should be understood that the electronic device in the present application may include a smart Phone (such as an Android Mobile Phone, an iOS Mobile Phone, a Windows Phone Mobile Phone, etc.), a tablet computer, a palm computer, a notebook computer, a Mobile internet device MID (Mobile INTERNET DEVICES, abbreviated as MID), a wearable device, etc. The above-described electronic devices are merely examples and are not intended to be exhaustive and include, but are not limited to, the above-described electronic devices. In practical applications, the electronic device may further include: intelligent vehicle terminals, computer devices, etc.

The embodiment of the present application also provides a computer-readable storage medium storing a computer program that is executed by a processor to implement part or all of the steps of any one of the image conversion methods described in the above method embodiments.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform part or all of the steps of any one of the image conversion methods described in the method embodiments above.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are alternative embodiments, and that the acts and modules referred to are not necessarily required for the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, such as a division of units, merely a division of logic functions, and there may be additional divisions in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected based on actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units described above may be implemented either in hardware or in software program modules.

The integrated units, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in whole or in part in the form of a software product stored in a memory, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned memory includes: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

The foregoing has outlined rather broadly the more detailed description of embodiments of the application, wherein the principles and embodiments of the application are explained in detail using specific examples, the above examples being provided solely to facilitate the understanding of the method and core concepts of the application; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the idea of the present application, the present disclosure should not be construed as limiting the present application in summary.

Claims

1. An image conversion method, comprising:

2. The method of claim 1, wherein adjusting the second image based on the color of the original image and the image description of the original image results in a first target image, comprising:

3. The method according to claim 2, wherein adding noise to the second image at least once based on the color of the original image, to obtain a first noise image, comprises:

Determining a color duty cycle of the original image;

Obtaining the noise adding times N based on the color duty ratio;

and adding N times of noise to the second image to obtain the first noise image.

4. A method according to claim 2 or 3, wherein said denoising the first noisy image at least once by the color of the original image and the image description of the original image to obtain the first target image comprises:

Extracting color features of the second noise image to obtain color features;

and decoding the target feature map to obtain the first target image.

5. A method according to claim 3, wherein said determining the color duty cycle of the original image comprises:

determining an area of a color patch based on the plurality of pixel points;

6. The method of any one of claims 1-5, wherein the color of the original image comprises: color corresponding to the pixel point with the color value of the pixel point in the original image within a preset range; the adjusting the color of the first image based on the color of the original image to obtain a second image includes:

7. The method of any of claims 1-6, wherein the converting the original image into the first image comprises:

8. An image conversion apparatus, characterized in that the apparatus comprises: a transceiver unit and a processing unit;

The receiving and transmitting unit is used for acquiring an original image;

The processing unit is used for converting an original image into a first image, wherein the content of the first image at least comprises the content of the original image;

The processing unit is used for adjusting the color of the first image based on the color of the original image to obtain a second image;

The receiving and transmitting unit is used for adjusting the second image based on the color of the original image and the image description of the original image to obtain a first target image, wherein the image description is at least used for describing the content of the original image.

9. An electronic device, comprising: a processor and a memory, the processor being connected to the memory, the memory being for storing a computer program, the processor being for executing the computer program stored in the memory to cause the electronic device to perform the method of any one of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program, which is executed by a processor to implement the method of any of claims 1-7.