CN117456026A

CN117456026A - Image processing method and device

Info

Publication number: CN117456026A
Application number: CN202311393531.7A
Authority: CN
Inventors: 关文政; 王英博
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-10-25
Filing date: 2023-10-25
Publication date: 2024-01-26

Abstract

The embodiment of the specification provides an image processing method and device, wherein the image processing method comprises the following steps: after a description text and a theme image for image generation are obtained, converting each text keyword contained in the description text to obtain each image keyword corresponding to each text keyword, determining model parameters of a sub-model corresponding to each target image library matched with each image keyword, loading the model parameters into a pre-training model to obtain an image generation model, and performing image generation based on each image keyword and the theme keyword determined based on the theme image through the image generation model to obtain a target image.

Description

Image processing method and device

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image processing method and apparatus.

Background

With the continuous development and popularization of the internet, the technology of the generated artificial intelligence (AIGC, AI-Generated Content) is popularized, and various generated artificial intelligence models are rapidly applied, wherein one important application of the generated artificial intelligence models is image generation according to dialogue information or text information input by a user, but the data volume required by training the generated artificial intelligence models is often relatively large, the cost of training data labeling is relatively high, the development of the generated artificial intelligence models is limited, and meanwhile, the application of the generated artificial intelligence models in various fields is focused on how to realize the low cost of the generated artificial intelligence models.

Disclosure of Invention

One or more embodiments of the present specification provide an image processing method including: and acquiring the descriptive text and the theme image for image generation. And converting each text keyword contained in the descriptive text to obtain each image keyword corresponding to each text keyword. And determining model parameters of the sub model corresponding to each target image library matched with each image keyword in a parameter library, and loading the model parameters into a pre-training model to obtain an image generation model. And inputting the image keywords and the theme keywords determined based on the theme images into the image generation model to generate images, so as to obtain a target image.

One or more embodiments of the present specification provide an image processing apparatus including: and the acquisition module is configured to acquire the descriptive text and the theme image for image generation. And the conversion module is configured to perform conversion processing on each text keyword contained in the descriptive text to obtain each image keyword corresponding to each text keyword. And the loading module is configured to determine the model parameters of the sub-model corresponding to each target image library matched with each image keyword in the parameter library, and load the model parameters into the pre-training model to obtain an image generation model. And the image generation module is configured to input the image keywords and the theme keywords determined based on the theme image into the image generation model for image generation to obtain a target image.

One or more embodiments of the present specification provide an image processing apparatus including: a processor; and a memory configured to store computer-executable instructions that, when executed, cause the processor to: and acquiring the descriptive text and the theme image for image generation. And converting each text keyword contained in the descriptive text to obtain each image keyword corresponding to each text keyword. And determining model parameters of the sub model corresponding to each target image library matched with each image keyword in a parameter library, and loading the model parameters into a pre-training model to obtain an image generation model. And inputting the image keywords and the theme keywords determined based on the theme images into the image generation model to generate images, so as to obtain a target image.

One or more embodiments of the present specification provide a storage medium storing computer-executable instructions that, when executed by a processor, implement the following: and acquiring the descriptive text and the theme image for image generation. And converting each text keyword contained in the descriptive text to obtain each image keyword corresponding to each text keyword. And determining model parameters of the sub model corresponding to each target image library matched with each image keyword in a parameter library, and loading the model parameters into a pre-training model to obtain an image generation model. And inputting the image keywords and the theme keywords determined based on the theme images into the image generation model to generate images, so as to obtain a target image.

Drawings

For a clearer description of one or more embodiments of the present description or of the solutions of the prior art, the drawings that are needed in the description of the embodiments or of the prior art will be briefly described below, it being obvious that the drawings in the description that follow are only some of the embodiments described in the present description, from which other drawings can be obtained, without inventive faculty, for a person skilled in the art;

FIG. 1 is a schematic diagram of an environment in which an image processing method according to one or more embodiments of the present disclosure is implemented;

FIG. 2 is a process flow diagram of an image processing method according to one or more embodiments of the present disclosure;

FIG. 3 is a schematic illustration of a subject image and a target image provided in one or more embodiments of the present disclosure;

FIG. 4 is a process flow diagram of an image processing method for a merchant document scene according to one or more embodiments of the present disclosure;

FIG. 5 is a schematic diagram of an embodiment of an image processing apparatus according to one or more embodiments of the present disclosure;

fig. 6 is a schematic structural diagram of an image processing apparatus according to one or more embodiments of the present disclosure.

Detailed Description

In order to enable a person skilled in the art to better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one or more embodiments of the present disclosure without inventive effort, are intended to be within the scope of the present disclosure.

The image processing method provided in one or more embodiments of the present disclosure may be applied to an implementation environment of an image generating system, and referring to fig. 1, the implementation environment includes at least:

a pre-training model 101 for image generation, a parameter library 102 for storing model parameters of each sub-model, and one or more image libraries 103; in addition, the implementation environment may include a language model 104;

in this implementation environment, in the process of generating an image based on the input descriptive text and the subject image, each text keyword included in the descriptive text is first converted into a corresponding image keyword, each text keyword is input into the language model 104 to perform conversion processing in the process of converting each text keyword into a corresponding image keyword, each image keyword is obtained, after that, each target image library matched with each image keyword is determined in one or more image libraries 103, model parameters of a sub-model corresponding to each target image library are determined in the parameter library 102, and the model parameters are loaded into the pre-training model 101 to obtain an image generation model, and finally, the image generation is performed by inputting each image keyword and the subject keyword of the subject image into the image generation model, so as to obtain a target image output by the pre-training model 101.

One or more embodiments of an image processing method provided in the present specification are as follows:

referring to fig. 2, the image processing method provided in the present embodiment specifically includes steps S202 to S208.

Step S202, a description text and a subject image for image generation are acquired.

According to the image processing method, each text keyword contained in the description text for image generation is converted into each corresponding image keyword, so that suitability of the image keyword is improved by converting the text keyword into the image keyword, model parameters of sub-models corresponding to each target image library matched with each image keyword are determined in the parameter library, an image generation model is obtained by loading the model parameters into the pre-training model, each image keyword and the theme keyword determined based on the theme image are input into the image generation model for image generation, and the target image is obtained.

The description text in the embodiment refers to descriptive text required for generating an image; the descriptive text comprises scene descriptive text of the target service, and the descriptive text can be composed of one or more words/words, such as descriptive text of 'celebrating new year rabbits'; the subject image refers to a template image including a template image for image generation as a reference, as shown in fig. 3 (1).

In practical applications, there is a need for a merchant or a user to generate and obtain a target image based on text and a subject image, so that descriptive text and a subject image input by the merchant or the user for image generation can be obtained.

After the descriptive text and the theme images for image generation are acquired, in order to avoid the excessive divergence of the target images generated later, the suitability of the target images is improved, and the target images can be constrained based on theme keywords determined by the theme images; in an optional implementation manner provided in this embodiment, after the execution of acquiring the descriptive text and the subject image for image generation, the following operations are further executed:

and inputting the theme images into a neural network model to extract theme keywords, and obtaining the theme keywords.

The neural network model may be a control net, through which topic keyword extraction may be performed. The subject keywords may be one or more (two or more).

Further, in order to improve the comprehensiveness and flexibility of extracting the topic keyword and meet the diversified requirements of extraction, in an optional implementation manner provided in this embodiment, the topic keyword is extracted in the following manner:

identifying image elements in the theme image, performing action detection on the image elements, and determining action keywords based on action detection results;

performing boundary detection on the theme image to obtain an image element boundary in the theme image, and determining a contour keyword based on the image element boundary;

and performing topic type analysis on the topic image to obtain the topic type of the topic image, and determining topic type keywords according to the topic type.

The image elements refer to elements included in the subject image, for example, the image elements included in the subject image shown in fig. 3 (1) are rabbits. The action detection result refers to a detection result of action information representing the image element; the action keyword is a keyword representing an action of an image element required for image generation. The contour key may be a key that characterizes the contour of the image element required for image generation. The topic type of the topic image can be the topic style of the topic image; the theme type keywords comprise keywords for characterizing the theme type required for generating the image, namely keywords for characterizing the theme style required for generating the image. The image element boundary refers to boundary information of an image element contained in the subject image. The action keywords, outline keywords, and topic type keywords herein are the topic keywords.

Specifically, in the process of determining the action keyword based on the action detection result, the action detection result may be used as the action keyword or the action detection result may be transformed to obtain the action keyword; in the process of determining the contour key word based on the image element boundary, the element contour information corresponding to the image element boundary can be used as the contour key word or the image element boundary is subjected to transformation processing to obtain the contour key word; in the process of determining the topic type keyword according to the topic type, the topic type can be used as the topic type keyword or the topic type can be transformed to obtain the topic type keyword.

For example, the image elements in the theme image shown in fig. 3 (1) are identified as rabbits, the action detection result obtained by performing action detection on the rabbits is "holding cakes", the action keywords are determined to be "holding transfer beads" based on the action detection result, the theme image is subjected to boundary detection to obtain rabbit boundaries in the theme image, the rabbit boundaries are rabbit boundaries facing to the right, and the contour keywords are determined to be "front rabbit contours" based on the rabbit boundaries; and performing topic type analysis on the topic image to obtain a topic type of the topic image as a traditional topic type, determining a topic type keyword as a brief topic type according to the topic type of the topic image, namely converting the traditional topic type into the brief topic type.

In addition, in the process of extracting the topic keywords, the implementation processes of identifying the image elements in the topic image, performing action detection on the image elements, determining action keywords based on action detection results, performing boundary detection on the topic image to obtain the image element boundaries in the topic image, determining contour keywords based on the image element boundaries, performing topic type analysis on the topic image to obtain the topic type of the topic image, determining any one or two implementation processes of topic type keywords according to the topic type, and taking any one or two obtained keywords as the topic keywords.

Step S204, converting each text keyword contained in the descriptive text to obtain each image keyword corresponding to each text keyword.

In the step, each text keyword contained in the description text is converted into each corresponding image keyword, so that the text keywords are converted from the text dimension to the image dimension, and the convenience in the subsequent image generation is improved.

The text keywords in this embodiment refer to one or more target text tokens among a plurality of text tokens obtained by performing a token processing on the descriptive text. The image keywords corresponding to the text keywords comprise corresponding image keywords obtained by converting the text keywords. The text keywords may be one or more. The image keywords corresponding to the text keywords refer to the respective image keywords corresponding to each text keyword in the text keywords.

In practical application, each text keyword contained in the descriptive text may be abstract and not easy to understand, so in order to promote the readability of each text keyword, image generation is convenient, each text keyword can be converted to obtain each image keyword, and the understanding degree of the descriptive text is enhanced; in order to improve the conversion convenience of each text keyword, determining each text keyword from the fine granularity level and converting each text keyword; in an optional implementation manner provided in this embodiment, in a process of performing conversion processing on each text keyword included in the descriptive text to obtain each image keyword corresponding to each text keyword, the following operations are performed:

Performing word segmentation processing on the descriptive text to obtain a plurality of text word segments of the descriptive text;

and determining target word segmentation from the text word segmentation, and converting the text keywords into corresponding image keywords.

Along the above example, the description text is a rabbit of new year of celebration, word segmentation processing is carried out on the description text, a plurality of text words of the description text are obtained, namely, celebration, new year, rabbit, and target words of new year and rabbit are determined in the text words as each text keyword, and the rabbit in each text keyword is converted into a corresponding image keyword of Chinese zodiac rabbit, and the new year in each text keyword is converted into a corresponding image keyword of festival of new year and couplet.

In addition, in the process of converting each text keyword contained in the descriptive text to obtain each image keyword corresponding to each text keyword, in order to improve the efficiency and convenience of conversion, a language model can be introduced, and conversion is performed through the language model, so that the conversion is efficient; in an optional implementation manner provided in this embodiment, in a process of performing conversion processing on each text keyword included in the descriptive text to obtain each image keyword corresponding to each text keyword, the following operations are performed:

And inputting the text keywords into a language model for conversion processing, and obtaining the specific keywords and/or derivative keywords corresponding to the text keywords as the image keywords.

The language model may be a neural network language model, and the language model may be LLM (Large Language Model ). Here, the respective specific keywords and/or derivative keywords corresponding to each text keyword in the text keywords are obtained. The specific keywords refer to specific keywords, including keywords with direct or indirect relation with the commodity, and the derivative keywords can be keywords derived from text keywords or specific keywords; for example, each text keyword is "new year" and "rabbit", the image keyword corresponding to "rabbit" in each text keyword is "Chinese zodiac rabbit", the "Chinese zodiac rabbit" is the specific keyword, the image keyword corresponding to "new year" in each text keyword is "new year holiday" and "tie-in-tie", wherein "new year holiday" is the specific keyword, and the "tie-in-tie" is the derivative keyword.

In the specific implementation, after each text keyword included in the descriptive text is converted to obtain each image keyword corresponding to each text keyword, a target image library matched with each image keyword in each image keyword can be determined, so that model parameters of a sub model corresponding to each target image library are read, and in the process of determining the target image library matched with each image keyword in each image keyword, in order to improve the determination accuracy, the target image library matched with each image keyword in each image keyword can be determined in each image library according to the semantic similarity of the keyword vector corresponding to each image keyword and the text vector of the image descriptive text in the image library from the vector dimension; specifically, in an optional implementation manner provided in this embodiment, after performing the conversion processing on each text keyword included in the description text to obtain each image keyword corresponding to each text keyword, the following operations are further performed:

Vectorizing the keywords of each image to obtain keyword vectors of the keywords of each image;

and calculating semantic similarity according to the keyword vector and the text vector of the image description text in each image library, and determining each target image library matched with each image keyword in each image library according to the semantic similarity.

Wherein the keyword vector may be an embedding vector; each image library comprises one or more image libraries which are built in advance, each image library corresponds to a respective image type, namely, each image library stores images of different types, each image library also stores image description text of the image, and the image description text can be text describing the content in the image; meanwhile, each image library can be used for carrying out model training of the corresponding sub model to obtain model parameters of the sub model.

Specifically, keyword emplacement of each image keyword can be calculated, cosine similarity is calculated according to the keyword emplacement of each image keyword and text emplacement of the image description text in each image library, and each target image library matched with each image keyword is determined in each image library according to the cosine similarity.

More specifically, vectorizing each image keyword in the image keywords to obtain a keyword vector of each image keyword; and calculating the semantic similarity of the keyword vector of each image keyword and the text vector of the image description text stored in each image database aiming at each image keyword, and determining a target image database matched with each image keyword in each image database according to the semantic similarity, namely, matching each image keyword in each image keyword with the target image database.

For example, the keywords of each image are "Chinese zodiac rabbit", "new year festival", "paste antithetical couplet", keyword ebedding of each image keyword is calculated, cosine similarity is calculated according to keyword ebedding of each image keyword and text ebedding of image description text in each image library, and according to the cosine similarity, the target image library matched with "Chinese zodiac rabbit" is determined as an animal image library in each image library, the target image library matched with "new year festival" is determined as a festival image library, and the target image library matched with "paste antithetical couplet" is determined as a civilian culture image library.

It should be noted that, in the process of calculating the semantic similarity according to the keyword vector and the text vector of the image description text in each image library, since each image library in each image library may store one or more image description texts, under the condition of storing one image description text, the semantic similarity is calculated according to the keyword vector and the text vector of one image description text in each image library, and on the basis, the target image library matched with each image keyword is determined in each image library according to the semantic similarity; under the condition that a plurality of image description texts are stored in each image library, the vectors of the plurality of image description texts stored in each image library can be spliced to obtain the text vector of each image library in each image library, then the semantic similarity of the keyword vector of each image keyword in each image keyword and the text vector of each image library is calculated, and the target image library matched with each image keyword is determined in each image library according to the semantic similarity.

Further, in order to improve the convenience of determining each target image library matched with each image keyword, in an optional implementation manner provided in this embodiment, in a process of determining each target image library matched with each image keyword in each image library according to the semantic similarity, the following operations are performed:

sorting the image libraries in a descending order according to the semantic similarity;

and determining an image library with the sorting position before the preset position from the sorting result as each target image library matched with each image keyword.

Wherein, the descending order refers to the order from big to small. The preset position comprises a second position, namely the image library before the ordering position is positioned at the preset position comprises the image library of the first position in the ordering result.

Specifically, for each image keyword in the image keywords, ordering the image libraries in a descending order according to the semantic similarity between each image keyword and each image library, and determining the image library with the ordering position before the preset position in the ordering result as a target image library matched with each image keyword.

In this embodiment, in the process of pre-constructing one or more image libraries, the images may be classified according to the element types of the image elements included in the images, so as to construct image libraries corresponding to different element types, for example, a red-packet image library is constructed according to the image and the image description text in which the included image elements are red packets, a gold coin image library is constructed according to the image and the image description text in which the included image elements are gold coins, and a dish image library is constructed according to the image and the image description text in which the included image elements are dishes; the images can be classified according to the service types to which the images belong so as to construct image libraries corresponding to different service types, for example, a motion service image library is constructed according to the images of the motion service types and the image description texts, and a fund service image library is constructed according to the images of the fund service types and the image description texts; the image can be classified according to the element types of the background elements contained in the image so as to construct an image library corresponding to different element types, for example, a transparent background image library is constructed according to the image with the transparent background containing the background elements and the image description text, and a white background image library is constructed according to the image with the white background containing the background elements and the image description text.

And S206, determining model parameters of the sub-model corresponding to each target image library matched with each image keyword in a parameter library, and loading the model parameters into a pre-training model to obtain an image generation model.

And in the step, determining model parameters of sub-models corresponding to the target image libraries matched with the image keywords in the parameter libraries, and loading the model parameters to the pre-training model to obtain an image generation model, so that the pre-training model is loaded in real time through parameter loading, and the convenience and the flexibility of running the image generation model are improved.

The target image libraries described in this embodiment are target image libraries in which the image keywords are determined to match in the image libraries constructed in advance, that is, each image keyword in the image keywords is matched with a respective target image library. The sub-model corresponding to each target image library refers to the sub-model in the pre-training model corresponding to each target image library. The pre-training model refers to a pre-trained model for generating images, such as a pre-trained stable diffusion model (Stable Diffusion Model) for generating images according to input keywords, and may be other types of models in an AIGC (AI generated content, generated artificial intelligence).

Optionally, the image libraries, the sub-models in the pre-training model and the model parameters of the sub-models in the parameter library are in one-to-one correspondence. For example, each image library comprises a festival image library, a dish image library and a folk culture image library, the pre-training model comprises a sub-model corresponding to the festival image library, a sub-model corresponding to the dish image library and a sub-model corresponding to the folk culture image library, and the parameter library comprises model parameters of the sub-model corresponding to the festival image library, model parameters of the sub-model corresponding to the dish image library and model parameters of the sub-model corresponding to the folk culture image library.

Each sub-model in the pre-training model refers to a new processing layer added in the pre-training model, and the added new processing layer is called a sub-model; model training is carried out on each sub-model added in the pre-training model by utilizing each image library to obtain model parameters of each sub-model, for example, model training is carried out on a first sub-model added in the pre-training model according to a festival image library to obtain model parameters of a sub-model corresponding to the festival image library, and model training is carried out on a second sub-model added in the pre-training model according to a dish image library to obtain model parameters of a sub-model corresponding to the dish image library; different types of adjustment are carried out on images output by the preprocessing module in the pre-training model by adding different sub-models in the pre-training model, so that the image processing capacity is improved.

It should be added that after model training is performed on each sub-model in the pre-training model according to each image library to obtain model parameters of each sub-model, the model parameters of each sub-model can be stored in the parameter library, that is, each sub-model in the pre-training model does not have image adjustment capability before parameter loading, and only after each sub-model in the pre-training model is loaded with parameters, the model parameters of each sub-model in the pre-training model have image adjustment capability. Optionally, the model parameters of the sub-models corresponding to the image libraries are obtained by performing model training on the pre-training model through a low-rank adaptive plugin (LoRA, low Rank Adaptation).

On the basis of calculating the semantic similarity according to the keyword vector and the text vector of the image description text in each image library, and determining each target image library matched with each image keyword in each image library according to the semantic similarity, in an alternative implementation manner provided in this embodiment, the model parameters of the sub-model corresponding to each image library are obtained by the following manner:

inputting the image description text in each image library and the sample theme keywords determined based on the theme image sample into a preprocessing module in the pre-training model for image preprocessing to obtain a preprocessed image;

Inputting the preprocessed images into sub-models corresponding to the image libraries in the pre-training model to perform image adjustment processing to obtain adjustment images;

and calculating a loss value based on the adjustment image and the sample image, and carrying out parameter adjustment on the sub-model corresponding to each image library based on the loss value so as to obtain model parameters of the sub-model corresponding to each image library.

The sample topic keywords can be obtained by extracting topic keywords from topic image samples through the neural network model, and the specific implementation mode is similar to the topic keywords determined based on topic images, wherein the topic image samples are topic images in the model training process. The sample image refers to a sample image combining the image description text in each image library and the topic keywords, and the sample image accords with the image description text in each image library and the sample topic keywords.

Specifically, the model parameters of the sub-model corresponding to any one of the image libraries are obtained by adopting the following modes: inputting an image description text in any image library and a sample topic keyword determined based on a topic image sample into a preprocessing module in a pre-training model to perform image preprocessing to obtain a preprocessed image; inputting the preprocessed image into a sub-model corresponding to any image library in the pre-training model to perform image adjustment processing to obtain an adjustment image; and calculating a loss value based on the adjustment image and the sample image, and carrying out parameter adjustment on the sub-model corresponding to any image library based on the loss value so as to obtain the model parameters of the sub-model corresponding to any image library.

In addition, in the process of training each sub-model in the model to be trained, training can be performed from the image description text and the theme keyword dimension respectively, and then the following operations are specifically executed:

inputting the image description text in each image library into a preprocessing module in a pre-training model to perform image preprocessing to obtain a second preprocessed image;

inputting the second preprocessed image into sub-models corresponding to each image library in the pre-training model to perform image adjustment processing to obtain a second adjustment image;

and calculating a second loss value based on the second adjustment image and a second sample image corresponding to the image description text in each image library, and carrying out parameter adjustment on the sub-model corresponding to each image library based on the second loss value so as to obtain a second model parameter of the sub-model corresponding to each image library.

On the basis, the sample theme keywords can be input into a preprocessing module in a pre-training model to perform image preprocessing, so as to obtain a third preprocessed image;

inputting the third preprocessed image into sub-models corresponding to each image library in the pre-training model to perform image adjustment processing to obtain a third adjustment image;

and calculating a third loss value based on the third adjustment image and the subject image sample, and carrying out parameter adjustment on the sub-model corresponding to each image library based on the third loss value so as to obtain a third model parameter of the sub-model corresponding to each image library.

Further, determining the model parameters of the sub-model corresponding to each image library according to the second model parameters and the third model parameters of the sub-model corresponding to each image library, and accordingly carrying out subsequent parameter loading according to the model parameters.

When the method is implemented, after each target image library matched with each image keyword is determined, model parameters of sub-models corresponding to each target image library can be read from a parameter library, the read model parameters are loaded to a pre-training model to obtain an image generation model, namely, the read model parameters of each sub-model are loaded to each sub-model in the pre-training model to obtain the image generation model; i.e. the pre-trained model after loading the model parameters of the sub-model generates a model for the image.

Along the use example, determining a target image library matched with the animal image library as an animal image library, determining a target image library matched with the new year festival as a festival image library, determining a target image library matched with the paste couplet as a civilian culture image library, loading model parameters of a sub-model corresponding to the animal image library, model parameters of a sub-model corresponding to the festival image library and model parameters of a sub-model corresponding to the folk culture image library into 3 sub-models corresponding to the pre-training model, and loading 3 model parameters to enable the pre-training model to have the capability of adjusting images into images of the festival type, the animal type and the folk culture type.

And step S208, inputting the image keywords and the theme keywords determined based on the theme images into the image generation model to generate images, and obtaining target images.

And determining model parameters of the sub-model corresponding to each target image library matched with each image keyword in the parameter library, and loading the model parameters into the pre-training model to obtain an image generation model.

In this embodiment, the topic keywords determined based on the topic image refer to keywords that characterize topic types required for generating the image, where the topic keywords include topic keywords in at least one dimension, and the topic keywords include action keywords, contour keywords, and/or topic type keywords.

In the implementation, in order to improve the efficiency and accuracy of image generation, image generation is performed through an image generation model obtained through real-time loading; in the process of inputting the image keywords and the theme keywords determined based on the theme image into the image generation model to generate an image, the following operations may be performed:

Inputting the image keywords and the theme keywords determined based on the theme images into a preprocessing module in the image generation model for image preprocessing to obtain preprocessed images;

and carrying out image adjustment processing on the preprocessed image through each sub-model in the image generation model to obtain the target image.

Wherein, each image keyword can be one or more; each image keyword can be in one-to-one correspondence with each sub-model. The image preprocessing here includes image generation based on each image keyword and subject keyword, and a preprocessed image is obtained.

Specifically, under the condition that each image keyword is one, the image keywords and the theme keywords determined based on the theme images can be input into the image preprocessing in the image generation model to be subjected to image preprocessing, a preprocessed image is obtained, and then the preprocessed image is input into a sub-model corresponding to a target image library matched with the image keywords in the image generation model to be subjected to image adjustment processing, so that a target image is obtained;

under the condition that a plurality of image keywords are provided, the image keywords and the theme keywords determined based on the theme images can be input into a preprocessing module in an image generation model for image preprocessing to obtain preprocessed images, and image adjustment processing can be carried out on the preprocessed images through each sub-model corresponding to a target image library matched with the image keywords in the image generation model in a serial mode to obtain target images; the serial mode comprises the steps of inputting an adjusting image of a preprocessed image into a current sub-model in an image generation model to perform image adjustment processing to obtain a candidate adjusting image, and inputting the candidate adjusting image into a next sub-model of the current sub-model in the image generation model to perform image adjustment processing to obtain a target image; wherein a first sub-model in the image generation model may be input into the pre-processed image and a last sub-model in the image generation model may output the target image; the adjustment order of the image adjustment process performed by each sub-model in the image generation model is not particularly limited, and the execution order may be determined at random, or the image adjustment process may be performed in accordance with a preset execution order.

For example, each image keyword is "zodiac rabbit", "new year festival", "paste antithetical couplet", the subject keyword is "holding transfer bead", "front rabbit outline", "brief subject type", each image keyword and subject keyword are input into an image generation module in an image generation model to perform image generation to obtain a front rabbit image holding transfer bead, the front rabbit image is input into a sub-model corresponding to "zodiac rabbit" in the image generation model to perform image adjustment processing to obtain a front rabbit subgraph with zodiac animal information, the front rabbit subgraph corresponding to "new year festival" in the image generation model is input with the zodiac animal information to perform image adjustment processing to obtain a front rabbit subgraph with the zodiac animal information and festival information, the front rabbit subgraph corresponding to "paste antithetical couplet" in the image generation model is input with the zodiac animal information and festival information to obtain a front rabbit subgraph with the zodiac animal information, festival information and folk culture information, and the target image shown in fig. 3 (2) is the front rabbit subgraph.

In addition, besides the above-mentioned method of performing image adjustment processing on the preprocessed image by each sub-model in the image generation model in a serial manner to obtain the target image, the method may also perform image adjustment processing on the preprocessed image by each sub-model in the image generation model in a parallel manner to obtain the target image, specifically, the preprocessed image may be respectively input into each sub-model in the image generation model to obtain each intermediate image output by each sub-model, and the fusion module in the image generation model performs image fusion on each intermediate image to obtain the target image.

In addition, the image generation model may perform image generation as follows: and generating an image based on the image keywords to obtain a generated image, and performing image correction processing on the generated image according to the theme keywords determined based on the theme image to obtain the target image.

In practical application, on the basis of obtaining the target image, the fusion of the background element and the main element may exist in the target image, obvious splitting feeling exists, burrs exist in the target image, and the like, so that the image effect of the target image is poor, and the watching effect of a user is influenced; in view of this, in order to enhance the image effect of the target image and help to enhance the fusion degree of the target image, in an alternative implementation manner provided in this embodiment, after inputting the image keywords and the topic keywords determined based on the topic image into the image generation model to generate the image, the following operations are further performed after the target image is obtained:

extracting elements from the target image based on the element extraction information of the target image to obtain a main element and a background element;

and carrying out smoothing processing on the target image based on the main body element and the background element to obtain a smoothed image.

The element extraction information includes information for prompting how to perform element extraction, and the element extraction information may be an element extraction position and/or an element extraction text, where the element extraction position includes element extraction points and/or element extraction boxes, such as which of extraction points perform main element extraction in a target image and which of extraction boxes perform background element extraction.

On the basis, in order to further improve the accuracy and the effectiveness of element extraction, the image and the element extraction information can be converted into a feature level, and an element extraction mode is determined on the feature level, so that subsequent element extraction is performed; in an optional implementation manner provided in this embodiment, in a process of extracting an element of the target image based on element extraction information of the target image to obtain a main element and a background element, the following operations are performed:

calculating element extraction characteristics of the element extraction information, and extracting characteristics of the target image to obtain image characteristics;

and carrying out feature fusion on the element extraction features and the image features, determining an element extraction mode according to a feature fusion result, and extracting the main body element and the background element from the target image according to the element extraction mode.

The element extraction mode comprises main body mask information and/or background mask information; the main body mask information comprises mask image information obtained by binarizing the target image according to the main body information; the background mask information includes mask image information obtained by binarizing the target image based on the background information.

Specifically, in the process of determining an element extraction mode according to the feature fusion result and extracting the main body element and the background element from the target image according to the element extraction mode, main body mask information and background mask information in the target image can be determined according to the fusion feature obtained by feature fusion, and the main body element is extracted from the target image according to the main body mask information and the background element is extracted from the target image according to the background mask information.

In addition, in the process of extracting the element of the target image based on the element extraction information of the target image to obtain the main body element and the background element, an image segmentation component can be introduced to process, and the element extraction information of the target image can be specifically input into the image segmentation component to process the image segmentation to obtain the main body element and the background element. Wherein, the image segmentation component may be Segment analysis.

Specifically, the Image feature can be obtained by Image encoding the target Image through an Image encoder (Image encoder) in the Image segmentation component, the element extraction information is encoded through a Prompt encoder (Prompt encoder) in the Image segmentation component, the element extraction feature is obtained, the Image feature and the element extraction feature are subjected to feature fusion through a fusion module in the Image segmentation component, and the main element and the background element are extracted from the target Image according to the feature fusion result.

In summary, the one or more image processing methods provided in this embodiment first obtain a description text and a subject image for image generation, perform word segmentation processing on the description text to obtain a plurality of text words describing the text, determine a target word in the text words as each text keyword, convert each text keyword into a corresponding image keyword, vectorize each image keyword to obtain a keyword vector of each image keyword, calculate semantic similarity according to the keyword vector and the text vector of the image description text in each image library, determine each target image library matched with each image keyword according to the semantic similarity, then determine model parameters of a sub-model corresponding to each target image library in the parameter library, load the model parameters into a pre-training model to obtain an image generation model, input each image keyword and the subject keyword determined based on the subject image into the image generation model to generate an image, obtain a target image, extract the target image based on element extraction information of the target image, obtain a main element and a background element, calculate semantic similarity according to the keyword vector and the text vector of each image description text vector of each image, and determine each target image library matched with each image keyword according to the semantic similarity, then determine model parameters of each sub-model corresponding to each target image library in the parameter library, load model parameters to the pre-training model to obtain the image generation model, generate the image based on the subject image, generate the target image based on the subject image, and the subject image keyword, and the object image is generated based on the subject image.

The following further describes the image processing method provided in this embodiment by taking the application of the image processing method provided in this embodiment to a business document scene as an example, and referring to fig. 4, the image processing method applied to the business document scene specifically includes the following steps.

Step S402, acquiring a description text and a theme image which are input by a merchant and used for generating a text image.

Step S404, word segmentation processing is carried out on the descriptive text, a plurality of text word segments of the descriptive text are obtained, and target word segments are determined in the plurality of text word segments to serve as text keywords.

Step S406, inputting each text keyword into the language model for conversion processing, and obtaining each text image keyword corresponding to each text keyword.

Step S408, vectorization processing is carried out on the keywords of each document image, and keyword vectors of the keywords of each document image are obtained.

Step S410, calculating semantic similarity according to the keyword vector and the text vector of the image description text in each image library, and determining a target image library matched with each document image keyword in each image library according to the semantic similarity.

And step S412, determining model parameters of the sub-models corresponding to each target image library in the parameter library, and loading the model parameters of each sub-model to the stable diffusion model to obtain an image generation model.

In step S414, the text image keywords and the theme keywords determined based on the theme image are input into the image generation model to generate an image, and the target text image is obtained.

Step S416, element extraction features of the element extraction information of the target document image are calculated, and feature extraction is performed on the target document image to obtain image features.

Step S418, carrying out feature fusion on the element extraction features and the image features, and extracting main body elements and background elements from the target document image according to the feature fusion result.

Step S420, smoothing the target document image based on the main body element and the background element to obtain a smoothed document image.

An embodiment of an image processing apparatus provided in the present specification is as follows:

in the above-described embodiments, an image processing method and an image processing apparatus corresponding thereto are provided, and the following description is made with reference to the accompanying drawings.

Referring to fig. 5, a schematic diagram of an embodiment of an image processing apparatus provided in this embodiment is shown.

Since the apparatus embodiments correspond to the method embodiments, the description is relatively simple, and the relevant portions should be referred to the corresponding descriptions of the method embodiments provided above. The device embodiments described below are merely illustrative.

The present embodiment provides an image processing apparatus including:

an acquisition module 502 configured to acquire a descriptive text and a subject image for image generation;

the conversion module 504 is configured to perform conversion processing on each text keyword included in the descriptive text, so as to obtain each image keyword corresponding to each text keyword;

the loading module 506 is configured to determine model parameters of the sub-model corresponding to each target image library matched with each image keyword in the parameter library, and load the model parameters into a pre-training model to obtain an image generation model;

and the image generation module 508 is configured to input the image keywords and the theme keywords determined based on the theme image into the image generation model for image generation, so as to obtain a target image.

in correspondence to the above-described image processing method, one or more embodiments of the present disclosure further provide an image processing apparatus for performing the above-provided image processing method, based on the same technical concept, and fig. 6 is a schematic structural diagram of the image processing apparatus provided by the one or more embodiments of the present disclosure.

An image processing apparatus provided in this embodiment includes:

as shown in fig. 6, the image processing apparatus may have a relatively large difference due to different configurations or performances, and may include one or more processors 601 and a memory 602, and one or more storage applications or data may be stored in the memory 602. Wherein the memory 602 may be transient storage or persistent storage. The application programs stored in the memory 602 may include one or more modules (not shown), each of which may include a series of computer-executable instructions in the image processing apparatus. Still further, the processor 601 may be arranged to communicate with the memory 602 and execute a series of computer executable instructions in the memory 602 on an image processing device. The image processing device may also include one or more power supplies 603, one or more wired or wireless network interfaces 604, one or more input/output interfaces 605, one or more keyboards 606, and the like.

In a particular embodiment, an image processing apparatus includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the image processing apparatus, and configured to be executed by the one or more processors, the one or more programs comprising computer-executable instructions for:

Acquiring a description text and a theme image for image generation;

converting each text keyword contained in the descriptive text to obtain each image keyword corresponding to each text keyword;

determining model parameters of sub-models corresponding to each target image library matched with each image keyword in a parameter library, and loading the model parameters into a pre-training model to obtain an image generation model;

and inputting the image keywords and the theme keywords determined based on the theme images into the image generation model to generate images, so as to obtain a target image.

An embodiment of a storage medium provided in the present specification is as follows:

in correspondence with the above-described image processing method, one or more embodiments of the present specification further provide a storage medium based on the same technical idea.

The storage medium provided in this embodiment is configured to store computer executable instructions that, when executed by a processor, implement the following flow:

acquiring a description text and a theme image for image generation;

It should be noted that, in the present specification, an embodiment of a storage medium and an embodiment of an image processing method in the present specification are based on the same inventive concept, so that a specific implementation of the embodiment may refer to an implementation of the foregoing corresponding method, and a repetition is omitted.

In this specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment focuses on the differences from other embodiments, for example, an apparatus embodiment, and a storage medium embodiment, which are all similar to a method embodiment, so that description is relatively simple, and relevant content in reading apparatus embodiments, and storage medium embodiments is referred to the part description of the method embodiment.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In the 30 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each unit may be implemented in the same piece or pieces of software and/or hardware when implementing the embodiments of the present specification.

One skilled in the relevant art will recognize that one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable image processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable image processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable image processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable image processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

One or more embodiments of the present specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is by way of example only and is not intended to limit the present disclosure. Various modifications and changes may occur to those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. that fall within the spirit and principles of the present document are intended to be included within the scope of the claims of the present document.

Claims

1. An image processing method, comprising:

acquiring a description text and a theme image for image generation;

2. The image processing method according to claim 1, after the step of acquiring the descriptive text and the subject image for image generation is performed, and before the step of inputting the respective image keywords and the subject keywords determined based on the subject image into the image generation model for image generation, further comprising:

3. The image processing method according to claim 2, said performing topic keyword extraction, comprising:

4. The image processing method according to claim 1, wherein after the step of converting each text keyword included in the descriptive text to obtain each image keyword corresponding to each text keyword is performed, and the step of determining, in the parameter library, model parameters of a sub-model corresponding to each target image library to which each image keyword matches, and loading the model parameters into the pre-training model to obtain an image generation model, further comprises:

5. The image processing method according to claim 4, wherein the model parameters of the sub-model corresponding to each image library are obtained by:

6. The image processing method according to claim 1, wherein the step of inputting the respective image keywords and the subject keywords determined based on the subject image into the image generation model for image generation, after the step of obtaining the target image, further comprises:

7. The image processing method according to claim 6, wherein the element extraction of the target image based on the element extraction information of the target image, to obtain a subject element and a background element, comprises:

8. The image processing method according to claim 1, wherein the converting each text keyword included in the descriptive text to obtain each image keyword corresponding to each text keyword includes:

9. The image processing method according to claim 4, wherein the determining, in the image libraries, each target image library to which each image keyword matches according to the semantic similarity, comprises:

10. The image processing method according to claim 1, wherein the converting each text keyword included in the descriptive text to obtain each image keyword corresponding to each text keyword includes:

11. An image processing apparatus comprising:

the acquisition module is configured to acquire a descriptive text and a theme image for image generation;

the conversion module is configured to perform conversion processing on each text keyword contained in the descriptive text to obtain each image keyword corresponding to each text keyword;

the loading module is configured to determine model parameters of sub-models corresponding to the target image libraries matched with the image keywords in the parameter libraries, and load the model parameters into a pre-training model to obtain an image generation model;

and the image generation module is configured to input the image keywords and the theme keywords determined based on the theme image into the image generation model for image generation to obtain a target image.

12. An image processing apparatus comprising:

a processor; and a memory configured to store computer-executable instructions that, when executed, cause the processor to:

acquiring a description text and a theme image for image generation;

13. A storage medium storing computer-executable instructions that when executed by a processor implement the following:

acquiring a description text and a theme image for image generation;