CN117745857B

CN117745857B - Image generation model training method and device, image processing method and device

Info

Publication number: CN117745857B
Application number: CN202311755560.3A
Authority: CN
Inventors: 戎康; 宋雨鑫; 张琦; 刘芳龙
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2025-04-08
Anticipated expiration: 2043-12-20
Also published as: CN117745857A

Abstract

The present disclosure provides an image generation model training method and device, which relates to the field of artificial intelligence technology, specifically to the technical fields of computer vision, deep learning, large models, etc., and can be applied to scenarios such as artificial intelligence content generation. The specific implementation scheme is: obtain an image sample set; obtain a pre-built image generation network, the image generation network includes: a sequentially connected image and text recognition module, a large language model, and a text generation image model; input the image sample selected from the image sample set into the image generation network to obtain the generated image output by the image generation network; use the image scoring model to score the generated image to obtain the evaluation value of the generated image; based on the evaluation value, calculate the network loss value of the image generation network; based on the network loss value of the image generation network, train the image generation network to obtain a trained image generation model.

Description

Image generation model training method and device, image processing method and device

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the technical field of computer vision, deep learning, large models, and the like, which may be applied to scenes such as content generation of artificial intelligence, and more particularly, to an image generation model training method and apparatus, an image processing method and apparatus, an electronic device, a computer readable storage medium, and a computer program product.

Background

With the advent of SD (stablediffusion ) models, generative image models have demonstrated powerful capabilities, including more realistic scenes, more rich details, and good instruction following capabilities.

For widely-used deep learning text-to-image generation models, inputting a prompt word into the deep learning text-to-image generation model can generate almost any image imagined by a human.

Disclosure of Invention

The present disclosure provides an image generation model training method and apparatus, an image processing method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

According to a first aspect, an image generation model training method is provided, and the method comprises the steps of obtaining an image sample set, wherein the image sample set comprises at least one image sample, obtaining a pre-built image generation network, wherein the image generation network comprises a picture-text recognition module, a large language model and a text generation image model which are sequentially connected, the picture-text recognition module is used for obtaining a recognition text based on an input image, the large language model is used for obtaining a prompt word text with multiple image description characteristics based on the recognition text, the text generation image model is used for obtaining a generation image based on the prompt word text, the image sample selected from the image sample set is input into the image generation network to obtain a generation image output by the image generation network, the image generation network is scored by adopting an image scoring model to obtain an evaluation value of the generation image, the network loss value of the image generation network is calculated based on the evaluation value, the image generation network is trained, and the trained image generation model is obtained.

According to a second aspect, there is provided an image processing method comprising obtaining an image to be processed, inputting the image to be processed into an image generation model generated by a method as described in any implementation manner of the first aspect, and obtaining an image generation result of the image to be processed.

According to a third aspect, there is provided an image generation model training apparatus comprising a set acquisition unit configured to acquire an image sample set including at least one image sample, a network acquisition unit configured to acquire an image generation network constructed in advance, the image generation network including a pattern recognition module, a large language model, and a text generation image model connected in this order, the pattern recognition module obtaining a recognition text based on an input image, the large language model obtaining a prompt word text having a multi-image description feature based on the recognition text, the text generation image model obtaining a generation image based on the prompt word text, a sample input unit configured to input an image sample selected from the image sample set into the image generation network to obtain a generation image output by the image generation network, a scoring unit configured to score the generation image using the image scoring model to obtain an evaluation value of the generation image, a calculation unit configured to calculate a network loss value of the image generation network based on the evaluation value, and a model obtaining unit configured to train the image generation network based on the network loss value of the image generation network to obtain a trained image generation model.

According to a fourth aspect, there is also provided an image processing apparatus including an image acquisition unit configured to acquire an image to be processed, and a result obtaining unit configured to input the image to be processed into an image generation model generated using the apparatus described in any one of the third aspects, and output an image generation result of the image to be processed.

According to a fifth aspect there is provided an electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described in any one of the implementations of the first or second aspects.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method as described in any implementation of the first or second aspect.

According to a seventh aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first or second aspects.

The image generation model training method and device provided by the embodiment of the disclosure comprise the steps of firstly, obtaining an image sample set, wherein the image sample set comprises at least one image sample, secondly, obtaining a pre-built image generation network, wherein the image generation network comprises an image-text recognition module, a large language model and a text generation image model which are sequentially connected, the image-text recognition module is used for obtaining a recognition text based on an input image, the large language model is used for obtaining a prompt word text with multiple image description characteristics based on the recognition text, the text generation image model is used for obtaining a generation image based on the prompt word text, thirdly, inputting an image sample selected from the image sample set into the image generation network to obtain a generation image output by the image generation network, scoring the generation image by adopting an image scoring model from time to obtain an evaluation value of the generation image, then, calculating a network loss value of the image generation network based on the evaluation value, and finally training the image generation network based on the network loss value of the image generation network to obtain a trained image generation model. In the image generation network training process, the generated images are scored through the image scoring model, and the image generation model is obtained based on the evaluation value obtained by scoring, so that the reliability and the accuracy of training the image generation model are improved, and the image generation effect of the model is improved.

The image processing method and device provided by the embodiment of the disclosure acquire an image to be processed, input the image to be processed into an image generation model generated by an image generation model training method, and obtain an image generation result of the image to be processed. Therefore, the image generation result is generated by adopting the image generation model comprising the large language model, and the reliability and accuracy of the image generation result are improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of one embodiment of an image generation model training method according to the present disclosure;

FIG. 2 is a schematic diagram of one architecture of image generation network training in an embodiment of the present disclosure;

FIG. 3 is a flow chart of one embodiment of an image processing method according to the present disclosure;

FIG. 4 is a flow chart of another embodiment of an image processing method in an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of the architecture of one embodiment of an image generation model training apparatus according to the present disclosure;

FIG. 6 is a schematic diagram of a structure of an embodiment of an image processing apparatus according to the present disclosure;

Fig. 7 is a block diagram of an electronic device used to implement an image generation model training method or an image processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The traditional style image generation process includes selecting a reference picture, and directly guiding new picture generation using the original picture. And (3) carrying out transformation operation on the original picture to obtain information such as an edge distribution map, rough color value distribution, overall composition and the like of the picture content, inputting the information into an image generation model, and comprehensively guiding the generated picture by the information. This picture should have the style or content characteristics that it is desired to see in the generated image.

The prior proposal needs to input an original generated model after a series of preprocessing on the original picture, but the preprocessing can not completely embody the content of the picture, especially the abstract characteristics such as the style of the picture, and the human content can not be guided well through the form. In a content-guided scene, the model may not accurately reproduce details in the reference picture, resulting in a generated image that deviates in content from the original. Or too follow the original picture without more divergent and extension effects. The results of the generation depend strongly on the quality of the user-provided text prompts and the reference images. Inaccurate or ambiguous inputs may result in an undesirable output.

Based on this, the present disclosure proposes an image generation model training method, fig. 1 shows a flow 100 according to one embodiment of the image generation model training method of the present disclosure, the image generation model training method comprising the steps of:

step 101, acquiring an image sample set.

In this embodiment, the execution subject on which the image generation model training method is executed may acquire the image sample set in various manners, for example, the execution subject may acquire the image sample set stored therein from the database server by a wired connection manner or a wireless connection manner. For another example, a user may obtain a set of image samples collected by a terminal by communicating with the terminal.

The image sample set can comprise at least one image sample, the image sample comprises a labeling image related to a generated image to be generated, the content displayed in the labeling image comprises various objects, scenes and styles, optionally, the image sample can also comprise image description text, when the image generation network is trained, the images in the image sample can be input into a picture and text recognition module, the image description text and the recognition text are spliced, and the spliced text is input into a large language model.

In the technical scheme of the disclosure, the related processes of video frame, video sequence collection, storage, use, processing, transmission, provision, disclosure and the like are performed after being authorized, and the related laws and regulations are met.

Step 102, a pre-built image generation network is acquired.

The image generation network comprises an image-text recognition module, a large language model and a text generation image model which are sequentially connected, wherein the image-text recognition module is used for obtaining a recognition text based on an input image, the large language model is used for obtaining a prompt word text with multiple image description characteristics based on the recognition text, and the text generation image model is used for obtaining a generated image based on the prompt word text.

In this embodiment, the image-text recognition module is an image content extractor for converting an image into a text, and the image-text recognition module may be a multi-mode image-text recognition model, and inputs the image into the multi-mode image-text recognition model to obtain a recognition text output by the multi-mode image-text recognition model for describing the content in the image. Identifying text is information that characterizes an image in the form of text data. The multimodal teletext recognition model can be trained to support multiple languages, thus crossing language barriers.

In this embodiment, the large language model is a deep learning-based natural language processing model, which mainly learns a large amount of text data to automatically generate sentences, paragraphs or articles conforming to language rules, and the core idea of the large language model is to learn characteristics such as grammar and semantics of natural language by using a deep neural network, so as to predict the occurrence probability of the next vocabulary and generate new sentences according to the probabilities.

In this embodiment, a recognition text is input into a large language model to obtain a prompt word text of multiple image description features output by the large language model, wherein the image description features are texts describing the image features, and the prompt word text obtained by the large language model of the present disclosure has multiple image description features, so that the input image of the image-text recognition module can be more comprehensively described through the multiple image description features.

In this embodiment, the text generation image model is a model based on a text generation style image, and the text generation image model is input with the prompt word text, so that the image generation requirement can be input into the text generation image module, and further, the requirement of more target images is provided for the text generation image module.

In this embodiment, the text-generated image model may be an SD (stablediffusion, steady diffusion) model, and the recognition text generated based on the image-text recognition module is more detailed and accurate, and the automatically generated prompt word text through the large language model may provide a new viewing angle and creative, and may provide more possible space for the image generation of the text-generated image model, and even excite the creative inspiration of the model user.

And step 103, inputting the image sample selected from the image sample set into an image generation network to obtain a generated image output by the image generation network.

In this embodiment, the execution subject may select an image sample from the image sample set obtained in step 101, and execute the training steps of steps 103 to 106, so as to complete an iterative training of the image generation network. The selection manner and the selection number of selecting the video frames from the image sample set are not limited in the present application, and the number of iterative training of the image generating network is not limited. For example, in one iterative training, a plurality of continuous image sets can be selected randomly, the selected image sets can be only images, and can also have image corresponding description text, and the network loss value of the image generation network is calculated through the selected image samples, so that the parameters of the image generation network are adjusted.

And 104, scoring the generated image by adopting an image scoring model to obtain an evaluation value of the generated image.

In this embodiment, the image scoring model is a model that scores the content and the expression form of the image after training in advance, the image is input into the image scoring model, and an evaluation value output by the image scoring model can be obtained, and the evaluation value is a specific representation for representing the richness and the aesthetic feeling of the content of the image.

In this embodiment, the image scoring model may be an image text conversion model obtained by training a large model, which refers to a deep learning or machine learning model with a large number of parameters that can be automatically adjusted by the training process to capture complex relationships in the input data. Such models typically have deeper network structures and more neurons to increase the representation and learning capabilities of the model.

Specifically, as shown in fig. 2, a sample image is input into an image-text recognition module, the image-text recognition module outputs a recognition text, a large language model is based on the recognition text to obtain a prompt word text, a text generated image model is based on the prompt word text to obtain a generated image, an image scoring model obtains the generated image, the generated image is scored to obtain an evaluation value, and an image generation network is trained through the evaluation value.

Step 105, calculating a network loss value of the image generation network based on the evaluation value.

In this embodiment, during each iterative training of the image generating network, an image sample is selected from the image sample set, the selected image sample is input into the image generating network, and a network loss value of the image generating network is calculated based on a loss function and an evaluation value set for the image generating network in advance.

In this embodiment, the loss function of the image generating network may use a mean square error function, where the mean square error function is an expectation of a square of a predicted value (estimated value) and a true value difference of the image generating network, and in an iterative training process of the image generating network, the loss function of the image generating network may be minimized by using a gradient descent algorithm, so as to iteratively optimize network parameters of the image generating network.

The intention of the gradient is a vector that indicates that the directional derivative of a certain loss function at that point takes a maximum along that direction, i.e. the loss function changes the fastest along that direction at that point with the greatest rate of change. In deep learning, the main task of the neural network is to find the optimal network parameters (weights and biases) at the time of learning, which are the parameters at which the loss function is minimum.

In the training process of the image generation network, a loss function can be designed for the text generation image model, the loss function of the text generation image model is used for calculating a loss value, the gradient of the large language model is updated based on the loss value and the evaluation value, and the parameters of the large language model are adjusted, so that the parameters of the text generation image model are not required to be adjusted, and the aim of generating the image model by hot-plugging the text is fulfilled.

Optionally, in the training process of the image generating network, a loss function can be designed for the text generating image model, the loss value is calculated through the loss function of the text generating image model, and the parameters of the large language model and the text generating image model are updated based on the loss value and the evaluation value, so that the aim of adjusting the text generating image model and the large language model simultaneously is fulfilled.

In this embodiment, calculating the network loss value of the image generation network based on the evaluation value includes calculating an overall loss value of the image generation network, and dividing the overall loss value by the evaluation value to obtain the network loss value.

And step 106, training the image generation network based on the network loss value of the image generation network to obtain a trained image generation model.

In this embodiment, the image generation model is an image generation network after performing multiple iterative training on the image generation network, and after performing parameter adjustment on the image generation network, whether the image generation network meets the training completion condition can be detected through a network loss value of the image generation network, and after the image generation network meets the training completion condition, the image generation model is obtained.

Optionally, in this embodiment, in response to the image generating network not meeting the training completion condition, the relevant parameters in the image generating network are adjusted so that the network loss value of the image generating network converges, and the training steps 103 to 106 are continuously performed based on the adjusted image generating network.

In this optional implementation manner, when the image generation network does not meet the training completion condition, relevant parameters of the image generation network are adjusted, which is helpful to help the convergence of network loss values of the image generation network.

The image generation model training method provided by the embodiment can automatically generate the prompt word text, greatly reduces the time and labor of manual input, and particularly improves the efficiency of image processing tasks for large-scale image processing tasks. And the automatically generated prompt word text can provide standardized image description, thereby being beneficial to unified communication and searching of image content.

The image generation model training method comprises the steps of firstly, obtaining an image sample set, wherein the image sample set comprises at least one image sample, secondly, obtaining a pre-built image generation network, wherein the image generation network comprises an image-text recognition module, a large language model and a text generation image model which are sequentially connected, the image-text recognition module obtains a recognition text based on an input image, the large language model obtains a prompt word text with multiple image description characteristics based on the recognition text, the text generation image model obtains a generation image based on the prompt word text, inputting the image sample selected from the image sample set into the image generation network again to obtain a generation image output by the image generation network, scoring the generation image by adopting an image scoring model to obtain an evaluation value of the generation image, then calculating a network loss value of the image generation network based on the evaluation value, and finally training the image generation network based on the network loss value of the image generation network to obtain a trained image generation model. In the training process of the image generation network, the generated images are scored through the image scoring model, and the image generation model is obtained based on the evaluation value obtained by scoring, so that the reliability and the accuracy of training of the image generation model are improved.

In some optional implementations of the disclosure, calculating the network loss value of the image generation network based on the evaluation value includes obtaining a loss function of the text generation image model, calculating a model loss value of the text generation image model based on the selected image sample and the loss function, and adjusting the model loss value based on the evaluation value to obtain the network loss value.

In this optional implementation manner, the text generated image model is used as a main adjustment network, so as to obtain a loss function of the text generated image model. The method for calculating the text generated image model loss value based on the selected image sample and the loss function comprises the steps of obtaining a generated image of the text generated image model based on the selected image sample, obtaining a difference value between the selected image sample and the generated image based on the loss function, and taking the difference value as the model loss value.

The model loss value is adjusted based on the evaluation value, and the network loss value is obtained by dividing the model loss value by the evaluation value.

According to the method for calculating the network loss value of the image generation network, the loss function of the text generation image model is obtained, the model loss value of the text generation image model is calculated based on the selected image sample and the loss function, the model loss value is adjusted based on the evaluation value, the network loss value is obtained, the model loss value is obtained based on the text generation image model, the model loss value is adjusted based on the evaluation value of the image scoring model, the network loss value is obtained, and a reliable implementation mode is provided for obtaining the network loss value.

In some optional implementations of the disclosure, training the image generation network based on the network loss value of the image generation network to obtain a trained image generation model includes taking the image generation network as the image generation model in response to the network loss value of the image generation network satisfying a training completion condition.

In this alternative implementation, the training completion condition includes at least one of a number of training iterations of the image generation network reaching a predetermined iteration threshold, and a network loss value of the image generation network being less than the predetermined network loss value threshold. Wherein the predetermined iteration threshold is an empirical value derived based on a network loss value of the image generation network. For example, the predetermined iteration threshold of the image generation network is 5 ten thousand times. The predetermined network loss value threshold for the image generation network is 0.01.

According to the method for obtaining the image generation model, when the network loss value of the image generation network meets the training completion condition, the image generation network is used as the image generation model, and a reliable implementation mode is provided for generation of the image generation model.

In some optional implementations of the present disclosure, the image generation model training method further includes adjusting parameters of the large language model based on the network loss value and continuing to train the image generation network in response to the network loss value of the image generation network not meeting the training completion condition.

In this embodiment, the step of continuing to perform the total training includes continuing to select an image sample from the image sample set, inputting the image sample selected from the image sample set into the image generation network to obtain a generated image output by the image generation network, scoring the generated image by using an image scoring model to obtain an evaluation value of the generated image, calculating a network loss value of the image generation network based on the evaluation value, and obtaining the image generation model based on the network loss value of the image generation network.

According to the image generation model training method, when the network loss value of the image generation network does not meet the training completion condition, the parameters of the large language model are adjusted, the parameters of the text generation image model are not changed, the pluggable text generation image model can be applied to the image generation network, and the application flexibility of the text generation image model is improved.

According to the image generation model training method, as the parameter quantity of the large language model is small, when the network loss value of the image generation network does not meet the training completion condition, only the parameter of the large language model is adjusted, the parameter of the text generation image model is not changed, the network training quantity can be reduced, and the convergence effect of the image generation network is improved.

In some alternative implementations of the present disclosure, the multiple image description features include elemental features, composition features, and style features.

In this alternative implementation, the element features are feature texts describing each unit in the generated image to be generated, where the unit may be a person, an object, an animal, or a scene, and for example, the identification text includes a cock, and the element features are a cock made of antique tin and wood.

In this alternative implementation, the composition feature is a feature text describing the layout of each cell in the generated image to be generated. For example, if a cock positioned in the middle of the image is included in the identification text, the composition feature is a cock positioned in the center of the background in a white background.

In this alternative implementation, the style features are features describing style characteristics of the generated image to be generated, for example, style features include a retro style, a mechanical style, a comic style, an oil painting style, and the like.

Alternatively, the multi-image description feature may further include a tone feature, a plot feature, and the like, wherein the tone feature is a feature describing a tone of a generated image to be generated, and the plot feature is a feature describing a plot of a story of the generated image to be generated.

The multi-image description feature provided by the alternative implementation mode limits the prompt word text through the element feature, the composition feature and the style feature, improves the richness of the generation of the prompt word text, and ensures the reliability of an image generation model.

In some optional implementations of the disclosure, the image scoring model is obtained by training a multi-modal image-text recognition network, and the training step of the multi-modal image-text recognition network includes a first training step that may be performed multiple times to obtain a first scoring module, a second training step that may be performed multiple times to obtain a first scoring model by performing multiple times of the first training step, and a second training step that may be performed multiple times to obtain the image scoring model by performing multiple times of the second training step.

The first training step comprises the steps of inputting an obtained first image sample and a description text into a multi-mode image-text recognition network to obtain an answer text output by the multi-mode image-text network, splicing the first image sample, the description text, the answer text and a grading text to obtain first splicing information, inputting the first splicing information into the multi-mode image-text recognition network to obtain a first grading output by the multi-mode image-text network, calculating a loss value of the multi-mode image-text network based on the first grading, and obtaining a first grading model based on the loss value of the multi-mode image-text network.

In this embodiment, the first image sample includes a first image and a score of the first image, where the first image may be an image generated by an image generating model, the score of the first image may be a score obtained after the first image is scored manually, and the loss value of the multi-mode image-text network may be obtained by calculating the score of the first image, the first score, and a loss function of the multi-mode image-text network, and the first score model is obtained in response to the loss value of the multi-mode image-text network meeting a training completion condition.

In this embodiment, the description text is text that requires the model to describe the first image in the first image sample, for example, the description text is text that describes the main content in the first image. The scoring text is text that requires the model to score the first image in the first image sample, e.g., the scoring text is that please score the first image.

In this embodiment, the first image sample, the description text, the answer text and the scoring text are spliced to obtain first splicing information, and the first splicing information is input into the multi-mode image-text recognition network, so that the multi-mode image-text recognition network synthesizes the first image, the description text, the answer text and the scoring text in the first image sample, and gives a first score to the first image in the first image sample.

In this embodiment, the answer text is a text that the multi-modal image text recognition network performs text description on the image content of the first image in the first image sample after obtaining the description text, and the comprehensiveness of the description of the first image by the multi-modal image text recognition network can be determined through the answer text.

In this embodiment, the first score is a score after the multimodal context recognition network scores the first image in the first image sample. The first score may be a score for an aesthetic effect or a cognitive effect of the first image.

The second training step comprises the steps of inputting the acquired second image sample and the scoring text into the first scoring model to obtain a second score output by the first scoring model, calculating a loss value of the first scoring model based on the second score, and obtaining an image scoring model based on the loss value of the first scoring model.

In this embodiment, the second image sample includes a second image and a score of the second image, where the second image may be an image generated by an image generating model, the score of the second image may be a score obtained after the second image is scored manually, and the loss value of the first scoring model may be calculated by the score of the second image, the second score, and a loss function of the first scoring model (which is also a loss function of the multimodal image-text network), and the image scoring model is obtained in response to the loss value of the first scoring model meeting a training completion condition.

In this embodiment, through the second training step of multiple iterative training, an input image can be obtained through training, and an image scoring model for scoring the image is output, so that the convenience in using the image scoring model is improved.

According to the method for training the multi-mode image-text recognition network, the multi-mode image recognition network can fully understand images and the content of the images through the first training step, the score of the images is given, the model with the input as the images and the output as the score can be obtained through the second training step, and the reliability of obtaining the image scoring model is improved.

Optionally, the image scoring model is obtained by training a multi-modal image-text recognition network, and the training step of the multi-modal image-text recognition network comprises the steps of inputting the obtained scoring image sample into the multi-modal image-text recognition network to obtain scores output by the multi-modal image-text recognition network, and responding to the multi-modal image-text recognition network to meet training completion conditions to obtain the image scoring model.

Further, based on the image generation model training method provided by the embodiment, the disclosure also provides an embodiment of an image processing method, and the image processing method of the disclosure combines the artificial intelligence fields of computer vision, deep learning and the like.

Referring to fig. 3, a flow 300 is shown according to one embodiment of the image processing method of the present disclosure, which includes the steps of:

step 301, an image to be processed is acquired.

In this embodiment, the image to be processed may include information such as a person, an object, a scene, and the like, and the image to be processed is processed by the image generation model, so that an image generation result may be obtained. The execution subject of the image processing method can acquire an image to be processed in various ways. For example, the execution subject may acquire the image to be processed stored therein from the database server by a wired connection or a wireless connection. For another example, the execution subject may also receive, in real time, the image to be processed acquired by the terminal or other devices in real time.

Step 302, inputting the image to be processed into an image generation model, and outputting an image generation result of the image to be processed.

In this embodiment, the execution subject may input the image to be processed acquired from step 301 into the image generation model, thereby obtaining an image generation result of the image to be processed. The image generation result includes a generated image, which is a new image after style and/or content conversion with respect to the image to be processed.

In this embodiment, the image generating model may be trained by using the method described in the embodiment of fig. 1, and the specific training process may be described in the embodiment of fig. 1, which is not described herein.

The image processing method provided by the embodiment of the disclosure acquires a plurality of images to be processed, and inputs the images to be processed into an image generation model generated by adopting the image generation model training method of the embodiment to obtain an image generation result of the images to be processed. Therefore, the image generation model is adopted to perform reliable image processing on the image to be processed, and the effectiveness of the image processing is improved.

In some embodiments of the present disclosure, the image processing method includes acquiring an image to be processed, detecting whether a size of the image to be processed is a standard size, and adjusting the image to be processed to the standard size in response to the size of the image to be processed being not the standard size.

In this embodiment, the standard size may be a size adapted to the image generation model, for example, the standard size is 448×448.

In this embodiment, the size of the image to be processed may be directly measured by a length measuring tool in the image processing tool, and when the size of the image to be processed is not the standard size, the image to be processed is processed by an image cropping tool or an image scaling tool in the image processing tool, so as to obtain the image to be processed with the standard size.

According to the image processing method provided by the embodiment, when the size of the image to be processed is not the standard size, the image to be processed is adjusted to the standard size, so that the image processing steps of the image generation model can be reduced, and the reliability of the image generation result is improved.

In some optional implementation manners of the disclosure, the image generation model comprises an image-text recognition module, a large language model and a text generation image model, wherein the image to be processed is input into the image generation model, and the image generation result of the image to be processed is output.

In this embodiment, the image-text recognition module may be a multi-mode image-text recognition model, and the multi-mode image-text recognition model is a model obtained by training a multi-mode image-text recognition network, and specifically, the training process of the multi-mode image-text recognition model includes obtaining an image sample from an image sample set, inputting the image sample into the multi-mode image-text recognition network to obtain a text output by the multi-mode image-text recognition network, calculating a loss value of the multi-mode image-text recognition network, and obtaining the multi-mode image-text recognition model in response to the multi-mode image-text recognition network meeting a training completion condition.

As shown in FIG. 4, the image D to be processed is input into an image-text recognition module M1 to obtain a recognition text S output by the image-text recognition module M1, wherein the content of the recognition text S is 'a cock standing on the ground and having a red cocktail and a huge and plump tail', the recognition text S is input into a large language model M2 to obtain a prompt word text T output by the large language model M2, the content of the prompt word text T is 'a cock made of antique tin and wood, a white background, a praise, a standing pose and a mechanical style', the prompt word text T is input into a text generation image model M3 to obtain a generated image W output by the text generation image model M3.

According to the image processing method, when the image generation model comprises the image-text recognition module, the large language model and the text generation image model, the image recognition module sequentially obtains the recognition text, the prompt word text is obtained through the large language model, the image generation model is used for obtaining the generated image, and the accuracy of the generated image is improved due to the fact that the large language model outputs the prompt word text with multiple image description characteristics.

In some embodiments of the present disclosure, the image processing method further includes receiving an image processing requirement text, after obtaining the identification text, stitching the identification text and the image processing requirement text to obtain second stitching information, inputting the second stitching information into a large language model to obtain a new prompt word text output by the large language model, inputting the new prompt word text into the text to generate an image model, and obtaining a new generated image output by the text to generate the image model.

In this embodiment, the image processing requirement text may be a requirement of the generated image to be generated, which is input by the user, and the specific requirement of the user may be extracted through the image processing requirement text.

The image processing method provided by the embodiment comprises the steps of obtaining an image to be processed, receiving an image processing requirement text, inputting the image to be processed into an image-text recognition module to obtain a recognition text output by the image-text recognition module, splicing the image processing requirement text and the recognition text to obtain second splicing information, inputting the second splicing information into a large language model to obtain a new prompt word text output by the large language model, inputting the new prompt word text into a text generation image model, and obtaining a new generation image output by the text generation image model.

Optionally, the image processing method can further comprise the step of outputting new prompt word text, so that in a user interaction environment, the model can provide instant text feedback, and user experience is enhanced. The model can customize and generate a text prompt according to the preference and the historical feedback of the user, and can customize more diversified or unified text through the image processing requirement text and the business scene input again by the user.

The image processing method provided by the embodiment comprises the steps of after receiving an image processing demand text, splicing the image processing demand text and an identification text to obtain second splicing information, inputting the second splicing information into a large language model to obtain a new prompt word text output by the large language model, inputting the new prompt word text into a text generation image model to obtain a new generation image output by the text generation image model, and therefore image processing of a user can be obtained through the image processing demand text, and the image generation model fuses the new generation image obtained on the basis of user processing demand, so that the accuracy of generating the generation image is improved.

With further reference to fig. 5, as an implementation of the method illustrated in the above figures, the present disclosure provides an embodiment of an image generation model training apparatus, which corresponds to the method embodiment illustrated in fig. 1, and which is particularly applicable in various electronic devices.

As shown in fig. 5, the image generation model training apparatus 500 provided in this embodiment includes a set acquisition unit 501, a network acquisition unit 502, a sample input unit 503, a scoring unit 504, a calculation unit 505, and a model obtaining unit 506. Wherein the above-mentioned set acquisition unit 501 may be configured to acquire an image sample set, the image sample set comprising at least one image sample. The network obtaining unit 502 may be configured to obtain a pre-constructed image generating network, where the image generating network includes a graphic recognition module, a large language model, and a text generating image model, which are sequentially connected, where the graphic recognition module obtains a recognition text based on an input image, the large language model obtains a prompt word text with multiple image description features based on the recognition text, and the text generating image model obtains a generated image based on the prompt word text. The sample input unit 503 may be configured to input the image sample selected from the image sample set into the image generation network, and obtain a generated image output from the image generation network. The scoring unit 504 may be configured to score the generated image using an image scoring model, to obtain an evaluation value of the generated image. The above-described calculation unit 505 may be configured to calculate a network loss value of the image generation network based on the evaluation value. The model obtaining unit 506 may be configured to train the image generation network based on the network loss value of the image generation network, to obtain a trained image generation model.

In the embodiment, in the image generation model training apparatus 500, the specific processing of the model obtaining unit 506 and the technical effects thereof in the set obtaining unit 501, the network obtaining unit 502, the sample input unit 503, the scoring unit 504, and the calculating unit 505 may refer to the relevant descriptions of the steps 101, 102, 103, 104, 105, and 106 in the corresponding embodiment of fig. 1, and are not described herein.

In some optional implementations of this embodiment, the computing unit 505 is further configured to obtain a loss function of the text-generated image model, calculate a model loss value of the text-generated image model based on the selected image sample and the loss function, and adjust the model loss value based on the evaluation value to obtain the network loss value.

In some optional implementations of the present embodiment, the model derivation unit 506 is further configured to take the image generation network as the image generation model in response to the network loss value of the image generation network satisfying the training completion condition.

In some optional implementations of this embodiment, the apparatus 500 further includes an adjustment unit (not shown in the figure) configured to adjust parameters of the large language model based on the network loss value and control the sample input unit 503 to operate in response to the network loss value of the image generation network not satisfying the training completion condition.

In some alternative implementations of the present embodiment, the multiple image description features include elemental features, composition features, and style features.

In some alternative implementations of the embodiment, the image scoring model is obtained by training a multi-modal image-text recognition network, the multi-modal image-text recognition network is obtained by training a training unit (not shown in the figure), the training unit is configured to input the obtained first image sample and the description text into the multi-modal image-text recognition network to obtain an answer text output by the multi-modal image-text network, splice the first image sample, the description text, the answer text and the scoring text to obtain first spliced information, input the first spliced information into the multi-modal image-text recognition network to obtain a first score output by the multi-modal image-text network, calculate a loss value of the multi-modal image-text network based on the first score, obtain a first scoring model based on the loss value of the multi-modal image-text network, input the obtained second image sample and the scoring text into the first scoring model to obtain a second score output by the first scoring model, calculate the loss value of the first scoring model based on the second score, and obtain the image scoring model based on the loss value of the first scoring model.

The image generation model training device provided by the embodiment of the disclosure comprises a collection acquisition unit 501 for acquiring an image sample set, wherein the image sample set comprises at least one image sample, a network acquisition unit 502 for acquiring a pre-constructed image generation network, the image generation network comprises a picture-text recognition module, a large language model and a text generation image model, the picture-text recognition module is used for obtaining a recognition text based on an input image, the large language model is used for obtaining a prompt word text with multiple image description characteristics based on the recognition text, the text generation image model is used for obtaining a generation image based on the prompt word text, a sample input unit 503 is used for inputting the image sample selected from the image sample set into the image generation network to obtain a generation image output by the image generation network, a scoring unit 504 is used for scoring the generation image to obtain an evaluation value of the generation image, a calculation unit 505 is used for calculating a network loss value of the image generation network based on the evaluation value, and a model obtaining unit 506 is used for training the image generation network based on the network loss value of the image generation network to obtain a trained image generation model. In the training process of the image generation network, the generated images are scored through the image scoring model, and the image generation model is obtained based on the evaluation value obtained by scoring, so that the reliability and the accuracy of training of the image generation model are improved.

With further reference to fig. 6, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of an image processing apparatus, which corresponds to the method embodiment shown in fig. 3, and which is particularly applicable to various electronic devices.

As shown in fig. 6, the image processing apparatus 600 provided in this embodiment includes an image acquisition unit 601 and a result obtaining unit 602. The image acquiring unit 601 may be configured to acquire an image to be processed. The above-described result obtaining unit 602 may be configured to input the image to be processed into the image generation model generated by the apparatus as described in the above-described embodiment of fig. 5, and output the image generation result of the image to be processed.

In the embodiment, in the image processing apparatus 600, the specific processing of the image obtaining unit 601 and the resulting obtaining unit 602 and the technical effects thereof may refer to the descriptions related to step 301 and step 302 in the corresponding embodiment of fig. 3, and are not described herein.

In some alternative implementations of the present embodiment, the image processing apparatus 600 further includes a detection unit (not shown in the figure). Wherein the detection unit is configured to detect whether the size of the image to be processed is a standard size, and to adjust the image to be processed to the standard size in response to the size of the image to be processed being not the standard size.

In some alternative implementations of the present embodiment, the image generating model includes a text-to-text recognition module, a large language model, and a text generating image model, and the result obtaining unit 602 is further configured to input the image to be processed into the text-to-text recognition module to obtain the recognition text output by the text-to-text recognition module, input the recognition text into the large language model to obtain the prompt word text output by the large language model, and input the prompt word text into the text generating image model to obtain the generated image output by the text generating image model.

In some alternative implementations of the present embodiment, the apparatus 600 further includes a receiving unit (not shown in the figure) and a text input unit (not shown in the figure). The receiving unit may be configured to receive the image processing requirement text. The text input unit can be configured to splice the identification text and the text required by image processing after the identification text is obtained, obtain second spliced information, input the second spliced information into a large language model, obtain a new prompt word text output by the large language model, input the new prompt word text into the text generating image model, and obtain a new generated image output by the text generating image model.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the electronic device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the electronic device 700 are connected to the I/O interface 705, including an input unit 706 such as a keyboard, mouse, etc., an output unit 707 such as various types of displays, speakers, etc., a storage unit 708 such as a magnetic disk, optical disk, etc., and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and processes described above, such as an image generation model training method or an image processing method. For example, in some embodiments, the image generation model training method or the image processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the image generation model training method or the image processing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the image generation model training method or the image processing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable image generation model training apparatus, image processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), and the Internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An image generation model training method, the method comprising:

Obtaining an image sample set, the image sample set comprising at least one image sample;

The method comprises the steps of obtaining a pre-built image generation network, wherein the image generation network comprises an image-text recognition module, a large language model and a text generation image model which are sequentially connected, wherein the image-text recognition module is used for obtaining a recognition text based on an input image;

Inputting an image sample selected from the image sample set into the image generation network to obtain a generated image output by the image generation network, scoring the generated image by adopting an image scoring model to obtain an evaluation value of the generated image, calculating a network loss value of the image generation network based on the evaluation value, training the image generation network based on the network loss value of the image generation network to obtain a trained image generation model;

The image scoring model is obtained by training a multi-mode image-text recognition network, the input of the multi-mode image-text recognition network comprises a first image sample and the output of the multi-mode image-text recognition network comprises a first score, the first image sample comprises a first image generated by the image generating model and the score of the first image obtained by manually scoring the first image, and the score of the first image, the first score and a loss function of the multi-mode image-text recognition network are used for calculating and obtaining a loss value of the multi-mode image-text recognition network.

2. The method of claim 1, wherein the calculating a network loss value for the image generation network based on the evaluation value comprises:

acquiring a loss function of the text generated image model;

Calculating a model loss value of the text generated image model based on the selected image sample and the loss function of the text generated image model;

And adjusting the model loss value based on the evaluation value to obtain the network loss value.

3. The method of claim 1 or 2, wherein the training the image generation network based on the network loss value of the image generation network to obtain a trained image generation model comprises:

and responding to the network loss value of the image generation network meeting the training completion condition, and taking the image generation network as an image generation model.

4. A method according to claim 3, wherein the method further comprises:

And adjusting parameters of the large language model based on the network loss value and continuing to train the image generation network in response to the network loss value of the image generation network not meeting the training completion condition.

5. The method of claim 1, wherein the multiple image description features include elemental features, composition features, and style features.

6. The method of claim 1, wherein the training of the multimodal teletext recognition network comprises:

performing a first training step:

Inputting the first image sample and the acquired descriptive text into the multi-mode image-text recognition network to obtain an answer text output by the multi-mode image-text recognition network;

The first image sample, the description text, the answer text and the scoring text are spliced to obtain first splicing information, and the first splicing information is input into the multi-mode image-text recognition network to obtain the first score;

Performing a second training step:

Inputting the acquired second image sample and the scoring text into the first scoring model to obtain a second score output by the first scoring model;

Calculating a loss value of the first scoring model based on the second score;

and obtaining an image scoring model based on the loss value of the first scoring model.

7. An image processing method, the method comprising:

acquiring an image to be processed;

Inputting the image to be processed into an image generation model generated by the method according to any one of claims 1-6, and obtaining an image generation result of the image to be processed.

8. The method of claim 7, prior to inputting the image to be processed into an image generation model, the method further comprising:

Detecting whether the size of the image to be processed is a standard size or not;

and adjusting the image to be processed to a standard size in response to the size of the image to be processed being not the standard size.

9. The method according to claim 7, wherein the image generation model comprises a graphic recognition module, a large language model and a text generation image model, wherein the inputting the image to be processed into the image generation model generated by the method according to any one of claims 1-6, and outputting the image generation result of the image to be processed comprises:

inputting the image to be processed into the image-text recognition module to obtain a recognition text output by the image-text recognition module;

inputting the identification text into the large language model to obtain a prompt word text output by the large language model;

And inputting the prompt word text into the text generation image model to obtain a generation image output by the text generation image model.

10. The method of claim 9, the method further comprising:

Receiving an image processing demand text, wherein the demand text is the demand of a generated image to be generated;

After the identification text is obtained, splicing the identification text and the image processing requirement text to obtain second splicing information;

Inputting the second splicing information into the large language model to obtain a new prompt word text output by the large language model;

and inputting the new prompt word text into the text generated image model to obtain a new generated image output by the text generated image model.

11. An image generation model training apparatus, the apparatus comprising:

a set acquisition unit configured to acquire an image sample set including at least one image sample;

The network acquisition unit is configured to acquire a pre-constructed image generation network, wherein the image generation network comprises an image-text recognition module, a large language model and a text generation image model which are sequentially connected, wherein the image-text recognition module is used for obtaining a recognition text based on an input image;

A sample input unit configured to input an image sample selected from the image sample set into the image generation network, resulting in a generated image output by the image generation network;

The scoring unit is configured to score the generated image by adopting an image scoring model to obtain an evaluation value of the generated image;

A calculation unit configured to calculate a network loss value of the image generation network based on the evaluation value;

a model obtaining unit configured to train the image generation network based on a network loss value of the image generation network, to obtain a trained image generation model;

12. The apparatus of claim 11, wherein the computing unit is further configured to obtain a loss function of the text-generated image model, calculate a model loss value for the text-generated image model based on the selected image sample and the loss function of the text-generated image model, and adjust the model loss value based on the evaluation value to obtain the network loss value.

13. The apparatus according to claim 11 or 12, wherein the model deriving unit is further configured to take the image generation network as an image generation model in response to a network loss value of the image generation network satisfying a training completion condition.

14. The apparatus of claim 13, wherein the apparatus further comprises an adjustment unit configured to adjust parameters of the large language model based on the network loss value and control the sample input unit to operate in response to the network loss value of the image generation network not satisfying a training completion condition.

15. The apparatus of claim 11, wherein the multiple image description features include an elemental feature, a composition feature, and a style feature.

16. The device of claim 11, wherein the multi-modal image-text recognition network is obtained through training by a training unit, the training unit is configured to input the first image sample and the acquired descriptive text into the multi-modal image-text recognition network to obtain answer text output by the multi-modal image-text recognition network, splice the first image sample, the descriptive text, the answer text and the scoring text to obtain first splice information, input the first splice information into the multi-modal image-text recognition network to obtain the first score, obtain a first scoring model based on a loss value of the multi-modal image-text recognition network, input the acquired second image sample and the scoring text into the first scoring model to obtain a second score output by the first scoring model, calculate a loss value of the first scoring model based on the second score, and obtain an image scoring model based on a loss value of the first scoring model.

17. An image processing apparatus, the apparatus comprising:

An image acquisition unit configured to acquire an image to be processed;

a result obtaining unit configured to input the image to be processed into an image generation model generated using the apparatus according to any one of claims 11 to 16, and output an image generation result of the image to be processed.

18. The apparatus of claim 17, further comprising a detection unit configured to detect whether the size of the image to be processed is a standard size, and in response to the size of the image to be processed not being a standard size, adjust the image to be processed to a standard size.

19. The apparatus according to claim 17, wherein the image generation model comprises a graphic recognition module, a large language model and a text generation image model, and the result obtaining unit is further configured to input the image to be processed into the graphic recognition module to obtain a recognition text output by the graphic recognition module, input the recognition text into the large language model to obtain a prompt word text output by the large language model, and input the prompt word text into the text generation image model to obtain a generated image output by the text generation image model.

20. The apparatus of claim 19, the apparatus further comprising:

a receiving unit configured to receive an image processing demand text;

The text input unit is configured to splice the identification text and the text required by the image processing after the identification text is obtained, obtain second splicing information, input the second splicing information into the large language model, obtain a new prompt word text output by the large language model, input the new prompt word text into the text generation image model, and obtain a new generation image output by the text generation image model.

21. An electronic device, comprising:

At least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-10.

23. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-10.