CN117746214B

CN117746214B - Text adjustment method, device and storage medium for generating image based on large model

Info

Publication number: CN117746214B
Application number: CN202410173841.6A
Authority: CN
Inventors: 田云龙; 苏明月; 王迪; 王淼; 徐静; 牛丽; 黄媛媛
Original assignee: Qingdao Haier Technology Co Ltd; Qingdao Haier Intelligent Home Appliance Technology Co Ltd; Haier Uplus Intelligent Technology Beijing Co Ltd
Current assignee: Qingdao Haier Technology Co Ltd; Qingdao Haier Intelligent Home Appliance Technology Co Ltd; Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date: 2024-02-07
Filing date: 2024-02-07
Publication date: 2024-05-24
Anticipated expiration: 2044-02-07
Also published as: CN117746214A

Abstract

The application discloses a text adjustment method, a device and a storage medium for generating images based on a large model, which relate to the technical field of smart families, and the text adjustment method based on the large model comprises the following steps: acquiring a first data set; inputting a plurality of original images corresponding to the first data set into a generated text model to obtain a target text corresponding to each original image, wherein the generated text model is a model which is determined through a preset training mechanism and is used for automatically generating a preset text structure according to the images; determining difference information between the target text and the target original text corresponding to each original image; and carrying out text adjustment on the target original text according to the difference information to obtain a corrected text, wherein the corrected text is used for indicating the text after the target original text is adjusted. The method solves the technical problem that how to modify the input text description to obtain the expected generated image effect cannot be determined, and further can improve the efficiency of obtaining the expected generated image.

Description

Text adjustment method, device and storage medium for generating image based on large model

Technical Field

The application relates to the technical field of smart families, in particular to a text adjustment method, a device and a storage medium for generating images based on a large model.

Background

With advances in science and technology and advances in artificial intelligence, more and more algorithms for generating images from text have been proposed successively and are widely used in various fields. As a tool for improving the working efficiency, it is very important to accurately acquire a corresponding multi-attribute target home scene through an input text in a complex home scene. For example, the text description contains a plurality of objects, each of which contains a particular visual attribute, such as: an embedded household appliance.

In the related art, the current text-to-image technology is mainly based on three basic algorithms, namely GAN (GENERATIVE ADVERSARIAL Network, generating countermeasure Network, GAN for short), VAE (Variable Autoencoder, variable self-encoder, VAE for short) and diffusion model. While current model algorithms can each generate images from a provided textual description, the generation of images is ubiquitous in situations where the textual description is inconsistent, especially in complex scenarios where multiple objects are contained in the textual description. And it is not possible to determine how to modify the entered text description during the image generation process, the desired generated image effect can be obtained.

Therefore, for the technical problem in the related art that it is impossible to determine how to modify the inputted text description to obtain the desired generated image effect, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the application provides a text adjustment method, a text adjustment device and a storage medium for generating an image based on a large model, which at least solve the technical problem that in the related art, how to modify an input text description to obtain a desired generated image effect cannot be determined.

According to an embodiment of the present application, there is provided a text adjustment method for generating an image based on a large model, including: acquiring a first data set, wherein each data set in the first data set comprises an original text, and an original image corresponding to the original text; the original text is used for describing N target objects contained in the original image and attribute information corresponding to the N target objects, wherein N is a positive integer; inputting a plurality of original images corresponding to the first data set into a generated text model to obtain a target text corresponding to each original image, wherein the generated text model is a model which is determined through a preset training mechanism and is used for automatically generating a preset text structure according to the images; determining difference information between the target text and the target original text corresponding to each original image; and carrying out text adjustment on the target original text according to the difference information to obtain a corrected text.

In an exemplary embodiment, determining the difference information between the target text and the target original text corresponding to each original image includes: describing the target text and the target original text according to a preset description format respectively to obtain first description information corresponding to the target text and second description information corresponding to the target original text; the first description information comprises a first description matrix corresponding to the target text and a first attribute type matrix corresponding to the first description matrix; the second description information comprises a second description matrix corresponding to the target original text and a second attribute type matrix corresponding to the second description matrix; comparing matrix values at the same matrix position in a first attribute type matrix in the first description information and a second attribute type matrix in the second description information to obtain a target comparison result; and determining difference information between the target text and the target original text corresponding to each original image according to the target comparison result.

In an exemplary embodiment, determining difference information between the target text and the target original text corresponding to each original image according to the target comparison result includes: determining that the text expression of the attribute of the target object indicated by the matrix value is consistent with the text expression of the target original text in the target text under the condition that the target comparison result indicates that the matrix value at the same matrix position in the first attribute type matrix and the second attribute type matrix is the same; determining that the text expression of the attribute of the target object indicated by the matrix value is inconsistent with the text expression of the target original text in the target text under the condition that the matrix value indicated by the matrix value in the same matrix position in the first attribute type matrix and the second attribute type matrix is different according to the target comparison result; and determining difference information between the target text and the target original text corresponding to each original image according to the matrix positions of which all the text expressions are inconsistent.

In an exemplary embodiment, performing text adjustment on the target original text according to the difference information to obtain an adjusted corrected text, including: analyzing the difference information to determine P target objects with differences and target attribute information corresponding to the P target objects respectively; determining a first priority corresponding to the P target objects and a second priority corresponding to the target attribute information according to preset priority data; determining a text adjustment mode for the target original text according to the first priority and the second priority; and adjusting the target original text by using the text adjustment mode to obtain a corrected text.

In an exemplary embodiment, determining a text adjustment manner for the target original text according to the first priority and the second priority includes: determining description adjustment orders of the P target objects in text descriptions of target original texts according to the first priority; determining description weight values of the P target objects in the text description of the target original text according to the second priority; and determining a text adjustment mode of the target original text based on a preset adjustment rule, the description adjustment order and the description weight value.

In an exemplary embodiment, before determining the text adjustment manner for the target original text based on the preset adjustment rule, the description adjustment order, and the description weight value, the method further includes: acquiring a preset adjustment formula corresponding to the preset adjustment rule: wherein, the preset adjustment formula is: ; wherein, Representing the text description of the target original text after adjustment,/>Representing each target object included in the target original image corresponding to the target original text and attribute category corresponding to each target object,/>Representing a second attribute type matrix corresponding to the target original text,/>Representing a first attribute type matrix corresponding to the target text,/>And the weight value coefficient corresponding to the am attribute is represented, and the attribute type matrix is used for indicating the attribute type contained in each target object.

In an exemplary embodiment, after determining the difference information between the target text and the target original text corresponding to each original image, the method further includes: calculating a difference value corresponding to the difference information, wherein the difference value is used for indicating the difference size between the target text of each original image and the target original text; and comparing the difference value with a preset text difference threshold value, and determining whether to replace the target original text by the target text according to a comparison result.

In an exemplary embodiment, comparing the difference value with a preset text difference threshold, and determining whether to replace the target original text with the target text according to the comparison result includes: determining to replace the target original text with the target text if the comparison result indicates that the difference value is smaller than the preset text difference threshold; and if the comparison result indicates that the difference value is greater than or equal to the preset text difference threshold value, determining that the target original text is not replaced by the target text.

In an exemplary embodiment, after determining to replace the target original text with the target text, if the comparison result indicates that the difference value is smaller than the preset text difference threshold, the method further includes: performing association binding on the target text and a target original image corresponding to the target original text; updating the first data set according to the binding result to obtain a second data set; training a large model of the text-generated image using the second set of data sets.

In an exemplary embodiment, after determining not to replace the target original text with the target text, if the comparison result indicates that the difference value is greater than or equal to the preset text difference threshold, the method further includes: adding a first identifier to the target original text and a target original image corresponding to the target original text, wherein the first identifier is used for indicating that the target original text is a high-accuracy text; performing text classification on the first data set according to the first identifier to obtain a first type text set and a second type text set, wherein the first type text set is a set of original texts with the first identifier, and the second type text set is a set of original texts without the first identifier; and sending prompt information for updating the second type text set to the large model training object, wherein the prompt information is used for prompting the large model training object to execute replacement operation on the second type text set in the iterative training process.

In one exemplary embodiment, training a large model of an image generated by text using the second set of data sets includes: the ith training is performed by the following steps, wherein i is a positive integer greater than or equal to 1, and the large model obtained through the 0 th training is a general large model which is not trained by the second data set: acquiring training samples used by the ith round from the second data set, wherein the training samples used by the ith round comprise sample images used by the ith round, and texts corresponding to the sample images used by the ith round; inputting a text corresponding to a sample image used by the ith round into a general large model obtained through the ith-1 round of training to obtain a generated image obtained through the ith round of training; acquiring a value of a first loss function of the ith training according to a sample image used by the ith training and a generated image obtained by the ith training by the universal large model obtained by the ith training; respectively extracting features of a sample image used by the ith wheel and a generated image obtained by the ith wheel training by using a target detection model to obtain a first object attribute feature of the sample image used by the ith wheel and a second object attribute feature of the generated image obtained by the ith wheel training; determining the value of a second loss function of the ith training round according to the first object attribute characteristics and the second object attribute characteristics; determining the value of the objective loss function of the ith training according to the value of the first loss function of the ith training and the value of the second loss function of the ith training; and ending the training under the condition that the value of the objective loss function of the ith training meets the preset convergence condition, and obtaining a large model for generating an image through text by using the second data set training.

According to another embodiment of the present application, there is also provided a text adjustment apparatus for generating an image based on a large model, including: the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring a first data set, each data set in the first data set comprises an original text, and an original image corresponding to the original text; the original text is used for describing N target objects contained in the original image and attribute information corresponding to the N target objects, wherein N is a positive integer; the text module is used for inputting a plurality of original images corresponding to the first data set into a generated text model to obtain a target text corresponding to each original image, wherein the generated text model is a model which is determined through a preset training mechanism and is used for automatically generating a preset text structure according to the images; the determining module is used for determining difference information between the target text and the target original text corresponding to each original image; and the adjustment module is used for carrying out text adjustment on the target original text according to the difference information to obtain a corrected text.

According to a further aspect of embodiments of the present application, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described text adjustment method for generating an image based on a large model when run.

According to still another aspect of the embodiment of the present application, there is further provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the above text adjustment method for generating an image based on a large model through the computer program.

In the embodiment of the application, a first data set is acquired, wherein each data set in the first data set comprises an original text, and an original image corresponding to the original text; the original text is used for describing N target objects contained in the original image and attribute information corresponding to the N target objects, wherein N is a positive integer; inputting a plurality of original images corresponding to the first data set into a generated text model to obtain a target text corresponding to each original image, wherein the generated text model is a model which is determined through a preset training mechanism and is used for automatically generating a preset text structure according to the images; determining difference information between the target text and the target original text corresponding to each original image; performing text adjustment on the target original text according to the difference information to obtain a corrected text; on the basis of determining the corresponding relation between the original images and the original texts, generating text processing is carried out on a plurality of original images through generating a text model. By comparing the difference between the original text and the generated text, whether the text description of the original image corresponding to the original text is accurate or not is determined, and the content which is not accurate enough in the original text is corrected according to the generated text, so that the generation accuracy when the corrected text is used for generating the image is improved, the expected generated image effect is obtained by adjusting and correcting the original text.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic diagram of a hardware environment of a text adjustment method for generating an image based on a large model according to an embodiment of the present application;

FIG. 2 is a flow chart of a text adjustment method for generating an image based on a large model according to an embodiment of the present application;

FIG. 3 is a flow diagram of a text adjustment method for generating images based on large models in accordance with an embodiment of the present application;

FIG. 4 is a block diagram of a text adjustment device for generating images based on a large model according to an embodiment of the present application;

Fig. 5 is a block diagram (ii) of a text adjustment apparatus for generating an image based on a large model according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description of the present application and the above-described drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to one aspect of an embodiment of the present application, there is provided a text adjustment method for generating an image based on a large model. The text adjustment method based on the large model generated image is widely applied to full-house intelligent digital control application scenes such as intelligent Home (Smart Home), intelligent Home equipment ecology, intelligent Home (INTELLIGENCE HOUSE) ecology and the like. Alternatively, in the present embodiment, the above-described text adjustment method for generating an image based on a large model may be applied to a hardware environment constituted by the terminal device 102 and the server 104 as shown in fig. 1. As shown in fig. 1, the server 104 is connected to the terminal device 102 through a network, and may be used to provide services (such as application services and the like) for a terminal or a client installed on the terminal, a database may be set on the server or independent of the server, for providing data storage services for the server 104, and cloud computing and/or edge computing services may be configured on the server or independent of the server, for providing data computing services for the server 104.

The network may include, but is not limited to, at least one of: wired network, wireless network. The wired network may include, but is not limited to, at least one of: a wide area network, a metropolitan area network, a local area network, and the wireless network may include, but is not limited to, at least one of: WIFI (WIRELESS FIDELITY ), bluetooth. The terminal device 102 may not be limited to a PC, a mobile phone, a tablet computer, an intelligent air conditioner, an intelligent smoke machine, an intelligent refrigerator, an intelligent oven, an intelligent cooking range, an intelligent washing machine, an intelligent water heater, an intelligent washing device, an intelligent dish washer, an intelligent projection device, an intelligent television, an intelligent clothes hanger, an intelligent curtain, an intelligent video, an intelligent socket, an intelligent sound box, an intelligent fresh air device, an intelligent kitchen and toilet device, an intelligent bathroom device, an intelligent sweeping robot, an intelligent window cleaning robot, an intelligent mopping robot, an intelligent air purifying device, an intelligent steam box, an intelligent microwave oven, an intelligent kitchen appliance, an intelligent purifier, an intelligent water dispenser, an intelligent door lock, and the like.

In this embodiment, a text adjustment method for generating an image based on a large model is provided, and is applied to the above-mentioned computer terminal, and fig. 2 is a flowchart of a text adjustment method for generating an image based on a large model according to an embodiment of the present application, where the flowchart includes the following steps:

Step S202, a first data set is obtained, wherein each data set in the first data set comprises an original text, and an original image corresponding to the original text; the original text is used for describing N target objects contained in the original image and attribute information corresponding to the N target objects, wherein N is a positive integer;

as an alternative example, the target object is an object, a person, an animal, or the like in the image, and may be a refrigerator, a washing machine, a cabinet, a floor, or the like.

As an optional example, in the case where the target object is a device, the attribute information corresponding to the target object includes, but is not limited to: color information, texture information, viewing angle information, pattern information, embedded information of an object.

Step S204, inputting a plurality of original images corresponding to the first data set into a generated text model to obtain a target text corresponding to each original image, wherein the generated text model is a model which is determined through a preset training mechanism and is used for automatically generating a preset text structure according to the images;

Step S206, determining difference information between the target text and the target original text corresponding to each original image;

and step S208, performing text adjustment on the target original text according to the difference information to obtain a corrected text.

Through the steps, a first data set is obtained, wherein each data set in the first data set comprises an original text, and an original image corresponding to the original text; the original text is used for describing N target objects contained in the original image and attribute information corresponding to the N target objects, wherein N is a positive integer; inputting a plurality of original images corresponding to the first data set into a generated text model to obtain a target text corresponding to each original image, wherein the generated text model is a model which is determined through a preset training mechanism and is used for automatically generating a preset text structure according to the images; determining difference information between the target text and the target original text corresponding to each original image; performing text adjustment on the target original text according to the difference information to obtain a corrected text; on the basis of determining the corresponding relation between the original images and the original texts, generating text processing is carried out on a plurality of original images through generating a text model. By comparing the difference between the original text and the generated text, whether the text description of the original image corresponding to the original text is accurate or not is determined, and the content which is not accurate enough in the original text is corrected according to the generated text, so that the generation accuracy when the corrected text is used for generating the image is improved, the expected generated image effect is obtained by adjusting and correcting the original text.

In an exemplary embodiment, determining difference information between the target text and the target original text corresponding to each original image includes the following steps S11-S13:

Step S11: describing the target text and the target original text according to a preset description format respectively to obtain first description information corresponding to the target text and second description information corresponding to the target original text; the first description information comprises a first description matrix corresponding to the target text and a first attribute type matrix corresponding to the first description matrix; the second description information comprises a second description matrix corresponding to the target original text and a second attribute type matrix corresponding to the second description matrix; for example, the target original text may be expressed as: Wherein G is the whole text description corresponding to the target original text,/> The second description matrix is corresponding to the target original text; each column in the second description matrix represents one target class, and G ¹ represents a first class of target objects, such as refrigerators, washing machines, floors, vases, etc.; each row represents a category attribute, and G _a1 represents a first category object attribute, such as color information, texture information, view angle information, pattern information, direction information, embedded information. It should be noted that different target objects correspond to different target properties, for example: the refrigerator has attribute information such as color information, visual angle information, direction information, embedded information and the like;

it should be noted that, for example, the floor type object has only color information and texture information, so that the second attribute type matrix, namely, the 0-1 matrix in the above formula is utilized To control the kinds of attributes contained in the target object, am describes the kinds of attributes contained in all targets, and n represents the kinds of the target object.

Similarly, the target text may be expressed as: ; wherein R is the whole text description corresponding to the target original text,/> The first description matrix is corresponding to the target text; each column in the first description matrix represents a target class, and R ¹ represents a first class of target objects, such as refrigerators, washing machines, floors, vases, etc.; each row represents a category attribute, and R _a1 represents a first category object attribute, such as color information, texture information, view angle information, pattern information, direction information, embedded information. It should be noted that different target objects correspond to different target properties, for example: the refrigerator has attribute information such as color information, visual angle information, direction information, embedded information and the like; A first attribute type matrix;

Step S12: comparing matrix values at the same matrix position in a first attribute type matrix in the first description information and a second attribute type matrix in the second description information to obtain a target comparison result;

Optionally, the first attribute type matrix is compared: And a second attribute type matrix If the first row and the first column are compared, the two matrixes are found to be 1, and the target category and the target attribute are expressed as the same in two texts, so that modification is not needed; comparing the first row and the second column, finding that the first attribute type matrix corresponds to 1, and the second attribute type matrix corresponds to 0, and indicating that the difference between the target text and the target original text occurs about the target type description of the same target object, wherein the target type description of the target original text needs to be corrected through the target text; and comparing the first columns of the second row, finding that the first attribute type matrix corresponds to 1, and the second attribute type matrix corresponds to 0, and explaining that the target text and the target original text have differences with respect to the target attribute description of the same target object, and correcting the target attribute description of the target original text through the target text;

step S13: and determining difference information between the target text and the target original text corresponding to each original image according to the target comparison result.

In an exemplary embodiment, step S13 above: determining difference information between the target text and the target original text corresponding to each original image according to the target comparison result comprises the following substeps:

step S13-01: determining that the text expression of the attribute of the target object indicated by the matrix value is consistent with the text expression of the target original text in the target text under the condition that the target comparison result indicates that the matrix value at the same matrix position in the first attribute type matrix and the second attribute type matrix is the same;

Step S13-02: determining that the text expression of the attribute of the target object indicated by the matrix value is inconsistent with the text expression of the target original text in the target text under the condition that the matrix value indicated by the matrix value in the same matrix position in the first attribute type matrix and the second attribute type matrix is different according to the target comparison result;

Step S13-03: and determining difference information between the target text and the target original text corresponding to each original image according to the matrix positions of which all the text expressions are inconsistent.

That is, if the values of the same positions of the two attribute type matrices are the same, it means that the attribute corresponding to the target object is correctly expressed in the image, and modification adjustment is not required, and if the values of the same positions of the two attribute type matrices are different, it means that the attribute corresponding to the target object is not correctly expressed in the image. And then according to the information which is determined not to be normally expressed, the difference information between the target text and the target original text can be rapidly positioned.

In an exemplary embodiment, performing text adjustment on the target original text according to the difference information to obtain an adjusted corrected text, including: analyzing the difference information to determine P target objects with differences and target attribute information corresponding to the P target objects respectively; determining a first priority corresponding to the P target objects and a second priority corresponding to the target attribute information according to preset priority data; wherein P is a positive integer, and P is less than or equal to N; determining a text adjustment mode for the target original text according to the first priority and the second priority; and adjusting the target original text by using the text adjustment mode to obtain a corrected text.

For example, the corresponding difference information is parsed, the target category and attribute category of the difference between the target text and the target original text are determined according to the corresponding positions of different values in the matrix, and then the description sequence of the target category and the attribute category in the text description is determined, the priority is determined according to the description sequence, and the priority of the target category in the text description can beThen, for the attribute information in each target object G ⁱ, adjustment is performed to determine that the priority order between the respective attributes may be/>The weight values of the text descriptions are then adjusted according to the rules.

Optionally, the rule may be sequentially adjusted from small to large according to the coordinates of the positions corresponding to the matrices with different differences, or may be adjusted one by one according to the sequence of the rows or columns after the number of the positions of the matrices with different shapes in different rows or columns in the matrix are sequenced from large to small.

Optionally, when inputting text, setting priority order for different attributes of different target objects) And setting different weight value coefficients according to the attribute priority level, for example: the total m target attributes are set to be 1.01, and the priority of the target attribute is the lowest, and then the target attribute is gradually increased along with the increase of the priority.

It should be noted that, the difference value is mainly used for determining the difference degree between the target text and the target original text, when the difference degree is low (i.e. the difference value is lower than the preset text difference threshold), the difference between the target text and the target original text can be ignored, the use of the target original text and the original image for corresponding will not affect the subsequent use, and when the difference degree is high (i.e. the difference value is higher than the preset text difference threshold), the target original text and the original image can not normally correspond, and the accurate correspondence of the text and the text image in the generating relation can be ensured after the target text is needed to replace the target original text.

In one exemplary embodiment, training a large model of an image generated by text using the second set of data sets includes: the ith training is performed by the following steps, wherein i is a positive integer greater than or equal to1, and the large model obtained through the 0 th training is a general large model which is not trained by the second data set: acquiring training samples used by the ith round from the second data set, wherein the training samples used by the ith round comprise sample images used by the ith round, and texts corresponding to the sample images used by the ith round; inputting a text corresponding to a sample image used by the ith round into a general large model obtained through the ith-1 round of training to obtain a generated image obtained through the ith round of training; acquiring a value of a first loss function of the ith training according to a sample image used by the ith training and a generated image obtained by the ith training by the universal large model obtained by the ith training; respectively extracting features of a sample image used by the ith wheel and a generated image obtained by the ith wheel training by using a target detection model to obtain a first object attribute feature of the sample image used by the ith wheel and a second object attribute feature of the generated image obtained by the ith wheel training; determining the value of a second loss function of the ith training round according to the first object attribute characteristics and the second object attribute characteristics; determining the value of the objective loss function of the ith training according to the value of the first loss function of the ith training and the value of the second loss function of the ith training; and ending training under the condition that the value of the objective loss function of the ith training meets a preset convergence condition, and obtaining a large model for generating an image through text by using the second data set training, wherein i is a positive integer.

In order to better understand the process of the text adjustment method for generating the image based on the large model, the following describes the flow of the implementation method for text adjustment for generating the image based on the large model in combination with the alternative embodiment, but is not limited to the technical scheme of the embodiment of the present application.

As a tool for improving the working efficiency, it is very important to accurately acquire a corresponding multi-attribute target home scene through an input text in a complex home scene. For example, the text description includes furniture, embedded appliances, and decorations placed on different features. However, in the related art, only the algorithm model is used to generate the image corresponding to the provided text description, but in the complex multi-attribute target home scene, the generated image obtained according to the input text description does not necessarily meet the expected effect. Therefore, there is a problem of how to adjust the input of the text description to obtain a desired generated image effect.

As an alternative implementation manner, the present application proposes a text adjustment method for a text generated image, specifically, fig. 3 is a schematic diagram of a text adjustment method for a text generated image based on a large model according to an embodiment of the present application, as shown in fig. 3, including the following steps:

Step S302: determining a plurality of text descriptions and corresponding images of the text descriptions;

step S304: processing the plurality of text descriptions using the text-generated image model;

step S306: obtaining a plurality of target images generated by a plurality of text descriptions;

Step S308: inputting a plurality of target images into an image generation text model;

It should be noted that, the current image generation text model is trained using a text image pair under a specific scene. The current image generation text model trains data by using image texts in complex home scenes with a plurality of targets and a plurality of attributes, and the image generation text model in a specific scene is obtained. The training image is used for generating a text model, inputting the image, outputting the target object contained in the image and the attribute information corresponding to the target object, and outputting the model according to the order of the pixel proportion of the target object in the image. For example: black wooden cabinet, walnut wooden stripe floor, white embedded refrigerator, etc.

Alternatively, the current image generation text model mainly aims at complex home scenes with multiple targets and multiple attributes, and the final output of the image generation text model is in the form of entity phrases, such as a white embedded refrigerator, a black washing machine, a wood stripe floor and the like. The entity phrase forms described above may be represented by:

；/>

wherein, Representing the target object,/>And representing attribute information corresponding to the target object, wherein am describes the type of the attribute information, and n represents the type of the target object. The attribute information includes color information, texture information, view angle information, pattern information, embedded information, and the like of the target object, wherein the values of n and am are positive integers.

Step S310: determining target text descriptions corresponding to the generated multiple target images;

furthermore, the target text description adopts a description mode of the same structure as that of the entity phrase form which is finally output by the image generation text model, and the description mode is as follows:

；

Representing the target object,/> And representing attribute information corresponding to the target object, wherein am describes the type of the attribute information, n represents the type of the target object, and then determining whether to modify training data of the training image generation text model by determining the difference relation between the text descriptions G and R, thereby improving the model generation accuracy of the image generation text model.

Step S312: calculating a difference value of the two text descriptions, namely the target text description and the text description;

step S314: judging the difference value between the two text descriptions and the preset threshold value;

Step S316: and under the condition that the difference value is smaller than the preset threshold value, determining that the description between the text description and the corresponding image of the text description is accurate, and outputting the original image and the text description to be corresponding without adjusting the text description.

Step S318: and under the condition that the difference value is larger than or equal to a preset threshold value, determining that the description between the text description and the corresponding image of the text description has a difference, and adjusting the text description to ensure the accurate correspondence between the text description and the image. And further inputting the two text descriptions into a text automatic updating module, and replacing the text descriptions with the target text descriptions to correspond to the original images.

It should be noted that the image generation text model may be implemented using an object detection model. First, the object category contained in the picture is determined, for example: targets for refrigerators, washing machines, cabinets, floors, etc., this function may be achieved using target detection algorithms, such as: FASTER RCNN (Faster Region-based Convolutional Neural Network, faster Region-based convolutional neural network, FASTER RCNN for short), yolo (You Only Look Once, network only need to be observed once, yolo for short) series of algorithmic models. Wherein the class feature Loss of multiple targets Loss2:；/> ; wherein/> The class characteristic loss corresponding to the target object 1 is represented, n represents the class number of the target object, and k represents the weight value of class loss functions corresponding to different targets. After the categories of the targets contained in the picture are identified at present, the targets are sent into different model structures to identify the attribute characteristics of the targets according to the attribute characteristics of the different targets. For example, the attribute information includes substantially color information, texture information, viewing angle information, pattern information, direction information, embedded information, and the like of the target object. However, it should be noted that the target attributes included in different target objects are different, for example, the refrigerator has attribute information including color information, viewing angle information, direction information, embedded information, and the like, but only includes color information and material information like a floor type target. Therefore, the model determines the attribute of the target according to the identified targets with different categories, has different attributes, and sends the target into different attribute identification networks to determine the categories. Wherein the attribute class Loss of multiple targets los 3:；/> ; wherein/> The target attribute type loss corresponding to the target object 1 is represented, n represents the type number of the target object, and S represents the weight value of the attribute type loss function corresponding to different targets. Since different targets contained in the picture have different kinds of attribute categories, the number and category of the attribute of each target are different. Assuming that the target includes am all attribute types, wherein the possible attribute types of the target object 1 may be (a 1, a2, a4, …, ai) (i < m), and the possible attribute types of the target object 2 may be (a 1, a3, a4, …, aj) (j < m), the attribute type losses corresponding to the target object 1 and the target object 2 are respectively:

；。

Through the steps, the text description of the generated image is acquired by using the method of generating the text by using the image, the accuracy of the real text description is judged according to the text description of the generated image, and the result of generating the image by generating the difference between the text description and the real text description is optimized by adjusting the sequence and the weight of the real input text description. Therefore, the problem that the image effect cannot be expected after the text description is input in the current complex home scene image with multiple targets and multiple attributes is solved, and the input real text description can be adjusted and modified according to the text description of the generated image, so that the expected generated image effect is obtained.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the various embodiments of the present application.

FIG. 4 is a block diagram of a text adjustment device for generating images based on a large model according to an embodiment of the present application; as shown in fig. 4, includes:

An obtaining module 42, configured to obtain a first data set, where each data set in the first data set includes an original text, and an original image corresponding to the original text; the original text is used for describing N target objects contained in the original image and attribute information corresponding to the N target objects, wherein N is a positive integer;

the text module 44 is configured to input a plurality of original images corresponding to the first dataset set into a generated text model to obtain a target text corresponding to each original image, where the generated text model is a model determined by a preset training mechanism and used for automatically generating a preset text structure according to the images;

A determining module 46, configured to determine difference information between the target text and the target original text corresponding to each original image;

And the adjustment module 48 is configured to perform text adjustment on the target original text according to the difference information, so as to obtain a corrected text.

Acquiring a first data set through the device, wherein each data set in the first data set comprises an original text and an original image corresponding to the original text; the original text is used for describing N target objects contained in the original image and attribute information corresponding to the N target objects, wherein N is a positive integer; inputting a plurality of original images corresponding to the first data set into a generated text model to obtain a target text corresponding to each original image, wherein the generated text model is a model which is determined through a preset training mechanism and is used for automatically generating a preset text structure according to the images; determining difference information between the target text and the target original text corresponding to each original image; performing text adjustment on the target original text according to the difference information to obtain a corrected text; on the basis of determining the corresponding relation between the original images and the original texts, generating text processing is carried out on a plurality of original images through generating a text model. By comparing the difference between the original text and the generated text, whether the text description of the original image corresponding to the original text is accurate or not is determined, and the content which is not accurate enough in the original text is corrected according to the generated text, so that the generation accuracy when the corrected text is used for generating the image is improved, the expected generated image effect is obtained by adjusting and correcting the original text.

In an exemplary embodiment, the determining module 46 is further configured to describe the target text and the target original text according to a preset description format, so as to obtain first description information corresponding to the target text and second description information corresponding to the target original text; the first description information comprises a first description matrix corresponding to the target text and a first attribute type matrix corresponding to the first description matrix; the second description information comprises a second description matrix corresponding to the target original text and a second attribute type matrix corresponding to the second description matrix; comparing matrix values at the same matrix position in a first attribute type matrix in the first description information and a second attribute type matrix in the second description information to obtain a target comparison result; and determining difference information between the target text and the target original text corresponding to each original image according to the target comparison result.

In an exemplary embodiment, the determining module 46 is further configured to determine that, when the target comparison result indicates that the matrix values at the same matrix positions in the first attribute type matrix and the second attribute type matrix are the same, the attribute of the target object indicated by the matrix values is consistent with the text expression in the target text and the target original text; determining that the text expression of the attribute of the target object indicated by the matrix value is inconsistent with the text expression of the target original text in the target text under the condition that the matrix value indicated by the matrix value in the same matrix position in the first attribute type matrix and the second attribute type matrix is different according to the target comparison result; and determining difference information between the target text and the target original text corresponding to each original image according to the matrix positions of which all the text expressions are inconsistent.

In an exemplary embodiment, the adjustment module 48 is further configured to parse the difference information to determine P target objects with differences and target attribute information corresponding to the P target objects respectively; determining a first priority corresponding to the P target objects and a second priority corresponding to the target attribute information according to preset priority data; determining a text adjustment mode for the target original text according to the first priority and the second priority; and adjusting the target original text by using the text adjustment mode to obtain a corrected text.

In an exemplary embodiment, the adjustment module 48 is further configured to determine, according to the first priority, a description adjustment order of the P target objects in the text description of the target original text; determining description weight values of the P target objects in the text description of the target original text according to the second priority; and determining a text adjustment mode of the target original text based on a preset adjustment rule, the description adjustment order and the description weight value.

In an exemplary embodiment, the foregoing adjustment module 48 further includes an obtaining unit, configured to obtain a preset adjustment formula corresponding to the preset adjustment rule before determining the text adjustment manner for the target original text based on the preset adjustment rule, the description adjustment order, and the description weight value: wherein, the preset adjustment formula is: ; wherein/> Representing the text description of the target original text after adjustment,/>Representing each target object included in the target original image corresponding to the target original text and attribute category corresponding to each target object,/>Representing a second attribute type matrix corresponding to the target original text,/>Representing a first attribute type matrix corresponding to the target text,/>And the weight value coefficient corresponding to the am attribute is represented, and the attribute type matrix is used for indicating the attribute type contained in each target object.

Optionally, the text adjustment device for generating an image based on a large model includes, in addition to the acquisition module 42, the text module 44, the determination module 46, and the adjustment module 48, a text adjustment device further includes: a comparison module 50, an update module 52. As shown in fig. 5, fig. 5 is a block diagram (ii) of a text adjustment apparatus for generating an image based on a large model according to an embodiment of the present application.

In an exemplary embodiment, the above apparatus further includes: a comparison module 50, configured to calculate a difference value corresponding to the difference information after determining the difference information between the target text and the target original text corresponding to each original image, where the difference value is used to indicate a difference size between the target text and the target original text of each original image; and comparing the difference value with a preset text difference threshold value, and determining whether to replace the target original text by the target text according to a comparison result.

In an exemplary embodiment, the comparing module 50 is further configured to determine to replace the target original text with the target text if the comparison result indicates that the difference value is less than the preset text difference threshold; and if the comparison result indicates that the difference value is greater than or equal to the preset text difference threshold value, determining that the target original text is not replaced by the target text.

In an exemplary embodiment, the above apparatus further includes: an updating module 52, configured to, when the comparison result indicates that the difference value is smaller than the preset text difference threshold, determine that the target text is used to replace the target original text, and then perform association binding on the target text and a target original image corresponding to the target original text; updating the first data set according to the binding result to obtain a second data set; training a large model of the text-generated image using the second set of data sets.

In an exemplary embodiment, the comparison module further includes: the identification unit is used for adding a first identification to a target original image corresponding to the target original text after determining that the target original text is not replaced by the target original text under the condition that the comparison result indicates that the difference value is greater than or equal to the preset text difference threshold value, wherein the first identification is used for indicating that the target original text is a high-accuracy text; the classifying unit is used for classifying the text of the first data set according to the first identifier to obtain a first type text set and a second type text set, wherein the first type text set is a set of original texts with the first identifier, and the second type text set is a set of original texts without the first identifier; and the prompting unit is used for sending prompting information for updating the second type text set to the large model training object, wherein the prompting information is used for prompting the large model training object to execute replacement operation on the second type text set in the iterative training process.

In an exemplary embodiment, the above apparatus further includes: the updating module is further configured to perform an ith training, where i is a positive integer greater than or equal to 1, and the large model obtained through the 0 th training is a general large model that is not trained by the second data set: acquiring training samples used by the ith round from the second data set, wherein the training samples used by the ith round comprise sample images used by the ith round, and texts corresponding to the sample images used by the ith round; inputting a text corresponding to a sample image used by the ith round into a general large model obtained through the ith-1 round of training to obtain a generated image obtained through the ith round of training; acquiring a value of a first loss function of the ith training according to a sample image used by the ith training and a generated image obtained by the ith training by the universal large model obtained by the ith training; respectively extracting features of a sample image used by the ith wheel and a generated image obtained by the ith wheel training by using a target detection model to obtain a first object attribute feature of the sample image used by the ith wheel and a second object attribute feature of the generated image obtained by the ith wheel training; determining the value of a second loss function of the ith training round according to the first object attribute characteristics and the second object attribute characteristics; determining the value of the objective loss function of the ith training according to the value of the first loss function of the ith training and the value of the second loss function of the ith training; and ending the training under the condition that the value of the objective loss function of the ith training meets the preset convergence condition, and obtaining a large model for generating an image through text by using the second data set training.

In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

An embodiment of the present application also provides a storage medium including a stored program, wherein the program executes the method of any one of the above.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store program code for performing the steps of:

S1, acquiring a first data set, wherein each data set in the first data set comprises an original text, and an original image corresponding to the original text; the original text is used for describing N target objects contained in the original image and attribute information corresponding to the N target objects, wherein N is a positive integer;

s2, inputting a plurality of original images corresponding to the first data set into a generated text model to obtain a target text corresponding to each original image, wherein the generated text model is a model which is determined through a preset training mechanism and is used for automatically generating a preset text structure according to the images;

s3, determining difference information between the target text and the target original text corresponding to each original image;

And S4, performing text adjustment on the target original text according to the difference information to obtain a corrected text.

An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

And S4, performing text adjustment on the target original text according to the difference information to obtain a corrected text. Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a removable hard disk, a magnetic disk, or an optical disk, etc., which can store program codes.

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present application is not limited to any specific combination of hardware and software.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A text adjustment method for generating an image based on a large model, comprising:

Acquiring a first data set, wherein each data set in the first data set comprises an original text, and an original image corresponding to the original text; the original text is used for describing N target objects contained in the original image and attribute information corresponding to the N target objects, wherein N is a positive integer;

Inputting a plurality of original images corresponding to the first data set into a generated text model to obtain a target text corresponding to each original image, wherein the generated text model is a model which is determined through a preset training mechanism and is used for automatically generating a preset text structure according to the images;

determining difference information between the target text and the target original text corresponding to each original image;

Performing text adjustment on the target original text according to the difference information to obtain a corrected text;

The text adjustment is performed on the target original text according to the difference information to obtain an adjusted corrected text, which comprises the following steps:

Analyzing the difference information to determine P target objects with differences and target attribute information corresponding to the P target objects respectively;

determining a first priority corresponding to the P target objects and a second priority corresponding to the target attribute information according to preset priority data, wherein P is a positive integer, and P is less than or equal to N;

determining a text adjustment mode for the target original text according to the first priority and the second priority;

And adjusting the target original text by using the text adjustment mode to obtain a corrected text.

2. The text adjustment method for generating images based on large models according to claim 1, determining difference information between the target text and the target original text corresponding to each original image, comprising:

Describing the target text and the target original text according to a preset description format respectively to obtain first description information corresponding to the target text and second description information corresponding to the target original text; the first description information comprises a first description matrix corresponding to the target text and a first attribute type matrix corresponding to the first description matrix; the second description information comprises a second description matrix corresponding to the target original text and a second attribute type matrix corresponding to the second description matrix;

Comparing matrix values at the same matrix position in a first attribute type matrix in the first description information and a second attribute type matrix in the second description information to obtain a target comparison result;

And determining difference information between the target text and the target original text corresponding to each original image according to the target comparison result.

3. The text adjustment method for generating images based on large models according to claim 2, determining difference information between the target text and the target original text corresponding to each original image according to the target comparison result, comprising:

Determining that the text expression of the attribute of the target object indicated by the matrix value is consistent with the text expression of the target original text in the target text under the condition that the target comparison result indicates that the matrix value at the same matrix position in the first attribute type matrix and the second attribute type matrix is the same;

Determining that the text expression of the attribute of the target object indicated by the matrix value is inconsistent with the text expression of the target original text in the target text under the condition that the matrix value indicated by the matrix value in the same matrix position in the first attribute type matrix and the second attribute type matrix is different according to the target comparison result;

And determining difference information between the target text and the target original text corresponding to each original image according to the matrix positions of which all the text expressions are inconsistent.

4. The text adjustment method for generating an image based on a large model according to claim 1, wherein determining a text adjustment manner for the target original text according to the first priority and the second priority comprises:

Determining description adjustment orders of the P target objects in text descriptions of target original texts according to the first priority;

determining description weight values of the P target objects in the text description of the target original text according to the second priority;

And determining a text adjustment mode of the target original text based on a preset adjustment rule, the description adjustment order and the description weight value.

5. The text adjustment method for generating an image based on a large model according to claim 4, further comprising, before determining the text adjustment mode for the target original text based on a preset adjustment rule, the description adjustment order, and the description weight value:

acquiring a preset adjustment formula corresponding to the preset adjustment rule: wherein, the preset adjustment formula is: ; wherein/> Representing the text description of the target original text after adjustment,/>Representing each target object included in the target original image corresponding to the target original text and attribute category corresponding to each target object,/>Representing a second attribute type matrix corresponding to the target original text,/>Representing a first attribute type matrix corresponding to the target text,/>And the weight value coefficient corresponding to the am attribute is represented, and the attribute type matrix is used for indicating the attribute type contained in each target object.

6. The text adjustment method for generating images based on large models according to claim 1, after determining difference information between the target text and the target original text corresponding to each original image, the method further comprising:

Calculating a difference value corresponding to the difference information, wherein the difference value is used for indicating the difference size between the target text of each original image and the target original text;

and comparing the difference value with a preset text difference threshold value, and determining whether to replace the target original text by the target text according to a comparison result.

7. The text adjustment method for generating an image based on a large model according to claim 6, wherein comparing the difference value with a preset text difference threshold and determining whether to replace the target original text with the target text according to the comparison result comprises:

Determining to replace the target original text with the target text if the comparison result indicates that the difference value is smaller than the preset text difference threshold;

And if the comparison result indicates that the difference value is greater than or equal to the preset text difference threshold value, determining that the target original text is not replaced by the target text.

8. The text adjustment method for generating an image based on a large model according to claim 7, wherein, in a case where the comparison result indicates that the difference value is smaller than the preset text difference threshold value, after determining to replace the target original text with the target text, the method further comprises:

performing association binding on the target text and a target original image corresponding to the target original text;

updating the first data set according to the binding result to obtain a second data set;

Training a large model of the text-generated image using the second set of data sets.

9. The text adjustment method for generating an image based on a large model according to claim 7, wherein, in a case where the comparison result indicates that the difference value is greater than or equal to the preset text difference threshold value, it is determined that the target original text is not replaced with the target text, the method further comprising:

Adding a first identifier to the target original text and a target original image corresponding to the target original text, wherein the first identifier is used for indicating that the target original text is a high-accuracy text;

Performing text classification on the first data set according to the first identifier to obtain a first type text set and a second type text set, wherein the first type text set is a set of original texts with the first identifier, and the second type text set is a set of original texts without the first identifier;

and sending prompt information for updating the second type text set to the large model training object, wherein the prompt information is used for prompting the large model training object to execute replacement operation on the second type text set in the iterative training process.

10. The method of text adaptation of a large model-based generated image of claim 8, wherein training a large model of a generated image from text using the second set of data sets comprises:

the ith training is performed by the following steps, wherein i is a positive integer greater than or equal to 1, and the large model obtained through the 0 th training is a general large model which is not trained by the second data set:

acquiring training samples used by the ith round from the second data set, wherein the training samples used by the ith round comprise sample images used by the ith round, and texts corresponding to the sample images used by the ith round;

Inputting a text corresponding to a sample image used by the ith round into a general large model obtained through the ith-1 round of training to obtain a generated image obtained through the ith round of training;

Acquiring a value of a first loss function of the ith training according to a sample image used by the ith training and a generated image obtained by the ith training by the universal large model obtained by the ith training; and

Respectively extracting features of a sample image used by the ith wheel and a generated image obtained by the ith wheel training by using a target detection model to obtain a first object attribute feature of the sample image used by the ith wheel and a second object attribute feature of the generated image obtained by the ith wheel training; determining the value of a second loss function of the ith training round according to the first object attribute characteristics and the second object attribute characteristics;

Determining the value of the objective loss function of the ith training according to the value of the first loss function of the ith training and the value of the second loss function of the ith training;

And ending the training under the condition that the value of the objective loss function of the ith training meets the preset convergence condition, and obtaining a large model for generating an image through text by using the second data set training.

11. A text adjustment device for generating an image based on a large model, comprising:

The system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring a first data set, each data set in the first data set comprises an original text, and an original image corresponding to the original text; the original text is used for describing N target objects contained in the original image and attribute information corresponding to the N target objects, wherein N is a positive integer;

The text module is used for inputting a plurality of original images corresponding to the first data set into a generated text model to obtain a target text corresponding to each original image, wherein the generated text model is a model which is determined through a preset training mechanism and is used for automatically generating a preset text structure according to the images;

the determining module is used for determining difference information between the target text and the target original text corresponding to each original image;

The adjustment module is used for carrying out text adjustment on the target original text according to the difference information to obtain a corrected text;

The adjustment module is further used for analyzing the difference information and determining P target objects with differences and target attribute information corresponding to the P target objects respectively;

12. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program when run performs the method of any of the preceding claims 1 to 10.