CN117830447A

CN117830447A - Image generation, automatic question and answer and parameter generation model training method

Info

Publication number: CN117830447A
Application number: CN202311622640.1A
Authority: CN
Inventors: 冯玉彤; 龚镖; 陈狄; 沈宇军; 刘宇; 周靖人
Original assignee: Zhejiang Alibaba Robot Co ltd
Current assignee: Zhejiang Alibaba Robot Co ltd
Priority date: 2023-11-28
Filing date: 2023-11-28
Publication date: 2024-04-05

Abstract

The embodiment of the specification provides an image generation, automatic question and answer and parameter generation model training method, wherein the image generation method comprises the following steps: acquiring an image description text; inputting the image description text and the generated prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used for describing visual characteristics of the image, and the parameter generation model is obtained based on a plurality of sample image text pairs and sample parameter information carried by the sample image text pairs in a training way; and inputting the image generation parameters and the image description text into an image generation model to obtain a target image corresponding to the image description text. The visual elements of the image are subjected to semantic disassembly by utilizing the parameter generation model, so that image generation parameters are obtained, and accurate image generation is further completed based on the image generation parameters, so that the target image can clearly express the image description text and the image generation parameters, and the interpretability and the controllability of the image generation are improved.

Description

Image generation, automatic question and answer and parameter generation model training method

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to an image generation, automatic question-answering and parameter generation model training method.

Background

With the development of computer technology, the art of the venturi map gradually becomes a core technology of the artificial intelligence generation field (AIGC, AI Generated Content). The art of the paperwork can generate images through text description, and can be transformed and adjusted according to the requirements and input contents of users, so that the users can easily create artistic works with unique styles, and the art works are widely applied in the field of digital arts.

At present, a common literature graph structure is a hidden space diffusion model (HSDM, hidden Space Diffusion Model), however, because the hidden space diffusion model is operated in a hidden space, the generating process is difficult to understand, and the generated image has a large difference from the text originally input by the user, so that a high-interpretability and accurate image generating method is needed.

Disclosure of Invention

In view of this, the present embodiment provides an image generation method. One or more embodiments of the present specification relate to an automatic question-answering method, a parameter generation model training method, an image generation apparatus, an automatic question-answering apparatus, a parameter generation model training apparatus, a computing device, a computer-readable storage medium, and a computer program, to solve the technical drawbacks existing in the prior art.

According to a first aspect of embodiments of the present specification, there is provided an image generation method including:

acquiring an image description text;

inputting the image description text and the generated prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used for describing visual characteristics of the image, and the parameter generation model is obtained based on a plurality of sample image text pairs and sample parameter information carried by the sample image text pairs in a training way;

and inputting the image generation parameters and the image description text into an image generation model to obtain a target image corresponding to the image description text.

According to a second aspect of embodiments of the present disclosure, there is provided a parameter generation model training method applied to cloud-side equipment, including:

obtaining a sample set, wherein the sample set comprises a plurality of sample image text pairs, the sample image text pairs comprise sample images and sample description texts, and the sample image text pairs carry sample parameter information;

inputting a plurality of sample image text pairs and prediction prompt information into a pre-training language model to obtain image prediction parameters corresponding to the plurality of sample image text pairs respectively;

and adjusting model parameters of the pre-training language model according to the image prediction parameters and the sample parameter information to obtain a parameter generation model for completing training.

According to a third aspect of embodiments of the present disclosure, there is provided an automatic question-answering method, including:

receiving an image question-answer request, wherein the image question-answer request carries an image description text;

and inputting the image generation parameters and the image description text into an image generation model to obtain a reply image corresponding to the image question-answer request.

According to a fourth aspect of embodiments of the present specification, there is provided an image generating apparatus comprising:

a first acquisition module configured to acquire an image description text;

the first input module is configured to input the image description text and the generated prompt information into the parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used for describing visual characteristics of the image, and the parameter generation model is obtained based on a plurality of sample image text pairs and sample parameter information carried by the plurality of sample image text pairs in a training mode;

And the second input module is configured to input the image generation parameters and the image description text into the image generation model to obtain a target image corresponding to the image description text.

According to a fifth aspect of embodiments of the present specification, there is provided a parameter generation model training apparatus applied to cloud-side equipment, including:

a second acquisition module configured to acquire a sample set, wherein the sample set comprises a plurality of sample image text pairs, the sample image text pairs comprising sample images and sample description text, the sample image text pairs carrying sample parameter information;

the third input module is configured to input a plurality of sample image text pairs and prediction prompt information into the pre-training language model to obtain image prediction parameters respectively corresponding to the plurality of sample image text pairs;

and the adjusting module is configured to adjust model parameters of the pre-training language model according to the image prediction parameters and the sample parameter information to obtain a parameter generation model for completing training.

According to a sixth aspect of embodiments of the present specification, there is provided an automatic question-answering apparatus, including:

the first receiving module is configured to receive an image question-answer request, wherein the image question-answer request carries an image description text;

The fourth input module is configured to input the image description text and the generated prompt information into the parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used for describing visual characteristics of the image, and the parameter generation model is obtained based on a plurality of sample image text pairs and sample parameter information carried by the plurality of sample image text pairs in a training mode;

and a fifth input module configured to input the image generation parameters and the image description text into the image generation model to obtain a reply image corresponding to the image question-and-answer request.

According to a seventh aspect of embodiments of the present specification, there is provided a computing device comprising:

a memory and a processor;

the memory is configured to store computer executable instructions that, when executed by the processor, implement the steps of the methods provided in the first, second or third aspects above.

According to an eighth aspect of embodiments of the present specification, there is provided a computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the method provided in the first or second or third aspects above.

According to a ninth aspect of embodiments of the present specification, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the method provided in the first or second or third aspect described above.

According to the image generation method provided by the embodiment of the specification, an image description text is acquired; inputting the image description text and the generated prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used for describing visual characteristics of the image, and the parameter generation model is obtained based on a plurality of sample image text pairs and sample parameter information carried by the sample image text pairs in a training way; and inputting the image generation parameters and the image description text into an image generation model to obtain a target image corresponding to the image description text. The visual elements of the image are subjected to semantic disassembly by utilizing the parameter generation model, so that image generation parameters are obtained, and accurate image generation is further completed based on the image generation parameters, so that the target image can clearly express the image description text and the image generation parameters, and the interpretability and the controllability of the image generation are improved.

Drawings

FIG. 1 is an architecture diagram of an image generation system provided in one embodiment of the present description;

FIG. 2 is an architecture diagram of another image generation system provided by one embodiment of the present description;

FIG. 3 is a flow chart of an image generation method provided by one embodiment of the present description;

FIG. 4 is a process flow diagram of an image generation method according to one embodiment of the present disclosure;

FIG. 5 is a flow chart of a method of training a parametric generation model provided in one embodiment of the present disclosure;

FIG. 6 is a flow chart of another parametric generation model training method provided by one embodiment of the present disclosure;

FIG. 7 is a flow chart of an automatic question-answering method provided by one embodiment of the present disclosure;

FIG. 8 is a process flow diagram of another image generation method provided by one embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a process for generating a model of parameters according to one embodiment of the present disclosure;

FIG. 10 is an interface schematic of an image generation interface provided by one embodiment of the present disclosure;

fig. 11 is a schematic structural view of an image generating apparatus provided in an embodiment of the present specification;

FIG. 12 is a schematic diagram of a training device for generating a model according to an embodiment of the present disclosure;

Fig. 13 is a schematic structural diagram of an automatic question answering device according to one embodiment of the present disclosure;

FIG. 14 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

Furthermore, it should be noted that, user information (including, but not limited to, user equipment information, user personal information, etc.) and data (including, but not limited to, data for analysis, stored data, presented data, etc.) according to one or more embodiments of the present disclosure are information and data authorized by a user or sufficiently authorized by each party, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions, and is provided with corresponding operation entries for the user to select authorization or denial.

In one or more embodiments of the present description, a large model refers to a deep learning model with large scale model parameters, typically including hundreds of millions, billions, trillions, and even more than one billion model parameters. The large Model can be called as a Foundation Model, a training Model is performed by using a large-scale unlabeled corpus, a pre-training Model with more than one hundred million parameters is produced, the Model can adapt to a wide downstream task, and the Model has better generalization capability, such as a large-scale language Model (LLM, large Language Model), a multi-Model pre-training Model and the like.

When the large model is actually applied, the pretrained model can be applied to different tasks by fine tuning with a small amount of samples, the large model can be widely applied to the fields of natural language processing (NLP, natural Language Processing), computer vision and the like, and particularly can be applied to the tasks of the computer vision fields such as vision question and answer (VQA, visual Question Answering), image description (IC), image generation and the like, and the tasks of the natural language processing fields such as emotion classification based on texts, text abstract generation, machine translation and the like, and main application scenes of the large model comprise digital assistants, intelligent robots, searching, online education, office software, electronic commerce, intelligent design and the like.

First, terms related to one or more embodiments of the present specification will be explained.

Hidden space diffusion model: the hidden space diffusion model refers to an artificial intelligence technology widely applied to image editing and image generation, and is used for coding an image into a hidden space for noise adding and denoising.

Attention mechanism: the attention mechanism (Attention mechanism) is a machine learning method that can be assigned different weights depending on the importance of the portions of the input data.

CLIP: CLIP (Contrastive Language-Image Pre-tracking) is a model for text and Image correlation metrics, widely used for token extraction of text and images.

Large-scale language model: the large-scale language model predicts and generates various language expression forms, such as texts, sentences, paragraphs and the like, by training massive text data.

LoRA: a language big model fine tuning method of Low-rank adaptation (Low-rank adaptation) can achieve good effects by training a small amount of parameters when a big model is used for adapting to downstream tasks.

The text-to-image technology describes text output images based on images input by users, and has great value in numerous scenes such as advertisement recommendation, interactive entertainment, artistic design and the like. The common literature graph architecture is a hidden space diffusion model, because the hidden space diffusion model is operated in a hidden space, the internal working mechanism and decision process of the hidden space diffusion model are difficult to understand and explain, the generating process is difficult to understand, the understanding of the text by the method relies on a feature extraction module (CLIP encoder), and semantic control on visual elements cannot be carried out in the generating stage, so that the finally generated result and the image description text originally input by a user are greatly different.

In order to improve the accuracy of an image generation model, the embodiment of the specification provides an image generation method combining a large-scale language model, which utilizes the powerful semantic understanding and association capability of the large-scale language model to complete semantic disassembly and recombination of visual elements of an image, parameterizes the visual elements and inputs the parameterized visual elements into the image generation model to complete accurate image generation so as to meet the stronger interpretability and controllability of the image generation process, and the large-scale language model can align complex long texts and make technical support for more user interaction modes. And by combining a large-scale language model and an image generation model, the method can realize convenient and fine image editing, similar structure image generation and other expansion applications.

Specifically, the embodiment of the specification provides an image generation scheme, and an image description text is acquired; inputting the image description text and the generated prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used for describing visual characteristics of the image, and the parameter generation model is obtained based on a plurality of sample image text pairs and sample parameter information carried by the sample image text pairs in a training way; and inputting the image generation parameters and the image description text into an image generation model to obtain a target image corresponding to the image description text.

In the present specification, an image generation method, an automatic question-answering method, a parameter generation model training method, an image generation apparatus, an automatic question-answering apparatus, a parameter generation model training apparatus, a computing device, and a computer-readable storage medium are provided, which are described in detail one by one in the following embodiments.

Referring to fig. 1, fig. 1 illustrates an architecture diagram of an image generation system provided in one embodiment of the present description, which may include a client 100 and a server 200;

a client 100 for transmitting an image description text to a server 200;

the server 200 is configured to input an image description text and a generated prompt message into a parameter generation model, and obtain an image generation parameter corresponding to the image description text, where the image generation parameter is used to describe visual features of an image, and the parameter generation model is obtained by training based on a plurality of sample image text pairs and sample parameter information carried by the plurality of sample image text pairs; inputting the image generation parameters and the image description text into an image generation model to obtain a target image corresponding to the image description text; transmitting the target image to the client 100;

The client 100 is further configured to receive the target image sent by the server 200.

By applying the scheme of the embodiment of the specification, the visual elements of the image are subjected to semantic disassembly by utilizing the parameter generation model to obtain the image generation parameters, and further accurate image generation is completed based on the image generation parameters, so that the target image can clearly express the image description text and the image generation parameters, and the interpretability and the controllability of the image generation are improved.

Referring to fig. 2, fig. 2 illustrates an architecture diagram of another image generation system provided in one embodiment of the present specification, where the image generation system may include a plurality of clients 100 and a server 200, where the clients 100 may include an end-side device and the server 200 may include a cloud-side device. Communication connection can be established between the plurality of clients 100 through the server 200, in the image generation scenario, the server 200 is used to provide an image generation service between the plurality of clients 100, and the plurality of clients 100 can respectively serve as a transmitting end or a receiving end, so that communication is realized through the server 200.

The user may interact with the server 200 through the client 100 to receive data transmitted from other clients 100, or transmit data to other clients 100, etc. In the image generation scenario, it may be that the user issues a data stream to the server 200 through the client 100, and the server 200 generates a target image according to the data stream and pushes the target image to other clients that establish communication.

Wherein, the client 100 and the server 200 establish a connection through a network. The network provides a medium for a communication link between client 100 and server 200. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The data transmitted by the client 100 may need to be encoded, transcoded, compressed, etc. before being distributed to the server 200.

The client 100 may be a browser, APP (Application), or a web Application such as H5 (HyperText Markup Language, hypertext markup language (htv) 5 th edition) Application, or a light Application (also called applet, a lightweight Application) or cloud Application, etc., and the client 100 may be based on a software development kit (SDK, software Development Kit) of a corresponding service provided by the server 200, such as a real-time communication (RTC, real Time Communication) based SDK development acquisition, etc. The client 100 may be deployed in an electronic device, need to run depending on the device or some APP in the device, etc. The electronic device may for example have a display screen and support information browsing etc. as may be a personal mobile terminal such as a mobile phone, tablet computer, personal computer etc. Various other types of applications are also commonly deployed in electronic devices, such as human-machine conversation type applications, model training type applications, text processing type applications, web browser applications, shopping type applications, search type applications, instant messaging tools, mailbox clients, social platform software, and the like.

The server 200 may include a server that provides various services, such as a server that provides communication services for multiple clients, a server for background training that provides support for a model used on a client, a server that processes data sent by a client, and so on. It should be noted that, the server 200 may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. The server may also be a server of a distributed system or a server that incorporates a blockchain. The server may also be a cloud server for cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDN, content Delivery Network), and basic cloud computing services such as big data and artificial intelligence platforms, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.

It should be noted that, the image generating method provided in the embodiment of the present disclosure is generally executed by the server, but in other embodiments of the present disclosure, the client may also have a similar function to the server, so as to execute the image generating method provided in the embodiment of the present disclosure. In other embodiments, the image generating method provided in the embodiments of the present disclosure may be performed by a client and a server together.

Referring to fig. 3, fig. 3 shows a flowchart of an image generating method according to an embodiment of the present disclosure, which specifically includes the following steps:

step 302: and acquiring an image description text.

In one or more embodiments of the present disclosure, at the beginning of image generation, an image description text may be acquired, and a target image according to the actual requirement of the user is generated based on the image description text.

Specifically, the image description text characterizes the user's image generation needs. The image description text may be description text of different languages, such as english description text, chinese description text, and the like. Because the image generation process introduces a large-scale language model, the image description text may be short text or long text. For example, the image description text is "two black cats and one white dog on orange sofa".

In practical applications, there are various ways to obtain the image description text, and the method is specifically selected according to practical situations, which is not limited in any way in the embodiment of the present disclosure. In one possible implementation manner of the present specification, image description text sent by a user through a client may be received. In another possible implementation of the present description, the image description text may be read from other data acquisition devices or databases.

Step 304: inputting the image description text and the generated prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used for describing visual characteristics of the image, and the parameter generation model is obtained based on a plurality of sample image text pairs and sample parameter information carried by the plurality of sample image text pairs.

In one or more embodiments of the present disclosure, after the image description text is acquired, further, the image description text and the generated prompt information may be input into a parameter generation model to obtain image generation parameters corresponding to the image description text.

Specifically, the parameter generation model is obtained by training a pre-training language model based on a sample set. The pre-training language model may be a large-scale language model, such as a multi-modal large model, or may be a language processing model that is trained using a first training set. The image generation parameters may be understood as parameterized image generation conditions. Image generation parameters include, but are not limited to, size parameters, location parameters, shape parameters, and the like. Generating hint information may be understood as a parameter generation paradigm, also referred to as a generation condition paradigm. The generation prompt information is used for guiding the parameter generation model to generate the image generation parameters. Generating hints includes, but is not limited to, shape, color, size, location, blurring, image generation condition elements of key points, for example, generating hints may be "what elements are in an image".

It should be noted that, in the embodiment of the present disclosure, the generation conditions are unified into parameters that can be described by natural language, rather than image conditions, so that the semantic understanding of the large-scale language model is facilitated.

Step 306: and inputting the image generation parameters and the image description text into an image generation model to obtain a target image corresponding to the image description text.

In one or more embodiments of the present disclosure, an image description text is acquired, the image description text and a generated prompt message are input into a parameter generation model, and after an image generation parameter corresponding to the image description text is obtained, the image generation parameter and the image description text may be further input into the image generation model, so as to obtain a target image corresponding to the image description text.

Specifically, the image generation model may be a hidden space diffusion model, or may be an image generation model obtained by training with the second training set. The image generation model is used for generating a final target image based on the image generation parameters and the image expression text. The target Image may be a black-and-white Image or a color Image (RGB Image).

In an optional embodiment of the present disclosure, the image generating parameters and the image description text are input into the image generating model, where the image generating parameters may be processed by adopting a bypass structure, that is, a parameter encoding unit is additionally added as a bypass structure on the basis of a codec unit originally provided in the image generating model, that is, the image generating model includes the parameter encoding unit and the codec unit; the inputting the image generating parameters and the image description text into the image generating model to obtain the target image corresponding to the image description text may include the following steps:

inputting the image generation parameters and the image description text into an image generation model, and coding the image generation parameters by a parameter coding unit to obtain parameter coding characteristics;

and the encoding and decoding unit is used for generating a target image corresponding to the image description text according to the parameter encoding characteristics and the image description text.

Referring to fig. 4, fig. 4 is a flowchart illustrating a processing procedure of an image generating method according to an embodiment of the present disclosure, where after an image generating parameter is generated by a parameter generating model, as shown in fig. 4, an image description text and the image generating parameter may be input together into the image generating model, and in the image generating model, first, the image generating parameter is encoded by a parameter encoding unit to obtain a parameter encoding feature, and then, a decoding unit generates a target image corresponding to the image description text according to the parameter encoding feature and the image description text.

The codec unit includes an encoding unit and a decoding unit. After the parameter coding features are obtained, the parameter coding features can be input into a coding unit in the hidden space together for parameter condition control, and the residual error is transmitted to a decoding unit to realize final parameter fusion generation.

In practical application, the parameter encoding unit may directly encode the image generation parameters to obtain the parameter encoding characteristics. Further, since the image generation parameters include parameters of different dimensions, the parameter encoding unit may encode the image generation parameters into parameter encoding features of different latitudes for the diversity of the dimensions of the image generation parameters.

By applying the scheme of the embodiment of the specification, the image generation parameters and the image description text are input into an image generation model, and the image generation parameters are encoded by a parameter encoding unit to obtain parameter encoding characteristics; and the encoding and decoding unit is used for generating a target image corresponding to the image description text according to the parameter encoding characteristics and the image description text. The image generation parameters are integrated in the target image generation process to control, so that the accuracy of the target image is ensured.

In an alternative embodiment of the present specification, the parameter encoding unit includes a one-dimensional parameter encoding unit, a two-dimensional parameter encoding unit, and a feature aggregation unit; the parameter encoding unit encodes the image generation parameters to obtain parameter encoding characteristics, and the method can comprise the following steps:

A one-dimensional parameter coding unit codes one-dimensional parameters in the image generation parameters to obtain one-dimensional coding characteristics;

the two-dimensional parameter coding unit codes the two-dimensional parameters in the image generation parameters to obtain two-dimensional coding characteristics;

and acquiring parameter coding features by aggregating the one-dimensional coding features and the two-dimensional coding features through a feature aggregation unit.

In particular, the image generation parameters may include parameters of different dimensions, such as one-dimensional parameters and two-dimensional parameters. Wherein the one-dimensional parameters include, but are not limited to, color parameters, blurring parameters. Two-dimensional parameters include, but are not limited to, position parameters, shape parameters. Therefore, in order to better encode the image generation parameters, the embodiment of the present specification divides the parameter encoding unit to obtain a one-dimensional parameter encoding unit for encoding one-dimensional parameters, and a two-dimensional parameter encoding unit for encoding two-dimensional parameters.

Furthermore, because the image generation parameters represent the overall generation conditions of the target image, a feature aggregation unit can be additionally added in the parameter coding unit, so that the one-dimensional coding feature and the two-dimensional coding feature are aggregated through the feature aggregation unit to obtain the parameter coding feature.

By applying the scheme of the embodiment of the specification, one-dimensional parameters in the image generation parameters are encoded by a one-dimensional parameter encoding unit to obtain one-dimensional encoding characteristics; the two-dimensional parameter coding unit codes the two-dimensional parameters in the image generation parameters to obtain two-dimensional coding characteristics; and the parameter coding features are obtained by aggregating the one-dimensional coding features and the two-dimensional coding features through the feature aggregation unit, so that the parameters with different dimensions are respectively coded, and the accuracy of the parameter coding features is ensured.

In an optional embodiment of the present disclosure, after the image generating parameter and the image description text are input into the image generating model to obtain the target image corresponding to the image description text, further, the image updating parameter for the target image may be received, and the image editing may be performed on the target image based on the image updating parameter, or an image similar to the structure of the target image is generated, that is, after the image generating parameter and the image description text are input into the image generating model to obtain the target image corresponding to the image description text, the method may further include the following steps:

acquiring image update parameters for a target image;

determining an image editing area in the target image according to the image updating parameters;

Masking the image editing area to obtain a mask generating sequence;

and inputting the mask generating sequence and the target image into an image generating model to obtain an updated target image.

Specifically, the image update parameters are used to modify or adjust the target image to obtain an updated target image. The image update parameter may be an image generation condition independent of the image generation parameter, such as the image generation parameter being "number of black cats is 2", the image update parameter being "number of black cats is 3"; for another example, the image generation parameter is "number of black cats is 2", and the image update parameter is "number of black cats plus one". The method for acquiring the image update parameters for the target image is various, and specifically selected according to the actual situation, which is not limited in any way in the embodiment of the present specification. In one possible implementation manner of the present disclosure, an image update parameter sent by a user through a client may be received. In another possible implementation of the present specification, the image update parameters for the target image may be read from other data acquisition devices or databases.

After the image update parameter for the target image is acquired, the image editing region in the target image may be determined after the image update parameter is determined to have changed from the image generation parameter. The image editing area refers to an area in the target image where parameter updating is required. When determining an image editing area in a target image based on the image update parameters, an image editing area in an attention map (attention map) output by an attention mechanism may be determined, and the image editing area in the attention map may be masked to generate a mask generation sequence.

By applying the scheme of the embodiment of the specification, the image update parameters aiming at the target image are obtained; determining an image editing area in the target image according to the image updating parameters; masking the image editing area to obtain a mask generating sequence; and inputting the mask generating sequence and the target image into an image generating model to obtain an updated target image. By regenerating the image editing area in the attention diagram based on the mask generating sequence, the area with other parameters not updated is not affected, and the image editing with identity preservation and the image generation with similar structures are realized.

In an alternative embodiment of the present disclosure, taking the image generation model as an example of the diffusion model, the inputting the mask generation sequence and the target image into the image generation model to obtain the updated target image may include the following steps:

and generating a diffusion result of the current diffusion step according to the diffusion result of the completed diffusion step and the mask generation sequence aiming at the current diffusion step of the image generation model until an iteration stop condition is reached, so as to obtain an updated target image.

Specifically, the diffusion generation model processing includes a plurality of diffusion steps (diffusion steps), each of which can be regarded as a transformation on the image, which can help the diffusion model better simulate the image degradation process and recover a noise-free image from the noisy image. In general, the multiple diffusion steps of the diffusion model can be divided into two phases, a diffusion phase and a denoising phase. In the diffusion phase, the diffusion model may simulate the process of image degradation by gradually adding noise to the input image; in the denoising stage, the diffusion model can find out and eliminate noise from the noisy image to restore the original noiseless image.

In the embodiment of the present disclosure, in the image generation model, the image generation of the next editing mode may be performed based on the diffusion result of each diffusion step generated in the previous time, and in the image generation process of the editing mode, the diffusion result and the mask generation sequence may be multiplied to obtain the diffusion result of the current diffusion step until reaching the iteration stop condition, so as to obtain the updated target image.

By applying the scheme of the embodiment of the specification, aiming at the current diffusion step of the image generation model, the diffusion result of the current diffusion step is generated according to the diffusion result of the completed diffusion step and the mask generation sequence until the iteration stop condition is reached, the updated target image is obtained, and the image editing of identity maintenance and the image generation of similar structures are realized.

In an optional embodiment of the present disclosure, the inputting the mask generating sequence and the target image into the image generating model, after obtaining the updated target image, may further include the following steps:

and sending the target image and the updated target image to the client so that the client displays the target image and the updated target image to the user.

It should be noted that, after the updated target image is obtained, the target image and the updated target image may be sent to the client at the same time, and the user may select an image meeting the actual requirement from the target image displayed by the client and the updated target image.

In practical applications, there are various ways of sending the target image and the updated target image to the client, and the method is specifically selected according to the practical situation, which is not limited in any way in the embodiment of the present disclosure. In one possible implementation manner of the present disclosure, the target image and the updated target image may be directly sent to the client, so that the client displays the target image and the updated target image to the user. In another possible implementation manner of the present disclosure, the target image and the updated target image may be sent to the client according to the display requirement information of the user. The display requirement information characterizes requirements of users for viewing target images. The display requirement information includes, but is not limited to, only a display target image and an updated target image, display image description information, a target image and an updated target image, and the display requirement information is specifically set according to actual requirements of a user, which is not limited in any way in the embodiment of the present specification.

By applying the scheme of the embodiment of the specification, the target image and the updated target image are sent to the client, so that the client displays the target image and the updated target image to the user, multiple choices are provided for the user, and the user can intuitively compare the target image and the updated target image to determine whether the updated target image meets the requirement of the user.

In an optional embodiment of the present disclosure, after the sending the target image and the updated target image to the client, the method may further include the following steps:

and receiving image selection information sent by a user through the client, and adjusting model parameters of the image generation model based on the image selection information.

Specifically, the image selection information is used to identify the image selected by the user. The image selection information includes, but is not limited to, image numbers and image positions, and is specifically selected according to practical situations, which is not limited in any way in the embodiment of the present specification.

It should be noted that, the manner in which the user sends the image selection information includes, but is not limited to, voice input, touch control, text input. Based on the image selection information, the image meeting the user requirement can be determined, and the image corresponding to the image selection information can be further used as a real image to carry out parameter adjustment on the image generation model.

By applying the scheme of the embodiment of the specification, the image selection information sent by the user through the client is received, and the model parameters of the image generation model are adjusted based on the image selection information, so that the image generation model has the interaction capability with the user, the user can adjust the image generation result in an interaction mode, and the user satisfaction degree is improved.

In an optional embodiment of the present disclosure, after the inputting the image generating parameters and the image description text into the image generating model to obtain the target image corresponding to the image description text, the method may further include the following steps:

and receiving adjustment sample data sent by a user through a client, and adjusting model parameters of an image generation model according to the adjustment sample data, wherein the adjustment sample data is constructed based on a target image.

In practical application, the image generation parameters and the image description text are input into the image generation model, after the target image corresponding to the image description text is obtained, the target image can be sent to the client, further, the user can process the image based on the target image, and if the target image is not satisfied, the sample data can be sent for adjustment so as to train the model secondarily. Specifically, after receiving adjustment sample data sent by a user through a client, in a first possible implementation manner, model parameters of an image generation model may be adjusted according to the adjustment sample data; in a second possible implementation manner, the model parameters of the model can be generated according to the adjustment parameters of the adjustment sample data; in a third possible implementation, the model parameters of the image generation model and the parameter generation model may be adjusted according to the adjustment sample data. The adjustment sample data may be obtained by adding a label (such as a positive sample label and a negative sample label) to the target image by a user, or may be obtained by performing parameter adjustment (such as adjusting a size parameter) to the target image, and the generation mode of the adjustment sample data in the embodiment of the present disclosure is not limited.

It should be noted that, the implementation manner of "adjusting the model parameters of the image generation model according to the adjustment sample data" is the same as the training manner of the image generation model, and the description of the embodiment of the present disclosure will not be repeated.

By applying the scheme of the embodiment of the specification, the target image is sent to the client so that the client displays the target image to the user; and receiving the adjustment sample data sent by the user through the client, and adjusting the model parameters of the image generation model according to the adjustment sample data, so that the image generation model has the interaction capability with the user, the user can adjust the image generation result in an interaction mode, and the user satisfaction degree is improved.

In an optional embodiment of the present disclosure, before inputting the image description text and the generated prompt information into the parameter generation model to obtain the image generation parameter corresponding to the image description text, the method may further include the following steps:

Specifically, the training mode of the parameter generation model is supervised training based on prompt learning, namely, each sample image text pair in the sample set carries real sample parameter information, and the sample parameter information is a generation target of the parameter generation model and is used for guiding the training process of the parameter generation model.

The prediction prompt information is used for prompting the prediction process of the pre-training language model, and is specifically set according to actual conditions, and the embodiment of the specification does not limit the prediction prompt information. For example, the predictive hints information may be "you are an intelligent bounding box generation model". I would provide you with an image, descriptive text of the image, the format of each bounding box should be (object name, center point abscissa, center point ordinate, bounding box width, bounding box height) ". The pre-training language model can output the prediction result with the fixed data format through the prediction prompt information.

It should be noted that, the implementation manner of inputting the plurality of sample image text pairs and the prediction prompt information into the pre-training language model to obtain the image prediction parameters corresponding to the plurality of sample image text pairs may refer to the implementation manner of inputting the image description text and the generated prompt information into the parameter generation model to obtain the image generation parameters corresponding to the image description text, which will not be described in detail in this specification. According to the image prediction parameters and the sample parameter information, when the model parameters of the pre-training language model are adjusted, a LoRA mode can be adopted, rank=8 can be set, and model adjustment is achieved on the premise that the semantic understanding capability of the basic model is not forgotten.

In practical applications, there are various ways of obtaining the sample set, and the sample set is specifically selected according to practical situations, which is not limited in any way in the embodiment of the present disclosure. In one possible implementation manner of the present specification, a large number of sample image text pairs carrying sample parameter information input by a user can be received to form a sample set. In another possible implementation manner of the present disclosure, a large number of sample image text pairs carrying sample parameter information may be read from other data acquisition devices or databases to form a sample set.

It is worth noting that the sample set may include at least one of an original sample subset, a generated sample subset, and a constructed sample subset. In particular, a plurality of sample images carrying sample description text may be acquired to form an original sample subset. The sample description text composition generation type sample subset can be carried out on the sample image by utilizing the image understanding capability of the large model, and the extraction of various sample parameter information can be realized by the cooperative detection segmentation model when the generation type sample subset is composed. The parameter configuration conditions, such as output rules, object rules, element number rules, etc., may be set based on a priori knowledge, and the pseudo data is structured to form a structured sample subset, thereby enabling the parameter generation model to purposefully learn capabilities using the structured sample subset.

Referring to fig. 5, fig. 5 shows a flowchart of a method for training a parameter generation model according to an embodiment of the present disclosure, where an original sample subset, a generated sample subset, and a constructed sample subset are mixed to obtain a sample set, a plurality of sample image text pairs and prediction prompt information in the sample set are input into a pre-training language model to obtain image prediction parameters corresponding to the plurality of sample image text pairs, and model parameters of the pre-training language model are adjusted according to the image prediction parameters and the sample parameter information to obtain a parameter generation model for completing training. By constructing a sample set comprising three sample subsets, the image generation parameters output by the parameter generation model can be made more reasonable.

By applying the scheme of the embodiment of the specification, the model parameters of the pre-training language model are adjusted according to the image prediction parameters and the sample parameter information, the parameter generation model for completing training is obtained, and the model parameters of the pre-training language model are continuously adjusted, so that the finally obtained parameter generation model is more accurate.

In an optional embodiment of the present disclosure, taking a generated sample subset included in a sample set as an example, the acquiring a sample set may include the following steps:

Acquiring a plurality of sample images;

inputting the first sample image and the construction prompt information into a pre-training language model aiming at the first sample image to obtain a first sample description text of the first sample image;

generating first sample parameter information of the first sample image according to the first sample image and the first sample description text;

and constructing a sample set according to the plurality of sample images, the sample description text of the plurality of sample images and the sample parameter information.

Specifically, the prompt information is constructed for prompting the pre-training language model to generate a sample description text of the sample image. The construction prompt information is specifically set according to actual situations, and the embodiment of the present specification does not limit the present description. For example, the prediction hint information may be "please describe the image".

It should be noted that, the manner of acquiring the plurality of sample images is various, and is specifically selected according to the actual situation, which is not limited in any way in the embodiment of the present disclosure. In one possible implementation of the present disclosure, a plurality of sample images input by a user may be received. In another possible implementation of the present description, a plurality of sample images may be read from other data acquisition devices or databases.

By applying the scheme of the embodiment of the specification, a plurality of sample images are acquired; inputting the first sample image and the construction prompt information into a pre-training language model aiming at the first sample image to obtain a first sample description text of the first sample image; generating first sample parameter information of the first sample image according to the first sample image and the first sample description text; and constructing a sample set according to the plurality of sample images, the sample description text of the plurality of sample images and the sample parameter information. And generating a sample description text through the image understanding capability of the pre-training language model, so that the readiness of the sample description text is ensured.

In practical applications, there are various ways to generate the first sample parameter information of the first sample image according to the first sample image and the first sample description text, and the embodiment of the present disclosure does not limit any limitation to this. In one possible implementation manner of the present specification, related information such as keywords, sentence structures and the like may be obtained from the first sample description text through text mining, image analysis and the like, and these information are used as the first sample parameter information.

In another possible implementation manner of the present disclosure, the extraction of multiple sample parameter information may be implemented by using a collaborative detection segmentation model, that is, the generating the first sample parameter information of the first sample image according to the first sample image and the first sample description text may include the following steps:

Performing region detection on the first sample image according to the first sample description text, and determining key region position information of the first sample image;

and performing visual segmentation on the first sample image according to the position information of the key region to obtain first sample parameter information of the first sample image.

Specifically, the key region position information refers to the coordinates of the center point of the key region in the first sample image, and the key region can be understood as a bounding box. For example, the first sample image is a region where a child and a dog are running, and the key region in the first sample image is a region where the child and the dog are located. The first sample parameter information includes, but is not limited to, the height, width of the critical area.

When the first sample image is subjected to region detection according to the first sample description text, the first sample description text and the first sample image may be input into a detection segmentation model, and key region position information of the first sample image may be obtained through region detection of the detection segmentation model. When the first sample image is visually segmented according to the key region position information, the key region position information and the first sample image may be input into a visual segmentation model (SAM, segmentation As A Module) to obtain first sample parameter information of the first sample image.

By applying the scheme of the embodiment of the specification, the first sample image is subjected to area detection according to the first sample description text, and the key area position information of the first sample image is determined; and performing visual segmentation on the first sample image according to the position information of the key region to obtain first sample parameter information of the first sample image. Through region detection and visual segmentation, the accuracy of the first sample parameter information is improved.

In an optional embodiment of the present disclosure, taking a construction sample subset included in the sample set as an example, a construction manner of the construction sample subset is described, after generating the first sample parameter information of the first sample image according to the first sample image and the first sample description text, the method may further include the following steps:

acquiring parameter configuration conditions;

under the condition that the first sample parameter information does not meet the parameter configuration conditions, the first sample parameter information is adjusted, and adjusted first sample parameter information is obtained;

constructing a sample set from the plurality of sample images, the sample descriptive text of the plurality of sample images, and the sample parameter information, comprising:

and constructing a sample set according to the plurality of sample images, the sample description text of the plurality of sample images and the adjusted sample parameter information.

Specifically, the parameter configuration conditions include, but are not limited to, luminance being greater than a preset luminance threshold, sharpness being greater than a preset sharpness threshold, and the like, and are specifically selected according to practical situations, which are not limited in any way in the embodiments of the present disclosure.

In practical application, there are various ways of obtaining the parameter configuration conditions, and the method is specifically selected according to practical situations, which is not limited in any way in the embodiment of the present specification. In one possible implementation of the present disclosure, a parameter configuration condition input by a user may be received. In another possible implementation manner of the present specification, the parameter configuration condition may be read from other data acquisition devices or databases.

It should be noted that after the parameter configuration condition is obtained, further, it may be determined whether the first sample parameter information satisfies the parameter configuration condition. In the case where the first sample parameter information satisfies the parameter configuration condition, the first sample parameter is not adjusted. And under the condition that the first sample parameter information does not meet the parameter configuration condition, adjusting the first sample parameter information to obtain adjusted first sample parameter information.

By applying the scheme of the embodiment of the specification, parameter configuration conditions are obtained; under the condition that the first sample parameter information does not meet the parameter configuration conditions, the first sample parameter information is adjusted, and adjusted first sample parameter information is obtained; and constructing a sample set according to the plurality of sample images, the sample description text of the plurality of sample images and the adjusted sample parameter information. And reliability monitoring is carried out on the sample parameter information through the parameter configuration conditions, so that the accuracy of the sample parameter information is improved.

In an optional embodiment of the present disclosure, before the inputting the image generating parameters and the image description text into the image generating model to obtain the target image corresponding to the image description text, the method may further include the following steps:

inputting a plurality of sample description texts and sample parameter information corresponding to the plurality of sample description texts into an initial generation model to obtain a prediction image corresponding to the plurality of sample description texts respectively;

and adjusting model parameters of the initial generation model according to the predicted image and the sample image to obtain the image generation model after training.

Specifically, the training mode of the image generation model is supervised training, that is, the sample description text in the sample set carries real sample images, and the sample images are generation targets of the image generation model and are used for guiding the training process of the image generation model.

It should be noted that, the "sample set acquisition" may refer to the sample set acquisition method in the above-mentioned parameter generation model training process. The implementation manner of inputting the plurality of sample description texts and the sample parameter information corresponding to the plurality of sample description texts into the initial generation model to obtain the prediction images respectively corresponding to the plurality of sample description texts may refer to the implementation manner of inputting the image generation parameters and the image description texts into the image generation model to obtain the target images corresponding to the image description texts, which will not be described in detail in the embodiments of the present specification.

In practical application, when the model parameters of the initial generation model are adjusted according to the predicted image and the sample image, the loss value can be calculated according to the predicted image and the sample image, and the model parameters of the initial generation model can be adjusted according to the loss value until a preset stop condition is reached, so as to obtain the training-completed image generation model, wherein the functions for calculating the loss value are numerous, such as a cross entropy loss function, an L1 norm loss function, a maximum loss function, a mean square error loss function, a logarithmic loss function and the like, and the model parameters are specifically selected according to practical situations, and the embodiment of the specification does not limit the method.

In one possible implementation manner of the present disclosure, the preset stopping condition includes that the loss value is less than or equal to a preset threshold value. And comparing the loss value with a preset threshold value after calculating the loss value according to the predicted image and the sample image.

Specifically, if the loss value is greater than the preset threshold, it indicates that the difference between the predicted image and the sample image is greater, the prediction capability of the initial generation model for the predicted image is poorer, at this time, the model parameters of the initial generation model can be adjusted, and training of the initial generation model is continued until the loss value is less than or equal to the preset threshold, which indicates that the difference between the predicted image and the sample image is smaller, and the preset stop condition is reached, so as to obtain the trained image generation model.

In another possible implementation manner of the present disclosure, in addition to comparing the magnitude relation between the loss value and the preset threshold, it may also be determined whether the training of the current initial generation model is completed by combining the iteration number.

Specifically, if the loss value is greater than a preset threshold, model parameters of an initial generation model are adjusted, training of the initial information extraction model is continued until a preset iteration number is reached, iteration is stopped, and a trained image generation model is obtained, wherein the preset threshold and the preset iteration number are specifically selected according to actual conditions, and the embodiment of the present specification does not limit the above.

By applying the scheme of the embodiment of the specification, the model parameters of the initial generation model are adjusted according to the predicted image and the sample image, the image generation model which is trained is obtained, and the model parameters of the initial generation model are continuously adjusted, so that the finally obtained image generation model is more accurate.

Referring to fig. 6, fig. 6 shows a flowchart of another method for training a parameter generating model according to an embodiment of the present disclosure, where the parameter generating model training is applied to cloud-side equipment, and specifically includes the following steps:

Step 602: a sample set is obtained, wherein the sample set comprises a plurality of sample image text pairs, the sample image text pairs comprise sample images and sample description texts, and the sample image text pairs carry sample parameter information.

Step 604: and inputting the plurality of sample image text pairs and the prediction prompt information into a pre-training language model to obtain image prediction parameters respectively corresponding to the plurality of sample image text pairs.

Step 606: and adjusting model parameters of the pre-training language model according to the image prediction parameters and the sample parameter information to obtain a parameter generation model for completing training.

It should be noted that, the implementation manners of step 602 to step 606 are detailed in the training manner of the parameter generation model in the above image generation method, and the embodiment of the present disclosure is not limited in this regard.

In practical application, after obtaining the trained parameter generation model, model parameters of the trained parameter generation model can be sent to the end-side device, so that a user can build the parameter generation model locally based on the model parameters to generate image generation parameters.

Referring to fig. 7, fig. 7 shows a flowchart of an automatic question-answering method according to an embodiment of the present disclosure, which specifically includes the following steps:

step 702: and receiving an image question and answer request, wherein the image question and answer request carries image description text.

Step 704: inputting the image description text and the generated prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used for describing visual characteristics of the image, and the parameter generation model is obtained based on a plurality of sample image text pairs and sample parameter information carried by the plurality of sample image text pairs.

Step 706: and inputting the image generation parameters and the image description text into an image generation model to obtain a reply image corresponding to the image question-answer request.

It should be noted that, the implementation manner of step 702 to step 706 is detailed in the above steps 302 to 306, which is not limited in any way in the embodiment of the present disclosure.

By applying the scheme of the embodiment of the specification, the visual elements of the image are subjected to semantic disassembly by utilizing the parameter generation model to obtain the image generation parameters, and further accurate image generation is completed based on the image generation parameters, so that the target image can clearly express the image description text and the image generation parameters, and the interpretability and the controllability of the automatic question-answering process are improved.

Referring to fig. 8, fig. 8 is a flowchart illustrating a processing procedure of another image generation method according to an embodiment of the present disclosure, in an image generation process, by using powerful semantic understanding and association capability of a parameter generation model, semantic disassembly and recombination of visual elements of an image are completed, and the parameterized visual elements are input into the image generation model to complete accurate image generation. The image generation stage can be divided into two stages: the system comprises a parameter generation model processing stage and an image generation model processing stage, wherein the parameter generation model processing stage comprises fine adjustment of a pre-training language model, construction of a condition generation sample set and design of generation prompt information;

as shown in fig. 8, training the pre-training language model based on a plurality of sample image text pairs (formed based on sample description text and sample image) and sample parameter information carried by the plurality of sample image text pairs to obtain a parameter generation model; acquiring an image description text; inputting the image description text and the generated prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text; and inputting the image generation parameters and the image description text into an image generation model to obtain a target image corresponding to the image description text.

By applying the scheme of the embodiment of the specification, the auxiliary generation based on the pre-training language model is realized by utilizing a two-stage diffusion generation architecture, the image generation parameters required by the image generation model processing stage are output by means of the emerging capacity and proper fine adjustment of the pre-training language model in the parameter generation model processing stage, then the image generation controlled by the image generation parameters is carried out in the image generation model processing stage, the image with accurate semantics is output, and the generated image can clearly express the generation prompt information or the image generation parameters used in the generation process, so that a user can understand and verify the accuracy of the generated result.

Referring to fig. 9, fig. 9 is a schematic diagram illustrating a process of a parameter generation model according to an embodiment of the present disclosure, where, as shown in fig. 9, the image description text is acquired as "two black cats and one white dog on orange sofa", the image description text and the generated prompt 1 my title is "two black cats and one white dog on orange sofa" and "what is the element in the image? Inputting a parameter generation model to obtain an image generation parameter 1"[ (black cat, 2), (white dog, 1), (sofa, 1) ]"; describing the image text and generating hint information 2 "what are their bounding boxes? Inputting a parameter generation model to obtain an image generation parameter 2"[ (black cat, [30,171,212,286 ]), (black cat, [40,200,123,412 ]), (white dog, [24,543,231,332 ]), (sofa, [264,173,222,221 ]) ] ]" corresponding to the image description text; text description of the image and generation of prompt 3 "what is the dominant color of each element? "input parameter generation model, obtain image generation parameter 3" [ (black cat, [ black, gray, dark gray ]), ], (white dog, [ white, light gray, brown ]), ] "; finally, the image generation parameters 1,2 and 3 are aggregated to obtain the final image generation parameters.

Referring to fig. 10, fig. 10 shows an interface schematic of an image generation interface according to an embodiment of the present disclosure. The image generation interface is divided into a request input interface and a result display interface. The request input interface includes a request input box, a "determine" control, and a "cancel" control. The result display interface comprises a result display frame.

The method comprises the steps that a user inputs an image generation request through a request input box displayed by a client, wherein the image generation request carries an image description text, a 'determination' control is clicked, a server receives the image description text sent by the client, the image description text and generated prompt information are input into a parameter generation model, and image generation parameters corresponding to the image description text are obtained, wherein the image generation parameters are used for describing visual characteristics of an image, and the parameter generation model is obtained based on a plurality of sample image text pairs and sample parameter information carried by the plurality of sample image text pairs in a training mode; inputting the image generation parameters and the image description text into an image generation model, obtaining a target image corresponding to the image description text, and sending the target image to the client. The client displays the target image in a result display frame.

In practical applications, the manner in which the user operates the control includes any manner such as clicking, double clicking, touch control, mouse hovering, sliding, long pressing, voice control or shaking, and the like, and the selection is specifically performed according to the practical situation, which is not limited in any way in the embodiments of the present disclosure.

Corresponding to the above-mentioned image generation method embodiment, the present disclosure further provides an image generation apparatus embodiment, and fig. 11 shows a schematic structural diagram of an image generation apparatus provided in one embodiment of the present disclosure. As shown in fig. 11, the apparatus includes:

a first acquisition module 1102 configured to acquire image description text;

a first input module 1104, configured to input the image description text and the generated prompt information into a parameter generation model, and obtain image generation parameters corresponding to the image description text, where the image generation parameters are used to describe visual features of the image, and the parameter generation model is obtained based on a plurality of sample image text pairs and sample parameter information carried by the plurality of sample image text pairs;

a second input module 1106 is configured to input the image generation parameters and the image description text into the image generation model to obtain a target image corresponding to the image description text.

Optionally, the image generation model includes a parameter encoding unit and a codec unit; a second input module 1106, further configured to input the image generation parameters and the image description text into an image generation model, and encode the image generation parameters via the parameter encoding unit to obtain parameter encoding features; and the encoding and decoding unit is used for generating a target image corresponding to the image description text according to the parameter encoding characteristics and the image description text.

Optionally, the parameter encoding unit includes a one-dimensional parameter encoding unit, a two-dimensional parameter encoding unit and a feature aggregation unit; the second input module 1106 is further configured to encode one-dimensional parameters in the image generation parameters to obtain one-dimensional encoding features through the one-dimensional parameter encoding unit; the two-dimensional parameter coding unit codes the two-dimensional parameters in the image generation parameters to obtain two-dimensional coding characteristics; and acquiring parameter coding features by aggregating the one-dimensional coding features and the two-dimensional coding features through a feature aggregation unit.

Optionally, the apparatus further comprises: a third acquisition module configured to acquire image update parameters for a target image; determining an image editing area in the target image according to the image updating parameters; masking the image editing area to obtain a mask generating sequence; and inputting the mask generating sequence and the target image into an image generating model to obtain an updated target image.

Optionally, the apparatus further comprises: and the first sending module is configured to send the target image and the updated target image to the client so that the client displays the target image and the updated target image to a user.

Optionally, the apparatus further comprises: and the second receiving module is configured to receive the image selection information sent by the user through the client and adjust the model parameters of the image generation model based on the image selection information.

Optionally, the apparatus further comprises: and the third receiving module is configured to receive adjustment sample data sent by a user through the client and adjust model parameters of the image generation model according to the adjustment sample data, wherein the adjustment sample data is constructed based on the target image.

Optionally, the apparatus further comprises: a parameter generation model training module configured to obtain a sample set, wherein the sample set comprises a plurality of sample image text pairs, the sample image text pairs comprise sample images and sample description text, and the sample image text pairs carry sample parameter information; inputting a plurality of sample image text pairs and prediction prompt information into a pre-training language model to obtain image prediction parameters corresponding to the plurality of sample image text pairs respectively; and adjusting model parameters of the pre-training language model according to the image prediction parameters and the sample parameter information to obtain a parameter generation model for completing training.

Optionally, the parameter generation model training module is further configured to acquire a plurality of sample images;

inputting the first sample image and the construction prompt information into a pre-training language model aiming at the first sample image to obtain a first sample description text of the first sample image; generating first sample parameter information of the first sample image according to the first sample image and the first sample description text; and constructing a sample set according to the plurality of sample images, the sample description text of the plurality of sample images and the sample parameter information.

Optionally, the parameter generation model training module is further configured to perform region detection on the first sample image according to the first sample description text, and determine key region position information of the first sample image; and performing visual segmentation on the first sample image according to the position information of the key region to obtain first sample parameter information of the first sample image.

Optionally, the apparatus further comprises: a fourth acquisition module configured to acquire parameter configuration conditions; under the condition that the first sample parameter information does not meet the parameter configuration conditions, the first sample parameter information is adjusted, and adjusted first sample parameter information is obtained; the parameter generation model training module is further configured to construct a sample set from the plurality of sample images, the sample descriptive text of the plurality of sample images, and the adjusted sample parameter information.

Optionally, the apparatus further comprises: an image generation model training module configured to obtain a sample set, wherein the sample set comprises a plurality of sample image text pairs, the sample image text pairs comprise sample images and sample description text, and the sample image text pairs carry sample parameter information; inputting a plurality of sample description texts and sample parameter information corresponding to the plurality of sample description texts into an initial generation model to obtain a prediction image corresponding to the plurality of sample description texts respectively; and adjusting model parameters of the initial generation model according to the predicted image and the sample image to obtain the image generation model after training.

The above is a schematic scheme of an image generating apparatus of the present embodiment. It should be noted that, the technical solution of the image generating apparatus and the technical solution of the image generating method belong to the same concept, and details of the technical solution of the image generating apparatus, which are not described in detail, can be referred to the description of the technical solution of the image generating method.

Corresponding to the above embodiment of the method for training the parameter generating model, the present disclosure further provides an embodiment of a device for training the parameter generating model, and fig. 12 is a schematic structural diagram of a device for training the parameter generating model provided in one embodiment of the present disclosure. As shown in fig. 12, the apparatus is applied to cloud-side equipment, and includes:

a second acquisition module 1202 configured to acquire a sample set, wherein the sample set comprises a plurality of sample image text pairs, the sample image text pairs comprising a sample image and a sample description text, the sample image text pairs carrying sample parameter information;

a third input module 1204 configured to input a plurality of pairs of sample image text and prediction prompt information into the pre-training language model, to obtain image prediction parameters respectively corresponding to the plurality of pairs of sample image text;

an adjustment module 1206 configured to adjust model parameters of the pre-training language model based on the image prediction parameters and the sample parameter information to obtain a trained parameter generation model.

The above is a schematic scheme of a parameter generation model training apparatus of the present embodiment. It should be noted that, the technical solution of the parameter generating model training device and the technical solution of the parameter generating model training method belong to the same concept, and details of the technical solution of the parameter generating model training device which are not described in detail can be referred to the description of the technical solution of the parameter generating model training method.

Corresponding to the above-mentioned automatic question-answering method embodiment, the present disclosure further provides an automatic question-answering device embodiment, and fig. 13 shows a schematic structural diagram of an automatic question-answering device provided in one embodiment of the present disclosure. As shown in fig. 13, the apparatus includes:

a first receiving module 1302 configured to receive an image question-answer request, wherein the image question-answer request carries an image description text;

a fourth input module 1304, configured to input the image description text and the generated prompt information into a parameter generation model, to obtain image generation parameters corresponding to the image description text, where the image generation parameters are used to describe visual features of the image, and the parameter generation model is obtained based on a plurality of sample image text pairs and sample parameter information carried by the plurality of sample image text pairs;

A fifth input module 1306 configured to input the image generation parameters and the image description text into the image generation model, and obtain a reply image corresponding to the image question-and-answer request.

The above is a schematic scheme of an automatic question answering apparatus of this embodiment. It should be noted that, the technical solution of the automatic question-answering device and the technical solution of the automatic question-answering method belong to the same concept, and details of the technical solution of the automatic question-answering device, which are not described in detail, can be referred to the description of the technical solution of the automatic question-answering method.

FIG. 14 illustrates a block diagram of a computing device provided in one embodiment of the present description. The components of computing device 1400 include, but are not limited to, a memory 1410 and a processor 1420. Processor 1420 is coupled to memory 1410 via bus 1430, and database 1450 is used to store data.

Computing device 1400 also includes an access device 1440, which access device 1440 enables computing device 1400 to communicate via one or more networks 1460. Examples of such networks include public switched telephone networks (PSTN, public Switched Telephone Network), local area networks (LAN, local Area Network), wide area networks (WAN, wide Area Network), personal area networks (PAN, personal Area Network), or combinations of communication networks such as the internet. The access device 1440 may include one or more of any type of network interface, wired or wireless (e.g., network interface card (NIC, network Interface Card)), such as an IEEE802.11 wireless local area network (WLAN, wireless Local Area Networks) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, world Interoperability for Microwave Access) interface, an ethernet interface, a universal serial bus (USB, universal Serial Bus) interface, a cellular network interface, a bluetooth interface, a near-field communication (NFC, near Field Communication) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 1400, as well as other components not shown in FIG. 14, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 14 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 1400 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or personal computer (PC, personal Computer). Computing device 1400 may also be a mobile or stationary server.

Wherein the processor 1420 is configured to execute computer-executable instructions that, when executed by the processor, perform the steps of the image generation method or the parameter generation model training method or the automatic question-answering method described above.

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device belongs to the same concept as the technical solution of the image generating method, the parameter generating model training method and the automatic question-answering method, and details of the technical solution of the computing device which are not described in detail can be described by referring to the technical solution of the image generating method, the parameter generating model training method or the automatic question-answering method.

An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the image generation method or the parameter generation model training method or the automatic question-answering method described above.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium belongs to the same concept as the technical solution of the image generating method, the parameter generating model training method and the automatic question-answering method, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the image generating method, the parameter generating model training method or the automatic question-answering method.

An embodiment of the present specification further provides a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the above-described image generation method or parameter generation model training method or automatic question-answering method.

The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solution of the computer program belongs to the same concept as the technical solution of the image generating method, the parameter generating model training method and the automatic question-answering method, and details of the technical solution of the computer program which are not described in detail can be referred to the description of the technical solution of the image generating method, the parameter generating model training method or the automatic question-answering method.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be increased or decreased appropriately according to the requirements of the patent practice, for example, in some areas, according to the patent practice, the computer readable medium does not include an electric carrier signal and a telecommunication signal.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. An image generation method, comprising:

acquiring an image description text;

inputting the image description text and the generated prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used for describing visual characteristics of images, and the parameter generation model is obtained based on a plurality of sample image text pairs and sample parameter information carried by the sample image text pairs;

2. The method of claim 1, the image generation model comprising a parametric coding unit and a codec unit;

inputting the image generation parameters and the image description text into an image generation model to obtain a target image corresponding to the image description text, wherein the method comprises the following steps:

inputting the image generation parameters and the image description text into an image generation model, and encoding the image generation parameters by the parameter encoding unit to obtain parameter encoding characteristics;

and generating a target image corresponding to the image description text according to the parameter coding characteristics and the image description text through the encoding and decoding unit.

3. The method of claim 2, the parameter encoding unit comprising a one-dimensional parameter encoding unit, a two-dimensional parameter encoding unit, and a feature aggregation unit;

the generating, by the parameter encoding unit, the parameter encoding for the image to obtain a parameter encoding feature includes:

the one-dimensional parameter coding unit codes one-dimensional parameters in the image generation parameters to obtain one-dimensional coding characteristics;

and aggregating the one-dimensional coding feature and the two-dimensional coding feature by the feature aggregation unit to obtain a parameter coding feature.

4. The method according to claim 1, wherein the inputting the image generation parameters and the image description text into an image generation model, after obtaining the target image corresponding to the image description text, further comprises:

acquiring image update parameters for the target image;

masking the image editing area to obtain a mask generating sequence;

and inputting the mask generation sequence and the target image into the image generation model to obtain an updated target image.

5. The method of claim 4, wherein the inputting the mask generation sequence and the target image into the image generation model, after obtaining the updated target image, further comprises:

and sending the target image and the updated target image to a client so that the client displays the target image and the updated target image to a user.

6. The method of claim 5, after said sending the target image and the updated target image to a client, further comprising:

and receiving image selection information sent by the user through the client, and adjusting model parameters of the image generation model based on the image selection information.

7. The method according to claim 1, wherein the inputting the image generation parameters and the image description text into an image generation model, after obtaining the target image corresponding to the image description text, further comprises:

and receiving adjustment sample data sent by a user through a client, and adjusting model parameters of the image generation model according to the adjustment sample data, wherein the adjustment sample data is constructed based on the target image.

8. The method according to claim 1, wherein the inputting the image description text and the generated prompt information into a parameter generation model, before obtaining the image generation parameters corresponding to the image description text, further comprises:

inputting the plurality of sample image text pairs and the prediction prompt information into a pre-training language model to obtain image prediction parameters respectively corresponding to the plurality of sample image text pairs;

9. The method of claim 8, the acquiring a sample set comprising:

acquiring a plurality of sample images;

inputting the first sample image and construction prompt information into a pre-training language model aiming at the first sample image to obtain a first sample description text of the first sample image;

10. The method of claim 9, the generating first sample parameter information for the first sample image from the first sample image and the first sample descriptive text, comprising:

and performing visual segmentation on the first sample image according to the key area position information to obtain first sample parameter information of the first sample image.

11. The method of claim 9, further comprising, after generating the first sample parameter information of the first sample image from the first sample image and the first sample descriptive text:

acquiring parameter configuration conditions;

adjusting the first sample parameter information under the condition that the first sample parameter information does not meet the parameter configuration conditions, and obtaining adjusted first sample parameter information;

the constructing a sample set according to the plurality of sample images, the sample description text of the plurality of sample images and the sample parameter information comprises the following steps:

12. The method according to claim 1, wherein before the inputting the image generation parameters and the image description text into the image generation model to obtain the target image corresponding to the image description text, further comprises:

inputting a plurality of sample description texts and sample parameter information corresponding to the plurality of sample description texts into an initial generation model to obtain prediction images corresponding to the plurality of sample description texts respectively;

13. A parameter generation model training method is applied to cloud side equipment and comprises the following steps:

14. An automatic question-answering method, comprising:

receiving an image question and answer request, wherein the image question and answer request carries an image description text;

15. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer executable instructions, the processor being configured to execute the computer executable instructions, which when executed by the processor, implement the steps of the method of any one of claims 1 to 12 or claim 13 or claim 14.

16. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the method of any one of claims 1 to 12 or claim 13 or claim 14.