CN117830451A

CN117830451A - Text illustration generation method, device, equipment and storage medium

Info

Publication number: CN117830451A
Application number: CN202311871939.0A
Authority: CN
Inventors: 苏朋杨
Original assignee: Shenzhen Flash Scissor Intelligent Technology Co ltd
Current assignee: Shenzhen Flash Scissor Intelligent Technology Co ltd
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-04-05

Abstract

The invention relates to the field of computers and discloses a text illustration generation method, a device, equipment and a storage medium. The method comprises the following steps: receiving a text illustration generation request, and acquiring an input text according to the text illustration generation request; acquiring an input prompt word, and carrying out semantic extraction on the input text through a trained large language model according to the input text and the prompt word to obtain a target text, wherein the prompt word is a condition met by the generated target text; encoding the target text by an encoder of a pre-trained neural network model to obtain a text code; according to the text encoding, a text illustration corresponding to the target text is generated by transmitting randomly sampled noise through a learned denoising process using a trained diffusion model. According to the embodiment of the invention, the process of text map matching can be simplified, the text map can be automatically generated, and the efficiency of generating the text map can be improved.

Description

Text illustration generation method, device, equipment and storage medium

Technical Field

The present invention relates to the field of computers, and in particular, to a method, an apparatus, a device, and a storage medium for generating text illustrations.

Background

The content innovation speed is an important influencing factor of the traffic scale. In various multimedia information media, image information is more visual and has visual impact force relative to text information, and propagation effect is easier to bring, so that artificial intelligence technology for generating images based on text becomes a research hot spot.

The flow of video authoring from media now essentially falls into two steps: the first step of text creation and the second step of text mapping. The current text mapping is basically a manual process, and the text creator is relied on to understand the text, manually searches the mapping related to the text on the Internet, and uses editing software to make the mapping and the text into video. This process is time and effort consuming, resulting in inefficient text-artwork generation.

Disclosure of Invention

The invention mainly aims to solve the technical problem of low text illustration generation efficiency.

The first aspect of the present invention provides a text insert generating method, which includes:

receiving a text illustration generation request, and acquiring an input text according to the text illustration generation request;

acquiring an input prompt word, and extracting semantics of the input text through a trained large language model according to the input text and the prompt word to obtain a target text, wherein the prompt word is a condition met by generating the target text;

coding the target text by an encoder of a pre-training neural network model to obtain a text code;

and according to the text coding, a trained diffusion model is used, and randomly sampled noise is transmitted through a learned denoising process to generate a text illustration corresponding to the target text.

Optionally, in a first implementation manner of the first aspect of the present invention, the generating, according to the text encoding, a text insert corresponding to the target text by transmitting noise sampled randomly through a learned denoising process using a trained diffusion model includes:

mapping the text encoding to a representation space by an encoder of the pre-trained neural network model;

mapping the text code to an image code using a trained diffusion model;

according to the image codes, a text image generation model is trained, the text codes are mapped from the representation space to the image space, semantic information of the target text is transferred, and randomly sampled noise is transferred through a learned denoising process to generate a text illustration corresponding to the target text.

Optionally, in a second implementation manner of the first aspect of the present invention, the mapping, by the encoder of the pre-trained neural network model, the text encoding to a representation space includes:

acquiring a plurality of text-image pairs, and encoding each text-image pair through an image encoder and a text encoder;

calculating cosine similarity of each coded text-image pair;

training iteration to minimize cosine similarity between incorrect text-image pairs and maximize cosine similarity between correct text-image pairs to obtain a pre-trained neural network model;

the text encoding is mapped to a representation space by an encoder of the pre-trained neural network model.

Optionally, in a third implementation manner of the first aspect of the present invention, the text insert generating method further includes:

acquiring a training text, and encoding the training text into a marking sequence;

inputting the marker sequence into a transducer model to obtain a final marker embedding;

and embedding the final mark into the attention context of each layer in the diffusion process of connecting the projection to the diffusion model, and performing model training to obtain a trained text image generation model.

Optionally, in a fourth implementation manner of the first aspect of the present invention, obtaining an input prompt word, and according to the input text and the prompt word, performing semantic extraction on the input text through a trained large language model, where obtaining a target text includes:

acquiring an input prompt word, predicting a first word according to the input text and the prompt word through a trained large language model, and adding the first word into a preset generated text;

predicting a next word according to the input text by an autoregressive mode, and adding the next word into the preset generated text;

and predicting the next word circularly until a target text meeting the conditions in the prompt word is generated. Optionally, in a fifth implementation manner of the first aspect of the present invention, the obtaining the input prompt word, predicting a first word according to the input text and the prompt word through a trained large language model, and adding the first word to a preset generated text includes:

acquiring a training data set, wherein the training data set is composed of a plurality of question-answer texts;

training the initial large language model according to the multiple question-answer texts to obtain a trained large language model;

acquiring an input prompt word, inputting the input text and the prompt word into a trained large language model, and generating a corresponding text response;

and predicting a first word according to the text response, and adding the first word into a preset generated text.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the receiving a text-to-pictorial generating request, before obtaining the input text according to the text-to-pictorial generating request, further includes:

pushing a text input menu;

acquiring a text document, and sending the text document to the text input menu;

and analyzing the text document to obtain an input text.

The second aspect of the present invention provides a text-to-pictorial generating apparatus, comprising: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line; the at least one processor invokes the instructions in the memory to cause the text-to-illustration generating device to perform the text-to-illustration generating method described above.

A third aspect of the present invention provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the text-to-illustration generating method described above.

In the embodiment of the invention, a text illustration generation request is received, and an input text is acquired according to the text illustration generation request; acquiring an input prompt word, and extracting semantics of the input text through a trained large language model according to the input text and the prompt word to obtain a target text, wherein the prompt word is a condition met by generating the target text; coding the target text by an encoder of a pre-training neural network model to obtain a text code; and according to the text coding, a trained diffusion model is used, and randomly sampled noise is transmitted through a learned denoising process to generate a text illustration corresponding to the target text. According to the invention, the input text is subjected to semantic extraction through the trained large language model to obtain the target text, the target text is encoded through the encoder of the pre-trained neural network model, then the training diffusion model is used, and the noise sampled randomly is transmitted through the learned denoising process to generate the text illustration corresponding to the target text, so that the text illustration matching process can be simplified, the text illustration can be automatically generated, and the text illustration generation efficiency is improved.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a text-to-pictorial generation method in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of an embodiment of a text-to-pictorial apparatus in accordance with an embodiment of the present invention;

fig. 3 is a schematic diagram of an embodiment of a text-to-image generating apparatus according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a text illustration generation method, a device, equipment and a storage medium.

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the present disclosure has been illustrated in the drawings in some form, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and examples of the present disclosure are for illustrative purposes only and are not intended to limit the scope of the present disclosure.

In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

For ease of understanding, a specific flow of an embodiment of the present invention is described below with reference to fig. 1, where an embodiment of a text-to-image generating method according to an embodiment of the present invention includes:

s100, receiving a text illustration generation request, and acquiring an input text according to the text illustration generation request.

In this embodiment, a user needs to prepare a section of text file for content creation, input the text file to a terminal, generate a text illustration generation request through the terminal, and send the text illustration generation request to a server, and the server loads a text illustration generation system to obtain an input text according to the text illustration generation request, where the terminal may be, but is not limited to, various personal computers, notebook computers, smartphones, tablet computers, and the like. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.

S200, acquiring the input prompt words, and extracting the semantics of the input text through a trained large language model according to the input text and the prompt words to obtain a target text.

In this embodiment, there are many large language models, and the present invention adopts a query-by-thousand model (Qwen). The large language model is led to summarize text content and needs a section of self-written prompt word, and the model is told to answer questions according to a fixed format. For a large language model LLM, answers can be obtained by inputting prompt words, and the format and quality of the answers are determined by the prompt words. Therefore, the prompt word is optimized. For example, the requirement model must reply to English, must summarize with 5 punctuations, and words such as "He" cannot appear because such words mislead the semantic understanding of "SD meristematic diagrams". According to the prompt word, the large language model outputs target text meeting the condition of the prompt word.

S300, coding the target text through an encoder of the pre-trained neural network model to obtain text codes.

In this embodiment, the target Text is input to an Encoder of a pre-trained neural network model, which may be a CLIP model, which is a pre-trained model based on a contrast Text-Image pair, and the model structure includes two parts, namely a Text Encoder (Text Encoder) and an Image Encoder (Image Encoder), to obtain the Text code.

S400, according to text coding, using a trained diffusion model, and transmitting randomly sampled noise through a learned denoising process to generate a text illustration corresponding to the target text.

In this embodiment, the diffusion models are generative models, meaning that they are used to generate data similar to training data. Basically, the diffusion model works by breaking training data by continuously adding gaussian noise and then recovering the data by learning to reverse this noise process, in particular the noise process is regarded as a parameterized markov chain which gradually adds noise to the image to be broken, eventually (asymptotically) resulting in pure gaussian noise. Diffusion model learning goes back along the chain, gradually removing noise in a series of steps to reverse the process. After training, a diffusion model may be used to generate a text-to-text illustration of the target text by simply passing randomly sampled noise through a learned denoising process. In the invention, the SDXL diffusion model is used for carrying out the operation of the text-generated graph through the SDXL, and the SDXL has the advantage that a small number of prompt words are input, so that the graph matching related to the text can be generated. In normal video production, it takes tens of minutes to find a picture matching text, while the text-generated map of SDXL only needs less than 10 s.

In an alternative embodiment of the first aspect of the present invention, generating a text insert corresponding to the target text by transferring randomly sampled noise through a learned denoising process using a trained diffusion model according to text encoding comprises:

mapping the text encoding to a representation space by an encoder of the pre-trained neural network model; mapping the text code to the image code using the trained diffusion model; according to the image codes, a text image generation model is trained, the text codes are mapped from a representation space to an image space, semantic information of target texts is transmitted, and randomly sampled noise is transmitted through a learned denoising process to generate text illustrations corresponding to the target texts.

In this embodiment, the encoder of the trained pre-trained neural network model maps the text codes to the representation space, then the trained diffusion model maps the text codes to the corresponding image codes, the image codes capture the semantic information of prompts contained in the text codes, the text image generation model maps the text codes from the representation space to the image space through back diffusion, conveys the semantic information of the text, and generates the text illustration corresponding to the target text. The text image generation model may be a GLIDE model, which adds text condition information based on ADM model architecture.

In an alternative embodiment of the first aspect of the present invention, mapping the text encoding to the representation space by the encoder of the pre-trained neural network model comprises:

acquiring a plurality of text-image pairs, and encoding each text-image pair through an image encoder and a text encoder; calculating cosine similarity of each coded text-image pair; training iteration to minimize cosine similarity between incorrect text-image pairs and maximize cosine similarity between correct text-image pairs to obtain a pre-trained neural network model; the text encoding is mapped to the representation space by an encoder of the pre-trained neural network model.

In this embodiment, for a training batch containing N Text-Image pairs, the Text Encoder and Image Encoder are used to extract N Text features and N Image features. There are N positive samples in total, i.e. text and images that truly belong to a pair, while the remaining N ² -N text-image pairs are negative samples. Combining N text features and N image features in pairs, the pre-trained neural network model predicts N ² The similarity of the possible text-image pairs, where the similarity directly calculates cosine similarity of the text features and the image features (cosine similarity), the purpose of the pre-training neural network model training is to minimize cosine similarity between incorrect text-image pairs and maximize cosine similarity between correct text-image pairs.

In an optional implementation manner of the first aspect of the present invention, the text-to-pictorial generating method further includes:

acquiring a training text, and encoding the training text into a marking sequence; inputting the marker sequence into a transducer model to obtain final marker embedding; and embedding the final mark into the attention context of each layer in the diffusion process of connecting the projection to the diffusion model, and performing model training to obtain a trained text image generation model.

In this embodiment, the training process is enhanced by adding additional text information, ultimately generating a text conditional image. Training text is first obtained, encoded into K sequences of labels for the purpose of conditioning the text, and these labels are input into a transducer model. The converter outputs a final marker embedding, projects the final marker embedding (a series of K eigenvectors) to the dimension of each attention layer in the whole diffusion model respectively, then connects to the attention context of each layer, carries out model training, and obtains a trained text image generation model.

In an optional implementation manner of the first aspect of the present invention, obtaining an input prompt word, and according to the input text and the prompt word, performing semantic extraction on the input text through a trained large language model, where obtaining the target text includes:

acquiring an input prompt word, predicting a first word according to the input text and the prompt word through a trained large language model, and adding the first word into a preset generated text; predicting the next word according to the input text by an autoregressive mode, and adding the next word into a preset generated text; and predicting the next word circularly until a target text meeting the condition in the prompt word is generated.

In this embodiment, the process of generating the target text is generally performed using an "autoregressive" method, and the large language model predicts the first word according to the input text and the prompt word, adds the first word to the preset generated text, predicts the next word, and then adds it to the generated text, and this process continues until the target text satisfying the expected length or condition is generated.

In an optional implementation manner of the first aspect of the present invention, obtaining an input prompt word, predicting a first word according to the input text and the prompt word through a trained large language model, and adding the first word to a preset generated text includes:

acquiring a training data set which is composed of a plurality of question-answer texts; training the initial large language model according to the multiple question-answering texts to obtain a trained large language model; acquiring an input prompt word, inputting the input text and the prompt word into a trained large language model, and generating a corresponding text response; and predicting the first word according to the text response, and adding the first word into the preset generated text.

In this embodiment, a text of one question and one answer is obtained as a data set for training, a plurality of question and answer texts are input into an initial large language model for training, a trained large language model is obtained, when model training is completed, the model can receive an initial input text (called "sample") and generate a relevant text response, and the model starts to predict the first word through the text response.

In an optional implementation manner of the first aspect of the present invention, the method further includes, before receiving the text-to-image generation request and obtaining the input text according to the text-to-image generation request:

pushing a text input menu; acquiring a text document, and sending the text document to a text input menu; and analyzing the text document to obtain an input text.

In this embodiment, the user creates a text document, the text document includes authored content, the system pushes a text input menu, the text document is sent to the text input menu by selecting a button, the system parses the text document to obtain an input text, and optionally, the user can directly input the text content in a text box of the text input menu.

Referring to fig. 2, a second aspect of the present invention provides a text-to-pictorial generating apparatus including:

an input text acquisition module 100, configured to receive a text-to-pictorial generation request, and acquire an input text according to the text-to-pictorial generation request;

the target text acquisition module 200 is configured to acquire an input prompt word, and perform semantic extraction on the input text according to the input text and the prompt word through a trained large language model to obtain a target text, where the prompt word is a condition that the generated target text meets;

the text code obtaining module 300 is configured to encode the target text by using an encoder of the pre-trained neural network model to obtain a text code;

the text-to-pictorial generation module 400 is configured to generate a text-to-pictorial corresponding to the target text by transmitting randomly sampled noise through a learned denoising process according to text encoding using a trained diffusion model.

In an alternative embodiment of the second aspect of the present invention, the text-artwork generation module 400 is further configured to map text encoding to the representation space by an encoder of the pre-trained neural network model; mapping the text code to the image code using the trained diffusion model; according to the image codes, a text image generation model is trained, the text codes are mapped from a representation space to an image space, semantic information of target texts is transmitted, and randomly sampled noise is transmitted through a learned denoising process to generate text illustrations corresponding to the target texts.

In an alternative embodiment of the second aspect of the present invention, the text-to-image generation module 400 is further configured to obtain a plurality of text-to-image pairs, and encode each text-to-image pair by the image encoder and the text encoder; calculating cosine similarity of each coded text-image pair; training iteration to minimize cosine similarity between incorrect text-image pairs and maximize cosine similarity between correct text-image pairs to obtain a pre-trained neural network model; the text encoding is mapped to the representation space by an encoder of the pre-trained neural network model.

In an optional embodiment of the second aspect of the present invention, the text-to-pictorial generating device further includes:

the model training module is used for acquiring training texts and encoding the training texts into a marking sequence; inputting the marker sequence into a transducer model to obtain final marker embedding; and embedding the final mark into the attention context of each layer in the diffusion process of connecting the projection to the diffusion model, and performing model training to obtain a trained text image generation model.

In an alternative embodiment of the second aspect of the present invention, the target text obtaining module 200 is further configured to obtain an input prompt word, predict a first word according to the input text and the prompt word through a trained large language model, and add the first word to a preset generated text; predicting the next word according to the input text by an autoregressive mode, and adding the next word into a preset generated text; and predicting the next word circularly until a target text meeting the conditions in the prompt word is generated.

In an alternative embodiment of the second aspect of the present invention, the target text obtaining module 200 is further configured to obtain a training data set, where the training data set is composed of a plurality of question-answer texts; training the initial large language model according to the multiple question-answering texts to obtain a trained large language model; acquiring an input prompt word, inputting an input text and the prompt word into a trained large language model, and generating a corresponding text response; and predicting the first word according to the text response, and adding the first word into the preset generated text.

the text input module is used for pushing a text input menu; acquiring a text document, and sending the text document to a text input menu; and analyzing the text document to obtain an input text.

Fig. 3 is a schematic structural diagram of a text-to-image generating apparatus according to an embodiment of the present invention, where the text-to-image generating apparatus 500 may have a relatively large difference according to a configuration or performance, and may include one or more processors (central processing units, CPU) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing application programs 533 or data 532. Wherein memory 520 and storage medium 530 may be transitory or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations on the text-to-image generating apparatus 500. Still further, the processor 510 may be configured to communicate with the storage medium 530 and execute a series of instruction operations in the storage medium 530 on the text-to-image generating device 500.

The text-based artwork generating device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input/output interfaces 560, and/or one or more operating systems 531, such as Windows Serve, mac OS X, unix, linux, free BSD, and the like. It will be appreciated by those skilled in the art that the text-to-illustration generating device structure shown in fig. 3 is not limiting of the text-to-illustration generating device and may include more or fewer components than illustrated, or may combine certain components, or may be a different arrangement of components.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, or may be a volatile computer readable storage medium, having stored therein instructions that, when executed on a computer, cause the computer to perform the steps of the text-map generating method.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Moreover, although operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A text-to-pictorial generation method, characterized in that the text-to-pictorial generation method comprises:

2. The text-to-pictorial generation method of claim 1 wherein the generating the text-to-text pictorial corresponding to the target text by passing randomly sampled noise through a learned denoising process using a trained diffusion model according to the text encoding comprises:

mapping the text code to an image code using a trained diffusion model;

3. The text-to-illustration generation method of claim 2, wherein the mapping the text encoding to a representation space by an encoder of the pre-trained neural network model comprises:

calculating cosine similarity of each coded text-image pair;

4. The text insert generation method according to claim 2, further comprising:

5. The text-to-pictorial generation method of claim 1 wherein the obtaining the entered alert word, according to the entered text and the alert word, performs semantic extraction on the entered text through a trained large language model, the obtaining a target text comprising:

and predicting the next word circularly until a target text meeting the conditions in the prompt word is generated.

6. The text-to-pictorial method of claim 5 wherein the obtaining the entered alert word, predicting a first word from the entered text and the alert word via a trained large language model, and adding the first word to a pre-set generated text includes:

7. The text-to-pictorial-generation method of claim 1, wherein the receiving a text-to-pictorial-generation request, based on the text-to-pictorial-generation request, further comprises, prior to obtaining the input text:

pushing a text input menu;

and analyzing the text document to obtain an input text.

8. A text-to-pictorial generation apparatus, comprising:

the input text acquisition module is used for receiving a text illustration generation request and acquiring an input text according to the text illustration generation request;

the target text acquisition module is used for acquiring an input prompt word, carrying out semantic extraction on the input text through a trained large language model according to the input text and the prompt word to obtain a target text, wherein the prompt word is a condition met by generating the target text;

the text code acquisition module is used for coding the target text through an encoder of the pre-training neural network model to obtain a text code;

and the text illustration generating module is used for generating the text illustration corresponding to the target text by transmitting randomly sampled noise through a learned denoising process by using a trained diffusion model according to the text coding.

9. A text-to-pictorial generation device, the text-to-pictorial generation device comprising: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line;

the at least one processor invokes the instructions in the memory to cause the text-to-illustration generating device to perform the text-to-illustration generating method of any of claims 1-7.

10. A computer readable storage medium having a computer program stored thereon, which when executed by a processor implements the text-map generation method according to any of claims 1-7.