CN117437317A - Image generation method, apparatus, electronic device, storage medium, and program product - Google Patents

Image generation method, apparatus, electronic device, storage medium, and program product Download PDF

Info

Publication number
CN117437317A
CN117437317A CN202311399398.6A CN202311399398A CN117437317A CN 117437317 A CN117437317 A CN 117437317A CN 202311399398 A CN202311399398 A CN 202311399398A CN 117437317 A CN117437317 A CN 117437317A
Authority
CN
China
Prior art keywords
image
noise
sample
features
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311399398.6A
Other languages
Chinese (zh)
Inventor
华锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202311399398.6A priority Critical patent/CN117437317A/en
Publication of CN117437317A publication Critical patent/CN117437317A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The application discloses an image generation method, an image generation device, electronic equipment, a storage medium and a program product, which can be applied to the fields of artificial intelligence such as computer vision and machine learning, and can also be applied to the fields of large models such as pre-training models; the method comprises the steps of obtaining a description text to be processed and a reference image; extracting text features and visual features from the description text to be processed and the reference image respectively; obtaining semantic guidance features according to the text features and the visual features; adding reference noise into the reference image to obtain a noise-added image; carrying out noise prediction on the noise-added image through the semantic guidance characteristics to obtain prediction noise; and repairing the noise-added image based on the prediction noise to generate a target image. Therefore, the visual features are introduced into the semantic guidance features, so that the image with the appearance can be restored as much as possible by combining the text features, and the quality of the generated image is improved.

Description

Image generation method, apparatus, electronic device, storage medium, and program product
Technical Field
The present invention relates to the field of computer technology, and in particular, to an image generating method, an image generating device, an electronic device, a storage medium, and a program product.
Background
In recent years, with the development of artificial intelligence technology, the technology of generating images from text prompts has become more mature. An image generation Model, particularly a Diffusion Model, may perform image rendering based on text entered by a user, generating a predictive image associated with the text.
However, since the text is poor in image quality, it is difficult to clearly describe a thing such as a person's image, and it is difficult for the text to describe the person's five sense organs in an image generation model, so that details cannot be accurately captured, resulting in poor quality of an image generated by the trained image generation model.
Disclosure of Invention
The embodiment of the application provides an image generation method, an image generation device, electronic equipment, a storage medium and a program product, which can improve the quality of generated images.
The embodiment of the application provides an image generation method, which comprises the following steps: acquiring a description text to be processed and a reference image; extracting text features and visual features from the description text to be processed and the reference image respectively; obtaining semantic guidance features according to the text features and the visual features; adding reference noise into the reference image to obtain a noise-added image; carrying out noise prediction on the noise-added image through the semantic guidance characteristics to obtain prediction noise; and repairing the noise-added image based on the prediction noise to generate a target image.
The embodiment of the application also provides an image generating device, which comprises: the acquisition unit is used for acquiring the description text to be processed and the reference image; the extraction unit is used for respectively extracting text features and visual features from the description text to be processed and the reference image; the combination unit is used for obtaining semantic guidance features according to the text features and the visual features; the noise adding unit is used for adding reference noise into the reference image to obtain a noise added image; the prediction unit is used for carrying out noise prediction on the noise-added image through the semantic guidance characteristics to obtain prediction noise; and the generating unit is used for repairing the noise-added image based on the prediction noise so as to generate a target image.
In some embodiments, the extraction unit includes a first visual extraction subunit, a second visual extraction subunit, and a third visual extraction subunit, comprising: a first visual extraction subunit for extracting initial visual features from the reference image; the second vision extraction subunit is used for acquiring the characteristic adjustment parameters; and the third visual extraction subunit is used for obtaining the visual characteristics according to the characteristic adjustment parameters and the initial visual characteristics.
In some embodiments, the image generation model further comprises an adding unit comprising: the adding unit is used for adding placeholders at adjacent positions of the parent keywords in the description text to be processed;
in some embodiments, the combination unit includes a replacement subunit comprising: and the replacing subunit is used for replacing the feature corresponding to the placeholder in the text feature with the visual feature to obtain the semantic guidance feature.
In some embodiments, the prediction unit includes a sampling subunit, a cross-attention subunit, and an attention prediction subunit, including: the sampling subunit is used for carrying out multi-scale feature sampling on the noise-added image to obtain multi-scale sampling features; the cross attention subunit is used for carrying out cross attention processing on the semantic guidance feature and the sampling feature aiming at the sampling feature of any scale to obtain an attention feature; an attention prediction subunit, configured to obtain the prediction noise from the attention feature.
In some embodiments, the cross-attention subunit includes a first transformation subunit and a weighting subunit, including: the first transformation subunit is used for carrying out linear transformation on the semantic guidance feature aiming at the sampling feature of any scale to obtain a key vector and a value vector, and carrying out linear transformation on the sampling feature to obtain a query vector; and the weighting subunit is used for carrying out attention weighting on the value vector through the query vector and the key vector to obtain attention characteristics.
In some embodiments, the prediction subunit includes a second transform subunit and a transform prediction subunit, comprising: the second transformation subunit is used for carrying out linear transformation on the attention characteristic to obtain a transformed characteristic; and the transformation prediction subunit is used for obtaining the prediction noise from the transformed characteristics.
In some embodiments, the image generating apparatus further includes a training unit including a training acquisition subunit, a training extraction subunit, a training combining subunit, a training noise adding subunit, a training prediction subunit, and a training adjustment subunit, including: the training acquisition subunit is used for acquiring a training sample set and an image generation model to be trained, wherein the training sample set comprises at least one sample image and a sample description text corresponding to the sample image; a training extraction subunit, configured to extract a sample text feature and a sample visual feature from the sample description text and the sample image, respectively; the training combination subunit is used for obtaining sample semantic guidance characteristics according to the sample text characteristics and the sample visual characteristics; the training noise adding subunit is used for adding sample noise into the sample image to obtain a noise added sample image; the training prediction subunit is used for carrying out noise prediction on the noise-added sample image through the sample semantic guidance characteristics to obtain predicted sample noise; the training adjustment subunit is used for adjusting the model parameters to be adjusted of the image generation model to be trained according to the predicted sample noise and the loss value between the sample noise to obtain a trained image generation model, and the trained image generation model is used for generating a target image.
In some embodiments, the image generating apparatus further includes a text generating unit including a first text generating sub-unit, a second text generating sub-unit, a third text generating sub-unit, a fourth text generating sub-unit, a fifth text generating sub-unit, and a sixth text generating sub-unit, including: a first text generation subunit, configured to extract an image feature to be processed from the sample image; a second text generation subunit, configured to take a descriptive text sequence including a start tag as a text sequence to be processed; the third text generation subunit is used for carrying out attention calculation on the image characteristics to be processed and the text sequences to be processed to obtain attention weights; a fourth text generation subunit, configured to determine, according to the attention weight, a generation probability of a next word in the descriptive text sequence; a fifth text generation subunit, configured to determine, according to the generation probability of the next word, the next word in the descriptive text sequence, so as to obtain a current descriptive text sequence; and a sixth text generation subunit, configured to take the current description text sequence as the text sequence to be processed, return to the execution step to perform attention computation on the feature of the image to be processed and the text sequence to be processed, obtain attention weight, and perform the subsequent steps until an end mark is generated, and take the current description text sequence as the sample description text corresponding to the sample image.
In some embodiments, the image generating apparatus further includes a parameter determining unit including a parameter acquiring subunit, a parameter decomposing subunit, and a parameter determining subunit, including: the parameter acquisition subunit is used for acquiring noise prediction parameters, and the noise prediction parameters are used for carrying out noise prediction on the noise-added sample image; the parameter splitting subunit is used for splitting the noise prediction parameters into fixed parameters and parameters to be adjusted; and the parameter determination subunit is used for taking the parameter to be adjusted as the model parameter to be adjusted.
In some embodiments, the parameter determination subunit includes a decomposition subunit and a determination subunit, including: the decomposition subunit is used for carrying out low-rank decomposition on the parameters to be adjusted to obtain a plurality of low-rank parameter matrixes; and the determining subunit is used for taking the plurality of low-rank parameter matrixes as the model parameters to be adjusted.
In some embodiments, the training adjustment subunit includes a loss calculation subunit and a loss adjustment subunit, comprising: a loss calculation subunit for calculating a first loss value between the predicted sample noise and the sample noise, and calculating a second loss value between the sample visual feature and the sample text feature; and the loss adjustment subunit is used for adjusting the model parameters to be adjusted of the image generation model to be trained by combining the first loss value and the second loss value to obtain a trained image generation model so as to use the trained image generation model for generating an image.
The embodiment of the application also provides computer equipment, which comprises a processor and a memory, wherein the memory stores a plurality of instructions; the processor loads instructions from the memory to perform steps in any of the image generation methods provided by the embodiments of the present application.
The present embodiments also provide a computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform steps in any of the image generation methods provided by the embodiments of the present application.
Embodiments of the present application also provide a computer program product comprising a plurality of instructions which, when executed by a processor, implement steps in any of the image generation methods provided by the embodiments of the present application.
The embodiment of the application can acquire the description text to be processed and the reference image; extracting text features and visual features from the description text to be processed and the reference image respectively; obtaining semantic guidance features according to the text features and the visual features; adding reference noise into the reference image to obtain a noise-added image; carrying out noise prediction on the noise-added image through the semantic guidance characteristics to obtain prediction noise; and repairing the noise-added image based on the prediction noise to generate a target image.
In the image generation process, the visual features are combined with the text features, and as the visual features comprise the information of the images of the sample and the details, the visual features can add semantic information of more images and the details to the text, so that the control force of the semantic information is enhanced. Meanwhile, visual features are introduced into semantic guide features, so that in addition to text features, the noise prediction process focuses on the features with images and details in the visual features, and therefore the images with images can be restored as much as possible according to the predicted noise, and the quality of the generated images is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1a is a schematic view of a scenario of an image generating method provided in an embodiment of the present application;
fig. 1b is a schematic flow chart of an image generating method according to an embodiment of the present application;
FIG. 2a is a schematic structural diagram of an image generation model according to an embodiment of the present application;
FIG. 2b is a flow chart of an image generation method according to another embodiment of the present application;
fig. 2c is a schematic diagram of a sample description text obtaining process according to an embodiment of the present application;
FIG. 2d is a schematic diagram of a matrix decomposition process provided by an embodiment of the present application;
FIG. 2e is a fused-style personalized character image provided by an embodiment of the present application;
FIG. 3 is a flow chart of an image generation method according to another embodiment of the present application;
fig. 4 is a schematic structural diagram of an image generating apparatus provided in an embodiment of the present application;
fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
The embodiment of the application provides an image generation method, an image generation device, electronic equipment, a storage medium and a program product.
It will be appreciated that in the specific embodiments of the present application, reference images, sample images, person images, face images, descriptive text, self-photographing, and other user-related data are referred to, and when the embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use, and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and regions.
The electronic device may be a terminal, a server, or the like. The terminal can be a mobile phone, a tablet personal computer, an intelligent Bluetooth device, a notebook computer, a desktop computer, an intelligent television, a vehicle-mounted device and the like; the server may be a single server, or may be a server cluster or cloud server composed of a plurality of servers.
The image generation method may be implemented by an electronic device. The electronic equipment can acquire a description text to be processed and a reference image; respectively extracting text features and visual features from the description text to be processed and the reference image; according to the text features and the visual features, semantic guidance features are obtained; adding reference noise into the reference image to obtain a noise-added image; noise prediction is carried out on the noise-added image through semantic guidance characteristics, so that prediction noise is obtained; based on the predicted noise, the noisy image is restored to generate a target image. For example, referring to fig. 1a, in some embodiments, the electronic device may be a server, and the server may obtain the description text to be processed and the reference image from the terminal through a network, so as to implement the image generating method. The server may also send the generated target image to the terminal via the network.
The following will describe in detail. The order of the following examples is not limited to the preferred order of the examples.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important innovation for the development of computer vision technology, and a pre-trained model in the vision fields of swin-transformer, viT, V-MOE, MAE and the like can be rapidly and widely applied to downstream specific tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.
Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. The natural language processing relates to natural language, namely the language used by people in daily life, and is closely researched with linguistics; and also to computer science and mathematics. An important technique for model training in the artificial intelligence domain, a pre-training model, is developed from a large language model (Large Language Model) in the NLP domain. Through fine tuning, the large language model can be widely applied to downstream tasks. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.
The automatic driving technology refers to that the vehicle realizes self-driving without operation of a driver. Typically including high-precision maps, environmental awareness, computer vision, behavioral decision-making, path planning, motion control, and the like. The automatic driving comprises various development paths such as single car intelligence, car-road coordination, networking cloud control and the like. The automatic driving technology has wide application prospect, and the current field is the field of logistics, public transportation, taxis and intelligent transportation, and is further developed in the future.
With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, digital twin, virtual man, robot, artificial Intelligence Generated Content (AIGC), conversational interactions, smart medical, smart customer service, game AI, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.
The Pre-training model (Pre-training model), also called a matrix model and a large model, refers to a deep neural network (Deep neural network, DNN) with large parameters, trains massive unlabeled data, utilizes the function approximation capability of the large-parameter DNN to enable PTM to extract common features on the data, and is suitable for downstream tasks through fine tuning (fine tuning), efficient fine tuning (PEFT) of parameters, prompt-tuning and other technologies. Therefore, the pre-training model can achieve ideal effects in a small sample (Few-shot) or Zero sample (Zero-shot) scene. PTM can be classified according to the data modality of the process into a language model (ELMO, BERT, GPT), a visual model (swin-transducer, viT, V-MOE), a speech model (VALL-E), a multi-modal model (ViBERT, CLIP, flamingo, gato), etc., wherein a multi-modal model refers to a model that builds a representation of the characteristics of two or more data modalities. The pre-training model is an important tool for outputting Artificial Intelligence Generation Content (AIGC), and can also be used as a general interface for connecting a plurality of specific task models.
Adaptively calculating: according to different input data, the calculation amount and the precision of the model are automatically adjusted, so that the purpose of improving the calculation efficiency of the model while maintaining the precision of the model is achieved. The adaptive calculation can flexibly adjust the calculation amount and the precision of the model on different input data, so that the calculation efficiency and the precision of the model are balanced better.
In this embodiment, an image generating method related to artificial intelligence is provided, as shown in fig. 1b, the image generating method may be executed by an electronic device, and the specific flow may be as follows:
110. and acquiring the description text to be processed and a reference image.
The description text to be processed refers to the description text used for generating the image. The descriptive text may contain content information to generate an image, such as "a yellow puppy smiling on a lawn", from which a "a yellow puppy stands on a green lawn, with its tongue protruding, as if it were smiling" picture may be generated.
Wherein the reference image refers to a reference or guide image when describing the text generation image to be processed. By using information in the reference image to assist in generating the image, a more accurate and fine image generation effect can be provided.
The image generation method provided by the embodiment of the application can be applied to various image generation scenes, such as animal image generation tasks or character image generation tasks. Taking a character image generating task as an example, under the condition of obtaining independent permission or consent of a user, any description text 'bright eyes' input by the user can be obtained, under soft light irradiation, the bright eyes are sparkled, self-confidence and warmth are revealed in smiles, and any self-shooting photo uploaded by the user is obtained as a reference image.
It should be noted that, in the embodiment of the present application, the description text to be processed may be any text, and the description text may not be directly related to the content information of the reference image. For example, the reference image shows that "a yellow puppy plays on the grass" and the descriptive text to be processed may be the text "today's weather is good" that the user inputs at will. Although the content of the reference image and the descriptive text do not appear to be directly related, the meaning relationship between the two can be understood from context and fused together to generate the image.
120. And respectively extracting text features and visual features from the description text to be processed and the reference image.
Wherein the visual features refer to a vector representation of visual information of the image. Text features refer to vector representations of text semantic information. For example, the text feature and the visual feature may be extracted from the descriptive text to be processed and the reference image by a visual encoder and a text encoder, respectively.
The embodiment of the application can be applied to an image generation model, and the image generation model can comprise a visual coding network formed by a visual encoder and a text coding network formed by a text encoder. After the reference image and the description text to be processed are respectively input into a visual coding network and a text coder, the reference image and the description text are respectively coded through the visual coding network and the text coder, so that the reference image and the description text to be processed are represented by vectors with low latitude, the vector representation of the obtained reference image is taken as a visual characteristic, and the vector representation of the description text to be processed is taken as a text characteristic. The visual encoder and the text encoder may be Convolutional Neural Network (CNN), cyclic neural network (RNN), self encoder (Autoencoder), CLIP (pre-training based on contrast text-image pair) model, and the like, which can be used for text or image encoding.
In some embodiments, the reference image and the descriptive text to be processed may be encoded using a visual encoder and a text encoder in the same pre-trained text image model, respectively, to obtain visual features and text features. For example, the reference image and the descriptive Text to be processed may be encoded using a visual Encoder (Vision Encoder) and a Text Encoder (Text Encoder) in a pre-trained CLIP (pre-trained based on contrast Text-image pairs) model, respectively. The visual encoder is a Convolutional Neural Network (CNN) that processes the input image to extract visual features. These features may be used to represent the content, style, and other visual attributes of the image. The goal of the visual encoder is to learn the visual feature representation that matches the natural language description without a class label. The text encoder enables a multi-layered transform architecture that captures the relationship and context information between words, thereby generating a richer text representation. In the training process of the pre-trained CLIP model, a large number of pre-training images and corresponding descriptive texts are used for comparison learning, and by maximizing the scores of matched images and texts and minimizing the scores of irrelevant images and texts, the model can learn a shared representation between vision and language so as to encode the images and texts respectively by using a vision encoder and a text encoder, and corresponding vision features and text features can be obtained in the same feature space.
In some embodiments, the parameters and the vector representation of the reference image may be adjusted according to the features, so as to control and edit the reference image, and obtain the visual features that better meet the requirements of image generation. Specifically, the visual characteristics are obtained by:
extracting initial visual features from a reference image;
acquiring characteristic adjustment parameters;
and obtaining the visual characteristics according to the characteristic adjustment parameters and the initial visual characteristics.
The feature adjustment parameter refers to a parameter for adjusting an image vector. For example, the feature adjustment parameter may be a parameter configured to adjust a particular attribute of the image, such as color or size, whereby the feature adjustment parameter may be represented as an attribute vector. The feature adjustment parameter may also be a parameter of the vector adjustment model, e.g. the feature adjustment parameter may be a parameter of the transformation (Concept Transform) model.
In some embodiments, the visual features may be derived from a combination of one or more of vector multiplication, vector difference, and other additive processing methods, among others, for adjusting parameters based on features and initial visual feature processing.
For example, the image generation model may also include an adjustment network of transformation models. When the image encoder is used for extracting the reference image, the reference image can be mapped into a vector representation (namely an initial visual characteristic) with low latitude, and then the initial visual characteristic and the characteristic adjustment parameter can be overlapped in a multiplying or subtracting mode through the conversion model, so that the visual characteristic which meets the image generation requirement better can be obtained.
In some implementations, the characteristic adjustment parameters may include linear transformation parameters and non-linear transformation parameters. Specifically, the initial visual feature extracted by the visual encoder can be subjected to linear transformation and nonlinear transformation through the linear transformation parameters and the linear transformation parameters to obtain the visual feature. For example, parameters of a multi-layer feedforward neural network (FFN) model may be used as the characteristic adjustment parameters. Specifically, one or more fully connected layers may be introduced to construct the FFN model, each fully connected layer including a linear transformation, a nonlinear activation function (i.e., nonlinear transformation parameters), and optionally a regularization method (e.g., dropout), whereby the FFN model may calculate a weighted sum of the nodes of the previous layer by the linear transformation parameters and then transform using a nonlinear activation function (e.g., a ReLU function) to introduce nonlinear features to obtain visual features.
130. And obtaining semantic guidance features according to the text features and the visual features.
For example, the text feature and the visual feature may be processed by one or more of stitching, summing, multiplying, and the like to obtain the semantic guidance feature. For example, the text feature a and the visual feature combination B may be spliced to obtain the semantic guidance feature AB. Because the visual features contain information of the images with the characteristics and the details, the visual features can be used as sub-category descriptions to be added into the text features, and more characteristic and detail semantic information is added into the text, so that the control force of the semantic information is enhanced, and the quality of the generated images is improved. An avatar refers to the true existence of forms and states of things or phenomena, as well as their specific manifestations and characteristics. Things or phenomena that are powerful in character are easy to directly perceive and understand by people. Thus, the image feature increases the image character of the text, and the quality of the generated image can be improved.
In some embodiments, visual features can be added at adjacent positions of features corresponding to parent keywords describing texts in text features, so that a mapping relation between images and text information can be better established, the control force of semantic information is further enhanced, and the quality of generated images is improved.
Wherein, the parent class keywords refer to keywords describing general description in the text. The parent keywords may be keywords that describe text concepts or subject matter, such as humans, mammals, birds, reptiles, etc. The parent class keywords may be keywords corresponding to the task type. For example, when applied to a scene of a character image generation task, a parent keyword may be a plurality of words that can represent a character, such as person (person), person (peer), person (man), man (man), woman (man), boy (boy), girl (girl), gentleman (genetleman), lady (lady), child (child), baby (baby), child (kid), teenager (reagent), adult (add), senior person (senior), citizen (citizen), employee (employee), employer (element), customer (stock), client (client), patient (patient), student (student), teacher (teacher), doctor (profuse), nurse (nurse), athlete (player), musician (player), artist (player), player (player), and the like.
For example, when applied to a scene of a character image generation task, any of the to-be-processed descriptive texts is ayoung man in a tank top holding a basketball (a young man wearing a vest holds a basketball), the parent keywords in the task scene can be matched with the words in the to-be-processed descriptive texts one by one, so as to find the parent keywords man. The descriptive text is encoded to obtain text features A1A2 … Ai … Am, where Ai represents a man-corresponding feature. In this way, the visual feature B1B2 … Bi … Bm of the reference image can be added before the feature Ai, resulting in the semantic guidance feature A1A2 … B1B2 … Bi … BmAi … Am. Or the visual feature B1B2 … Bi … Bm of the reference image may be added after the feature Ai to obtain the semantic guidance feature A1A2 … AiB1B2 … Bi … Bm … Am.
In some embodiments, when there are multiple parent keywords in the descriptive text, visual features may be added adjacent to features corresponding to the first parent keyword of the descriptive text to be processed in the text features. The first parent keyword refers to a first parent keyword in a plurality of parent keywords according to the arrangement sequence in the description text to be processed.
In some embodiments, placeholders can be inserted in adjacent positions of parent keywords of the descriptive text to be processed, so that text features and visual features can be combined according to feature positions corresponding to the placeholders to obtain semantic guidance features. Because the characteristics of the to-be-processed descriptive text after the placeholder codes are usually fixed, the insertion position of the visual characteristics can be rapidly and accurately determined by identifying the characteristics corresponding to the placeholders in the text characteristics, and meanwhile, the inserted visual characteristics cannot be mixed with other text characteristics, so that the efficiency and the accuracy of characteristic combination are improved. Specifically, before the semantic guidance feature is obtained according to the text feature and the visual feature, the method further comprises the following steps:
adding placeholders at adjacent positions of parent keywords in the description text to be processed;
obtaining semantic guidance features according to the text features and the visual features, including:
and replacing the features corresponding to the placeholders in the text features with visual features to obtain semantic guidance features.
Where placeholders refer to occupied symbols in text that use a particular symbol or marker to represent a particular location or value. Special symbols such as s may be used as placeholders. In some implementations, placeholders can be added before parent keywords to quickly identify features of the text features that correspond to the placeholders.
For example, placeholder s may be added before parent keyword man of descriptive text "a young man in a tank top holding abasketball" to be processed, resulting in "a young" man in a tank top holding a basketball ". The text of the description to be processed is encoded to obtain text characteristics A1A2 … Aj … Am, wherein Aj represents the characteristics corresponding to placeholders s. The visual feature B1B2 … Bi … Bm of the reference image can be used for replacing Aj, so that the semantic guidance feature A1A2 … B1B2 … Bi … Bm … Am is obtained.
140. And adding reference noise into the reference image to obtain a noise-added image.
Wherein the reference noise refers to noise for adding to the reference image. The reference noise is typically random noise. Random noise refers to noise whose value cannot be predicted in a given moment, caused by the accumulation of a large number of heave disturbances randomly generated in time. For example, the random noise may include, but is not limited to, one or more of gaussian noise, pretzel noise, or multiplicative noise.
For example, the image generation model may generate a reference noise image of the same size as the reference image based on random noise. As for gaussian noise, a normal distribution with a mean value of 0 and standard deviation of the desired value can be used to generate random numbers to generate a reference noise image. And adding the reference noise image and the reference image to obtain a noise added image.
In some embodiments, noise may be added to the hidden vector of the reference image. Specifically, adding reference noise to a reference image to obtain a noisy image, including:
extracting an initial hidden vector from a reference image;
and adding sample noise into the initial hidden vector to obtain a noise-added hidden vector of the noise-added image so as to obtain the noise-added image.
For example, the image generation model may further include an image coding network formed by an image encoder, wherein after the reference image is input into the image coding network, the reference image is coded by the image coding network to obtain an hidden vector (i.e., an initial hidden vector) of the reference image, and random noise is added to the initial hidden vector to obtain a noise adding hidden vector.
In some embodiments, an initial hidden vector may be extracted from a reference image using a pre-trained variable self-encoding (VAE) encoder. A variational self-encoding encoder may convert an input reference picture into a latent vector representation (i.e., a hidden vector) that contains abstract features of the reference picture.
150. And carrying out noise prediction on the noise-added image through semantic guidance characteristics to obtain prediction noise.
Wherein, the semantic guidance feature refers to a feature condition for guiding a noise prediction process.
In embodiments of the present application, semantic guidance features can guide and constrain the noise prediction process on noisy images to help the image generation model more accurately understand and restore the images. It can be appreciated that, in general, the image generation model needs to establish a relation between a text and a corresponding image to implement denoising and restoration on the image, and due to poor text image, it is difficult to clearly describe a thing, such as a person image, and the text is difficult to describe the person's five sense organs in an image, so that the image generation model cannot accurately capture details of the description, and the generated image has poor quality or is not matched with the description. However, in the embodiment of the present application, the visual feature is added as a sub-description to the text feature, so that the visual feature is introduced into the semantic guidance feature, so that in addition to the text feature, the feature with the characteristics and details in the visual feature is focused in the noise prediction process, so that the relationship between the text and the corresponding image is better established, and the quality of the generated image is improved.
In some embodiments, details of the image may be captured from different scales through multi-scale feature sampling to more fully understand the image content and provide more information for noise prediction. And then, the semantic guidance features and the sampling features are subjected to cross attention processing so as to dynamically adjust the weights of the features based on an attention mechanism, thereby being beneficial to focusing attention on important feature parts and improving the accuracy and effect of noise prediction so as to improve the quality of generated images. Specifically, noise prediction is performed on the noise-added image through semantic guidance features to obtain prediction noise, including:
Performing multi-scale feature sampling on the noise-added image to obtain multi-scale sampling features;
aiming at the sampling features of any scale, carrying out cross attention processing on semantic guidance features and the sampling features to obtain attention features;
the prediction noise is derived from the attention features.
The multi-scale feature processing refers to a process of decomposing and processing images on different scales so as to obtain feature representations under corresponding scales, and fusing the feature representations. And after the noise-added image is subjected to multi-scale feature processing, the multi-scale features are obtained through fusion. For example, the noisy image may be multi-scale feature sampled using filters of different scales or pyramid sampling methods. In multi-scale features, each scale refers to a representation of different sizes obtained by scaling or sampling the raw data to different extents.
Wherein cross-attention processing refers to computing a correlation between two different input sequences or feature graphs based on an attention mechanism to capture the relationship between them. In the embodiment of the application, the two different input sequences or feature maps are semantic guidance features and multi-scale features.
For example, the image generation model may also include a noise prediction network of noise prediction models. And carrying out multi-scale feature sampling on the noise adding hidden vector of the noise adding image through a noise prediction model to obtain sampling features of n scales. And for the sampling characteristic i of any scale i, the semantic guidance characteristic and the sampling characteristic i can be subjected to cross attention processing to obtain an attention characteristic i corresponding to the scale i. The attention feature i may be used as an input feature for the next feature sample. The prediction noise is obtained from the attention features output by the feature samples of the largest scale (i.e., the attention features output after the last feature sample).
In some implementations, the multi-scale feature processing may be implemented by downsampling (downsampling) and upsampling (upsampling), the sampled features including upsampled features and downsampled features. For example, a Unet (U-shaped network) model is used as a noise prediction model to sample the noise-added hidden vectors of the noise-added image for multi-scale features. The Unet model comprises an encoder and a decoder, and information transfer is carried out between the encoder and the decoder through jumping. In this embodiment, the encoder includes a downsampling module, and the decoder includes an upsampling module, where the downsampling module and the upsampling module are both provided with a cross attention layer. Specifically, the downsampling module progressively reduces the input noisy image through successive convolution layers and downsampling operations to capture a high-level feature representation (i.e., downsampled features). After each downsampling operation, cross-attention processing can be performed on the downsampled features and the semantic guidance features to obtain downsampled attention features. The decoder gradually restores the downsampled features to the same size as the input noisy image through an upsampling operation and deconvolution to gradually increase the size of the image feature map, resulting in upsampled features. After each upsampling operation, the upsampled features and the semantic guidance features may be cross-attention processed to obtain upsampled attention features. The attention features of the encoder are connected with the attention features of the corresponding decoder stage by means of jump links between the encoder and the decoder, i.e. the downsampled attention features and the upsampled attention features of the same scale are connected (fused) as decoder output features of that scale, thereby providing more comprehensive context information and helping the decoder to restore the detail better. The last decoder output feature, i.e. the largest scale downsampled attention feature and the upsampled attention feature are connected (fused) to the resulting feature, i.e. the largest scale connected feature, is taken as the prediction noise.
In some embodiments, the semantic guidance features can be used for performing cross attention processing on the sampling features of each scale one by one so as to gradually inject text information into image information, so that more image details are restored as much as possible, the loss of information is reduced, and the quality of the generated image is improved. Specifically, for the sampling feature of any scale, performing cross attention processing on the semantic guidance feature and the sampling feature to obtain an attention feature, including:
aiming at the sampling characteristics of any scale, performing linear transformation on the semantic guidance characteristics to obtain key vectors and value vectors, and performing linear transformation on the sampling characteristics to obtain query vectors;
and carrying out attention weighting on the value vector through the query vector and the key vector to obtain attention characteristics.
The query vector, the key vector and the value vector refer to a query vector Q, a key vector K and a value vector V in an attention mechanism, wherein Q is used for learning the relationship between the query vector Q and other elements, K is used for learning the relationship between other elements and the query vector K, and V is specific information representing each element.
For example, taking the use of a Unet (U-shaped network) model as the noise prediction model as an example, the key parameter matrix W may be used in the cross attention layer of either downsampling or upsampling modules of the Unet model K Value parameter matrix W v By k=x 1 *W K V=x 1 *W v For semantic guidance feature X 1 The linear transformation results in a key vector K and a value vector V, where x represents the multiplication. And by querying a parameter matrix W Q By q=x 2 *W Q For X 2 Linear transformation to obtain query vectors Q, X 2 Representing the sampled characteristics obtained by the sampling module, such as the downsampled characteristics of the following sampling module or the upsampled characteristics of the upsampling module. The attention matrix can be obtained by matrix multiplication of the query vector and the key vector, the attention moment matrix is normalized through a softmax function, normalized attention weight is obtained, and the attention weight and the value vector are weighted and summed to obtain the attention characteristic of the sampling module. The above steps can be realized by the following formula:wherein d k Refers to the dimension of the key vector, +.>Representing the scale factor, attention (Q, K, V) represents the Attention feature.
In some embodiments, the accuracy and effectiveness of noise prediction may be improved by linearly transforming the attention features, enhancing the local information, to improve the quality of the generated image. Specifically, deriving the prediction noise from the attention feature includes:
linearly transforming the attention characteristic to obtain a transformed characteristic;
The prediction noise is derived from the transformed features.
For example, taking the use of the Unet model as the noise prediction model as an example, the cross-attention layer of either scale downsampling or upsampling modules of the Unet model may be followed by a linear transformation layer. For example, the attention feature of each downsampling or upsampling may be linearly transformed using a multi-layer feedforward neural network (FFN) model as a linear transformation layer to obtain a transformed feature, where the transformed feature may be used as an input feature of a next sampling module, and the transformed feature output by the last decoder (i.e., the upsampling module) may be used as prediction noise. Specifically, the FFN model may include multiple layers of neurons arranged in a hierarchical structure, where the connection feature of the largest scale may be input into the multiple layers of neurons, weighted summation (i.e., linear transformation) is performed through xw+b by each layer of neurons, nonlinear transformation is performed through an activation function, and then the transformed value is transferred to the next layer of neurons until the last layer of neurons is output, so as to obtain the transformed feature, where X is an input feature, W is a weight matrix, and B is a bias vector.
160. Based on the predicted noise, the noisy image is restored to generate a target image.
For example, the image generation model may further include an image decoding network formed by an image decoder, and the prediction noise may be subtracted from the noise-added image to obtain the target hidden vector. And then the image decoder decodes the target hidden vector to obtain a target image.
In some embodiments, when an encoder using a pre-trained variational self-coding (VAE) extracts an initial hidden vector from a reference image, a decoder using a pre-trained variational self-coding (VAE) may be used to decode a target hidden vector to obtain a target image. A decoder of the variant self-coding may map the target hidden vector from the potential space back to the data space to generate an image.
In some implementations, the image generation model can be trained using the sample image and the sample description text corresponding to the sample image to generate the target image using the trained image generation model. Specifically, the image generation method further includes:
acquiring a training sample set and an image generation model to be trained, wherein the training sample set comprises at least one sample image and a sample description text corresponding to the sample image;
respectively extracting sample text features and sample visual features from the sample description text and the sample image;
Obtaining sample semantic guidance features according to the sample text features and the sample visual features;
adding sample noise into the sample image to obtain a noisy sample image;
noise prediction is carried out on the noise-added image through semantic guidance characteristics, so that predicted sample noise is obtained;
and according to the loss value between the predicted sample noise and the sample noise, adjusting the model parameters to be adjusted of the image generation model to be trained to obtain a trained image generation model, wherein the trained image generation model is used for generating a target image.
It should be noted that, the sample image, the sample descriptive text, the sample text feature, the sample visual feature, the sample semantic guidance feature, the sample noise, the noisy sample image, and the predicted sample noise in the embodiments of the present application are images for training an image generation model to be trained, descriptive text, text features, visual features, semantic guidance features, noise, noisy image, and predicted noise.
The sample image is an image for training an image generation model, and the description text is a text capable of describing the display content of the sample image. In the embodiment of the application, the sample image and the sample description text are related text pairs, that is, the sample image and the sample description text are related image text. For example, the sample image may be displayed with "a yellow puppy standing on a green grass, its tongue protruding as if it were smiling", and the corresponding sample descriptive text may be "a yellow puppy smiling on a grass". The sample noise may be random noise.
For example, sample text features and sample visual features may be extracted from the sample descriptive text and sample images, respectively, using a visual encoder and a text encoder in the image generation model to be trained. And combining the sample text features and the sample visual features to obtain sample semantic guidance features. Sample noise such as Gaussian noise is added to the sample image, and a noisy sample image is generated. And guiding and restraining a noise prediction network in the image generation model to be trained by using the sample semantic guidance characteristics to conduct noise prediction. Specifically, the noise prediction network may perform multi-scale feature sampling on hidden vectors of the sample noise graph to obtain multi-scale sample sampling features, and perform cross attention processing on the sample semantic guidance features and the sample sampling features according to the sample sampling features of any scale to obtain sample attention features, and obtain predicted sample noise from the sample attention features. The specific processing procedure and principle of each network in the image generation model to be trained can refer to the corresponding content of the foregoing step process, and will not be described herein.
After obtaining the predicted sample noise, a combination of one or more of a mean square error loss function, a cross entropy loss function, and a perceptual loss function may be used to calculate a loss value between the predicted sample noise and the sample noise, and an optimization algorithm, such as a gradient descent method or an evolutionary algorithm, may be used to update the model parameters to be adjusted by minimizing the loss value. The optimization algorithm iteratively adjusts the model parameters to be adjusted based on gradient information of the objective function (loss value) about the model parameters to be adjusted until the objective of minimizing the loss value is reached. The image generation model corresponding to the adjusted model parameters is the trained image generation model.
In some embodiments, the reference image may be any sample image in the training sample set, so that the quality of the generated image is improved by using the information of the training sample set.
In actual practice, the sample descriptive text corresponding to the sample image may be generated using a pre-trained neural network model, which may include, but is not limited to, a look-and-talk (showand call) model, a context (Up-Down) model, a stack generation countermeasure network (stack gan) model, an image converter (Image Transformer) model, and the like, for example.
In some implementations, the relationship between the image and the text can be captured by an attention mechanism, as well as capturing long distance semantic dependencies and context information, to obtain more accurate and fluent sample descriptive text of the sample image. Specifically, a sample description text corresponding to a sample image is obtained through the following steps:
extracting image features to be processed from a sample image;
taking the sample description text sequence containing the start mark as a text sequence to be processed;
performing attention calculation on the image characteristics to be processed and the text sequences to be processed to obtain attention weights;
determining the generation probability of the next word in the sample description text sequence according to the attention weight;
Determining the next word in the sample description text sequence according to the generation probability of the next word so as to obtain a current sample description text sequence;
and taking the current sample description text sequence as a text sequence to be processed, returning to the execution step, performing attention calculation on the image characteristics to be processed and the text sequence to be processed to obtain attention weight, and performing the subsequent steps until an end mark is generated, wherein the current description text sequence is taken as a sample description text corresponding to the sample image.
Wherein the start tag refers to a tag for indicating the start of generating a sample description text sequence, such as "< start >". The end mark refers to a mark for indicating the end of generating the sample description text sequence. The end mark may be task-defined or may be determined by a pre-training process for the decoder. For example, the decoder may be trained using a training data set comprising an image and a sample description text corresponding to the image, which sample description text typically contains a specific end marker, such as "< end >" or "</s >", and the decoder is instructed to stop generating words when this specific end marker appears in the sequence of sample description text generated by the decoder.
For example, the sample image features may be extracted using an image encoder, i.e., the image encoder may encode the sample image to represent global image features a (i.e., image features to be processed), which may be an encoder that can be used for image encoding, such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a self encoder (Autoencoder), or the like. The image feature a is then decoded to obtain a sample descriptive text of the sample image using a decoder, which may be a Cross-Attention (Cross-Attention) based decoder. Specifically, the decoder may obtain a text sequence to be processed { < start > }, calculate Cross-Attention (Cross-Attention) for the image feature a and the text sequence to be processed { < start > }, obtain an Attention weight, and perform two classifications for the Attention weight to obtain a probability of generating a next word in { < start > }, select a word B with the highest probability from the vocabulary according to the probability of generating as the next word of < start >, that is, obtain a current sample description text sequence { < start >, B }, and use { < start >, B } as the text sequence to be processed, and process the image feature a and { < start >, B } by the decoder to predict the next word of B. And so on until the predicted word is the end mark < end >, so that the obtained current sample description text sequence { < start >, B, C, …, N, < end > }, the word sequences except the start mark and the end mark, namely B, C, …, N, are taken as the sample description text of the sample image.
In some implementations, sample descriptive text for a sample image can be generated using a unified visual language understanding and generated guide language image pre-training (BLIP) model. The BLIP model includes an Image Encoder (Image Encoder), a Text Encoder (Text Encoder), an Image-guided Text Encoder (Image-grounded Text Encoder), and an Image-guided Text decoder (Image-grounded Text Decoder). The pre-training may be performed using a number of unsupervised training data sets including images and sample descriptive text corresponding to the images to obtain a pre-trained BLIP model. During the training process, the image encoder and the text encoder are responsible for encoding the image and the sample description text, respectively. The image guidance text encoder is responsible for judging whether the sample description text and the image have the same meaning, and the image guidance text decoder is responsible for generating corresponding description according to the picture. In this way, in the pre-trained BLIP model, the sample Image may be input into an Image Encoder (Image Encoder) to extract Image features (i.e., image features to be processed) in the sample Image; the image-guided text decoder receives as input the image features from the image encoder, gradually generates a description sequence by inputting a start tag as initial input, and at each time step, calculates attention weights using the currently generated text sequence and the image features, the attention weights representing the correlation between the image and the text and being used to guide the generation of the next word until a specific end tag is generated, indicating the end of the description. And finally, taking the description sequence output by the image guide text decoder as a sample description text corresponding to the sample image.
In some implementations, the sample description text includes parent description text. The parent description text characterizes the general description of the sample image body, and can generate a summary description corresponding to the task type. For example, the parent descriptive text may be the general description "a young man in a tank top holding a basketball" of the person in the sample image (a young man wearing a vest holding a basketball) when applied to the scene of the person image generation task, and the parent descriptive text may be the general description "A little yellow dog is smiling on the grass" of the animal in the sample image (a yellow puppy smiling on a lawn) when applied to the scene of the animal image generation task.
In the training process of the image generation model to be trained, the model parameters to be adjusted may include parameters of all or part of the network structure of the image generation model. For example, the model parameters to be adjusted may include parameters of one or more of a visual encoding network, an image encoding network, a text encoding network, an adjustment network, a noise prediction network, an image decoding network. For example, in training an image generation model, parameters of an image encoding network and an image decoding network, which are formed by partial structures such as a pre-trained image encoder and an image decoder, in the image generation model may be fixed without adjustment, and only parameters of a trained network or a partial key network are adjusted. The model is easier to converge, a better generation effect is obtained, and the training time and the consumption of calculation resources of the whole model are reduced, so that the training efficiency is improved.
In some embodiments, the model parameters to be adjusted may include at least one of parameters for extracting visual features and parameters for noise prediction, wherein the network parameters for extracting visual features include parameters of at least one of a visual coding network, an adjustment network, and parameters for noise prediction. Therefore, the visual information of the figures and the details added into the text is optimized by adjusting the network parameters of the visual coding network, so that the trained image generation model can restore the image of the figures, and the quality of the generated image is improved. By adjusting network parameters of the noise prediction network, the noise prediction network can better capture noise in the image, thereby generating more accurate repair results and improving the quality of the generated image.
In some embodiments, the parameters of the visual coding network and the text coding network may be fixed, i.e. the visual encoder and the text encoder are not trained, during the training of the image generation model to be trained. Therefore, the visual encoder and the text encoder with fixed parameters can provide more stable and reliable characteristic input for the model, so that the model is easier to converge and better generation effect is obtained, and the training time and the consumption of calculation resources of the whole model are reduced. For example, when the visual encoder and the text encoder in the CLIP model to be trained are used to extract the visual features and the text features respectively, because the cross-modal learning is performed based on the relationship between the image and the text in the pre-training process, the parameters of the visual encoder and the text encoder can be fixed (i.e. the parameters are not adjusted) in the training process of the image generation model to be trained, so that better training effect can be obtained.
In some embodiments, in order to optimize the visual information of the figures and details added to the text while fixing the visual encoder, network parameters of the adjustment network for adjusting the visual characteristics, i.e. the feature adjustment parameters, may be taken as model parameters to be adjusted, i.e. the model parameters to be adjusted comprise the feature adjustment parameters.
In some embodiments, the model parameters to be adjusted include noise prediction parameters for noise predicting the noisy sample image, the noise prediction parameters including at least one of a cross-attention parameter for cross-attention processing the sample semantic guidance feature and the sample sampling feature, and a linear transformation parameter for linear transforming the sample attention feature. That is, the cross attention parameter is a parameter of the cross attention layer, and the linear transformation parameter is a parameter of the linear transformation layer after the cross attention layer.
In some embodiments, the model parameters to be adjusted include only feature adjustment parameters, cross-attention parameters, and linear transformation parameters, whereby only parameters of a portion of the critical structures are adjusted. The model is easier to converge, a better generation effect is obtained, and the training time and the consumption of calculation resources of the whole model are reduced, so that the training efficiency is improved.
In some embodiments, because the parameter scale in the noise prediction process is large, a large amount of computing resources are required to be consumed, a large amount of model parameters to be adjusted in the noise prediction process can be decomposed into fewer parameters in the training process, so that the parameter quantity to be adjusted is greatly reduced, and the training efficiency is improved. Specifically, according to the loss value between the prediction noise and the sample noise, the model parameters to be adjusted of the image generation model to be trained are adjusted, and before the trained image generation model is obtained, the method further comprises the steps of:
acquiring a noise prediction parameter, wherein the noise prediction parameter is used for carrying out noise prediction on the noise-added sample image;
splitting the noise prediction parameters into fixed parameters and parameters to be adjusted;
and taking the parameters to be adjusted as the parameters of the model to be adjusted.
For example, for a parameter matrix of noise prediction parameters, its initial parameter matrix before adjusting the model parameters to be adjusted of the image generation model is W 0 . The noise prediction parameters can be split and represented as W in the adjustment process 0 +Δw, where Δw represents the updated portion of the noise prediction parameters, which may be adjusted due to the fixed initial parameter matrix of the noise prediction parametersNoise prediction parameter splitting fixed parameter W 0 +the parameter to be adjusted Δw, Δw=0 before adjustment. In this way, during the training process, a fixed portion of the noise prediction parameters may be frozen, and only the parameter Δw to be adjusted is trained to reduce the number of parameters to be adjusted. By a fixed parameter W 0 The +adjusted parameters DeltaW' obtain adjusted image generation parameters, and the image generation model is updated by the adjusted image generation parameters to obtain a trained image generation model.
In some embodiments, the low-rank decomposition can be performed on the model parameters to be adjusted, so as to compress and reduce the dimension of the parameters of the model parameters to be adjusted, and reduce the data storage space and the calculation cost in the adjustment process. Specifically, taking the parameters to be adjusted as the parameters of the model to be adjusted includes:
performing low-rank decomposition on the model parameters to be adjusted to obtain a plurality of low-rank parameter matrixes;
and taking the plurality of low-rank parameter matrixes as model parameters to be adjusted.
The low rank decomposition refers to a method of decomposing a matrix into low rank matrices. By low rank decomposition, one matrix can be represented as the product of two matrices, where the number of rows of one matrix is small and the number of columns of the other matrix is small, and the result of multiplying the two small matrices is a low rank approximation of the large matrix.
For example, the model parameter Δw to be adjusted can be expressed as the product of two matrices of BA, i.e. W, by low-rank decomposition 0 +ΔW=W 0 +ba, wherein,r < min (d, k). Thus W is 0 If the matrix is a d×k dimensional matrix, the sizes of the matrix B and the matrix A can be reduced to d×r and r×k, and obviously, the parameter quantity of the matrix B and the matrix A after decomposition is far smaller than the parameter quantity of the model parameter DeltaW to be adjusted, wherein r can be far smaller than d, and theoretically, the smaller the rank r of multiplication among the matrix BA is, the smaller the parameter quantity of the model parameter to be adjusted is. In this way, the input/output corresponding to the model parameter to be adjusted can be defined by h=w 0 x+Δwx is changed to h=w 0 x+ BAx, where x represents the input and h representsAnd outputting.
In some embodiments, when the noise prediction parameters include a cross-attention parameter, the noise prediction parameters include a query parameter matrix W in the cross-attention process Q Key parameter matrix W K Value parameter matrix W v . For example, they can be expressed as respectivelyWherein (1)>For fixing parameters, B 1 、A 1 、B 2 、A 2 、B 3 、A 3 For the parameters to be adjusted, the parameters can be adjusted by ∈ ->Obtaining adjusted image generation parameters, wherein B' 1 、A′ 1 、B′ 2 、A′ 2 、B′ 3 、A′ 3 Is the adjusted parameter.
In some embodiments, when the noise prediction parameters include linear transformation parameters, the noise prediction parameters include weight parameters in the linear transformation. For example, the noise prediction parameters include a weight matrix for linear transformation by the FFN model, and the noise prediction parameters may further include bias vectors for linear transformation. For example, the weight matrix W may be represented as W 0 +BA, and pass through W 0 +B 'A' results in an adjusted image generation parameter, where B ', A' are the adjusted parameters.
In some implementations, a loss value between the predicted sample noise and the sample noise may be calculated using MSELoss (mean square error loss). For example, the loss function ReconstructionLoss (x i ,y i )=(x i -y i ) 2 Calculating the difference between the predicted sample noise and the corresponding sample noise, and squaring to obtain a reconstruction loss value, wherein the reconstruction loss is represented by ReconstructionLoss, and x i Sample noise, y representing sample image i i Representation ofThe prediction sample noise corresponding to sample image i. When there are a plurality of sample images, the reconstructed loss value of each sample image may be summed or averaged to obtain an overall reconstructed loss value, which is used as a loss value between the predicted sample noise and the sample noise.
In some implementations, to restore an image that is an object, the present embodiments add visual features after text features as semantic guidance features. However, since the text features and the visual features come from different feature spaces, the degree of influence is inconsistent in the noise prediction process. For example, visual features may be weaker, the purpose of adding more apparent and detailed semantic information to the text may not be achieved, such that the quality of the image generated by the image generation model is poor, or image features may be stronger, such that text features have weaker influence, and control effects may not be achieved. In the training process, the model parameters to be adjusted are adjusted based on the loss values of the sample visual features and the sample text features, so that the visual features approach to the feature space of the text features on the premise of keeping the self semantics, more specific and detailed semantic information is added to the text without weakening the text feature control effect, the image is restored, and the quality of the generated image of the generated model is improved. Specifically, according to a loss value between predicted sample noise and sample noise, adjusting model parameters to be adjusted of an image generation model to be trained to obtain a trained image generation model, including:
Calculating a first loss value between the predicted sample noise and the sample noise, and calculating a second loss value between the visual feature and the semantic guidance feature;
and adjusting model parameters to be adjusted of the image generation model to be trained by combining the first loss value and the second loss value to obtain the trained image generation model.
In practical applications, the loss value according to the visual features and the semantic guidance features may be calculated using a combination of one or more of a mean square error loss function, a cross entropy loss function, and a perceptual loss function.
In some embodiments, MSELoss may be used(mean square error loss) a loss value between the sample visual feature and the sample text feature is calculated. For example, a loss function based ReconstructionLoss (x i ,y i )=(x i -y i ) 2 The calculated overall reconstructed loss value is taken as a first loss value. A second loss value is then calculated based on the difference between the sample visual features and the sample text features. In particular, the loss function textimageededingLoss (x) i ,y i )=(x i -y i ) 2 The calculated difference between the visual characteristic and the corresponding sample text characteristic is squared again to obtain a text image embedding loss value, wherein textimageEmbeddingLoss represents the text image embedding loss, and x i Sample visual features representing sample image i, y i The sample image i is represented as a sample text feature corresponding to the sample descriptive text i. When there are multiple sample images, the text image embedding loss values for each sample image may be summed or averaged to obtain an overall embedding loss value, which is used as a second loss value between the sample visual feature and the sample text feature. The first loss value and the second loss value may be weighted and summed to obtain a total loss value, wherein the weighted weight may be set according to experience or training effect, e.g. may be set to 1. The model parameters to be adjusted are updated by minimizing the total loss value using an optimization algorithm, such as a gradient descent method or an evolutionary algorithm, until the goal of minimizing the loss value is reached. The image generation model corresponding to the adjusted image generation parameters is the trained image generation model.
In some embodiments, after the trained image generation model is obtained, it may be trimmed using a small amount of stylized image for generating an image that is consistent with the style of the stylized image. Specifically, the image generation method further includes:
acquiring a stylized image and an image generation model corresponding to the adjusted generation parameters;
And adjusting the model parameters to be adjusted of the image generation model corresponding to the adjusted generation parameters through the stylized image to obtain a stylized image generation model so as to use the stylized image generation model for generating the target image.
For example, a small number of stylized images, such as 1, may be input into the trained image generation model to adjust model parameters to be adjusted corresponding to the trained image generation model, so that the secondarily trained image generation model may generate, according to the input text, a personalized image that merges the style of the stylized image. The secondary training process is different from the training process of the image generation model to be trained only in inputting training data, and the training process and the principle are the same, so that the training process and the principle can refer to the corresponding content of the training process of the image generation model to be trained, and the description thereof is omitted.
The image generation scheme provided by the embodiment of the application can be applied to various image generation scenes. For example, taking an animal image generation task or a character image generation task as an example, acquiring a description text to be processed and a reference image; respectively extracting text features and visual features from the description text to be processed and the reference image; according to the text features and the visual features, semantic guidance features are obtained; adding reference noise into the reference image to obtain a noise-added image; noise prediction is carried out on the noise-added image through semantic guidance characteristics, so that prediction noise is obtained; based on the predicted noise, the noisy image is restored to generate a target image.
As can be seen from the above, in the image generation process, the embodiment of the application combines the visual features with the text features, and since the visual features include the information of the sample image, the visual features can add more specific and detailed semantic information to the text, so as to enhance the control force of the semantic information. Meanwhile, visual features are introduced into semantic guide features, so that in addition to text features, the noise prediction process focuses on the features with images and details in the visual features, and therefore the images with images can be restored as much as possible according to the predicted noise, and the quality of the generated images is improved.
The method described in the above embodiments will be described in further detail below.
In this embodiment, a training method of an image generation model according to an embodiment of the present application will be described in detail taking a person image generation task as an example.
The training method of the image generation model in the embodiment of the application can be realized by the image generation model. The image generation model shown in fig. 2a includes a visual encoding network, an image encoding network, a text encoding network, an adjusting network, a noise prediction network, and an image decoding network, wherein a low-rank decomposition module is introduced in the noise prediction network. The gray part of the figure (i.e. the adjusted network and the visual features of the adjusted network output, and the low rank decomposition model) is the trainable part, and the other parts are the untrainable parts.
The image generation method provided by the embodiment of the application can be used for training an image generation model such as a Diffusion (Stable Diffusion) model in the field of a text generation graph, wherein the text generation graph refers to an input text control generated image in a pre-training Diffusion model.
As shown in fig. 2b, an image generating method may be executed by an electronic device, and the specific procedure is as follows:
210. and acquiring a training sample set and an image generation model to be trained.
The training sample set comprises at least one sample image and sample description text corresponding to the sample image. In the event that a data-related user or subject license or agreement is obtained, a public task data set and a self-collected character data set may be obtained to construct a sample image in a training sample set. For example, an MS-Celeb-1M dataset (million face dataset) may be selected to cooperate as a public character dataset containing 10 tens of thousands of characters, each corresponding to 100 images, one image may be selected as a sample image for each task in order to maintain data diversity.
The public character data set mainly comprises face images, and in order to increase data diversity, images are searched in a self-built gallery through keywords and used for constructing the self-collection character data set. For example, the keyword "street beat" may be used to perform the search, and then the poor quality image may be filtered out according to the basic rule set by the factors such as the size and quality of the image, for example, the basic rule may be to keep the size of the image to be greater than 512×512 and the quality score (aesthetical score) to be greater than 6.5. Thus, compared with the public character data set, the self-collection 1-ten-thousand data set has stronger image diversity, including whole-body and half-body multi-scene images.
Because of the high diversity of sample images in the training data set, there are character images of different types and different scenes. Corresponding sample description text can be generated on the sample images in the training set through the BLIP model, so that the pre-trained image generation model can be helped to more accurately find the mapping relation between the sample images and the corresponding description text. For example, as shown in the sample description text acquisition process of fig. 2c, the sample description text "a young man in a tank top holding abasketball" may be generated by the BLIP model for the sample image in the figure (a young man wearing the vest holds a basketball).
The detailed description of the pictures generated by the BLIP model mainly comprises a main parent class, and in order to better represent the uniqueness of each picture and the characteristics of the subsequent replacement pictures, placeholders of main subclasses need to be added into the description text to obtain the final description text. For example, the subject sub-class term may be specified as "s". The word front of the main body parent class representing 'people' is added with's' text by using a character string matching mode. Taking the subject parent text "ayoung man in a tank top holding a basketball" as an example, the added subject subclass is denoted as "a young" man in a tank top holding a basketball "(a young man wearing a vest holds a basketball).
220. Sample text features and sample visual features are extracted from the sample descriptive text and the sample image, respectively.
For example, the pre-trained image generation model may use a visual encoder and a text encoder in the CLIP model as the visual encoding network and the text encoding network, respectively. The text encoder model input template is "photo of s person" (a photo of s person), which mainly includes two parts of content: subject parent and child classes. "photo of person" is a subject parent description, mainly responsible for describing the content of a picture, and prescribes the category to which the content subject belongs. "s" is a main subclass, representing a specific person in the current picture.
230. And obtaining sample semantic guidance features according to the sample text features and the sample visual features.
In the related art, a sample text feature is generally directly used as a condition of a UNet model in a noise prediction network, and the model is guided to generate an image, which has the problems of poor control capability and poor imaging. In the embodiment of the application, the visual characteristic coded by the visual coder is used for replacing the text characteristic corresponding to the s after the text coder codes. Furthermore, to enhance the feature space range of picture coding, a Concept Transform (transformation) model of multi-layer trainable FFN model can be added after the untrainable image encoder. By introducing image features into text features, the character image quality of the generated image may be enhanced, i.e., the character face image quality may be emphasized, as well as the correlation between the generated image and text.
When the method is applied to a scene of a character image generation task, the description text of any sample image is ayoung man in a tank top holding a basketball, and the parent keywords in the task scene and the words in the description text can be matched one by one to find the parent keywords man (men). The descriptive text is encoded to obtain text features A1A2 … Ai … Am, where Ai represents a man-corresponding feature. In this way, the visual feature B1B2 … Bi … Bm of the sample image can be added after the feature Ai, resulting in the semantic guidance feature A1A2 … AiB1B2 … Bi … Bm … Am as shown in fig. 2 a. In the training process, since the adjustment network is also trained, the visual characteristics of the output of the adjustment network are also continuously adjusted (namely the adjustment network can be trained).
240. And adding sample noise into the sample image to obtain a noisy sample image.
250. And carrying out noise prediction on the noise-added sample image through sample semantic guidance characteristics to obtain predicted sample noise.
For example, the fused text and image features (i.e., semantic guidance features) can be used as noise prediction conditions of a Unet (U-shaped network) model to better guide the Unet model to generate high-quality images.
In the embodiment of the application, a low rank adaptation (LoRA) model may be used as the low rank decomposition module. The LoRA model is a training technology for fine-tuning a diffusion model, and the aim of introducing a new concept is fulfilled by making small modifications on an original model. In the embodiment of the application, the cross attention layer and the FFN layer after the cross attention layer in the uiet model are trimmed through the LoRA model, so that the whole model parameters do not need to be trimmed, and only the trimming of the part is enough to realize good performance. Where cross attention is the image and text interaction layer.
In particular, the LoRA model adds trainable parameters mainly at the cross-attention layer in the Unet model, and the FFN layer after the cross-attention layer. And the parameter increment can be changed into fewer trainable parameters through matrix decomposition, so that the parameter quantity required to be trained by fine tuning is greatly reduced. The process of matrix decomposition as shown in fig. 2d corresponds to the formula h=w 0 x+ΔWx=W 0 x+ BAx, where x represents input, h represents output, x and h are d-dimensions, W 0 Representing the original parameters of the model (i.e. fixed parameters), Δw represents the parameters of the interposed layer (i.e. parameters to be adjusted), where in order to reduce the parameter amount of the interposed layer as much as possible, improve the efficiency, the parameters to be adjusted Δw are decomposed into the product of two low-rank parameter matrices of BA, b=0, a=n (0, σ) 2 ) If W is 0 Is a d x d dimensional matrix, i.eThe size of the a and B matrices can be reduced to d x r where r can be much smaller than d. In theory, the smaller the rank r of the multiplication between BA matrices, the smaller the number of parameters of the intervening layers.
260. And adjusting model parameters to be adjusted of the image generation model to be trained according to the predicted sample noise and the loss value between the sample noise, so as to obtain the trained image generation model.
The loss function is used to evaluate how different the model's predicted and actual values are, the better the loss function, the better the model's performance in general. For example, the loss function used in the embodiment of the present application is Reconstruction Loss, specifically, reconstructionLoss (x i ,y i )=(x i -y i ) 2 The book of itThe mass is MSELoss. The purpose of the penalty function is to make the input x under text conditions i Noise and model prediction y i Noise is as close as possible, and the aim of establishing the mapping of the text and the picture is achieved.
In order to improve the imaging performance of the text, the embodiment of the application introduces visual features into the text features, and as the two features come from different feature spaces, the problem of inconsistent influence degrees of the two features is easily caused after the two features are transferred to the Unet model. In general, the method is mainly represented in two extreme cases, the image features are weaker, the aim of improving the character appearance of the text is not achieved, and the quality of the character image generated by the model is poor; the image features are strong, the influence of other words in the text is weak, the control effect is not achieved, and the model tends to generate an original training data diagram. In the training process of the image generation model to be trained, the method and the device for generating the image generation model to be trained not only calculate the necessary loss function of the model, but also add additional loss functions between the text features and the picture features, so that the picture features in the picture field and the picture features in the text field are located in the same feature space. Specifically, the loss function is textimageededingLoss (x i ,y i )=(x i -y i ) 2 Essentially, MSELoss, the purpose of which is to make the picture feature approach the text feature space while preserving its own semantics.
270. And adjusting model parameters to be adjusted of the trained image generation model through the stylized image to obtain the stylized image generation model so as to use the stylized image generation model for generating the target image.
In the embodiment of the application, a large number of image-text pairs can be used for constructing training data to pretrain the image generation model, and the image generation model is guided by the text to generate an image with better quality. The model is first used to initialize parameters and the model is pre-trained using the constructed training data. And in the training process, each round of training stores characteristic adjustment parameters and parameters to be adjusted for subsequent fine adjustment of the model. In order to select a better pre-training model, a model with low loss function and high similarity of text and pictures is selected as a base model of a follow-up fine-tuning model, namely a pre-training image generation model, through evaluation indexes comprising the loss function, the image characteristics and the similarity of the text characteristics output by an image coding network and a text coding network.
Specifically, the model fine tuning process may be: the original model parameters of the image generation model to be trained are loaded, and the adjusted model parameters to be adjusted are obtained in step 260. The foregoing steps 220-260 may be repeated, using a character image, such as a high quality avatar, as a sample image, to further fine tune the adjusted model parameters to learn the character image. On the basis of loading the image generation model to be trained, the model can be fine-tuned on the GPU A100 (the graphic processor A100), and embedding the character image into the pre-trained image generation model can be realized only for 30 seconds. After the model is finely tuned, the image characteristics of the representative person and the adjusted model parameters can be obtained and can be directly plugged into the designated style model to generate a high-quality personalized person image with a fused style as shown in fig. 2 e.
As can be seen from the above, in the embodiment of the present application, the visual feature after sample encoding is used to replace the placeholder in the original text feature, and the image feature after replacement is not confused with other text features, so that the mapping relationship with the picture can be directly established. When the text becomes longer, the control capability of the part is not weakened due to the accurate mapping relation. Thus, the visual features can provide more apparent information for the model, and the quality of the model-generating person is improved. In addition, the embodiment of the application performs fine adjustment after the image generation model is trained in advance based on the 10-ten thousand-level character training data set, so as to improve the training speed of the fine adjustment model and the character quality generated by the model. Experiments prove that the scheme of the embodiment of the application can finish fine adjustment within 30 seconds through the GPU A100 only by 1 figure picture.
The method described in the above embodiments will be described in further detail below.
In this embodiment, an image generation method according to an embodiment of the present application will be described in detail by taking a personalized applet head generation task as an example.
As shown in fig. 3, an image generating method may be executed by an electronic device, and the specific procedure is as follows:
310. And acquiring the to-be-processed description text input by the user and the reference image uploaded by the user.
For example, the image generation method provided in the embodiment of the present application may be implemented by an image generation applet. It should be noted that, in the training process of the image generation model to be trained, the self-collection character data set includes self-photographing uploaded to the applet self-built gallery by the user. When the user permission or consent is obtained, a pending description text A input by the user A through the image generation applet can be obtained, and one of the self-shots uploaded from the user A to the applet self-built gallery is optionally taken as a reference image A. Alternatively, one may be selected by the user a from the self-photograph uploaded by the user a as the reference image a. And may invoke the trained image generation model in response to a generation request generated by the user's image generation operation.
320. And respectively extracting text features and visual features from the description text to be processed and the reference image.
For example, the text a to be processed may be input into a text encoding network of a trained image generation model, and the text features A1A2 … Aj … Am may be extracted. The reference image A is input into a visual coding network and an adjusting network of a trained image generation model, and visual characteristics B1B2 … Bi … Bm are extracted, wherein Aj represents characteristics corresponding to placeholders s.
330. And obtaining semantic guidance features according to the text features and the visual features.
For example, the visual feature may be substituted for the feature corresponding to the placeholder "s" in the text feature to obtain the semantic guidance feature A1A2 … B1B2 … Bi … Bm … Am.
340. An initial hidden vector is extracted from the reference image.
For example, the reference image a may be input into an image coding network of a trained image generation model, and the hidden vector (i.e., the initial hidden vector) of the reference image a is extracted.
350. And adding sample noise into the initial hidden vector to obtain a noise-added hidden vector of the noise-added image.
For example, random noise may be added to the initial hidden vector to obtain a noise-added hidden vector.
360. And carrying out noise prediction on the noise-added hidden vector through semantic guidance characteristics to obtain prediction noise.
For example, the noise-added hidden vector and the semantic guidance feature may be input into a noise prediction network of the trained image generation model, and the noise-added hidden vector may be subjected to noise prediction by the semantic guidance feature-guided noise prediction network to obtain the prediction noise.
370. And repairing the noisy image based on the predicted noise to generate a personalized head portrait.
For example, the prediction noise may be subtracted using the noisy image to obtain the target hidden vector. And then decoding the target hidden vector by the image decoding network of the trained image generation model to obtain the personalized head portrait, namely the target image.
As can be seen from the above, in the task of generating the personalized applet image, the embodiment of the present application can provide more specific information for the model through the visual features corresponding to the reference image, so as to provide personalized avatar customization for the user quickly and with high quality, and improve the user experience.
In order to better implement the method, the embodiment of the application also provides an image generation device.
As shown in fig. 4, the image generating apparatus may include an acquisition unit 410, an extraction unit 420, a combining unit 430, a noise adding unit 440, a prediction unit 450, and a generating unit 460, as follows:
first acquisition unit 410
And the method is used for acquiring the description text to be processed and the reference image.
(two) extraction Unit 420
For extracting text features and visual features from the descriptive text to be processed and the reference image, respectively.
In some embodiments, the extraction unit includes a first visual extraction subunit, a second visual extraction subunit, and a third visual extraction subunit, comprising:
a first visual extraction subunit for extracting initial visual features from the reference image;
the second vision extraction subunit is used for acquiring the characteristic adjustment parameters;
and the third visual extraction subunit is used for obtaining the visual characteristics according to the characteristic adjustment parameters and the initial visual characteristics.
(III) combining unit 430
The semantic guidance feature is obtained according to the text feature and the visual feature.
In some embodiments, the image generation model further comprises an adding unit comprising:
the adding unit is used for adding placeholders at adjacent positions of the parent keywords in the description text to be processed;
in some embodiments, the combination unit includes a replacement subunit comprising:
and the replacing subunit is used for replacing the feature corresponding to the placeholder in the text feature with the visual feature to obtain the semantic guidance feature.
(IV) noise adding unit 440
And the method is used for adding the reference noise into the reference image to obtain a noise-added image.
(five) prediction unit 450
The method is used for carrying out noise prediction on the noise-added image through semantic guidance characteristics to obtain prediction noise.
In some embodiments, the prediction unit includes a sampling subunit, a cross-attention subunit, and an attention prediction subunit, including:
the sampling subunit is used for carrying out multi-scale feature sampling on the noise-added image to obtain multi-scale sampling features;
the cross attention subunit is used for carrying out cross attention processing on the semantic guidance feature and the sampling feature aiming at the sampling feature of any scale to obtain an attention feature;
An attention prediction subunit for deriving a prediction noise from the attention feature.
In some embodiments, the cross-attention subunit includes a first transformation subunit and a weighting subunit, including:
the first transformation subunit is used for carrying out linear transformation on the semantic guidance feature aiming at the sampling feature of any scale to obtain a key vector and a value vector, and carrying out linear transformation on the sampling feature to obtain a query vector;
and the weighting subunit is used for carrying out attention weighting on the value vector through the query vector and the key vector to obtain attention characteristics.
In some embodiments, the prediction subunit includes a second transform subunit and a transform prediction subunit, comprising:
the second transformation subunit is used for carrying out linear transformation on the attention characteristic to obtain a transformed characteristic;
and a transform predictor unit for deriving a prediction noise from the transformed features.
Sixth generation unit 460
For repairing the noisy image based on the prediction noise to generate a target image.
In some embodiments, the image generating apparatus further includes a training unit including a training acquisition subunit, a training extraction subunit, a training combining subunit, a training noise adding subunit, a training prediction subunit, and a training adjustment subunit, including:
The training acquisition subunit is used for acquiring a training sample set and an image generation model to be trained, wherein the training sample set comprises at least one sample image and a sample description text corresponding to the sample image;
the training extraction subunit is used for respectively extracting sample text features and sample visual features from the sample description text and the sample image;
the training combination subunit is used for obtaining sample semantic guidance characteristics according to the sample text characteristics and the sample visual characteristics;
the training noise adding subunit is used for adding sample noise into the sample image to obtain a noise adding sample image;
the training prediction subunit is used for carrying out noise prediction on the noise-added sample image through sample semantic guide characteristics to obtain predicted sample noise;
the training adjustment subunit is used for adjusting the model parameters to be adjusted of the image generation model to be trained according to the loss value between the predicted sample noise and the sample noise to obtain a trained image generation model, and the trained image generation model is used for generating a target image.
In some embodiments, the image generating apparatus further includes a text generating unit including a first text generating sub-unit, a second text generating sub-unit, a third text generating sub-unit, a fourth text generating sub-unit, a fifth text generating sub-unit, and a sixth text generating sub-unit, including:
A first text generation subunit, configured to extract image features to be processed from a sample image;
a second text generation subunit, configured to take a descriptive text sequence including a start tag as a text sequence to be processed;
the third text generation subunit is used for carrying out attention calculation on the image characteristics to be processed and the text sequences to be processed to obtain attention weights;
a fourth text generation subunit, configured to determine, according to the attention weight, a generation probability describing a next word in the text sequence;
a fifth text generation subunit, configured to determine, according to the generation probability of the next word, the next word in the descriptive text sequence, so as to obtain a current descriptive text sequence;
and the sixth text generation subunit is used for taking the current description text sequence as a text sequence to be processed, returning to the execution step, carrying out attention calculation on the characteristics of the image to be processed and the text sequence to be processed to obtain attention weight, and carrying out the subsequent steps until an end mark is generated, wherein the current description text sequence is taken as a sample description text corresponding to the sample image.
In some embodiments, the image generating apparatus further includes a parameter determining unit including a parameter acquiring subunit, a parameter decomposing subunit, and a parameter determining subunit, including:
The parameter acquisition subunit is used for acquiring noise prediction parameters, and the noise prediction parameters are used for carrying out noise prediction on the noise-added sample image;
the parameter splitting subunit is used for splitting the noise prediction parameters into fixed parameters and parameters to be adjusted;
and the parameter determination subunit is used for taking the parameter to be adjusted as the model parameter to be adjusted.
In some embodiments, the parameter determination subunit includes a decomposition subunit and a determination subunit, including:
the decomposition subunit is used for carrying out low-rank decomposition on the parameters to be adjusted to obtain a plurality of low-rank parameter matrixes;
and the determining subunit is used for taking the plurality of low-rank parameter matrixes as model parameters to be adjusted.
In some embodiments, the training adjustment subunit includes a loss calculation subunit and a loss adjustment subunit, comprising:
a loss calculation subunit for calculating a first loss value between the predicted sample noise and the sample noise, and calculating a second loss value between the sample visual feature and the sample text feature;
and the loss adjustment subunit is used for adjusting the model parameters to be adjusted of the image generation model to be trained by combining the first loss value and the second loss value to obtain the trained image generation model.
In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.
As described above, the image generating apparatus of the present embodiment includes an acquisition unit, an extraction unit, a combining unit, a noise adding unit, a prediction unit, and a generation unit. The acquisition unit is used for acquiring the description text to be processed and the reference image; the extraction unit is used for respectively extracting text features and visual features from the description text to be processed and the reference image; the combination unit is used for obtaining semantic guidance characteristics according to the text characteristics and the visual characteristics; the noise adding unit is used for adding reference noise into the reference image to obtain a noise added image; the prediction unit is used for carrying out noise prediction on the noise-added image through semantic guidance characteristics to obtain prediction noise; and the generating unit is used for repairing the noise-added image based on the prediction noise so as to generate a target image.
Therefore, in the image generation process, the visual characteristics and the text characteristics can be combined, and as the visual characteristics comprise the information of the image appearance and the details of the sample image, the visual characteristics can add more apparent and detailed semantic information to the text so as to enhance the control force of the semantic information. Meanwhile, visual features are introduced into semantic guide features, so that in addition to text features, the noise prediction process focuses on the features with images and details in the visual features, and therefore the images with images can be restored as much as possible according to the predicted noise, and the quality of the generated images is improved.
The embodiment of the application also provides electronic equipment which can be a terminal, a server and other equipment.
In the present embodiment, a detailed description will be given taking an example in which the electronic device is a server, for example, as shown in fig. 5, which shows a schematic structural diagram of the server according to the embodiment of the present application, specifically:
the server may include one or more processors 510 of a processing core, memory 520 of one or more computer readable storage media, a power supply 530, an input module 540, and a communication module 550, among other components. Those skilled in the art will appreciate that the server architecture shown in fig. 5 is not limiting of the server and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:
the processor 510 is a control center of the server, connects various parts of the entire server using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 520, and calling data stored in the memory 520. In some embodiments, processor 510 may include one or more processing cores; in some embodiments, processor 510 may integrate an application processor that primarily processes operating systems, user interfaces, applications, etc., with a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 510.
The memory 520 may be used to store software programs and modules, and the processor 510 performs various functional applications and data processing by executing the software programs and modules stored in the memory 520. The memory 520 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the server, etc. In addition, memory 520 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, memory 520 may also include a memory controller to provide processor 510 with access to memory 520.
The server also includes a power supply 530 that provides power to the various components, and in some embodiments, the power supply 530 may be logically coupled to the processor 510 via a power management system, such that charge, discharge, and power consumption management functions are performed via the power management system. The power supply 530 may also include one or more of any components, such as a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
The server may also include an input module 540, which input module 540 may be used to receive entered numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
The server may also include a communication module 550, and in some embodiments the communication module 550 may include a wireless module, through which the server may wirelessly transmit over short distances, thereby providing wireless broadband internet access to the user. For example, the communication module 550 may be used to assist a user in e-mail, browsing web pages, accessing streaming media, and the like.
Although not shown, the server may further include a display unit or the like, which is not described herein. In this embodiment, the processor 510 in the server loads executable files corresponding to the processes of one or more application programs into the memory 520 according to the following instructions, and the processor 510 executes the application programs stored in the memory 520, so as to implement the steps in the methods of the embodiments of the present application.
As can be seen from the above, in the image generation process, the embodiment of the application combines the visual features with the text features, and since the visual features include the information of the sample image, the visual features can add more specific and detailed semantic information to the text, so as to enhance the control force of the semantic information. Meanwhile, visual features are introduced into semantic guide features, so that in addition to text features, the noise prediction process focuses on the features with images and details in the visual features, and therefore the images with images can be restored as much as possible according to the predicted noise, and the quality of the generated images is improved.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present application provide a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform steps in any of the image generation methods provided by the embodiments of the present application. For example, the instructions may perform steps in methods of embodiments of the present application.
Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
According to one aspect of the present application, a computer program product or computer program is provided, comprising instructions stored in a computer readable storage medium. The processor of the electronic device reads the instructions from the computer-readable storage medium and executes the instructions to cause the electronic device to perform the methods provided in the various alternative implementations provided in the above-described embodiments.
Because the instructions stored in the storage medium may perform steps in any of the image generation methods provided in the embodiments of the present application, the beneficial effects that any of the image generation methods provided in the embodiments of the present application may be achieved, which are detailed in the previous embodiments and are not described herein.
The foregoing has described in detail the methods, apparatuses, electronic devices, storage media and program products for generating images according to the embodiments of the present application, and specific examples have been applied to illustrate the principles and embodiments of the present application, where the foregoing examples are provided to assist in understanding the methods and core ideas of the present application; meanwhile, those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, and the present description should not be construed as limiting the present application in view of the above.

Claims (15)

1. An image generation method, comprising:
acquiring a description text to be processed and a reference image;
extracting text features and visual features from the description text to be processed and the reference image respectively;
obtaining semantic guidance features according to the text features and the visual features;
Adding reference noise into the reference image to obtain a noise-added image;
carrying out noise prediction on the noise-added image through the semantic guidance characteristics to obtain prediction noise;
and repairing the noise-added image based on the prediction noise to generate a target image.
2. The image generation method according to claim 1, wherein the visual features are obtained by:
extracting initial visual features from the reference image;
acquiring characteristic adjustment parameters;
and according to the characteristic adjustment parameters and the initial visual characteristics, obtaining the visual characteristics.
3. The image generation method according to claim 1, wherein before the semantic guidance feature is obtained from the text feature and the visual feature, the method further comprises:
adding placeholders at adjacent positions of parent keywords in the description text to be processed;
the obtaining the semantic guidance feature according to the text feature and the visual feature comprises the following steps:
and replacing the features corresponding to the placeholders in the text features with the visual features to obtain the semantic guidance features.
4. The image generating method according to claim 1, wherein said performing noise prediction on the noisy image by the semantic guidance feature to obtain prediction noise comprises:
Performing multi-scale feature sampling on the noise-added image to obtain multi-scale sampling features;
for the sampling features of any scale, carrying out cross attention processing on the semantic guidance features and the sampling features to obtain attention features;
the prediction noise is derived from the attention feature.
5. The image generating method according to claim 4, wherein the cross-attention processing is performed on the semantic guidance feature and the sampling feature for the sampling feature of any scale to obtain an attention feature, including:
for the sampling features of any scale, performing linear transformation on the semantic guidance features to obtain key vectors and value vectors, and performing linear transformation on the sampling features to obtain query vectors;
and carrying out attention weighting on the value vector through the query vector and the key vector to obtain attention characteristics.
6. The image generation method of claim 4, wherein the deriving the prediction noise from the attention feature comprises:
performing linear transformation on the attention characteristic to obtain a transformed characteristic;
The prediction noise is derived from the transformed features.
7. The image generation method according to any one of claims 1 to 6, characterized in that the method further comprises:
acquiring a training sample set and an image generation model to be trained, wherein the training sample set comprises at least one sample image and a sample description text corresponding to the sample image;
respectively extracting sample text features and sample visual features from the sample description text and the sample image;
obtaining sample semantic guidance features according to the sample text features and the sample visual features;
adding sample noise into the sample image to obtain a noisy sample image;
carrying out noise prediction on the noise-added sample image through the sample semantic guidance characteristics to obtain predicted sample noise;
and adjusting model parameters to be adjusted of the image generation model to be trained according to the loss value between the predicted sample noise and the sample noise to obtain a trained image generation model, wherein the trained image generation model is used for generating a target image.
8. The image generation method of claim 7, wherein the sample description text is obtained by:
Extracting image features to be processed from the sample image;
taking the descriptive text sequence containing the start mark as a text sequence to be processed;
performing attention calculation on the image characteristics to be processed and the text sequences to be processed to obtain attention weights;
determining the generation probability of the next word in the descriptive text sequence according to the attention weight;
determining the next word in the description text sequence according to the generation probability of the next word so as to obtain a current description text sequence;
and taking the current description text sequence as the text sequence to be processed, returning to an execution step to perform attention calculation on the image characteristics to be processed and the text sequence to be processed to obtain attention weights, and a subsequent step until an end mark is generated, and taking the current description text sequence as the sample description text corresponding to the sample image.
9. The method for generating an image according to claim 7, wherein adjusting model parameters to be adjusted of the image generation model to be trained according to a loss value between the prediction noise and the sample noise, before obtaining the trained image generation model, further comprises:
Acquiring a noise prediction parameter, wherein the noise prediction parameter is used for carrying out noise prediction on the noise-added sample image;
splitting the noise prediction parameters into fixed parameters and parameters to be adjusted;
and taking the parameters to be adjusted as the parameters of the model to be adjusted.
10. The image generating method according to claim 9, wherein said taking the parameters to be adjusted as the model parameters to be adjusted includes:
performing low-rank decomposition on the parameters to be adjusted to obtain a plurality of low-rank parameter matrixes;
and taking the plurality of low-rank parameter matrixes as the model parameters to be adjusted.
11. The image generating method according to claim 7, wherein said adjusting the model parameters to be adjusted of the image generating model to be trained according to the loss value between the predicted sample noise and the sample noise, to obtain the trained image generating model, comprises:
calculating a first loss value between the predicted sample noise and the sample noise, and calculating a second loss value between the sample visual feature and the sample text feature;
and adjusting model parameters to be adjusted of the image generation model to be trained by combining the first loss value and the second loss value to obtain the trained image generation model.
12. An image generating apparatus, comprising:
the acquisition unit is used for acquiring the description text to be processed and the reference image;
the extraction unit is used for respectively extracting text features and visual features from the description text to be processed and the reference image;
the combination unit is used for obtaining semantic guidance features according to the text features and the visual features;
the noise adding unit is used for adding reference noise into the reference image to obtain a noise added image;
the prediction unit is used for carrying out noise prediction on the noise-added image through the semantic guidance characteristics to obtain prediction noise;
and the generating unit is used for repairing the noise-added image based on the prediction noise so as to generate a target image.
13. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps in the image generation method according to any one of claims 1 to 11.
14. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps in the image generation method of any of claims 1 to 11.
15. A computer program product comprising a plurality of instructions which when executed by a processor carry out the steps of the image generation method of any of claims 1 to 11.
CN202311399398.6A 2023-10-25 2023-10-25 Image generation method, apparatus, electronic device, storage medium, and program product Pending CN117437317A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311399398.6A CN117437317A (en) 2023-10-25 2023-10-25 Image generation method, apparatus, electronic device, storage medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311399398.6A CN117437317A (en) 2023-10-25 2023-10-25 Image generation method, apparatus, electronic device, storage medium, and program product

Publications (1)

Publication Number Publication Date
CN117437317A true CN117437317A (en) 2024-01-23

Family

ID=89549292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311399398.6A Pending CN117437317A (en) 2023-10-25 2023-10-25 Image generation method, apparatus, electronic device, storage medium, and program product

Country Status (1)

Country Link
CN (1) CN117437317A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117710510A (en) * 2024-02-04 2024-03-15 支付宝(杭州)信息技术有限公司 Image generation method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117710510A (en) * 2024-02-04 2024-03-15 支付宝(杭州)信息技术有限公司 Image generation method and device

Similar Documents

Publication Publication Date Title
Ma et al. Joint embedding VQA model based on dynamic word vector
WO2020228376A1 (en) Text processing method and model training method and apparatus
CN112487182B (en) Training method of text processing model, text processing method and device
CN111246256B (en) Video recommendation method based on multi-mode video content and multi-task learning
CN113761153B (en) Picture-based question-answering processing method and device, readable medium and electronic equipment
Li et al. Residual attention-based LSTM for video captioning
CN113239169B (en) Answer generation method, device, equipment and storage medium based on artificial intelligence
CN110209774A (en) Handle the method, apparatus and terminal device of session information
CN113064968B (en) Social media emotion analysis method and system based on tensor fusion network
CN113704460A (en) Text classification method and device, electronic equipment and storage medium
CN115223020B (en) Image processing method, apparatus, device, storage medium, and computer program product
CN114282013A (en) Data processing method, device and storage medium
CN117437317A (en) Image generation method, apparatus, electronic device, storage medium, and program product
Khurram et al. Dense-captionnet: a sentence generation architecture for fine-grained description of image semantics
CN116975350A (en) Image-text retrieval method, device, equipment and storage medium
CN116882450B (en) Question-answering model editing method and device, electronic equipment and storage medium
CN117216185A (en) Comment generation method, device, equipment and storage medium for distributed content
CN116258147A (en) Multimode comment emotion analysis method and system based on heterogram convolution
CN113741759B (en) Comment information display method and device, computer equipment and storage medium
CN116955599A (en) Category determining method, related device, equipment and storage medium
CN114282528A (en) Keyword extraction method, device, equipment and storage medium
CN114443916A (en) Supply and demand matching method and system for test data
CN113821610A (en) Information matching method, device, equipment and storage medium
CN117315070A (en) Image generation method, apparatus, electronic device, storage medium, and program product
CN116975654B (en) Object interaction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication