CN118691923A

CN118691923A - Image generation method and device of target theme, computer equipment and storage medium

Info

Publication number: CN118691923A
Application number: CN202310335853.XA
Authority: CN
Inventors: 温泉; 周智毅; 王逸宇; 衣景龙
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2024-09-24

Abstract

The present application relates to an image generation method, apparatus, computer device, storage medium and computer program product of a target subject. The method comprises the following steps: obtaining a theme image generation model obtained by performing secondary training on a pre-training model through a sample image; the pre-training model is used for generating an image according to the text; the sample image comprises at least one subject element conforming to a target subject; acquiring a text set containing the theme element description texts of the same type based on the theme element description texts which are carried by each sample image and used for describing the theme elements; respectively selecting topic element description texts from at least a part of the text sets, and combining to obtain a target description text containing the selected topic element description texts; and performing image generation processing according to the target description text through a theme image generation model to obtain a target image matched with the target theme. By adopting the method, the high-quality image matched with the target theme can be generated.

Description

Image generation method and device of target theme, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for generating an image of a target subject.

Background

With the development of artificial intelligence technology, an image intelligent generation technology is presented, and image generation can be performed according to text description input by a user, and the generated image has finer details.

The technical method for generating the image based on the text can switch the appointed theme style when generating the image, but for the image of the appointed theme cell, a large amount of parameter adjustment and trial work are needed, the image generation process has the problem of unstable quality, and the image conforming to the theme style is difficult to generate with high quality.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an image generating method, apparatus, computer device, computer-readable storage medium, and computer program product that are capable of generating a target subject of an image conforming to a subject style with high quality.

In a first aspect, the present application provides a method for generating an image of a target subject. The method comprises the following steps:

Obtaining a theme image generation model obtained by performing secondary training on a pre-training model through a sample image; the pre-training model is used for generating an image according to the text; the sample image comprises at least one subject element conforming to a target subject;

Acquiring a text set containing the theme element description texts of the same type based on the theme element description texts which are carried by each sample image and used for describing the theme elements;

Respectively selecting topic element description texts from at least a part of the text sets, and combining to obtain a target description text containing the selected topic element description texts;

and performing image generation processing according to the target description text through the theme image generation model to obtain a target image matched with the target theme.

In a second aspect, the application further provides an image generation device of the target theme. The device comprises:

the model acquisition module is used for acquiring a theme image generation model obtained by performing secondary training on the pre-training model through the sample image; the pre-training model is used for generating an image according to the text; the sample image comprises at least one subject element conforming to a target subject;

The text set determining module is used for acquiring a text set containing the same type of theme element description text based on the theme element description text which is carried by each sample image and used for describing the theme element;

the description text determining module is used for respectively selecting the topic element description texts from at least a part of the text set and combining the topic element description texts to obtain a target description text containing the selected topic element description texts;

and the target image generation module is used for generating a model through the theme image, and performing image generation processing according to the target description text to obtain a target image matched with the target theme.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the image generation method of the above-mentioned target subject when the processor executes the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the image generation method of the above-described target subject.

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the image generation method of the above-mentioned target subject.

According to the image generation method, the device, the computer equipment, the storage medium and the computer program product of the target theme, the theme image generation model obtained by performing secondary training on the pre-training model is obtained, the model capable of generating the image conforming to the target theme according to the text is obtained, in the process of determining the text of the input model, at least one theme element conforming to the target theme and the theme element description text which is carried by each sample image of the training theme image generation model and used for describing the theme element are contained in the sample image, a text set containing the theme element description text of the same type can be used as a text component part of the input model, the theme element description text is selected from at least one part of the text set respectively, the target description text containing the selected theme element description text is obtained through combination, the theme image generation model can be made to perform image generation processing according to the target description text, and a high-quality target image which is different from the sample image and can be matched with the target theme.

Drawings

FIG. 1 is an application environment diagram of a method of image generation of a target subject in one embodiment;

FIG. 2 is a flow diagram of a method of generating an image of a target subject in one embodiment;

FIG. 3 is a schematic diagram of image screening by comparing a target image with different candidate images in one embodiment;

FIG. 4 is a flow diagram of acquiring a sample image based on a target subject video in one embodiment;

FIG. 5 is a schematic diagram of a theme image generation model in one embodiment;

FIG. 6 is a schematic diagram of an image encoding and decoding process in a subject image generation model in one embodiment;

FIG. 7 is a schematic diagram of a denoising process in a subject image generation model in one embodiment;

FIG. 8 is a schematic structural diagram of Unet models in the subject image generation model, in one embodiment;

FIG. 9 is a diagram of a training process and an application process of a theme image generation model in one embodiment;

FIG. 10 is a schematic diagram of a layout of a segmented semantic object in one embodiment;

FIG. 11 is a schematic image generated according to a layout of semantic objects in one embodiment;

FIG. 12 is a schematic image generated based on depth map effects in one embodiment;

FIG. 13 is a schematic diagram of different sample images in one embodiment;

FIG. 14 is a schematic diagram of different generated images in one embodiment;

FIG. 15 is a block diagram of an image generating apparatus of a target subject in one embodiment;

FIG. 16 is an internal block diagram of a computer device in one embodiment;

fig. 17 is an internal structural view of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The technical terms referred to in the present application are defined as follows:

IP (Intellectual Property ): in the content field of the internet, a cultural brand capable of being developed in multiple dimensions is defined, and the cultural brand has the characteristics of value identification of a main body (can be a character image, a specific story and a story scene), multiple carriers (can be various carriers such as a novel, a cartoon, an animation, a television play, a movie, a true person, an avatar and the like) and the like.

Text description of input (promt): based on the textual description, the model may generate a corresponding image. The promt contains several elements, namely a painting object (a main body of painting, generally a foreground), details of the painting object (clothes, expressions, colors, actions, faces, limb details and the like), a background of the painting, a shooting view angle and a far and near view angle (the view angle comprises a front face, a side face, a back face and the like; the far and near view comprises a close-up, a half body, a whole body and the like).

The image generation method of the target subject provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on the cloud or other servers. In the data storage system, a pre-trained model for generating images by text is stored. The server 104 may receive the image generation request time for the target theme sent by the terminal 102, where the server 104 obtains a sample image including at least one theme element conforming to the target theme, and performs secondary training on the pre-training model through the sample image to obtain a theme image generation model. It may be appreciated that in other embodiments, the server 104 may perform secondary training on the pre-training models respectively through sample images of a plurality of different topics to obtain a plurality of topic image generation models with different topics, and after receiving a request from the terminal 102, the server 104 may directly select a topic image generation model matched with the target topic from the trained topic image generation models with different topics.

After determining the topic image generation model matched with the target topic, the server 104 may obtain a text set containing topic element description texts of the same type based on topic element description texts carried by each sample image for describing topic elements, so as to select topic element description texts from at least a part of the text set, and thus combine to obtain the target description text containing the selected topic element description texts.

In one embodiment, the target description text may be selected and combined by the server 104 based on the text set, or the text set may be fed back to the terminal 102 by the server 104, the terminal user may select the theme element description text by using the text set, combine the target description text containing the selected theme element description text and feed back to the server 104, and the server 104 performs image generation processing according to the target description text by using the theme image generation model, so as to obtain a target image matched with the target theme and feed back to the terminal 102.

In another embodiment, the server 104 may also feed back the text set and the theme image generation model to the terminal 102, so that the terminal 102 obtains the target description text based on random selection and combination of the text set, or determines the target description text in response to a description text selection operation triggered by the terminal user for the text set, and then performs image generation processing according to the target description text through the theme image generation model to obtain the target image matched with the target theme.

The terminal 102 may be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster or cloud server composed of a plurality of servers.

In one embodiment, as shown in FIG. 2, a method of generating an image of a target subject is provided, which may be applied to a computer device. The computer device may be a terminal or a server. For convenience of description, the following embodiments take application of the method to a computer device as an example, and the image generating method of the target subject specifically includes the following steps:

step 202, obtaining a subject image generation model obtained by performing secondary training on a pre-training model through a sample image.

The pre-training model is a technical method for generating images based on texts, the images can be generated according to the input text descriptions by generating the images according to the texts, and the generated images have finer details. The text image generation method comprises a GAN (GENERATIVE ADVERSARIAL Network) framework, an autoregressive framework and a Diffusion framework (Diffusion). The idea of the GAN framework is to train the image encoder, the image decoder and the GAN arbiter in a unified way, capture local features using CNN (Convolutional Neural Networks, convolutional neural network) structures, and capture global features by VIT structures. The autoregressive framework performs text-to-image, image-to-text generation by encoding the image into a discrete sequence, thereby representing both the image and the text in the sequence, and using an autoregressive language model of the NLP (Natural Language Processing ) model. The idea of the Diffusion framework is to encode the image into the hidden variable domain of the two-dimensional structure, and perform corresponding Diffusion process, noise adding process and text control training in the hidden variable domain. In this embodiment, the pre-training model may be a model obtained by training any one of the above various frames.

The secondary training is fine tuning training for a pre-trained pre-training model, so that the pre-training model can have better data processing capacity. In some embodiments, the same pre-training model may be trained differently for different purposes, such that each model resulting from the secondary training may perform different functions. For example, for different IP topics, the same pre-training model capable of generating images according to text may be trained by using sample images of different IP topics, so that different models obtained by training can generate images of different IP topics.

The theme image generation model is a model which is obtained by training the pre-training model for the second time through a sample image conforming to the target theme and can generate an image conforming to the target theme based on the input text. It should be noted that, in order to ensure that the image generated by the theme image generation model has a higher degree of adaptation with the target theme, the input text of the theme image generation model may be a text matched with the target theme.

The sample image is an image containing at least one theme element conforming to the target theme and carrying theme element description text for describing the theme element. It will be appreciated that the number of sample images used for model training is a plurality. Each sample image can have the same image parameters, such as the same size, the same resolution, and the like, so that the sample images have the same information amount when being subjected to vector conversion in the secondary training process of the model, and meanwhile, the balance of the calculation resource amount of the model in the processing process can be realized, and the data processing efficiency is improved.

The sample images may have the same data source or different data sources. The data sources may include search from an image web page or from the same video by image frame extraction. When the acquired images have different image parameters, the images may be first subjected to parameter alignment so that the acquired images have the same image parameters. For example, when the sample image is an image of a specified IP theme, and the specified IP theme is taken as a television play, an image frame partially containing at least one theme element conforming to a target theme may be selected from a television play video, and theme elements of each image frame are labeled to obtain a theme element description text that describes the theme element and matches with the image frame, so as to obtain the sample image of the specified IP theme.

Theme elements conforming to the target theme refer to specific elements in which the content in the image is the target theme. Specifically, the character image, the specific story, the story scene, etc. of the target theme may be mentioned. Such as a main character in a television show, a specific scene, etc. Taking a ancient costume television play as an example, the main characters can comprise a main man angle A, a main woman angle B and the like; the specific scene may be a representative scene in a television series, such as a widely spread scene.

The topic element description text is text describing the content of the topic element in the sample image. For example, the theme element face description text for the sample image of a television series may include at least one of a main character, a figure of the character, an expression of the character, a background in which the character is located, a view angle size in which the character appears in the picture, and the like. In some embodiments, the description of the figure may be divided into a description of the garment and a description of the hairstyle, such as a description of the garment at principal angle A, including: the description of the hairstyle of the main angle A includes that of the satin gown with a light yellow-green collar, the sleeveless silk jacket with dark purple color, the white cotton-flax suit, and the like: half with dishevelled hair, hair bundling, etc.; the expression of the character includes: calm, anger, confusion, etc.; the description of the context in which the person is located includes: carriage, square space, etc.; the description of the view angle size at which the person appears in the screen includes: big head close range, half middle range, whole body distant range, etc.

In some specific embodiments, the computer device may respond to a training request of a theme image generation model for a target theme, obtain a sample image conforming to the target theme based on a pre-training model of the text generation image, perform secondary training of the model in real time, and obtain a theme image generation model.

In other specific embodiments, the computer device may also pre-train the topic image generation models for different topics, and the computer device searches for topic image generation models matching the target topic from the trained multiple topic image generation models in response to an acquisition request of the topic image generation models for the target topic.

Step 204, acquiring a text set containing the same type of theme element description text based on the theme element description text which is carried by each sample image and used for describing the theme elements.

Wherein each sample image carries a theme element description text for describing the theme element. The subject elements for each sample image are described from multiple angles. For each description angle there is a corresponding topic element description text for the sample image, the topic element description text of the description angle may constitute a text set. For example, for the theme elements of each sample image, the description is made from six angles of the main character, the character clothing, the character hairstyle, the character expression and the character background, and correspondingly, six different types of text sets of the main character, the character clothing, the character hairstyle, the character expression and the character background, the appearance view, can be formed. The description text in the text set is formed by performing de-duplication processing on the topic element description text of the angle in each sample image, and the obtained result is used as the description text contained in the text set of the angle.

In other embodiments, description texts of other external elements except the target subject may be added to the text set, for example, an external person is added to the text set of the main person, and the external person and the core scene of the target IP may be fused to implement embedding of the target IP scene. The external element added may be an element capable of acquiring specific image information based on a specific route, or an element capable of extracting corresponding image information based on an image uploaded by a user. For example, the names of well-known characters widely flowing on the network, or the character names of uploaded character photographs, etc. can be added to the text set of the main characters.

And 206, respectively selecting the theme element description texts from at least a part of text sets, and combining to obtain a target description text containing the selected theme element description texts.

The text set used for selecting the theme element description text can be each type of constructed text set, or can be a part of constructed text set. For example, in the process of constructing text sets, 6 text sets are constructed in total, and the text sets for selecting descriptive text may be 6 text sets, 5 text sets, 4 text sets, or the like. The target description text may include a plurality of topic element description texts selected from the text set, and may also include topic element description texts selected from the text set and object element description texts of external objects. The object element description text of the external object may be added to the text set corresponding to the object element obtained in step 204 in advance, so as to be directly selected from the text set corresponding to the object element. It will be appreciated that, in other embodiments, the object element description text of the external object may also be selected from a text set that includes only the object element description text of the external object, and the specific manner of acquiring the object element description text of the external object is not limited herein.

In some specific implementations, the computer device may randomly select the topic element description text from at least a portion of the text set, respectively, and combine to obtain the target description text containing the selected topic element description text. The computer device may also display each text set on the display interface, and combine the multiple description texts including the topic element description text selected by the user to obtain the target description text by responding to the user selection operation of the topic element description text in a part of the text sets.

In some specific applications, taking the example that the theme elements include main characters, when the number of main characters is one, at most one object element description text is selected from each text set, so as to avoid the situation that description content conflicts, thereby affecting the quality of the generated image. When the number of main characters is two or more, the number of object element description texts selected from each text set may be 1, or may be other numbers less than or equal to the number of main characters.

Accordingly, when the number of main characters is two or more and the number of object element description texts selected from each text set is 1, two or more main characters in the generated image may have the same and similar performances at angles corresponding to the description texts. When the number of main characters is N and the number of object element description texts selected from each text set is N, N characters in the generated image may have different expressions respectively at angles corresponding to the description texts. And step 208, performing image generation processing according to the target description text through a theme image generation model to obtain a target image matched with the target theme.

Specifically, the target description text generates input data of a model for the subject image. The topic image generation model is obtained by carrying a sample image of topic element description texts and performing secondary training on the pre-training model, and the target description texts are combination results which are selected from a text set of the sample image and contain the topic element description texts, so that the computer equipment can perform image generation processing according to the target description texts through the topic image generation model to obtain a high-quality target image matched with a target topic.

In some embodiments, the topic image generation model is a pre-training model based on a Diffusion framework, and the pre-training model of the framework is fine-trained (finetune trained) based on sample images of the target topic. The data processing process of the pre-training model based on the Diffusion frame comprises a Diffusion process and a denoising process, so that the theme image generation model obtained through finetune training also comprises the Diffusion process and the denoising process. The diffusion process refers to a process of gradually passing a data sample (hidden variable obtained by encoding an image) through noise to become random noise. The denoising process is to learn the noise in each step of denoising process through a complex Unet model, so that the smaller the difference between the estimated noise and the real noise is, the better. The input of the denoising process is white noise, the noise of the image is smaller and smaller through the noise estimation of multi-step noise, and finally, a high-definition image is sampled.

Specifically, the computer device may acquire a random seed in addition to the target description file, and input the target description file and the random seed into the theme image generation model. The random seed generates a random hidden variable by using a random number generator, the target description file is encoded by a text encoder to generate a text vector, the image hidden variable can be reconstructed based on the text vector and the random hidden variable, and then the random hidden variable is decoded into an image, so that a high-quality target image matched with a target theme can be obtained.

According to the image generation method of the target theme, the theme image generation model obtained by performing secondary training on the pre-training model is obtained, the model capable of generating the image conforming to the target theme according to the text is obtained, in the process of determining the text of the input model, at least one theme element conforming to the target theme and the theme element description text which is carried by the sample image of the training theme image generation model and used for describing the theme element are utilized, a text set containing the theme element description text of the same type can be used as a text component part of the input model, the theme element description text is selected from at least one part of the text set respectively, the target description text containing the selected theme element description text is obtained through combination, the theme image generation model can be made to perform image generation processing according to the target description text, and the high-quality target image which is different from the sample image and can be matched with the target theme highly is obtained.

Next, the manner of acquiring a sample image for performing model training will be described by the following examples.

In some embodiments, the sample image may be obtained from a video, the specific process including: acquiring a target video conforming to a target theme, and performing frame extraction processing on the target video to obtain a plurality of candidate image frames with scene differences; respectively carrying out theme element identification on each candidate image frame, and determining a preferred image frame containing at least one theme element; and based on the theme elements, performing theme element description text labeling on the preferable image frame to obtain a sample image.

The target video conforming to the target theme can be a video specific to a specific IP, such as a television show, a movie, a cartoon, an animation, and the like. The target video is a video displayed around the target theme and has at least one of a specific character, a specific story and a story scene, which can be used as a theme element. Taking the example that the target video has a specific character image, the subject element of the target video may be a principal angle in the target video.

The target video is composed of a plurality of video frames, and in order to realize effective training of the model, the target video can be subjected to frame extraction processing, and a part of video frames with better effects are extracted to be used as sample images for model training. In a specific embodiment, the frame rate of the video files is 30 frames/s, a set of videos is about 40min-120min, and 70k-210k images can be extracted from each video file on average, but since the change of the image content is very small between the continuous frames of the video files, only image frames with relatively large scene changes can be extracted. According to the method, for a set of videos for 40min, only 2k-4k images are generally extracted, screening of video frames in a target video can be effectively achieved, validity of sample images is improved, further model training effect can be improved, training time required by a model training process for achieving an expected effect is shortened, and occupation of data processing resources of computer equipment for training a model is reduced.

The scene is used to characterize the primary content in the image frames. The image frames of the scene that differ may be a plurality of image frames having one or more differences in different backgrounds, different characters, different perspectives, different figures, etc. The plurality of candidate image frames with differences in scenes are screened out through frame extraction, so that the content of the image frames serving as sample images has larger differences, and the training effect of the sample images in model training is improved.

The theme element identification is a process for judging whether or not the image frame contains the theme element. In a specific application, the image frame of the target video may have one theme element, may have two theme elements, or may not have a theme element. In the process of filtering the image frames, the image frames that do not include the subject element are removed, so that the filtered image frames all have at least one subject element, for example, in this embodiment, the preferred image frames obtained after removing the image frames that do not include the subject element from the candidate image frames.

The topic element description text label is a process in which a pointer describes topic elements in an image frame in words. The theme element description text is a text for describing elements appearing in an image frame with text. The theme element description text may include a main object (a main body of an image, typically a foreground), an image background, a view angle and distance of photographing, and some detailed descriptions (apparel, expression, color, action, face and limb details, etc.). The main object may be a character in the IP video, and may be specifically described by a name of the character. Such as main angle Zhang Xiaoming in "sitcom a". The image background can be a scene in the IP video, and can be a building, a natural landscape and a story line. Such as the ancient pavilion, the ancient temple, the hand-held big flag name scene, etc. in the television play A. Detailed description-apparel class: and (3) special clothes in IP video. Such as men and women's main corner clothing in TV play A, which is described by natural texts, light yellow green collar brocade gown, blue cotton-flax suit, lan Baige cotton-flax suit, dark purple silk waistcoat gown and white cotton-flax suit. Detailed description-expression class: the classification of the expressions of the characters in the IP video comprises calm, serious, happy, anger, doubt and the like. Based on the above description of text, for each image frame, one or more text descriptions are generated, forming pairs of training data with the image frame, such as the sample image carrying the subject element description text in the above embodiment.

In some specific embodiments, for the case where the theme elements are specific characters, the description may be made for each theme element in the image frame from the perspective of a main character, character clothing, character hairstyle, character expression, character background, appearance perspective, and the like. For the case that the theme elements are specific static objects, the description can be made from the angles of object names, object backgrounds, appearance angles and the like for each theme element in the image frame.

In this embodiment, the computer device performs frame extraction processing on the target video by acquiring the target video conforming to the target theme, so as to obtain a plurality of candidate image frames with scene differences, and performs preliminary filtering on the candidate image frames, then performs theme element recognition on each candidate image frame, determines a preferred image frame containing at least one theme element, eliminates an invalid image frame not containing the theme element, and finally performs theme element description text labeling on the preferred image frame based on the theme element, so that a high-quality sample image can be obtained conveniently and efficiently. It will be appreciated that in other embodiments, the order of execution of the steps may be reorganized according to actual needs.

The following describes the frame extraction process in detail, and it will be understood that in other embodiments, the frame extraction process may be implemented in other manners, which are not limited herein. In some embodiments, the frame extracting process is performed on the target video to obtain a plurality of candidate image frames with scene differences, including:

Carrying out framing treatment on the target video to obtain a plurality of image frames with the same pixel arrangement; determining the sum of absolute values of pixel granularity differences of all pixel arrangement positions based on the pixel granularity differences of the same pixel arrangement position for two video frames in a plurality of image frames; when the ratio between the sum of the absolute values and the total pixel granularity value of the image frames is greater than a scene change threshold, determining that the two image frames are candidate image frames with differences in the scene.

Wherein the pixel arrangement refers to the relative positions of the individual pixels that make up the image frame. Because the image frames are obtained by framing from the target video, the video frames obtained by frame extraction through the same processing mode have the same image attribute aiming at the same target video, and the same image attribute comprises the same pixel arrangement. The pixel granularity is a parameter for describing specific content of each pixel point in the image frame, and may be specifically one of parameters such as RGB value, gray value, and the like. For two image frames which are comparison objects, the computer equipment performs difference processing on specific parameter values of pixel granularity of the same pixel arrangement positions of the two image frames to obtain a pixel granularity difference value of each pixel arrangement position, wherein the pixel granularity difference value can be a positive value or a negative value, and in order to avoid interference caused by numerical value offset between positive values and negative values, the computer equipment obtains the sum of absolute values by accumulating the absolute values of the pixel granularity difference values of the pixel arrangement positions so as to represent the difference between the two image frames which are comparison objects.

Whether scene difference exists between two image frames or not can be embodied specifically through the comparison result of the ratio between the sum of absolute values and the total value of pixel granularity of the image frames and the numerical value of the scene change threshold. In a specific embodiment, for convenience of illustration, the image frame is simplified into a single channel 3x3 image, as shown in fig. 3, which is a target image, a candidate image 1 and a candidate image 2, respectively, and for comparison of the target image and the candidate image 1, by calculating the pixel granularity difference value of each pixel point in the target image and the candidate image 1, the sum of absolute values of the pixel granularity difference values is 20, the total value of the pixel granularity of the target image is 44, and the ratio is 20/44=0.45. If the threshold of the scene change is set to 0.3, that is, if the ratio of the scene change exceeds 0.3, the candidate image 1 satisfies the condition, and the target image and the candidate image 1 can be determined as two images having a difference in the scene.

Similarly, as shown in fig. 3, when comparing the target image with the candidate image 2, by calculating the pixel granularity difference value of each pixel point in the target image and the candidate image 2, the sum of absolute values of the pixel granularity difference values is 4, the total value of the pixel granularity of the target image is 44, and the ratio is 4/44=0.09. The threshold for the scene change is set to 0.3, i.e. the ratio of the required scene change exceeds 0.3, the candidate image 2 does not satisfy the condition.

Further, when the comparison result of the target image and the candidate image is that the candidate image does not meet the condition, the candidate image can be abandoned, the next image of the candidate image is compared with the target image, when the candidate image meeting the condition appears as the comparison result with the target image, the candidate image can be used as a new target image to be compared with the next image, and the scene change of the image frames arranged in sequence is judged, so that the efficient and accurate filtration of the image frames can be realized on the premise of effectively reducing the comparison times, and the quality of the sample image is further improved.

In some embodiments, the topic element identification is performed on each candidate image frame, and determining a preferred image frame containing at least one topic element includes: respectively acquiring attribute information and content information of each candidate image frame; and discarding candidate image frames of which at least one of the attribute information and the content information does not meet the theme element identification condition, and obtaining a preferred image frame containing at least one theme element.

The attribute information of the image frames includes attributes of the image frames in the target video, such as picture sharpness, whether the image frames are head image frames, tail image frames, advertisement image frames, transition image frames, and the like, which are invalid image frames unsuitable for use as sample images. Whether the attribute information of the image frame is invalid can be determined by the attribute identification which is preset, or the attribute information of the image frame can be determined by identifying the start-stop time nodes of the head, the tail, the advertisement and the transition in the target video and comparing the time information of the image frame with the start-stop time nodes. In other embodiments, the attributes of the image frames may also be determined by means of image recognition tools, including a head-to-tail recognition tool, an advertisement recognition tool. The content information refers to content contained in the image frame, and specifically may be whether the person in the image is a principal angle, whether the scene has identification degree, whether the person in the image is completely intercepted, and the like.

For an image frame for a sample image, attribute information and content information are required to meet the subject element identification condition at the same time. The subject element identification conditions may specifically include an attribute information screening condition and a content information screening condition, and it is ensured that the preferred image can contain at least one subject element only if the attribute information of the sample image satisfies the attribute information screening condition and the content information satisfies the content information screening condition, and the image frame can be determined as the preferred image frame satisfying the subject element identification condition.

In a specific application, the IP video may include some head-to-tail images, advertisement images, transition images, and in addition, the scenario may include some angle characters, scenes without identification, the image definition is low, the characters intercept incomplete images, which are not suitable for being used as training corpus, and the computer device obtains respective attribute information and content information of each candidate image frame; and screening the attribute information and the content information respectively, and discarding candidate image frames of which at least one item of the attribute information and the content information does not meet the theme element identification condition so as to ensure that the obtained preferred image frames contain at least one theme element.

In some embodiments, the theme elements include object elements that conform to a target theme; based on the theme elements, performing theme element description text labeling on the preferable image frame to obtain a sample image, wherein the method comprises the following steps:

For each preferable image frame, positioning the center point of the object element in the preferable image frame to obtain a positioning position; according to the size condition of the sample image, taking the positioning position as the center, carrying out image cutting on the preferable image frame to obtain a cut preferable image frame; and performing theme element description text labeling on the cut preferred image frame to obtain a sample image.

The object element refers to a theme element existing as a core object. Such as a character at a main corner in a television show, a main avatar in an animation, etc. Center point positioning refers to a process of determining the center position of an object element in a preferred image frame for positioning. The size condition of the sample image refers to the size that needs to be satisfied by the sample image of the input model. By performing image clipping on the preferable image frame with the positioning position as the center, the preferable image frame after clipping is obtained, and effective display of the object element can be realized.

In some embodiments, the image of the IP video frame is typically a wide screen image of a transverse screen 1920x1080, while the input requirements for training images are smaller (e.g., 512x512 or 768x768, etc.). If the scaling of the image is performed directly, this will result in a loss of image sharpness and also in insufficient resolution of the details of the main elements. In the above embodiment, the object element is positioned, and the image of the target size is captured with the object element as the center, so that the quality of the sample image can be improved on the premise of ensuring that the sample image has the same size.

In a specific embodiment, as shown in fig. 4, the specific process of acquiring the sample image may include: obtaining a video file conforming to a target theme, comparing the sum of absolute values of difference values of pixel granularity of two continuous frames of images with a set threshold value by calculating the sum of absolute values, performing frame extraction processing on the video file, and then filtering out some head-to-tail, advertisement and transition type images by identifying and judging the quality of theme elements, wherein the scenario possibly comprises some angle-matched character images, scene images without identification, images with lower definition, incomplete character interception images and the like, the images are not suitable for being used as image frames of training corpus, thus the filtering is required, then positioning the theme elements in the image frames, taking the theme elements as the center, performing image interception of the target size, finally performing text marking on each processed image, and describing the theme elements appearing in the image by text, including main objects (the main body of the image, generally the foreground), the image background, the photographed visual angle and the distance, and some descriptions (such as expression, color, face, detail, apparel and limbs, etc.).

Next, one possible implementation of determining the target text combinations based on the text sets is described by the following embodiments.

In some embodiments, the text sets include an open text set and a closed text set. Wherein the open text set includes an object element description text for describing an object element in the theme elements, and an object element description text for describing an external object element. The external object element refers to an object element that does not conform to the target theme. For example, taking the object element as a main character as an example, the open text set is used to describe the main character, and may be a famous character in the outside world or a character in which the user uploads a plain photo. Taking "drama a" as an example, the characters in the IP field include principal angle a, principal angle B, principal angle C, and the like. The famous person in the external world can be any famous person in the external world, such as famous movie actors, famous singers and the like, and also can be prime photos uploaded by a user, generally non-famous persons in the external world, such as a person corresponding to a photo containing a head portrait of the person, which is shot by the user through a camera.

A closed text set refers to a text set constructed based on the topic element description text of the topic element. In some embodiments, the element types corresponding to the closed text set and the open text set may be different, for example, when the element type corresponding to the open text set is an object element, the element type corresponding to the closed text set is another subject element except the object element. For example, the closed text set may specifically include a clothing text set: [ satin gown with light yellow green collar, cotton-flax suit with blue color, cotton-flax suit with Lan Baige colors, sleet with dark purple silk jacket, cotton-flax suit with white color ], hairstyle text set: half with dishevelled hair, hair binding ], set of emoji text: [ calm, serious, surprise, happy, anger, doubt ], visual angle text set: [ big head close range, half body middle range, whole body distant range ].

Further, selecting the topic element description text from at least a part of the text set, and combining to obtain a target description text containing the selected topic element description text, wherein the method comprises the following steps: selecting an object element description text from the open text set; and combining the object element description text with the theme element description text respectively selected from at least a part of the closed text set to obtain a target text combination.

Wherein, selecting the object element description text from the open text set can determine the object in the target image expected to be generated. Because the open text set comprises the object element description text describing the object elements in the theme elements and the object element description text describing the external object elements, the open text set is convenient to embed an external famous person or a prime person serving as a main body of the target image in a text image generation mode, so that the external person is fused with a core scene of the target theme, and the embedding of the target theme scene is realized.

In this embodiment, the theme image generation model may process the selected object element description text as the external object element by acquiring the image of the external object element as the initial feature of the hidden variable, so as to effectively fuse the external object element with the content of other target theme, and expand the application scenario of the theme image generation model.

Next, a subject image generation model will be described by the following examples. The topic image generation model is a result obtained by finely training the pre-training model based on training corpus of specific IP topic, and has the same model structure as the pre-training model

In one embodiment, a model structure of a subject image generation model, i.e., a pre-training model, is first described, and as shown in fig. 5, the model includes a control domain, a hidden variable domain, and an image domain. Wherein the image field includes an image encoder and an image decoder for associating the image field with the hidden variable field. The image encoder is used for encoding and compressing the image vector with the dimension of [ Channel, height, width ] in the image domain to obtain the hidden variable with the dimension of [ LATENTCHANNEL, LATENTHEIGHT, LATENTWIDTH ]. Typically, the image is RGB coded, channel is 3, height and Width represent the height and Width of the original image; LATENTCHANNEL is the number of channels in the hidden variable field, for example, 4, latentHeight, latentWidth is scaled down based on the original image by an index of 2 (e.g., 4,8, 16, etc.). LATENTHEIGHT = Height/scale_factor, LATENTWIDTH = Width/scale_factor. To ensure that the height and width of the hidden variable fields are integers, an alignment operation is typically performed on the input image, for example, by performing an image restore and pad operation, so as to ensure that the width and height of the image can be divided by a fixed scale_factor.

The processing procedure of the image decoder is the inverse of the processing procedure of the image encoder, and is responsible for restoring the hidden variable of the hidden variable domain into the image vector of the image domain. For example, the image decoder may receive an hidden variable of the hidden variable domain input dimension [ LATENTCHANNEL, LATENTHEIGHT, LATENTWIDTH ], restore the hidden variable to an image vector of [ Channel, heigth, width ], and generate a corresponding image.

In one embodiment, image encoding and image decoding may be implemented using a VAE framework (VariationalAuto-Encoder), such as that shown in fig. 6, where an image encoder (Encoder) is configured to encode an image vector of an input image into hidden variables, specifically, to estimate a mean E (z) and a variance V (z) of a gaussian distribution of each dimension of a hidden variable space, and then to sample the gaussian distribution based on the mean and the variance of each dimension, so as to obtain hidden variables. An image Decoder (Decoder) is used to decode the hidden variable into an image vector.

The control field includes a text encoder for converting natural text into text vectors of fixed dimensions. Specifically, the text encoder may employ a transducer structure, including the steps of converting a natural text into an ID sequence by word segmentation and word list lookup by a word segmentation device, and encoding the ID sequence into a sequence vector matrix of [ seq_len, hiddenSize ] by a multi-layer transducer structure; where seq_Len is the length of the sequence and HiddenSize is the dimension of the hidden vector. In some particular embodiments, the text encoder may be derived using an open source pre-trained model, including CLIPTextEncoder, bertTextEncoder, T5TextEncoder, and the like.

The data processing process of the hidden variable domain comprises a diffusion process and a denoising process. The diffusion process refers to a process of gradually passing a data sample (hidden variable obtained by encoding an image through VAE) through noise addition to become random noise. The specific formula is as follows

X_t＝α_tX_t-1+β_t∈_t,∈_t～N(0,1)

Wherein X _t-1 is the hidden variable of the last step, alpha _t,t is the preset noise weight, and E _t is the random sampling Gaussian noise. For the initial hidden variable, the diffusion noise adding process is performed for more than 1000 steps, the noise weight is smaller in the initial stage and larger in the later stage, and in particular, the noise weight can follow a fixed weight distribution.

As shown in fig. 7, the denoising process learns the noise in each step of denoising process through a complex Unet model, so that the smaller the difference between the estimated X _t-1' and the real X _t-1 is, the better. The denoising process is input as white noise, the noise of the image is smaller and smaller through multi-step noise estimation, and finally, a high-definition image is sampled. Unet model training loss function is

By minimizing the difference between the estimated noise at time t and the actual noise, the difference between the estimated X _t-1' and the actual X _t-1 can be minimized. The original image of the original X at the time t=0 is finally restored by denoising in the step t, and the model structure of the denoising process is a Unet structure, as shown in fig. 8, the Unet structure comprises a group of downsamplers, a group of upsamplers and a sampler with a dimension maintained in the middle layer. The downsampler may be a cnn+ transformer structure, and the upsampler cnn+ transformer structure, where the middle layer of the downsampler is the same structure as the upsampler, except that no dimension scaling is performed.

In the process of carrying out fine training on a pre-training model based on training corpus of a specific IP theme, a training frame mainly comprises a text encoder of a training control domain and Unet modules of a denoising process, and for other two parts in the whole frame, an image encoder and an image decoder use VAE which are trained in advance, the image encoder and the image decoder only do reasoning and do not participate in training, and a diffusion process is a process of adding Gaussian noise in a fixed multi-step mode and does not participate in training.

The application process of the theme image generation model generation image is described next. It should be noted that, the framework in the application process of the theme image generation model is different from the framework in the training process, and the framework in the application process does not include the coding process and the diffusion process based on the input image in the training framework because of no input image, but directly samples random hidden variables of the hidden variable domain by using random seeds.

In one embodiment, performing image generation processing according to a target description text by using a theme image generation model to obtain a target image matched with a target theme, including:

Converting the target description text into a text vector based on the target description text input from the control field; denoising under the guidance of a text vector based on white noise corresponding to a random hidden variable of the hidden variable domain so as to reconstruct an image hidden variable; decoding the hidden image variable in the image domain to obtain the target image matched with the target theme.

Specifically, as shown in fig. 9, the computer device inputs a target description text through a control field of the theme image generation model, and a text encoder based on the control field converts the target description text into a text vector of a hidden variable field to generate an image hidden variable z_0 corresponding to the text vector; random seed randomly generates random Gaussian white noise as random hidden variable of hidden variable domain by using random number generator, then uses multi-step denoising process based on random hidden variable, under the guidance of text vector, rebuilds image hidden variable Z_0 to obtain Z' _0, and uses image decoder of image domain to obtain image vector to realize generation of target image. In the embodiment, the theme image generation model generates random hidden variables to perform denoising under the guidance of text vectors, so that the reconstruction of the image hidden variables can be effectively realized, and the rapid high-quality generation of images is realized.

In the application process of the theme image generation model, by inputting text related to the theme IP, the theme image generation model can generate images which accord with the style of the theme IP, are consistent with the text description and have great diversity, namely, the generated images are not identical with the training data. In order to further improve the quality of the image generated by the theme image generation model, the following embodiments describe the manner of improving the image quality in the present application from different aspects, and it is understood that the generation of the image may be implemented by each embodiment alone, or may be implemented in combination with other embodiments, which is not limited herein.

In some embodiments, the computer device may enhance the quality of the generated image by way of text description weighting. Specifically, in some of these embodiments, converting target descriptive text into text vectors based on target descriptive text entered from a control field includes:

Aiming at a target description text input from a control field, acquiring weight data of a key description text in the target description text; and carrying out vector weighting processing on the key description text in the vector conversion process based on the weight data of the key description text to obtain a text vector corresponding to the target description text.

Specifically, a model is generated by adopting a theme image obtained by fine training in a standard training process and a model frame, and the problem that partial characteristic control force is not strong exists is solved. In order to solve the technical problem, in the embodiment, the key features are vector weighted in the text vector field, so that the key features have higher weight in the image generation process compared with other texts. The specific implementation mode is that the weight is fused with the text in the text construction process, and corresponding analysis and weighting are carried out in the text coding process.

For example, in one specific application, the target descriptive text is "people: principal angle a, viewing angle: (half body view: 1.1), hairstyle: hair-binding, clothing: (blue cotton-flax suit: 1.2) ". By means of (topic element description text: weight), the text portion to be weighted can be determined, different weights being applied to different texts; wherein, text without brackets may default to a weight of 1.0.

In this embodiment, in an application process of the theme image generation model, for a target description text input from a control field, weight data of a key description text in the target description text is determined according to a text format of the target description text, and then vector weighting processing is performed on the key description text in a text encoder executing vector conversion process based on the weight data of the key description text, so as to obtain a text vector corresponding to the target description text, which can effectively improve the effective expression of key features in the generated image, and further improve the image quality.

In some embodiments, the computer device may improve the quality of the generated image by fusing vectors of different layers in the text encoder. Specifically, in some of these embodiments, the control field includes a text encoder that converts text into vectors; the text encoder includes at least two network layers. For example, the text encoder may be a transducer model having multiple layers.

Further, converting the target descriptive text into a text vector based on the target descriptive text input from the control field, comprising: according to each network layer in the text encoder, vector conversion processing is sequentially carried out on the input target description text, and the output characteristics of each network layer are obtained; carrying out feature fusion on the output features of the target network layer and the output features of the last network layer of the text encoder to obtain a text vector corresponding to the target description text; the target network layer is a network layer in the text encoder for improving the text control capability.

In particular, it has been found in many experiments and repeated studies that the semantic control capabilities of the different layers of the text encoder and the sharpness of the final image may differ. And carrying out feature fusion on the input features of the target network layer and the output features of the last layer by determining the target network layer for improving the text control capability in the text encoder, wherein the feature fusion mode comprises weighted interpolation fusion, so that the obtained text vector has better semantic control capability in the image generation process.

Taking a file encoder as a transducer model as an example, the penultimate layer of the transducer can improve the control capability of the text. The text vectors of the penultimate layer and the last layer of the transformation model are subjected to interpolation fusion, and the specific fusion mode is as follows:

text_emb＝alpha*text_emb(L-1)+(1-alpha)*text_emb(L)

Wherein text_emb (L-1) is the text vector of the penultimate layer, and the dimension is [ seq_Len, hiddenSize ]; the text_emb (L) is the text vector of the last layer, and the dimension is consistent with the text_emb (L-1), so that linear interpolation fusion can be performed; alpha is the weight of linear interpolation, the value is the decimal between [0,1], the higher the value is, the higher the weight of the penultimate layer is.

In this embodiment, by performing feature fusion on different layer features of the file encoder, a vector with better semantic control capability can be fused to an output layer of the file encoder, so that a text vector obtained by encoding by the file encoder has better semantic control capability in an image generation process, and the generated image quality can be effectively improved.

In some embodiments, the computer device may increase the quality of the generated image by adding and weighting the negative description. Specifically, in some of these embodiments, the input text of the control field also includes negative description text. Negative description text is a text description of content that is not intended to appear in an image. For example, to improve the quality of image generation, the descriptor "low quality, low resolution" of a low quality image, etc. may be added to the negative description text.

The noise hidden variable predicted in the denoising process of the theme image generation model can be used for fusing the estimated noise of the input text and the estimated noise of the negative description text by adopting a guidancefree weighting method. In the model training, the negative description text is unconditional without input text. Further, the estimated noise of the input text and the estimated noise of the negative description text can be specifically fused by the following formula:

noise_predict＝noise_uncond+scale*(noise_text-noise_uncond)

Wherein noise_ predict is a fusion result, noise_ uncond is estimated noise of a negative description text, noise_text is estimated noise of an input text, scale is used for controlling the amplification ratio of a difference value between the noise_text and the noise_ uncond, and the higher the value is, the more the generated image is deviated to the input text and the farther the generated image is from the text of the noise_ uncond, so that the effect of controlling the negative description is achieved.

Further, based on white noise corresponding to random hidden variables of the hidden variable domain, denoising processing is performed under the guidance of a text vector so as to reconstruct hidden variables of an image, and the method comprises the following steps:

Taking the estimated noise of the negative description text as a fusion object in the denoising process, and fusing the estimated noise with the estimated noise of the target description text to obtain fused estimated noise; based on white noise corresponding to random hidden variables of the hidden variable domain, denoising processing is carried out according to fusion estimated noise under the guidance of a text vector so as to reconstruct the hidden variables of the image.

Specifically, when the computer device executes the denoising process through the theme image generation model, firstly identifying whether the input text of the control field comprises negative description text or not, wherein the target description text and the negative description text can be marked through different marks for identification and distinction. After the negative description text is identified, the computer equipment acquires the estimated noise of the negative description text, fuses the estimated noise of the negative description text with the estimated noise of the target description text to obtain the fused estimated noise, so that the content of the negative description text is better removed in a noise mode, the probability that the content represented by the negative description text appears in the generated image is reduced, and the quality of the generated image is improved.

In some embodiments, the computer device may control the manner in which different text is validated at different stages of the denoising process to improve the quality of the generated image. Specifically, in some of these embodiments, the denoising process includes a first stage and a second stage that occur sequentially. The first stage may be specifically a pre-stage of denoising, the second stage may be specifically a post-stage of denoising, and the demarcation point between the first stage and the second stage may be set based on actual needs, which is not limited herein. The target description text includes a body description text and a detail description text, wherein the body description text can be a subject character, a layout related text, and the detail description text can be a hairstyle, a clothing, and the like related text.

In multiple experiments and repeated researches on the denoising process, it is found that the drawing layout and the main body element drawing are mainly performed in the early stage of denoising, and the local detail drawing is completed in the later stage of denoising. Based on this finding, further, decoding the image hidden variable in the image domain to obtain a target image matched with the target subject, including:

In the first stage, a mask covers the detail description text, and a main body image hidden variable of the main body description text is decoded to obtain a main body image matched with a target theme; in the second stage, the mask covers the main description text, and detail image hidden variables of the detail description text are decoded to obtain a detail image; and rendering the detail image to a corresponding area in the main image to obtain a target image matched with the target theme.

Wherein, the mask coverage means that the detail description text or the main body description text is temporarily shielded so that the subject image generation model does not pay attention to the detail description text or the main body description text covered by the mask in the denoising process.

Specifically, the computer equipment only validates the subject person and the main description text related to the layout by controlling the subject image generation model in the early generation stage of the image corresponding to the denoising process, and invalidates the detail description text through mask coverage so as to ensure that more attention of the image in the early stage is paid to the main description text. In the later generation stage, reverse processing is carried out, so that the detail description text is easier to control the generation of the image, the texts controlled in large areas such as the subject characters, the layout and the like are covered by masks, and more attention of the image in the later stage is ensured to be on the detail description text. The two stages are matched with each other, so that the control force of generating the text and the image fineness of the detail text can be further improved.

In some embodiments, the computer device may improve the quality of the generated image by semantically cutting, controlling the manner in which objects are generated for different regions. Specifically, in some of these embodiments, the image generating method of the target subject further includes: performing semantic cutting on the target description text, and determining a layout diagram of each semantic object in the image, which is obtained by cutting, based on semantic object layout configuration parameters; an impact weight of an attention impact mechanism of each semantic object in each layout area in the layout diagram is determined.

The semantic cutting refers to a processing process of identifying entity objects in a text and cutting an identification result to obtain semantic objects. Through semantic cutting, objects needing to be drawn in different areas of an image can be preset before the image is generated, and in the generation process, the aim of generating different objects in different areas can be achieved through influencing the weight of the attention influence mechanism of different text fragments in the blind area. The semantic object layout configuration parameters are used for determining the layout of the target image aiming at the semantic object to obtain a semantic object layout diagram. Wherein, different semantic objects have different impact weights in the semantic object layout. The specific value of the influence weight can be obtained through the semantic object layout configuration parameters.

Further, decoding the image hidden variable in the image domain to obtain a target image matched with the target theme, including: and decoding the image hidden variable obtained based on the influence weight in the image domain to obtain a target image for enabling each semantic object to be laid out according to the layout.

In a specific embodiment, as shown in fig. 10, the input text is: "1 girl with white skirt and long hair, and 1 girl with red scarf beside, tree, sky, grass, high detail and best quality", the following semantic object layout diagram obtained by semantic cutting can be generated through the configuration file. Wherein, the white area is "1 girl with long hair on the white skirt", the black area is "1 men with red scarf", the green area is "tree", the blue area is "sky", and the brown area is "grass". Wherein, the influence weight of '1 girl with long hair on the white area ' is 1.0 ', the influence weight of '1 girl with red scarf ' on the black area ' is 1.4 ', the influence weight of ' tree ' on the green area ' is 1.2 ', the influence weight of ' sky ' on the blue area ' is 0.2, and the influence weight of ' grass ' on the brown area ' is 0.2. Further, based on the semantic object layout configuration parameters and the input target description text, the generated image is shown in fig. 11, so as to ensure that different semantic objects appear in different areas cut by the semantic objects in the layout.

In this embodiment, the generated objects in different areas are controlled by semantic cutting, so that the accuracy of the generated position of the semantic object can be effectively improved, the reasonable layout of the semantic object in the generated image is realized, and the quality of the generated image can be further effectively improved.

In some embodiments, the sample image further comprises a depth map. The computer device may improve the quality of the generated image by a control means based on depth cutting. Specifically, the depth cutting is to introduce a depth map of a training image in the training process, so that a black-and-white channel of the depth map can influence the position of final image generation.

In the application process of the theme image generation model, the method further comprises the following steps: and acquiring a single-channel gray scale image with the same size as the target image to be generated, wherein black and white channels of the single-channel gray scale image are respectively used for positioning a foreground part and a background part of the target image. As shown in the left diagram of FIG. 12, in the process of generating an image by the subject image generation model, a preselected alpha image is input in addition to the input target descriptive text. The alpha image is a single-channel gray scale image, and the size of the gray scale image is the same as that of the target generated image. The portion with an alpha value of 1 (corresponding to pixel value 255) is a foreground, the portion with an alpha value of 0 (corresponding to pixel value 0) is a background, and the region between 0 and 1 (corresponding to pixel values 0 to 255) is a semitransparent portion.

Correspondingly, in the image generation process, denoising processing is performed under the guidance of a text vector based on white noise corresponding to a random hidden variable of a hidden variable domain so as to reconstruct the hidden variable of the image, and the method comprises the following steps: constructing an initial hidden variable based on white noise corresponding to a random hidden variable of a hidden variable domain and a black-and-white channel of a single-channel gray scale map; denoising processing is performed based on the initial hidden variable under the guidance of the text vector so as to reconstruct the image hidden variable.

The black-and-white channel of the single-channel gray scale image has the effect of displaying the foreground part or the background part of the position of the theme elements in the target description text in the image. In one specific application, the initial hidden variables of the hidden variables are formed based on the alpha image channel and random noise together, so that the initial hidden variables and the input target description text interact together to generate an image conforming to the target theme. Specifically, as shown in the right diagram of fig. 12, from the generated image based on the input target text "pineapple standing in front of the podium", the generated target object "pineapple" can be well delineated on the foreground part in the depth map, so that the quality of the generated image can be effectively improved.

The application also provides an application scene, which applies the image generation method of the target theme. Specifically, the application of the image generation method of the target theme in the application scene is as follows:

Specifically, the scheme is based on a pre-training model of the Diffusion frame, and finetune training is carried out on the frame based on data of a specific IP, so that the problem that the Diffusion frame cannot stably generate images of the specific IP style is solved. In the key steps of training corpus construction, model training optimization, text generation reasoning and the like, a technology adapting to a specific IP style is integrated, so that the produced model can stably and reliably generate a high-quality image conforming to the IP content style according to an input text.

The method comprises the steps of learning a core scene and core characteristics of an IP through a model, taking external celebrities and prime persons as main bodies of drawing, and embedding the external celebrities and prime scenes in a mode of generating images through texts, so that the external persons and the IP core scenes are fused, and the embedding of the IP scenes is realized. And uploading an own image by a user, and completing IP style migration of the input image by taking the hidden variable code of the image as an initial condition. In addition, the method can also carry out corresponding training based on the limited materials of the IP, learn the core elements of the IP, generate a large number of related materials according to the combination of the characteristics, and put in social media to assist in the announcement of the IP.

The model is mainly composed of 3 parts, namely a control domain, a hidden variable domain and an image domain. Wherein the image field includes an image encoder and an image decoder for associating the image field with the hidden variable field. The image encoder is used for encoding and compressing the image vector with the dimension of [ Channel, height, width ] in the image domain to obtain the hidden variable with the dimension of [ LATENTCHANNEL, LATENTHEIGHT, LATENTWIDTH ]. Typically, the image is RGB coded, channel is 3, height and Width represent the height and Width of the original image; LATENTCHANNEL is the number of channels in the hidden variable field, for example, 4, latentHeight, latentWidth is scaled down based on the original image by an index of 2 (e.g., 4,8, 16, etc.). LATENTHEIGHT = Height/scale_factor, LATENTWIDTH = Width/scale_factor. To ensure that the height and width of the hidden variable fields are integers, an alignment operation is typically performed on the input image, for example, by performing an image restore and pad operation, so as to ensure that the width and height of the image can be divided by a fixed scale_factor.

In one embodiment, the image encoding and the image decoding may be implemented using a VAE framework, where the VAE framework includes an image encoder and an image decoder, and the image encoder is configured to encode an image vector of an input image into an hidden variable, specifically to estimate a mean E (z) and a variance V (z) of a gaussian distribution of each dimension of a hidden variable space, and then to sample the gaussian distribution based on the mean and the variance of each dimension, so as to obtain the hidden variable. The image decoder is for decoding the hidden variable into an image vector.

The control field includes a text encoder for converting natural text into text vectors of fixed dimensions. Specifically, the text encoder may employ a transducer structure, including the steps of converting a natural text into an ID sequence by word segmentation and word list lookup by a word segmentation device, and encoding the ID sequence into a sequence vector matrix of [ seq_len, hiddenSize ] by a multi-layer transducer structure; where seq_Len is the length of the sequence and HiddenSize is the dimension of the hidden vector.

X_t＝α_tX_t-1+β_t∈_t,∈_t～N(0,1)

Wherein X _t-1 is the hidden variable of the last step, alpha _t,α_t is the preset noise weight, and E _t is the random sampling Gaussian noise. For the initial hidden variable, the diffusion noise adding process is performed for more than 1000 steps, the noise weight is smaller in the initial stage and larger in the later stage, and in particular, the noise weight can follow a fixed weight distribution.

The denoising process is to learn the noise in each step of denoising process through a complex Unet model, so that the smaller and the better the difference between the estimated X _t-1' and the real X _t-1 is. The denoising process is input as white noise, the noise of the image is smaller and smaller through multi-step noise estimation, and finally, a high-definition image is sampled. Unet model training loss function is

By minimizing the difference between the estimated noise at time t and the actual noise, the difference between the estimated X _t-1' and the actual X _t-1 can be minimized. And finally restoring the original image of the original X at the time t=0 through denoising in the step t, wherein the model structure of the denoising process is a Unet structure, and the Unet structure comprises a group of downsamplers, a group of upsamplers and a sampler with a dimension maintained in the middle layer. The downsampler may be a cnn+ transformer structure, and the upsampler cnn+ transformer structure, where the middle layer of the downsampler is the same structure as the upsampler, except that no dimension scaling is performed.

In the model application process, the object of the model application is to randomly generate an image consistent with the promtt description based on the input promtt. The computer equipment inputs a target description text through a control field of the theme image generation model, and a text encoder based on the control field converts the target description text into a text vector of a hidden variable field to generate an image hidden variable Z_0 corresponding to the text vector; random seed randomly generates random Gaussian white noise as random hidden variable of hidden variable domain by using random number generator, then uses multi-step denoising process based on random hidden variable, under the guidance of text vector, rebuilds image hidden variable Z_0 to obtain Z' _0, and uses image decoder of image domain to obtain image vector to realize generation of target image. In the embodiment, the theme image generation model generates random hidden variables to perform denoising under the guidance of text vectors, so that the reconstruction of the image hidden variables can be effectively realized, and the rapid high-quality generation of images is realized.

In the model training process, the constructed training corpus is an image related to the topic IP, and the specific acquisition process comprises the following steps:

And extracting frames from the video files, wherein the frame rate of the video files is 30 frames/s, the video of a set is about 40-120 min, and 70-210 k images can be extracted from each file on average. However, since the image changes very little between successive frames of the video file, only image frames with relatively large scene changes need to be extracted. By this method, 2k-4k images are extracted for a set of 40min videos. Specifically, the image frames may be sequentially compared, and for two image frames that are the comparison targets, the computer device performs a difference process on specific parameter values of pixel granularity of the same pixel arrangement positions of the two image frames, so as to obtain a pixel granularity difference value of each pixel arrangement position, and the computer device obtains a sum of absolute values by accumulating absolute values of the pixel granularity differences of the pixel arrangement positions, so as to characterize the difference between the two image frames that are the comparison targets.

In addition, the IP video may include some head-to-tail images, advertisements, transition images, and in addition, the scenario may include some configuration angle characters, scenes with no recognition, the image definition is low, the characters intercept incomplete images, which are not suitable for being used as training corpus, may be regarded as invalid images, and may be identified by image identification tools including head-to-tail identification, advertisement identification, and image definition model, and filtering of the invalid images is performed based on the tools.

In addition, the image of the IP video frame is generally a wide-screen image of a transverse screen 1920x1080, and the input requirement of the training image is smaller. If the scaling of the image is performed directly, this will result in a loss of image sharpness and also in insufficient resolution of the details of the main elements. Therefore, it is necessary to locate the subject element and cut out an image of a target size centering on the subject element.

Each image obtained after the interception is subjected to text labeling, and the text labeling describes theme elements appearing in the image by words, including main objects (main bodies of the image, generally foreground), image background, shooting visual angles and distances, and some detail descriptions (clothes, expressions, colors, actions, faces, limb details and the like). The main object may be a character in the IP video, and may be specifically described by a name of the character. Such as main angle Zhang Xiaoming in "sitcom a". The image background can be a scene in the IP video, and can be a building, a natural landscape and a story line. Such as the ancient pavilion, the ancient temple, the hand-held big flag name scene, etc. in the television play A. Detailed description-apparel class: and (3) special clothes in IP video. Such as men and women's main corner clothing in TV play A, which is described by natural texts, light yellow green collar brocade gown, blue cotton-flax suit, lan Baige cotton-flax suit, dark purple silk waistcoat gown and white cotton-flax suit. Detailed description-expression class: the classification of the expressions of the characters in the IP video comprises calm, serious, happy, anger, doubt and the like. Based on the above description of text, for each image frame, one or more text descriptions are generated, forming training data pairs with the image frame.

As shown in fig. 13, taking "drama a" as an example, the following training data may be generated, and for the left graph of fig. 13, the text is labeled: [ character: principal angle a, viewing angle: big head close-up, clothing: blue cotton-flax suit, hairstyle: hair is bundled, and actions are performed: sit in the car and look outside, expression: calm, background: carriage ]. For the right diagram of fig. 13, the labeled text is: [ character: principal angle a, viewing angle: whole-body perspective, clothing: blue cotton-flax suit, hairstyle: half with dishevelled hair, action: standing hyperopia, expression: serious, background: square space ].

Based on the text description of the IP, for each attribute feature, a closed text set can be constructed as a feature attribute vocabulary. Taking the above "drama a" as an example, the feature attribute vocabulary may be constructed as follows:

For example, a set of clothing text: [ satin gown with light yellow green collar, cotton-flax suit with blue color, cotton-flax suit with Lan Baige colors, sleet with dark purple silk jacket, cotton-flax suit with white color ], hairstyle text set: half with dishevelled hair, hair binding ], set of emoji text: [ calm, serious, surprise, happy, anger, doubt ], visual angle text set: [ big head close range, half body middle range, whole body distant range ].

During the application of the model, the text description portion of the model is entered and the included content may be obtained from the open text description as well as the closed attribute vocabulary. The open text description is generally used to describe the main characters, and may be a famous character in the outside world or a character represented by a prime photo uploaded by the user, in addition to the characters in the IP field. Taking "drama a" as an example, the characters in the IP field include principal angle a, principal angle B, principal angle C, and the like. The famous person in the external world can be any famous person in the external world, such as famous movie actors, famous singers and the like, and also can be prime photos uploaded by a user, generally non-famous persons in the external world, such as a person corresponding to a photo containing a head portrait of the person, which is shot by the user through a camera.

Based on the above element constitution of text input, a free input text can be constructed, and the text is used as the input of model reasoning, for example, "character: principal angle a, viewing angle: middle body view, hairstyle: half with dishevelled hair, clothing: the image effect produced by the blue cotton-flax suit "is shown in fig. 14.

In the model application process, in order to further improve the image generation quality and ensure the text control capability, the scheme of the application also adds a plurality of text vector coding and fine adjustment modules, and correspondingly adjusts the denoising process.

The topic image generated model obtained by adopting the standard training process and the model frame to carry out fine training has the problem of weak control force of part of characteristics, and the generated image is inconsistent with the text description. In order to solve the technical problem, in the embodiment, the key features are vector weighted in the text vector field, so that the key features have higher weight in the image generation process compared with other texts. The specific implementation mode is that the weight is fused with the text in the text construction process, and corresponding analysis and weighting are carried out in the text coding process. In one specific application, the target descriptive text is "people: principal angle a, viewing angle: (half body view: 1.1), hairstyle: hair-binding, clothing: (blue cotton-flax suit: 1.2) ". By means of (topic element description text: weight), the text portion to be weighted can be determined, different weights being applied to different texts; wherein, text without brackets may default to a weight of 1.0. Based on the weight data of the key description text, vector weighting processing can be carried out on the key description text in the process of executing vector conversion by a text encoder, so that a text vector corresponding to the target description text is obtained, the effective expression of key features in a generated image can be effectively improved, and the image quality is further improved.

The semantic control capabilities of the different layers of the text encoder and the sharpness of the final image may differ. Taking a file encoder as a transducer model as an example, the penultimate layer of the transducer can improve the control capability of the text. The text vectors of the penultimate layer and the last layer of the transformation model are subjected to interpolation fusion, and the specific fusion mode is as follows:

text_emb＝alpha*text_emb(L-1)+(1-alpha)*text_emb(L)

Wherein text_emb (L-1) is the text vector of the penultimate layer, and the dimension is [ seq_Len, hiddenSize ]; the text_emb (L) is the text vector of the last layer, and the dimension is consistent with the text_emb (L-1), so that linear interpolation fusion can be performed; alpha is the weight of linear interpolation, the value is the decimal between [0,1], the higher the value is, the higher the weight of the penultimate layer is. By means of feature fusion of different layer features of the file encoder, vectors with better semantic control capability can be fused to an output layer of the file encoder, so that text vectors obtained by encoding of the file encoder have better semantic control capability in an image generation process, and the generated image quality can be effectively improved.

And in the early stage of denoising, drawing layout and main body element drawing are mainly performed, and in the later stage of denoising, drawing of local details is completed. The computer equipment only validates the subject person and the main description text related to the layout in the early generation stage of the image corresponding to the denoising process by controlling the subject image generation model, and invalidates the detail description text through mask coverage so as to ensure that more attention of the image in the early stage is focused on the main description text. In the later generation stage, reverse processing is carried out, so that the detail description text is easier to control the generation of the image, the texts controlled in large areas such as the subject characters, the layout and the like are covered by masks, and more attention of the image in the later stage is ensured to be on the detail description text. The two stages are matched with each other, so that the control force of generating the text and the image fineness of the detail text can be further improved.

The computer device can improve the quality of the generated image by semantically cutting and controlling the mode of generating objects in different areas. Through semantic cutting, objects needing to be drawn in different areas of an image can be preset before the image is generated, and in the generation process, the aim of generating different objects in different areas can be achieved through influencing the weight of the attention influence mechanism of different text fragments in the blind area. The semantic object layout configuration parameters are used for determining the layout of the target image aiming at the semantic object to obtain a semantic object layout diagram. Wherein, different semantic objects have different impact weights in the semantic object layout. For example, the input text is: "1 girl with white skirt and long hair, and 1 girl with red scarf beside, tree, sky, grass, high detail and best quality", the following semantic object layout diagram obtained by semantic cutting can be generated through the configuration file. Wherein, the white area is "1 girl with long hair on the white skirt", the black area is "1 men with red scarf", the green area is "tree", the blue area is "sky", and the brown area is "grass". Wherein, the influence weight of ' 1 girl with long hair on the white area ' is 1.0 ', the influence weight of ' 1 girl with red scarf ' on the black area ' is 1.4 ', the influence weight of ' tree ' on the green area ' is 1.2 ', the influence weight of ' sky ' on the blue area ' is 0.2, and the influence weight of ' grass ' on the brown area ' is 0.2. Further, based on the semantic object layout configuration parameters and the input target description text, the generated image is shown in fig. 11, so as to ensure that different semantic objects appear in different areas cut by the semantic objects in the layout.

In addition, in the process of generating an image by the subject image generation model, a preselected alpha image may be input in addition to the input target description text. The alpha image is a single-channel gray scale image, and the size of the gray scale image is the same as that of the target generated image. The initial hidden variable of the hidden variable is formed based on the alpha image channel and random noise, so that the initial hidden variable and the input target description text work together to generate an image conforming to a target subject, and the generated target object pineapple can be well defined in a foreground part in a depth map from the view of the generated image based on the input target text pineapple standing in front of a podium, so that the quality of the generated image can be effectively improved. The method and the device aim at the customized training of the content style and the scene of a specific IP (Internet protocol), such as a certain video IP, so that the model after finetune training can stably and controllably generate images conforming to the IP content style according to the needs of users. In addition, the model trained by referring to the scheme can also be used as a basic model of other technical schemes, such as generating images according to images or customizing individuation.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the present application also provides an image generating apparatus for realizing the above-mentioned target subject image generating method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the image generating device for one or more target subjects provided below may refer to the limitation of the image generating method for the target subject hereinabove, and will not be repeated herein.

In one embodiment, as shown in fig. 15, there is provided an image generating apparatus of a target subject, including: a model acquisition module 1502, a text set determination module 1504, a descriptive text determination module 1506, and a target image generation module 1508, wherein:

A model obtaining module 1502, configured to obtain a theme image generation model obtained by performing secondary training on a pre-training model through a sample image; the pre-training model is used for generating an image according to the text; the sample image contains at least one theme element that conforms to the target theme.

And a text set determining module 1504, configured to obtain a text set containing the topic element description text of the same type, based on topic element description text that is carried by each sample image and is used for describing the topic element.

The description text determining module 1506 is configured to select a topic element description text from at least a portion of the text set, and combine the topic element description texts to obtain a target description text containing the selected topic element description text.

And the target image generating module 1508 is configured to perform image generating processing according to the target description text through the theme image generating model, so as to obtain a target image matched with the target theme.

In some embodiments, the image generating device of the target subject further includes a sample image obtaining module, configured to obtain a target video conforming to the target subject, and perform frame extraction processing on the target video to obtain a plurality of candidate image frames with differences in scenes; respectively carrying out theme element identification on each candidate image frame, and determining a preferred image frame containing at least one theme element; and based on the theme elements, performing theme element description text labeling on the preferred image frame to obtain a sample image.

In some embodiments, the sample image obtaining module is further configured to perform frame-splitting processing on the target video to obtain a plurality of image frames with the same pixel arrangement; determining, for two video frames of the plurality of image frames, a sum of absolute values of pixel granularity differences for each of the pixel arrangement positions based on pixel granularity differences for the same pixel arrangement position; and when the ratio between the sum of the absolute values and the total pixel granularity value of the image frames is larger than a scene change threshold value, determining the two image frames as candidate image frames with differences in scenes.

In some embodiments, the sample image obtaining module is further configured to obtain attribute information and content information of each candidate image frame respectively; and discarding candidate image frames of which at least one of the attribute information and the content information does not meet the theme element identification condition to obtain a preferred image frame containing at least one theme element.

In some embodiments, the sample image obtaining module is further configured to perform, for each of the preferred image frames, center point positioning on an object element in the preferred image frame to obtain a positioning position; according to the size condition of the sample image, taking the positioning position as the center, carrying out image cutting on the preferable image frame to obtain a cut preferable image frame; and performing theme element description text labeling on the cut preferred image frame to obtain a sample image.

In some embodiments, the text sets include an open text set and a closed text set; the open text set includes object element description text of object elements in the subject element and object element description text of external object elements;

The descriptive text determining module is further used for selecting object element descriptive text from the open text set; and combining the object element description text with topic element description texts respectively selected from at least a part of the closed text set to obtain a target text combination.

In some embodiments, the subject image generation model includes a control field, a hidden variable field, and an image field;

The target image generation module is further used for converting the target description text into a text vector based on the target description text input from the control field; denoising under the guidance of the text vector based on white noise corresponding to the random hidden variable of the hidden variable domain so as to reconstruct an image hidden variable; and decoding the image hidden variable in the image domain to obtain a target image matched with the target theme.

In some embodiments, the target image generating module is further configured to obtain, for the target description text input from the control field, weight data of key description text in the target description text; and carrying out vector weighting processing on the key description text in the vector conversion process based on the weight data of the key description text to obtain a text vector corresponding to the target description text.

In some embodiments, the control field includes a text encoder that converts text into vectors; the text encoder includes at least two network layers; the target image generation module is further used for sequentially performing vector conversion processing on the input target description text according to each network layer in the text encoder to obtain output characteristics of each network layer; performing feature fusion on the output features of the target network layer and the output features of the last network layer of the text encoder to obtain a text vector corresponding to the target description text; the target network layer is a network layer used for improving text control capability in the text encoder.

In some embodiments, the input text of the control field further comprises negative description text; the target image generation module is further used for taking the estimated noise of the negative description text as a fusion object in the denoising process, and fusing the estimated noise with the estimated noise of the target description text to obtain fused estimated noise; and denoising under the guidance of the text vector according to the fusion estimated noise based on white noise corresponding to the random hidden variable of the hidden variable domain so as to reconstruct the image hidden variable.

In some embodiments, the denoising process includes a first stage and a second stage that occur sequentially; the target description text comprises a main description text and a detail description text; the target image generating module is further configured to, in the first stage, cover the detail description text with a mask, and decode a hidden variable of a main body image of the main body description text to obtain a main body image matched with the target subject; in the second stage, the mask covers the main description text, and detail image hidden variables of the detail description text are decoded to obtain a detail image; and rendering the detail image to a corresponding area in the main image to obtain a target image matched with the target theme.

In some embodiments, the target image generating module is further configured to perform semantic cutting on the target description text, and determine a layout diagram of each semantic object obtained by cutting in the image based on the semantic object layout configuration parameters; determining the influence weight of the attention influence mechanism of each semantic object in each layout area in the layout diagram; and decoding the image hidden variable obtained based on the influence weight in the image domain to obtain a target image for enabling each semantic object to be laid out according to the layout.

In some embodiments, the sample image further comprises a depth map; the target image generation module is further used for acquiring a single-channel gray scale image with the same size as the target image to be generated, and black and white channels of the single-channel gray scale image are respectively used for positioning a foreground part and a background part of the target image; constructing an initial hidden variable based on white noise corresponding to the random hidden variable of the hidden variable domain and a black-and-white channel of the single-channel gray scale image; denoising processing is carried out based on the initial hidden variable under the guidance of the text vector so as to reconstruct the image hidden variable.

According to the image generating device of the target theme, the theme image generating model obtained by performing secondary training on the pre-training model is obtained, the model capable of generating the image conforming to the target theme according to the text is obtained, in the process of determining the text of the input model, at least one theme element conforming to the target theme and the theme element description text which is carried by the sample image of the training theme image generating model and used for describing the theme element are utilized, a text set containing the theme element description text of the same type can be used as a text component part of the input model, the theme element description text is selected from at least one part of the text set respectively, the target description text containing the selected theme element description text is obtained through combination, the theme image generating model can be enabled to perform image generating processing according to the target description text, and a high-quality target image which is different from the sample image and can be matched with the target theme is obtained.

The respective modules in the image generating apparatus of the above-described target subject may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 16. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store the pre-training model and the sample image. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of image generation of a target subject.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 17. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of image generation of a target subject. The display unit of the computer equipment is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device, wherein the display screen can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on a shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structures shown in FIGS. 16 and 17 are merely block diagrams of portions of structures associated with aspects of the application and are not intended to limit the computer device to which aspects of the application may be applied, and that a particular computer device may include more or fewer components than those shown, or may combine certain components, or may have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of generating an image of a target subject, the method comprising:

2. The method according to claim 1, wherein the method further comprises:

Acquiring a target video which accords with the target theme, and performing frame extraction processing on the target video to obtain a plurality of candidate image frames with scene differences;

Respectively carrying out theme element identification on each candidate image frame, and determining a preferred image frame containing at least one theme element;

and based on the theme elements, performing theme element description text labeling on the preferred image frame to obtain a sample image.

3. The method according to claim 2, wherein the performing frame extraction processing on the target video to obtain a plurality of candidate image frames with scene differences includes:

carrying out framing treatment on the target video to obtain a plurality of image frames with the same pixel arrangement;

Determining, for two video frames of the plurality of image frames, a sum of absolute values of pixel granularity differences for each of the pixel arrangement positions based on pixel granularity differences for the same pixel arrangement position;

And when the ratio between the sum of the absolute values and the total pixel granularity value of the image frames is larger than a scene change threshold value, determining the two image frames as candidate image frames with differences in scenes.

4. The method of claim 2, wherein said identifying the subject elements for each of the candidate image frames, respectively, determines a preferred image frame containing at least one subject element, comprises:

respectively acquiring attribute information and content information of each candidate image frame;

and discarding candidate image frames of which at least one of the attribute information and the content information does not meet the theme element identification condition to obtain a preferred image frame containing at least one theme element.

5. The method of claim 4, wherein the theme elements include object elements that conform to the target theme; and performing topic element description text labeling on the preferred image frame based on the topic element to obtain a sample image, wherein the method comprises the following steps:

For each preferable image frame, positioning a center point of an object element in the preferable image frame to obtain a positioning position;

according to the size condition of the sample image, taking the positioning position as the center, carrying out image cutting on the preferable image frame to obtain a cut preferable image frame;

and performing theme element description text labeling on the cut preferred image frame to obtain a sample image.

6. The method of claim 1, wherein the text sets comprise an open text set and a closed text set; the open text set includes object element description text of object elements in the subject element and object element description text of external object elements;

The method for respectively selecting the topic element description texts from at least a part of the text sets and combining the topic element description texts to obtain a target description text containing the selected topic element description texts comprises the following steps:

Selecting an object element description text from the open text set;

and combining the object element description text with topic element description texts respectively selected from at least a part of the closed text set to obtain a target text combination.

7. The method according to any one of claims 1 to 6, wherein the subject image generation model includes a control field, a hidden variable field, and an image field;

The generating of the image by the theme image generating model according to the target description text to obtain a target image matched with the target theme comprises the following steps:

converting the target description text into a text vector based on the target description text input from the control field;

Denoising under the guidance of the text vector based on white noise corresponding to the random hidden variable of the hidden variable domain so as to reconstruct an image hidden variable;

And decoding the image hidden variable in the image domain to obtain a target image matched with the target theme.

8. The method of claim 7, wherein the converting the target descriptive text into a text vector based on the target descriptive text entered from the control field comprises:

acquiring weight data of key description texts in the target description texts aiming at the target description texts input from the control field;

and carrying out vector weighting processing on the key description text in the vector conversion process based on the weight data of the key description text to obtain a text vector corresponding to the target description text.

9. The method of claim 7, wherein the control field comprises a text encoder that converts text into vectors; the text encoder includes at least two network layers;

The converting the target descriptive text into a text vector based on the target descriptive text input from the control field includes:

According to each network layer in the text encoder, vector conversion processing is sequentially carried out on the input target description text, and output characteristics of each network layer are obtained;

performing feature fusion on the output features of the target network layer and the output features of the last network layer of the text encoder to obtain a text vector corresponding to the target description text; the target network layer is a network layer used for improving text control capability in the text encoder.

10. The method of claim 7, wherein the input text of the control field further comprises negative description text;

the white noise corresponding to the random hidden variable based on the hidden variable domain is subjected to denoising processing under the guidance of the text vector so as to reconstruct an image hidden variable, and the method comprises the following steps:

Taking the estimated noise of the negative description text as a fusion object in the denoising process, and fusing the estimated noise with the estimated noise of the target description text to obtain fused estimated noise;

And denoising under the guidance of the text vector according to the fusion estimated noise based on white noise corresponding to the random hidden variable of the hidden variable domain so as to reconstruct the image hidden variable.

11. The method of claim 7, wherein the denoising process comprises a first stage and a second stage that occur sequentially; the target description text comprises a main description text and a detail description text;

Decoding the image hidden variable in the image domain to obtain a target image matched with the target theme, including:

In the first stage, the mask covers the detail description text, and the hidden variables of the main body image of the main body description text are decoded to obtain the main body image matched with the target theme;

In the second stage, the mask covers the main description text, and detail image hidden variables of the detail description text are decoded to obtain a detail image;

And rendering the detail image to a corresponding area in the main image to obtain a target image matched with the target theme.

12. The method of claim 7, wherein the method further comprises:

Performing semantic cutting on the target description text, and determining a layout diagram of each semantic object in the image, which is obtained by cutting, based on semantic object layout configuration parameters;

determining the influence weight of the attention influence mechanism of each semantic object in each layout area in the layout diagram;

And decoding the image hidden variable obtained based on the influence weight in the image domain to obtain a target image for enabling each semantic object to be laid out according to the layout.

13. The method of claim 7, wherein the sample image further comprises a depth map; the method further comprises the steps of:

acquiring a single-channel gray scale image with the same size as a target image to be generated, wherein black and white channels of the single-channel gray scale image are respectively used for positioning a foreground part and a background part of the target image;

Constructing an initial hidden variable based on white noise corresponding to the random hidden variable of the hidden variable domain and a black-and-white channel of the single-channel gray scale image;

Denoising processing is carried out based on the initial hidden variable under the guidance of the text vector so as to reconstruct the image hidden variable.

14. An image generation apparatus of a target subject, the apparatus comprising:

15. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 13 when the computer program is executed.

16. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 13.

17. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 13.