CN118096943A

CN118096943A - Poster generation method and device

Info

Publication number: CN118096943A
Application number: CN202410216808.7A
Authority: CN
Inventors: 冯伟; 沈俊杰; 朱红贺; 刘安; 李耀宇; 张政; 吕晶晶; 朱鑫; 张旭; 王小梅
Original assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2024-02-27
Filing date: 2024-02-27
Publication date: 2024-05-28

Abstract

The application discloses a poster generation method and device. One embodiment of the method comprises the following steps: extracting features of the obtained target object image and the target text to obtain image features and text features; determining layout information of a plurality of visual elements in a target poster to be generated according to image features and text features through a pre-trained planning network; and generating a target poster according to the target object image and the layout information through a pre-trained rendering network. The application provides an end-to-end poster generation method based on a planning network and a rendering network, which is used for determining the layout structure of a poster through the planning network and generating a corresponding poster image according to the layout structure through the rendering network, thereby improving the generation efficiency, quality and diversity of the poster.

Description

Poster generation method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of artificial intelligence, and particularly relates to a method and a device for generating a poster, a computer readable medium and electronic equipment.

Background

The article poster plays a key role in the promotion of the article. An exquisite poster should not only contain a reasonable layout of elements (e.g., elements such as substrates, text, objects, etc.), but also have a background that is harmonious with the objects. Thus, the task of generating a poster is typically done by a poster designer.

Disclosure of Invention

The embodiment of the application provides a poster generation method, a poster generation device, a computer readable medium and electronic equipment.

In a first aspect, an embodiment of the present application provides a method for generating a poster, including: extracting features of the obtained target object image and the target text to obtain image features and text features; determining layout information of a plurality of visual elements in a target poster to be generated according to image features and text features through a pre-trained planning network; and generating a target poster according to the target object image and the layout information through a pre-trained rendering network.

In some examples, determining layout information of a plurality of visual elements in the target poster to be generated according to the image features and the text features through the pretrained planning network includes: initializing layout of a plurality of visual elements according to the target object image and the target text through a planning network to obtain initialized layout information; and carrying out iterative denoising on the initialized layout information according to the image characteristics and the text characteristics through a planning network to obtain the layout information.

In some examples, the planning network includes a first fully-connected layer, a second fully-connected layer, and a plurality of planning transformers connected in series, and the iteratively denoising, by the planning network, the initialized layout information according to the image feature and the text feature to obtain the layout information includes: iteratively executing the following denoising operations corresponding to the time steps, and determining the denoised layout information corresponding to the last time step as layout information: processing the denoised layout information obtained by the denoising operation corresponding to the previous time step through the first full-connection layer to obtain layout characterization information, wherein the first full-connection layer corresponding to the first time step processes initialization layout information; obtaining processed layout information according to the layout characterization information, the last time step, the image features and the text features through a plurality of planning transformers; and generating denoised layout information of the current time step according to the processed layout information through the second full connection layer.

In some examples, the planning transformer module includes an adaptive normalization layer, a self-attention layer, and a cross-attention layer, and the obtaining, by the plurality of planning transformers modules, processed layout information according to layout characterization information, a last time step, image features, and text features includes: the self-adaptive normalization layer is used for processing the de-noised layout information corresponding to the previous time step and the previous time step to obtain first processing data; combining the second processing data obtained by the self-attention layer based on the first processing data with the first processing data to obtain third processing data; and combining the third processing data obtained by the cross attention layer based on the second processing data, the image features and the text features with the second processing data to obtain the processed layout information.

In some examples, the rendering network includes a layout branch, a visual branch, a control network, a stable diffusion network, and the pre-trained rendering network generates a target poster from layout information and a target item image, including: a plurality of mask images corresponding to the visual elements one by one are fused through a space fusion module in the layout branch, so that space layout representation data are obtained; geometrically transforming the target object image according to the layout information through the visual branch to obtain a repositioning image, and determining visual representation data of the repositioning image; and using the spatial layout representation data and the visual representation data as control conditions of a control network to guide the stable diffusion network to generate a target poster.

In some examples, the above-described rendering network further includes a text rendering module; and the control conditions using the spatial layout characterization data and the visual characterization data as the control network, and guiding the stable diffusion network to generate the target poster, comprising: the space layout representation data and the visual representation data are used as control conditions of a control network to guide a stable diffusion network to generate an initial poster; and rendering the target text in the initial poster according to the color of the text region of the target text in the initial poster by a text rendering module to obtain the target poster.

In some examples, the spatial fusion module includes: the method comprises the steps of convoluting a plurality of vision transformers in series, and fusing a plurality of mask images corresponding to a plurality of vision elements one by one through a space fusion module in a layout branch to obtain space layout characterization data, and comprises the following steps: processing each of the plurality of mask images through a convolutional network to obtain a plurality of processed mask images; dividing each processed mask image in the processed mask images into a plurality of cut blocks, and fusing the corresponding plurality of cut blocks in the processed mask images to obtain a plurality of fused cut blocks; and respectively processing the fusion dices through the plurality of visual transducer modules to obtain a plurality of processed dices, and generating space layout characterization data comprising the plurality of processed dices.

In some examples, the visual transducer module includes a linear network of multi-headed self-attention layer and lead-in layer normalization operations, and the processing of the plurality of fused dice by the plurality of visual transducer modules, respectively, results in a plurality of processed dice, including: for each of the plurality of visual transducer modules, performing the following: for each fusion cut in the plurality of fusion cuts, processing the middle fusion cut corresponding to the fusion module output by the last visual transducer module through the multi-head self-attention layer in the visual transducer module to obtain a first processed fusion cut, wherein the first visual transducer module processes the fusion cut; combining the intermediate fusion cut and the first processed fusion cut to obtain a second processed fusion cut; processing the second processed fusion cut block through a linear network to obtain a third processed fusion cut block; and combining the second processed fusion cut piece and the third processed fusion cut piece to obtain the intermediate fusion cut piece output by the visual transducer module.

In some examples, the controlling conditions using the spatial layout characterization data and the visual characterization data as the control network instruct the stable diffusion network to generate the initial poster, including: taking the spatial layout representation data and the visual representation data as control conditions of a control network, and generating a noise poster based on the random noise guiding stable diffusion network; and carrying out iterative denoising on the noise poster to obtain an initial poster.

In some examples, the feature extraction of the obtained target object image and the target text to obtain the image feature and the text feature includes: and respectively extracting features of the target object image and the target text through a visual encoder and a language encoder in a pre-trained feature extraction model to obtain image features and text features, wherein the feature extraction model is obtained based on a contrast learning task of image-text pairs, mask language modeling and matching task training of the image-text pairs.

In a second aspect, an embodiment of the present application provides a poster generating apparatus, including: the extraction unit is configured to perform feature extraction on the acquired target object image and target text to obtain image features and text features; a planning unit configured to determine layout information of a plurality of visual elements in a target poster to be generated according to image features and text features through a pre-trained planning network; and the rendering unit is configured to generate a target poster according to the target object image and the layout information through a pre-trained rendering network.

In some examples, the planning unit is further configured to: initializing layout of a plurality of visual elements according to the target object image and the target text through a planning network to obtain initialized layout information; and carrying out iterative denoising on the initialized layout information according to the image characteristics and the text characteristics through a planning network to obtain the layout information.

In some examples, the planning network includes a first fully-connected layer, a second fully-connected layer, and a plurality of serially-connected planning transformers, and the planning unit is further configured to: iteratively executing the following denoising operations corresponding to the time steps, and determining the denoised layout information corresponding to the last time step as layout information: processing the denoised layout information obtained by the denoising operation corresponding to the previous time step through the first full-connection layer to obtain layout characterization information, wherein the first full-connection layer corresponding to the first time step processes initialization layout information; obtaining processed layout information according to the layout characterization information, the last time step, the image features and the text features through a plurality of planning transformers; and generating denoised layout information of the current time step according to the processed layout information through the second full connection layer.

In some examples, the planning transformer module includes an adaptive normalization layer, a self-attention layer, and a cross-attention layer, and the planning unit is further configured to: the self-adaptive normalization layer is used for processing the de-noised layout information corresponding to the previous time step and the previous time step to obtain first processing data; combining the second processing data obtained by the self-attention layer based on the first processing data with the first processing data to obtain third processing data; and combining the third processing data obtained by the cross attention layer based on the second processing data, the image features and the text features with the second processing data to obtain the processed layout information.

In some examples, the above-described rendering network includes a layout branch, a visual branch, a control network, a stable diffusion network, and the above-described rendering unit is further configured to: a plurality of mask images corresponding to the visual elements one by one are fused through a space fusion module in the layout branch, so that space layout representation data are obtained; geometrically transforming the target object image according to the layout information through the visual branch to obtain a repositioning image, and determining visual representation data of the repositioning image; and using the spatial layout representation data and the visual representation data as control conditions of a control network to guide the stable diffusion network to generate a target poster.

In some examples, the above-described rendering network further includes a text rendering module; and the rendering unit, further configured to: the space layout representation data and the visual representation data are used as control conditions of a control network to guide a stable diffusion network to generate an initial poster; and rendering the target text in the initial poster according to the color of the text region of the target text in the initial poster by a text rendering module to obtain the target poster.

In some examples, the spatial fusion module includes: a convolutional network, a plurality of serially connected visual transducer modules, and the rendering unit described above, further configured to: processing each of the plurality of mask images through a convolutional network to obtain a plurality of processed mask images; dividing each processed mask image in the processed mask images into a plurality of cut blocks, and fusing the corresponding plurality of cut blocks in the processed mask images to obtain a plurality of fused cut blocks; and respectively processing the fusion dices through the plurality of visual transducer modules to obtain a plurality of processed dices, and generating space layout characterization data comprising the plurality of processed dices.

In some examples, the visual transducer module includes a linear network of multi-headed self-attention layer and lead-in layer normalization operations, and the rendering unit is further configured to: for each of the plurality of visual transducer modules, performing the following: for each fusion cut in the plurality of fusion cuts, processing the middle fusion cut corresponding to the fusion module output by the last visual transducer module through the multi-head self-attention layer in the visual transducer module to obtain a first processed fusion cut, wherein the first visual transducer module processes the fusion cut; combining the intermediate fusion cut and the first processed fusion cut to obtain a second processed fusion cut; processing the second processed fusion cut block through a linear network to obtain a third processed fusion cut block; and combining the second processed fusion cut piece and the third processed fusion cut piece to obtain the intermediate fusion cut piece output by the visual transducer module.

In some examples, the above-described rendering unit is further configured to: taking the spatial layout representation data and the visual representation data as control conditions of a control network, and generating a noise poster based on the random noise guiding stable diffusion network; and carrying out iterative denoising on the noise poster to obtain an initial poster.

In some examples, the extraction unit described above is further configured to: and respectively extracting features of the target object image and the target text through a visual encoder and a language encoder in a pre-trained feature extraction model to obtain image features and text features, wherein the feature extraction model is obtained based on a contrast learning task of image-text pairs, mask language modeling and matching task training of the image-text pairs.

In a third aspect, embodiments of the present application provide a computer readable medium having a computer program stored thereon, wherein the program when executed by a processor implements a method as described in any of the implementations of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.

According to the poster generation method and device, the image features and the text features are obtained by extracting the features of the acquired target object images and the target texts; determining layout information of a plurality of visual elements in a target poster to be generated according to image features and text features through a pre-trained planning network; the target poster is generated according to the target object image and the layout information through the pre-trained rendering network, so that the end-to-end poster generation method based on the planning network and the rendering network is provided, the layout structure of the poster is determined through the planning network, the corresponding poster image is generated according to the layout structure through the rendering network, and the generation efficiency, quality and diversity of the poster are improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a method of generating a poster according to the present application;

fig. 3 is a schematic diagram of a framework of a method of generating a poster according to the application;

Fig. 4 is a schematic diagram of the structure of a planning network according to the present embodiment;

Fig. 5 is a schematic structural view of a spatial fusion module according to the present embodiment;

fig. 6 is a schematic diagram of an application scenario of the generation method of the poster according to the present embodiment;

Fig. 7 is a flowchart of still another embodiment of a generation method of a poster according to the present application;

Fig. 8 is a structural view of one embodiment of a poster generating apparatus according to the present application;

FIG. 9 is a schematic diagram of a computer system suitable for use in implementing embodiments of the present application.

Detailed Description

The application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be noted that, for convenience of description, only the portions related to the present application are shown in the drawings.

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

It should be noted that, in the technical solution of the present disclosure, the related aspects of collecting, updating, analyzing, processing, using, transmitting, storing, etc. of the personal information of the user all conform to the rules of the related laws and regulations, and are used for legal purposes without violating the public order colloquial. Necessary measures are taken for the personal information of the user, illegal access to the personal information data of the user is prevented, and the personal information security, network security and national security of the user are maintained.

Fig. 1 shows an exemplary architecture 100 to which the poster generation methods and apparatus of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The communication connection between the terminal devices 101, 102, 103 constitutes a topology network, the network 104 being the medium for providing the communication link between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The terminal devices 101, 102, 103 may interact with the server 105 through the network 104 to receive or transmit data or the like. The terminal devices 101, 102, 103 may be hardware devices or software supporting network connections for data interaction and data processing. When the terminal device 101, 102, 103 is hardware, it may be various electronic devices supporting network connection, information acquisition, interaction, display, processing, etc., including but not limited to smartphones, tablet computers, electronic book readers, laptop and desktop computers, etc. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. It may be implemented as a plurality of software or software modules, for example, for providing distributed services, or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, for example, a background processing server acquiring a target object image and a target text transmitted by a user through the terminal devices 101, 102, 103, generating a target task of a target poster based on a planning network and a rendering network. Alternatively, the background processing server may send the target poster to the terminal device. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should be further noted that, the method for generating the poster provided by the embodiment of the present application may be executed by a server, may be executed by a terminal device, or may be executed by the server and the terminal device in cooperation with each other. Accordingly, each part (for example, each unit) included in the poster generating apparatus may be provided in the server, may be provided in the terminal device, or may be provided in the server and the terminal device, respectively. For the terminal intelligent system, the method for generating the poster provided by the embodiment of the application is generally executed by the terminal equipment.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the generation method of the poster is operated does not need to perform data transmission with other electronic devices, the system architecture may include only the electronic device (e.g., a server or a terminal device) on which the generation method of the poster is operated.

With continued reference to fig. 2, a flow of one embodiment of a method of generating a poster is shown; with continued reference to fig. 3, a schematic diagram of an application framework 300 of a method of generating a poster is shown. The process 200 includes the steps of:

and 201, extracting features of the acquired target object image and target text to obtain image features and text features.

In this embodiment, the execution subject of the poster generation method (such as the terminal device or the server in fig. 1) may acquire the target object image and the target text from a remote location or from a local location through a wired network connection manner or a wireless network connection manner, and perform feature extraction on the acquired target object image and target text to obtain image features and text features.

The target object image is an object image to be presented in a target poster to be generated, for example, only the target object is included in the target object image; the target text is text to be presented in the target poster to be generated, for example, the target text is introduction text of the target object.

In this embodiment, the executing body may perform feature extraction on the target object image through the image feature extraction network to obtain an image feature; and extracting the characteristics of the target text through a text characteristic extraction network to obtain text characteristics.

The image feature extraction network and the text feature extraction network can be neural network models with feature extraction functions, such as convolutional neural networks, cyclic neural networks, self-encoders and the like; it may also be a feature extraction module in a neural network model, e.g. in a residual network, BERT (Bidirectional Encoder Representations from Transformers, converter represented by bi-directional encoder).

In some optional implementations of this embodiment, the executing body may execute the step 201 as follows: and respectively extracting the characteristics of the target object image and the target text by a visual encoder and a language encoder in the pre-trained characteristic extraction model to obtain image characteristics and text characteristics.

The feature extraction model is obtained based on a contrast learning task of image-text pairs, mask language modeling and matching task training of the image-text.

As an example, the feature extraction model is ALBEF model. The visual encoder in ALBEF model consists of a 12-layer visual Transformer ViT and the speech encoder consists of the first six layers of RoBERTa. In order to make the encoder more suitable for the e-commerce scene, in the implementation mode, large-scale image and text pairs are collected from the e-commerce platform and are finely adjusted on the basis of ALBEF original training targets. In particular, these training objectives include contrast learning of pairs of graphics, masking language modeling, and matching tasks of graphics.

The contrast learning of the image-text pair is an unsupervised learning method and is used for learning the correlation between the image and the corresponding text. This task aims to train the model to learn a good representation by gathering together related images and text samples, and separating unrelated images and text samples.

In a masking language modeling task, some words or segments are randomly selected and masked in the input text provided to the model, which requires prediction of the masked portion from context and other visible words. In general, masking may be by replacing the masked words with special markers (e.g., [ MASK ]), where the model needs to predict the masked words as accurately as possible during the training process.

The image-text matching task refers to solving the problem of a specific task by matching images and texts in the fields of natural language processing and computer vision. This task aims to measure the correlation between images and text.

In some preferred implementations, the feature extraction model also adds a training goal, i.e., predicting the category of the item from its image and title. It should be noted that the feature extraction model is not limited to chinese scenes, and can support other languages by replacing the language encoder.

Step 202, determining layout information of a plurality of visual elements in a target poster to be generated according to image characteristics and text characteristics through a pre-trained planning network.

In this embodiment, the executing body may determine layout information of a plurality of visual elements in the target poster to be generated through a pre-trained planning network according to the image features and the text features. Wherein the planning network 301 is configured to characterize correspondence between image features, text features, and layout information of a plurality of visual elements in a target poster.

The plurality of visual elements in the target poster include, but are not limited to, visual elements of a background, substrate, text, article, and the like.

The execution subject may input the image features and the text features into a pre-trained planning network, thereby obtaining layout information of a plurality of visual elements in the target poster to be generated. The planning network can be trained by the following modes:

First, a training sample set is acquired. The training samples in the training sample set comprise image feature samples, text feature samples and layout information. Then, a machine learning method is adopted, an image feature sample and a text feature sample are taken as input, layout information corresponding to the input image feature sample and text feature sample is taken as expected output, and a planning network is obtained through training.

In some optional implementations of this embodiment, the executing body may execute the step 202 as follows:

firstly, initializing layout of a plurality of visual elements according to a target object image and a target text through a planning network to obtain initialized layout information.

As an example, by planning a network, according to a target object image and a target text, a plurality of visual elements are subjected to random initialization layout, and initialization layout information is obtained.

Secondly, through planning a network, carrying out iterative denoising on the initialized layout information according to the image characteristics and the text characteristics to obtain the layout information.

As an example, the planning network may employ a layout decoder through which the initialized layout information is subjected to multi-step iterative denoising according to the image features and the text features to obtain the layout information.

Specifically, first, the initialized layout information is preprocessed, e.g., smoothed, downsampled, etc., to reduce the effect of noise and prepare for subsequent iterative steps.

Then, the preprocessed initialization layout information is used as an initial estimate. This is the starting point of the iteration, and the initial estimate typically contains noise. The following iterative process is then performed:

a. estimating noise: noise estimation is performed on the currently estimated initialized layout information, and statistical methods, waveform analysis or other signal processing techniques can be used. Noise estimation is a key step in iterative denoising.

B. denoising: based on the noise estimate, denoising the initialized layout information of the current estimate. Appropriate denoising algorithms, such as wavelet denoising, non-local mean denoising, etc., are used to reduce noise level.

C. Updating the estimation: and updating the estimation of the noise according to the result of the denoising process. This can be achieved by comparing the difference between the denoised initialized layout information and the original input.

D. intermediate results are saved: and at the end of each iteration, storing the current denoising result. Thus, when the entire iterative process is completed, intermediate results for each step can be obtained.

Termination condition: after the T-step iteration is performed, T intermediate results are obtained. The decision whether the stop condition is fulfilled may be based on a predetermined termination criterion, e.g. reaching a predetermined number of iterations or convergence of the noise level.

And (3) final treatment: and if the termination condition is met, selecting the denoising result of the last step as a final output. If desired, some final treatment, such as smoothing, strengthening, or other post-treatment steps, may be performed to achieve the final desired result.

In this implementation, the iterative denoising process of T steps allows noise to be gradually reduced in multiple stages, and retains important features of the signal. However, in the process of selecting the iteration number and the noise estimation, adjustment is required according to specific problems and application scenes so as to achieve the optimal denoising effect.

With continued reference to fig. 4, a schematic diagram of the architecture of the planning network is shown. In some alternative implementations of the present embodiment, the planning network 301 includes a first fully-connected layer 3011, a second fully-connected layer 3012, and a plurality of serially-connected planning transformers 3013.

In this implementation manner, the execution body may execute the second step by: iteratively executing the following denoising operations corresponding to the time steps, and determining the denoised layout information corresponding to the last time step as layout information:

And 2.1, processing the denoised layout information obtained by the denoising operation corresponding to the previous time step through the first full connection layer to obtain layout characterization information. Wherein, the first full connection layer process corresponding to the first time step initializes the layout information.

In this implementation, the denoising operation of step T _p is performed altogether. For each time step t, the inputs to the planning network comprise three parts: the layout information Z _t at the time t extracts good visual and language characteristics; layout information Z _t-1 at time t-1 is output.

For a first time step T _p, the first full connection layer processes the initialization layout information; for each subsequent time step t, the first full-connection layer processes the denoised layout information obtained by the denoising operation corresponding to the previous time step, and obtains layout characterization information e _t.

And 2.2, obtaining the processed layout information according to the layout characterization information, the last time step, the image characteristics and the text characteristics through a plurality of planning transformers.

For a first planning transformation module in the planning transformation modules, taking layout characterization information, the last time step, image characteristics and text characteristics obtained by the first full-connection layer as input to obtain processed layout information; and for each subsequent planning transducer module, the processed layout information, the last time step, the image characteristics and the text characteristics obtained by the previous planning transducer module are input to obtain the processed layout information corresponding to the planning transducer module.

And 2.3, generating denoising layout information of the current time step according to the processed layout information through the second full connection layer.

In the implementation manner, the processed layout information output by the last planning transformer module is decoded through the second full-connection layer to obtain the denoised layout information of the current time step.

In this implementation, the iterative denoising process is performed according to the order of time steps T _p,T_p -1, … 2,1, and finally, the denoised layout information of time step 1 is determined as the final layout information.

In the implementation manner, a specific structure of a planning network and a specific iterative denoising process are provided, and the accuracy of the obtained layout information is improved.

With continued reference to fig. 4 described above, in some alternative implementations of the present embodiment, the planning transducer module 3013 includes an adaptive normalization layer 30131, a self-attention layer 30132, and a cross-attention layer 30133.

In this implementation manner, the execution body may execute the step 2.2 by:

And 2.21, processing the denoised layout information corresponding to the previous time step and the previous time step through the self-adaptive normalization layer to obtain first processing data.

As an example, the above-described execution body may obtain the first processing data with reference to the following expression:

H_t＝AdaLN(e_t，t)

Wherein H _t represents the first processed data, adaLN represents the adaptive normalization layer, e _t represents the denoised layout information corresponding to the last time step, and t represents the last time step.

Step 2.22, combining the second processing data obtained by the self-attention layer based on the first processing data with the first processing data to obtain third processing data.

As an example, the above-described execution body may obtain the third processing data with reference to the following expression:

a_t＝H_t+SA(H_t)

Where a _t denotes third processed data, SA denotes a self-attention layer, and SA (H _t) denotes second processed data.

And 2.23, combining the third processing data obtained by the cross attention layer based on the second processing data, the image features and the text features and the second processing data to obtain the processed layout information.

As an example, the execution subject described above may obtain the post-processing layout information with reference to the following expression:

e_t＝FF(a_t+CA(a_t,CAT(e_T,e_I)))

Where FF represents the forward propagation function, CA represents the cross-attention layer, CAT represents the stitching operation, and e _T,e_I represents the text feature and the image feature, respectively.

In the implementation manner, a specific structure of the planning transformer module and a specific determination process of the processed layout information are provided, the accuracy of the processed layout information is improved, and a basis is provided for obtaining accurate layout information.

And 203, generating a target poster according to the target object image and the layout information through a pre-trained rendering network.

In this embodiment, the execution body may generate the target poster according to the target object image and the layout information through the pre-trained rendering network. The rendering network is used for representing the corresponding relation among the target object image, the layout information and the target poster.

As an example, the execution subject described above may input the target item image and layout information into a pre-trained rendering network, thereby generating a target poster. The rendering network can be trained by the following modes:

first, a training sample set is acquired. The training samples in the training sample set comprise a layout information sample, an article image sample and a poster. Then, a machine learning method is adopted, office information samples and article image samples are taken as input, the input office information samples and the posters of the article image samples are taken as expected output, and a planning network is obtained through training.

With continued reference to fig. 3, in some alternative implementations of the present embodiment, rendering network 302 includes a layout branch 3021, a visual branch 3022, a control network 3023, and a stable diffusion network 3024.

In this implementation manner, the execution body may execute the step 203 as follows:

first, a plurality of mask images corresponding to a plurality of visual elements one by one are fused through a space fusion module in a layout branch, so that space layout representation data is obtained.

In this implementation, the coordinates of the layout information output by the planning network are converted into a mask image { L _m } of the layout, where M ranges from 1 to M, and M is the number of categories of visual elements. For mask image L _m, the positions of the m-th type visual elements are filled with 1's and the remaining positions are filled with 0's.

And fusing the plurality of mask images { L _m } through a space fusion module in the layout branch to obtain space layout representation data Z ^L.

Secondly, geometrically transforming the target object image according to the layout information through the visual branch to obtain a repositioning image, and determining visual representation data of the repositioning image.

The purpose of the visual branch is to encode visual and spatial information of the article. In the implementation mode, firstly, scaling and translation are carried out on the target object image according to layout information output by a planning network, so as to obtain a repositioning image after repositioning the target object image according to the layout information; visual representation data Z ^V of the repositioned image is then extracted through a six-layer convolutional network.

Thirdly, the spatial layout representation data and the visual representation data are used as control conditions of a control network to guide the stable diffusion network to generate a target poster.

In this implementation manner, the spatial layout characterization data Z ^L and the visual characterization data Z ^V are used as control conditions for controlling the network control net to instruct the Stable Diffusion network SD (Stable Diffusion) to generate the target poster.

As an example, first, the spatial layout characterization data Z ^L and the visual characterization data Z ^V are input into the control network control net, and operations are performed inside the control network control net and corresponding control information is generated. Such control information may be understood as a vector guiding the poster generation process.

Then, the control information generated by the control network control net is used as the input of the stable diffusion network and is transmitted to the stable diffusion network so as to generate the target poster meeting the specific condition.

In the implementation manner, the method and the device provide a process for guiding the stable diffusion network to generate the target poster by taking the spatial layout representation data of the layout information and the visual representation data of the relocated target object image as the control conditions of the control network, thereby being beneficial to further improving the quality and diversity of the target poster.

In some alternative implementations of the present embodiment, rendering network 302 further includes text rendering module 3025. In this implementation manner, the execution body may execute the third step by:

Firstly, the spatial layout representation data and the visual representation data are used as control conditions of a control network to guide a stable diffusion network to generate an initial poster.

As an example, first, the spatial layout characterization data Z ^L and the visual characterization data Z ^V are input into the control network control net, and operations are performed inside the control network control net and corresponding control information is generated. Such control information may be understood as a vector guiding the poster generation process. Then, the control information generated by the control network control net is input as a stable diffusion network and is transmitted to the stable diffusion network so as to generate an initial poster meeting specific conditions.

And then, rendering the target text in the initial poster according to the color of the text region of the target text in the initial poster by a text rendering module to obtain the target poster.

The text rendering module is based on heuristic rules: firstly, extracting main colors of a text region of a target text in an initial poster, then carrying out proper color matching according to the main colors, and finally, randomly selecting a font from a font library for rendering to obtain the target poster.

In the implementation mode, the target text in the initial poster is rendered through the text rendering module on the basis of the initial poster, so that the text display effect in the target poster is further improved.

With continued reference to fig. 5, a schematic diagram of the spatial fusion module is shown. In some optional implementations of this embodiment, the spatial fusion module includes: a convolutional network 30211, a plurality of serially connected visual transducer modules 30212.

In this implementation manner, the execution body may execute the first step by:

first, each of a plurality of mask images is processed through a convolutional network, resulting in a plurality of processed mask images.

As an example, a plurality of mask images { L _m } are encoded using a three-layer convolutional network, and the feature shape of the encoded processed mask images is c×h×w. Wherein, C represents the number of channels, H represents the height, and W represents the width.

Then, each of the plurality of processed mask images is segmented into a plurality of cut pieces, and the corresponding plurality of cut pieces in the plurality of processed mask images are fused to obtain a plurality of fused cut pieces.

Each of the coded mask images L _m is cut into a plurality of cut pieces L _m,j in the shape of c×p×p, j being the sequence number of the cut pieces, which range from 1 to w×h/P ². The execution body may obtain a fusion cut according to the following expression:

Wherein CAT is the splicing operation, Is a fusion token added to the input.

And finally, respectively processing the fusion dices through a plurality of visual transducer modules to obtain a plurality of processed dices, and generating the space layout characterization data comprising the plurality of processed dices.

The plurality of visual transducer modules are connected in series, progressive processing of the plurality of fused dices by the plurality of visual transducer modules, and outputting the plurality of processed dices by the last visual transducer module, thereby generating spatial layout characterization data comprising the plurality of processed dices.

In the implementation manner, a specific structure of the spatial fusion module and a specific fusion manner based on the spatial fusion module are provided, so that the spatial relationship among a plurality of visual elements in the target poster can be better explored.

In some alternative implementations of the present embodiment, the visual transducer module includes a linear network of multi-headed self-attention layers and lead-in layer normalization operations.

In this implementation manner, the execution subject may obtain a plurality of processed diced pieces based on a plurality of visual transducer modules in the following manner:

for each of the plurality of visual transducer modules, performing the following:

First, for each fusion cut of a plurality of fusion cuts, processing an intermediate fusion cut corresponding to the fusion module output by the last visual transducer module through a multi-head self-attention layer in the visual transducer module to obtain a first processed fusion cut. Wherein the first visual transducer module processes the fusion cut.

For the first visual transducer module in the plurality of visual transducer modules, processing each fusion cut block through the visual transducer module to obtain a plurality of fusion cut blocks and corresponding intermediate fusion cut blocks; and for each subsequent visual transducer module, processing the plurality of intermediate fusion cuts output by the last visual transducer module through the visual transducer module to obtain the plurality of intermediate fusion cuts output by the visual transducer module.

And then combining the intermediate fusion cut piece and the first processed fusion cut piece to obtain a second processed fusion cut piece.

Inside each visual transducer module, a second post-treatment fusion cut was obtained by the following expression:

Wherein, Representing the second processed fusion cut corresponding to the j-th fusion cut inside the s-th visual transducer module; /(I)An intermediate fusion module representing the output of the s-1 th visual transducer module; /(I)Representing a first post-treatment fusion cut.

And then, processing the second processed fusion cut block through a linear network to obtain a third processed fusion cut block.

And finally, combining the second processed fusion cut piece and the third processed fusion cut piece to obtain the intermediate fusion cut piece output by the visual transducer module.

In this implementation, the intermediate fusion cut of the visual transducer module may be obtained by referring to the following expression:

Wherein, Representing the intermediate fused cut of the output of the s-th visual transducer module, FF represents a linear network,The fusion cut after the third treatment is shown.

In the implementation mode, the specific structure of the visual transducer module and the data processing process inside the visual transducer module are provided, the accuracy of the space layout representation data is improved, and a foundation is provided for subsequent generation of high-quality target posters.

In some optional implementations of this embodiment, the executing body may execute the initial poster generating process in the following manner:

Firstly, spatial layout characterization data and visual characterization data are used as control conditions of a control network, and a noise poster is generated based on a random noise guiding stable diffusion network.

As an example, first, the spatial layout characterization data Z ^L, the visual characterization data Z ^V, and the random noise are combinedAnd inputting a control network control net, and calculating and generating corresponding control information in the control net. Such control information may be understood as a vector guiding the poster generation process. Then, control information generated by the control network control net is input as a stable diffusion network and is transmitted to the stable diffusion network so as to generate a noise poster meeting specific conditions. /(I)

And then, carrying out iterative denoising on the noise poster to obtain an initial poster.

In this implementation, the denoising operation of step T _R is performed altogether. For each time step t, the input of the rendering network is poster P _t at time t, and the output is poster P _t-1 at time t-1.

In the implementation manner, the iterative denoising process is helpful to further improve the quality and diversity of the generated target poster.

With continued reference to fig. 6, fig. 6 is a schematic diagram 600 of an application scenario of the poster generation method according to the present embodiment. In the application scenario of fig. 6, a user 601 sends a target item image 604 and a target text 605 to a server 603 via a mobile terminal 602 for generating a target poster. After the server acquires the target object image and the target text, firstly extracting the characteristics of the acquired target object image and target text to obtain image characteristics 606 and text characteristics 607; then determining layout information of a plurality of visual elements in the target poster to be generated according to the image characteristics and the text characteristics through a pre-trained planning network 608; a target poster 610 is generated from the target item image and layout information by the pre-trained rendering network 609.

According to the method provided by the embodiment of the application, the image characteristics and the text characteristics are obtained by extracting the characteristics of the acquired target object image and the target text; determining layout information of a plurality of visual elements in a target poster to be generated according to image features and text features through a pre-trained planning network; the target poster is generated according to the target object image and the layout information through the pre-trained rendering network, so that the end-to-end poster generation method based on the planning network and the rendering network is provided, the layout structure of the poster is determined through the planning network, the corresponding poster image is generated according to the layout structure through the rendering network, and the generation efficiency, quality and diversity of the poster are improved.

With continued reference to fig. 7, there is shown a schematic flow chart 700 of yet another embodiment of a method of generating a poster according to the present application, including the steps of:

And 701, respectively extracting the characteristics of the target object image and the target text by a visual encoder and a language encoder in the pre-trained characteristic extraction model to obtain image characteristics and text characteristics.

Step 702, initializing a layout of a plurality of visual elements according to the target object image and the target text by planning a network, and obtaining initialized layout information.

In step 703, iterative denoising is performed on the initialized layout information according to the image features and the text features through the planning network, so as to obtain the layout information.

Step 704, fusing a plurality of mask images corresponding to the plurality of visual elements one by one through a spatial fusion module in the layout branch to obtain spatial layout characterization data.

Step 705, performing geometric transformation on the target object image according to the layout information through the visual branch to obtain a repositioning image, and determining visual representation data of the repositioning image.

And step 706, using the spatial layout characterization data and the visual characterization data as control conditions of the control network to guide the stable diffusion network to generate an initial poster.

Step 707, rendering, by the text rendering module, the target text in the initial poster according to the color of the text region of the target text in the initial poster, to obtain the target poster.

As can be seen from this embodiment, compared with the embodiment corresponding to fig. 2, the process 700 of the method for generating a poster in this embodiment specifically illustrates the determining process of layout information, the generating process of a target poster and the rendering process of a target text, which further improves the generating efficiency, quality and diversity of the poster.

With continued reference to fig. 8, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of a poster generating apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 8, the poster generating apparatus 800 includes: an extracting unit 801 configured to perform feature extraction on the acquired target object image and target text, to obtain image features and text features; a planning unit 802 configured to determine layout information of a plurality of visual elements in a target poster to be generated according to image features and text features through a pre-trained planning network; and a rendering unit 803 configured to generate a target poster from the target item image and layout information through the pre-trained rendering network.

In some optional implementations of this embodiment, the planning unit 802 is further configured to: initializing layout of a plurality of visual elements according to the target object image and the target text through a planning network to obtain initialized layout information; and carrying out iterative denoising on the initialized layout information according to the image characteristics and the text characteristics through a planning network to obtain the layout information.

In some optional implementations of this embodiment, the planning network includes a first fully-connected layer, a second fully-connected layer, and a plurality of serially connected planning transformer modules, and the planning unit 802 is further configured to: iteratively executing the following denoising operations corresponding to the time steps, and determining the denoised layout information corresponding to the last time step as layout information: processing the denoised layout information obtained by the denoising operation corresponding to the previous time step through the first full-connection layer to obtain layout characterization information, wherein the first full-connection layer corresponding to the first time step processes initialization layout information; obtaining processed layout information according to the layout characterization information, the last time step, the image features and the text features through a plurality of planning transformers; and generating denoised layout information of the current time step according to the processed layout information through the second full connection layer.

In some optional implementations of this embodiment, the planning transformer module includes an adaptive normalization layer, a self-attention layer, and a cross-attention layer, and the planning unit 802 is further configured to: the self-adaptive normalization layer is used for processing the de-noised layout information corresponding to the previous time step and the previous time step to obtain first processing data; combining the second processing data obtained by the self-attention layer based on the first processing data with the first processing data to obtain third processing data; and combining the third processing data obtained by the cross attention layer based on the second processing data, the image features and the text features with the second processing data to obtain the processed layout information.

In some optional implementations of the present embodiment, the above-described rendering network includes a layout branch, a visual branch, a control network, a stable diffusion network, and the above-described rendering unit 803 is further configured to: a plurality of mask images corresponding to the visual elements one by one are fused through a space fusion module in the layout branch, so that space layout representation data are obtained; geometrically transforming the target object image according to the layout information through the visual branch to obtain a repositioning image, and determining visual representation data of the repositioning image; and using the spatial layout representation data and the visual representation data as control conditions of a control network to guide the stable diffusion network to generate a target poster.

In some optional implementations of this embodiment, the rendering network further includes a text rendering module; and the rendering unit 803 described above, further configured to: the space layout representation data and the visual representation data are used as control conditions of a control network to guide a stable diffusion network to generate an initial poster; and rendering the target text in the initial poster according to the color of the text region of the target text in the initial poster by a text rendering module to obtain the target poster.

In some optional implementations of this embodiment, the spatial fusion module includes: a convolutional network, a plurality of serially connected visual transducer modules, and the rendering unit 803 described above, are further configured to: processing each of the plurality of mask images through a convolutional network to obtain a plurality of processed mask images; dividing each processed mask image in the processed mask images into a plurality of cut blocks, and fusing the corresponding plurality of cut blocks in the processed mask images to obtain a plurality of fused cut blocks; and respectively processing the fusion dices through the plurality of visual transducer modules to obtain a plurality of processed dices, and generating space layout characterization data comprising the plurality of processed dices.

In some optional implementations of this embodiment, the visual transducer module includes a linear network of multi-headed self-attention layer and lead-in layer normalization operations, and the rendering unit 803 is further configured to: for each of the plurality of visual transducer modules, performing the following: for each fusion cut in the plurality of fusion cuts, processing the middle fusion cut corresponding to the fusion module output by the last visual transducer module through the multi-head self-attention layer in the visual transducer module to obtain a first processed fusion cut, wherein the first visual transducer module processes the fusion cut; combining the intermediate fusion cut and the first processed fusion cut to obtain a second processed fusion cut; processing the second processed fusion cut block through a linear network to obtain a third processed fusion cut block; and combining the second processed fusion cut piece and the third processed fusion cut piece to obtain the intermediate fusion cut piece output by the visual transducer module.

In some optional implementations of this embodiment, the rendering unit 803 is further configured to: taking the spatial layout representation data and the visual representation data as control conditions of a control network, and generating a noise poster based on the random noise guiding stable diffusion network; and carrying out iterative denoising on the noise poster to obtain an initial poster.

In some optional implementations of this embodiment, the extracting unit 801 is further configured to: and respectively extracting features of the target object image and the target text through a visual encoder and a language encoder in a pre-trained feature extraction model to obtain image features and text features, wherein the feature extraction model is obtained based on a contrast learning task of image-text pairs, mask language modeling and matching task training of the image-text pairs.

In the embodiment, an extraction unit in the poster generating device performs feature extraction on the acquired target object image and target text to obtain image features and text features; the planning unit determines layout information of a plurality of visual elements in a target poster to be generated according to the image characteristics and the text characteristics through a pre-trained planning network; the rendering unit generates the target poster according to the target object image and the layout information through the pre-trained rendering network, so that the end-to-end poster generating device based on the planning network and the rendering network is provided, the layout structure of the poster is determined through the planning network, the corresponding poster image is generated through the rendering network according to the layout structure, and the generating efficiency, quality and diversity of the poster are improved.

Referring now to FIG. 9, there is illustrated a schematic diagram of a computer system 900 suitable for use with devices (e.g., devices 101, 102, 103, 105 shown in FIG. 1) for implementing embodiments of the present application. The apparatus shown in fig. 9 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present application.

As shown in fig. 9, the computer system 900 includes a processor (e.g., CPU, central processing unit) 901, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data necessary for the operation of the system 900 are also stored. The processor 901, the ROM902, and the RAM903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

The following components are connected to the I/O interface 905: an input portion 9091 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 910 so that a computer program read out therefrom is installed into the storage section 908 as needed.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. The above-described functions defined in the method of the present application are performed when the computer program is executed by the processor 901.

The computer readable medium of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the client computer, partly on the client computer, as a stand-alone software package, partly on the client computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computer may be connected to the client computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented in software or in hardware. The described units may also be provided in a processor, for example, described as: a processor includes an extraction unit, a planning unit, and a rendering unit. The names of these units do not constitute a limitation on the unit itself in some cases, and for example, the planning unit may also be described as "a unit that determines layout information of a plurality of visual elements in a target poster to be generated from image features and text features through a pre-trained planning network".

As another aspect, the present application also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the computer device to: extracting features of the obtained target object image and the target text to obtain image features and text features; determining layout information of a plurality of visual elements in a target poster to be generated according to image features and text features through a pre-trained planning network; and generating a target poster according to the target object image and the layout information through a pre-trained rendering network.

The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the application referred to in the present application is not limited to the specific combinations of the technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the inventive concept described above. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.

Claims

1. A method of generating a poster, comprising:

Extracting features of the obtained target object image and the target text to obtain image features and text features;

Determining layout information of a plurality of visual elements in a target poster to be generated according to the image features and the text features through a pre-trained planning network;

and generating the target poster according to the target object image and the layout information through a pre-trained rendering network.

2. The method of claim 1, wherein the determining, by the pre-trained planning network, layout information of a plurality of visual elements in a target poster to be generated from the image features and the text features comprises:

initializing layout of the plurality of visual elements according to the target object image and the target text through the planning network to obtain initialized layout information;

And carrying out iterative denoising on the initialized layout information according to the image characteristics and the text characteristics through the planning network to obtain the layout information.

3. The method of claim 2, wherein the planning network comprises a first fully-connected layer, a second fully-connected layer, and a plurality of serially connected planning transducer modules, and

And performing iterative denoising on the initialized layout information according to the image features and the text features through the planning network to obtain the layout information, wherein the method comprises the following steps:

iteratively executing the following denoising operations corresponding to each of the plurality of time steps, and determining the denoised layout information corresponding to the last time step as the layout information:

Processing the de-noised layout information obtained by the de-noising operation corresponding to the previous time step through the first full-connection layer to obtain layout characterization information, wherein the first full-connection layer corresponding to the first time step processes the initialized layout information;

Obtaining processed layout information according to the layout characterization information, the last time step, the image features and the text features through the plurality of planning transformers;

And generating denoising layout information of the current time step according to the processed layout information through the second full connection layer.

4. The method of claim 3, wherein the planning transducer module comprises an adaptive normalization layer, a self-attention layer, and a cross-attention layer, and

The step of obtaining, by the plurality of planning transformers, processed layout information according to the layout characterization information, the last time step, the image feature and the text feature, includes:

processing the de-noised layout information corresponding to the last time step and the last time step through the self-adaptive normalization layer to obtain first processing data;

combining the second processing data obtained by the self-attention layer based on the first processing data with the first processing data to obtain third processing data;

and combining third processing data obtained by the cross attention layer based on the second processing data, the image features and the text features with the second processing data to obtain the processed layout information.

5. The method of claim 1, wherein the rendering network comprises a layout branch, a visual branch, a control network, a stable diffusion network, and

The generating, by the pre-trained rendering network, the target poster according to the layout information and the target object image, includes:

Fusing a plurality of mask images corresponding to the visual elements one by one through a space fusion module in the layout branch to obtain space layout representation data;

geometrically transforming the target object image according to the layout information through the visual branch to obtain a repositioning image, and determining visual representation data of the repositioning image;

and using the spatial layout characterization data and the visual characterization data as control conditions of the control network to guide the stable diffusion network to generate the target poster.

6. The method of claim 5, wherein the rendering network further comprises a text rendering module; and

The step of guiding the stable diffusion network to generate the target poster by using the spatial layout characterization data and the visual characterization data as control conditions of the control network comprises the following steps:

the spatial layout characterization data and the visual characterization data are used as control conditions of the control network to guide the stable diffusion network to generate the initial poster;

and rendering the target text in the initial poster according to the color of the text region of the target text in the initial poster by the text rendering module to obtain the target poster.

7. The method of claim 5, wherein the spatial fusion module comprises: convolutional network, multiple serially connected vision transducer modules, and

The step of fusing the mask images corresponding to the visual elements one by one through the spatial fusion module in the layout branch to obtain spatial layout characterization data comprises the following steps:

Processing each mask image of the plurality of mask images through the convolutional network to obtain a plurality of processed mask images;

Dividing each processed mask image in the processed mask images into a plurality of cut blocks, and fusing the corresponding plurality of cut blocks in the processed mask images to obtain a plurality of fused cut blocks;

And respectively processing the fusion dices through the vision transducer modules to obtain a plurality of processed dices, and generating space layout representation data comprising the plurality of processed dices.

8. The method of claim 7, wherein the visual transducer module comprises a linear network of multi-headed self-attention layer and lead-in layer normalization operations, and

The processing of the fusion dices through the vision transducer modules to obtain processed dices comprises:

for each fusion cut in the plurality of fusion cuts, processing an intermediate fusion cut corresponding to the fusion module output by the last visual transducer module through a multi-head self-attention layer in the visual transducer module to obtain a first processed fusion cut, wherein the first visual transducer module processes the fusion cut;

combining the intermediate fusion cut and the first post-treatment fusion cut to obtain a second post-treatment fusion cut;

Processing the second processed fusion cut through the linear network to obtain a third processed fusion cut;

and combining the second processed fusion cut piece and the third processed fusion cut piece to obtain an intermediate fusion cut piece output by the vision transducer module.

9. The method of claim 6, wherein the directing the stable diffusion network to generate the initial poster with the spatial layout characterization data and the visual characterization data as control conditions of the control network comprises:

The spatial layout representation data and the visual representation data are used as control conditions of the control network, and the stable diffusion network is guided to generate a noise poster based on random noise;

and carrying out iterative denoising on the noise poster to obtain the initial poster.

10. The method of claim 1, wherein the feature extraction of the acquired target item image and target text to obtain image features and text features comprises:

and respectively extracting features of the target object image and the target text through a visual encoder and a language encoder in a pre-trained feature extraction model to obtain the image features and the text features, wherein the feature extraction model is obtained based on a contrast learning task of image-text pairs, mask language modeling and matching task training of the image-text pairs.

11. A poster generation apparatus comprising:

The extraction unit is configured to perform feature extraction on the acquired target object image and target text to obtain image features and text features;

A planning unit configured to determine layout information of a plurality of visual elements in a target poster to be generated according to the image features and the text features through a pre-trained planning network;

and a rendering unit configured to generate the target poster from the target item image and the layout information through a pre-trained rendering network.

12. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-10.

13. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-10.