CN114742991A

CN114742991A - Poster background image selection, model training, poster generation method and related device

Info

Publication number: CN114742991A
Application number: CN202210360219.7A
Authority: CN
Inventors: 金楚浩; 宋睿华; 许洪腾; 卢志武; 曹岗; 文继荣
Original assignee: Renmin University of China; Beijing Zhiyuan Artificial Intelligence Research Institute
Current assignee: Renmin University of China; Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date: 2022-04-07
Filing date: 2022-04-07
Publication date: 2022-07-12

Abstract

The application discloses a poster background image selection method, a poster background image model training method, a poster generation method and a related device. By applying the technical scheme of the application, the preset visual text model can be trained by the plurality of weakly-related image text pairs, and the poster background image weakly related to the text information which is interested by the user can be automatically selected through the visual text model obtained through training. And then generating a final poster image based on the automatically selected poster background image. And further, the problem that in the related art, the poster is generated only by manual design, which cannot meet the large demand of high-quality posters, is solved.

Description

Poster background image selection, model training, poster generation method and related device

Technical Field

The application relates to an image data processing technology, in particular to a poster background image selecting method, a poster background image training method, a poster generating method and a relevant device.

Background

Poster is a medium with both artistic and functional properties, which has been widely used in many commercial and non-commercial scenarios for promoting and disseminating information. For example, electronic commerce platforms use attractive posters to promote their own merchandise.

In the related art, the conventional poster generation method is usually formed by processing and typesetting each commodity image and information by professional art designers. The specific process comprises the processes of manually selecting a background image of the poster, manually performing text layout in the poster, manually determining a text style on the poster and the like.

However, such a time-consuming and subjective manual design of poster generation process does not meet the large and rapidly growing demand for high quality posters in real-world applications, thereby reducing the efficiency of information dissemination. Therefore, how to design a technical scheme capable of automatically realizing the poster generation process by using the pre-training model becomes a problem to be solved.

Disclosure of Invention

The embodiment of the application provides a poster background image selection method, a poster background image model training method, a poster generation method and a relevant device. The method and the device are used for solving the problem that the information transmission efficiency is reduced due to the fact that posters can only be generated manually in the related technology.

According to an aspect of an embodiment of the present application, there is provided a poster generation method, including:

obtaining a poster description text which contains text information used for being added on a poster;

and selecting a candidate background image which is weakly correlated and matched with the character information from a pre-acquired candidate background image set as a poster background image based on a pre-trained visual text model.

Optionally, in another embodiment based on the above method of the present application, the pre-trained visual text model includes: BriVL comprising a pre-trained image encoder and a pre-trained text encoder;

the method for selecting the candidate background image which is weakly correlated and matched with the character information from the pre-acquired candidate background image set as the poster background image based on the pre-trained visual text model comprises the following steps:

based on the pre-trained text encoder, enabling the pre-trained text encoder to perform feature extraction on the character information to obtain character features; and the number of the first and second groups,

respectively extracting features of each candidate background image based on the pre-trained image encoder to obtain image features corresponding to the candidate background images;

and calculating weak correlation feature similarity between the character features and image features respectively corresponding to the candidate background images, and taking the candidate background image with the highest weak correlation feature similarity with the character features as the poster background image.

Optionally, in another embodiment based on the method of the present application, the pre-trained text encoder includes: an encoder of RoBERTA-Large in a Chinese pre-training model;

the text encoder based on the pre-training is used for enabling the pre-training text encoder to perform feature extraction on the character information to obtain character features, and the method comprises the following steps:

and inputting the character information into a RoBERTA-Large encoder in the Chinese pre-training model so as to enable the encoder to perform feature extraction on the character information and output corresponding character features.

Optionally, in another embodiment based on the above method of the present application, the pre-trained image encoder includes: pre-trained Faster R-CNN and EfficientNet;

the pre-training-based image encoder respectively performs feature extraction on each candidate background image to obtain image features respectively corresponding to each candidate background image, and the method comprises the following steps:

and inputting each candidate background image into the pre-trained image encoder so that the pre-trained Faster R-CNN performs visual object detection processing on each candidate background image, and performing feature extraction on each candidate background image subjected to the visual object detection processing based on the EfficientNet to obtain image features corresponding to each candidate background image.

Optionally, in another embodiment based on the foregoing method of the present application, the calculating weak correlation feature similarities between the text features and image features respectively corresponding to the candidate background images, and using a candidate background image with a highest weak correlation feature similarity with the text features to be compared as the poster background image includes:

and determining weak correlation feature similarity between the character features and the image features respectively corresponding to the candidate background images based on an infoNCE loss function constructed by a preset weak supervised learning method and a comparative learning method (CPC).

According to an aspect of the embodiments of the present application, there is provided a visual text model training method for selecting a poster background image, including:

acquiring a plurality of weakly correlated image text pairs, wherein the weakly correlated image text pairs are used for representing a group of weakly correlated candidate background images and historical text information;

and pre-training and training a preset visual text model based on the plurality of weakly-related image text pairs to obtain the visual text model for selecting the poster background image weakly related to the character information.

Optionally, in another embodiment based on the above method of the present application, the pre-trained visual text model includes: BriVL comprising a pre-trained image coder and a pre-trained text coder;

the pre-trained text encoder includes: an encoder of RoBERTa-Large in a chinese pre-training model, the pre-trained image encoder comprising: pre-trained Faster R-CNN and EfficientNet;

the encoder of the RoBERTA-Large is used for extracting the characteristics of each historical character information and outputting corresponding character characteristics;

the pre-trained Faster R-CNN is used for performing visual object detection processing on each candidate background image, and the EfficientNet is used for performing feature extraction on each candidate background image subjected to the visual object detection processing to obtain image features corresponding to each candidate background image;

and the BriVL determines the weak correlation characteristic similarity between the character characteristic and the image characteristic corresponding to each candidate background image respectively based on an InfiniCE loss function constructed by a preset weak supervised learning method and a comparative learning method CPC.

According to an aspect of an embodiment of the present application, there is provided a poster generation method including:

determining a text layout area corresponding to the character information in the poster background image obtained based on the poster background image selection method;

and filling the text information in the text layout area to generate a target poster corresponding to the text information.

Optionally, in another embodiment based on the above method of the present application, the determining a text layout area in the poster background image includes:

performing preliminary layout prediction on the poster background image through a first cascade automatic encoder to determine an initial text layout area; and the number of the first and second groups,

and according to the text length and the text attribute represented by the character information, performing text layout refinement processing on the initial text layout region through a second-level cascade automatic encoder to obtain the text layout region.

Optionally, in another embodiment based on the foregoing method of the present application, the populating the text information in the text layout area to generate a poster corresponding to the text information includes:

extracting character features of the character information, and detecting background colors of the text layout area;

selecting a target text style matched with the character characteristics and the background color from a preset text style database based on the background color of the text layout area;

and converting the character information according to the target text style, and filling the converted character information in the text layout area to obtain a target poster corresponding to the character information.

According to another aspect of an embodiment of the present application, there is provided a poster background image selecting apparatus, including:

the acquisition module is configured to acquire poster description text which contains text information used for being added to the poster;

and the selecting module is configured to select a candidate background image which is weakly related and matched with the character information from a pre-acquired candidate selecting background image set as a poster background image based on the pre-trained visual text model.

According to another aspect of the embodiments of the present application, there is provided an electronic device including:

a memory for storing executable instructions; and

a display for communicating with the memory to execute the executable instructions to perform the operations of any of the poster generation methods described above.

According to a further aspect of an embodiment of the present application, there is provided a computer-readable storage medium storing computer-readable instructions that, when executed, perform the operations of any of the poster generation methods described above.

In the application, a poster description text generated by a user can be obtained, wherein the poster description text contains character information added on a poster; acquiring a candidate background image set, and selecting a candidate background image which is weakly correlated and matched with character information from the candidate background image set as a poster background image based on a visual text model, wherein the visual text model is obtained by training a plurality of sample images marked with weak correlation text pairs; and after a text layout area is determined in the poster background image, filling character information in the text layout area to obtain a target poster image. By applying the technical scheme of the application, a plurality of weakly-related image text pairs can be trained on the preset visual text model, and the weakly-related poster background images among the text information which is interested by the user can be automatically selected through the visual text model obtained through training. And then generating a final poster image based on the automatically selected poster background image. And the problem that the large demand of high-quality posters cannot be met only by manually designing and generating the posters in the related technology is solved.

The technical solution of the present application is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.

The present application may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:

fig. 1 is a schematic diagram of a poster background image selection method proposed in the present application;

fig. 2 is a schematic flow chart of poster background image selection proposed in the present application;

FIG. 3 is a schematic diagram of a method for training a visual text model for selecting a background image of a poster according to the present application;

FIG. 4 is a schematic illustration of a method of generating a poster as set forth in the present application;

fig. 5 is a reference diagram of an example of poster background image selection proposed in the present application;

FIG. 6 is a schematic diagram illustrating a comparison between a text layout method proposed in the present application and a text layout method of a prior art process;

FIG. 7 is a schematic flow chart of a poster text layout as set forth in the present application;

fig. 8 is a schematic overall flow chart of a poster generation method as proposed in the present application;

fig. 9 is a schematic structural diagram of an electronic device for poster background image selection proposed in the present application;

fig. 10 is a schematic structural diagram of an electronic device for poster background image selection proposed in the present application.

Detailed Description

Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present application unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the application, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

In addition, technical solutions between the various embodiments of the present application may be combined with each other, but it must be based on the realization of the technical solutions by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination of technical solutions should be considered to be absent and not within the protection scope of the present application.

It should be noted that all the directional indicators (such as upper, lower, left, right, front and rear … …) in the embodiment of the present application are only used to explain the relative position relationship between the components, the motion situation, etc. in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicator is changed accordingly.

A method for generating a poster according to an exemplary embodiment of the present application is described below in conjunction with fig. 1-8. It should be noted that the following application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.

The application also provides a poster background image selection method, a model training method, a poster generation method and a related device.

Fig. 1 schematically shows a flow chart of a poster generation method according to an embodiment of the present application. As shown in fig. 1, the method includes:

s101, obtaining a poster description text, wherein the poster description text contains character information for being added on the poster.

In the related art, a poster is used as a medium with artistic and functional properties, and is widely applied to many commercial and non-commercial scenes for promoting and spreading information. For example, electronic commerce platforms use attractive posters to promote their merchandise. Websites of social events such as meetings are often decorated with fine and rich posters. These high quality posters are created by embedding stylized text into a suitable background image, a process that requires a significant amount of manual editing effort and an artistic design experience that is difficult to quantify. However, such a time-consuming and subjective manual design process cannot meet the large and rapidly growing demand for high-quality posters in real-world applications, thereby reducing the efficiency of information dissemination, resulting in poor advertising effectiveness.

In one form, the process of poster generation comprises at least three steps, including:

step 1: selecting a background image of the poster;

step 2: performing text layout in the poster background image;

and step 3: the style of the text on the poster is determined and filled into the background image, thereby generating the final poster image.

Based on the above process, wherein for step 1, in order to achieve automatic search of the background image of the poster, existing classical retrieval methods typically search for a suitable image by matching the poster content text with the background image annotation text.

However, the methods in the large correlation technique usually only consider the single-mode features, so that the semantic gap between the visual and text modes cannot be closed, and the search results with large deviation can be generated. That is, the selection of the background image does not match the direction in which the user is interested this time.

In addition, regarding step 2, in order to perform text layout prediction in the related art, the conventional rule-based method generally selects a layout of text from a limited number of predefined layout templates, but it has poor flexibility. This may result in the content of the background image being ignored in the arrangement of the text, and thus may result in an undesirable effect of poster generation.

Further, in order to solve the problem that the Poster background image selection is not matched with the direction in which the user is interested at this time, the application provides a technical scheme for performing the Poster background image selection method based on a Text2Poster pre-trained visual Text model.

In summary, the present application may first obtain a large-scale pre-trained visual text model, select a candidate background image matching a poster description text as a poster background image among a plurality of candidate background images according to the poster description text given by a user, then iteratively arrange a text on the image through a cascade auto-encoder, and finally perform a stylization process on the text through a matching-based method to synthesize a target poster.

In one approach, the present application may also optimize each module of the framework through a weakly supervised and self-supervised learning strategy, thereby reducing reliance on tag data. Therefore, the technical scheme based on the data-driven framework Text2Poster can achieve more excellent purpose in the aspect of Poster generation quality.

Specifically, the method can firstly obtain poster description text which is generated by a user and contains text information for being added to the poster. It can be understood that the poster description text is a description problem of poster information that the user wants to generate at this time. A plurality of parameters may be included, including, for example, title information, scene information, embodiment information, character information, and the like.

S102, selecting candidate background images which are weakly correlated and matched with character information from a pre-acquired candidate background image set as poster background images based on a pre-trained visual text model.

In one mode, in order to improve the high quality of the poster, in the embodiment of the application, when the background image is retrieved in the poster generation process, an image which has a weak correlation match with text information (namely, text information which is contained in the poster description text and is used for being added to the poster) can be searched.

Specifically, the weak correlation matching method is a method having metaphorical matching. For example, instead of looking for images of a particular wedding scene (i.e., not strongly related), such as a picture showing the metaphorical love of a white church in a blue sky, the application may be inclined to find some more metaphorical images when retrieving background images based on the phrase "bob and alice's wedding".

Furthermore, the aim of selecting the candidate background image which is weakly correlated and matched with the text information from the candidate background image set as the background image of the poster is fulfilled. The present application may utilize BriVL, which is one of the pre-trained visual text models, to select a background image from the candidate images based on the textual information.

In particular, BriVL is the visual text model proposed in this application, as shown for example in FIG. 2, and can be seen to consist of an image encoder and a text encoder, respectively denoted asf_I(i.e. image encoder) and f_T(i.e., a text encoder).

Wherein the image encoder f_IThe method comprises the steps of firstly detecting a visual object by utilizing a pre-trained Faster R-CNN model, and then extracting image features corresponding to each background candidate image by using an EfficientNet model as a visual main frame of the visual main frame.

In addition, a text encoder f_TThe RoBERTa-Large encoder in the chinese pre-training model may be used as its text main frame. Based on the output of the main frame model, BriVL stacks multiple layers of transformers to derive character features corresponding to the character information.

It should be noted that the visual text model BriVL in the present application needs to be trained on a plurality of (for example, 3000 ten thousand) weakly correlated "sample image-text pairs" collected from the network in advance, and therefore can satisfy the weak correlation artistry of the poster-generated results presented in the present application.

It will be appreciated that the present application applies a weakly supervised learning strategy and an InfoNCE loss function to align features of text with features of images. The loss function constructed by the contrast learning method cpc (contrast Predictive coding) is InfoNCE, where NCE refers to Noise contrast Estimation (Noise contrast Estimation).

In one aspect, the present application may collect a plurality of high-quality images from an object such as a picture material website as candidate background images of the present application as a candidate background image set. So that the selection of the background image of the poster is subsequently realized based on the visual text model BriVL mentioned in the present application.

Specifically, the image features corresponding to the candidate background images and the character features corresponding to the character information are extracted. Therefore, the weak correlation feature similarity between the character features and the image features respectively corresponding to the candidate background images can be respectively calculated, and the candidate background image with the highest weak correlation feature similarity with the character features is taken as the background image of the poster

In the application, a poster description text generated by a user can be obtained, wherein the poster description text contains character information added on a poster; acquiring a candidate background image set, and selecting a candidate background image which is weakly related and matched with character information from the candidate background image set as a poster background image based on a visual text model, wherein the visual text model is obtained by training a plurality of sample images marked with weakly related text pairs; and after a text layout area is determined in the poster background image, filling character information in the text layout area to obtain a target poster image. By applying the technical scheme of the application, a plurality of weakly-related image text pairs can be trained on the preset visual text model, and the weakly-related poster background images among the text information which is interested by the user can be automatically selected through the visual text model obtained through training. And then generating a final poster image based on the automatically selected poster background image. And further, the problem that in the related art, the poster is generated only by manual design, which cannot meet the large demand of high-quality posters, is solved.

Optionally, in another embodiment based on the method of the present application, the pre-trained visual text model includes: BriVL comprising a pre-trained image encoder and a pre-trained text encoder;

based on a pre-trained visual text model, selecting a candidate background image which is weakly related and matched with character information from a pre-acquired candidate background image set as a poster background image, and the method comprises the following steps:

based on a pre-trained text encoder, enabling the pre-trained text encoder to perform feature extraction on character information to obtain character features; and the number of the first and second groups,

respectively extracting the features of each candidate background image based on a pre-trained image encoder to obtain the image features corresponding to each candidate background image;

and calculating weak correlation characteristic similarity between the character characteristic and the image characteristic corresponding to each candidate background image, and taking the candidate background image with the highest weak correlation characteristic similarity with the character characteristic as the poster background image.

Still illustrated by the above-described FIG. 2, whereinBriVL is the visual text model proposed in this application and can be seen to consist of an image encoder and a text encoder, respectively denoted as f_I(i.e., image encoder) and f_T(i.e., a text encoder).

Wherein the image encoder f_IFirstly, a visual object is detected by using a pre-trained Faster R-CNN model, and then an EfficientNet model is used as a visual main frame of the visual object to extract image features corresponding to each background candidate image.

In addition, a text encoder f_TThe RoBERTa-Large's encoder in the chinese pre-training model may be used as its main frame of text. Based on the output of the main frame model, BriVL stacks multiple layers of transformers to derive character features corresponding to the character information.

Specifically, in the embodiment of the present application, after obtaining the image features corresponding to the candidate background images and the text features corresponding to the text information included in the poster description text, the features may be converted into corresponding encoding vector values to be represented.

For example, when the encoding vector r of the character feature is obtained_T'And the coding vector of each candidate background image

Then, r can be used_T'＝f_T(U_iT_i),

To indicate.

In one mode, the embodiment of the application can calculate r_T'And each of

Cosine similarity between the background images, and selecting the candidate background image with the highest similarity as the poster background image I.

The formula for calculating the cosine similarity between the feature vectors may be:

optionally, in another embodiment based on the method of the present application, the pre-trained text encoder comprises: an encoder of RoBERTA-Large in the Chinese pre-training model;

based on the text encoder of the pre-training, so that the text encoder of the pre-training carries out feature extraction on the character information to obtain character features, the method comprises the following steps:

inputting the character information into a RoBERTA-Large encoder in a Chinese pre-training model so as to enable the encoder to extract the characteristics of the character information and output the corresponding character characteristics.

Optionally, in another embodiment based on the method described above, the pre-trained image encoder comprises: pre-trained Faster R-CNN and EfficientNet;

the image encoder based on pre-training respectively extracts the features of each candidate background image to obtain the image features respectively corresponding to each candidate background image, and the method comprises the following steps:

and inputting each candidate background image into a pre-trained image encoder so that the pre-trained Faster R-CNN performs visual object detection processing on each candidate background image, and performing feature extraction on each candidate background image subjected to the visual object detection processing based on EfficientNet to obtain image features corresponding to each candidate background image.

and determining weak correlation characteristic similarity between the character characteristics and the image characteristics respectively corresponding to the candidate background images based on an InfiniCE loss function constructed by a preset weak supervised learning method and a comparative learning method CPC.

In one mode, in the process of performing weak correlation matching by using the image features and the character features of the candidate background images, the candidate background image with the highest similarity to the weak correlation features of the character features to be compared can be used as the poster background image. Specifically, the method needs to apply a weak supervised learning strategy and an InfoNCE loss function to perform feature alignment operation on the character features and the image features of each candidate background image. And after the features are aligned, performing correlation matching on the features.

For the InfoNCE loss function, the application may use a loss function constructed by a contrast learning method cpc (contrast Predictive coding) as the InfoNCE, where NCE refers to Noise contrast Estimation (Noise contrast Estimation).

By applying the technical scheme of the application, a plurality of weakly-related image text pairs can be trained on the preset visual text model, and the weakly-related poster background images among the text information which is interested by the user can be automatically selected through the visual text model obtained through training. And then generating a final poster image based on the automatically selected poster background image. And further, the problem that in the related art, the poster is generated only by manual design, which cannot meet the large demand of high-quality posters, is solved.

Fig. 3 schematically shows a flow chart of a visual text model training method for selecting poster background images according to an embodiment of the application. As shown in fig. 3, the method includes:

s201, acquiring a plurality of weakly correlated image text pairs, wherein the weakly correlated image text pairs are used for representing a group of weakly correlated candidate background images and historical text information.

S202, pre-training and training a preset visual text model based on the plurality of weakly-correlated image text pairs to obtain the visual text model for selecting the poster background image weakly correlated with the character information.

In one mode, in the process of training to obtain the visual text model, the method can use a plurality of weakly correlated image text pairs obtained in advance as training samples. It should be noted that the weakly correlated image text pairs are used to represent a group of weakly correlated candidate background images and historical text information. That is, each candidate background image is labeled with a set of text descriptions corresponding to the weak correlations.

For example, weakly correlated image text pairs may include: candidate background images of a blue sky and a cloud and corresponding marked text information 'wedding'. The character information "wedding" is historical text information corresponding to the candidate background image of the blue sky white cloud in a weak correlation mode.

Or, the weakly correlated image text pair may be: a candidate background image of a school classroom and the corresponding annotated text message "grow". The text information "yes" is historical text information corresponding to the candidate background image in the classroom weakly. It can be understood that the visual text model meeting the weak correlation artistry existing between the character information and the poster background image can be obtained by using the plurality of weak correlation image text pairs as sample training data and pre-training and training initial preset visual text models.

It should be noted that the number of the weak relevant image text pairs is not specifically limited in the present application, and may be, for example, 1000 ten thousand, 3000 ten thousand, or the like.

In one mode, when the present application detects that a preset training condition is reached (e.g., a certain number of training times is reached, a certain training time is reached, training is converged, etc.) during the process of pre-training and training a preset visual text model based on a plurality of weakly correlated image texts, it is determined that a final visual text model is obtained.

In the application, a plurality of weakly correlated image text pairs can be obtained, wherein the weakly correlated image text pairs are used for representing a group of weakly correlated candidate background images and historical text information; and pre-training and training the preset visual text model based on the plurality of weakly-related image texts to obtain the visual text model for selecting the poster background image weakly related to the text information.

Alternatively, in another embodiment based on the method described herein,

the pre-trained visual text model includes: BriVL comprising a pre-trained image encoder and a pre-trained text encoder;

the pre-trained text encoder includes: the RoBERTA-Large encoder in the Chinese pre-training model, the pre-trained image encoder comprises: pre-trained Faster R-CNN and EfficientNet;

the encoder of RoBERTA-Large is used for extracting the characteristics of each historical character information and outputting the corresponding character characteristics;

the pre-trained fast R-CNN is used for carrying out visual object detection processing on each candidate background image, and the EfficientNet is used for carrying out feature extraction on each candidate background image subjected to visual object detection processing to obtain image features corresponding to each candidate background image;

BriVL determines weak correlation feature similarity between the character features and image features respectively corresponding to the candidate background images based on an InfiniCE loss function constructed by a preset weak supervised learning method and a comparative learning method CPC.

In one approach, the visual text model in the present application may be constructed based on BriVL, which consists of a pre-trained image encoder and a pre-trained text encoder.

The pre-trained image encoder firstly detects a visual object by using a pre-trained Faster R-CNN model, and then extracts image features corresponding to each background candidate image by using an EfficientNet model as a visual main frame of the pre-trained image encoder.

In addition, the pre-trained text encoder may use the RoBERTA-Large encoder in the Chinese pre-training model as its main frame of text. Based on the output of the main frame model, BriVL stacks multiple layers of transformers to extract the characteristics of each historical character information and output the corresponding character characteristics.

It should be noted that the present application may apply a weak supervised learning strategy and an InfoNCE loss function to align the features of the text with the features of the image. The loss function constructed by the contrast learning method cpc (contrast Predictive coding) is InfoNCE, where NCE refers to Noise contrast Estimation (Noise contrast Estimation).

Fig. 4 schematically shows a flow diagram of a poster generation method according to an embodiment of the application. As shown in fig. 4, the method includes:

s301, determining a text layout area corresponding to the character information.

S302, filling the text information into the text layout area to generate a target poster corresponding to the text information.

In one mode, the text information can be filled in a text layout area in a poster background image after a candidate background image which is weakly correlated and matched with the text information is selected as the poster background image from a pre-acquired candidate background image set based on a pre-trained visual text model, and then a final target poster is obtained. Specifically, the method comprises the following steps:

step 1: and selecting the candidate background image as the poster background image.

Wherein a background image is selected from the candidate images based on the text information using BriVL, which is one of the pre-trained visual text models.

Step 2: a text layout area is determined.

Wherein each character information T contained in the poster description text_iAnd respectively carrying out text layout prediction. Is marked as

Wherein p is_i∈R²Represents the ith text information T_iNormalized upper left corner coordinates on poster background image I, and then forming text information T_iThe text layout prediction result of (1).

And 3, step 3: determining a text style: and respectively determining character characteristics (including character fonts and character colors, for example) of each piece of character information, and filling each piece of text information into a corresponding text layout area according to the text layout prediction result to complete the generation of the poster.

Alternatively, fig. 5 is a schematic diagram of a background image that is intended to find a weakly correlated match to textual information (i.e., textual information contained in the poster description text for addition to the poster) using the teachings of the present application. The correspondingly selected background image under each text message is shown.

Optionally, in a poster generation method provided by the present application, a layout method based on a poster text is first required, and a text layout in a poster image is implemented for the layout method of the poster text, so as to generate a corresponding target poster image. Specifically, the poster text layout method in the present application may include the steps of:

determining a smooth area in a poster background image corresponding to the poster description text, and selecting an available area in the smooth area by utilizing a first automatic encoder;

and sampling the available area, and generating a text box corresponding to the available area according to the corresponding sampling result and the character information which is corresponding to the poster description text and is added to the poster so as to obtain a target text layout area which is used for writing the character information in the poster background image.

Optionally, determining a smooth area in the poster background image corresponding to the poster description text includes:

selecting a plurality of candidate frames in the poster background image; generating a significant image corresponding to the poster background image by using a spectrum residual algorithm, wherein the candidate frames are overlapped with each other;

determining a candidate value corresponding to each candidate frame based on the significant average value of the significant image, the pixel number of the candidate frame and a preset offset;

and selecting a candidate frame with a candidate value lower than a preset candidate threshold value as a target candidate frame, and determining a smooth area in the poster background image based on the target candidate frame, wherein the positions of the target candidate frames in the poster background image are not overlapped.

Optionally, selecting a candidate frame with a candidate value lower than a preset candidate threshold as a target candidate frame, and determining a smooth area in the background image of the poster based on the target candidate frame, including:

selecting a candidate frame with a candidate value lower than a preset candidate threshold value as a target candidate frame based on a non-maximum value inhibition method;

determining candidate frame areas of all target candidate frames in the poster background images;

and converting the candidate frame area in the poster background image into a binary image, and taking the converted binary image area as a smooth area in the poster background image.

Optionally, selecting the available area in the smooth area by using the first automatic encoder includes:

respectively taking each target candidate frame corresponding to the smooth area as an encoder end input of a first automatic encoder to obtain an encoder output result, wherein the encoder end is constructed by stacked CNNs;

and connecting the output result of the encoder with the position embedded graph, and then inputting the output result of the encoder as a decoder end of the first automatic encoder to obtain an available area serving as an initial text layout result in the smooth area, wherein the decoder end is constructed by stacked transmitted-CNNs.

Optionally, after generating the text box corresponding to the available area, the method further includes:

and respectively carrying out thinning layout processing on the text boxes corresponding to the available areas in an autoregressive mode by utilizing a second automatic encoder to obtain target text boxes corresponding to the available areas so as to form target text layout areas for writing character information in the poster background image.

Optionally, the first and second auto-encoders are both layout predictors having a cascaded auto-encoding architecture.

Optionally, a plurality of sample images are obtained, where each sample image includes a text region marked with a corresponding text description field;

extracting a sample background image in each sample image, and determining a sample smooth image area in the sample background image by using a smooth area detector;

combining a text description field, a text region, a sample background image and a sample smooth image region into an encoder training data set;

the first and second autoencoders are independently trained, respectively, using the encoder training data sets such that the first autoencoder is used to predict a probability distribution of the text layout and the second autoencoder is used to refine the layout book block.

Optionally, the second automatic encoder further includes: and training by using an auto-supervised learning strategy, wherein an encoder end of the second automatic encoder is constructed by the stacked CNNs, and a decoder end of the second automatic encoder is constructed by the 2-layer bidirectional LSTM.

Alternatively, fig. 6 is a schematic diagram comparing the poster text layout method proposed by the present application with other text layout methods in the prior art. The corresponding target poster images under each poster text layout are shown.

In the application, a poster description text generated by a user can be obtained, wherein the poster description text contains character information added on a poster; acquiring a candidate background image set, and selecting a candidate background image which is weakly related and matched with character information from the candidate background image set as a poster background image based on a visual text model, wherein the visual text model is obtained by training a plurality of sample images marked with weakly related text pairs; and after a text layout area is determined in the poster background image, filling the text information in the text layout area to obtain a target poster image. By applying the technical scheme of the application, a plurality of weakly-related image text pairs can be trained on the preset visual text model, and the weakly-related poster background images among the text information which is interested by the user can be automatically selected through the visual text model obtained through training. And then generating a final poster image based on the automatically selected poster background image. And further, the problem that in the related art, the poster is generated only by manual design, which cannot meet the large demand of high-quality posters, is solved.

Optionally, in another embodiment of the method described above based on the present application, determining a text layout area in the poster background image includes:

performing preliminary layout prediction on a poster background image through a first cascade automatic encoder to determine an initial text layout area; and the number of the first and second groups,

and according to the text length and the text attribute represented by the character information, performing text layout thinning processing on the initial text layout region through a second-level cascade automatic encoder to obtain a text layout region.

In one embodiment, after obtaining the poster background image I, the present application may predict the text layout area P of the text information by two cascaded automatic encoders (i.e., a first cascaded automatic encoder and a second cascaded automatic encoder), and specifically includes the following steps:

step a: a smooth image region in the poster background image is determined.

Further, the present application needs to first generate several regions (candidate frames) with different sizes in the background image I, and these regions have overlap, which is denoted as

Then, a saliency map (S) corresponding to the background image I is generated by applying a spectral residual method, which is denoted as S.

Wherein for each candidate frame, a candidate value v is respectively assigned to each candidate frame by calculating the average value of the saliency map S_i. Namely, it is

Where S (p) refers to the significance of the text layout prediction result p, | A_iIs A_iIs an offset sensitive to the candidate frame size.

In a preferred scheme, a candidate frame with a candidate value lower than a preset candidate threshold value can be selected as a target candidate frame. And determining a candidate frame area in the poster background image where the target candidate frame is located, and converting the candidate frame area in the poster background image into a binary image. And finally, taking the converted binary image area as a smooth image area in the poster background image.

It should be noted that the preset candidate threshold may be adaptively set according to different images, that is: the threshold may be set according to an average value of respective corresponding values of respective areas of the current background image I, for example, the threshold may be adaptively set to 1.4 × mean { }, and a non-maximum suppression method NMS is applied to ensure that respective areas finally selected from the previously generated areas do not overlap (for example, 1000 candidate boxes are generated in the previous period, and the boxes may overlap).

For example, as shown in fig. 7, the more central portion of the image (a) is a saliency map S of the background image I, and in practical applications, the saliency map S may be displayed in a first color (e.g., blue), the more edge rectangular region of the image (a) is a smooth region map a identified by this step, and in practical applications, the saliency map S may be displayed in a second color (e.g., red) that is clearly different from the first color.

Step b: at least one initial text layout area is determined in the smoothed image area using a first auto-encoder.

In one mode, the present application may use a first cascade auto-encoder g1 to predict a probability distribution, denoted as L, for text layout based on the smoothed region a obtained in step a. For each pixel p, L (p) E [0,1] represents the probability that p belongs to a text box.

The first cascade automatic encoder g1 adopted in the present application has an automatic encoding architecture, in which the encoder f1 is a stacked CNN, and the decoder h1 is a stacked transmitted-CNN. The input to the decoder h1 is then constructed by concatenating the output of the encoder f1 with a learnable position embedding map (denoted as E).

For example, a white portion in the image (b) in fig. 3 shows a probability distribution prediction result L in which text layout is possible in the poster background image I, wherein the brighter the color, the higher the layout probability.

Step c: and performing iterative refinement layout on at least one initial text layout region in an autoregressive mode by using a second automatic encoder to obtain a target text layout region.

The result L of the layout prediction probability distribution obtained in the above steps is the initial region screening that can be performed for text layout, and needs to be refined to form each available region into an explicit text box.

That is, first, the layout is initialized by sampling each from L on a L basis. And taking the unnormalized coordinate as the upper left corner coordinate of the ith text box, initializing the box, determining the size of the box according to the length and the attribute of the corresponding text information, and finally determining the position and the size of each text box for writing each text information in the background image I, namely the text layout prediction result.

In a specific implementation manner, the text layout refinement processing may be performed on the initial text layout region by using a second-stage cascade automatic encoder to obtain a final text layout region, where the text layout refinement processing includes:

P^(k+1)＝g₂(Concat(A,L),P^(k))，k＝0，...，K-1。

where k is the number of iterations, g₂And the second automatic encoder is provided, wherein A is a smooth image area, L is the probability distribution of an initial text layout area, and P is a pixel point in the poster background image.

It should be noted that, for the second-level concatenated automatic encoder, its encoder may be a stacked CNN, and its decoder may be a layer-2 bi-directional LSTM.

In a preferred embodiment, the first-stage cascade automatic encoder for predicting the layout of the available region in step b and the second-stage cascade automatic encoder for refining the layout of the text in step c may both use a layout predictor having a cascade automatic encoding architecture, so as to achieve the purpose of effectively simulating the manual image editing process.

As shown in fig. 8, an overall flow architecture diagram of the poster generation method proposed by the present application is shown.

Optionally, in another embodiment based on the foregoing method of the present application, filling text information in a text layout area to generate a poster corresponding to the text information includes:

extracting character features of character information and detecting background colors of a text layout area;

and converting the character information according to a target text style, and filling the converted character information in a text layout area to obtain a target poster corresponding to the character information.

Further, the present application may describe each text information T in the text for the poster_iE' and all the corresponding character features r ═ f need to be extracted_T(T) and obtaining a background color of the character feature as c^II (p). Wherein p is_iIs a text message T_iA corresponding text layout area.

Based on (r, c)^T) Target text styles matched with the character features and the background colors can be searched from F in a preset text style database under the cosine similarity, and therefore all T can be determined_iThe color and font of (c). And fills it into the corresponding text layout area.

Specifically, the following are implementation details of two autoencoders g1 and g2, including:

A) each convolutional layer used in g1 contains 16 convolutional kernels of size 9 × 9; the encoder of g1 finally outputs a 64-dimensional feature vector.

B) g2 contains 64 convolution kernels of size 5 × 5 per convolution layer; the dimension of the hidden layer of the 2-layer bi-directional LSTM (decoder) of g2 is set to 200.

C) In training the two automatic encoders described above, the training set and the verification set may be divided, and for example, the data set D may be divided into 138013 poster images for training and 16000 poster images for verification. The size of each poster image can be adjusted to 300 x 400. And the Adam algorithm can be used to optimize the models of the autoencoders g1 and g2, with a learning rate of 0.05 and a batch size of 512. On four V100 GPUs, two autoencoders 4(g1) and 48(g2) were trained, respectively, for hours.

Finally, in order to prove the validity of the Text2post framework proposed in the present application, the following sections also include corresponding verifications, which are specifically described as follows:

1) verification for image retrieval:

first, the image retrieval method is verified for validity (i). For ease of comparison, in addition to the BriVL method provided herein, the present application also references (a) a search engine that applies unsplash.

Representative search results obtained by three different search methods are shown in the present application. In contrast to the other two approaches, the images retrieved using the BriVL method provided by the present application do contain metaphors corresponding to the input text. For example, for the given text "Campus Charity Sale", search engines of unsplash. Even for challenging abstract descriptions like "See the world together" and "Dreams new stop" (never standing), the BriVL method provided by the present application can still find suitable images.

In subjective evaluation, given 50 text queries, the present application retrieves the top 5 images of each query by different methods. The application invites three volunteers to score the quality of the retrieved images from 0 (very poor) to 4 (very good). Com, the score means and standard deviation of the search engine was 2.17 ± 0.10, the score means and standard deviation of the tag-based matching method was 1.64 ± 0.16, and the score means and standard deviation of the BriVL method provided by the present application was 2.38 ± 0.13, which further demonstrates the superiority of the method of the present application.

2) Verification for text layout prediction:

the layout predictor proposed herein was evaluated quantitatively and qualitatively and compared to the following control group:

A) the most advanced learning-based approach, layout gan + +;

B) the most advanced rule-based methods IUI and desa;

C) commercial poster generator LUBAN on https:// LUBAN.

Further, to demonstrate the usefulness of the iterative layout optimization strategy of the present application, the present application sets K to 1, 5, and 30, respectively, for the layout predictor of the present application.

Com collects 16,000 posters from huaban to construct a reference data set and prepares three background image sets: unshmesh 2K, unshmesh 10K and PSD1.6K. The unscpush 2K and the unscpush 10K contained 2,000 and 10,000 background images from Unsplash. PSD1.6K contains 1,637 background images extracted from a PSD formatted poster file. For each image set, the present application arranges input text on a background image and generates a poster by various methods. The present application calculates the initial distance, FID, between the poster and the reference dataset as per the work in layout gan + +. The results in table 1 show that the method of the present application is consistently superior to the above control group and that its performance improves with increasing K, which verifies the rationality of the iterative refinement strategy of the present application.

TABLE 1

Table 1 shows the objective and subjective evaluation of various layout prediction methods. C) LUBAN in (a) provides only a fee-based service, and cannot be quantitatively evaluated on a large scale.

In addition, the present application has manually selected 50 text sets, each of which contains a title and several subtitles or descriptions.

For each layout prediction method, the present application first retrieves five background images for each set of text and generates 250 posters accordingly. The application requires three volunteers to score the layout aesthetics of these generated posters from 0 (very poor) to 4 (very good). For each method, the mean and standard deviation of the scores are shown in table 1.

Alternatively, in another embodiment of the present application, as shown in fig. 9, the present application further provides a poster background image selecting apparatus. Which comprises the following steps:

an obtaining module 401 configured to obtain a poster description text containing text information for adding to a poster;

a selecting module 402 configured to select, based on the pre-trained visual text model, a candidate background image that is weakly associated and matched with the text information from the pre-obtained candidate selecting background image set as a poster background image.

In the application, a poster description text generated by a user can be obtained, wherein the poster description text contains character information added on a poster; acquiring a candidate background image set, and selecting a candidate background image which is weakly correlated and matched with character information from the candidate background image set as a poster background image based on a visual text model, wherein the visual text model is obtained by training a plurality of sample images marked with weak correlation text pairs; and after a text layout area is determined in the poster background image, filling character information in the text layout area to obtain a target poster image. By applying the technical scheme of the application, a plurality of weakly-related image text pairs can be trained on the preset visual text model, and the weakly-related poster background images among the text information which is interested by the user can be automatically selected through the visual text model obtained through training. And then generating a final poster image based on the automatically selected poster background image. And further, the problem that in the related art, the poster is generated only by manual design, which cannot meet the large demand of high-quality posters, is solved.

In another embodiment of the present application, the obtaining module 401 is configured to perform the following steps:

the pre-trained visual text model comprises: BriVL comprising a pre-trained image encoder and a pre-trained text encoder;

In another embodiment of the present application, the selecting module 402 is configured to perform the steps of:

In another embodiment of the present application, the obtaining module 401 is configured to perform the steps of:

the pre-trained image encoder comprises: pre-trained Faster R-CNN and EfficientNet;

and determining weak correlation feature similarity between the character features and the image features respectively corresponding to the candidate background images based on an InfiniCE loss function constructed by a preset weak supervised learning method and a comparative learning method CPC.

FIG. 10 is a block diagram illustrating a logical structure of an electronic device in accordance with an exemplary embodiment. For example, the electronic device 500 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

In an exemplary embodiment, there is also provided a non-transitory computer readable storage medium, such as a memory, including instructions executable by a processor of an electronic device to perform a method of generating a poster as described above, the method comprising: obtaining a poster description text which contains text information used for being added on a poster; and selecting a candidate background image which is weakly correlated and matched with the character information from a pre-acquired candidate background image set as a poster background image based on a pre-trained visual text model. Optionally, the instructions may also be executable by a processor of the electronic device to perform other steps involved in the exemplary embodiments described above. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided an application/computer program product including one or more instructions executable by a processor of an electronic device to perform a method of generating a poster as described above, the method comprising: obtaining a poster description text which contains text information used for being added on a poster; and selecting a candidate background image which is weakly correlated and matched with the character information from a pre-acquired candidate background image set as a poster background image based on a pre-trained visual text model. Optionally, the instructions may also be executable by a processor of the electronic device to perform other steps involved in the exemplary embodiments described above.

Fig. 10 is an exemplary diagram of an electronic device 500. Those skilled in the art will appreciate that the schematic diagram 10 is merely an example of the electronic device 500 and does not constitute a limitation of the electronic device 500 and may include more or less components than those shown, or combine certain components, or different components, e.g., the electronic device 500 may also include input-output devices, network access devices, buses, etc.

The Processor 502 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor 502 may be any conventional processor or the like, the processor 502 being the control center for the electronic device 500 and various interfaces and lines connecting the various parts of the overall electronic device 500.

The memory 501 may be used to store computer readable instructions 503, and the processor 502 may implement various functions of the electronic device 500 by executing or executing the computer readable instructions or modules stored in the memory 501, as well as invoking data stored in the memory 501. The memory 501 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the electronic device 500, and the like. In addition, the Memory 501 may include a hard disk, a Memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Memory Card (Flash Card), at least one disk storage device, a Flash Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), or other non-volatile/volatile storage devices.

The modules integrated by the electronic device 500 may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by the present invention, and the above mentioned computer readable instructions can be also realized by the related hardware through computer readable instructions, which can be stored in a computer readable storage medium, and when the computer readable instructions are executed by a processor, the steps of the above described method embodiments can be realized.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A poster background image selection method, comprising:

2. A method of selecting a background image of a poster of claim 1, wherein the pre-trained visual text model comprises: BriVL comprising a pre-trained image encoder and a pre-trained text encoder;

the selecting, based on the pre-trained visual text model, a candidate background image weakly correlated and matched with the text information from a pre-acquired candidate background image set as a poster background image includes:

3. A poster background image selection method as defined in claim 2, wherein the pre-trained text encoder comprises: an encoder of RoBERTA-Large in a Chinese pre-training model;

4. A poster background image selection method as defined in claim 2, wherein the pre-trained image encoder comprises: pre-trained Faster R-CNN and EfficientNet;

5. The poster background image selection method as recited in claim 2, wherein the calculating of the weak correlation feature similarity between the text features and the image features respectively corresponding to the candidate background images, and taking the candidate background image with the highest weak correlation feature similarity with the text features to be compared as the poster background image comprises:

6. A visual text model training method for selecting a poster background image, comprising:

7. A visual text model training method for selecting a background image of a poster as in claim 6, wherein the pre-trained visual text model comprises: BriVL comprising a pre-trained image encoder and a pre-trained text encoder;

the encoder of RoBERTA-Large is used for extracting the characteristics of each historical text message and outputting the corresponding text characteristics;

8. A poster generation method, comprising:

determining a text layout area corresponding to the text information in the poster background image obtained based on the poster background image selection method of any one of claims 1 to 5;

9. A poster generation method as defined in claim 8, wherein determining a text layout area in the poster background image comprises:

performing preliminary layout prediction on the poster background image by a first cascaded automatic encoder to determine an initial text layout area; and the number of the first and second groups,

10. The method of claim 8, wherein the populating the textual information in the text layout area to generate the poster to which the textual information corresponds comprises:

11. A poster background image selection apparatus comprising:

12. An electronic device, comprising:

a memory for storing executable instructions; and the number of the first and second groups,

a processor for executing with the memory the executable instructions to perform the operations of the poster background image selection method of any of claims 1 to 5, the visual text model training method for selecting poster background images of claims 6 or 7, or the poster generation method of any of claims 8 to 10.

13. A computer-readable storage medium storing computer-readable instructions that, when executed, perform the operations of the poster background image selection method of any of claims 1 to 5, the visual text model training method of claim 6 or 7 for selecting poster background images, or the poster generation method of any of claims 8 to 10.