CN114564768A

CN114564768A - End-to-end intelligent plane design method based on deep learning

Info

Publication number: CN114564768A
Application number: CN202210218256.4A
Authority: CN
Inventors: 李金遥; 刘金华; 刘传蔓; 黄东晋
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2022-03-03
Filing date: 2022-03-03
Publication date: 2022-05-31

Abstract

The invention discloses an end-to-end intelligent plane design method based on deep learning, which comprises the steps of using an acquired poster data set, firstly screening and dividing an original poster data set, carrying out layout design and attribute confirmation for common learning of two subtasks, and carrying out training and parameter adjustment on a training set to enable the model performance to reach an optimal state; according to the input image and text information, by calling the trained model, the intelligent planar design framework extracts the characteristic information of the image and the text, searches for a proper layout of the design drawing and confirms the text attribute of the composition, and automatically generates a harmonious planar design. The invention uses a unified joint training framework to prevent data distribution difference in the training process, reduces error propagation in a pipeline model, exerts the advantage of end-to-end training, does not need to artificially define aesthetic rules of plane design, learns the aesthetic rules from data, and does not depend on image significant map detection, thereby being capable of better generalizing to various plane design tasks.

Description

End-to-end intelligent plane design method based on deep learning

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to an end-to-end intelligent plane design method based on deep learning.

Background

The planar design, for example, poster design, is widely used in our daily life as an important medium for visual communication. Designers often require a great deal of time and effort to create a harmonious and aesthetically pleasing graphic design, and have a high threshold for ordinary people without professional foundations, requiring high aesthetic requirements and professional design knowledge. With the rapid progress and development of computer vision technology, people have increasingly strong interest in computer intelligent design.

At present, various methods for assisting in planar design can significantly reduce the time and labor consumed by basic design work, and can be mainly divided into a conventional method, a rule-based method and a data-driven method.

The traditional method of automatic flat panel design usually adopts a design rule or a structured data manner, and designers often obey the design rule when creating graphic designs, such as the aesthetic principle of layout and harmonious color model, but the traditional method has lower intelligence and limited application range. Rule-based approaches use some design rules to assist specific design tasks, including layout synthesis, color generation, and image thumbnails, etc., yet aesthetic rules are complex and difficult to define and constrain by specific rules. Ali, et al, in the document Learning cancellation network for semantic segmentation, propose a constraint-based recommendation system that can generate a magazine cover design according to the user's preference, but this method is only applicable to a specific magazine cover template.

With the development of deep learning, data-driven methods derive specific design attributes from training images for use in solving planar design problems. Yang et al, in the literature, "Recommendation system for automatic design of Magazine covers" discloses the effectiveness of optimization methods using aesthetic design principles, and proposes a learning-based method for generating magazine covers with the help of a large database, however, the rules are more restricted and there is a templating problem.

Zheng et al, in Content-aware formatting of graphic design layouts, propose a Content-aware depth generation model for graphic design layouts, which is capable of synthesizing graphic design layouts based on visual and textual features, but is mainly applied to magazine layout and is not suitable for generation with more extensive graphic posters.

Yang et al, in the document "assistant frame for comprehensive learning of visual representation", devised a system for automatically generating digital magazine covers by summarizing a set of templates related to topics and introducing a calculation formula, including a framework of key elements for layout design, but the above method is a rule-based system and is less intelligent.

Feng et al propose an interactive system design Scap in Adataset and a baseline model for spatial object detection, which can generate layout suggestions on a set of images and texts input by a user, in order to enhance the interactivity of the automatic layout generation process, but which is mainly based on saliency detection and has limitations for different planar designs.

While the above-described techniques have contributed to the generation of suitable text image layouts, the layout rules derived from these techniques are not applicable to the generation of planar design works, and ignore the critical impact of design element attributes other than layout on planar designs. Meanwhile, the mainstream intelligent solution excessively depends on significance detection, and the development space of intelligent plane design is limited to a great extent.

Disclosure of Invention

Four big problems to present intelligent planar design field: firstly, the use of saliency image detection is excessively relied on so as to limit the intelligent plane design, secondly, design element styles in the plane design are lack of consideration and solutions, thirdly, propagation errors and errors are accumulated due to the fact that a pipeline model structure is used more, and fourthly, poster structured data is insufficient; the invention provides an end-to-end intelligent plane design method based on deep learning, which comprehensively considers given images and design elements and automatically generates harmonious and attractive plane design.

An end-to-end intelligent plane design method based on deep learning comprises the following steps:

(1) collecting semi-structured poster data, cleaning and screening the poster data, and respectively storing composition text attributes and background images of the poster to provide required training data for an intelligent planar design model;

(2) performing joint training on the layout design and attribute confirmation subtasks in the intelligent planar design model, so that the model can extract features from a multi-modal (visual and text) view of poster data;

(3) fusing image text characteristics by using an image module and a text module in the model, and decoding to obtain a layout density inference graph;

(4) determining a layout design by using an approximate inference algorithm according to a layout density inference diagram;

(5) and integrating the overall image features and the local image features, and determining the attribute category of the composition text according to the output of the classifier.

Further, the specific implementation manner of the step (1) is as follows: firstly, semi-structured poster data are collected from public information of a webpage and screened, poster background images, title text sequences, relative coordinates of text boxes, text font information and other corresponding composition text attributes in the collected poster data set are recorded and stored, the poster background images, the title text sequences, the poster background images and the text font information are used as input, the poster is rendered as a target, and required training data are provided for a neural network in an intelligent planar design network framework. Training with an unsupervised signal in a semi-structured poster can overcome the difficulty of obtaining annotations.

Further, the specific implementation manner of the step (2) is as follows: firstly, gathering two main parts of visual representation and text representation, which are keys of the design of the plane poster, and considering not only the characteristics of an image but also the semantics of an input text; then, the aggregated multi-modal features are used as input of a decoder, the decoder outputs a density map with the same size as the original image, and each element value in the density map corresponds to the fraction of an original pixel and represents the weight of the pixel appearing in the text area; and then designing an objective function in the training process, wherein the overall objective is the sum of the minimum layout prediction loss and the attribute identification loss, and the parameter learning of the model is carried out through second-order gradient optimization.

Further, the expression of the objective function is as follows:

wherein:

in order to be the objective function, the target function,

a loss function is predicted for the layout,

identifying a loss function for an attribute, M_i,jIs the element value, G, of a pixel point (i, j) in the density map_i,jA binary indicator of whether the pixel point (i, j) is located in the text region,

for attribute alpha belonging to suitable category y_aσ () is a sigmoid function and Attributes represents a set of Attributes.

Furthermore, the intelligent planar design model integrally adopts a coder-decoder architecture, a convolutional neural network is used for extracting the image characteristics of the poster in an image module, the image characteristics are coded in more channels along with the depth of the network, and a characteristic diagram is used for coding in a wider receptive field; in a text module, extracting context-aware text features by using a pre-training language model, namely adopting a distributed vector to represent semantic information expected to carry an input text, then obtaining fixed-dimension distributed representation of the whole input sequence by using average pooling, converting text representation into a similar vector space represented by a graph by using a multilayer perceptron, and finally representing the text as a vector; and carrying out channel level feature fusion on the image features and the text features, and decoding by a decoder to obtain a layout density inference graph.

Further, in the step (4), a search-based layout design is adopted, that is, the position and the size of the composition text are predicted by searching on the layout density inference graph after feature fusion, the score of each pixel is modeled to indicate a proper region of the composition text, and the layout design predicts the position and the size of the composition text region by solving the optimization problem on the density inference graph; starting from the local maximum of the layout density inference graph, gradually enlarging the area of a rectangle in a heuristic search mode to determine the area of a text object, assuming that the correct position of composition text input is approximately centered on the local maximum, and the score of a candidate area is almost convex in terms of the distance from the edge to the local maximum, and determining the layout design by using an approximate inference algorithm by utilizing the locality of the density graph.

Further, after determining the poster layout design, the step (5) discretizes continuous attributes in composition text attributes (including text font, color, font size, etc.) into several categories for respective design, and further collects features from two sources by using a composition text attribute design model: on one hand, collecting the hidden image features of the poster image text features for global summarization of the integral input; on the other hand, local features are collected from the weighted local view of the original image; as the color of text information is generally limited by the hue of a local text region, the composition text attribute design model extracts local view image features by using a similar convolution encoder, integrates the overall features and the local features of the image, performs grading output by using a multi-layer perceptron classifier, and jointly determines the category of the composition text attribute.

The invention uses the unified joint training framework to prevent the data distribution difference in the training process, reduces the error propagation in the pipeline model and exerts the advantage of end-to-end training; meanwhile, the invention does not need to artificially define the aesthetic rules of the planar design, but learns the aesthetic rules from the data, and does not depend on the detection of the image significant map, thereby being capable of better generalizing to various planar design tasks.

Compared with the prior art, the invention has the following characteristics and beneficial technical effects:

1. the invention firstly provides an end-to-end framework of intelligent planar design, and avoids the weak point of error propagation generated by a multi-step pipeline and the difficulty of maintaining aesthetic constraints.

2. The invention designs an end-to-end network, jointly learns layout design and attribute determination, trains the network by using the self-supervision signal extracted from the semi-structured poster, and overcomes the difficulty of obtaining labels.

3. The effectiveness of the framework of the invention is proved by experimental results of crawling data, and extensive experiments and analysis of the results prove that the end-to-end intelligent plane design method based on deep learning has outstanding superiority compared with the prior method based on significance.

Drawings

Fig. 1 is a schematic overall flow chart of the intelligent planar design method of the present invention.

FIG. 2 is a schematic diagram of a design flat data auto-supervision signal.

FIG. 3 is a schematic diagram of an image and text feature fusion model.

FIG. 4 is a search-based layout design.

FIG. 5 is a schematic diagram of an attribute determination process based on fused features.

Detailed Description

In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.

In the embodiment, the collected poster data set is used, the original poster data set is screened and divided firstly, the common learning of two subtasks of layout design and attribute confirmation is carried out, and training and parameter adjustment are carried out on the training set, so that the model performance reaches the optimal state; according to the input image and text information, by calling the trained model, the intelligent planar design framework extracts the characteristic information of the image and the text, searches for a proper layout of the design drawing and confirms the text attribute of the composition, and automatically generates a harmonious planar design. As shown in fig. 1, the specific process flow of the present embodiment is as follows:

(1) and (4) collecting and preprocessing the design data of the self-supervision plane.

The method comprises the steps of collecting semi-structured poster data from public information of a webpage, screening, recording poster background images, basic composition units, title text sequences, relative coordinates of text boxes, text font information and other corresponding design element attributes in collected poster data sets, and using the poster background images, the basic composition units, the title text sequences, the relative coordinates of the text boxes, the text font information and the other corresponding design element attributes as annotation information and self-supervision signals of training data. The method comprises the steps that a text sequence and a background image are used as input, a rendered poster is used as a target, and a required training signal is provided for a neural network in a network framework designed for an intelligent poster; training with an unsupervised signal in a semi-structured poster overcomes the difficulty of obtaining annotations. The data used in this example was poster data, the underlying data consisting essentially of: poster background picture, basic composition unit, caption content and text sequence, character font, font color, font size, font position, etc., as shown in fig. 2.

(2) And (5) performing intelligent planar design framework joint training.

The intelligent planar design framework uses a unified neural network for layout design and attribute validation, so that the benefit of joint training of two subtasks can be enjoyed. The key to the design of a flat poster, the two main parts of visual representation and text representation are grouped together as follows, taking into account not only the characteristics of the image, but also the semantics of the input text:

F＝Concat(H_V,REP(U),REP(L))

wherein:

representing the hidden image feature map extracted by the encoder, C_VH and W respectively represent the channel number, height and width of the characteristic diagram; the text representation is replicated in the height and width dimensions to align with the visual representation (repeated operations are represented by REP) and additionally a scalar feature is added as an input sequence

Length of (d).

M＝Decoder(F)

Using the aggregated multi-modal features F as input to the decoder, the decoder outputs a density map of the same size as the original image

Each element M_i,jIs corresponding to the pixel I_i,jRepresents the weight of the pixel selected to appear in the text region.

Goal of layout design

Layout and composition text attribute validation

The attributes are respectively shown in the following formula:

wherein: g_i,jIs a binary indicator of whether the pixel (i, j) is located in a text region,

is that the attribute a belongs to the suitable category y_aThe probability of (c). The overall goal is to minimize the sum of layout design loss and attribute validation loss, perform parameter learning of the model through second order gradient optimization, and joint training helps the intelligent floorplanning network to extract better features from multimodal (visual and textual) views of data.

(3) And image text feature fusion and layout density reasoning.

The intelligent planar design model is trained on the acquired poster data training set, and the whole structure adopts a coder-decoder architecture, as shown in fig. 3; at an image module, extracting image features of the poster by using a convolutional neural network; as the network goes deeper, it encodes the graphical features in more channels and uses the feature map to encode over a wider receptive field:

H_V＝Encoder (I)

at the text module, extracting context-aware text features using a pre-trained language model:

U＝MLP(AVG({E₁,E₂,…,E_n,}))

wherein: each mark T_iIs embedded in a mark represented by

Distributed vector representation is expected to carry semantic information of input text, followed by using average poolingA fixed-dimension distributed representation of the entire input sequence is obtained (denoted AVG), and the textual representation is converted to a similarity vector space of the graphical representation by a multi-layered perceptron (denoted MLP). The final text representation is a vector

And carrying out channel level feature fusion on the image features and the text features, and decoding by a decoder M to obtain a layout density inference graph.

(4) And designing a layout based on the search.

The layout design module models the score for each pixel to indicate the appropriate region of the composed text by predicting the position and size of the composed text based on a search on the feature fused layout density map.

The layout design predicts the position and size of a composition text region by solving the optimization problem on the density map; based on the output of decoder M, assume σ (M)_i,j) Is represented by_i,jProbability of being located in the bounding box of a given text sequence, σ, represents a sigmoid function as shown below:

given a density map M and a corresponding probability matrix σ (M), the angular coordinates of the text box are determined: lower left corner (x)₁,y₁) And the upper right corner (x)₂,y₂) And specifies x₁<x₂,y₁<y₂And converting the corresponding prediction task into a constraint optimization problem. Starting from the local maximum of the layout density map, the area of the rectangle is gradually enlarged in a heuristic search mode to determine the area of the text object. Assuming that the correct position of the patterned text input is approximately centered on the local maximum and the score of the candidate region is almost convex in terms of distance from the edge to the local maximum, the layout design is determined using an approximation reasoning algorithm, taking advantage of the locality of the density map, as shown in fig. 4.

(5) And determining the composition text attribute based on the fusion features.

Through the data-driven joint learning in the step (2), the fusion feature F not only contains key information of layout design, but also contains key information of composition text attributes. After the poster layout design is determined, the continuous attributes of the attributes (including text font, color, font size, etc.) of the composition text are discretized into several categories for separate design. Composition text property design models collect features mainly from two sources: on one hand, collecting the hidden image features of the poster image text features for global summarization of the overall input; on the other hand, local features are collected from the weighted local views of the original image. In particular, since the color of text information is generally limited by the hue of local text regions, the model may use a similar convolution encoder to extract local view image features, as shown in fig. 5, and merge the overall features and the local features of the image, and use a multi-layered perceptron classifier to perform scoring output, thereby jointly determining the category of the composition text attribute.

Feature F from the multimodal feature extraction network describes a global view of the input, while local views of the image also have key-dependent effects on the attribute determination, e.g. the color of the text input is generally limited by the hue of the local text regions. Weighting original images

Applying the probability density map as an attention weight for each pixel of the original image,

representing the bit-aligned product, using a similar convolutional encoder to extract local view image features

Connecting the global features and the local features, using an MLP classifier to carry out logits output, and logarithms pass through a softmax functionNormalized to a probability distribution, p_iRepresents the probability that an attribute belongs to the ith class:

logit_i＝MLP(F,F_l)

the performance of an end-to-end intelligent plane design framework on various mainstream models is researched, and Test loss (lower is better), Jaccard similarity (higher is better) and ACC (higher is better) indexes are used for respectively evaluating the overall quality, the layout generation quality and the attribute design quality.

With the increasing functionality of the backbone structure, the end-to-end intelligent planar design framework can obtain better design works in terms of overall quality, and overall performance and effectiveness of each component can be better. As shown in table 1, the performance of the end-to-end intelligent planar design network of the present invention on the layout prediction subtask is superior to the method based on significance detection, which obtains lower Jaccard similarity during testing compared to all AuPoD series networks except AuPoD-FCN 32. The results show that the end-to-end intelligent floor design framework can benefit from better significance detection induced bias.

TABLE 1

The foregoing description of the embodiments is provided to enable one of ordinary skill in the art to make and use the invention, and it is to be understood that other modifications of the embodiments, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty, as will be readily apparent to those skilled in the art. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims

1. An end-to-end intelligent plane design method based on deep learning comprises the following steps:

(2) performing joint training on the layout design and attribute confirmation two subtasks in the intelligent planar design model, so that the model can extract features from the multi-modal view of the poster data;

2. The end-to-end intelligent planar design method of claim 1, wherein: the specific implementation manner of the step (1) is as follows: firstly, semi-structured poster data are collected from public information of a webpage and screened, poster background images, title text sequences, relative coordinates of text boxes, text font information and other corresponding composition text attributes in the collected poster data set are recorded and stored, the poster background images, the title text sequences, the poster background images and the text font information are used as input, the poster is rendered as a target, and required training data are provided for a neural network in an intelligent planar design network framework.

3. The end-to-end intelligent planar design method of claim 1, wherein: the specific implementation manner of the step (2) is as follows: firstly, gathering two main parts of visual representation and text representation, which are keys of the design of the plane poster, and considering not only the characteristics of an image but also the semantics of an input text; then, the aggregated multi-modal features are used as the input of a decoder, the decoder outputs a density map with the same size as the original image, and each element value in the density map corresponds to the fraction of the original pixel and represents the weight of the pixel selected to appear in the text region; and then designing an objective function in the training process, wherein the overall objective is the sum of the minimum layout prediction loss and the attribute identification loss, and the parameter learning of the model is carried out through second-order gradient optimization.

4. The end-to-end intelligent planar design method of claim 3, wherein: the expression of the objective function is as follows:

wherein:

in order to be the objective function, the target function,

a loss function is predicted for the layout,

for attribute a belonging to suitable category y_aσ () is a sigmoid function and Attributes represents a set of Attributes.

5. The end-to-end intelligent planar design method of claim 1, wherein: the intelligent planar design model integrally adopts a coder-decoder framework, a convolutional neural network is used for extracting the image characteristics of the poster in an image module, the image characteristics are coded in more channels along with the depth of the network, and a characteristic diagram is used for coding in a wider receptive field; in a text module, extracting context-aware text features by using a pre-training language model, namely adopting a distributed vector to represent semantic information expected to carry an input text, then obtaining fixed-dimension distributed representation of the whole input sequence by using average pooling, converting text representation into a similar vector space represented by a graph by using a multilayer perceptron, and finally representing the text as a vector; and carrying out channel level feature fusion on the image features and the text features, and decoding by a decoder to obtain a layout density inference graph.

6. The end-to-end intelligent planar design method of claim 1, wherein: in the step (4), a layout design based on search is adopted, namely, the position and the size of the composition text are predicted by searching on a layout density inference graph after feature fusion, the score of each pixel is modeled to indicate a proper region of the composition text, and the position and the size of the region of the composition text are predicted by solving the optimization problem on the density inference graph; starting from the local maximum of the layout density inference graph, gradually enlarging the area of a rectangle in a heuristic search mode to determine the area of a text object, assuming that the correct position of composition text input is approximately centered on the local maximum, and the score of a candidate area is almost convex in terms of the distance from the edge to the local maximum, and determining the layout design by using an approximate inference algorithm by utilizing the locality of the density graph.

7. The end-to-end intelligent planar design method of claim 1, wherein: after determining the poster layout design, discretizing the continuous attributes in the composition text attributes into several categories for respective design, and collecting features from two sources by using a composition text attribute design model: on one hand, collecting the hidden image features of the poster image text features for global summarization of the integral input; on the other hand, local features are collected from the weighted local view of the original image; as the color of text information is generally limited by the hue of a local text region, the composition text attribute design model extracts local view image features by using a similar convolution encoder, integrates the overall features and the local features of the image, performs grading output by using a multi-layer perceptron classifier, and jointly determines the category of the composition text attribute.

8. The end-to-end intelligent planar design method of claim 1, wherein: the method comprises the steps of using an acquired poster data set, firstly screening and dividing an original poster data set, carrying out layout design and attribute confirmation for common learning of two subtasks, and carrying out training and parameter adjustment on a training set to enable the model performance to reach an optimal state; according to the input image and text information, by calling the trained model, the intelligent planar design framework extracts the characteristic information of the image and the text, searches for a proper layout of the design drawing and confirms the text attribute of the composition, and automatically generates a harmonious planar design.