CN115511969A

CN115511969A - Image processing and data rendering method, apparatus and medium

Info

Publication number: CN115511969A
Application number: CN202211462844.9A
Authority: CN
Inventors: 周敏; 马也; 林金鹏; 侯兴林; 张渊猛; 史斌斌; 曹耘宁; 许晨晨; 高逸凡; 蒋刚玮; 王诗瑶; 葛铁铮; 姜宇宁
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2022-12-23
Anticipated expiration: 2042-11-22
Also published as: CN115511969B

Abstract

The embodiment of the application provides an image processing and data rendering method, equipment and medium. The embodiment of the application provides a scheme capable of automatically generating images without depending on a manual design template, and the method can obtain a synthetic image with any size and up to standard quality by taking picture materials as a center, taking the template as a center, and based on a target material diagram through generation of a base image, image layout information, pattern information on the image and visual attribute estimation and rendering. The image layout, the image and the text and the visual attributes can be designed independently, the image layout, the image and the text and the visual attributes are not limited and restricted by the fixed position on the template, the image layout is more flexible and reasonable, the synthetic elements can avoid the main body, the main body prominence is enhanced, the visual fusion degree is improved, and the primary sense of the synthetic image is enhanced; the information of the file is also more expressive; the color matching is richer in visual attributes, and the synthesized image is excellent in visual and putting effects. And is suitable for large-scale application, and the realization cost is lower.

Description

Image processing and data rendering method, apparatus and medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, and a medium for image processing and data rendering.

Background

With the continuous development of technologies such as internet, electronic commerce and the like, more and more image data need to be processed, and especially in the accurate advertisement putting process, the high-quality creative idea of mass commodity advertisement images can not be left for achieving the effect of thousands of people and attracting users. Currently, there are two main creative production methods for commercial advertisement images in the industry.

One is the full-artificial creation: the advertisement creativity is specially customized by a designer aiming at each specific commodity and specific picture, and the manufacturing mode of the advertisement image has high cost and low efficiency, and is difficult to be popularized on a large scale on resource positions with different materials, different commodities and even different sizes.

One is automatic splicing template originality: the template is manufactured by a designer, spaces of elements such as commodity pictures, documentaries and the like are reserved at fixed positions of the template, and commodity materials can be directly obtained and spliced to the template in real time when the advertisement image is manufactured. The manufacturing method of the advertisement image depends heavily on the template, the advertisement layout is fixed, the fusion degree between the commodity image and the template is poor, and the quality of the advertisement image is low, such as poor background and foreground information coupling degree, lack of natural harmony feeling and the like.

Disclosure of Invention

Aspects of the present disclosure provide an image processing and data rendering method, device, and medium, which are used to provide an image generation method independent of a template, so as to improve image generation efficiency, reduce cost, and improve image quality.

An embodiment of the present application provides an image processing method, including: generating a base image according to a target material image containing a main body object, wherein the target material image has an original size, and the base image has a target size; inputting the base image into an image layout model for image layout to obtain image layout information of the base image, wherein the image layout information comprises the position and the category of at least one target area used for bearing at least one element to be synthesized on the base image; inputting the base image, the position and the category of the at least one target area and the basic material information corresponding to the main object into a document generation model to generate document information so as to obtain the document information in the at least one element to be synthesized; estimating the visual attribute of the at least one element to be synthesized according to the position and the category of the base image and the at least one target area to obtain the visual attribute of the at least one element to be synthesized; and at least rendering the file information in the at least one element to be synthesized to the base image according to the position and the category of the at least one target area and the visual attribute of the at least one element to be synthesized so as to obtain a target synthesized image.

An embodiment of the present application further provides an image processing method, including: acquiring an original image containing a subject object, the original image having an original size; sending the original image into an element detection model for on-graph element analysis to obtain original synthetic elements and attribute information thereof contained in the original image; restoring the original image according to the attribute information of the original synthetic element to obtain a restored image which does not contain the original synthetic element; and carrying out image redirection processing on the repaired image according to the size relation between the target size and the original size so as to obtain a target image with the target size.

An embodiment of the present application further provides a data rendering method, including: acquiring an object to be rendered, wherein the object to be rendered comprises at least one target area for bearing at least one element to be synthesized, and the object to be rendered is an image or a page; estimating the visual attribute of the at least one element to be synthesized according to the position and the category of the object to be rendered and the at least one target area to obtain the visual attribute of the at least one element to be synthesized; rendering the at least one element to be composited onto the object to be rendered according to the position and the category of the at least one target area and the visual attribute of the at least one element to be composited.

An embodiment of the present application further provides a computer device, including: a memory and a processor; wherein the memory is for storing a computer program; the processor, coupled with the memory, is configured to execute the computer program, so as to implement the steps in the methods provided by the embodiments of the present application.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to implement the steps in the various methods provided by the embodiments of the present application.

The embodiment of the application provides a scheme capable of automatically generating an image without depending on a manual design template, wherein the scheme takes a picture material as a center instead of the template as the center, only a target material graph is needed to be obtained, and a composite image with any size and quality meeting requirements can be obtained based on a machine learning model through generation of a base image, image layout information, pattern information on the graph and visual attribute estimation and rendering of elements to be synthesized on the basis of the target material graph. Compared with creative images based on templates, the technical scheme of the application can achieve more flexibility and reasonability in image layout, can avoid main bodies when synthesizing elements, enhances the saliency of the main bodies, improves the visual fusion degree and enhances the primary sense of the synthesized images; the file information is more flexible and expressive; the color collocation rendered on the visual attribute can be richer, so that the primary and secondary descriptions between the file and the main body are clear, and the synthetic image has excellent performance on the visual and putting effects. In addition, because the method does not depend on the template any more and is not limited by the number of the templates, a large number of images can be synthesized, and the method has lower implementation cost.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic flowchart of an image processing method according to an exemplary embodiment of the present application;

FIG. 2 is a schematic flowchart of creative base map generation provided by an exemplary embodiment of the present application;

FIG. 3 is a diagram illustrating PS element classes on an image provided by an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of a network architecture of a controllable text composition module, a composition policy network, and a text erasure model according to an exemplary embodiment of the present application;

FIG. 5 is a network architecture diagram of a domain alignment model provided in an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of an architecture of a Transformer network with an autoregressive structure according to an exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of a network architecture of a geometric alignment module provided in an exemplary embodiment of the present application;

FIG. 8 is a schematic diagram of positive and negative samples of a document image constructed in accordance with an exemplary embodiment of the present application;

FIG. 9 is a schematic structural diagram of a multi-layered transform-based multi-modal model constructed in accordance with exemplary embodiments of the present application;

FIG. 10 is a diagram illustrating a state of slice coding of an image according to an exemplary embodiment of the present application;

FIG. 11 is a schematic diagram of a training process of a font recognition module provided in an exemplary embodiment of the present application;

FIG. 12 is a schematic diagram of a network architecture of a visual property prediction model according to an exemplary embodiment of the present application;

13-15 are schematic illustrations of creative advertising images of any size generated for different categories of items provided by exemplary embodiments of the present application;

FIG. 16a is a schematic diagram of an image processing system according to an exemplary embodiment of the present application;

FIG. 16b is a schematic structural diagram of another image processing method according to an exemplary embodiment of the present disclosure;

fig. 17 is a schematic flowchart of a data rendering method according to an exemplary embodiment of the present application;

fig. 18a is a schematic structural diagram of an image processing apparatus according to an exemplary embodiment of the present application;

fig. 18b is a schematic structural diagram of another image processing apparatus according to an exemplary embodiment of the present application;

fig. 18c is a schematic structural diagram of a data rendering apparatus according to an exemplary embodiment of the present application;

fig. 19 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Aiming at the technical problems that the existing method for manufacturing advertisement images depends on templates, is limited by the number of the templates and the manufacturing cost, and is high in image manufacturing cost, low in efficiency, poor in image quality and the like, in the embodiment of the application, a scheme capable of automatically generating images without depending on manually designed templates is provided, the scheme takes picture materials as the center instead of templates as the center, only needs to obtain a target material image, and based on the target material image, and a machine learning model is used for obtaining a synthetic image with any size and quality meeting requirements through generation of base images, image layout information, on-picture case information and visual attribute estimation and rendering of elements to be synthesized.

The method is not limited by the fixed position on the template any more, and the method can be designed independently in the aspects of image size, image layout, patterns on the image and visual attributes, and has stronger flexibility and freedom. Compared with creative images based on templates, the technical scheme of the application is more flexible and reasonable in image layout, the main body can be avoided during element synthesis, the main body saliency is enhanced, the visual fusion degree between the synthetic elements and the main body is improved, and the original feeling of the synthetic images is enhanced; the file information is more flexible and expressive; the color matching rendered on the visual attribute can be richer, so that the primary and secondary patterns between the file and the main body are clear, and the composite image has excellent performance on both visual and putting effects; in terms of image size, the sizes of the target material map and the composite image are not limited, and a target material map of an arbitrary size can be generated and a composite image of an arbitrary size can be generated. In addition, the technical scheme of the application does not depend on the template any more and is not limited by the number of the templates, so that a large number of images can be synthesized, and the implementation cost is low.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of an image processing method according to an exemplary embodiment of the present application. As shown in fig. 1, the method includes:

s101, generating a base image according to a target material image containing a main body object, wherein the target material image has an original size, and the base image has a target size;

s102, inputting the base image into an image layout model for image layout to obtain image layout information of the base image, wherein the image layout information comprises the position and the category of at least one target area for bearing at least one element to be synthesized on the base image;

s103, inputting the base image, the position and the category of at least one target area and the basic material information corresponding to the main object into a pattern generation model to generate pattern information so as to obtain the pattern information in at least one element to be synthesized;

s104, estimating the visual attribute of at least one element to be synthesized according to the position and the category of the base image and the at least one target area to obtain the visual attribute of the at least one element to be synthesized;

s105, according to the position and the category of the at least one target area and the visual attribute of the at least one element to be synthesized, at least rendering the pattern information in the at least one element to be synthesized to the base image to obtain a target synthesized image.

In the present embodiment, the target material map is a material map of quality matching the generated base image. The base image is generated from the target material map and is the basis for the composite creative image. The target material map contains main body objects, the base image and the target material map have the same main body objects, and the main body objects are different according to different image synthesis or creative scenes. The purpose of image composition is to synthesize some other elements at appropriate positions on the base image, centered on the subject object, so as to obtain a desired target composite image. The subject object differs, the elements to be composited on the base image, and the positions (i.e., layout information) and visual properties of the composited elements. For example, in a creative advertisement scenario, the subject objects contained in the target material map or base image may be, but are not limited to: the method is not limited to various characters to be publicized, scenic spots, or various commodities to be popularized, such as vehicles, clothes, electronic products, beauty products, furniture, household appliances and the like.

In this embodiment, the target material image has an original size, the base image has a target size, and the original size and the target size may be the same or different, which is not limited. The embodiment can generate the base image with any size according to the given target material graph with any size, and finally obtain the target composite image with the same size as the base image. In the case where the original size is not the same as the target size, the original size may be 1:1, the target size may be 16, 4:3, 3:2, or the like; alternatively, the original size may be 4:3, the target size may be 1:1, 16, or 3:2, etc..

In this embodiment, after the base image is obtained, at least one element to be synthesized may be synthesized on the base image, so as to obtain a target synthesized image meeting creative requirements. The elements to be synthesized may be referred to as image Processing (PS) elements, and the PS elements may include, but are not limited to, the following categories: the mark element (logo), the pattern element, the substrate element, the decorative element and the like, wherein the mark element can be a trademark or a logo, the pattern element mainly refers to character information related to a main object, the substrate element is shading information for filling a picture or characters in the whole image, and the decorative element can be an icon, a geometric figure, a pattern, a symbol, a line and the like.

In order to synthesize the element to be synthesized on the base image and obtain the target synthetic image meeting the creative demand, the problem of the synthesis position of the element to be synthesized on the base image and the problem of the content of the element to be synthesized, especially the document information, need to be solved, and the problems of the visual attribute and the rendering of the element to be synthesized need to be solved. The following are described separately:

for convenience of description, the problem of the synthesis position of the element to be synthesized on the base image is referred to as image layout information of the base image, that is, image layout information of the base image is required to be obtained, the image layout information includes position information and category information of at least one element to be synthesized which is required to be synthesized with the base image, and in short, the image layout information is used for describing which position on the base image is suitable for synthesizing which category of element to be synthesized. In this embodiment, the base image may be input to the image layout model for image layout to obtain image layout information of the base image. The image layout model is any model capable of performing image layout on the base image, and may be, for example, a Generative Additive Networks (GAN) or an Autoregressive Transformer (Autoregressive Transformer), and the like, but is not limited thereto. After the image layout information of the base image is obtained, elements to be synthesized of corresponding categories can be synthesized at appropriate positions on the base image according to the guidance of the image layout information, and conditions are provided for obtaining a locally reasonable target synthesized image.

It should be noted that, in the image layout information, only what kind of elements to be synthesized are required to be represented, that is, the kind of the elements to be synthesized can be obtained, and the specific content of the elements to be synthesized cannot be determined. In this embodiment, the PS elements are classified into several categories including, but not limited to: marking elements, pattern elements, substrate elements, decorative elements, and the like. The elements to be synthesized may include one or more of the categories, depending on the category determined in the specific view image layout information. Further, considering that the elements to be synthesized need to be located in a certain area on the base image in the future, and the positions of the areas are positions where the elements to be synthesized need to be synthesized in the future, for convenience of describing image layout information, the area where the elements to be synthesized need to be synthesized on the base image is called a target area, the target area is an image area used for bearing the elements to be synthesized on the base image, and the position where the elements to be synthesized need to be synthesized on the base image is represented by the position and size of the target area, that is, the image layout information includes the position and category of at least one target area used for bearing at least one element to be synthesized on the base image; similarly, the category of the target area represents the category of the element to be synthesized that needs to be carried by the target area. For example, the target region may be represented by a bounding box (bbox), which is a rectangular box represented by a point coordinate at the upper left corner and a point coordinate at the lower right corner of the rectangle, or by a center coordinate (xc, yc) and a width and height (w, h) thereof.

In this embodiment, the element to be synthesized may include various categories, and usually includes a document element, and the logo element, the substrate element, and the decoration element are all optional elements, and the element to be synthesized may include these optional elements, or may not include these optional elements, which is not limited herein. If the elements to be synthesized comprise logo elements, substrate elements or decoration elements, the implementation of the elements is relatively simple or definite, for example, the logo elements are usually definite, the substrate elements are also simple, and the elements can be directly from the basic material information of the main body object. However, since the document information has a feature of being flexible and diversified, the content problem of the element to be synthesized mainly refers to the generation problem of the document information for synthesizing with the base image. Based on this, in the present embodiment, after the image layout information is obtained, various information of the base image (such as the subject object, the position of the subject object, and the background color), the position and the category of the target region defined in the image layout information, and the base material information corresponding to the subject object may be comprehensively considered, and the case information in the at least one element to be synthesized may be generated. The basic material information corresponding to the main body object may include description information of the main body object, for example, a name of the main body object and text information such as various attribute information, and further may include: the mark information, the substrate information and the decoration element associated with the main body object further can comprise an image of the main body object and the like. Specifically, the position and the category of at least one target region in the base image and the image layout information, and the basic material information corresponding to the main object may be input into the document generation model to generate the document information, so as to obtain the document information in at least one element to be synthesized. The type of the target area can provide whether the at least one element to be synthesized contains the file information and how many file information, the position of the target area can provide the position of the file information on the base image, the related elements around the position in the base image can be obtained according to the position, the file information is generated by integrating the information of the related elements around, and the adaptation degree and the fusion degree of the file information and the base image can be improved. Furthermore, according to the basic material information of the main body object, the adaptation degree of the file information and the main body object can be improved, and the main body object can be accurately described and expressed conveniently.

In this embodiment, after the image layout information and the document information in at least one element to be synthesized are obtained, the visual attribute and rendering problem of the element to be synthesized also need to be solved. The visual attributes of the elements to be synthesized mainly solve the problem of visual adaptation between the elements to be synthesized and existing elements on the base image, particularly between the elements and a main object, and the elements to be synthesized are coordinated and rich in color matching and have clear primary and secondary, so that a target synthetic image which can attract users visually is obtained. Wherein, the visual attribute of the element to be synthesized is different according to the different categories of the element to be synthesized. If the element to be synthesized is the document information, the visual attribute mainly refers to a visual attribute related to the document information, and may be, for example, a font attribute, a color attribute, a font size, and the like related to the document information, and further includes a visual attribute of a target area for carrying the document information, such as a color attribute, a shape attribute, a line attribute, and the like of a boundary frame corresponding to the target area, for example, whether a line is gradually changed, whether the line is drawn, and if the line is drawn, a color of the drawing, and the like. For another example, if the element to be synthesized is substrate information, the visual attribute may be a visual attribute related to the substrate information, for example, a color attribute, a shape attribute, whether the substrate color is gradation, or the like of the substrate. The visual attribute of the at least one element to be synthesized can be estimated according to the visual feature information in the base image and the position and the category of the at least one target area, so as to obtain the visual attribute of the at least one element to be synthesized. The method comprises the steps of estimating the visual attribute of at least one element to be synthesized, firstly considering the existing visual characteristic information in a base image, combining the position of the element to be synthesized in the base image, enabling the visual attribute of the element to be synthesized to be coordinated with the existing visual characteristic information as much as possible, considering the category of the element to be synthesized, and using the visual attribute matched with the category.

In this embodiment, after obtaining the pattern information in the at least one element to be synthesized, the position and the category of the at least one target area, and the visual attribute of the at least one element to be synthesized, at least the pattern information in the at least one element to be synthesized may be rendered onto the base image according to the position and the category of the at least one target area, and the visual attribute of the at least one element to be synthesized, so as to obtain the target synthesized image. It is noted that, when other types of elements are simultaneously contained in at least one element to be synthesized, the other elements also need to be simultaneously rendered onto the base image according to the visual attributes and positions of the other elements on the base image.

In this embodiment, steps S101 to S105 in the image processing method may be summarized as follows according to the sequence of the flow: generating a creative base map, generating a creative layout, generating a pattern on the map, and estimating and rendering visual attributes of elements. The different steps are explained in detail below.

Step S101: creative base map generation

As shown in fig. 2, step S101 may include four steps T101-T104, which are summarized as: image filtering, on-graph element parsing, image inpainting, and image reorientation, steps T102-T104 may be generalized to the process of base image generation. Each will be described in detail below.

Step T101: image filtering

In the embodiment of the application, according to different image synthesis scenes, a plurality of original material images may exist, and in a scene in which a creative advertisement image needs to be synthesized based on a commodity image on an e-commerce platform, a plurality of commodity images containing the same commodity on the e-commerce platform may exist. Moreover, the image contents of these original material maps are complex and various, and the noise is large, so that it is difficult to use them as target material maps directly. For example, the raw material maps may include a plain text image, a detail map of a main object, or a stitched image obtained by stitching multiple images, and the main object in the raw material maps is not protruded, and it is difficult to generate a high-quality base image with a protruded main object, a proper size, and a good visual appearance from the raw material maps.

Specifically, at least one raw material map containing a main object may be acquired; inputting at least one original material image into an image quality classification model to perform quality classification so as to obtain the quality classification of each original material image, wherein at least one original material image may contain images of different quality classifications, and the quality classification may be divided according to the resolution, hue, brightness, saturation, distortion and size of the images, the composition mode of characters and images contained in the images (for example, whether the images are obtained by splicing small images) and the like; according to the quality category of each original material image, selecting the original material image with image quality suitable for serving as a base image as a target material image, wherein the target material image can be one or more, each Zhang Sucai image can generate one base image, and each base image can finally obtain a corresponding target composite image.

For example, the at least one raw material map may include a plurality of quality categories, and raw material maps suitable as base images may be of one or more quality categories, without limitation. For example, at least one original material image is divided into two types according to the image quality, wherein the quality type A1 is the original material image with the image quality suitable for serving as a base image, and mainly comprises an image obtained by splicing a non-thumbnail, the resolution of which is higher than a set resolution threshold, the number of characters which is less than a set character number threshold, and the like; the quality class A2 is a raw material map whose image quality is not suitable as a base image, and mainly includes an image whose resolution is lower than a set resolution threshold (simply referred to as a low-resolution image), an image whose number of characters is greater than or equal to a set character number threshold (simply referred to as an image containing a large amount of character information), and an image obtained by stitching small images. The raw material maps belonging to the quality category A1 can be obtained from at least one raw material map as the target material map.

The image quality classification model may be any model capable of classifying the raw material images, for example, a Deep Residual Network (ResNet), and further may be a ResNet-50 model, where a numeral 50 following the ResNet indicates that the model includes 50 two-dimensional convolutional layers (conv 2 d).

An example of a model architecture for an image quality classification model is provided below, the image quality classification model comprising: a preprocessing module and 4 convolution modules. The raw material maps are first sent to a preprocessing module in the image quality classification model, the preprocessing module includes a convolution layer res1 and a pooling layer, and the convolution layer and the pooling layer respectively reduce the size of the raw material maps by two times, for example, the raw material map with the original size of 3 × 800 × 1216 passes through the preprocessing module and becomes 64 × 200 × 304, wherein three parameters in 3 × 800 × 1216 and 64 × 200 × 304 respectively represent the number of channels (channels) of the image, the width (width) of the image, and the height (height) of the image. Wherein, 4 convolution modules are named as: res2, res3, res4, and res5, each convolution module is composed of a Bottleneck layer (Bottleneck), res2 contains 3 bottlenecks, and the remaining 3 convolution modules respectively include: 4, 6 and 3 bottletech. Here, the convolution modules res2, res3 and res4 are used for performing convolution and pooling operations, and the convolution module res5 is used for performing classification operations, in this embodiment, the classes output by the convolution module res5 are set as two classes, i.e., two quality classes A1 and A2.

The following illustrates the training and reasoning principles of the image quality classification model:

collecting a plurality of original material graphs in advance, carrying out quality annotation on at least one original material graph, dividing the original material graphs into two types of training samples, taking the original material graph of a quality class A1 as a positive sample with image quality suitable for being used as a base image, and taking the original material graph of a quality class A2 as a negative sample with image quality not suitable for being used as the base image; and carrying out training of an image quality classification model based on the two types of pre-labeled training samples, counting the accuracy of the model classification result according to the image classification result output by the image quality classification model in the model training process, taking the accuracy as a loss function of the model training, and continuously adjusting the model parameters of the image quality classification model according to the loss function until obtaining the image quality classification model capable of accurately classifying the image into two types according to the quality.

In the model reasoning process, at least one original material map is input into the image quality classification model, the quality classification result of each original material map, namely the quality class of each original material map, is obtained, and the original material map in the quality class A1 can be selected as a target material map according to the quality class of each original material map to generate a base image. It should be noted that the raw material graph in the model inference process is different from the raw material graph in the model training process, and there may be a cross between the two, but the two are not completely the same. In general, the raw material maps in the model training process are used as training samples, and the number of the raw material maps is much larger.

After the target material map is obtained, steps T102-T104 may be executed to implement base image generation. The generation of the base image specifically comprises:

t102 element resolution on the graph: sending the target material graph into an element detection model for on-graph element analysis to obtain original synthetic elements and attribute information thereof contained in the target material graph;

t103 image restoration: repairing the target material graph according to the attribute information of the original synthetic element to obtain a repaired material graph which does not contain the original synthetic element;

t104 image retargeting: and carrying out image redirection processing on the repaired material image according to the size relation between the target size and the original size so as to obtain a base image with the target size.

Step T102, element analysis on the graph:

the target material map also includes various synthetic elements (i.e., PS elements), the synthetic elements in the target material map may be referred to as original synthetic elements, the on-map element analysis mainly analyzes whether or not there are original synthetic elements in the target material map, and analyzes attribute information of the original synthetic elements when the target material map includes the original synthetic elements. The primitive synthesized elements are PS elements, including but not limited to: the description of the PS element can be referred to the above, and is not repeated herein. In fig. 3, the category of the PS element provided in the embodiment of the present application is illustrated by two images, but is not limited thereto. The attribute information of the original synthesized element may include, but is not limited to: the type of the original synthesized element (e.g., whether the original synthesized element is a file, logo, or substrate), the position information of the original synthesized element in the target material map, and the size information of the original synthesized element. It should be noted that the raw composite elements are elements that need to be erased from the target material map, and the elements to be synthesized are elements that need to be added to the base image, and from the perspective of action, the raw composite material and the elements to be synthesized are not the same, but both belong to PS elements. Specifically, the target material diagram may be sent to an element detection model for on-diagram element analysis, so as to obtain original synthetic elements and attribute information thereof included in the target material diagram.

The element detection model may be any model capable of performing on-map element analysis on the target material diagram, and optionally, the element detection model includes: the system comprises a feature extraction layer, an element identification layer based on a self-attention mechanism and an attribute labeling layer. The feature extraction layer is mainly used for extracting features of the target material graph, and the feature extraction layer may be any network layer capable of extracting features of the target material graph, such as ResNet or Visual Geometry Group (VGG); the element identification layer based on the attention mechanism is mainly used for identifying a feature map corresponding to an original synthesized element contained in the target material map from the feature map extracted by the feature extraction layer. For convenience of description and distinction, the feature map extracted by the feature extraction layer from the target material map is referred to as a first feature map, and the feature map corresponding to the original synthesized element extracted from the first feature map is referred to as a second feature map. Wherein, the element recognition layer can be a unified Object Detection Head (Unifying Object Detection Heads with orientations, dynamic Head) model with multiple attention mechanisms; the element identification layer based on the self-attention mechanism can identify the original synthesized elements from three dimensions, which are respectively: scale-aware (scale-aware) dimensions: performing scale perception on the dimension of a feature level (level) through an attention mechanism, identifying elements with different scales in a target material graph, wherein the different feature levels correspond to different scales, and the scale perception capability of element analysis can be enhanced; spatial-aware (spatial-aware) dimension: performing spatial perception in a spatial (spatial) dimension through an attention mechanism, identifying the shape or orientation and the like of elements of the same category in an original material map, wherein different spatial positions correspond to geometric transformation of a target, so that the spatial position perception capability of a target detector can be enhanced; task-aware (task-aware) dimension: task perception is executed through an attention mechanism in the dimension of an output channel (channel), elements of different representation modes (such as frames and the like) in a target material graph are identified, different channels correspond to different tasks, and perception capability of target detection on different tasks can be enhanced. The attribute marking layer is mainly used for marking the attribute information of the original synthesized element identified by the element identification layer, and can be any network layer capable of marking the attribute information.

Before using the element detection model, the element detection model needs to be trained in advance. The training process of the element detection model comprises the following steps: in order to train an element detection model (also referred to as a detector for short), a large number of existing images containing PS elements are obtained as sample images, for example, about 13 million advertiser creatives with different sizes can be obtained in the field of e-commerce, and the advertiser creatives are used as sample images; then, labeling the PS elements in the sample image, for example, whether the sample image includes the PS elements and the included PS elements, such as mark elements, text elements, decoration elements and/or substrate elements, are labeled in the case that the sample image includes the PS elements, then performing model training on the initial Dynamic Head model using the sample images and the labeling results thereof, obtaining whether each sample image given by the Dynamic Head model includes the PS elements and attribute information of the PS elements in the case that the sample image includes the PS elements in the model training process, comparing the attribute information of the PS elements in the case that the sample image includes the PS elements and the identification result of the PS elements in the case that the sample image includes the PS elements to obtain a loss function of the Dynamic Head model, and continuously adjusting model parameters of the Dynamic Head model until the loss function satisfies a model convergence condition when the loss function does not satisfy the model convergence condition, thereby obtaining an element detection model capable of accurately identifying the PS elements included in the image.

After the element detection model is obtained through training, the element detection model can be used for carrying out on-graph element analysis on the target material image. The model reasoning process of the element detection model comprises the following steps: sending the target material graph into a feature extraction layer in the element detection model to perform feature extraction, so as to obtain a first feature graph corresponding to the target material graph, where for example, the first feature graph may be implemented as a feature map (feature map), and the first feature graph may include feature information corresponding to one or more original synthetic elements, and the first feature graph may be understood as an overall feature graph; sending the first feature map into an element detection model to identify synthetic elements based on an element identification layer of a self-attention mechanism so as to obtain a second feature map corresponding to original synthetic elements contained in the target material map, wherein the second feature map can be understood as a local feature map; and sending the second feature graph into an attribute labeling layer in the element detection model for attribute labeling so as to obtain attribute information such as the position, the size, the category and the like of the original synthesized element. For example, according to the position of the second feature map in the first feature map, the position of the original composite element corresponding to the second feature map in the target material map is determined, and according to the size (i.e. dimension) of the second feature map in the first feature map, the size of the original composite element corresponding to the second feature map is determined; according to the feature information in the second feature map, the category of the original composite element corresponding to the second feature map can be determined. The feature information corresponding to different synthesized elements is different, for example, if the second feature map includes the first feature information, it may be determined that the original synthesized element corresponding to the second feature map is a text element; if the second feature map contains the second feature information, the original synthesized element corresponding to the second feature map can be determined as a mark element; if the second feature map contains the third feature information, the original synthesized elements corresponding to the second feature map can be determined as substrate elements; if the second feature map includes the fourth feature information, it may be determined that the original composite element corresponding to the second feature map is a decoration element.

In this embodiment, whether the target material map includes the original synthesized element and the attribute information of the original synthesized element included in the case of including the original synthesized element can be obtained by the element detection model. For the situation that the target material graph contains the original synthesis element, step T103 needs to be executed for image restoration, and then step T104 is executed for image redirection; for the case where the original synthesized element is not included in the target material map, step T104 may be directly performed, i.e., image redirection may be performed, skipping step T103.

Step T103, image restoration:

in this embodiment, when the target material map includes the original composite element, the target material map needs to be repaired in order to obtain a high-quality base image with a protruding main object, an appropriate size, and an excellent visual appearance based on the target material map. When the target material map contains the original synthetic elements, the reason why the target material map needs to be repaired is that the visual effect of the image is seriously affected by the truncation and deformation of the original synthetic elements, which increases the difficulty of image redirection in step T104, and further limits the play space of steps of generating image layout and document information by subsequent creatives, estimating and rendering visual attributes of elements, and the like.

The repairing mainly refers to erasing original composite elements contained in the target material image, and repairing missing or damaged areas in the target material image caused by erasing operation, so that the whole image is visually and semantically coherent. In other words, the process of image restoration of the target material map can be understood as a process of reconstructing a region where the original synthesized element in the target material map is located. The original synthetic element has attribute information, and the target material graph can be repaired according to the attribute information of the original synthetic element so as to obtain a repaired material graph which does not contain the original synthetic element. For example, the region in the target material map that needs to be repaired can be determined according to the position information and the size information of the original composite element in the target material map, and then the region can be repaired.

Optionally, an embodiment of repairing the target material map according to the attribute information of the original synthesized element to obtain a repaired material map not containing the original synthesized element includes: and inputting the target material graph and the attribute information of the original synthetic element into an image restoration model, and restoring the target material graph by using the attribute information of the original synthetic element in the model so as to output a restored material graph which does not contain the original synthetic element any more. In this embodiment, the model architecture of the image restoration model is not limited. Illustratively, one model architecture of an image inpainting model includes a mask processing network and an image inpainting network.

The mask processing network is mainly used for performing mask processing on the original synthesized element in the target material image, for example, performing black-and-white processing on the original synthesized element and other elements on the target material image, for example, setting the pixel value of the original synthesized element to 0, and setting the pixel value of other elements on the target material image to 255, or setting the pixel value of the original synthesized element to 255 and setting the pixel value of other elements to 0, so that the original synthesized element presents an obvious black-and-white effect, and a black-and-white mask image is obtained; and then, synthesizing the black-white mask image and the target material image to obtain a mask material image, wherein other elements except the original synthesized elements in the target material image are reserved in the mask material image, and the area of the original synthesized elements is erased and is called as an area to be repaired. Specifically, the attribute information of the target material map and the original synthesized element may be input to a mask processing network in the image restoration model, and the target material map is masked according to the attribute information of the original synthesized element to obtain a mask material map, where the mask material map includes a region to be restored, which is obtained by masking the original synthesized element; and inputting the mask material image into an image restoration network in the image restoration model, and restoring the area to be restored according to the pixel values of the peripheral area of the area to be restored to obtain a restoration material image which does not contain the original synthetic elements.

The image restoration network is mainly used for restoring the area to be restored according to the pixel values of the area around the area to be restored. According to the pixel values of the peripheral area of the area to be repaired, the area to be repaired is not simply pixel copy or directly filled, but the pixel values in the area to be repaired are regenerated according to the pixel values of the peripheral elements, and semantic consistency is achieved. Specifically, in the image restoration network, the mask material graph can be downsampled through a downsampling layer to obtain a key feature graph, the key feature graph is subjected to at least one Fast Fourier Convolution (FFC) through at least one fast Fourier convolution block to obtain a convolution feature graph, and the convolution feature graph is upsampled through an upsampling layer to obtain the restoration material graph. The fast Fourier She Juanji is used, the image receiving area is wide, the receptive field perception loss is high, the key feature map comprises global key features and local key features in the mask material map, the global key features retain global visual information in the mask material map, and the local key features represent visual information with finer granularity in the mask material map; and when the fast Fourier convolution is performed each time, the local features are processed together with the convolution in the spatial domain, and the Fourier convolution is used for analyzing the global features in the frequency domain, so that the high-resolution and high-quality restored material map is restored. The image restoration model with the model architecture may adopt a LArge MAsk image restoration (LAMA) model, but is not limited thereto.

It should be noted that the image restoration model can restore various original synthetic elements included in the target material map, that is, whether the original synthetic elements are file information, logo elements, or substrate elements, the image restoration model can be used to restore the original synthetic elements. In addition, the embodiment of the application also provides a character erasing model, and a special user erases the file information contained in the target material diagram to realize the repair of the file information.

The image restoration model may be used alone, or when only the original composite element of the document type is included in the target material map, the character erasure model may be used alone. Of course, when the target material map includes various types of original composite elements at the same time, the image restoration model and the character erasure model may be used in combination. For example, the text erasure model may be used to erase the document information included in the target material map, and then the image restoration model may be used to restore the original synthesized elements, such as logo elements, substrate elements, or decoration elements, included in the target material map. Alternatively, the original composite elements such as logo elements, substrate elements, and decoration elements included in the target material drawing may be repaired using the image repair model, and then the document information included in the target material drawing may be erased using the character erase model.

In this embodiment, the text erasure model may erase the document information included in the target material map, and perform background content supplementation on the region where the document information is erased according to the information of other regions on the target material map, so as to obtain the target material map from which the document information is erased. In this embodiment, the model architecture of the character erasing model is not limited, and any model architecture that can erase the document information contained in the target material map and supplement the background content of the region where the document information is erased according to the information of other regions on the target material map is suitable for the embodiment of the present application. For example, the word erasure model of the embodiments of the present application may be based on generating a countermeasure network implementation.

Before the character erasure model is used, the character erasure model is obtained through a model training process. In the embodiment of the application, considering that if a large number of labeled samples are needed for training the character erasing model by adopting a supervised model training mode, namely an original image without file information and an image obtained by adding the file information to the original image are needed, however, in practical applications, it is difficult to obtain an original image without document information, and most of the images that can be obtained are images containing various document information, which undoubtedly increases the difficulty of supervised model training. In this regard, the embodiment of the present application provides a method for self-supervised learning of text erasure. The method mainly utilizes the characteristic of low dependency of self-supervision learning on label data to construct a controllable character synthesis module (used for synthesizing a training sample), an end-to-end character erasing model (used for erasing document information) and a synthesis strategy network (a refined feedback path from erasing to synthesizing is realized), so as to form a closed-loop system with continuously improved quality and precision. As shown in FIG. 4, the system includes a controllable Text composition module, a composition Policy Network (Policy Network), and a Text Erasing model (Text Erasing module). The controllable character synthesis module is used for synthesizing image sample data required by the training character erasing model, the synthesis strategy network is used for providing synthesis rules (picture synthesis case rules for short) required by the characters synthesized on the image for the controllable character synthesis module, and the picture synthesis case rules are continuously updated according to the image of the character erasing model after the case information is erased in the model training process, so that the precision of the picture synthesis case rules is improved, and the quality of the sample image synthesized by the controllable character synthesis module and used for model training is further improved.

As shown in fig. 4, the controllable text Synthesis module mainly includes a Synthesis Function (Synthesis Function), which is mainly used to synthesize new document information on an image that already contains document information according to a rule of synthesizing a document by using a picture, so as to obtain an image that contains more document information, and form the two images into an annotated training sample. For convenience of distinguishing and description, since the images of the present embodiment all include the document information, the images of the present embodiment are referred to as document images, the document images including the original document information are referred to as original document images, and the images with the new document information added to the original document images are referred to as target document images; and the document information newly added on the original document image is referred to as target document information. Based on this, the input of the synthesis function is the original document imageIThe original document imageThe original file information is contained, such as 'suddenly declaring a price reduction', and the original file image is an unmarked image, such as but not limited to a commodity image, an advertisement image and the like on an e-commerce platform. The output of the synthesis function is the target file image after newly adding the target file informationI _syn Images of the target documentI _syn Is a sample image required for training the text erasure model. The principle of the synthesis function for synthesizing the sample image for training is as follows: given a picture synthesis case rule s, the function can extract original case information in an input original case image, generate target case information based on the original case information according to the picture synthesis case rule s, and then synthesize the target case information into a non-character area in the original case image by adopting a character rendering code library or a copying mode and the like so as to obtain a target case image (namely a sample image) for model training. Wherein, the target file imageI _syn Also known as Synthetic images (Synthetic images).

On the basis of obtaining the target file image, the target file image can be obtainedI _syn And the original document imageIAnd performing model training as a training sample, and obtaining a character erasing model until a loss function of the model training meets requirements. The input of the character erasing model is a target file image and a corresponding original file image, and the output is an output image after erasing the target file information in the target file imageI _pred (ii) a Illustratively, the word erasure model of the present embodiment adopts a generation countermeasure network (GAN) as its main network structure, and therefore, the word erasure model is divided into two parts, a generator and a discriminator. In fig. 4, the generator in the character erasure model is shown, and the arbiter is not illustrated. Wherein the generator will generate the input target pattern imageI _syn Processing, erasing the target pattern image through the multilayer convolution networkI _syn The target file information is added in the erasing area according to the information of the surrounding area, and the final output image of the character erasing model is obtainedI _pred . Wherein the character erasure model outputs an imageI _pred Will be input into the discriminator which will then judge the output image of the generatorI _pred And the original document image without labelIThereby constructing a supervision signal, and continuously guiding the generator to work according to the supervision signal until the final output image of the generatorI _pred Very close to the original document imageIAnd the arbiter can hardly distinguish the degree, so far, the model training is finished.

As shown in fig. 4, the generator in the text erasure Model includes two parts, namely, a Coarse Model (Coarse Model) for initially erasing the text information and a fine Model (reference Model) for secondarily erasing the text information, by which the erasure of the text information is divided into two stages, both of which are aimed at erasing the text information to obtain an image of the erased text information. Wherein, in the Model training process, the input of the Coarse Model is the target pattern imageI _syn The output is the first output image obtained after the preliminary erasing of the document informationI _c (i.e., coarse output); the input of the Refinement model is a first output imageI _c The output is a second output image obtained after erasing the document information for the second timeI _r (i.e., referred to as referred output); wherein the second output imageI _r Black and white mask patterns corresponding to target document information (i.e., the pattern of the target document)M _syn ) The final output image of the whole generator can be obtained by synthesisI _pred (i.e., composite output). Wherein, the target file image is processedI _syn The target pattern information in (1) is masked, specifically, the pixel value of the target pattern information is set to 255, and the pixel values in the other regions are set to 0, so that a black-and-white mask pattern can be obtainedM _syn The black and white mask patternM _syn It may also be referred to as a Synthetic text mask. In the Model inference process, the input of Coarse Model is the targetThe output of the material graph is a target material graph obtained after the file information is preliminarily erased; the input of the Refinement Model is a target material map output by the Coarse Model after the preliminary erasing of the document information, and the output of the Refinement Model is a target material map obtained after the document information is erased for the second time, namely the output result of the whole generator.

In this embodiment, the supervised loss of model training is divided into two parts: generating antagonistic lossesL _adv (adaptive loss) and ternary erase lossL _te (triplet loss). Wherein, generating the antagonistic loss refers to a loss function of the discriminator, which can be expressed by, but not limited to, the following formula:

whereinDA network of discriminators is represented that,Ga model of the erasure of a word is represented,

representing production of confrontational lossL _adv And z represents the target pattern image synthesized by the controllable character synthesis moduleI _syn ，

Representing the scoring of the original document image true or false by the discriminator network,

representing images output to a network of discriminators to a generatorI _pred And (4) scoring true and false. The left part, i.e.

The role of (2) is to ensure the basic judgment capability of the discriminator, the right part, i.e.

The role of (a) is to ensure that the discriminator is able to distinguish false samples.

Is the sign of the operation of taking the logarithm,

is to evaluate the operation sign of the expectation value,

presentation pictureIDistribution of data of (1) ~

Is distributed from the data

The picture is sampled.

Wherein, the ternary erasure lossL _te The method comprises the following steps: first output image of generator in two stagesI _c The second output imageI _r And original document imageICan be expressed by, but is not limited to, the following formula:

and the detach is an operation of removing the gradient backhaul to prevent the update of the network parameters, gamma is a weight coefficient, and the larger gamma is, the more important the effect of erasing the second stage of the network is. Iiii is a sign of operation in which the difference value of the elements at the corresponding positions of the matrix is absolute-valued and then all absolute-valued values are summed. To be provided with

For example, the second output imageI _r And the original document imageICalculating difference values of elements at each position, respectively solving absolute values of all the difference values, and then summing all the absolute values to obtain a final numerical result.

In the model training process, when the ternary erasure loss function and the generated countermeasure loss function both meet the convergence requirement, the model training is finished, and the final character erasure model is obtained. It should be noted that the ternary erasure loss function and the generated countermeasure loss function may converge respectively, or the ternary erasure loss function and the generated countermeasure loss function may be fused to obtain a fusion loss function, and when the fusion loss function satisfies the convergence requirement, the model training is finished to obtain the final character erasure model.

As shown in FIG. 4, the composition strategy network is responsible for providing the photo-composition text rules to the controllable text composition module, the input of which is the target document imageI _syn And output image of character erasure modelI _pred (ii) a The output is the picture synthesis case rule s; for example, the synthesis policy network adopts LSTM as its main network structure, and it will sequentially output the attribute information of the document information that needs to be synthesized on the picture, such as the text color, font size, and whether italics is needed or not. These attribute information (e.g. in fig. 4) ₁ 、e ₂ 、e ₃ ) And forming a picture synthesis case rule s, and providing the picture synthesis case rule s for a controllable character synthesis module to carry out synthesis work of a sample image. As shown in fig. 4, the LSTM network mainly includes a normalized exponential function Layer (Softmax Layer), a Hidden Layer (Hidden Layer), and an embedded Layer (Embedding Layer); wherein, the original document imageISending the image into a hidden layer to perform feature extraction (feature extraction), thereby abstracting the original pattern image to another dimensional space to show more abstract features of the original pattern image; and then, the abstracted features are sent into a Softmax layer to be mapped and classified, the classified features are sent into an embedding layer to be embedded and coded, and corresponding embedded vectors are obtained, and the embedded vectors represent attribute information of the file information needing to be synthesized on the picture.

Further, as shown in FIG. 4, the target document imageI _syn Quality feedback ofR _real And output image of character erasure modelI _pred Quality feedback ofR _diff It can also be used as the input of the synthetic strategy network, which can be based on the target patternImage (A)I _syn Quality feedback ofR _real And output image of character erasure modelI _pred Quality feedback ofR _diff To continuously optimize the picture composition case rule s of its output. Wherein, as shown in FIG. 4, the target pattern imageI _syn Quality of reflecting target file imageI _syn Can be applied to the feature map by a discriminatorg(I _syn )After multiple upsampling and prediction text mask mapM _pred ^， Adding the obtained images to obtain an output image of the character erasure modelI _pred Quality feedback ofR _diff Can output imagesI _pred And the original document imageIIs obtained. Wherein, the characteristic diagramg(I _syn )Is the target pattern image of the Coarse ModelI _syn The down-sampling is carried out for a plurality of times to obtain; predictive text masking mapM _pred ^， Is to pass the target file imageI _syn The target document information in (1) is masked, specifically, the pixel value of the target document information is set to 0, and the pixel values in the other regions are set to 255, so that a predictive text mask map can be obtainedM _pred ^，。

It can be seen that, in the embodiment of the present application, the original document image (which may be an existing advertisement creative intention, for example) containing the original document information is first collected, and then a controllable text composition module performs a composition process, and sample images (i.e. the above target document images) required for performing model training on the text erasure model are automatically synthesized under the instruction of the original document style. After the synthesized sample image is obtained, the character erasing model fully learns the erasing capability of the synthesized sample image and effectively generalizes the erasing capability to a real scene; in this section, the model is optimized together with the ternary erasure loss and the generation of the countermeasure loss, and the quality of the model can be improved. And in order to reduce the difference between the synthesized sample image and the real sample image, a synthesis strategy network is introduced, a text style required by the synthesized sample image is determined according to the feedback of the character erasure model to the synthesized sample image, the synthesis strategy network is guided to be continuously improved, a higher-quality picture synthesis file rule is output, the quality of the sample image synthesized based on the rule is further improved, the difference between the synthesized sample image and the real sample image is reduced, and the model training precision is further improved.

T104 image retargeting:

in this embodiment, the original size of the target material image may not meet the size requirement of the final required target composite image, for example, taking the generation of an advertisement image based on a commodity image in the field of e-commerce as an example, 90% or more of the commodity images are all the square charts with the size of 1:1, while the sizes of the advertisement images required to be generated are many not the square charts and need to be determined according to the size requirement of an advertisement position on a page, and the size of the advertisement images has a large difference with the size ratio of the square charts, so that the advertisement images with the sizes meeting the requirement need to be obtained through image redirection. Based on this, after the repair material image is obtained, image redirection processing can be performed on the repair material image according to the size relationship between the target size and the original size, so as to obtain a base image with the target size. Specifically, the image to be cut can be determined on the basis of the restoration material image according to the size relationship between the target size and the original size, and the image to be cut is the restoration material image or the extension image of the restoration material image; inputting an image to be cut into a saliency cutting model based on image importance, locking an image area where a main object is located according to saliency characteristics of the image to be cut, and cutting the image to be cut by taking the image area where the main object is located as a center according to a target size to obtain a base image with the target size.

Under the condition that the target size is smaller than or equal to the original size, the restored material image can be directly used as an image to be cut; and under the condition that the target size is larger than the original size, extending (outlining) the repairing material graph to a degree not smaller than the target size, and then taking the extended image as an image to be cut. The method comprises the steps of using an image extension model to extend a repaired material image, specifically, inputting the repaired material image and a target size into the image extension model, performing image extension on the repaired material image according to the target size to obtain an extended image, and using the extended image as an image to be cut.

In the embodiment of the present application, a model architecture of an image extension model is not limited, and all model architectures that can perform size extension on an image and can ensure that the extended image has better semantic consistency are suitable for the embodiment of the present application. By way of example, an architecture of an image extension model includes: a preprocessing network and a generation countermeasure network with an image extension function. The preprocessing network is used for determining the extension direction and the extension length of the repair material map; and generating a countermeasure network for performing semantic continuity-based extension processing on the repaired material map according to the extension direction and the extension length determined by the preprocessing network to obtain an extension image. Based on this, the process of extending the repair material map by adopting the image extension model comprises the following steps: inputting the repair material map and the target size into a preprocessing network in an image extension model, and determining an extension direction and an extension length in the preprocessing network according to the aspect ratio of the target size, wherein the repair material map comprises a known image area in the extension direction, and the extension length is used for limiting an unknown image area in the extension direction and specifically limiting the length of the unknown image area. Wherein, the extending direction is an image direction, for example, the extending direction may be a height direction or a length direction; further, after determining the extension length in the extension direction, the range of the unknown image area may be determined in combination with the length or height of the other image direction. And then, inputting the repair material image, the extension direction and the extension length into a generation countermeasure network in an image extension model, and generating countermeasure on the pixel values in the unknown image area in the extension direction by taking semantic continuity as a constraint condition on the basis of the pixel values in the known image area in the extension direction and semantic information thereof to obtain an extension image. Specifically, in the process of training only playing by using the generated countermeasure network with the image extension function, the generator is used for generating pixel values for unknown image areas in the sample images to be extended aiming at the sample images to be extended, and the discriminator is used for carrying out semantic consistency discrimination on the pixel values generated by the generator according to the original sample images corresponding to the sample images to be extended until the generator can generate the semantic consistency pixel values.

After the above-described series of processes, a high-quality base image in which the subject object is prominent, has an appropriate size, and is visually beautiful can be obtained. After the base image is obtained, step S102 may be performed for generating a creative layout. Next, for step S102: creative layout generation is described in detail.

Step S102: creative layout generation

In this step, it is necessary to generate image layout information necessary for the target composite image in accordance with the content of the base image. In the present embodiment, the image layout information may be defined as information describing the category and position of each element to be synthesized in the set { e1, e 2.,. EN } of elements to be synthesized of an indefinite length, where N is an integer equal to or greater than 2. The number of elements to be synthesized is different for different base images. The type of the element to be synthesized is consistent with the definition in the on-graph element analysis in the generation of the base image, and the type of the element to be synthesized includes, but is not limited to, logo, pattern, substrate, and decoration. The position information of the elements to be synthesized is represented by, but not limited to, the center coordinates (xc, yc) and the width and height (w, h) of the target area carrying each element to be synthesized.

In this embodiment of the present application, an image layout model may be used to perform image layout processing on a base image to obtain image layout information corresponding to the base image. The model architecture of the image layout model is not limited, and any model architecture capable of generating image layout information is suitable for the embodiment of the present application. Alternatively, the image layout model may employ, but is not limited to: a GAN network fusing a multiscale CNN (convolutional neural network) and a Transformer, or an Autoregressive (Autoregressive) based Transformer network. The transform is a network structure composed of two parts, namely an Encoder (Encoder) and a Decoder (Decoder), and is referred to as an Encoder-Decoder structure for short. These two networks will be described in detail below.

1. GAN network fusing multi-scale CNN and Transformer

Wherein, the creative layout is generated according to the content of the base image, and two core problems need to be solved: one is how to obtain paired sample data required by model training, and the paired sample data comprises a sample layout image and corresponding sample layout information; one is how to make full use of the content information of the base image in generating image layout information when model reasoning.

In response to the first problem, as shown in fig. 5, the embodiment of the present application innovatively proposes a domain alignment model, which is used to generate pair sample data required for model training, i.e., target layout images used as training samples and their corresponding target layout information, based on existing original layout images (e.g., advertiser creative images) containing synthetic elements and their layout information. The original layout image refers to an existing image containing synthesis elements and layout information formed by the synthesis elements, the target layout image refers to an image which can be used as a training sample of an image layout model, and the target layout image does not contain the synthesis elements and the layout information corresponding to the synthesis elements; the target layout information is layout information corresponding to the target layout image, and corresponds to tag information of the target layout image. Specifically, an original layout image may be collected, and then the original layout image is input into a domain alignment model, where the positions and categories of synthesized elements in the original layout image are extracted to obtain target layout information; then, masking the synthesized elements in the original layout image by using a masking unit (e.g., circled M in fig. 5), so as to erase the synthesized elements from the original layout image, so as to obtain a masked layout image, specifically, as shown in fig. 5, masking the synthesized elements by using the original mask image and the original layout image, so as to obtain a masked layout image; then, a target layout image is obtained by repairing the mask region in the mask layout image by using a repairing unit (i.e., inpNet unit in fig. 5), and finally, a visual feature map of the target layout image is extracted by using a visual extracting unit (i.e., salNet unit in fig. 5). The mask area in the mask layout image refers to a result obtained by performing mask processing on the synthesized elements, the pixel value of the mask area is 0 or 255, and the pixel values of other areas are real pixel values in the original layout image; accordingly, in the initial mask map, the pixel values of the regions other than the synthesized elements are 0 or 255 (illustrated in fig. 5 by taking the pixel value as 0 as an example), and the pixel value of the region corresponding to the synthesized element is the true pixel value of the synthesized element. The mask region is repaired, and the mask region can be subjected to pixel filling according to pixel values in the surrounding region, so that a target layout image with consistent vision and semantics is obtained. For a specific method for repairing the mask region, reference may be made to the specific repairing method in the foregoing image repairing step, and details are not described here again.

In this embodiment, by providing the domain alignment model, the pair sample data required for model training, that is, the target layout image and the target layout information corresponding thereto, are generated by the domain alignment model based on the original layout image including the synthetic elements and the layout information thereof, without depending on the designer designing the layout information from the image, the sample cost for obtaining the pair data required for model training can be reduced, and the pair data obtained by the domain alignment model is richer and will not fall into the eigen mode because of the smaller number of designers.

After the target layout information, the target layout image and the visual feature map thereof, which can be used as a training sample, are obtained, the target layout information, the target layout image and the visual feature map thereof can be used as the training sample to train the image layout model until the model converges or meets the end condition of the model training, so as to obtain the image layout model.

For the second problem, in the embodiment of the present application, a content-aware Layout (Composition-aware Layout) is proposed to generate GAN networks, i.e., image Layout models, by combining the multi-scale CNN and the transform. The GAN network fully utilizes the advantages of the multi-scale CNN and the Transformer, not only can effectively learn the relations of alignment, overlapping and the like among elements, but also can model the relations among the elements, the image content position, the background color and the texture distribution. In addition, the network also supports adding user constraints, can reasonably complement user layout, and meets the requirement that some scenes have fixed layout design rules in practical application.

Specifically, as shown in fig. 5, the GAN network fusing the multi-scale CNN and the Transformer includes: the structure of the generator and the structure of the discriminator respectively comprise a multi-scale CNN network, an encoder-decoder and a full connection layer (FC), and the layout information is generated by adopting a content-aware (Composition-aware) technology. The multi-scale CNN network is used for extracting the characteristic information of the input image on different scales by using convolution kernels with different sizes, and finally splicing the characteristic information on different scales to obtain a spliced characteristic image. Further, the multi-scale CNN network specifically performs downsampling processing on an input image for multiple times through multiple volume blocks (ConvBlock) to obtain a multi-scale feature map. The encoder is responsible for encoding the input spliced characteristic diagram to obtain an encoded information matrix required by generating image layout information, and outputting the encoded information matrix to a decoder in the generator, the decoder is responsible for generating the image layout information according to the encoded information matrix, and finally the decoder outputs an information sequence of indefinite length, wherein the information sequence comprises the position, the category and the size of at least one target region (such as bbox) for bearing at least one element to be synthesized, namely the image layout information. In the model reasoning process, a generator in the GAN network mainly plays a role; in the model training process, the generator is responsible for generating predicted layout information for the target layout image, and the discriminator is responsible for carrying out countermeasure training on the predicted layout information generated by the generator. Specifically, the input of the encoder in the discriminator is the target layout image, the input of the decoder is two branches, one is the information sequence (i.e. the predicted layout information) output by the decoder in the generator and the encoding information matrix of the target layout image output by the encoder in the discriminator, the information of the two branches is fused to obtain a fused image, and then whether the fused image of the two branches is not inconsistent or cannot be judged to be synthesized is judged until the model convergence requirement is met, at this time, the model training is finished. Furthermore, in the process of generating image layout information for the base image by using the GAN model, the information sequence output by the decoder in the generator in the GAN network is the image layout information generated by the base image.

It is explained here that, as shown in fig. 5, the domain alignment model is used in the model inference process, in addition to playing a role in the model training phase and being responsible for providing the pair of sample data (i.e., the target layout image and the target layout information) required for model training. In the process of model reasoning, a base image input as a model enters a domain alignment model firstly, the base image does not contain any synthesis element, so that masking and repairing processing on the base image is not needed, the base image and the visual feature map are only required to be extracted by a visual extraction unit (namely a salNet unit in fig. 5) in the domain alignment model, the base image and the visual feature map are input into a multi-scale CNN network for extracting the multi-scale feature map, and the multi-scale feature map is spliced and then sent to a generator in the GAN network for generating image layout information.

Based on the above, step S102: one detailed implementation of creative layout generation includes: extracting a visual characteristic diagram of the base image input domain alignment model to obtain the visual characteristic diagram of the base image; inputting the substrate image and the visual characteristic diagram thereof into a multi-scale CNN network in an image layout model to extract the multi-scale characteristic diagram, and splicing the extracted multi-scale characteristic diagram to obtain a spliced characteristic diagram; and sending the splicing characteristic diagram into an image layout model, and generating image layout information by adopting a generation countermeasure network of an encoder-decoder structure to obtain the image layout information of the base image. Further, sending the stitching feature map into an image layout model, and generating image layout information by adopting a generation countermeasure network of an encoder-decoder structure to obtain the image layout information of the base image, including: sending the splicing characteristic diagram into an encoder in a generator of the GAN network, and encoding the splicing characteristic diagram to obtain an intermediate image characteristic (also called as an encoding information matrix); inputting the intermediate image features into a decoder in a generator, and decoding the intermediate image features to obtain initial layout information, wherein the initial layout information comprises the position of at least one display area; and sending the initial layout information to a full connection layer in the generator, and carrying out category marking on at least one display area to obtain the image layout information of the base image.

2. Autoregrisive transform network

Similar to the GAN network that combines multi-scale CNNs and transformers described above, the purpose of the autoregegressive Transformer network is also to model the locations of the synthetic elements in the image and the locations of the subject objects in the image. In order to enable the generation of the diversified layouts, the network structure of the embodiment adopts a network structure combining a transform and a VAE, as shown in fig. 6, the network structure includes a Visual Backbone (Visual Backbone) network and a transform network, and the transform network includes an encoder and a decoder. In this embodiment, the Visual backhaul network may adopt, but is not limited to, a ViT (Vision Transformer) network. In fig. 6, a portion shown by a dotted line is an inference process for image layout using the network structure shown in fig. 6; the part shown by the solid line and the part shown by the dotted line are combined into a process of performing model training on the network structure shown in fig. 6 to obtain the network structure finally used for image layout information generation.

When image layout is performed on a base image based on the network structure shown in fig. 6, on one hand, feature extraction is performed on the base image through a Visual Backbone (Visual Backbone) network to obtain a content embedded vector (Embedding) of the base image, wherein the content embedded vector comprises a Visual feature vector and a position coding vector corresponding to the base image; on the other hand, the hidden space vector z is randomly sampled according to the distribution information of the hidden space vector z corresponding to the bbox learned in the model training process, then the randomly sampled hidden space vector z and the content embedded vector of the base image are input into a decoder in a transform network together for decoding, a bbox sequence is obtained after multiple autoregressive processes are carried out in the decoder, the bbox sequence comprises the position, the size and the category of each bbox, each bbox is a target area needing to bear elements to be synthesized on the base image, and the position, the size and the category of the bboxs finally form the image layout information of the base image. Wherein the hidden space vector z is a feature expression of bbox learned in the model training process. Alternatively, the implicit spatial vector z may be sampled using, but not limited to, a KL divergence based sampling method. For each target area, the type and the position of the target area are predicted according to the type and the position of the target area which are predicted before; for the first target region, a specific character is input for marking the first target region to be generated, and the specific character may be, but is not limited to, bos, and is also used for indicating the beginning of an output information sequence (in the model inference process, the information sequence is image layout information). The autoregressive process means that an implicit space vector z is used as the first input of a decoder to predict a first bbox; splicing the hidden space vector z and the embedded vector corresponding to the first bbox to be used as new decoder input, and continuously predicting a second bbox; and splicing the hidden space vector z with embedded vectors corresponding to the first bbox and the second bbox to serve as new decoder input, and continuously predicting the third bbox by analogy until the whole bbox sequence is obtained.

In the model training process, firstly obtaining sample images and marking sample bbox in advance, wherein the samples bbox have attribute information such as position, type and the like; then, as shown by the solid line and the dotted line in fig. 6, the sample image is firstly subjected to a Visual Backbone (Visual Backbone) network to obtain a content embedded vector (Embedding) of the sample image, wherein the content embedded vector comprises a Visual feature vector and a position encoding vector of the sample image; then the content embedded vector and the embedded vector of the current sample bbox Y marked in advance are sent into a transform network, coding processing is carried out through a transform coder to obtain a coding vector corresponding to the current sample bbox Y, attention weighted Average processing is carried out on the coding vectors corresponding to all samples bbox through an Attention Average Pooling layer (AAP) to obtain a hidden space vector z by fusing the coding vectors corresponding to all samples bbox, and the mean value and the variance (namely distribution information) of the hidden space vector z are learned; then, random sampling is carried out through the learned distribution information of the hidden space vector z, the randomly sampled hidden space vector z and a content Embedding (Embedding) vector of the sample image are sent to a transform decoder together for decoding processing, and the type and the position of one bbox are predicted in the decoder through an autoregressive process each time. Then, calculating divergence loss (klloss) according to the predicted category and position of the bbox and the category and position of the corresponding sample bbox each time, and ending the model training process until the divergence loss meets the requirement, thereby obtaining the final network structure for generating the image layout information as shown in fig. 6. For the non-first predicted bbox, the type and position of the non-first predicted bbox are predicted according to the type and position of the bbox predicted before the bbox, that is, the input of each autoregressive process of the decoder is the splicing result of the hidden space vector z and the embedded vectors corresponding to all the previously predicted bboxes. For the purpose of distinction, the output of the decoder is the predicted location and class of bbox on the sample image during model training, and the output of the decoder is the location and class of the target region on the base image during model inference.

It is noted that the following considerations apply to the architecture of the autoregapplicable transform network in the embodiments of the present application:

first, autoregeresive has stronger expressive force than Non-Autoregeresive, that is, the position of the (N + 1) th target region or bbox is predicted under the condition of giving the category and position of the first N target regions or booxs, and by reasonably arranging the order of the target regions or booxs, the network can sequentially output the positions and categories of different target regions or booxs in a certain order. At the same time, the task of predicting the location and category of the remaining target areas or box given the category and location of the input target area or box may also be supported naturally.

Secondly, the adoption of the VAE structure and the divergence loss (KL loss) can effectively restrict the hidden space to be Gaussian distribution. During model Inference (Inference), the randomly sampled hidden space vector z can obtain a good layout frame, i.e. the hidden space vector z is continuous and dense.

Further, in the process of model training or reasoning, in order to fully utilize the position information of the subject object in the sample image or the base image, the embodiment of the present application further creatively provides a geometric Alignment (Geometry Alignment) module, which has a structure as shown in the left lower half of fig. 7. The geometric alignment module is mainly used for enhancing position coding of the sample image or the base image in a model training or reasoning process. Specifically, the geometric alignment module is configured to divide an input image (in a model training process, the input image is a sample image, and in a model inference process, the input image is a base image) into a plurality of image blocks (patches), where each patch has geometric parameters such as a position, a length, a width, and the like, and the size of the patch is not limited in this embodiment of the application, for example, the size of one patch may be, but is not limited to, 16 × 16; and then, carrying out embedded coding on each patch according to the geometric parameters of each patch to obtain an embedded vector of each patch, wherein the embedded vectors of all the patches form a position coding sequence for position coding enhancement, and the position coding sequence is used as one path of input of an encoder in a transform network and is sent into the transform network to realize the position enhancement of a main object in an input image, and the position of the main object is fully considered in the image layout process to avoid the occlusion of the bbox on the main object.

Based on the above, a network structure of a fused geometric alignment module is shown in fig. 7. In the network structure shown in fig. 7, the upper left part represents the visual backbone network and its output, the lower left part represents the geometric alignment module and its input, the right part represents the encoder structure in the Transformer, it should be noted that the encoder and decoder structures are the same, the difference between the two is only the input difference, and fig. 7 illustrates the encoder and its input as an example. Since the encoder is only used in the model training process, the model training process of the fusion domain alignment module is described in detail below with reference to fig. 7. As shown in fig. 7, in the model training process, the input of the encoder in the transform network includes three paths, one path is a content embedded vector (including a visual feature vector and a position coding vector) obtained by performing content coding on an input image by the visual backbone network, the other path is a position coding sequence obtained by segmenting each patch of the input image by the domain alignment module and performing embedded coding on each patch, and the other path is an embedded vector obtained by performing embedded coding on attribute information such as the position, the category, the length, and the width of a pre-labeled sample bbox. Further, as shown in fig. 7, in the encoder, on one hand, the embedded vector of the sample bbox is correlated in the self attention layer (self attention) by using a self attention mechanism to learn the position relationship between different samples bbox; the output of the self-attention layer is sent to a normalization layer (add & norm) for normalization processing, then sent to a cross-attention layer (cross attention), and the content embedded vector, the position coding sequence and the output result of the normalization layer of the input image are subjected to relevant processing in the cross-attention layer, so that the fusion of the image vision, the position of the main body object and the sample bbox is realized, and a fusion characteristic vector is obtained; inputting the fusion feature vector into a normalization layer (add & norm) for normalization, and then sequentially passing through a full connection layer (FFN) and the normalization layer (add & norm) to obtain a hidden space vector z. Further, in the model training process, the content embedded vector of the input image, the position coding sequence output by the domain alignment module and the hidden space vector z are sent to a decoder for decoding processing; in an encoder, the splicing result of the hidden space vector z or the hidden space vector z and the predicted embedded vector corresponding to the bbox is used as the input of a self-attention layer, and the self-attention layer is also adopted to learn the predicted position relation between different bboxs; furthermore, the content embedded vector of the input image, the position coding sequence and the normalization result of the predicted position relation between different bboxes learned by the self-attention layer are subjected to correlation processing in the cross-attention layer, so that the fusion of the image vision, the position of the main body object and the predicted bbox is realized, and the type and the position of the next bbox are predicted based on the fusion feature vectors.

Correspondingly, in the model reasoning process, in a decoder, a splicing result of a randomly sampled implicit space vector z or a randomly sampled implicit space vector z and a predicted embedded vector corresponding to a target region is used as the input of a self-attention layer, and the self-attention layer is also adopted to learn the predicted position relation between different target regions; furthermore, the content embedded vector of the base image, the position coding sequence and the normalization result of the predicted position relation between different target areas learned by the self-attention layer are subjected to correlation processing in the cross-attention layer, the fusion of the image vision, the position of the subject object and the predicted target areas is realized, and the category and the position of the next target area are predicted based on the fusion feature vectors.

In the encoder or the decoder, whether the model training process or the model inference process, in the self-attention layer and the cross-attention layer, one of the information may be used as Q, i.e. query word (query), and then the other information may be used as Key-Value pair (KV), and the relevant computation is implemented by matching the Key-Value pairs.

In the embodiment of the application, a cross-attention mechanism is adopted in a decoder in a Transformer, and a visual Embedding (Embedding) vector obtained by a visual backbone network (such as ViT network) is explicitly modeled by using position coordinates, so that the content Embedding vector of the input Transformer and the position vector of a subject object are decoupled, and inner products are respectively made with visual features and position codes in the content Embedding vector of the input image to obtain respective similarity matrixes, thereby calculating a final similarity matrix. The structure enables the position of the output target area to effectively sense the position of the main object in the input image, and the main object can be avoided while the position relation of the main object is kept, so that the shielding of the main object is reduced.

Based on the above, step S102: another detailed implementation of creative layout generation includes: inputting a base image into a visual backbone network in an image layout model to perform visual feature extraction, so as to obtain a content embedded vector of the base image, wherein the content embedded vector comprises a visual feature vector and a position coding vector of the base image; inputting the base image into a domain alignment module, dividing the base image into a plurality of image blocks and carrying out position coding on the plurality of image blocks to obtain a position coding sequence; inputting the content embedded vector and the position coding sequence into a decoder in the image layout model; in the decoder, a cross-attention mechanism and a self-attention mechanism are adopted to carry out autoregressive decoding processing on the content embedded vectors and the position coding sequence, and image layout information of the base image is obtained. It should be noted that the input decoder further includes a randomly sampled implicit spatial vector z.

After the above-described series of processing, image layout information of the base image can be obtained. After the base image and its image layout information are obtained, step S103 may be performed for pattern generation. Next, for step S103: the pattern generation is described in detail above.

Step S103: pattern generation

In this embodiment, after obtaining the base image and the image layout information thereof, a multi-modal document generation method may be adopted to generate appropriate document information for a target area for carrying document information on the base image (the target area for carrying document information may be simply referred to as a text box). The multi-modal pattern generation means that the corresponding pattern information is adaptively generated for the target area used for carrying the pattern information on the base image by comprehensively considering multi-modal information such as the information of the base image (such as information of a main object, the position and background color of the main object on the base image, and the like), the basic material information corresponding to the main object (such as various text description information, tables, video information, audio information corresponding to the main object), the position and category of each target area, and the mutual logical relationship among a plurality of target areas. Specifically, the detailed embodiment of the pattern generation in the figures includes: generating multi-modal description information of the main body object according to the base image and the basic material information corresponding to the main body object, wherein partial description information of the main body object is recorded in each modal description information; and inputting the multi-modal description information of the main body object and the position and the category of at least one target area into the pattern generation model on the graph to generate the pattern information so as to obtain the pattern information which needs to be carried by the target area of each pattern category. The target area of the file category is a target area which needs to bear file information, and the category of each target area is used for determining whether the target area is a target area of the file category.

The multi-modal description information of the subject object includes, but is not limited to: attribute information of the subject object acquired from the base image, the position and background color of the subject object on the base image, and the like, and various text description information, tables, video information, audio information, and the like corresponding to the subject object acquired from the base material information corresponding to the subject object. For example, the text description information may record text information such as a title of the product, information related to the product (brand, style, product number, applicable season, sales channel, etc.), and the like, with the subject object as the product; the table may record attribute information of the product itself (e.g., product parameter information including, but not limited to, material, color, composition, size, etc.); the video information is, for example, a video containing a commodity; the audio information is, for example, an introduction audio of the article. In this embodiment, because each of the modality description information in the multi-modality description information records part of the description information of the subject object, semantic representation of the subject object is realized.

After obtaining the multi-modal description information of the subject object, generating the pattern information by using the pattern generation model on the graph further in combination with the position and the category of at least one target area. In the present embodiment, when generating the document information using the above-described pattern generation model, in order to improve the quality and the rationality of generating the document information, the following considerations are made:

1. considering that some document information may be suitable for multiple target areas, if the document information is generated for each target area independently, similar document information may be generated for the target areas for multiple target areas with similar positions, resulting in duplication of the document information. Therefore, in the above-drawing pattern generation model of the present embodiment, the problem of easy generation of duplicate patterns is solved by comprehensively considering the mutual logical relationship between the target areas, that is, for each target area, when generating pattern information for the target area, considering the target area and the related information (for example, pattern information existing in the previous target area, the category of the next target area, and the like) of the other target area (simply referred to as context information) adjacent to the target area at the same time.

Specifically, the spatial distance between each target area may be calculated according to the position of each target area or the position of the center point, and all target areas may be sorted according to the spatial distance between each target area; according to the sequencing result, adjacent and nearest neighbor areas in front of and behind each target area can be determined; further, the position code information of the current target area, which is the position of the other target area adjacent to and closest to the current target area, may be used as the context position code information, and the position code information of the current target area may be sent to the pattern generation model together with the position code information of the current target area to generate the pattern information.

2. Considering that the target area for carrying the document information has a certain size in addition to the position, in the model for generating the pattern on the graph of the present embodiment, the position and the size of the target area are considered at the same time, and accordingly, the content, the type and the number of words of the document information are controlled, so as to achieve high adaptation between the document information and the target area. That is, in the above-drawing pattern generation model of this embodiment, the base image, the target region (i.e., bbox) and the pattern information are matched by adding a matching task, so as to improve the strong dependency relationship between the base image and the target region (i.e., bbox) and the pattern information.

Accordingly, in the model training process, the matching task can be trained from three aspects: a) The rationality of the file: judging whether the sample image is matched with the sample pattern information in the training process; b) Style adaptability: judging whether the sample area in the training process is matched with the sample pattern information; c) Diversity of the documents: and judging whether the similar sample areas in the training process are matched with the sample file information. The sample area refers to an area for bearing document information on a sample image in a training process, and can also be represented by a bounding box (bbox). During the model training process, positive and negative samples need to be constructed, and the construction condition of one positive and negative sample is shown in fig. 8. In the positive and negative samples shown in fig. 8, an image-pattern matching positive and negative sample, a position-pattern matching positive and negative sample, and an adjacent position-pattern matching positive and negative sample are included, respectively, and in fig. 8, a positive sample is provided with a square top and a negative sample is provided with a square bottom. In the image-pattern matching positive and negative samples, the pattern information in the positive sample is 'refreshing water replenishing and soothing conditioning', and the pattern information in the negative sample is '7-layer filtering living spring water'. In the position-pattern matching positive and negative samples, the pattern information in the positive sample is 'refreshing, moisturizing, relieving and conditioning' at a position above the commodity image close to the commodity image, and in the negative sample, the pattern information is in the bottom area of the image. In the adjacent position-file matching positive and negative samples, a 'file oil scanning light' in the file information 'file oil scanning light clearing standing' in the positive sample is positioned above the 'clearing standing', and the distance between the upper line and the lower line of the file is relatively close; in the negative sample, the oil-sweeping light is positioned below the transparent vertical light, and the distance between the upper line and the lower line of the file is relatively far.

3. In consideration of the obvious difference between the requirements of different classes of subject objects on the pattern description styles, in order to avoid the situation that one pattern description style is given to subject objects which are not suitable for the subject objects, the pattern description style is simply referred to as 'Zhangliang' wearing. In this embodiment, the category of the main object may be added as one of the multi-modal description information, and the auxiliary graph upper case generation model generates the case style adapted to the main object of different categories, and meanwhile, the matching task may alleviate the problem of non-correspondence of the case to a certain extent in the process of matching the case information with the target area. For example, taking the case where the subject object is a commodity, the category information of the subject object may be category information of the commodity.

In the embodiment of the present application, the model architecture of the pattern generation model in the drawings is not limited, and any model architecture having the above-mentioned functions and capable of generating pattern information for the target area is suitable for the embodiment of the present application. By way of example, the embodiment of the present application provides a multi-modal model structure based on multi-layer transformers, and the model structure is shown in fig. 9. In the model architecture shown in fig. 9, it is supported that multimodal description information such as a base image, a position of a current target area, positions of other target areas adjacent to and closest to the current target area in front of and behind the current target area (i.e., a position of a previous target area and a position of a next target area), a category of a subject object (e.g., a category of a product), a name of the subject object, a pair of attributes of the subject object, and a predictive text flag is input as a model, the multimodal description information is respectively subjected to Embedding (Embedding) coding to obtain corresponding embedded vectors, and then the embedded vectors are input into a multimodal model based on a multi-layer transformer, and the pattern information corresponding to the current target area is generated in an auto-regression manner. As shown in fig. 9, embedded coding may be performed on the type of the subject object (such as the category of the product), the name of the subject object, the attribute of the subject object, and the predicted pattern flag, such as Word Embedding (Word Embedding), position Embedding (positional Embedding), and Model Embedding, to obtain a corresponding embedded vector; correspondingly, spatial Embedding (Spatial Embedding) and linearization (linear) processing can be performed on the position of the current target region, the position of the previous target region and the position of the next target region, so as to obtain a corresponding Spatial embedded vector; accordingly, for the base image, visual Embedding (Visual Embedding) and Spatial Embedding (Spatial Embedding) may be performed to obtain corresponding embedded vectors, i.e., a Visual embedded vector and a Spatial embedded vector. Wherein, embedding refers to using a low-dimensional vector to represent a corresponding object, and the object may be the above base image, the position of the current target area, the position of the previous target area, the position of the next target area, the category of the subject object, the name of the subject object, the attribute pair of the subject object, the predicted text flag, and the like.

Further, when the position of the target region is embedded-coded as described above, in order to improve coding efficiency and convenience, the position coordinates of the target region may be discretized, and as shown in fig. 9, the x-coordinate and the y-coordinate corresponding to the target region are embedded-coded, respectively, where each target region is represented by coordinates (x 1, y1, x2, y 2), (x 1, y 1) represents the coordinates of the upper left corner of the target region, and (x 2, y 2) represents the coordinates of the lower right corner of the target region. Further, in order to facilitate embedded coding of the base image, in the embodiment of the present application, the entire base image is divided into a fixed number of image patches (patch) from both horizontal and vertical directions, and the horizontal and vertical coordinates of the patch in which the target region is located are used as the position coordinates of the target region. As shown in fig. 10, first, the document information on the base image is subjected to mask processing by a mask operation to obtain a base image with a mask region; next, the base image with the CNN network is divided, so that the entire base image is divided into 5 × 5 patches from both the horizontal and vertical directions, wherein one target area occupies the 2 nd to 4 th rows and the 2 nd to 4 th rows of the 1 st row, the position coordinates of the two patches of the 2 nd row and the 4 th column of the 2 nd row can be taken as the coordinates of the target area, that is, (x 1, y1, x2, y 2) = (1,2,2,4), and then the x coordinate and the y coordinate of the target area are spatially encoded (Spatial encoding) according to the position coordinates (1,2,2,4). In fig. 10, the entire base image is divided into 5 × 5 patches as an example, and the number of patches is not limited to this. In embedded coding of the base image, a residual network, such as the ResNet-50 model, may be employed, but is not limited thereto.

After the series of processes, the document information to be synthesized with the base image can be obtained. After obtaining the base image, the image layout information of the base image, and the document information to be synthesized with the base image, step S104 may be performed for element visual attribute estimation and rendering. Next, for step S104: element visual attribute estimation and rendering are described in detail.

Step S104: element visual attribute estimation and rendering

In this embodiment, after the base image, the image layout information of the base image, and the corresponding document information are obtained, the visual attribute corresponding to each element to be synthesized including the document information may be predicted, and the element to be synthesized is rendered into the corresponding target area on the base image according to the visual attribute, so as to obtain the target synthesized image. In the embodiment of the application, a visual attribute pre-estimation model is trained in advance, and the visual attribute estimation model is utilized to estimate the visual attribute of the element to be synthesized.

In this embodiment, a visual attribute recognition module is provided, and a training sample for model training of the visual attribute estimation model is automatically generated by using the visual attribute recognition module. The training sample required for model training of the visual attribute estimation model needs to be a sample image with synthesis elements and visual attribute information, and the visual attribute information on the sample image needs to be labeled (label), for example, the sample image has label data such as "position of file area, whether to trace, whether to fade", and the like.

In practical applications, some visual attribute information on a sample image is easy to label manually, but some visual attribute information, such as specific color RGB values, fonts of a document, and the like, is difficult to judge by naked eyes, so that it is difficult to label manually. The visual attribute identification module of the embodiment can solve the problem of labeling the visual attribute information in the sample image, and is particularly used for solving the problem of labeling the attributes of fonts and colors in the visual attribute information. In this embodiment, the visual attribute recognition module includes at least a font recognition module for performing font recognition and a color recognition module for performing color recognition.

The font identification module can be realized by adopting a neural network model, and before the font identification module is used, model training needs to be performed to obtain a font identification model. As shown in fig. 11, first, the distribution of the pattern information on a large number (e.g. 8 ten thousand) of real images (e.g. the advertisement creative intent) is counted, for example, including but not limited to: the height, word number frequency, character appearance frequency, etc. of the pattern information; and then, synthesizing the file information and the image without the file to obtain a synthesized image according to the statistical result of the file information, and training a character recognition function for a character recognition module based on the synthesized image. Further, in order to reduce the distribution difference between the synthesized image and the real image in the color space, in this embodiment, the real image and the synthesized image are uniformly grayed, so as to reduce the distribution difference between the document information in the color space, and emphasize the font information of the document. And finally, the font identification module obtained by training is used for predicting the pattern font in the sample image used for training the visual attribute estimation model.

In this embodiment, the model architecture of the font recognition module is not limited, and for example, resNet-50 may be used as the main model architecture, but is not limited thereto. In addition, in the embodiment, in the training process of the font identification module, the following three methods can be adopted to improve the accuracy of font identification.

Mode 1: in order to prevent the problem of fuzzy font edges when the image is zoomed, in the model training process, the input composite image is directly padded to the same size without zooming and cropping;

mode 2: to alleviate the problem of model overfitting on the synthesized image, the final full connected layer of ResNet-50 is replaced by Full Convolutional Network (FCN) for classification; the FCN uses a convolutional neural network to realize the transformation from image pixels to pixel classes, specifically, transforms the height and width of the intermediate layer feature map back to the size of the input synthetic image through a transposed convolution (transposed convolution) layer, so that the prediction result and the input synthetic image are in one-to-one correspondence in the spatial dimension (height and width): giving a position on a space dimension, and outputting the channel dimension, namely predicting the class of a pixel corresponding to the position;

mode 3: using label smoothing method to further improve the character recognition effect, namely using pre-trained coder (Encoder) in character migration model to carry out embedded coding to the character to obtain two embedded images of the character; and calculating the similarity between fonts by using embedded (Embedding) vectors of the fonts, and replacing the original One-hot label to calculate the classification loss after passing through a Softmax module.

The color identification module is mainly used for marking the RGB value of the file information, and the problem that the RGB value of the file information is difficult to manually mark is solved. In this embodiment, the color identification module mainly identifies and labels colors in a graphics processing manner. For a given pattern and position on a graph, a font Segmentation model (Rethinking Text Segmentation) is used to obtain a pattern region, and then corresponding colors are extracted. Furthermore, considering that factors such as pixel extraction error, multicolor gradual change, edge tracing and the like can interfere with the extraction of main colors, the RGB color values of the pixels in the file area can be converted into an LAB space after the file area is obtained, the color values of the pixels in the file area are clustered in the LAB space, and the most numerous clustering centers are taken as file colors. Where "L" in the LAB space represents luminance, "a" represents hue, and "B" represents saturation.

In this embodiment, a font identification module and a color identification module in the visual attribute identification module are used to label the font and color information of the text area in the sample image, and further, a large number of sample images with visual attribute information label data are obtained by combining some visual attribute information of manual labeling. Wherein the labeled visual attribute information includes but is not limited to: font color, font style, substrate color, gradient color, edge painting color; wherein each text or substrate element in Wen Anou field has corresponding visual attribute label data. And then, performing model training by using the sample images with the visual attribute information label data to obtain a visual attribute estimation model. Since there is a category long-tail distribution in visual attributes such as color and font, in the embodiment of the present application, focallloss is used as a loss function for model training, and Soft Encoding (Soft Encoding) is performed on label data (label) of visual attribute information.

After the visual attribute pre-estimation model is obtained, the positions and the types of the base image and the at least one target area can be input into the visual attribute pre-estimation model, and the visual attribute of the at least one element to be synthesized is estimated to obtain the visual attribute of the at least one element to be synthesized.

Further optionally, in order to reduce the influence of the complex background on the color estimation, in this embodiment of the application, before the estimation of the element visual attribute is performed, the base image may be quantized, so as to obtain a quantized base image. Specifically, the way of quantifying the base image includes: converting the base image from an RGB space to an LAB space, and clustering pixel points in the base image in the LAB space to obtain a plurality of clustering groups; reassigning each pixel point in each cluster group to be the pixel value of the corresponding cluster center so as to obtain a reassigned base image; and converting the reassigned base image from the LAB space to the RGB space again to obtain a quantized base image. Where in the LAB space, the colors are decoupled into L and AB for processing and prediction, respectively. Specifically, using the LAB color space, the AB gamut equal interval at L =50 is divided into 313 classes, and the L brightness equal interval is divided into 11 classes, and prediction is separated when the model predicts colors.

Further optionally, in order to prevent the visual information of the pattern substrate in the base image from being leaked, the pattern substrate may be masked by a mask method and then input as a model. The change of the color space and the covering treatment of the pattern substrate in the base image are simple data processing processes, and a neural network model is not needed.

After the quantized base image is obtained, the quantized base image and the position and category of the at least one target region may be input into a visual attribute prediction model, and the visual attribute of the at least one element to be synthesized may be estimated to obtain the visual attribute of the at least one element to be synthesized. In the embodiment of the present application, a model architecture of the visual attribute estimation model is not limited, and for example, as shown in fig. 12, the visual attribute estimation model may adopt an Encoder-Decoder (Encoder-Decoder) structure, where the Encoder is configured to perform embedded encoding processing on the input quantized base image to obtain a visual embedded vector of the base image, and the Decoder is configured to estimate the visual attribute of the element according to the visual embedded vector of the base image and the position embedded vector of the at least one target region. Specifically, inputting coordinates, length and width attributes and category attributes of each target region in a quantized base image and the base image on the base image into a visual attribute estimation model, wherein the quantized base image is input into an encor in the visual attribute estimation model, and performing embedded coding of visual information on the quantized base image by using the encor to obtain a visual embedded vector (or called image visual information) of the base image; and then inputting the visual embedded vector of the base image, at least one element to be synthesized, and the coordinate, the length and width attribute and the category attribute of each target region on the base image into a Decoder, and estimating the visual attribute of the element to be synthesized corresponding to each target region by the Decoder according to the visual embedded vector of the image, the at least one element to be synthesized, the coordinate, the length and width attribute and the category attribute of each target region on the base image. In fig. 12, XY denotes coordinates of the target region on the base image, WH denotes a length-width attribute of the target region, cls denotes a category attribute of the target region (or an element to be synthesized), and PE denotes a position-embedded vector of the base image.

Specifically, in the Encoder, the quantized base image may be encoded into a patch sequence with a length N in a manner similar to VIT, which is similar to the encoding manner shown in fig. 10, and is not described herein again, and the patch sequence with the length N (i.e., 1-N shown in fig. 12) is used as the input of the Decoder. In the Decoder, the position and the category of each target area are taken as processing objects, and a Self-attention mechanism (Self-attention) and a Cross-attention mechanism (Cross-attention) are combined to decode the visual embedded vector and at least one element to be synthesized so as to obtain the visual attribute of each element to be synthesized. Specifically, the position and the category attribute of each target area are used as search terms (Query), information interaction between different elements to be synthesized is carried out through Self-attribute, and a first similarity is obtained; and performing information interaction between the element to be synthesized and the visual embedded vector through Cross-attention to obtain a second similarity. Each Decoder layer performs the operation of the attention mechanism, and finally; and determining the visual attribute of each element to be synthesized according to the first similarity and the second similarity, namely outputting corresponding visual attribute information aiming at each Query.

After the visual attribute information of each element to be synthesized is obtained, the pattern information in at least one element to be synthesized can be rendered on the base image according to the position and the category of the target area and the estimated visual attribute information of each element to be synthesized, so as to obtain a target synthesized image. Specifically, the position and the category of the target area and the estimated information such as the visual attribute information of each element to be synthesized can be managed from bottom to top according to the hierarchical relationship in the following table, and finally, a rasterized graph corresponding to each element to be synthesized is drawn on the base image by the rendering layer to obtain the target synthesized image. Alternatively, some existing or self-developed graphic library may be used to draw the rasterized graphic corresponding to the element to be synthesized on the base image. For example, a graphics library, pygauge or ski, may be used, pygauge being a cross-platform Python library; skia is a library of 2D vector graphics processing functions that include font, coordinate transformations, and bitmaps that are highly efficient and compact representations.

In the following table 1, the following are included in order from bottom to top: a common attribute layer, a private attribute layer, a physical layer, a rule layer, and a render layer. The common attribute layer is responsible for managing attribute information common to all elements to be synthesized, such as position, length, width, foreground color, gradient color and the like; the private attribute layer is responsible for managing attribute information specific to various elements to be synthesized, such as texts, fonts, substrate styles, shop names, logo diagrams and the like; the entity layer is responsible for managing and maintaining various specific elements to be synthesized, such as file information, substrate elements, logo elements and the like; the rule layer is responsible for describing some rendering rules of various elements to be synthesized, such as character readability, substrate shape, position reasonability and the like; the rendering layer is responsible for drawing the rasterized graph by utilizing the graph library according to the information or rules responsible for the following layers.

Table 1

In summary, the embodiment of the present application provides a method for automatically generating a composite image with any specified size according to the content of a base image and the basic material information of a main object, and belongs to the category of automatic creative images. The whole process comprises four main steps of creative base map generation, creative layout generation, pattern generation on the map, element visual attribute estimation and rendering.

In the generation link of the base map, a method for completing image redirection by integrating image classification, PS element detection, inpainting/outlining, significance detection and clipping is provided, and the effect of erasing words during inpainting is improved in a self-supervision mode by combining reinforcement learning;

in the creative layout generation link, eliminating the domain difference between a creative image and a common image in the modes of significance, interpolation and the like, solving the problem of difficulty in acquiring data generated by image layout, modeling the relationship between image content and layout through a transformer structure, and providing a geometric alignment module to improve the modeling effect;

in the picture and pattern generation link, a novel multi-mode pattern generation network structure is provided, and the pattern content can be generated in a self-adaptive manner by comprehensively considering information such as pictures (such as commodity main bodies, positions of the commodity main bodies and background colors), commodity texts, positions of text frames and mutual logical relations among a plurality of frames;

in the link of attribute estimation and rendering, an attribute data set of vector elements such as fonts and substrates is constructed by combining methods such as self-supervision font identification, character segmentation and color extraction, manual labeling and the like, a multi-task attribute estimation model is constructed, and estimation effects are effectively improved through modes such as color quantitative decoupling, label smoothing and the like.

By the method provided by the embodiment of the application, the problem that the creative images are difficult to apply in batches during manual production can be solved; in addition, the method does not depend on a manual design template, can automatically perform creative layout, improves the fusion degree of the main object with the file, the layout and the like, and improves the quality of creative images; moreover, the method and the device can comprehensively perform image originality from the estimation and rendering of base images to the layout to the literature to the visual attributes by combining the empirical information of the layout, the literature, the visual information estimation and the like learned by the machine learning model, and improve the richness and the integrity of the automatic creative image.

The method provided by the embodiment of the application can be applied to various application scenes with image generation requirements, for example, the method can be applied to generation of creative advertisement images in the field of e-commerce. In the e-commerce field, an example of generating a creative advertisement image based on a commodity picture is given below, where an original material picture may be a commodity picture collected on an e-commerce platform, a target material picture may be a commodity picture selected from the commodity pictures and having a high quality, and a base image is generated based on the commodity image, a main object in the base image is specifically a commodity object, and after a base image including the commodity object is obtained, a creative advertisement image of the commodity object may be obtained through processes such as creative layout generation, pattern generation on the figure, element visual attribute estimation and rendering. The creative advertisement image may be placed on various pages provided by the e-commerce APP, such as a home page, a product detail page, a product list page, a shopping cart page, and the like in the e-commerce APP. Of course, the creative advertisement image may be delivered to other internet platforms as well, which is not limited thereto.

The method provided by the embodiment of the application can be applied to any commodity category, the creative advertisement image is generated for the commodity of any category, and the creative advertisement image with various size specifications is generated in a supporting mode. As shown in fig. 13-15, the creative advertisement image for each item includes two size specifications. The creative ad image aspect ratio of 3:4 in FIG. 13, 16.

It should be noted that the method of the above embodiment of the present application may be executed by a computer device, where the computer device may be various terminal devices such as a mobile phone, a tablet computer, and a notebook computer, may also be a traditional server, and may also be various cloud devices such as a cloud server, a server array, a virtual machine, or a container. Of course, the method of the above embodiment of the present application may also be completed by the terminal device and the server device being matched with each other; specifically, as shown in fig. 16a, the terminal device 16a is responsible for acquiring at least one raw material map, and uploading the at least one raw material map to the server device 16b; the server device 16b is responsible for executing steps S101-S105, namely sequentially executing creation of a creative base map, creation of a creative layout, creation of a pattern above the map, and estimation and rendering of element visual attributes to obtain a target composite image; then, as shown in fig. 16a, the server device 16 may publish the target composite image to other platforms, such as an advertisement platform, a social platform, etc., autonomously or according to image acquisition requests of other platforms; alternatively, the server device 16 sends the target composite image to the terminal device 16a, so that the terminal device 16a displays the target composite image. In the e-commerce field, the target composite image is a creative advertisement image, as shown in fig. 16a, the server device 16b embeds the creative advertisement image into an advertisement slot area in a page of the e-commerce APP, and sends the page to the terminal device 16b, so that the terminal device 16b displays the page of the e-commerce APP and presents the creative advertisement image on the page.

Fig. 16b is a flowchart illustrating an image processing method according to an exemplary embodiment of the present application. As shown in fig. 16b, the method comprises:

161. an original image containing a subject object is acquired, the original image having an original size.

162. And sending the original image into an element detection model for on-graph element analysis to obtain original synthetic elements and attribute information thereof contained in the original image.

163. And repairing the original image according to the attribute information of the original synthetic element to obtain a repaired image which does not contain the original synthetic element.

164. And carrying out image reorientation processing on the repaired image according to the size relation between the target size and the original size so as to obtain a target image with the target size.

The image processing method provided by the embodiment of the application can be applied to any application scene needing to process the PS elements and the image size on the graph. Specifically, an original image containing a subject object having an original size, for example, 1:1, 4:3, or the like, may be acquired. Optionally, the original image including the subject object may be directly acquired from the local storage space, or the original image including the subject object sent by the cloud or other device may be received. Further alternatively, in the e-commerce field, the original image may be an advertisement image having a satisfactory image quality selected from at least one advertisement image including the commodity object. As for a detailed implementation of selecting an advertisement image with satisfactory image quality from advertisement images including commodity objects, reference may be made to a detailed implementation of "selecting a target material map from at least one original material map" in the foregoing embodiment, where an original image in the present embodiment corresponds to a target material map in the foregoing embodiment, and an advertisement image in the present embodiment corresponds to an original material map in the foregoing embodiment.

After the original image containing the subject object is obtained, the original image may be sent to an element detection model for on-map element analysis, so as to obtain an original synthesized element contained in the original image and attribute information thereof. For a detailed implementation of the element detection model and the element analysis performed on the graph by the element detection model, reference may be made to the foregoing embodiments, which are not described herein again.

After the original synthesized elements and the attribute information thereof contained in the original image are obtained, the original image can be restored according to the attribute information of the original synthesized elements to obtain a restored image not containing the original synthesized elements. For the original synthesized element and the attribute information thereof, and the process of repairing the original image according to the attribute information of the original synthesized element, reference may be made to the detailed implementation process of repairing the target material map in the foregoing method embodiment, which is not described herein again.

After obtaining the restored image, if a target image having a target size, which may not be the same as the original size, is desired, the restored image may be subjected to image reorientation processing according to a size relationship between the target size and the original size to obtain the target image having the target size. Of course, if the target size is the same as the original size, the restored image may be directly taken as the target image. For a detailed implementation of the image redirection processing on the repaired image, reference may be made to the detailed implementation of the image redirection processing on the repaired material map in the foregoing embodiment, and details are not described here again.

In this embodiment, generation of a target image of any size can be completed according to an original image including a subject object, that is, image redirection (image redirection) can be performed by using models such as PS element detection, image inpainting (inpainting)/font erasing, image extending (outpainting), saliency detection, and the like on an overall graph, so that efficiency and quality of image processing are improved, various image requirements can be met in image size, and flexibility is high.

Fig. 17 is a flowchart illustrating a data rendering method according to an exemplary embodiment of the present application. As shown in fig. 17, the method includes:

171. the method comprises the steps of obtaining an object to be rendered, wherein the object to be rendered comprises at least one target area used for bearing at least one element to be synthesized, and the object to be rendered is an image or a page.

172. Estimating the visual attribute of at least one element to be synthesized according to the object to be rendered and the position and the category of at least one target area to obtain the visual attribute of at least one element to be synthesized;

173. and rendering the at least one element to be synthesized to the object to be rendered according to the position and the category of the at least one target area and the visual attribute of the at least one element to be synthesized to obtain a target object.

The data rendering method provided by the embodiment of the application can be applied to any application scene needing to synthesize other elements on the existing element bearing object. For example, the method can be applied to an application scene for synthesizing page elements such as controls, images and texts on various application pages, in the scene, the object to be rendered is a page, and the element to be synthesized can be a page element such as a control, an image and a text which needs to be rendered on the page. For another example, it is possible to apply to an image composition scene in which an object to be rendered is an image and new image elements are synthesized on various images, and the elements to be synthesized may be document information, logo elements, substrate elements, decoration elements, or the like that need to be synthesized.

Specifically, an object to be rendered may be acquired. Optionally, the object to be rendered may be directly acquired from the local storage space, or the object to be rendered sent by the cloud or other devices may be received. The object to be rendered comprises at least one target area, each target area is used for bearing one element to be synthesized, each target area has a position and a category attribute, the position of each target area represents the position of the element to be synthesized, which needs to be borne by the target area, on the object to be rendered, and the category of each target area represents the category of the element to be synthesized, which needs to be borne by the target area. Taking the object to be rendered as a page as an example, the category of the element to be synthesized can be a control, a text, an image, a link and the like; taking the example that the object to be rendered is an image, the category of the element to be synthesized is a case, logo, substrate or decorative element, etc.

In addition, each element to be synthesized can be obtained in advance, or can be obtained in real time in the rendering process. Under the condition that the element to be synthesized contains the text or the file information, the file generation model can be adopted to generate the text or the file information no matter the text or the file information is obtained in advance or in real time in the rendering process. Specifically, the object to be rendered, the position and the category of the at least one target area, and the basic material information corresponding to the main object included in the object to be rendered may be input into the pattern generation model to generate the pattern information, so as to obtain the pattern information in the at least one element to be synthesized. The process of generating the document information through the document generation model may refer to the description of the foregoing embodiments, and is not described herein again. For the case that the object to be rendered is a page, a main object included in the object to be rendered may be determined, where the main object may be a commodity material diagram, key information, or first information in the page, and the like, and this is not limited, and image and text information, audio and video information, and the like, which are related to the main object and exist in the database, may be used as basic material information of the main object.

The object to be rendered includes at least one target area, and the position and category of each target area can be regarded as layout information of the object to be rendered. The layout information may be designed manually or may be automatically generated according to a layout model. Specifically, the object to be rendered may be input into the layout model for generating the layout information. The architecture, training, and reasoning process of the layout model in this embodiment can be referred to the image layout model in the foregoing embodiment, and are not described herein again.

After obtaining the at least one element to be synthesized, the visual attribute of the at least one element to be synthesized may be estimated according to the position and the category of the object to be rendered and the at least one target region, so as to obtain the visual attribute of the at least one element to be synthesized. And further, rendering the at least one element to be synthesized on the object to be rendered according to the position and the category of the at least one target area and the visual attribute of the at least one element to be synthesized to obtain the target object. For a detailed implementation of performing the visual attribute estimation and a detailed implementation of rendering at least one element to be synthesized onto an object to be rendered, reference may be made to the foregoing embodiments, which are not described herein again.

In this embodiment, the visual attribute of the at least one element to be synthesized is estimated according to the object to be rendered and the position and the category of the at least one target area, and the element to be synthesized is rendered onto the object to be rendered based on the obtained visual attribute, so that the visual fusion degree between the element to be synthesized and the object to be rendered can be improved, and the original feeling of the rendered target object can be improved. In addition, in the link of attribute estimation and rendering, an attribute data set of vector elements such as fonts and substrates is constructed by combining methods such as self-supervision font identification, character segmentation and color extraction, manual labeling and the like, a multi-task attribute estimation model is constructed, and estimation effects are effectively improved through modes such as color quantitative decoupling, label smoothing and the like.

It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of steps S101 to S105 may be device a; for another example, the execution subjects of steps S101 and S102 may be device a, and the execution subjects of steps S103 to S105 may be device B; and so on.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations appearing in a specific order are included, but it should be clearly understood that these operations may be executed out of the order they appear herein or in parallel, and the order of the operations such as S101, S102, etc. is merely used to distinguish various operations, and the order itself does not represent any execution order. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

Fig. 18a is a schematic structural diagram of an image processing apparatus according to an exemplary embodiment of the present application. As shown in fig. 18a, the apparatus comprises: an image generation module 181a, a layout generation module 182a, a pattern generation module 183a, a vision estimation module 184a, and a rendering module 185a.

An image generating module 181a for generating a base image from a target material map containing a subject object, the target material map having an original size, the base image having a target size;

a layout generating module 182a, configured to input the base image into an image layout model for image layout to obtain image layout information of the base image, where the image layout information includes a position and a category of at least one target area on the base image for bearing at least one element to be synthesized;

the pattern generating module 183a is configured to input the base image, the position and the category of the at least one target area, and the base material information corresponding to the main object into a pattern generating model to generate pattern information, so as to obtain pattern information in the at least one element to be synthesized;

the visual estimation module 184a is configured to estimate a visual attribute of at least one element to be synthesized according to the position and the category of the base image and the at least one target region, so as to obtain the visual attribute of the at least one element to be synthesized;

and a rendering module 185a, configured to render at least the pattern information in the at least one element to be synthesized onto the base image according to the position and the category of the at least one target area and the visual attribute of the at least one element to be synthesized, so as to obtain a target synthesized image.

For detailed functional description of the functional modules, reference may be made to all relevant contents of the steps related to the embodiment of the method shown in fig. 1, and details are not repeated here.

The image processing apparatus provided in this embodiment is configured to execute the steps in the image processing method provided in the embodiment shown in fig. 1, and therefore can achieve the same effects as those of the method described above.

Fig. 18b is a schematic structural diagram of another image processing apparatus according to an exemplary embodiment of the present application. As shown in fig. 18b, the apparatus comprises: an image acquisition module 181b, an on-map analysis module 182b, an image inpainting module 183b, and a redirection module 184b.

An image acquisition module 181b for acquiring an original image containing a subject object, the original image having an original size;

the on-map analysis module 182b is configured to send the original image to an element detection model for on-map element analysis to obtain an original synthesized element and attribute information thereof included in the original image;

the image restoration module 183b is configured to restore the original image according to the attribute information of the original synthesized element, so as to obtain a restored image not including the original synthesized element;

and a redirection module 184b, configured to perform image redirection processing on the repaired image according to a size relationship between the target size and the original size, so as to obtain a target image with the target size.

For detailed functional description of the functional modules, reference may be made to all relevant contents of the steps involved in the embodiment of the method shown in fig. 16b, and details are not repeated here.

The image processing apparatus provided in this embodiment is used to execute the steps in the image processing method provided in the embodiment shown in fig. 16b, and therefore the same effects as those of the method can be achieved.

Fig. 18c is a schematic structural diagram of another data rendering apparatus according to an exemplary embodiment of the present application. As shown in fig. 18c, the apparatus comprises: an acquisition module 181c, a vision estimation module 182c, and a rendering module 183c.

An obtaining module 181c, configured to obtain an object to be rendered, where the object to be rendered includes at least one target area for bearing at least one element to be synthesized, and the object to be rendered is an image or a page;

a visual estimation module 182c, configured to perform, according to the object to be rendered and the position and the category of the at least one target area, estimation of a visual attribute of the at least one element to be synthesized, to obtain a visual attribute of the at least one element to be synthesized;

a rendering module 183c for rendering the at least one element to be composited onto the object to be rendered according to the position, the category of the at least one target region and the visual attribute of the at least one element to be composited.

For detailed functional description of the functional modules, reference may be made to all relevant contents of the steps involved in the embodiment of the method shown in fig. 17, and details are not repeated here.

The image processing apparatus provided in this embodiment is configured to execute the steps in the image processing method provided in the embodiment shown in fig. 17, and thus can achieve the same effects as those of the method described above.

Fig. 19 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application. As shown in fig. 19, the computer apparatus includes at least: a memory 191 and a processor 192.

The memory 191 is used to store computer programs and may be configured to store various other data to support operations on the computer device. Examples of such data include instructions for any application or method operating on the computer device, contact data, phonebook data, messages, pictures, videos, and the like.

The memory 191 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor 192 coupled to the memory 191 for executing computer programs in the memory 191 for: generating a base image according to a target material image containing a main body object, wherein the target material image has an original size, and the base image has a target size; inputting the base image into an image layout model for image layout to obtain image layout information of the base image, wherein the image layout information comprises the position and the category of at least one target area used for bearing at least one element to be synthesized on the base image; inputting the base image, the position and the category of at least one target area and the basic material information corresponding to the main object into a document generation model to generate document information so as to obtain the document information in at least one element to be synthesized; estimating the visual attribute of at least one element to be synthesized according to the position and the category of the base image and at least one target area to obtain the visual attribute of at least one element to be synthesized; and at least rendering the file information in the at least one element to be synthesized onto the base image according to the position and the category of the at least one target area and the visual attribute of the at least one element to be synthesized to obtain a target synthesized image.

In an alternative embodiment, processor 192 is further configured to: before generating a base image according to a target material image containing a main object, acquiring at least one original material image containing the main object; inputting at least one original material image into an image quality classification model for quality classification to obtain the quality classification of each original material image; and selecting the raw material map with the image quality suitable for serving as the base image from the raw material maps as the target material map according to the quality category of each raw material map.

In an alternative embodiment, the processor 192, when generating the base image from the target material map containing the subject object, is specifically configured to: sending the target material graph into an element detection model for on-graph element analysis to obtain original synthetic elements and attribute information thereof contained in the target material graph; repairing the target material graph according to the attribute information of the original synthetic element to obtain a repaired material graph which does not contain the original synthetic element; and carrying out image redirection processing on the repaired material image according to the size relation between the target size and the original size so as to obtain a base image with the target size.

In an optional embodiment, when the processor 192 sends the target material map to the element detection model for on-map element analysis to obtain the original synthesized element and the attribute information thereof included in the target material map, it is specifically configured to: sending the target material graph into a feature extraction layer in the element detection model for feature extraction to obtain a first feature graph corresponding to the target material graph; sending the first feature map into an element detection model to identify synthetic elements based on an element identification layer of a self-attention mechanism so as to obtain a second feature map corresponding to original synthetic elements contained in the target material map; and sending the second feature graph to an attribute labeling layer in the element detection model for attribute labeling so as to obtain the position, the size and the category of the original synthesized element.

In an optional embodiment, when the processor 192 repairs the target material map according to the attribute information of the original synthesized element to obtain a repaired material map not including the original synthesized element, the processor is specifically configured to: inputting the attribute information of the target material image and the original synthetic element into a mask processing network in the image restoration model, and performing mask processing on the target material image according to the attribute information of the original synthetic element to obtain a mask material image, wherein the mask material image comprises an area to be restored, which is obtained by performing mask processing on the original synthetic element; and inputting the mask material image into an image restoration network in the image restoration model, and restoring the area to be restored according to the pixel values of the peripheral area of the area to be restored to obtain a restoration material image which does not contain the original synthetic elements.

In an alternative embodiment, processor 192 is further configured to: and erasing the file information contained in the target material graph by using a character erasing model realized based on the generated countermeasure network, and supplementing the background content of the area for erasing the file information according to the information of other areas on the target material graph so as to obtain the target material graph after the file information is erased.

In an alternative embodiment, processor 192 is further configured to: extracting original case information from the original case image, synthesizing case rules according to pictures given by a synthesis strategy network, generating target case information based on the original case information, and synthesizing the target case information into a non-character area in the original case image to obtain a target case image; and using the target file image and the original file image as training samples to carry out model training on the generated countermeasure network until the ternary erasure loss function and the generated countermeasure loss function both meet the requirements, and obtaining a character erasing model, wherein the ternary erasing loss function is generated according to the original file image and the two-stage output image of the generator in the generation countermeasure network.

In an alternative embodiment, processor 192 is further configured to: inputting the target file image and an output image of the generator in the model training process into a synthesis strategy network so that the synthesis strategy network updates the picture synthesis file rule; the output image of the generator is an image obtained after erasing the target file information in the target file image.

In an optional embodiment, when the processor 192 performs image redirection processing on the repair material map according to a size relationship between the target size and the original size to obtain a base image with a target size, the processor is specifically configured to: determining an image to be cut on the basis of the restoration material image according to the size relation between the target size and the original size, wherein the image to be cut is the restoration material image or an extension image of the restoration material image; and inputting the image to be cut into a saliency cutting model based on the image importance, locking an image area where the main object is located according to the saliency characteristics of the image to be cut, and cutting the image to be cut by taking the image area where the main object is located as the center according to the target size to obtain a base image with the target size.

In an alternative embodiment, when determining the image to be cropped based on the restored material map according to the size relationship between the target size and the original size, the processor 192 is specifically configured to: under the condition that the target size is larger than the original size, inputting the repair material image and the target size into an image extension model, carrying out image extension on the repair material image according to the target size to obtain an extended image, and taking the extended image as an image to be cut; and under the condition that the target size is smaller than or equal to the original size, directly taking the repaired material image as an image to be cut.

In an alternative embodiment, when the processor 192 inputs the repair material map and the target size into the image extension model and performs image extension on the repair material map according to the target size to obtain an extended image, the processor is specifically configured to: inputting a repairing material diagram and a target size into a preprocessing network in an image extension model, and determining an extension direction and an extension length according to the aspect ratio of the target size, wherein the repairing material diagram comprises a known image area in the extension direction, and the extension length is used for limiting an unknown image area in the extension direction; inputting the repair material image, the extension direction and the extension length into a generation countermeasure network in an image extension model, and generating countermeasure on pixel values in an unknown image area in the extension direction by taking semantic continuity as a constraint condition on the basis of pixel values in a known image area in the extension direction and semantic information thereof to obtain an extension image.

In an optional embodiment, when inputting the base image into the image layout model for image layout to obtain the image layout information of the base image, the processor 192 is specifically configured to: extracting a visual characteristic diagram of the base image input domain alignment model to obtain the visual characteristic diagram of the base image; inputting the substrate image and the visual characteristic diagram thereof into a multi-scale CNN network in an image layout model to extract the multi-scale characteristic diagram, and splicing the extracted multi-scale characteristic diagram to obtain a spliced characteristic diagram; and sending the splicing characteristic diagram into an image layout model, and generating image layout information by adopting a generation countermeasure network of an encoder-decoder structure to obtain the image layout information of the base image.

In an alternative embodiment, the processor 192, when sending the stitching feature map into the image layout model, generates the image layout information by using a generation countermeasure network of an encoder-decoder structure to obtain the image layout information of the base image, is specifically configured to: sending the splicing characteristic diagram into an encoder in a generator for generating a countermeasure network, and encoding the splicing characteristic diagram to obtain an intermediate image characteristic; inputting the intermediate image features into a decoder in a generator, and decoding the intermediate image features to obtain initial layout information, wherein the initial layout information comprises the position of at least one display area; and sending the initial layout information to a full-connection layer in the generator, and carrying out category marking on at least one display area to obtain image layout information.

In an alternative embodiment, processor 192 is further configured to: inputting the original layout image into a domain alignment model, and extracting the position and the category of a synthesized element in the original layout image to obtain target layout information; performing mask processing on synthesized elements in the original layout image to obtain a mask layout image, repairing a mask region in the mask layout image to obtain a target layout image, and extracting a visual feature map of the target layout image; and performing model training on an initial network model formed by combining a multi-scale CNN network and a generation countermeasure network adopting an encoder-decoder structure by taking the target layout information, the target layout image and a visual characteristic diagram thereof as training samples to obtain an image layout model.

In an optional embodiment, when the base image is input into the image layout model for image layout to obtain the image layout information of the base image, the processor 192 is specifically configured to: inputting a base image into a visual backbone network in an image layout model to perform visual feature extraction to obtain a content embedded vector of the base image; inputting a base image into a domain alignment module in an image layout model, dividing the base image into a plurality of image blocks and carrying out position coding on the image blocks to obtain a position coding sequence; and inputting the content embedded vectors and the position coding sequences into a decoder in the image layout model, and performing autoregressive decoding processing on the content embedded vectors and the position coding sequences by adopting a cross attention mechanism and a self attention mechanism in the decoder to obtain the image layout information of the substrate image.

In an optional embodiment, when the base image, the position and the category of the at least one target region, and the base material information corresponding to the main object are input into the document generation model to generate the document information, so as to obtain the document information in the at least one element to be synthesized, the processor 192 is specifically configured to: generating multi-modal description information of the main body object according to the base image and the basic material information corresponding to the main body object, wherein partial description information of the main body object is recorded in each modal description information; and inputting the multi-modal description information of the main body object and the position and the category of at least one target area into the pattern generation model on the graph to generate the pattern information so as to obtain the pattern information which needs to be carried by the target area of each pattern category.

In an optional embodiment, the processor 192, when performing estimation of the visual attribute of the at least one element to be synthesized according to the base image and the position and the category of the at least one target area to obtain the visual attribute of the at least one element to be synthesized, is specifically configured to: quantizing the base image to obtain a quantized base image; inputting the quantized base image into an encoder in a visual attribute estimation model, and encoding the quantized base image to obtain image visual information; and sending the position and the category of at least one target area, at least one element to be synthesized and image visual information to an encoder in a visual attribute pre-estimation model, and decoding the image visual information and the at least one element to be synthesized by taking the position and the category of each target area as a processing object and combining a self-attention mechanism and a cross-attention mechanism to obtain the visual attribute of each element to be synthesized.

In an alternative embodiment, when the processor 192 quantizes the base image to obtain a quantized base image, it is specifically configured to: converting the base image from RGB space to LAB space, and clustering pixel points in the base image in the LAB space to obtain a plurality of cluster groups; reassigning each pixel point in each cluster group to be the pixel value of the corresponding cluster center so as to obtain a reassigned base image; and converting the reassigned base image from the LAB space to the RGB space again to obtain a quantized base image.

In an optional embodiment, when the quantized base image is input to an encoder in the visual property estimation model and the quantized base image is encoded to obtain the image visual information, the processor 192 is specifically configured to: inputting the quantized base image into an encoder in a visual attribute estimation model, dividing the base image into a plurality of image blocks, and performing visual characteristic encoding processing on the plurality of image blocks to obtain a visual characteristic sequence formed by the plurality of image blocks.

In an alternative embodiment, the processor 192, when decoding the image visual information and the at least one element to be synthesized by using the position and the category of each target region as the processing object and combining the self-attention mechanism and the cross-attention mechanism to obtain the visual attribute of each element to be synthesized, is specifically configured to: taking the position and the category of each target area as processing objects, and performing information interaction between at least one element to be synthesized by adopting a self-attention mechanism to obtain a first similarity; performing information interaction between the image visual information and at least one element to be synthesized by adopting a cross attention mechanism to obtain a second similarity; and determining the visual attribute of each element to be synthesized according to the first similarity and the second similarity.

Further, as shown in fig. 19, the computer apparatus further includes: communication components 193, display 194, power components 195, audio components 196, and the like. Only some of the components are shown schematically in fig. 19, and it is not meant that the computer device includes only the components shown in fig. 19. In addition, the components within the dashed box in fig. 19 are optional components, not necessary components, and may depend on the product form of the computer device. The computer device of this embodiment may be implemented as a terminal device such as a desktop computer, a notebook computer, a smart phone, or an IOT device, or may be a server device such as a conventional server, a cloud server, or a server array. If the computer device of this embodiment is implemented as a terminal device such as a desktop computer, a notebook computer, a smart phone, etc., the computer device may include components within a dashed line frame in fig. 19; if the computer device of the present embodiment is implemented as a server device such as a conventional server, a cloud server, or a server array, the components in the dashed box in fig. 19 may not be included.

Accordingly, the present application also provides a computer readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor is enabled to implement the steps in the method embodiment shown in fig. 1.

In addition to the foregoing computer device, an embodiment of the present application further provides a computer device, where the structure of the computer device is the same as or similar to that of the computer device shown in fig. 19, and is not repeated here, and the main differences are that: the functions performed by the processor to execute the computer programs stored in the memory are different. Specifically, the processor of the computer device of the present embodiment executes the computer program stored in the memory to perform the following operations: acquiring an original image containing a subject object, the original image having an original size; sending the original image into an element detection model for on-graph element analysis to obtain original synthetic elements and attribute information thereof contained in the original image; restoring the original image according to the attribute information of the original synthetic element to obtain a restored image which does not contain the original synthetic element; and carrying out image reorientation processing on the repaired image according to the size relation between the target size and the original size so as to obtain a target image with the target size.

Accordingly, the present application also provides a computer readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor is enabled to implement the steps in the method embodiment shown in fig. 16 b.

In addition to the foregoing computer device, an embodiment of the present application further provides a computer device, where the structure of the computer device is the same as or similar to that of the computer device shown in fig. 19, and is not repeated here, and the main differences are that: the functions performed by the processor to execute the computer programs stored in the memory are different. Specifically, the processor of the computer device of the present embodiment executes the computer program stored in the memory to perform the following operations: acquiring an object to be rendered, wherein the object to be rendered comprises at least one target area for bearing at least one element to be synthesized, and the object to be rendered is an image or a page; estimating the visual attribute of at least one element to be synthesized according to the object to be rendered and the position and the category of at least one target area to obtain the visual attribute of at least one element to be synthesized; and rendering the at least one element to be synthesized on the object to be rendered according to the position and the category of the at least one target area and the visual attribute of the at least one element to be synthesized.

Accordingly, the present application also provides a computer readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor is enabled to implement the steps in the method embodiment shown in fig. 17.

The communication component in the above embodiments is configured to facilitate communication between the device in which the communication component is located and other devices in a wired or wireless manner. The device where the communication component is located can access a wireless network based on a communication standard, such as WiFi, a mobile communication network such as 2G, 3G, 4G/LTE, 5G, or the like, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further comprises a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

The display in the above embodiments includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The power supply assembly of the above embodiments provides power to various components of the device in which the power supply assembly is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

The audio component in the above embodiments may be configured to output and/or input an audio signal. For example, the audio component includes a Microphone (MIC) configured to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, encoded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. An image processing method, comprising:

generating a base image according to a target material image containing a main body object, wherein the target material image has an original size, and the base image has a target size;

inputting the base image into an image layout model for image layout to obtain image layout information of the base image, wherein the image layout information comprises the position and the category of at least one target area used for bearing at least one element to be synthesized on the base image;

inputting the base image, the position and the category of the at least one target area and the basic material information corresponding to the main object into a pattern generation model to generate pattern information so as to obtain the pattern information in the at least one element to be synthesized;

estimating the visual attribute of the at least one element to be synthesized according to the position and the category of the base image and the at least one target area to obtain the visual attribute of the at least one element to be synthesized;

and at least rendering the file information in the at least one element to be synthesized to the base image according to the position and the category of the at least one target area and the visual attribute of the at least one element to be synthesized so as to obtain a target synthesized image.

2. The method of claim 1, further comprising, prior to generating the base image from the target material map containing the subject object:

acquiring at least one original material graph containing a main body object; inputting the at least one original material image into an image quality classification model for quality classification to obtain the quality classification of each original material image;

and selecting the raw material map with the image quality suitable for serving as the base image from the raw material maps as the target material map according to the quality category of each raw material map.

3. The method of claim 1, wherein generating a base image from a target material map containing a subject object comprises:

sending the target material graph into an element detection model for on-graph element analysis to obtain original synthetic elements and attribute information thereof contained in the target material graph;

repairing the target material graph according to the attribute information of the original synthetic element to obtain a repaired material graph which does not contain the original synthetic element;

and carrying out image redirection processing on the repaired material image according to the size relation between the target size and the original size so as to obtain a base image with the target size.

4. The method of claim 3, wherein the step of sending the target material graph into an element detection model for on-graph element analysis to obtain original composite elements and attribute information thereof contained in the target material graph comprises:

sending the target material graph into a feature extraction layer in an element detection model for feature extraction to obtain a first feature graph corresponding to the target material graph;

sending the first feature map into an element detection model to identify synthetic elements based on an element identification layer of a self-attention mechanism so as to obtain a second feature map corresponding to original synthetic elements contained in the target material map;

and sending the second feature graph to an attribute labeling layer in the element detection model for attribute labeling so as to obtain the position, size and category of the original synthesized element.

5. The method according to claim 3, wherein repairing the target material map according to the attribute information of the original composite element to obtain a repaired material map not containing the original composite element comprises:

inputting the target material graph and the attribute information of the original synthetic element into a mask processing network in an image restoration model, and performing mask processing on the target material graph according to the attribute information of the original synthetic element to obtain a mask material graph, wherein the mask material graph comprises a to-be-restored area obtained by performing mask processing on the original synthetic element;

and inputting the mask material map into an image restoration network in the image restoration model, and restoring the area to be restored according to the pixel values of the peripheral area of the area to be restored to obtain a restoration material map which does not contain the original synthetic elements.

6. The method according to claim 5, wherein in the case that the original synthesized element contains the pattern information, before inputting the target material map and the attribute information of the original synthesized element into the image preprocessing network in the image inpainting model, the method further comprises:

erasing the file information contained in the target material graph by using a character erasing model realized based on a generated countermeasure network, and supplementing the background content of the area where the file information is erased according to the information of other areas on the target material graph so as to obtain the target material graph after the file information is erased.

7. The method of claim 6, further comprising:

extracting original case information from an original case image, synthesizing a case rule according to a picture given by a synthesis strategy network, generating target case information based on the original case information, and synthesizing the target case information into a non-character area in the original case image to obtain a target case image; and performing model training on the generated countermeasure network by taking the target file image and the original file image as training samples until a ternary erasure loss function and the generated countermeasure loss function meet requirements to obtain a character erasure model, wherein the ternary erasure loss function is a loss function generated according to the original file image and output images of generators in the generated countermeasure network at two stages.

8. The method of claim 7, further comprising:

inputting the target file image and an output image of the generator in the model training process into the synthesis strategy network so that the synthesis strategy network updates the picture synthesis file rule; the output image of the generator is an image obtained after erasing the target file information in the target file image.

9. The method according to claim 3, wherein performing image redirection processing on the repair material map according to a size relationship between the target size and the original size to obtain a base image with the target size comprises:

determining an image to be cut on the basis of the repairing material image according to the size relation between the target size and the original size, wherein the image to be cut is the repairing material image or an extension image of the repairing material image;

inputting the image to be cut into a saliency cutting model based on image importance, locking an image area where the main body object is located according to saliency characteristics of the image to be cut, and cutting the image to be cut by taking the image area where the main body object is located as a center according to the target size to obtain a base image with the target size.

10. The method according to claim 9, wherein determining an image to be cropped based on the repair material map according to the size relationship between the target size and the original size comprises:

under the condition that the target size is larger than the original size, inputting the repair material image and the target size into an image extension model, carrying out image extension on the repair material image according to the target size to obtain an extended image, and taking the extended image as an image to be cut;

and under the condition that the target size is smaller than or equal to the original size, directly taking the repairing material image as an image to be cut.

11. The method according to claim 10, wherein the inputting the repair material map and the target size into an image extension model, and performing image extension on the repair material map according to the target size to obtain an extended image includes:

inputting the repairing material image and the target size into a preprocessing network in an image extension model, and determining an extension direction and an extension length according to the aspect ratio of the target size, wherein the repairing material image comprises a known image area in the extension direction, and the extension length is used for limiting an unknown image area in the extension direction;

inputting the repair material map, the extension direction and the extension length into a generation countermeasure network in an image extension model, and generating countermeasure on pixel values in an unknown image area in the extension direction by taking semantic continuity as a constraint condition on the basis of pixel values in a known image area in the extension direction and semantic information thereof to obtain the extension image.

12. The method according to any one of claims 1-11, wherein inputting the base image into an image layout model for image layout to obtain image layout information of the base image comprises:

extracting a visual characteristic diagram of the base image input domain alignment model to obtain the visual characteristic diagram of the base image;

inputting the substrate image and the visual feature map thereof into a multi-scale CNN network in the image layout model to extract the multi-scale feature map, and splicing the extracted multi-scale feature map to obtain a spliced feature map;

and sending the splicing characteristic diagram into the image layout model, and generating image layout information by adopting a generation countermeasure network of an encoder-decoder structure to obtain the image layout information of the base image.

13. The method of claim 12, wherein feeding the stitched feature map into the image layout model using an encoder-decoder structure generation countermeasure network to generate image layout information for the base image comprises:

sending the splicing characteristic diagram into an encoder in the generator for generating the countermeasure network, and encoding the splicing characteristic diagram to obtain an intermediate image characteristic;

inputting the intermediate image features into a decoder in the generator, and decoding the intermediate image features to obtain initial layout information, wherein the initial layout information comprises the position of at least one display area;

and sending the initial layout information to a full connection layer in the generator, and performing category marking on the at least one display area to obtain the image layout information.

14. The method of claim 12, further comprising:

inputting an original layout image into a domain alignment model, and extracting the position and the category of a synthesized element in the original layout image to obtain target layout information; performing mask processing on the synthesized elements in the original layout image to obtain a mask layout image, repairing a mask region in the mask layout image to obtain a target layout image, and extracting a visual feature map of the target layout image;

and performing model training on an initial network model formed by combining a multi-scale CNN network and a generation countermeasure network adopting an encoder-decoder structure by taking the target layout information, the target layout image and a visual characteristic diagram thereof as training samples to obtain the image layout model.

15. The method according to any one of claims 1-11, wherein inputting the base image into an image layout model for image layout to obtain image layout information of the base image comprises:

inputting the base image into a visual backbone network in an image layout model to perform visual feature extraction to obtain a content embedded vector of the base image;

inputting the base image into a domain alignment module in an image layout model, dividing the base image into a plurality of image blocks and carrying out position coding on the plurality of image blocks to obtain a position coding sequence;

and inputting the content embedded vector and the position coding sequence into a decoder in the image layout model, and performing autoregressive decoding processing on the content embedded vector and the position coding sequence in the decoder by adopting a cross attention mechanism and a self attention mechanism to obtain the image layout information of the base image.

16. The method according to any one of claims 1 to 11, wherein inputting the base image, the position and the category of the at least one target region, and the basic material information corresponding to the main body object into a pattern generation model to generate the pattern information, so as to obtain the pattern information in the at least one element to be synthesized, comprises:

generating multi-modal description information of the main body object according to the base image and basic material information corresponding to the main body object, wherein partial description information of the main body object is recorded in each modal description information;

and inputting the multi-modal description information of the main object and the position and the category of the at least one target area into a pattern generation model on the graph to generate the pattern information so as to obtain the pattern information which needs to be carried by the target area of each pattern category.

17. The method according to any one of claims 1 to 11, wherein estimating the visual property of the at least one element to be synthesized according to the position and the category of the base image and the at least one target region to obtain the visual property of the at least one element to be synthesized comprises:

inputting the base image into an encoder in a visual attribute estimation model, and encoding the base image to obtain image visual information;

and sending the position and the category of the at least one target area, the at least one element to be synthesized and the image visual information to an encoder in a visual attribute pre-estimation model, and decoding the image visual information and the at least one element to be synthesized by taking the position and the category of each target area as a processing object and combining a self-attention mechanism and a cross-attention mechanism to obtain the visual attribute of each element to be synthesized.

18. The method of claim 17, wherein before the base image is encoded by an encoder that inputs the base image into a visual property prediction model to obtain image visual information, the method comprises:

converting the base image from an RGB space to an LAB space, and clustering pixel points in the base image in the LAB space to obtain a plurality of clustering groups;

reassigning each pixel point in each cluster group to a pixel value corresponding to the cluster center to obtain a reassigned base image;

and converting the reassigned base image from the LAB space to the RGB space again to obtain a quantized base image.

19. The method of claim 18, wherein inputting an encoder in a visual property prediction model to the base image, encoding the base image to obtain image visual information, comprises:

inputting the quantized base image into an encoder in a visual attribute estimation model, dividing the base image into a plurality of image blocks, and coding visual features of the image blocks to obtain a visual feature sequence formed by the image blocks.

20. The method according to claim 17, wherein decoding the image visual information and the at least one element to be synthesized by using the position and the category of each target region as a processing object in combination with a self-attention mechanism and a cross-attention mechanism to obtain a visual attribute of each element to be synthesized comprises:

taking the position and the category of each target area as processing objects, and performing information interaction between the at least one element to be synthesized by adopting a self-attention mechanism to obtain a first similarity;

performing information interaction between the image visual information and the at least one element to be synthesized by adopting a cross attention mechanism to obtain a second similarity;

and determining the visual attribute of each element to be synthesized according to the first similarity and the second similarity.

21. An image processing method, characterized by comprising:

acquiring an original image containing a subject object, the original image having an original size;

sending the original image into an element detection model for on-graph element analysis to obtain original synthetic elements and attribute information thereof contained in the original image;

restoring the original image according to the attribute information of the original synthetic element to obtain a restored image which does not contain the original synthetic element;

and carrying out image redirection processing on the repaired image according to the size relation between the target size and the original size so as to obtain a target image with the target size.

22. A method of data rendering, comprising:

acquiring an object to be rendered, wherein the object to be rendered comprises at least one target area for bearing at least one element to be synthesized, and the object to be rendered is an image or a page;

estimating the visual attribute of the at least one element to be synthesized according to the position and the category of the object to be rendered and the at least one target area to obtain the visual attribute of the at least one element to be synthesized;

rendering the at least one element to be composited onto the object to be rendered according to the position and the category of the at least one target area and the visual attribute of the at least one element to be composited.

23. A computer device, comprising: a memory and a processor; wherein the memory is for storing a computer program; the processor, coupled to the memory, configured to execute the computer program for implementing the steps of the method of any one of claims 1-22.

24. A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, causes the processor to carry out the steps of the method of any one of claims 1-22.