CN117911580A

CN117911580A - Design synthesis using image coordination

Info

Publication number: CN117911580A
Application number: CN202310957863.7A
Authority: CN
Inventors: I·米罗尼卡; M·鲁帕斯库; A·V·科斯汀; C-C·布佐尤; A·达拉比
Original assignee: Adobe Systems Inc
Current assignee: Adobe Inc
Priority date: 2022-10-17
Filing date: 2023-08-01
Publication date: 2024-04-19

Abstract

Embodiments of the present disclosure relate to design synthesis using image coordination. Systems and methods for image editing, and in particular for reconciling background images with text, are provided. Embodiments of the present disclosure obtain an image including text and an area overlapping the text. In some aspects, the text includes a first color. Then, the embodiment selects a second color that contrasts with the first color, and generates a modified image including text and a modified region using a machine learning model having the image and the second color as inputs. The modified image is conditionally generated to include a second color in the region corresponding to the text.

Description

Design synthesis using image coordination

Cross Reference to Related Applications

According to 35 U.S. c. ≡119, the present U.S. non-provisional application claims priority from U.S. provisional patent application No. 63/379,813, filed by the U.S. patent and trademark office at 10 month 17 of 2022, the disclosure of which is incorporated herein by reference in its entirety.

Background

The following relates generally to image editing and, more particularly, to reconciling images and text. Compositing involves combining text and graphic elements into a single image. This technology is widely used in graphic design, advertising and digital media. The text and images are layered together to create the final composition. The aim of the composition is to convey information or tell stories through visual elements, creating an attractive visual experience for the audience. Color correction, blending, and masking techniques are commonly used to achieve the desired look and feel and to seamlessly display text and images together.

Several factors need to be considered in creating a composition with text and background images. The text and the background image should have sufficient contrast with each other to ensure that the text is easy to read. The text should be clear and legible and should not interfere with the background image. In some cases, graphic designers will apply various effects to text to provide adequate contrast and legibility, such as using distance, perspective, colors from the luminescence and warm spectra, and the like.

Disclosure of Invention

The present disclosure describes systems and methods for altering an underlying image of text to increase contrast and readability of the text. Embodiments receive an image and text, apply preprocessing to the image, and generate a new image including contrasting colors within the text region. Embodiments include generating a machine learning model, such as a steady diffusion model, configured to produce an image that is similar to the original image except for regions corresponding to text. For example, the generative machine learning model may be configured to receive the preprocessed image as a condition for the generative diffusion process.

A method, apparatus, non-transitory computer readable medium, and system for reconciling text and background images are described. One or more aspects of the method, apparatus, non-transitory computer-readable medium, and system include: obtaining an image comprising text and an area overlapping the text, wherein the text comprises a first color; selecting a second color that contrasts with the first color; and generating a modified image using the machine learning model having the image and the second color as inputs, the modified image including text and a modified region, wherein the modified region overlaps the text and includes the second color.

An apparatus, system, and method for reconciling text and background images are described. One or more aspects of the apparatus, system, and method include a non-transitory computer-readable medium storing code comprising instructions executable by a processor to: obtaining an image comprising text and an area overlapping the text, wherein the text comprises a first color; selecting a second color that contrasts with the first color; and generating a modified image using the machine learning model having the image and the second color as inputs, the modified image including text and a modified region, wherein the modified region overlaps the text and includes the second color.

An apparatus, system, and method for reconciling text and background images are described. One or more aspects of the apparatus, system, and method include: a processor; a memory comprising instructions executable by a processor to perform operations comprising: obtaining an image and text overlapping the image, wherein the text comprises a first color; selecting a second color that contrasts with the first color; and generating a background image of the text based on the second color using the machine learning model, wherein the background image includes the second color in a region corresponding to the text.

Drawings

Fig. 1 illustrates an example of an image editing system in accordance with aspects of the present disclosure.

Fig. 2 illustrates an example of an image editing apparatus according to aspects of the present disclosure.

Fig. 3 illustrates an example of a method of a first stage in a first algorithm for image coordination in accordance with aspects of the present disclosure.

Fig. 4 illustrates an example of a method for branch a in a second phase of a first algorithm in accordance with aspects of the present disclosure.

Fig. 5 illustrates an example of a method for branching B in a second phase of a first algorithm in accordance with aspects of the present disclosure.

Fig. 6 illustrates an example of a method for a second algorithm for image coordination in accordance with aspects of the present disclosure.

FIG. 7 illustrates an example of a method for providing designs to a user in accordance with aspects of the present disclosure.

Fig. 8 illustrates an example of a method for image editing in accordance with aspects of the present disclosure.

Fig. 9 illustrates an example of a guided potential diffusion model in accordance with aspects of the present disclosure.

Fig. 10 illustrates an example of a U-net architecture according to aspects of the present disclosure.

Fig. 11 illustrates an example of a diffusion process in accordance with aspects of the present disclosure.

FIG. 12 illustrates an example of a method for training a diffusion model in accordance with aspects of the present disclosure.

Fig. 13 illustrates an example of a computing device in accordance with aspects of the present disclosure.

Detailed Description

Graphic design is a discipline that involves the use of visual elements (such as images and text) to convey ideas and information. One of the core aspects of graphic design is to combine images with text to create a visually attractive effective combination. This may involve layering the image, adjusting colors, adjusting the size and location of text, and selecting the correct layout to convey the desired information.

In combining images with text, graphic designers must consider many factors, including the context of the image, the purpose of the design, the target audience, and the medium in which the design is presented. For example, the graphical design of a web site may require different considerations than the design of a printed publication.

One of the key challenges in combining images and text is finding the correct balance between these two elements. The graphic designer must select the correct layout, adjust the size and location of the text, and select images that complement the text to create a harmonious combination.

In addition, the designer must consider legibility, readability, and accessibility issues when creating a design that combines images and text. This may involve adjusting the contrast between the text and the background, selecting legible fonts, and ensuring that the text is available to disabled persons.

In some cases, graphic designers may apply various effects to text to provide adequate contrast and legibility. These effects may include shading, lighting, adding solid background to the text, and changing the color of the text. However, these variations are destructive to the design features of the text. Furthermore, making changes to the text that add significant areas (such as adding a solid background) may obscure the image underlying the text.

Furthermore, applying more complex techniques, such as placing the text in the perspective of the image, or changing the color of the text based on a color temperature comparison with the background, it may be desirable for the underlying image to include a particular perspective or color temperature. In some cases, the underlying image is not compatible with these techniques.

Embodiments of the present disclosure include an image editing device configured to edit an image underlying text rather than the text itself. In this way, the design features of the text are preserved throughout the synthesis process.

Some embodiments are configured to pre-process an image by extracting colors that contrast with text, performing panoramic segmentation to identify objects in the image that overlap with the text, and coloring the objects with the contrasting colors. Some embodiments then add gaussian noise in the text overlap region that includes contrasting colors. Embodiments then use this changed image as a condition for generating a machine learning model to generate a new modified image that can largely remain similar to the original background image, but now contains contrasting colors in the text overlap region to provide improved contrast.

The color contrast between two colors refers to the visual difference in hue, saturation, and brightness of the two colors. For example, dark colors and light colors may have a high contrast, e.g., dark colors and light colors have a large degree of difference in hue, saturation, and brightness. The high contrast between the two colors can highlight the text and is easily recognizable in the background. A low contrast between the two colors means that the difference between the two colors is less pronounced, which may make the text more difficult to read, and less noticeable in the background.

In some cases, when the contrast of the two colors is sufficiently large, the second color contrasts with the first color such that the two colors are visually distinct, such that the two colors stand out and are easily recognizable from each other.

According to some embodiments, (hue-saturation-value) HSV color space is used for panorama segmentation, but the disclosure may not be necessarily limited thereto. Unlike the RGB color mode, which represents colors as a combination of red, green, and blue light intensities, the HSV color model represents colors as a combination of hue, saturation, and brightness (luminance). The use of the HSV color space provides a method of intuitively processing color information because it separates chrominance information (hue and saturation) from luminance information (brightness). This separation of chromaticity and luminance information allows for independent adjustment of hue, saturation and brightness and makes it easier to perform image processing tasks including color segmentation.

In some cases, objects cannot be reliably identified using panoramic segmentation, and therefore cannot be accurately colored before gaussian noise is applied. For example, the image may not include distinguishable objects. In these cases, embodiments are configured to extract "super-pixels (superpixel)" from an original image, which is a block of the original image having an average color that is outside a predetermined range in a color space such as HSV. Then, these superpixels are applied to the text overlap region, colored gaussian noise is further applied, and the resulting changed image is provided as a condition to a generative machine learning model to generate a new background image.

Some embodiments replace the underlying image by using a combination of white gaussian noise and colored gaussian noise, where the colored noise includes a color that contrasts with the text. The term "colored" in colored noise refers to noise that includes a relatively high amount of color, rather than random gray-scale noise. For example, some embodiments place colored gaussian noise in the region that will overlap with the text in the final composition and generate one or more images that include contrasting colors in the region that overlaps with the text. The one or more images may be used as design variants for a given text.

Thus, embodiments improve the graphic design process by providing an image coordination method that does not disrupt the design features of the input text. This allows a graphic designer to adhere to a design language, such as that specified by a particular brand, while producing a composite design with legible text.

An image editing system is described with reference to fig. 1-2. A method of generating a graphical design using an image editing system is described with reference to fig. 3-8. The generated machine learning model used by the image editing system is described with reference to fig. 9-12. A computing device that may be used to implement the image editing system is described with reference to fig. 13.

Image editing system

An apparatus for reconciling text and background images is described. One or more aspects of the apparatus include a processor; a memory comprising instructions executable by a processor to perform operations comprising: obtaining an image and text overlapping the image, wherein the text comprises a first color; selecting a second color that contrasts with the first color; and generating a background image for the text based on the second color using the machine learning model, wherein the background image includes the second color in an area corresponding to the text.

Some examples of the apparatus, system, and method further include a segmentation component configured to segment the image to identify one or more objects. Some examples also include a noise component configured to add noise to the image in the region corresponding to the text.

Some examples of the apparatus, system, and method further include a superpixel component configured to extract a plurality of superpixels from a region corresponding to text. Some examples also include a combining component configured to combine the image and the background image to obtain a combined image. In some aspects, the machine learning model includes a generative diffusion model.

Fig. 1 illustrates an example of an image editing system in accordance with aspects of the present disclosure. The illustrated example includes an image editing apparatus 100, a database 105, a network 110, and a user interface 115.

In one example, a user provides a design including text and images to the image editing apparatus 100 via the user interface 115. The system then generates a noisy image (noise image) based on the image and text as input to the machine learning model. The machine learning model uses the noisy image as a condition to generate a new background image. For example, a noisy image may include noise concentrated in one or more areas below the text, and the machine learning model may shift the non-noisy portion of the image with minimal change while introducing contrast in the noisy portion. In some cases, one or more components or aspects of the image editing apparatus 100 are stored on the database 105, such as model parameters, reference images, etc., and such information is exchanged between the image editing apparatus 100 and the database 105 via the network 110. Then, the image editing apparatus 100 provides the newly generated image to the user through the user interface 115.

In some examples, one or more components of image editing apparatus 100 are implemented on a server. The server provides one or more functions to users linked through one or more of the various networks 110. In some cases, the server comprises a single microprocessor board that includes a microprocessor responsible for controlling all aspects of the server. In some cases, the server uses microprocessors and protocols to exchange data with other devices/users on one or more of the networks 110 via hypertext transfer protocol (HTTP) and Simple Mail Transfer Protocol (SMTP), although other protocols such as File Transfer Protocol (FTP) and Simple Network Management Protocol (SNMP) may also be used. In some cases, the server is configured to send and receive hypertext markup language (HTML) formatted files (e.g., for displaying web pages). In various embodiments, the server comprises a general purpose computing device, a personal computer, a laptop computer, a mainframe computer, a supercomputer, or any other suitable processing apparatus.

The data used by the image editing system includes generative machine learning models, training data, cached images, fonts, design elements, and the like. In some cases, database 105 includes a server that stores data and manages payment for data and content. A database is an organized collection of data. For example, databases store data in a specified format called an architecture. The database may be constructed as a single database, a distributed database, a plurality of distributed databases, or an emergency backup database. In some cases, the database controller may manage the storage and processing of data in database 105. In some cases, the user interacts with the database controller. In other cases, the database controller may operate automatically without user interaction.

The network 110 facilitates the transfer of information between the user, the database 105 and the image editing apparatus 100. Network 110 may be referred to as a "cloud". A cloud is a computer network configured to provide on-demand availability of computer system resources (e.g., data storage and computing capabilities). In some examples, the cloud provides resources without active management by the user. The term "cloud" is sometimes used to describe a data center that many users can use over the internet. Some large cloud network functions are distributed from a central server across multiple locations. If a server has a direct or intimate connection with a user, the server is designated as an edge server. In some cases, the cloud is limited to a single organization. In other examples, many organizations may use clouds. In one example, the cloud includes a multi-layer communication network that includes a plurality of edge routers and core routers. In another example, the cloud is based on a set of local switches in a single physical location.

According to some aspects, the image editing apparatus 100 obtains an image and text overlapping the image, wherein the text includes a first color. In some examples, the image editing apparatus 100 superimposes text on the modified image to obtain the composite image. In some cases, overlaying text on the modified image includes combining the text and the modified image into a single image by taking an original text image and overlaying it on the modified image, thereby producing a composite image in which both the text and the background are visible. For example, the overlay process may create a mask for the text and mix it with the modified image so that the text appears seamlessly integrated with the background. The result is a new image in which text is superimposed on the modified image.

According to some aspects, the image editing apparatus 100 includes a non-transitory computer readable medium storing code configured to perform the methods described herein. The image editing apparatus 100 is an example of or includes aspects of the respective elements described with reference to fig. 2.

Fig. 2 illustrates an example of an image editing apparatus 200 according to aspects of the present disclosure. The illustrated example includes an image editing apparatus 200, a contrast color extractor 205, a segmentation component 210, a superpixel component 215, a noise component 220, a masking component 225, a machine learning model 230, and a combining component 235. The image editing apparatus 200 is an example of or includes aspects of the respective elements described with reference to fig. 1.

An embodiment of the image editing apparatus 200 includes several components. The term 'component' is used to divide the functions implemented by the processor and executable instructions included in a computing device (such as the computing device described with reference to fig. 13) for implementing the image editing apparatus 200. The partitioning may be physically implemented, for example, by using separate circuits or processors for each component, or may be logically implemented by the architecture of processor-executable code.

One or more components of the image editing apparatus 200 use the trained model. In one example, at least the machine learning model 230 includes a training model, but the disclosure is not necessarily limited thereto. The machine learning model may include an Artificial Neural Network (ANN). ANN is a hardware or software component that includes a number of connected nodes (i.e., artificial neurons) that loosely correspond to neurons in the human brain. Each connection or edge transmits a signal from one node to another (just like a physical synapse in the brain). When a node receives a signal, it processes the signal and then sends the processed signal to other connected nodes. In some cases, the signals between nodes comprise real numbers, and the output of each node is calculated from a function of the sum of its inputs. In some examples, the nodes may determine their outputs using other mathematical algorithms (e.g., selecting a maximum value from the inputs as the output) or any other suitable algorithm for activating the nodes. Each node and edge is associated with one or more node weights that determine how the signal is processed and transmitted.

During the training process, these weights are adjusted to improve the accuracy of the results (i.e., by minimizing the loss function corresponding to the difference between the current result and the target result). The weight of the edge increases or decreases the strength of the signal transmitted between the nodes. In some cases, the node has a threshold below which signals are not transmitted at all. In some examples, nodes are aggregated into a layer. Different layers perform different transformations on their inputs. The initial layer is called the input layer and the last layer is called the output layer. In some cases, the signal may pass through one or more layers multiple times.

In some embodiments, the machine learning model 230 includes a Convolutional Neural Network (CNN). CNNs are a class of neural networks commonly used in computer vision or image classification systems. In some cases, CNNs may implement processing of digital images with minimal pre-processing. CNNs may be characterized by the use of convolutional (or cross-correlation) concealment layers. These layers apply convolution operations to the input before sending the result to the next layer. Each convolution node may process data of a limited input field (i.e., an acceptance field). During forward pass of the CNN, the filter for each layer may be convolved over the input volume, calculating the dot product between the filter and the input. During the training process, the filters may be modified to activate specific features in the input when they detect them.

According to some aspects, the machine learning model 230 generates a modified image for the text based on the second color, wherein the modified image includes the second color in the region corresponding to the text. The area corresponding to the text may be an image area containing the text. For example, the region may be obtained by cropping the image based on the location and size of the text. In some aspects, the machine learning model 230 includes a generative diffusion model. Additional details regarding the generative diffusion model will be provided with reference to fig. 9-12.

The contrast color extractor 205 is a component or body of instructions configured to extract a color that contrasts with the input text from the input image. According to some aspects, the contrast color extractor 205 selects a second color that contrasts with the first color. In some examples, the contrast color extractor 205 generates a palette based on the image region overlapping the text, wherein the second color is selected from the palette. A method for extracting contrast color will be provided with reference to fig. 3.

The segmentation component 210 is configured to perform panoramic segmentation on the input image. Panoramic segmentation includes semantic segmentation and instance segmentation, and is considered a "unified segmentation" approach. The purpose of panoramic segmentation is to extract, label and classify objects in an image.

According to some aspects, segmentation component 210 segments an image to identify one or more objects that overlap text, i.e., one or more objects in a text region. In some examples, segmentation component 210 applies a contrast color to one or more objects to obtain a first modified image, wherein the modified image is generated based on the first modified image.

In some examples, segmentation component 210 calculates a probability score for one or more objects that indicates a likelihood that the one or more objects are present. In some examples, the segmentation component 210 determines a low probability of the presence of one or more objects based on the probability score. In this case, the embodiment may continue to process the image according to branch B in the second stage of the first algorithm for preprocessing, as described with reference to fig. 5.

According to some aspects, the superpixel component 215 extracts a set of superpixels from an image region overlapping text based on a low probability determination, wherein the intermediate processed image includes the set of superpixels. Superpixels are blocks of an original image in a color space (e.g., HSV) that have an average color outside a predetermined range. In one example, when the system cannot confidently color objects within an area overlapping text, the system may instead paste textures that include superpixels in the area, and then add noise to the area to generate a noisy image as input to the machine learning model 230. Additional details regarding this process will be provided with reference to fig. 5.

The noise component 220 is configured to generate noise information in, for example, a pixel space. Noise component 220 may generate noise according to, for example, a gaussian function. According to some aspects, the noise component 220 adds noise to the image in the region corresponding to the text to obtain a noisy image, wherein the modified image is generated based on the noisy image. In some aspects, at least a portion of the noise comprises colored noise corresponding to a second color.

According to some aspects, the masking component 225 generates a mask that indicates an area corresponding to text, wherein noise is added to the image based on the mask. The mask may use size, shape, orientation, location, or other information from the text to generate the mask. In some cases, masking component 225 receives noise information from noise component 220 prior to generating the mask.

In at least one embodiment, a subset of the input images, rather than the entire image, are processed for input to the machine learning model 230. In this case, the machine learning model 230 performs "repair" by generating a modified image that is smaller than the image and has a subset dimension. According to some aspects, the combining component 235 combines the image and the modified image to obtain a combined image.

One or more embodiments of the above system include a non-transitory computer readable medium storing code comprising instructions executable by a processor to obtain an image and text overlaying the image, wherein the text comprises a first color; selecting a second color from the image area overlapping the text, wherein the second color contrasts with the first color; and generating a modified image of the text based on the second color using the machine learning model, wherein the modified image includes the second color in a region corresponding to the text.

Some examples of the non-transitory computer-readable medium further include code executable to segment the image to identify one or more objects that overlap with the text. Some examples also include code executable to apply a second color to the one or more objects to obtain a first modified image, wherein the modified image is generated based on the first modified image.

Some examples of the non-transitory computer-readable medium further include code executable to calculate a probability score for the one or more objects, the probability score indicating a likelihood of presence of the one or more objects. Some examples also include code executable to determine a low probability of presence of one or more objects based on the probability score. Some examples also include code executable to extract a plurality of superpixels from a region of an image overlapping text based on the determination, wherein the first modified image includes the plurality of superpixels.

Some examples of the non-transitory computer-readable medium further include executable code to add noise to the image in the region corresponding to the text to obtain a noisy image, wherein the modified image is generated based on the noisy image.

Some examples also include code executable to combine the image and the modified image to obtain a combined image. Some examples also include code executable to superimpose text on the modified image to obtain a composite image.

Generating a design

A method for reconciling text and background images is described. One or more aspects of the method include obtaining an image and text overlaying the image, wherein the text includes a first color; selecting a second color that contrasts with the first color; and generating a modified image of the text based on the second color using the machine learning model, wherein the modified image includes the second color in a region corresponding to the text.

Some examples of the method, apparatus, non-transitory computer-readable medium, and system further include generating a palette based on the image region overlapping the text, wherein the second color is selected from the palette. Some examples of the method, apparatus, non-transitory computer-readable medium, and system further include segmenting the image to identify one or more objects that overlap with the text. Some examples also include applying a second color to the one or more objects to obtain a first modified image, wherein the modified image is generated based on the first modified image.

Some examples also include adding noise to the image in the region corresponding to the text to obtain a noisy image, wherein the modified image is generated based on the noisy image. Some examples also include generating a mask indicating an area corresponding to the text, wherein noise is added to the image based on the mask. In some aspects, at least a portion of the noise comprises colored noise corresponding to a second color.

Some examples of the method, apparatus, non-transitory computer readable medium, and system further include combining the image and the modified image to obtain a combined image. For example, the modified image may correspond to a subset area of the image and may be combined with the image to obtain a combined image that is used as the modified image in the final design. Some example systems also include overlaying text on the modified image to obtain a composite image.

The present disclosure provides two main algorithms that the system is configured to execute on an input design, but it is understood that the sub-steps may be combined in different ways to generate variations of the algorithms described herein. Fig. 3 illustrates an example of a method 300 for a first stage in a first algorithm for image coordination in accordance with aspects of the present disclosure. The first phase may be referred to as a "preprocessing" phase or portion of the algorithm. In some examples, these operations are performed by a system comprising a processor that executes a set of codes to control the functional elements of a device. Additionally or alternatively, one or more processes are performed using dedicated hardware. In general, these operations are performed according to the methods and processes described in accordance with aspects of the present disclosure. In some cases, the operations described herein consist of various sub-steps, or are performed in conjunction with other operations.

In operation 305, the system obtains an image with blurred text. For example, a user may provide images and blurred text to the system via a user interface such as a Graphical User Interface (GUI). The GUI may be a component of an illustration program or a web application (web-app).

At operation 310, the system identifies the region(s) of the input image corresponding to the text. The region(s) include the area covered by the text and may also include padding with marginal extension out of the text. The region corresponding to the text may be an image region containing the text. For example, the region may be obtained by cropping the image based on the location and size of the text. In some cases, the region corresponding to the text also includes portions of the image that are not under or covered by the text.

At operation 315, the system creates a palette of dominant colors from the region(s). The operation of this step refers to, or may be performed by, the contrast color extractor described with reference to fig. 2. In some embodiments, the colors are extracted using a clustering algorithm on the Lab space of the image. Lab space is a type of color space where 'L' is the spectrum of luminance values, 'a' is the spectrum of red/green values, and 'b' is the spectrum of blue/yellow values. The clustering algorithm combines the brightness, red-green spectrum and blue-yellow spectrum of similar colors according to their values in the Lab color space. Once the colors are grouped into clusters, the dominant color is extracted from the clusters.

In some embodiments, the dominant colors in the region(s) are extracted and ordered according to the contrast ratio R. Equation 1 provides an example of R:

Where L ₁ is the relative brightness of the brighter of the foreground or background color and L ₂ is the relative brightness of the darker of the foreground or background color. The relative brightness is the brightness of the color measured relative to the brightness of the reference color. In one example, the relative brightness of the brighter color L ₁ is the brightness of the brighter color in the image relative to the brightness of the reference color. The relative luminance L ₂ of the darker color is the luminance of the darker color in the image relative to the luminance of the reference color.

In operation 320, the system selects color C having the highest contrast relative to the text color. The color C may be selected from the palette generated through the above-described process. In the case where there is no color in the region(s) with the threshold R, the embodiment may select a contrasting color from the Lab space.

At operation 325, the system performs panoramic segmentation on the image. The purpose of panoramic segmentation is to extract, label and classify objects in an image. In some cases, the panoramic segmentation produces one or more additional images that are segmentation masks. In some cases, operation 325 includes performing simultaneous and unified segmentation on a background (e.g., surrounding environment, such as sky, grass, or sidewalk) and an object (e.g., an instance, such as a person, car, or building). For example, operation 325 performs simultaneous and unified segmentation of the background and text. In some cases, operation 325 includes analyzing the image and generating a label map that segments the image into a plurality of regions, each region having a corresponding category label.

At operation 330, the system determines a probability score for one or more objects in the image that overlap with the region(s). The probability score is generated by a panoramic segmentation operation and indicates the confidence of the segmentation for each object. For objects with blurred edges, objects with similar colors as background elements, etc., the probability score may be lower. In some cases, the result of the determination changes the logical path of the algorithm in its second stage.

At operation 335, the system determines that the sum of the probability scores exceeds the threshold and proceeds to path a described with reference to fig. 4. If the system determines that the sum of the probability scores does not meet the threshold, the system proceeds to path B at operation 340. Path B is described with reference to fig. 5.

Fig. 4 illustrates an example of a method 400 of branching a in a second stage in a first algorithm in accordance with aspects of the present disclosure. In some examples, these operations are performed by a system comprising a processor that executes a set of codes to control the functional elements of a device. Additionally or alternatively, one or more processes are performed using dedicated hardware. In general, these operations are performed according to the methods and processes described in accordance with aspects of the present disclosure. In some cases, the operations described herein consist of various sub-steps, or are performed in conjunction with other operations.

In operation 405, the system colors an object overlapping the region(s) based on the division masking color C. The operation of this step refers to the segmentation component described with reference to fig. 2, or may be performed by the segmentation component. In one example, the system iterates through objects in the region(s) identified by panorama segmentation and determines whether each object has an incompatible dominant color. For each object having an incompatible dominant color, the system may then apply color C, which contrasts with the text color, to each object. Note that coloring an object in this way may cause the image to look unnatural, and the algorithm does not end up.

At operation 410, the system generates a gaussian mask corresponding to the region(s) blurred by the gaussian noise. The operation of this step refers to, or may be performed by, a masking component as described with reference to fig. 2. In an example, the region(s) may be determined by one or more bounding boxes of text. The system adds gaussian noise waves to the bounding box to create a mask similar to that shown in the figure.

At operation 415, the system adds color C to the Gaussian mask to create a noisy image. The operation of this step refers to, or may be performed by, the noise component described with reference to fig. 2. In some embodiments, the operation includes applying color C to the Gaussian mask and combining the colored Gaussian mask with the image to create a noisy image in the graph shown next to operation 415. Some embodiments also add fractal noise in areas below the area(s) of text or on surfaces or edges initially in these areas. Fractal noise will cause the generative machine learning model to add fine detail in these areas.

At operation 420, the system generates a new modified image using the noisy image as a condition for the generated diffusion model and combines the original text with the new modified image. In some cases, this completes the first algorithm and provides the final design to the user through the user interface. The generated diffusion model may refer to the machine learning model described with reference to fig. 2. Additional details regarding the generative diffusion model will be provided with reference to fig. 9-12.

As described with reference to fig. 3, the image editing system may determine that the panoramic segmentation was unsuccessful based on the aggregation of the probability scores. In this case, for the second phase of the first algorithm, the system proceeds to branch B.

Fig. 5 illustrates an example of a method 500 of branching B in a second stage in a first algorithm in accordance with aspects of the present disclosure. In some examples, these operations are performed by a system comprising a processor that executes a set of codes to control the functional elements of a device. Additionally or alternatively, one or more processes are performed using dedicated hardware. In general, these operations are performed according to the methods and processes described in accordance with aspects of the present disclosure. In some cases, the operations described herein consist of various sub-steps, or are performed in conjunction with other operations.

In operation 505, the system extracts contrast superpixels from an input image that includes color C. The operation of this step refers to or may be performed by a superpixel component as described with reference to fig. 2. Superpixels are blocks or sub-regions of the original image that have an average color outside a certain range in HSV space. The range may be determined by the color of the input text and may include three separate ranges of hue, saturation, and brightness. In some cases, when a superpixel cannot be extracted within the range, the range may be reduced and another search may be performed for the superpixel. The search may be performed using, for example, a sliding window process that examines the predetermined size of each sub-region of the image for its average color. In some cases, the system may create super-pixels with colors. For example, after narrowing down, if the system still does not find a suitable superpixel, the system may create its own superpixel with one color C.

At operation 510, the system combines (e.g., tessellates) the superpixels into a texture and pastes the texture into the region(s). The operations of this step may also be referenced to or performed by a superpixel component as described with reference to fig. 2. After this operation, the intermediate image may appear similar to the image shown next to operation 510 in fig. 5.

At operation 515, the system generates a gaussian mask corresponding to the region(s) blurred by the gaussian noise. The operation of this step refers to, or may be performed by, a masking component as described with reference to fig. 2. In an example, the region(s) may be determined by one or more bounding boxes of text. The system adds gaussian noise waves to the bounding box to create a mask similar to that shown in the figure.

At operation 520, the system uses the gaussian masking blur texture to create a noisy image. This step may include combining the gaussian mask with the intermediate image.

In operation 525, the system generates a new modified image using the noisy image as a condition for generating a diffusion model and combines the original text with the new modified image. In some cases, this completes the first algorithm and provides the final design to the user through the user interface. The generated diffusion model may refer to the machine learning model described with reference to fig. 2. Additional details regarding the generative diffusion model will be provided with reference to fig. 9-12.

In some cases, the user wishes to generate a new image, i.e., an image without objects or features from a previous image, to be used as a modified image with the design of text. In this case, the embodiment is further configured to execute a second algorithm that uses pure noise and colored noise as a basis for generating the additional background instead of a noisy image consisting of noise applied to the starting image.

Fig. 6 illustrates an example of a method 600 for a second algorithm for image coordination in accordance with aspects of the present disclosure. In some examples, these operations are performed by a system comprising a processor that executes a set of codes to control the functional elements of a device. Additionally or alternatively, one or more processes are performed using dedicated hardware. In general, these operations are performed according to the methods and processes described in accordance with aspects of the present disclosure. In some cases, the operations described herein consist of various sub-steps, or are performed in conjunction with other operations.

In operation 605, the system receives a design with text. For example, a user may provide a design to the system via a user interface, such as a Graphical User Interface (GUI). In some cases, the design includes a starting image. In some cases, the design does not include a starting image, for example, as shown in fig. 6.

In operation 610, the system generates a noise-only image and adds additional noise, including colors that contrast with text in the region(s) of text. The operation of this step refers to, or may be performed by, a noise component as described with reference to fig. 2. For example, the system may create an image with a pre-configured aspect ratio and resolution consisting of pure white noise. The system may then determine a color that contrasts with the color of the text, for example, based on the method described with reference to fig. 3. The system may then add additional noise including contrast colors in the text region, such as the bounding box of the text.

In operation 615, the system generates a modified image using the noisy image as a condition for the generated diffusion model. The generated diffusion model may refer to the machine learning model described with reference to fig. 2. Additional details regarding the generative diffusion model will be provided with reference to fig. 9-12.

At operation 620, the system combines the original text with the modified image to generate a final design. In some cases, this step completes the second algorithm. The system may then present the final design to the user via the user interface.

FIG. 7 represents a use case of an embodiment described herein and shows an example of a method 700 for providing a design to a user in accordance with aspects of the present disclosure. In some examples, these operations are performed by a system comprising a processor that executes a set of codes to control the functional elements of a device. Additionally or alternatively, one or more processes are performed using dedicated hardware. In general, these operations are performed according to the methods and processes described in accordance with aspects of the present disclosure. In some cases, the operations described herein consist of various sub-steps, or are performed in conjunction with other operations.

In operation 705, a user provides an image having text. For example, a user may provide images and text to the system via a user interface, such as a Graphical User Interface (GUI). The GUI may be a component of an illustration program or a web application.

At operation 710, the system identifies an area containing the text. This information may be known or cached in the bounding box of the text.

At operation 715, the system applies noise and a color that contrasts with text in the region. For example, the system may apply noise using the object coloring method described with reference to fig. 4 or the super pixel method described with reference to fig. 5.

At operation 720, the system generates a modified image having contrasting colors in the region. The system may perform this operation using a generative machine learning model. Additional details regarding the generative diffusion model will be provided with reference to fig. 9-12.

Fig. 8 illustrates an example of a method 800 for image editing in accordance with aspects of the present disclosure. The process of increasing compatibility between the underlying image and text is sometimes referred to as "image coordination (image harmonization)". In some examples, operations are performed by a system comprising a processor executing a set of codes to control the functional elements of an apparatus. Additionally or alternatively, one or more processes are performed using dedicated hardware. In general, these operations are performed according to the methods and processes described in accordance with aspects of the present disclosure. In some cases, the operations described herein consist of various sub-steps, or are performed in conjunction with other operations.

At operation 805, the system obtains an image and text overlaying the image, wherein the text includes a first color. In some cases, the operation of this step refers to, or may be performed by, the image editing apparatus described with reference to fig. 1 and 2. For example, a user may provide images and text to the system via a user interface, such as a Graphical User Interface (GUI). The GUI may be a component of an illustration program or a web application.

At operation 810, the system selects a second color that contrasts with the first color. In some cases, the operation of this step involves, or may be performed by, a contrast color extractor as described with reference to fig. 2. The system may determine the second color according to the method described with reference to fig. 3. For example, the system may use a color from the image to determine a second color that has sufficient contrast with the color of the text, as measured by the ratio of the brightness between the two colors. The brightness may be determined after conversion from one color space such as HSV or RGB space to Lab space.

At operation 815, the system generates a modified image of the text based on the second color using the machine learning model, wherein the modified image includes the second color in the region corresponding to the text. The machine learning model may be a generative machine learning model, such as a steady diffusion model, and additional details thereof will be provided with reference to fig. 9-12.

Generating machine learning model

Fig. 9 illustrates an example of a guided potential diffusion model 900 according to aspects of the present disclosure. The guided potential diffusion model 900 depicted in fig. 9 is an example of or includes aspects of the machine learning model described with reference to fig. 2. The illustrated examples include a guided potential diffusion model 900, an original image 905, a pixel space 910, an image encoder 915, an original image feature 920, a potential space 925, a forward diffusion process 930, a noise feature 935, a backward diffusion process 940, a denoised image feature 945, an image decoder 950, an output image 955, a guided prompt 960, a multi-mode encoder 965, a guided feature 970, and a guided space 975. The original image 905, the forward diffusion process 930, and the backward diffusion process 940 are examples of or include aspects of the corresponding elements described with reference to fig. 11.

The diffusion model is a class of generative neural networks that can be trained to generate new data having characteristics similar to those in the training data. In particular, a diffusion model may be used to generate a new image. The diffusion model may be used for various image generation tasks including image super resolution, image generation with perceptual metrics, conditional generation (e.g., text-guided based generation), image restoration, and image processing.

Types of diffusion models include a Denoising Diffusion Probability Model (DDPM) and a Denoising Diffusion Implicit Model (DDIM). In DDPM, the generative process includes an inverse random Markov diffusion process. DDIM, on the other hand, uses deterministic processes, so the same inputs will produce the same outputs. The characteristics of the diffusion model may also be whether noise is added to the image itself or to image characteristics generated by the encoder (i.e., potential diffusion).

The diffusion model works on the principle that noise is iteratively added to the data in the forward direction and then the data is learned for recovery by denoising the data in the reverse direction. For example, during training, the guided potential diffusion model 900 may take as input an original image 905 in the pixel space 910 and apply the image encoder 915 to convert the original image 905 into the original image features 920 in the potential space 925. The forward diffusion process 930 then gradually adds noise to the original image features 920 to obtain noise features 935 (also in the potential space 925) at various noise levels.

Next, a back diffusion process 940 (e.g., U-Net ANN) gradually removes noise from the noise features 935 at various noise levels to obtain de-noised image features 945 in the potential space 925. In some examples, the denoised image features 945 are compared to the original image features 920 at each of the various noise levels, and parameters of the back-diffusion process 940 of the diffusion model are updated based on the comparison. Finally, image decoder 950 decodes the denoised image features 945 to obtain an approximation of the existing noise in the input image in pixel space 910. Then, the output image 955 is obtained as a difference between the input images 910, subtracting the controllable portion of the previously predicted noise from the input images 910. In some cases, the output image 955 is created at each of various noise levels. The output image 955 may be compared to the original image 905 to train the back-diffusion process 940.

In some cases, the image encoder 915 and the image decoder 950 are pre-trained prior to training the back-diffusion process 940. In some examples, they are co-trained, or by the image encoder 915 and the image decoder 950, and fine tuned in conjunction with the back-diffusion process 940.

The back diffusion process 940 may also be directed based on instructional cues 960, such as images, layouts, segmentation maps, and the like. The instructional cues 960 may be encoded using a multi-mode encoder 965 to obtain instructional features 970 in an instructional space 975. The guide features 970 may be combined with noise features 935 at one or more layers of the back diffusion process 940 to ensure that the output image 955 includes what is described by the guide cues 960. For example, when a new modified image is generated, the new modified image will contain features from the original image because the noisy image provided as the instructional cue 960 is based on the original image. The guide feature 970 may be combined with the noise feature 935 using a cross-attention block within the back-diffusion process 940.

Fig. 10 illustrates an example of a U-Net 1000 architecture in accordance with aspects of the disclosure. The illustrated examples include a U-Net 1000, an input feature 1005, an initial neural network layer 1010, an intermediate feature 1015, a downsampling layer 1020, a downsampling feature 1025, an upsampling process 1030, an upsampling feature 1035, a skip connection 1040, a final neural network layer 1045, and an output feature 1050. The U-Net 1000 depicted in fig. 3 is an example of, or includes aspects of, an architecture for use in the back-diffusion process described with reference to fig. 10.

In some examples, the diffusion model is based on a neural network architecture called U-Net. The U-Net 1000 takes input features 1005 with an initial resolution and an initial number of channels and processes the input features 1005 using an initial neural network layer 1010 (e.g., a convolutional network layer) to produce intermediate features 1015. The intermediate features 1015 are then downsampled using the downsampling layer 1020 such that the downsampled features 1025 have a resolution less than the original resolution and a number of channels greater than the original number of channels.

This process is repeated a number of times, and then the process is reversed. That is, the downsampled feature 1025 is upsampled using the upsampling process 1030 to obtain an upsampled feature 1035. The upsampling feature 1035 may be combined with intermediate features 1015 having the same resolution and channel number by skipping the connection 1040. These inputs are processed using a final neural network layer 1045 to produce output features 1050. In some cases, the output features 1050 have the same resolution as the initial resolution and the same number of channels as the initial number of channels.

In some cases, the U-Net 1000 employs additional input functionality to produce conditionally generated outputs. For example, the additional input features may include a vector representation of the input cues. The input cues may be text cues or noisy images as described above with reference to fig. 3-6. Additional input features may be combined with intermediate features 1015 within the neural network at one or more layers. For example, a cross-attention module may be used to combine additional input features with intermediate features 1015.

Fig. 11 illustrates an example of a diffusion process 1100 in accordance with aspects of the disclosure. As described above with reference to fig. 9, the diffusion model may include a forward diffusion process 1105 for adding noise to an image (or feature in potential space) and a backward diffusion process 1110 for denoising the image (or feature) to obtain a denoised image. The forward diffusion process 1105 may be denoted as q (x _t|x_t-1) and the reverse diffusion process 1110 may be denoted as p (x _t-1|x_t). In some cases, the forward diffusion process 1105 is used during training to generate images with continuously larger noise, and the neural network is trained to perform the backward diffusion process 1110 (i.e., continuously removing noise).

In an example forward process of the potential diffusion model, the model maps the observed variable x ₀ (in pixel space or potential space) to the intermediate variable x ₁,…,x_T using a markov chain. As the latent variable passes through a neural network such as U-Net, the Markov chain gradually adds Gaussian noise to the data to obtain an approximate posterior q (x _1:T|x₀), where x ₁,…,x_T has the same dimensions as x ₀.

The neural network may be trained to perform a reversal process. During the back diffusion process 1110, the model begins with noise data x _T, such as noisy image 1115, and denoises the data to obtain p (x _t-1|x_t). At each step t-1, the back diffusion process 1110 takes as input x _t and t, such as the first intermediate image 1120. Here, t represents a step in the conversion sequence associated with a different noise level, and the back-diffusion process 1110 iteratively outputs x _t-1, e.g., the second intermediate image 1125, until x _T is restored back to x ₀, i.e., the original image 1130. The reverse process can be expressed as:

p_θ(x_t-1|x_t)：＝N(x_t-1;μ_θ(x_t,t),∑_θ(x_t,t)) (1)

the joint probabilities of sample sequences in a markov chain can be written as the product of a condition and a marginal probability:

where p (x _T)＝N(x_T; 0,I) is the pure noise distribution, because the reverse process takes as input the result of the forward process, i.e. the pure noise samples, and A sequence of gaussian transitions corresponding to the sequence in which gaussian noise is added to the samples is represented.

At the time of the disturbance, the observed data x ₀ in the pixel space may be mapped to the potential space as input and the generated dataAs an output, maps back from the potential space to the pixel space. In some examples, x ₀ represents an original input image with low image quality, latent variable x ₁,…,x_T represents a noisy image,/>A generated image representing high image quality.

FIG. 12 illustrates an example of a method 1200 for training a diffusion model in accordance with aspects of the present disclosure. Method 1200 represents an example for training a back-diffusion process as described above with reference to fig. 11. In some examples, these operations are performed by a system comprising a processor that executes a set of codes to control the functional elements of a device, such as the image editing device depicted in fig. 2.

Additionally or alternatively, one or more processes of method 1200 may be performed using dedicated hardware. In general, these operations are performed according to the methods and processes described in accordance with aspects of the present disclosure. In some cases, the operations described herein consist of various sub-steps, or are performed in conjunction with other operations.

In operation 1205, the user initializes the untrained model. Initialization may include defining the architecture of the model and establishing initial values for model parameters. In some cases, initializing may include defining super parameters such as the number of layers, resolution and channel per layer block, location of skipped connections, and so forth.

In operation 1210, the system adds noise to the training image using a forward diffusion process in N stages. In some cases, the forward diffusion process is a fixed process in which gaussian noise is continuously added to the image. In the potential diffusion model, gaussian noise may be added successively to features in the potential space.

At operation 1215, the system predicts an image or image feature at stage N-1 using a back-diffusion process at each stage N, starting at stage N. For example, the back diffusion process may predict noise added by the forward diffusion process, and the predicted noise may be removed from the image to obtain a predicted image. In some cases, the original image is predicted at each stage of the training process.

At operation 1220, the system compares the predicted image (or image feature) at stage n-1 with an actual image (or image feature), such as the image at stage n-1 or the original input image. For example, given observation data x, a diffusion model may be trained to minimize the upper bound of variation of the negative log likelihood log-log p _θ (x) of the training data.

In operation 1225, the system updates parameters of the model based on the comparison. For example, gradient descent can be used to update the parameters of the U-Net. The time-varying parameters of Gao Siyue transitions may also be learned.

Fig. 13 illustrates an example of a computing device 1300 configured to reconcile images with text in accordance with aspects of the disclosure. The illustrated examples include computing device 1300, processor(s), memory subsystem 1310, communication interface 1315, I/O interface 1320, user interface component(s), and channel 1330.

In some embodiments, computing device 1300 is an example of or includes aspects of image editing apparatus 100 of fig. 1. In some embodiments, computing device 1300 includes one or more processors 1305 that may execute instructions stored in memory subsystem 1310 to obtain an image and text overlapping the image, wherein the text includes a first color; selecting a second color that contrasts with the first color; and generating a modified image of the text based on the second color using the machine learning model, wherein the modified image includes the second color in a region corresponding to the text.

According to some aspects, the computing device 1300 includes one or more processors 1305. In some cases, the processor is a smart hardware device (e.g., a general purpose processing component, a Digital Signal Processor (DSP), a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a microcontroller, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a programmable logic device, discrete gate or transistor logic components, discrete hardware components, or a combination thereof). In some cases, the processor is configured to operate the memory array using the memory controller. In other cases, the memory controller is integrated into the processor. In some cases, the processor is configured to execute computer-readable instructions stored in the memory to perform various functions. In some embodiments, the processor includes dedicated components for modem processing, baseband processing, digital signal processing, or transmission processing.

According to some aspects, storage subsystem 1310 includes one or more storage devices. Examples of storage devices include Random Access Memory (RAM), read Only Memory (ROM), or hard disks. Examples of storage devices include solid state memory and hard drives. In some examples, memory is used to store computer-readable, computer-executable software comprising instructions that, when executed, cause a processor to perform the various functions described herein. In some cases, the memory includes a basic input/output system (BIOS) or the like that controls basic hardware or software operations, such as interactions with peripheral components or devices. In some cases, the memory controller operates the memory unit. For example, the memory controller may include a row decoder, a column decoder, or both. In some cases, memory cells within a memory store information in the form of logical states.

According to some aspects, the communication interface 1315 operates at the boundary between a communication entity (such as computing device 1300, one or more user devices, cloud, and one or more databases) and channel 1330, and may record and process communications. In some cases, a communication interface 1315 is provided to enable a processing system coupled to a transceiver (e.g., transmitter and/or receiver). In some examples, the transceiver is configured to transmit (or send) and receive signals for the communication device via the antenna.

According to some aspects, the I/O interface 1320 is controlled by an I/O controller to manage input and output signals of the computing device 1300. In some cases, I/O interface 1320 manages peripheral devices that are not integrated into computing device 1300. In some cases, I/O interface 1320 represents a physical connection or port to an external peripheral device. In some cases, the I/O controller uses a controller such as Or other operating systems known to those skilled in the art. In some cases, the I/O controller represents or interacts with a modem, keyboard, mouse, touch screen, or similar device. In some cases, the I/O controller is implemented as a component of a processor. In some cases, the user interacts with the device via the I/O interface 1020 or via hardware components controlled by the I/O controller.

According to some aspects, user interface component(s) 1325 enable a user to interact with computing device 1300. In some cases, the user interface component(s) 1325 include an audio device such as an external speaker system, an external display device such as a display screen, an input device (e.g., a remote control device that interfaces with the user interface directly or through an I/O controller), or a combination thereof. In some cases, the user interface component(s) 1325 include a GUI.

The description and drawings described herein represent example configurations and do not represent all implementations that are within the scope of the claims. For example, operations and steps may be rearranged, combined, or otherwise modified. Furthermore, structures and devices may be shown in block diagram form in order to represent relationships between components and to avoid obscuring the concepts described. Similar components or features may have the same name but may have different reference numbers corresponding to different figures.

Some modifications of the present disclosure will be apparent to those skilled in the art, and the principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not limited to the examples and designs described herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The described methods may be implemented or performed by a device that comprises a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor, a conventional processor, a controller, a microcontroller, or a state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration). Thus, the functions described herein may be implemented in hardware or software and may be performed by a processor, firmware, or any combination thereof. If implemented in software for execution by a processor, the functions may be stored on a computer-readable medium in the form of instructions or code.

Computer-readable media include non-transitory computer storage media and communication media including any medium that facilitates transfer of code or data. Non-transitory storage media may be any available media that can be accessed by a computer. For example, the non-transitory computer-readable medium may include Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), compact Disk (CD) or other optical disk storage, magnetic disk storage, or any other non-transitory medium for carrying or storing data or code.

Further, the connection components may be properly termed a computer-readable medium. For example, if the code or data is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, or microwave signals, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies are included in the definition of medium. Combinations of the media are also included within the scope of computer-readable media.

In this disclosure and the appended claims, the word "or" indicates an inclusive list such that, for example, the list X, Y or Z represents X or Y or Z or XY or XZ or YZ or XYZ. Furthermore, the phrase "based on" is not intended to represent a closed set of conditions. For example, a step described as "based on condition a" may be based on both condition a and condition B. In other words, the phrase "based on" should be construed to mean "based, at least in part, on. Furthermore, the word "a" or "an" indicates "at least one".

Claims

1. A method, comprising:

obtaining an image comprising text and an area overlapping the text, wherein the text comprises a first color;

Selecting a second color that contrasts with the first color; and

A modified image is generated using a machine learning model having the image and the second color as inputs, the modified image including the text and a modified region, wherein the modified region overlaps the text and includes the second color.

2. The method of claim 1, further comprising:

segmenting the region to identify one or more objects that overlap the text; and

The second color is applied to the one or more objects to obtain a first modified region, wherein the modified image is generated based on the first modified region.

3. The method of claim 1, further comprising:

Noise is added to the region overlapping the text to obtain a noisy image, wherein the modified image is generated based on the noisy image.

4. A method according to claim 3, further comprising:

A mask is generated indicating the region overlapping the text, wherein the noise is added to the image based on the mask.

5. A method according to claim 3, wherein:

at least a portion of the noise includes colored noise corresponding to the second color.

6. The method of claim 1, further comprising:

The image and the modified image are combined to obtain a combined image.

7. The method of claim 1, further comprising:

the text is superimposed on the modified image to obtain a composite image.

8. The method of claim 1, further comprising:

A palette is generated based on the region overlapping the text, wherein the second color is selected from the palette.

9. A non-transitory computer-readable medium storing code comprising instructions executable by a processor to:

Selecting a second color from the region overlapping the text, wherein the second color contrasts with the first color; and

10. The non-transitory computer-readable medium of claim 9, wherein the code further comprises instructions executable by the processor to:

segmenting the image to identify one or more objects in the region that overlap the text; and

The second color is applied to the one or more objects to obtain a first modified region, wherein the background image is generated based on the first modified region.

11. The non-transitory computer-readable medium of claim 10, wherein the code further comprises instructions executable by the processor to:

Calculating a probability score for the one or more objects, the probability score indicating a likelihood of presence of the one or more objects;

determining a low probability of the presence for the one or more objects based on the probability score; and

Based on the determination, a plurality of superpixels is extracted from the region overlapping the text, wherein the first modified region includes the plurality of superpixels.

12. The non-transitory computer-readable medium of claim 9, wherein the code further comprises instructions executable by the processor to:

Noise is added to the image in the region overlapping the text to obtain a noisy image, wherein the modified image is generated based on the noisy image.

13. The non-transitory computer-readable medium of claim 9, wherein the code further comprises instructions executable by the processor to:

The image and the modified image are combined to obtain a combined image.

14. The non-transitory computer-readable medium of claim 9, wherein the code further comprises instructions executable by the processor to:

the text is superimposed on the modified image to obtain a composite image.

15. An apparatus for image editing, comprising:

A processor;

a memory comprising instructions executable by the processor to perform operations comprising:

obtaining an image and text overlapping the image, wherein the text comprises a first color;

Selecting a second color that contrasts with the first color; and

A background image for the text is generated based on the second color using a machine learning model, wherein the background image includes the second color in a region corresponding to the text.

16. The apparatus of claim 15, further comprising:

a segmentation component is configured to segment the image to identify one or more objects.

17. The apparatus of claim 15, further comprising:

A noise component configured to add noise to the image in the region corresponding to the text.

18. The apparatus of claim 15, further comprising:

A superpixel component configured to extract a plurality of superpixels from the region corresponding to the text.

19. The apparatus of claim 15, further comprising:

a combining component configured to combine the image and the background image to obtain a combined image.

20. The apparatus of claim 15, wherein:

the machine learning model includes a generative diffusion model.