WO2023178579A1

WO2023178579A1 - Method and system for multimodal based image searching and synthesis

Info

Publication number: WO2023178579A1
Application number: PCT/CN2022/082630
Authority: WO
Inventors: Changqing Zou; Himanshu Arora; MingXue Wang
Original assignee: Huawei Technologies Co.,Ltd.
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2023-09-28

Abstract

Method, system and computer program product for processing an input image of a scene including a plurality of objects including at least one sketch object, including generating, based on an input image, respective object level-representations for each of the plurality of objects; generating, based on the object level-representations, a set of constrained-correlated representations, each constrained-correlated representation representing a respective object of the plurality of objects, the constrained-correlated representation for each object including information about an appearance of the object and positional relationships between the object and other objects of the plurality of objects; and generating, based on the constrained-correlated representations and positional information derived from the input image, a set of respective freely-correlated object representations, each freely-correlated representation representing a respective object of the plurality of objects, the freely correlated representation for each object including information about a location of the object within the input image.

Description

METHOD AND SYSTEM FOR MULTIMODAL BASED IMAGE SEARCHING AND SYNTHESIS

TECHNICAL FIELD

This disclosure relates to method and systems for composing and searching images based on multimodal composition images.

BACKGROUND

Recent works have explored both image search and image generation guided by freehand sketches, enabling an intuitive way to communicate visual intent. Sketch based image retrieval systems have been discussed in published papers such as: [Reference Document: Fang Wang, Le Kang, and Yi Li. Sketch-based 3d shape retrieval using convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR) , 2015 IEEE Conference on, pages 1875–1883. IEEE, 2015. ] , [Reference Document: Yonggang Qi, Yi-Zhe Song, Honggang Zhang, and Jun Liu. Sketch-based image retrieval via siamese convolutional neural network. In Image Processing (ICIP) , 2016 IEEE International Conference on, pages 2460–2464. IEEE, 2016] and [Reference Document: Tu Bui, Leo Ribeiro, Moacir Ponti, and John Collomosse. Compact descriptors for sketch-based image retrieval using a triplet loss convolutional neural network. Computer Vision and Image Understanding, 2017. ]

In sketch-based image retrieval, images are retrieved by using sketches to find similar images from an image set. In most cases, this task has been accomplished directly with techniques like contrastive learning or triplet network learning based mappings between sketches and edge maps and have focused in single mode searching (e.g., search is performed based on a single sketched object) .

Scene Graphs have also been used for searching. Scene Graphs refer to a representation of images which indicate the objects included in the image and the relationship between those objects. Scene Graphs have been widely used in past to perform text-based image retrieval [Reference Document: Justin Johnson, Ranjay Krishna, Michael Stark, Li Jia Li, David A. Shamma, Michael S. Bernstein, and Fei Fei Li. Image retrieval using scene graphs. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 07-12-June, pages 3668–3678. IEEE, jun 2015. ] and very recently have been use to generate semantic layout of images [Reference Document: Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image Generation from Scene Graphs. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1219–1228. IEEE, jun 2018. ] , [Reference Document: Oron Ashual and Lior Wolf. Specifying object attributes and relations in interactive scene generation. In Proceedings of the IEEE International Conference on Computer Vision, volume 2019-Octob, pages 4560–4568, sep 2019. ]

Conditional image generation has been performed using a conditional form of Generative Adversarial Networks (GANs) [Reference Document: Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv: 1411.1784, 2014. ] These networks can use various label priors or image priors as a conditionality while generating images. Recent works like Pix2PixHD [Reference Document: Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. ] and SPADE [Reference Document: Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. ] are known for their high definition synthesis. Conditional generation provides control on generation.

Image synthesis using sketch inputs is a difficult problem given the randomness that comes in a hand-made sketch. Most works so far like Sketchy-GAN [Reference Document: Wengling Chen and James Hays. Sketchygan: Towards diverse and realistic sketch to image synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2018. ] and Contextual-GAN [Reference Document: Yongyi Lu, Shangzhe Wu, Yu-Wing Tai, and Chi-Keung Tang. Image generation from sketch constraint using contextual GAN. In The European Conference on Computer Vision (ECCV) , September 2018. ] have worked on generating images based on single object images.

Visual Search is an important task given the huge amount of visual data being generated every minute. Performing visual search based on freehand sketches is an important variation of visual search. The existing works predominantly focus on performing search operations on the basis of a single dominant object irrespective of its location in the image. Existing solutions do not adequately support search and image generation tasks that are based on multiple input objects that can include both sketched objects and objects of other modality types. There is a need for method and system for multimodal based image searching and synthesis.

SUMMARY

Methods and systems are disclosed for searching and generating images based on free-hand sketches of scene compositions which can describe the appearance and relative position of multiple objects.

According to a first example aspect of the disclosure is computer-implemented method for processing an input image of a scene including a plurality of objects including at least one sketch object. The method includes generating, based on an input image, respective object level-representations for each of the plurality of objects; generating, based on the object level-representations and relationship information derived from the input image, a set of constrained-correlated representations, each constrained-correlated representation representing a respective object of the plurality of objects, the constrained-correlated representation for each object including information about an appearance of the object and positional relationships between the object and other objects of the plurality of objects; and generating, based on the constrained-correlated representations and positional information derived from the input image, a set of respective freely-correlated object representations, each freely-correlated representation representing a respective object of the plurality of objects, the freely correlated representation for each object including information about a location of the object within the input image.

Including information about an appearance of the object and positional relationships between the object and other objects of the plurality of objects in the constrained-correlated representation for the object can improve performance of retrieval and synthesis operations in at least some example scenarios. Furthermore, the inclusion of multiple objects including at least one sketch object in the input image of a scene allows for multiple objects and sketches to be considered in the representations.

In some aspects, the method includes generating a semantic layout for the input image based on the set of freely correlated object representations.

In one or more of the preceding aspects, the method includes generating a synthetic image based on the semantic layout.

In one or more of the preceding aspects, the method includes generating, based on the constrained-correlated representations and the positional information derived from the input image, a scene representation that embeds information about the scene.

In one or more of the preceding aspects, the method includes comprising searching for and retrieving images from a database of images based on the scene representation.

In one or more of the preceding aspects, the scene representation and the freely-correlated object representations are both generated using a common machine learning (ML) transformer network.

In one or more of the preceding aspects, the scene representation is included in a single vector, and each of the freely-correlated object representations are respective vectors.

In one or more of the preceding aspects, generating the set of constrained-correlated representations comprises: generating a scene graph for the input image using the object level-representations for respective nodes and the relationship information derived from the input image for defining connecting edges between the respective nodes; and processing the scene graph using a graphical convolution network (GCN) to learn the respective constrained-correlated representations.

In one or more of the preceding aspects, the plurality of objects include at least one non-sketch object.

In one or more of the preceding aspects, generating the respective object level-representations comprises: encoding, using a first encoder network, an intermediate sketch object representation for the at least one sketch object; encoding, using a second encoder network, an intermediate non-sketch object representation for the at least one non-sketch object; and encoding, using a common encoder network, the intermediate sketch object representation to provide the object level-representation for the at least one sketch object and the intermediate non-sketch object representation to provide the object level-representation for the at least one non-sketch object.

In one or more of the preceding aspects, the at least one non-sketch object is selected from a group that includes: a photographic object; a photo-real object; a cartoon object; and a text object.

In one or more of the preceding aspects, the at least one non-sketch object is a photographic object.

In one or more of the preceding aspects, the first encoder network, second encoder network, and common encoder network each comprises respective sets of network layers of a machine learning (ML) model.

In one or more of the preceding aspects, the method includes assembling the input image as a composite image that includes the plurality of objects based on inputs received through a user input interface, the inputs including a user indication of at least one crop from photographic image for inclusion as a photographic object in the plurality of objects.

In one or more of the preceding aspects, the method includes generating the positional information derived from the input image by computing a positional encoding for each object of the plurality of objects based on a grid location of that object based on a grid overlaid on the input image.

According to a second aspect of the disclosure, a system is disclosed that includes one or more processors and one or more memories storing instructions executable by the one or more processors. The instructions, when executed, causing the system to perform a method according to any one of the preceding aspects.

According to a third aspect of the disclosure, a computer readable memory is disclosed that stores instructions for configuring a computer system to perform a method according to any one of the preceding aspects.

According to a fourth aspect of the disclosure is a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any one of the preceding aspects.

BRIEF DESCRIPTION OF THE DRAWINGS

Reference will now be made, by way of example, to the accompanying drawings which show example embodiments of the present disclosure, and in which:

FIG. 1 is an illustrative example of a composition image;

FIG. 2 is a block diagram illustrating a system for multimodal based image searching and synthesis, according to example embodiments;

FIG. 3 is a block diagram of an object level representation (OLR) generator of the system of FIG. 2;

FIG. 4 is a block diagram of a constrained-correlated representation (CCR) generator of the system of FIG. 2;

FIG. 5 is a block diagram of a freely-correlated representation (FCR) generator of the system of FIG. 2;

FIG. 6 is a simplified illustration of a positional encoding grid overlaid on an input composition image;

FIG. 7 is an example of an input image including a sketch object and a resulting semantic layout generated by the system of FIG. 2;

FIG. 8 illustrates an example of an input image including a sketch object and a resulting set of search result images;

FIG. 9 illustrates iterative use of the system of FIG. 2 to generate a set of synthesized images; and

FIG. 10 is a block diagram illustrating an example of a computer system for implementing methods and systems disclosed herein, according to example embodiments.

DETAILED DESCRIPTION

Systems and methods are disclosed for searching and generating images using free-hand sketches.

For illustrative purposes, FIG. 1 illustrates an example of a composition image 150 that depicts a scene that includes a plurality of

objects

152, 154, 160, 164, 168 (collectively referred to as objects 151) . The illustrated objects 151 are represented using different types of illustrative modes such as hand-drawn, photographic and text illustrative modes, such that the composition image 150 is a multimodal image. For example, sketch object 152 is a hand-drawn sketch of an arrow. Sketch object 514 is a hand-drawn sketch of a star. Photographic object crop 156 (shown as a line drawing in FIG. 1 for ease of illustration) , includes foreground photographic object 160 (left arrow) against a photographic background 162 (shown as vertical lines) . Photographic object crop 158 (shown as a line drawing in FIG. 1 for ease of illustration) , includes foreground photographic object 164 (star) against a photographic background 166 (shown as a set of slanted lines) . A text object 168 is also illustrated in FIG. 1.

As used in this disclosure:

“sketch object” can refer to a set of digital data that encodes a hand-drawn sketch representation of a physical or abstract object.

“photographic object” can refer to a set of digital data that encodes a photograph of a real world object that has been captured using an image sensing device such as a visual light camera; a photographic object may be a foreground element in a photographic image (or crop from a photographic image) that can also include a photographic background.

“scene” can refer to a collection of one or more objects that each have a respective position and orientation within the scene.

“text object” can refer to a set of digital data that encodes text characters.

“image” refers a set of digital data that encodes the objects included in a scene and can be rendered by a rendering device to depict the scene.

It will be appreciated that sketch objects, photographic objects, and text objects represent a subset of possible illustrative modes that can be used to represent objects 151 in a scene. Further illustrative modes can, for example, include augmented, photo-real, and cartoon. By way of example, augmented objects can include objects that combine data of photographic objects and data from other illustrative modes. Photo-real objects (which may or may not be a form of augmented objects) are highly realistic artificially generated representations (for example, computer generated graphic image (CGI) objects) that can approach or achieve a level of object representation of photographic objects. Cartoon and anime objects can fall on a spectrum between sketch objects and photographic objects.

In illustrated examples, systems and methods are described in the context of two illustrative modes, namely sketch objects and photographic objects. However, the method and systems can be adapted to process other illustrative modes.

Multimodal images such as composition image 150 can be generated using known software solutions and can be stored using a known data format. For example, image 150 can be a raster-graphics file such as a Portable Network Graphics (PNG) file that is capable of supporting different illustrative modes within a common scene. By way of example, a user can create the scene depicted in image 150 by using a graphics application that allows a user to use input device (s) (for example a mouse, keyboard, touchscreen/stylus combination, and/or microphone) to: hand-

sketch objects

152 and 154 into the scene; crop photographic object crops 156, 158 from source images and drop them into desired locations in the scene; and/or add a text box at a desired location with text object 168.

FIG. 2 illustrates an example of a system 200 for multimodal based image searching and synthesis according to an example aspect of the disclosure. As will be explained in greater detail below, the system 200 can include the following modules: a modality-based object detector 202, an object level representation (OLR) generator 212, a constrained-correlated representation (CCR) generator 216, a freely-correlated representation (FCR) generator 220, a semantic layout generator 223 and search space embedding generator 226. As used in this disclosure, terms such as module and generator can refer to a combination of a hardware processing circuit and machine-readable instructions (software and/or firmware) executable on the hardware processing circuit. A hardware processing circuit can include any or some combination of a microprocessor, a core of a multi-core microprocessor, a microcontroller, a programmable integrated circuit, a programmable gate array, a digital signal processor, or another hardware processing circuit.

As described in detail below, the system 200 is configured to process an input image, for example composition image 150, to generate cross-modal representations (for example freely correlated representations 222 and scene representations 224) that can be used for image searching and image generation purposes.

As shown in FIG. 2, input composition image 150 is first processed by a modality based object detector 202. Modality based object detector 202 is configured to select, according to illustrative mode, each of the objects 151 included in the composition image and extract the set of data for a region of the composition image 150 that corresponds to the object 151. For example, modality based object detector 202 detects all sketch objects (e.g., sketch objects 152 and 154) represented in composition image 150 and extracts the set of data corresponding to each identified sketch object to provide a respective sketch object crop 204. Each sketch object crop 204 is a dataset that includes the set of data (for example pixel data) from composition image 150 for a region (for example a region bounded by a rectangular bounding box 153) that encompasses the sketch object. Additionally, modality based object detector 202 detects all photographic objects (e.g., photographic objects 160 and 164) represented in composition image 150 and extracts the set of data corresponding to each identified photographic object to provide a respective photographic object crop 206. Each photographic object crop 206 is a data set that includes the set of data (for example pixel data) from composition image 150 for a region (for example a region bounded by a rectangular bounding box that corresponds to photographic image crop 156 in the case of photographic object 160) that encompasses photographic object.

As noted above, the present description will focus on the processing of two illustrative modes, namely sketch objects and photographic objects, although crops for other illustrative modes (for example text object 168) can similarly be extracted and respectively processed by the system 200 using methods similar to those described herein in respect of the sketch and photographic illustrative modes.

In addition to providing respective sketch object crops 204 and photographic object crops 206, modality based object detector can also generate positional data 208 that is used by downstream modules such as CCR generator 216 and FCR generator 220. Positional data 208 can indicate the locations of the composition image 150 that the respective object crops 204, 206 have been taken from. For example, positional data 208 can indicate bounding box coordinates within the composition image 150 that correspond to each of the respective sketch object crops 204 and respective photographic object crops 206. In some examples, rather than bounding box information, the positional data 208 that is provided to downstream modules may include the original composition image 150 itself which can then be processed by the downstream modules themselves to extract positional information.

In some examples (for example in the case of a composition image represented in a PNG format) the information required by modality based object detector 202 to generate sketch object crops 204, photographic object crops 206 and positional data 208 may be readily available or inferable from the data included in composition image 150 and related metadata, in which case modality based object detector 202 may be a rule-based processing module. In other scenarios, modality based object detector 202 may employ a machine learning (ML) based model that has bene trained to perform a modality-based object detection and mode classification task.

Once extracted, the sketch object crops 204 and photographic object crops 206 from input composition image 150 are processed by OLR generator 212 to independently encode each object within the composition image to a respective object level representation (OLR) 214 that uses a common feature format regardless of the object’s illustrative mode. OLR generator 212 is shown in greater detail in FIG. 3.

As shown in FIG. 3, the OLR generator 212 includes a respective illustrative mode specific encoder for the input object crops, for example a sketch object encoder 312 for processing sketch object crops 204, a photographic object encoder 314 for processing photographic object crops 206, and (optionally) other modal object encoders 315 for encoding other illustration mode crops (e.g., text objects) .

In example embodiments, OLR generator 212 can be implemented using one or more multiple neural network (NN) layer machine learning (ML) based models. In an example embodiment, sketch object encoder 312 can be implemented using a set of network layers (e.g., a first encoder network) that are trained to approximate a function f _s (. ) that encodes each sketch object (e.g., as represented in an input sketch object crop 204) to an intermediate sketch object representation (SOR) . Photographic object encoder 312 can be implemented using network layers (e.g., a second encoder network) that are trained to approximate a function f _p (. ) that encodes each photographic object (e.g., as represented in an input photographic object crop 206) to an intermediate sketch object representation (POR) .

The intermediate SORs and PORs are provided as respective inputs to common networks layers (e.g., a common encoder network) that approximate a common encoding function f _c (. ) that encodes each SOR and POR to a respective OLR 214.

By way of example, sketch object encoding function f _s (. ) and photographic encoding function f _p (. ) can each be implemented using a known neural network architecture for image processing such as MobileNet [Reference Document: Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv: 1704.04861, 2017. ] . A deep metric learning approach can be adopted using a heterogeneous triplet architecture (i.e., anchor branch (a) , positive branch (p) and negative branch (n) , with no shared weights between the a, p and n branches) . Common encoding function f _c (. ) can, for example, be implemented using two shared fully connected (fc) layers that collectively generate a multi-dimensional (for example a 128 element) feature tensor for each OLR 214. As used herein, “tensor” can refer to an ordered set of values, where the location of each value in the set has a meaning. Examples of tensors include vectors, matrices and maps.

In an example embodiment, the ML model layers that implement sketch object encoding function f _s (. ) and photographic encoding function f _p (. ) can be pre-trained on ImageNet [Reference Document: Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) , 115 (3) : 211–252, 2015. ] , and fine-tuned for cross-domain learning using a combined categorical cross-entropy (CCE) and triplet loss, with the triplet loss defined as:

Where m is the margin (for example m=0.5) and |. | is the L ₂ norm.

In some examples, an ML model used to implement OLR generator 212 is trained using rasterized sketches (a) and objects cropped from a photographic image dataset with corresponding (p) and differing (n) object classes. To aid convergence, an equal-weighted CCE loss CCE (. ) is applied to a further fc layer f _e (. ) (e.g., sketch object classifier 320 and photographic object classifier 322) appended to the ML model network:

Where

are the one-hot vectors encoding the semantic class of each input.

Accordingly, OLR generator 212 processes a plurality of objects 151 that are represented using different illustrative modes, including sketch objects, to generate a respective OLR 214 (i.e., a multi-dimensional feature tensor) for each of the objects 151 included in a scene depicted in the input composition image 150. Each OLR 214 uses a common embedding format that includes appearance and pose information for the object 151 that it represents, regardless of the illustrative mode of the object.

Referring to FIG. 4, CCR generator 212 receives, as inputs, the OLRs 214 generated in respect of input composition image 150 and the positional data 208 (e.g., bounding box coordinates) . CCR generator 212 outputs a set of constrained-correlated representations (CCR) 218 of each object 151. Each CCR 218 represents a respective object 151 as a multi-dimensional embedding that encodes information about the pose and appearance of its respective object and pair-wise positional relationships of the respective object 151 with the other objects from image composition 150.

CCR generator 212 includes (i) a scene graph generator 401 to generate a scene graph SG for the input composition 150 based on the OLRs 214 and positional data 214, and (ii) a graph convolution network (GCN) 401 that learns, based on the scene graph SG, a respective multi-dimensional embedding (i.e., CCR 218) for each object.

As known in the art, graphs are data structures that allow efficient storage of relational information, and a GCN can be used to encode pair-wise object relationships represented in a scene graph SG into a continuous representation. A scene graph SG describes objects and their spatial relationships within the scene. CCR generator 216 can include a scene graph generator 401 that organizes OLRs 214 and positional data 208 into a scene graph SG. Each OLR 214 corresponds to an initial node embedding for a respective node. The positional data 208 is used to generate edges that encode pair-wise relationships between the nodes. The scene graph SG can be denoted as a graph G = (V, E) where V = {o ₁, ..., o _i, ..., o _K} is the set of nodes representing the K objects 151 of the composition image 150 and (o _i, r _ij, o _j) ∈ E is the set of edges encoding the discrete relationships r _ij∈ R. In the illustrated embodiment, scene graph generator can transform the positional data 208 into discrete relationships r _ij∈ R, where R is a set of discrete positional relationships, including for example relative positional classification categories such as: “left of” , “right of” , “above” , “below” , “contains” , and “inside of” .

By way of illustrative example, the GCN 402 of FIG. 4 can be denoted as GCN g _r(. ) , and each OLR 214 denoted as an input node o _i. A learnable embedding f _r (r _ij) is used to map relationships R into dense tensors (for example, dense vectors) (see [Reference Document: Oron Ashual and Lior Wolf. Specifying object attributes and relations in interactive scene generation. In Proceedings of the IEEE International Conference on Computer Vision, volume 2019-Octob, pages 4560–4568, sep 2019. ] ) . By way of non-limiting example, the GCN g _r (. ) can include a sequence of 6 layers g _r ^k (. ) , where k denotes the index of each layer, . using a network architecture such as that disclosed in [Reference Document: Justin Johnson, Ranjay Krishna, Michael Stark, Li Jia Li, David A. Shamma, Mi-chael S. Bernstein, and Fei Fei Li. Image retrieval using scene graphs. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 07-12-June, pages 3668–3678. IEEE, jun 2015. ] Edges E are processed by a fully connected (fc) layer g _r ^k (. ) as follows:

Each fc layer g _r ^k (. ) updates a relationship vector (e.g., a dense vector) for each pair-wise relationship to

and builds the set

of object vectors which are gathered and averaged (per object in set V) by h and processed by another fc layer (k+1) to compute an updated object vector

for each object. The CCR 218 generated for an object 151 embeds information about the pose and appearance of the object 151 as well the positional relationship of the object 151 with other objects included in the input composition image 150.

Accordingly, the CCR generator 216 is configured to learn a set of CCRs 218 to represent the scene graph SG by optimizing edge distances based on the positional relationships of objects 151. The layers of the GCN 402 generate object vectors for each object 151, with the same layers being shared for all objects regardless of the illustrative modes of the objects. These layers are trained end-to-end to learn the CCRs 218. Each CCR 218 is a multidimensional vector that embeds pose and appearance information for a respective object 151 as well as information about the object’s positional relationships with other objects from the composition image 150. The CGN 402 encodes pairwise object relationships in the scene into a continuous relationship.

Relationship information about object pairs that are constrained by a direct pairwise relationship (e.g., “left of” , “right of” , “above” , “below” , “contains” , and “inside of” ) in the scene graph SG will be represented in the CCRs 218 generated by CCR generator 218. However, information about unconstrained relationships, for example relationships based on relative locations within the image rather than direct inter-object relationships, may not be represented in the CCRs. 218. With reference to FIG. 5, the CCRs 218 generated by CCR generator 216 are provided to FCR generator 220. One purpose of the FCR generator 220 is to provide a final representation tensor (i.e., a freely-correlated representation FCR 222) for each object 151 that is learned based on the connection of that object 151 to all other objects in the image composition 150 even if there is not a specific connection between the objects in the scene graph SG.

To accomplish this purpose, the FCR generator 220 includes a transformer network 502 that enables free correlation between all objects 151. The resulting FCRs 222 encode relationships between each object 151 and all other objects in a weighted manner. The transformer network 502 includes a plurality of stacked attention layers 506 that are capable of interlacing multiple object instances based on their respective features. Along with the CCRs 218 for each object 151, positional encoding for each object 151 is also fed as inputs to the transformer 502.

Referring to FIG. 6, in some example embodiments, FCR generator 220 includes an object positional encoding module 504 to compute a respective position vector for each object 151 that represents the object’s location within the scene depicted in the composition image 150. By way of illustrative example, FIG. 6 shows a grid-based positional encoding scenario in which the scene of composition image 150 is divided into a 2 by 2 grid. In the illustrated scenario, the positional sketch object 152 would be assigned a position vector corresponding to grid location 0, positional sketch object 154 would be assigned a position vector corresponding to grid location 2, photographic sketch object 160 would be assigned a position vector corresponding to grid location 1, and photographic sketch object 162 would be assigned a position vector corresponding to grid location 3. In an example embodiment, the scene is divided into a 5 by 5 grid and a position index of p∈0, 1, ..., 24 is assigned to each object 151 based on where the center of the object 151 falls in the grid. The position index assigned to the object 151 is then mapped by predefined function f _pe (p) to a position vector for the object 151.

In an example embodiment, the transformer 502, denoted as t (. ) =Z, includes 3 layers, each having 16 attention heads. The CCRs 218 are input to the transformer 502, along with an empty vector

as an input matrix. In each layer t ⁱ (. ) , attention weights are computed that relate each vector to all other vectors in the sequence:

where multiplications are matrix-based, s is the softmax function,

are fc layers, Q is a matrix made by stacking the vector

and the respective vectors of CCRs 218, and E are the positional encodings generated by function f _pe (p) , stacked in the same fashion as Q. A resulting output matrix includes a respective representation (e.g., a vector FCR 222) for each object 151, as well as an overall scene representation (SR) 224 that is included in the same matrix position that was used for empty vector

Referring again to FIG. 2, FCRs 222 are provided as inputs to a semantic layout generator 223 that is configured to generate a semantic scene layout 228 representation of the input image 130. In the example of FIG. 2, the semantic layout generator 223 includes two machine learning based models, namely a mask generator 224 that generates a mask for each object 151 based on the object’s respective FCR 22, and a box generator 225 that generates a respective bounding box for each object 151 to provide a layout for the semantic layout 228. By way of example, the box generator 225, which can, for example, be a 2-layer Multi-layer Perceptron (MLP) network, can be trained using Generalized intersection-over-Union (GIOO) and mean-squared error (MSE) (for example, see Reference Document: [Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2019. ] ) The mask generator 224 can be trained within a general adversarial network (GAN) (for example, see Reference Document: [Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, 2014. ] ) , using a conditional setup (for example, see Reference Document: [Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv: 1411.1784, 2014] ) , with least squares generative adversarial network (LSGAN) losses (for example, see Reference Document: [Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, and Zhen Wang. Least squares generative adversarial networks, 2016. cite arxiv: 1611.04076. ] ) , together with MSE and feature matching to regularize the adversarial training.

Taking b as a ground truth bounding box,

as the FCR 222 for a photographic object from input composition image 151 and

as the FCR 222 for a sketch object from input composition image 151, bounding box generation losses can be denoted as:

for each object 151 on each image, applied to both

and

with λ ₁= λ ₂=10. Using a similar notation, taking the ground truth masks as m _g, and the object labels as y∈C, where C is the set of object classes, the loss for the mask generator 222 is:

with D as the discriminator and also applied to both

and

with weights λ ₃=10, λ ₄= 0.25, and λ ₅= 10. (The values of weights λ are hyper-parameters, with the illustrated values representing one possible example) . The loss l _FM (. ) is the feature matching loss, the L1 difference in the activations of D (D (. ) ) . The D itself is trained with the opposite loss, in classic GAN fashion. Those losses back-propagate through all representation levels.

The generated masks and bounding boxes are then combined by a combing operation C into a semantic layout 228. In some examples, objects are placed ordered by their size, so that bigger objects are behind smaller ones (for example, see reference document: [Oron Ashual and Lior Wolf. Specifying object attributes and relations in interactive scene generation. In Proceedings of the IEEE International Conference on Computer Vision, volume 2019-Octob, pages 4560–4568, sep 2019. ] ) . In some examples, the object-only layout may then be combined with a background layout (which may for example be selected from search results) .

By way of example, FIG. 7 shows an input image 150A that includes a sketch object of a bicycle, and a resulting semantic layout 228A generated by semantic layout generator 223. In the example of FIG. 7, a pre-selected background specified by a user has been combined with the sketch-object to generate semantic layout 226.

The final semantic layout 226 can then be passed to a semantic image generator 232 to generate a synthetic composition image. (for example, see reference document: [Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. ] )

Referring again to FIG. 2, the scene representation 224 generated by FCR generator 220 is a single vector latent space that can be used by a composition image retrieval module 230 as a search embedding for searching an image search space (e.g., a database of photographic images and other illustrative mode images) . The scene representation 224 embeds information about the similarity of the different illustrative mode objects (e.g., similarities of sketch objects and photographic objects) that are included in an input composition image 150. The metric properties of the scene representation 224 can be encouraged during training of transformer 502 through an adaptation of the contrastive loss. For example, during contrastive training, half of the negatives are randomly sampled from other images of scenes in a training set. The other half of the negatives are synthesized by swapping objects of different class into the positions of the objects in the positive image. By including such a class-swapped version of the image as a negative, the ML model used for transformer 502 is encouraged to discriminate change in object class to a greater degree than changes in object positions. The contrastive loss, adapted to incorporate these synthetic negatives, is:

where D is a function that computes the Euclidean distance between all vectors on one set and all vectors on another set, x _s and x _i are respectively sketch object and photographic object components of the scene representation 224, and Y is a label indicating if a certain pair of vectors are from the same scene or not. Finally, x _sn denotes the scene representation 224 of the synthesized scenes with swapped objects.

In system 200, cross-modal search embedding is used to measure similarity of visual compositions that include sketch objects and images. The search embedding is generated via a hybrid graph convolutional network (GCN) and transformer network architecture. Using this hybrid architecture, object correlations are learned including information about object type, appearance and arrangements. The correlations can then be combined with the help of a semantic layout generator which generates and combines object masks. Resulting images can be combined with sketches in a combined search and synthesis workflow, which allows the search system to iterate over the resulting images.

In some examples, training of the ML models of system 200 is performed in three stages. First the OLR generator 212 is pre-trained independent of the rest of the system 200, using the dual triplet and categorical cross-entropy loss. In a second stage, the ML models for entire system 200 are trained end-to-end. In a third stage, the ML models are fine tuned by mixing samples from different training datasets (e.g., QuickDrawCOCO-92c and SketchyCOCO samples) .

In example embodiments, the multimodal based image searching and synthesis system 200 can be used iteratively to generate a final image. An example use scenario will now be described with reference to FIGs. 8 and 9.

FIG. 8 illustrates an example of a scene-based search. A user has used a drawing application to create a PNG format image 150C that includes hand drawn sketches (i.e., two sketch objects) of two horse-like animals. Image 150C is provided as input to system 200 as a basis for a search request for images of similar scenes. In response, system 200 generates a scene representation 224 for the image 150C that is then used by composition image retrieval module 230 as a search embedding for searching an image database. Composition image retrieval module 230 returns a set of photographic images that each include scenes of multiple horse-like animals, namely sample images 800A to 800E, which are presented to the user via a computer system user interface.

FIG. 9 illustrates multiple examples, illustrated as columns 801A to 801D, of synthetic image generation based on input image 150C and the search results (image samples 800A to 800E) . In a first example, (column 801A) the user uses a user interface input device to indicate selection of sample photographic image 800D, and more particularly, indicates that the user wants to use the background from such image underlay a synthesized image. In such case, the system 200 generates a semantic layout 228 representation of the two sketch objects included in input image 150C and the background of sample image 800D (i.e. foreground objects of sample image 800D are ignored) . The resulting semantic layout 228 is provided to synthetic image generator 232 that outputs a respective synthesized image 802A.

In a second example, (column 801B) , using an input device, the user performs a snip and copy action on a portion of the sample photographic image 800D, and pastes the resulting photographic object crop into a revised input image 150D. As can be seen in FIG. 9, the input image 150D is a multimodal image that includes the two sketch horse-like objects and a photographic object crop that includes a photographic object that represents zebra. In the example of column 801B, the user again selects the background from photographic image 800D as an background underlay. In such case, the system 200 generates a semantic layout 228 representation of the two sketch objects included in input image 150C, the photographic object (e.g., zebra) cropped from sample image 800D and the background of sample image 800D. The resulting semantic layout 228 is provided to synthetic image generator 232 that outputs a respective synthesized image 802B.

A third second example, (column 801C) , is similar to that of the second example (column 801D) , except that the user has chosen to flip the photographic object crop to a mirror image, providing a revised input composite image 150E. The resulting semantic layout 228 is provided to synthetic image generator 232 that outputs a respective synthesized image 802C.

In the fourth example, (column 801D) , the user has chosen to change the relative positions of the sketch objects and the photographic objects, providing a revised input composite image 150F. The resulting semantic layout 228 is provided to synthetic image generator 232 that outputs a respective synthesized image 802D.

Referring to FIG. 10, a block diagram of a computer system 100 that can be used to implement systems of methods of the present disclosure, including system 200, is shown. Although an example embodiment of the computer system 100 is shown and discussed below, other embodiments may be used to implement examples disclosed herein, which may include components different from those shown. Although FIG. 10 shows a single instance of each component, there may be multiple instances of each component shown.

The computer system 100 includes a hardware processing circuit that comprises one or more processors 106, such as a central processing unit, a microprocessor, an application-specific integrated circuit (ASIC) , a field-programmable gate array (FPGA) , a dedicated logic circuitry, a tensor processing unit, a neural processing unit, a dedicated artificial intelligence processing unit, or combinations thereof. The computer system 100 may also include one or more input/output (I/O) interfaces 104. The computer system 100 includes one or more network interfaces 108 for wired or wireless communication with a network (e.g., an intranet, the Internet, a peer-to-peer (P2P) network, a wide area network (WAN) and/or a local area network (LAN) ) or other node. The network interface (s) 108 may include wired links (e.g., Ethernet cable) and/or wireless links (e.g., one or more antennas) for intra-network and/or inter-network communications.

The computer system 100 includes one or more memories 118, which may include volatile and non-volatile memories and electronic storage elements (e.g., a flash memory, a random access memory (RAM) , read-only memory (ROM) , hard drive) . The non-transitory memory (ies) 118 may store instructions for execution by the processor (s) 106, such as to carry out examples described in the present disclosure. The memory (ies) 118 may store, in a non-volatile format, other non-volatile software instructions, such as for implementing an operating system and other applications/functions. The software instructions may for example include multimodal based image searching and synthesis instructions 200I that when executed by the one or more processor (s) 106, configure the computer system 100 to implement the system 200.

In the present disclosure, use of the term “a, ” “an” , or “the” is intended to include the plural forms as well, unless the context clearly indicates otherwise. Also, the term “includes, ” “including, ” “comprises, ” “comprising, ” “have, ” or “having” when used in this disclosure specifies the presence of the stated elements, but do not preclude the presence or addition of other elements.

The present disclosure has presented illustrative example embodiments. It will be understood by those skilled in the art that adaptations and modifications of the described embodiments can be made without departing from the scope of the disclosure. y

The content of all published documents referenced in this disclosure are incorporated herein by reference.

Claims

A computer-implemented method for processing an input image of a scene including a plurality of objects including at least one sketch object, comprising:

generating, based on an input image, respective object level-representations for each of the plurality of objects;

generating, based on the object level-representations and relationship information derived from the input image, a set of constrained-correlated representations, each constrained-correlated representation representing a respective object of the plurality of objects, the constrained-correlated representation for each object including information about an appearance of the object and positional relationships between the object and other objects of the plurality of objects; and

generating, based on the constrained-correlated representations and positional information derived from the input image, a set of respective freely-correlated object representations, each freely-correlated representation representing a respective object of the plurality of objects, the freely correlated representation for each object including information about a location of the object within the input image.
The method of claim 1 comprising generating a semantic layout for the input image based on the set of freely correlated object representations.
The method of claim 2 comprising generating a synthetic image based on the semantic layout.
The method of claim 1, 2, or 3 comprising generating, based on the constrained-correlated representations and the positional information derived from the input image, a scene representation that embeds information about the scene.
The method of claim 4 further comprising searching for and retrieving images from a database of images based on the scene representation.
The method of claim 4 or claim 5 wherein the scene representation and the freely-correlated object representations are both generated using a common machine learning (ML) transformer network.
The method of claim 4, 5, or 6 wherein the scene representation is included in a single vector, and each of the freely-correlated object representations are respective vectors.
The method of anyone of claims 1 to 7 wherein generating the set of constrained-correlated representations comprises:

generating a scene graph for the input image using the object level-representations for respective nodes and the relationship information derived from the input image for defining connecting edges between the respective nodes; and

processing the scene graph using a graphical convolution network (GCN) to learn the respective constrained-correlated representations.
The method of any one of claims 1 to 8 wherein the plurality of objects include at least one non-sketch object.
The method of claim 9 wherein generating the respective object level-representations comprises:

encoding, using a first encoder network, an intermediate sketch object representation for the at least one sketch object;

encoding, using a second encoder network, an intermediate non-sketch object representation for the at least one non-sketch object; and

encoding, using a common encoder network, the intermediate sketch object representation to provide the object level-representation for the at least one sketch object and the intermediate non-sketch object representation to provide the object level-representation for the at least one non-sketch object.
The method of claim 10 wherein the at least one non-sketch object is selected from a group that includes: a photographic object; a photo-real object; a cartoon object; and a text object.
The method of claim 10 wherein the at least one non-sketch object is a photographic object.
The method of any one of claims 10 to 12 wherein the first encoder network, second encoder network, and common encoder network each comprises respective sets of network layers of a machine learning (ML) model.
The method of any one of claims 1 to 13 comprising assembling the input image as a composite image that includes the plurality of objects based on inputs received through a user input interface, the inputs including a user indication of at least one crop from photographic image for inclusion as a photographic object in the plurality of objects.
The method of any one of claims 1 to 14 comprising generating the positional information derived from the input image by computing a positional encoding for each object of the plurality of objects based on a grid location of that object based on a grid overlaid on the input image.
A system comprising:

one or more processors;

one or more memories storing instructions executable by the one or more processors, the instructions, when executed, causing the system to perform a method according to any one of claims 1 to 15.
A computer readable memory storing instructions for configuring a computer system to perform a method according to any one of claims 1 to 15.
A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any one of claims 1 to 15.