WO2021042084A1

WO2021042084A1 - Systems and methods for retreiving images using natural language description

Info

Publication number: WO2021042084A1
Application number: PCT/US2020/053795
Authority: WO
Inventors: Ning Yan
Original assignee: Futurewei Technologies, Inc.
Priority date: 2020-05-30
Filing date: 2020-10-01
Publication date: 2021-03-04
Also published as: EP4154174A4; CN115885275A; EP4154174A1

Abstract

Implementations are directed to methods, systems, and computer-readable media obtaining images and generating, for each image in the images, a scene graph for the image. Generating the scene graph for the image includes identifying, objects in the image, and extracting a relationship feature defining a relationship between a first object and a second, different object of the objects in the image. The scene graph for the image is generated that includes a set of nodes and a set of edges. A natural language query request for an image is received, including terms defining a relationship between two or more particular objects. A query graph is generated for the natural language query request, and a set of images corresponding to the set of scene graphs matching the query graph are provided for display on a user device.

Description

SYSTEMS AND METHODS FOR RETREIVING IMAGES USING NATURAL

LANGUAGE DESCRIPTION

CROSS-REFERENCE TO RELATED APPLICATION [0001] This application claims priority to U.S. Application No. 63/032,569, filed May 30, 2020, the disclosure of which is incorporated by reference in its entirety.

TECHNICAL FIELD

[0002] This specification generally relates to image processing and searching for images in an image gallery.

BACKGROUND

[0003] Searching for particular images within image galleries containing large numbers of images can be time consuming and can result in search results containing images that are unresponsive or irrelevant to a search query submitted by a user.

SUMMARY

[0004] Implementations of the present disclosure are generally directed to image processing and image gallery queries. More particularly, implementations of the present disclosure are directed to utilizing a machine-learned model to process a repository of images to extract, from each image, objects and relationship features defining relationships between the objects. The extracted objects and relationship features are used to build a scene graph for each of the images, where objects form the nodes and relationship features form the edges between nodes. A searchable index of scene graphs for the repository of images can be generated from the scene graphs. A query for an image can be provided by a user, where the query includes a natural language description of a visual relationship between objects included in an image of interest. A query graph can be generated from the query, where the query graph can be matched to one or more scene graphs in the searchable index of scene graphs. Images corresponding to the one or more matching scene graphs can be provided in response to the query for the image.

[0005] In some implementations, operations can include obtaining multiple images and generating, for each image in the plurality of images, a scene graph for the image. Generating the scene graph for the image includes identifying, by a machine-learned model, objects in the image, and extracting, by the machine-learned model, a relationship feature defining a relationship between a first object and a second, different object of the objects in the image. The machine-learned model generates, from the objects and the relationship feature, the scene graph for the image that includes a set of nodes and a set of edges that interconnect a subset of nodes in the set of nodes, where the first object is represented by a first node from the set of nodes, the second object is represented by a second node from the set of nodes, and the relationship feature is an edge connecting the first node to the second node. A natural language query request for an image in the plurality of images is received, where the natural language query request includes terms specifying two or more particular objects and a relationship between the two or more particular objects. A query graph is generated for the natural language query request, a set of scene graphs of the scene graphs matching the query graph are identified, and a set of images corresponding to the set of scene graphs are provided for display on a user device.

[0006] Other implementations of this aspect include corresponding systems, apparatus, and computer programs, configured to perform the actions of the methods, encoded on computer storage devices.

[0007] These and other implementations can each optionally include one or more of the following features. In some implementations, the methods can further include generating, by the data processing apparatus and from the scene graphs, a scene graph index, where identifying the set of scene graphs of the plurality of scene graphs matching the query graph comprises searching the scene graph index.

[0008] In some implementations, the methods can further include ranking the set of scene graphs matching the query graph, including for each scene graph matching the query graph, assigning a confidence score, and providing a subset of scene graphs each including at least a threshold score.

[0009] In some implementations, the natural language query request can be a voice query from a user, where generating the query graph includes parsing the voice query into a set of terms.

[0010] In some implementations, identifying the obj ects in the image can include generating, by the machine-learned model, a set of bounding boxes, each bounding box encompassing an object in the image, and identifying, by the machine-learned model, the object within the bounding box.

[0011] The present disclosure also provides a non-transitory computer-readable media coupled to one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with implementations of the methods provided herein.

[0012] The present disclosure further provides a system for implementing the methods provided herein. The system includes one or more processors, and a non-transitory computer- readable media device coupled to the one or more processors having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with implementations of the methods provided herein.

[0013] Particular embodiments of the subject matter described in this specification can be implemented to realize one or more of the following advantages. For example, an advantage of this technology is that it can facilitate efficient and accurate discovery of images using natural language descriptions of visual relationships between objects depicted in the images, and may reduce a number of queries entered by a user in order to find a particular image of interest. This in turn reduces the number of computer resources required to execute multiple queries until the appropriate image has been identified.

[0014] The system can provide a more intuitive interface for end users to find images of interest by using natural language and visual relationship descriptions to search through scene graphs generated from images. Searching through an index of scene graphs can accelerate a querying process, where the query is performed over the scene graphs generated for the images rather than the images themselves, thus reducing the need to iterate and/or search through the images. Deep neural networks and a machine-learned model can be utilized to map images into scene graphs that represent underlying visual relationships. The machine-learned model can be pre-trained using a repository of training images and can be further refined for a particular image gallery of a user to increase accuracy of the determined visual relationships. [0015] The system can be used to facilitate discovery of images from various sources, e.g., photographs taken by a user, generated photos, downloaded photos, or the like, as well as images stored in various locations, e.g., on local storage of a user device or a cloud-based server.

[0016] It is appreciated that methods in accordance with the present disclosure can include any combination of the aspects and features described herein. That is, methods in accordance with the present disclosure are not limited to the combinations of aspects and features specifically described herein, but also include any combination of the aspects and features provided. [0017] The details of one or more implementations of the present disclosure are set forth in the accompanying drawings and the description below. Other features and advantages of the present disclosure will be apparent from the description and drawings, and from the claims.

BRIEF DESCRIPTION OF DRAWINGS

[0018] FIG. 1 depicts an example operating environment of a visual relationship system. [0019] FIG. 2A depicts a block diagram of an example embodiment of the visual relationship system.

[0020] FIG. 2B depicts a block diagram of an example architecture of the visual relationship model.

[0021] FIG. 3 depicts a block diagram of another example embodiment of the visual relationship system.

[0022] FIG. 4 depicts a block diagram of example objects and visual relationships determined by the visual relationship system.

[0023] FIG. 5 is a flow diagram of an example process performed by the visual relationship system for processing images and querying for images.

[0024] FIG. 6 shows an example of a computing system in which the microprocessor architecture disclosed herein may be implemented.

[0025] FIG. 7 illustrates a schematic diagram of a general-purpose network component or computer system.

DETAILED DESCRIPTION

Overview

[0026] Implementations of the present disclosure are generally directed to image processing and image gallery queries. More particularly, implementations of the present disclosure are directed to utilizing a machine-learned model to process a repository of images to extract, from each image, objects and relationship features defining relationships between the objects. The extracted objects and relationship features are used to build a scene graph for each of the images, where objects form the nodes and relationship features form the edges between nodes. A searchable index of scene graphs for the repository of images can be generated from the scene graphs. A query for an image can be utilized to generate a query graph, where the query graph can be matched to one or more scene graphs in the searchable index of scene graphs. Images corresponding to the one or more matching scene graphs can be provided in response to the query for the image. [0027] A natural language query including multiple terms that are descriptive of a visual relationship between objects can be provided by a user. Queries can be provided as text queries or voice queries, e.g., through an assistant application on a user device, in which case speech- to-text processing and natural language processing can be applied to the query. A query graph can be generated from the multiple terms of the query, and such a query graph identifies objects and relationship features between the identified objects, as defined by the terms of the query. [0028] A search of the index of scene graphs to find matches between the query graph and scene graphs can be performed. As part of this matching, a confidence score between each matched scene graph and the query graph can be assigned and utilized to rank the matched scene graphs. A set of images corresponding to the matched scene graphs can be provided in response to the query, e.g., for display on a user device.

[0029] In some implementations, an artificial intelligence (Al)-enabled processor chip can be enabled with natural language understanding and integrated with a processor, e.g., a central processing unit (CPU) or a graphics processing unit (GPU), in a “smart” mobile device. The AI-enabled processor chip enabled with natural language understanding can be utilized to receive a natural language voice query and generate, from the natural language voice query, a query graph for the voice query. The AI-chip can be used to accelerate object detection and relationship feature extraction using pre-trained machine-learned models stored locally on the user device and/or on a cloud-based server.

Example Operating Environment

[0030] FIG. 1 depicts an example operating environment 100 of a visual relationship system 102. Visual relationship system 102 can be hosted on a local device, e.g., user device 104, one or more local servers, a cloud-based service, or a combination thereof. In some implementations, a portion or all of the processes described herein can be hosted on a cloud- based server 103.

[0031] Visual relationship system 102 can be in data communication with a network 105, where the network 105 can be configured to enable exchange of electronic communication between devices connected to the network 105. In some implementations, visual relationship system 102 is hosted on a cloud-based server 103 where user device 104 can communicate with the visual relationship system 102 via the network 105.

[0032] The network 105 may include, for example, one or more of the Internet, Wide Area Networks (WANs), Local Area Networks (LANs), analog or digital wired and wireless telephone networks e.g., a public switched telephone network (PSTN), Integrated Services Digital Network (ISDN), a cellular network, and Digital Subscriber Line (DSL), radio, television, cable, satellite, or any other delivery or tunneling mechanism for carrying data. The network may include multiple networks or subnetworks, each of which may include, for example, a wired or wireless data pathway. The network may include a circuit-switched network, a packet-switched data network, or any other network able to carry electronic communications e.g., data or voice communications. For example, the network may include networks based on the Internet protocol (IP), asynchronous transfer mode (ATM), the PSTN, packet-switched networks based on IP, X.25, or Frame Relay, or other comparable technologies and may support voice using, for example, VoIP, or other comparable protocols used for voice communications. The network may include one or more networks that include wireless data channels and wireless voice channels. The network may be a wireless network, a broadband network, or a combination of networks including a wireless network and a broadband network. In some implementations, the network 105 can be accessed over a wired and/or a wireless communications link. For example, mobile computing devices, such as smartphones, can utilize a cellular network to access the network 105.

[0033] User device 104 can host and display an application 110 including an application environment. For example, a user device 104 is a mobile device that hosts one or more native applications, e.g., application 110, that includes an application interface 112, e.g., a graphical user interface, through which a user may interact with the visual relationship system 102. User device 104 include any appropriate type of computing device such as a desktop computer, a laptop computer, a handheld computer, a tablet computer, a personal digital assistant (PDA), a cellular telephone, a network appliance, a camera, a smart phone, an enhanced general packet radio service (EGPRS) mobile phone, a media player, a navigation device, an email device, a game console, or an appropriate combination of any two or more of these devices or other data processing devices. In addition to performing functions related to the visual relationship system 102, the user device 104 may also perform other unrelated functions, such as placing personal telephone calls, playing music, playing video, displaying pictures, browsing the Internet, maintaining an electronic calendar, etc.

[0034] Application 110 refers to a software/firmware program running on the corresponding mobile device that enables the user interface and features described throughout, and is a system through which the visual relationship system 102 may communicate with the user on user device 104. The user device 104 may load or install the application 110 based on data received over a network or data received from local media. The application 110 runs on mobile devices platforms. The user device 104 may receive the data from the visual relationship system 102 through the network 105 and/or the user device 104 may host a portion or all of the visual relationship system 102 on the user device 104.

[0035] The visual relationship system 102 includes a speech-to-text converter 106 and visual relationship model 108. Though described herein with reference to a speech-to-text converter 106 and visual relationship model 108, the operations described can be performed by more or fewer sub-components. Visual relationship model 108 can be a machine-learned model and can be built using multiple sub-models each implementing machine learning to perform the operations described herein. Further details of the visual relationship model 108 are described with reference to FIG. 2A, 3, and 4.

[0036] Visual relationship system 102 can obtain, as input, images 114 from an image database 116 including a repository of images 114. Image database 116 can be locally stored on user device 104 and/or stored on cloud-based server 103, where the visual relationship system 102 may access image database 116 via network 105. Image database 116 can include, for example, a user’s collection of photographs captured using a camera on a mobile phone. As another example, image database 116 can include a collection of photographs captured by multiple user devices and stored in a remote location, e.g., a cloud server.

[0037] The visual relationship system 102 can generate, using the visual relationship model 108, scene graphs for a scene graph database 118 as output. Scene graph database 118 can be locally stored on user device 104 and/or stored on cloud-based server 103, where the visual relationship system 102 may access the scene graph database 118 via network 105. Scene graph database 118 can include scene graphs generated for at least a subset of the images 114 in the image database 116. Further details of the generation of scene graphs are described with reference to FIG. 2A.

[0038] Visual relationship system 102 can receive, from a user on user device 104, a query 120 through application interface 112 as input. Query 120 can be a voice query provided by a user of user device 104 through the application interface 112. Query 120 can be a text-based query entered by a user into application interface 112.

[0039] Application interface 112 can include a search feature 122 where a user can select to enter a query 120, e.g., a voice query. In one example, a user can enter a voice query using an assistant function of the user device 104, which can be activated, e.g., by pressing the microphone button 124 in search feature 122. In another example, a user can enter a text query in the text field of the search feature 122.

[0040] Query 120 can be a natural language query including terms descriptive of a visual relationship between objects that may be included in one or more images 114. A natural language query can include terms that are part of a user’s normal vocabulary and not include any special syntax or formatting. The natural language query can be entered in various forms, for example, as a statement, a question, or a simple list of keywords. In one example, a natural language query is “I want to find a boy holding a ball.” In another example, a natural language query is “Where is the photograph of a dog running on the beach?” In yet another example, a natural language query is “Boy holding ball. Boy on beach.”

[0041] The speech-to-text converter 106 can receive the user’s voice query and parse the user’s voice query into text using voice-to-text techniques and natural language processing. The parsed query can be provided by the speech-to-text converter 106 to the visual relationship model 108 as input. In response to user-input query 120, the visual relationship system 102 can provide one or more images 114 responsive to the query 120 as output to the user device 104, for display in the application interface 112 of the application 110.

[0042] In some implementations, a user can select to enter a query 120, e.g., a text-based query. For example, a user can type a textual query into search feature 122. Query 120 can be a natural language query including terms descriptive of a visual relationship included in one or more images 114. The visual relationship system 102 can receive the textual query as input and utilize natural language processing, e.g., as a function of the AI-based chip, to parse the textual query. In response to the user-input query 120, the visual relationship system 102 can provide one or more images 114 responsive to the query 120 as output to the user device 104, for display in the application interface 112 of the application 110. Further details of the processes of the visual relationship system 102 are described with reference to FIGS. 2A and 3.

[0043] FIG. 2A depicts a block diagram 200 of an example embodiment of the visual relationship system 102, and in particular the visual relationship model 108, that generates scene graphs from input images 114.

[0044] The visual relationship model 108 can be a machine-learned model which may be in turn built utilizing multiple sub-models to perform the actions described herein. Visual relationship model 108 can include deep neural network model(s) where images 114 in the image database 116 are mapped into scene graphs 202 representing the underlying visual relationships. An example architecture for the visual relationship model 108 is described with reference to FIG. 2B below, however, the actions performed by the visual relationship model 108 can be implemented generally to perform the actions described with reference to feature/object extraction 208 and scene graph generation 214. [0045] As depicted in FIG. 2A, and as briefly described with reference to FIG. 1, visual relationship model 108, which is part of the visual relationship system 108, can receive images 114 from image database 116 as input and generate a respective scene graph 202 for each image 114 as output for storage in a scene graph database 118. In some implementations, a scene graph 202 is generated for each of a subset of the images 114 in the image database 116, e.g., a subset of the total number of images in the image database 116.

[0046] A scene graph 202 includes a set of nodes 204 and a set of edges 206 that interconnect a subset of nodes in the set of nodes. Each scene graph 202 can define a set of objects that are represented by respective nodes 204, e.g., where a first object is represented by a first node from the set of nodes, and a second object is represented by a second node from the set of nodes. The first node and the second node can be connected by an edge representing a relationship feature that is defining of a relationship between the two objects.

[0047] The visual relationship model 108 can be implemented using one or more deep neural networks. In some implementations, visual relationship model 108 includes machine learning models that are based on one or more pre-trained models which are trained using generic data, e.g., a generic image repository, or a user-specific data, e.g., a user’s image library, to generate a scene graph for each image into the model. The pre-trained models can then be further fine- tuned based on an image database 116, e.g., a user’s image gallery of images or videos. The fine-tuning process can be conducted either on the user device 104 and/or on a cloud-based server 103 depending on, for example, a location of the images 114, and the processing capacity of the user device 104. Thus, in some implementations, the initial training can be performed by a machine learning model that is stored in the cloud-based server 103, or another networked location, and then, after completion of training, can be provided for storage and further fine tuning to a user device 104. Alternatively, the initial training and any subsequent fine tuning may be performed on the user device 104. Alternatively, the initial training and any subsequent fine tuning may be performed on the cloud-based server 103, or another networked location. [0048] In some implementations, after the visual relationship model has been initially trained and/or fine-tuned, the visual relationship model 108 can process an obtained image 114 to perform feature/object extraction 208, which in turn can be used to generate a scene graph for the image. In one example, a user’s image gallery on a mobile device or a cloud-based image gallery including a set of photographs can be analyzed by the visual relationship model 108 to generate respective scene graphs 202 that are descriptive of the visual relationships within the images 114. [0049] Feature/object extraction 208 can include identifying, by the visual relationship model 108, objects in the image 114. Identifying objects in the image 114 can include applying bounding boxes 210 to the image 114, where each bounding box 210 encompasses an object appearing in the image 114. For example, multiple bounding boxes 210 can be applied to an image of a boy holding a ball, where a first bounding box can encompass the boy and a second bounding box can encompass the ball. Partial objects can appear in image 114, e.g., a portion of a ball, where a bounding box can be applied to the portion of the object appearing in the image 114. Identifying objects in the image 114 can be performed using various object detection models, for example Mask R-CNN or YOLO. In some embodiments, identifying objects in the image 114 can be performed using a machine-learned model architecture that can perform object detection and scene graph prediction/generation in a concurrent process. For example, a feature pyramid network (FPN) can be utilized to aggregate multi-scale information that is derived from a ResNet50 backbone that is applied to an input image 114.

[0050] Feature/object extraction 208 can additionally include extracting, by the visual relationship model 108, relationship features 212 defining relationships between objects of the multiple objects in the image 114. In some implementations, each relationship feature 212 defines a relationship between a first object and a second, different object. For example, a relationship feature 212 can be “holding,” where the relationship feature 212 defines a relationship between a first object “boy” and a second object “ball,” to define a visual relationship of “boy” “holding” “ball.” Relationships can be determined by the visual relationship model 108, for example, based in part on proximity/spatial distances between objects, known relationships between categories of objects, user-defined relationships between particular objects and/or categories of objects, or the like.

[0051] In some implementations, a machine-learned model can be utilized to predict the relationship between detected object pairs. For example, the model may be a single-pass model that completes both object detection and relationship identification a same time. In other words, feature/object extraction to identify objects and define relationships between objects can be performed using a one-pass model where the machine-learned model completes both an object detection process and a relationship identification inference process in a single pass. [0052] In some implementations, the visual relationship model 108 is a machine-learned model implemented as a single pass model, which can predict a scene graph for an input image 114 in a single pass. An example architecture 250 for a machine-learned single-pass model is depicted in FIG. 2B. [0053] As depicted in the architecture 250, a dual -branch technique can be utilized to perform object detection and relationship feature extraction, e.g., as described with reference to the feature/object extraction 208. Architecture 250 can include Resnet50, HRNet, or another similar convolutional neural network to obtain an image 114 and generate a multiple scale output representing features extracted/generated from multiple scaling of an original output, e.g., 256x256, 128x128, 64x64, etc. The multiple scale output can be provided as input to a feature pyramid network (FPN)-style structure for processing the multiple scale output. In the example depicted in FIG. 2B, two FPN (each individually referred to as FPN or a BiFPN) can be used to each perform object detection and relationship feature extraction respectively, e.g., as described with reference to feature/object extraction 208, however, more or fewer FPN can be utilized in the architecture 250. The multiple output relationship prediction tensors of each BiFPN can be utilized as input for multiple convolution and batch normalization layers for predicting the scene graph for the input image. The output of the architecture 250 includes a scene graph, e.g., scene graph 202 generate from input image 114.

[0054] Visual relationship model 108 predicts, from the extracted objects from bounding boxes 210 and relationship features 212 to a scene graph, via scene graph generation 214. A scene graph 202 for the image 114 is generated from the objects and relationship features for the image 114, where each object is a node 204 and each relationship feature is an edge 206 connecting at least two nodes 204 together. The scene graph 202 can include each identified object as a node and relationship features between at least two objects as an edge connecting the nodes. A first node can be connected to multiple other different nodes, where each connection is an edge defining a relationship feature between the first node and a second different node of the multiple other nodes. For example, a first node can be “boy,” a second node can be “ball,” and a third node “hat.” The first node and second node can be connected by an edge representing relationship feature “holding,” e.g., “boy holding ball,” and the first node and third node can be connected by an edge representing relationship feature “wearing,” e.g., “boy wearing hat.”

[0055] In some implementations, a first node may be connected to multiple other different nodes by a same type of relationship feature, where each connection is represented by a separate edge. For example, a boy can be holding a ball and a book in an image. A first node can be “boy” and a second node can be “ball” and a third node can be “book.” The relationship feature can be “holding” between the first and second nodes, e.g., “boy holding ball,” and can also be “holding” between the first and third nodes, e.g., “boy holding book.” The scene graph 202 can include the three nodes, e.g., “boy” “ball” “book”, and the two edges, e.g., “holding” and “holding”.

[0056] The scene graph 202 for the image 114 is stored in scene graph database 118, and includes a reference to the image 114. A scene graph index 216 can be built from the stored scene graphs 202 in the scene graph database 118, which may facilitate matching stored scene graphs 202 to queries using graph indexing techniques. As one example, the scene graph index can be a lookup table that identifies each image and its corresponding scene graph, as depicted in FIG. 2A.

[0057] Various graph indexing techniques can be utilized, for example, Graph Indexing: A Frequent Structure-based Approach (glndex). More generally, graph indexing techniques based on paths and/or techniques based on structures can be utilized. Reverse indexing techniques may be utilized for scene graph indexing, depending in part on a size of the scene graphs that are generated.

[0058] Referring back to FIG. 1, a user can provide a query 120 to the visual relationship system 102, e.g., as a voice query or a text query, via application interface 112. Visual relationship system 102 can process the voice query 120 using a speech-to-text converter 106 to generate a parsed query. In some implementations, speech-to-text converter 106 can transcribe a voice query 120 into textual commands using voice-to-text neural network models, e.g., ALBERT, or another similar neural network model.

[0059] FIG. 3 depicts a block diagram 300 of another example embodiment of the visual relationship system, where the visual relationship model 108 is utilized to identify scene graphs that match a user input query.

[0060] A query 302 including terms descriptive of a visual relationship can be provided to the visual relationship system 102. In some implementations, query 302 is a textual query that is generated by the speech-to-text converter 106 from a query 120 received by the visual relationship system 102 from a user on a user device 104.

[0061] Visual relationship system 102 can receive the query 302 as input and perform feature/object extraction 304 on the query 302 to determine terms of the query 302 defining objects 306 and relationship features 308. Visual relationship system 102 can extract objects 306 and relationship features 308 from the input query 302, for example, by using natural language processing to parse the terms of the query and identify objects/relationship features. In one example, natural language processing techniques, e.g., the Python Spacy toolkit, can be used to process the query to extract objects and relationships. In one example, a query 302 is “I want a boy holding a ball” where the object-terms are determined as “boy” and “ball” and relationship feature-terms are determined as “holding.”

[0062] The visual relationship model 108 can utilize the extracted objects 306 and relationship features 308 that are defined in the terms of the query 302 to perform query graph generation 310. A query graph 312 can be generated where objects 306 and relationship features 308 extracted from the terms of the query 302 are utilized as nodes 314 and edges 316 between nodes, respectively. Continuing the example provided above, a query graph 312 can include a first node “boy” and a second node “ball” with an edge “holding” connecting the first and second nodes 314.

[0063] The visual relationship system 102 can perform scene graph matching 318 between query graph 312 and scene graphs 202 from scene graph database 118. In some implementations, the matching, which is further described below, between query graph 312 and scene graphs 202 from scene graph database 118 includes searching a scene graph index 216 to retrieve relevant images 114 responsive to query 120. A set of scene graphs 202 that match the query graph 312 are selected from the scene graphs 202 in the scene graph database 118.

[0064] In some implementations, visual relationship system 102 can utilize one or more relevance models to perform the scene graph matching 318. Scene graphs 202 can be assigned confidence scores, where scene graphs 202 meeting a threshold confidence score to the query graph 312 can be identified. The set of identified scene graphs 202 meeting the threshold confidence score can be ranked, where a first scene graph 202 and the query 312 having a higher confidence score, e.g., a closer match, can be ranked higher than a second scene graph 202 having a lower confidence score, e.g., a more distant match. Scene graph matching can be exact matching of words, e.g., where a same set of a first node and a second node are connected by a same edge in both the scene graph and the query graph. For example, a scene graph can include a “boy -holding-ball” nodel-edge-node2 relationship and the query graph can also include the “boy-holding-ball” relationship. Scene graph matching can alternatively be proximate matching or fuzzy matching, for example, where one or more of the nodes or one or more of the edges between nodes are different between the scene graph and the query graph. Proximate matching can be matching of words based on a semantic distance of the words based on word embedding, e.g., using word2vec or the like. For example, a query graph can include “boy-holding-ball” and an identified scene graph can include “boy-throwing-ball,” where “holding” and “throwing” are determined, e.g., by a pre-generated lexicon, to be within a threshold of matching. [0065] Images 114 corresponding to the set of identified scene graphs 202 can be provided for display on the user device, e.g., in application interface 112. The images 114 corresponding to the set of identified scene graphs 202 can be displayed according to a ranking, where an image 114 corresponding to a scene graph 202 with a higher confidence score can be presented at a more prominent location, e.g., at the top of a display, than an image 114 corresponding to a scene graph 202 with a lower confidence score.

[0066] In some implementations, a set of top-ranked images are provided for display on the user device, e.g., a set of the top 10 ranked images. A user can provide feedback to the visual relationship system 102 to request a range of images to provide in response to a query request, e.g., between 0-25 images. In one example, a user may request up to 15 images to be returned in response to a query request. In some implementations, a number of images returned for display on the user device can depend on a pre-defmed parameter, e.g., set by the application 110. The number of images displayed may depend on a device screen size, where the number of images is set by the available display space for thumbnail previews of the images.

[0067] As described with reference to FIG. 2A, the visual relationship model 108 can perform feature/object extraction 208 on an image 114. FIG. 4 depicts a block diagram 400 of example objects and visual relationships that are determined/extracted by the visual relationship system.

[0068] As depicted in FIG. 4, a photograph 402, e.g., image 114, depicts a woman sitting in a chair that is next to a table, where the table has a book on top of the table. The visual relationship model 108 can receive photograph 402 and determine a set of bounding boxes 404, each bounding box encompassing an object or a portion of an object that appears within the photograph 402. For example, bounding boxes 404 for photograph 402 identify objects 405 including a person, e.g., a woman, a dress, a chair, a book, and a table within the photograph 402.

[0069] Each of the identified objects 405 encompassed by a bounding box and can be associated, e.g., linked, with one or more of the other identified objects 405 using a relationship feature e.g., from among a set of relationship features 406, where each of relationship features 406 describe a relationship between a pair of objects. Relationship features 406 can include natural language terms. Relationship features 406 for the photograph 402 can include, for example, “next to,” “on,” and “wearing.” In one example, a visual relationship can be defined as “table next to chair” where “table” and “chair” are objects 405 and “next to” is a relationship feature 406 between the objects 405. [0070] An example of a scene graph is depicted in FIG. 4, showing multiple obj ects as nodes that are connected by relationship features as edges. An object, e.g., “woman”, can be connected to multiple other objects, e.g., “chair,” “dress,” and “table”, via respective relationship features 406, e.g., “on,” “wearing,” and “next to.” The extracted objects 405 and relationship features 406 can be utilized by the visual relationship model 108 to generate a scene graph, e.g., scene graph 202, for the photograph 402, e.g., image 114.

[0071] In some implementations, text descriptive of the semantics of an image 114 can be utilized instead of a scene graph 202 generated for the image 114. In other words, text describing the objects and relationship features within the image 114 can be associated with the image 114. For example, an image including a boy holding a ball can be associated, e.g., tagged or otherwise assigned to, terms including “boy”, “holding”, “ball”, “boy holding a ball”, and “boy holding ball”. In some implementations, a neural network model can map an image into text descriptions, for example, using image captioning techniques. A semantic language search can be performed of the descriptive texts for each image of the image database 116.

Example Process of the Visual Relationship System

[0072] FIG. 5 is a flow diagram of an example process 500 of the visual relationship system 102 for processing images and querying for images. Operations of process 500 are described below as being performed by the visual relationship system described and depicted in Figures 1-3. Operations of the process 500 are described below for illustration purposes only. Operations of the process 500 can be performed by any appropriate device or system, e.g., any appropriate data processing apparatus, such as, e.g., the visual relationship system or the user device 104. Operations of the process 500 can also be implemented as instructions stored on a non-transitory computer readable medium. Execution of the instructions cause one or more data processing apparatus to perform operations of the process 500.

[0073] Images are obtained (502). Images 114 from an image database 116 can be obtained by the visual relationship system 102. In some implementations, an image 114 is obtained by the visual relationship system 102 when the image is captured and/or saved into the image database 116. In some implementations, images 114 from the image database 116 can be periodically obtained by the visual relationship system 102 for processing, e.g., when the user device 104 is connected to power, when a memory use of the user device 104 is below a threshold activity, etc.

[0074] In some implementations, images 114 are stored locally on the user device 104, e.g., in the memory of a mobile phone. Images 114 can additionally or alternatively be stored on a cloud-based server 103, which is in data communication with user device 104 via a network 105. Images 114 can be, for example, documents including visual representations, e.g., photographs captured by a camera of the user device 104. In general, documents can be processed by the visual relationship system 102 including, for example, documents in portable document format (PDF), graphics interchange format (GIF), portable network graphics (PNG), joint photographic experts group (JPEG), or another format for visual -based documents.

[0075] In some implementations, the operations described below with reference to steps 504 through 508 can be performed on each image of a repository of images in an image database 116. Alternatively, the operations described below with reference to steps 504-508 can be performed on each image in a subset of images taken from the repository of images. As described above with reference to FIG. 2A, each image 114 can be received by the visual relationship model 108 as input and a scene graph 202 can be generated for the image 114. [0076] For each image, objects are identified in the image (504). Visual relationship model 108 can receive image 114 and perform feature/object extraction 208 on the image 114. Object extraction can include applying bounding boxes 210 to the image, where each bounding box 210 encompasses an object or encompasses a portion of an object appearing within the image. As depicted in FIG. 4, bounding boxes 404 can each define an object 405, e.g., table, woman, book, etc., that appear in the photograph 402. Object detection models, e.g., mask R-CNN, YOLO, single shot detector (SSD), can be utilized to identify objects in the image.

[0077] Referring back to FIG. 5, for each image, a relationship feature is extracted from the image, where the relationship feature defines a relationship between a first object and a second, different object in the image (506). A relationship feature can be extracted by the visual relationship model 108, e.g., using deep neural networks, and defines a relationship between at least two objects that appear within the image. Extraction of relationship features from the image can be built into the visual relationship model as a part of an end-to-end output. The relationship feature can include one or more terms defining the relationship between a first object and a second object. As depicted in FIG. 4, a relationship feature 406 can include a term or set of terms, for example, “next to” and “wearing” where the terms define how a first object relates to a second object.

[0078] Referring back to FIG. 5, for each image, a scene graph is generated from the identified objects and the extracted relationship features (508). A scene graph, e.g., scene graph 202 depicted in FIG. 2A and illustrated in FIG. 4, is generated by the visual relationship model 108, where each object is defined as a node 204 and each relationship feature as an edge 206 connecting two nodes 204 in the scene graph 202. In some implementations, a first node 204 can be connected to a second node via a first edge 206 and connected to a third node via a second edge 206.

[0079] The generated scene graph 202 is stored in a scene graph database 118, e.g., locally on the user device 104 and/or on a cloud-based server 103 in data communication with the user device 104 via network 105. Each generated scene graph 202 can include a reference to the particular image 114 from which is it generated, e.g., an identifier referencing the image 114 or a storage location of the image 114 in image database 116. The scene graph database 118 can be indexed to generate a scene graph index 216, which may be utilized for searching the scene graph database 118 for a particular set of scene graphs 202.

[0080] Referring back to FIG. 5, a natural language query request that requests one or more images from the image database 116 is received, where the natural language query request specifies two or more objects and one or more relationships between the two or more objects (510). A natural language query request, e.g., query 120, can be a voice query provided by a user of a user device 104, for example, through an application interface 112 of an application 110 and/or through a digital assistant on the user device 104. The natural language query request can include a set of terms descriptive of one or more objects and one or more relationships between the objects in an image that the user is interested in viewing. For example, the natural language query request can be “I want to find a woman sitting on a chair,” where the objects are “woman” and “chair” and the relationship between the objects is “sitting on.” In another example, a natural language query request can be “Find the photo of me hiking on Mount St. Helens,” where the objects are “me [the user]” and “Mount St. Helens,” and the relationship between the objects is “hiking on.”

[0081] In some implementations, a speech-to-text converter, e.g., speech-to-text converter 106, receives a voice query and converts it into a text-based query that can be provided to the visual relationship model 108. Speech-to-text converter 106 can be a part of the visual relationship system 102, or can be a function of a digital assistant or another application 110 located on the user device 104.

[0082] Visual relationship system 102 can receive the textual query from a speech-to-text converter 106, e.g., query 302, and perform feature/object extraction, e.g., feature/object extraction 304, to extract objects and relationship features, e.g., objects 306 and relationship features 308 included in the query.

[0083] Referring now to FIG. 5, a query graph is generated for the natural language query request (512). Query graph generation, e.g., query graph generation 310, can be performed by the visual relationship system 102 using the extracted objects and relationship features from the user-provided query. A query graph, e.g., query graph 312, can be generated that includes a graph-based representation of the query 302, in which each object 306 is represented by a node 314 and each relationship feature 308 is represented by an edge 316 connecting a first node to a second node.

[0084] Referring back to FIG. 5, a set of scene graphs matching the query graph are identified from the multiple scene graphs (514). Scene graph matching, e.g., scene graph matching 318, can be performed by the visual relationship system 102 in which the query graph 312 is compared to the scene graphs 202 in the scene graph database 118. As described with reference to FIG. 3, a set of scene graphs 202 from among the scene graphs that match the query graph 312 are identified, for example, by searching a scene graph index 216 for scene graphs 202 among the scene graphs that match, e.g., an exact match, a proximate/fuzzy match, the query graphs 312. In some implementations, based on the matching, each scene graph 202 in the scene graph database 118 can be assigned a confidence score with respect to the query graph 312, e.g., a trueness of the match, and only those scene graphs with a confidence score that satisfies, e.g., meets or exceeds, a threshold confidence score are included in the set of scene graphs.

[0085] Referring now to FIG. 5, a set of images corresponding to the set of scene graphs are provided for display on a user device (516). A set of images corresponding to the set of scene graphs, e.g., images 114 corresponding to scene graphs 202, can be identified from the image database 116. Each scene graph of the set of scene graphs can include a reference to a particular image from which the scene graph was generated, e.g., a reference to a storage location, a unique identifier, or the like. The set of images can be identified in the image database 116 and provided for display by the visual relationship system 102 to an application 110, e.g., a photo gallery application, on the user device 104.

[0086] The set of images can be displayed in an application interface of an application, e.g., application interface 112 of application 110, on the user device 104. In some implementations, the set of images can be presented for display with respect to a ranking for each image with respect to each other image in the set of images. In one example, a first image corresponding to a scene graph having a higher confidence score can be presented in a more prominent position in the application interface 112, e.g., at the top of the displayed results, than a second image corresponding to a scene graph having a lower confidence score.

[0087] FIG. 6 shows an example of a computing system in which the microprocessor architecture disclosed herein may be implemented. The computing system 600 includes at least one processor 602, which could be a single central processing unit (CPU) or an arrangement of multiple processor cores of a multi-core architecture. In the depicted example, the processor 602 includes a pipeline 604, an instruction cache 606, and a data cache 608 (and other circuitry, not shown). The processor 602 is connected to a processor bus 610, which enables communication with an external memory system 612 and an input/output (I/O) bridge 614. The I/O bridge 614 enables communication over an I/O bus 616, with various different I/O devices 618A-618D (e.g., disk controller, network interface, display adapter, and/or user input devices such as a keyboard or mouse).

[0088] The external memory system 612 is part of a hierarchical memory system that includes multi-level caches, including the first level (LI) instruction cache 606 and data cache 608, and any number of higher level (L2, L3, . . . ) caches within the external memory system 612. Other circuitry (not shown) in the processor 602 supporting the caches 606 and 608 includes a translation lookaside buffer (TLB), various other circuitry for handling a miss in the TLB or the caches 606 and 608. For example, the TLB is used to translate an address of an instruction being fetched or data being referenced from a virtual address to a physical address, and to determine whether a copy of that address is in the instruction cache 606 or data cache 608, respectively. If so, that instruction or data can be obtained from the LI cache. If not, that miss is handled by miss circuitry so that it may be executed from the external memory system 612. It is appreciated that the division between which level caches are within the processor 602 and which are in the external memory system 612 can differ in various examples. For example, an LI cache and an L2 cache may both be internal and an L3 (and higher) cache could be external. The external memory system 612 also includes a main memory interface 620, which is connected to any number of memory modules (not shown) serving as main memory (e.g., Dynamic Random Access Memory modules).

[0089] FIG. 7 illustrates a schematic diagram of a general-purpose network component or computer system. The general-purpose network component or computer system includes a processor 702 (which may be referred to as a central processor unit or CPU)

[0090] that is in communication with memory devices including secondary storage 704, and memory, such as ROM 706 and RAM 708, input/output (I/O) devices 710, and a network [0091] 712, such as the Internet or any other well-known type of network, that may include network connectively devices, such as a network interface. Although illustrated as a single processor, the processor 702 is not so limited and may comprise multiple processors. The processor 702 may be implemented as one or more CPU chips, cores (e.g., a multi-core processor), FPGAs, ASICs, and/or DSPs, and/or may be part of one or more ASICs. The processor 702 may be configured to implement any of the schemes described herein. The processor 702 may be implemented using hardware, software, or both.

[0092] The secondary storage 704 is typically comprised of one or more disk drives or tape drives and is used for non-volatile storage of data and as an over-flow data storage device if the RAM 708 is not large enough to hold all working data. The secondary storage 704 may be used to store programs that are loaded into the RAM 708 when such programs are selected for execution. The ROM 706 is used to store instructions and perhaps data that are read during program execution. The ROM 706 is a non-volatile memory device that typically has a small memory capacity relative to the larger memory capacity of the secondary storage 704. The RAM 708 is used to store volatile data and perhaps to store instructions. Access to both the ROM 706 and the RAM 708 is typically faster than to the secondary storage 704. At least one of the secondary storage 704 or RAM 708 may be configured to store routing tables, forwarding tables, or other tables or information disclosed herein.

[0093] It is understood that by programming and/or loading executable instructions onto the node 700, at least one of the processor 720 or the memory 722 are changed, transforming the node 700 in part into a particular machine or apparatus, e.g., a router, having the novel functionality taught by the present disclosure. Similarly, it is understood that by programming and/or loading executable instructions onto the node 700, at least one of the processor 702, the ROM 706, and the RAM 708 are changed, transforming the node 700 in part into a particular machine or apparatus, e.g., a router, having the novel functionality taught by the present disclosure. It is fundamental to the electrical engineering and software engineering arts that functionality that can be implemented by loading executable software into a computer can be converted to a hardware implementation by well-known design rules. Decisions between implementing a concept in software versus hardware typically hinge on considerations of stability of the design and numbers of units to be produced rather than any issues involved in translating from the software domain to the hardware domain. Generally, a design that is still subject to frequent change may be preferred to be implemented in software, because re spinning a hardware implementation is more expensive than re-spinning a software design. [0094] Generally, a design that is stable that will be produced in large volume may be preferred to be implemented in hardware, for example in an ASIC, because for large production runs the hardware implementation may be less expensive than the software implementation. Often a design may be developed and tested in a software form and later transformed, by well- known design rules, to an equivalent hardware implementation in an application specific integrated circuit that hardwires the instructions of the software. In the same manner as a machine controlled by a new ASIC is a particular machine or apparatus, likewise a computer that has been programmed and/or loaded with executable instructions may be viewed as a particular machine or apparatus.

[0095] The technology described herein can be implemented using hardware, firmware, software, or a combination of these. The software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein. The processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer readable storage media and communication media. Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CDROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. A computer readable medium or media does (do) not include propagated, modulated or transitory signals.

[0096] Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.

[0097] In alternative embodiments, some or all of the software can be replaced by dedicated hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc. In one embodiment, software (stored on a storage device) implementing one or more embodiments is used to program one or more processors. The one or more processors can be in communication with one or more computer readable media/ storage devices, peripherals and/or communication interfaces.

[0098] It is understood that the present subject matter may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter will be thorough and complete and will fully convey the disclosure to those skilled in the art. Indeed, the subject matter is intended to cover alternatives, modifications and equivalents of these embodiments, which are included within the scope and spirit of the subject matter as defined by the appended claims. Furthermore, in the following detailed description of the present subject matter, numerous specific details are set forth in order to provide a thorough understanding of the present subject matter. However, it will be clear to those of ordinary skill in the art that the present subject matter may be practiced without such specific details.

[0099] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

[00100] The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated.

[00101] For purposes of this disclosure, each process associated with the disclosed technology may be performed continuously and by one or more computing devices. Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device. [00102] Although the subject maher has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject maher defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

[00103] While this specification contains many specifics, these should not be construed as limitations on the scope of the disclosure or of what may be claimed, but rather as descriptions of features specific to particular implementations. Certain features that are described in this specification in the context of separate implementations may also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation may also be implemented in multiple implementations separately or in any suitable sub-combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination may in some cases be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination. [00104] Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the implementations described above should not be understood as requiring such separation in all implementations, and it should be understood that the described program components and systems may generally be integrated together in a single software product or packaged into multiple software products. [00105] A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. For example, various forms of the flows shown above may be used, with steps re-ordered, added, or removed. Accordingly, other implementations are within the scope of the following claims.

[00106] What is claimed is:

Claims

CLAIMS:

1. A computer-implemented method comprising: generating, by a data processing apparatus and for each image in a plurality of images, a scene graph for the image, wherein generating the scene graph for the image comprises: identifying, by a machine-learned model, a plurality of objects in the image; extracting, by the machine-learned model, a relationship feature defining a relationship between a first object and a second, different object of the plurality of objects in the image; and generating, by the machine-learned model and from the plurality of objects and the relationship feature, the scene graph for the image that includes a set of nodes and a set of edges that interconnect a subset of nodes in the set of nodes, wherein the first object is represented by a first node from the set of nodes, the second object is represented by a second node from the set of nodes, and the relationship feature is an edge connecting the first node to the second node; receiving, by the data processing apparatus, a natural language query request for an image in the plurality of images, wherein the natural language query request comprises a plurality of terms specifying two or more particular objects and a relationship between the two or more particular objects; generating, by the data processing apparatus, a query graph for the natural language query request; identifying, by the data processing apparatus and from a plurality of scene graphs generated for the plurality of images, a set of scene graphs of the plurality of scene graphs matching the query graph; and providing, by the data processing apparatus and for display on a user device, a set of images corresponding to the set of scene graphs.

2. The method of claim 1, further comprising: generating, by the data processing apparatus and from the plurality of scene graphs, a scene graph index, wherein identifying the set of scene graphs of the plurality of scene graphs matching the query graph comprises searching the scene graph index.

3. The method of claim 1, further comprising: ranking the set of scene graphs matching the query graph, comprising: for each scene graph matching the query graph, assigning a confidence score; and providing a subset of scene graphs each including at least a threshold score.

4. The method of claim 1, wherein the natural language query request is a voice query from a user, and wherein generating the query graph comprises parsing the voice query into a set of terms.

5. The method of claim 1, wherein identifying the plurality of objects in the image comprises: generating, by the machine-learned model, a set of bounding boxes, each bounding box encompassing an object in the image; and identifying, by the machine-learned model, the object within the bounding box.

6. One or more non-transitory computer-readable media coupled to one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations comprising: generating, for each image in a plurality of images, a scene graph for the image, wherein generating the scene graph for the image comprises: identifying, by a machine-learned model, a plurality of objects in the image; extracting, by the machine-learned model, a relationship feature defining a relationship between a first object and a second, different object of the plurality of objects in the image; and generating, by the machine-learned model and from the plurality of objects and the relationship feature, the scene graph for the image that includes a set of nodes and a set of edges that interconnect a subset of nodes in the set of nodes, wherein the first object is represented by a first node from the set of nodes, the second object is represented by a second node from the set of nodes, and the relationship feature is an edge connecting the first node to the second node; receiving a natural language query request for an image in the plurality of images, wherein the natural language query request comprises a plurality of terms specifying two or more particular objects and a relationship between the two or more particular objects; generating a query graph for the natural language query request; identifying, from a plurality of scene graphs generated for the plurality of images, a set of scene graphs of the plurality of scene graphs matching the query graph; and providing, for display on a user device, a set of images corresponding to the set of scene graphs.

7. The computer-readable media of claim 6, further comprising: generating, from the plurality of scene graphs, a scene graph index, wherein identifying the set of scene graphs of the plurality of scene graphs matching the query graph comprises searching the scene graph index.

8. The computer-readable media of claim 6, further comprising: ranking the set of scene graphs matching the query graph, comprising: for each scene graph matching the query graph, assigning a confidence score; and providing a subset of scene graphs each including at least a threshold score.

9. The computer-readable media of claim 6, wherein the natural language query request is a voice query from a user, and wherein generating the query graph comprises parsing the voice query into a set of terms.

10. The computer-readable media of claim 6, wherein identifying the plurality of objects in the image comprises: generating, by the machine-learned model, a set of bounding boxes, each bounding box encompassing an object in the image; and identifying, by the machine-learned model, the object within the bounding box.

11. A system, comprising: one or more processors; and a computer-readable media device coupled to the one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations comprising: generating, for each image in a plurality of images, a scene graph for the image, wherein generating the scene graph for the image comprises: identifying, by a machine-learned model, a plurality of objects in the image; extracting, by the machine-learned model, a relationship feature defining a relationship between a first object and a second, different object of the plurality of objects in the image; and generating, by the machine-learned model and from the plurality of objects and the relationship feature, the scene graph for the image that includes a set of nodes and a set of edges that interconnect a subset of nodes in the set of nodes, wherein the first object is represented by a first node from the set of nodes, the second object is represented by a second node from the set of nodes, and the relationship feature is an edge connecting the first node to the second node; receiving a natural language query request for an image in the plurality of images, wherein the natural language query request comprises a plurality of terms specifying two or more particular objects and a relationship between the two or more particular objects; generating a query graph for the natural language query request; identifying, from a plurality of scene graphs generated for the plurality of images, a set of scene graphs of the plurality of scene graphs matching the query graph; and providing, for display on a user device, a set of images corresponding to the set of scene graphs.

12. The computer-readable media of claim 11, further comprising: generating, from the plurality of scene graphs, a scene graph index, wherein identifying the set of scene graphs of the plurality of scene graphs matching the query graph comprises searching the scene graph index.

13. The computer-readable media of claim 11, further comprising: ranking the set of scene graphs matching the query graph, comprising: for each scene graph matching the query graph, assigning a confidence score; and providing a subset of scene graphs each including at least a threshold score.

14. The computer-readable media of claim 11, wherein the natural language query request is a voice query from a user, and wherein generating the query graph comprises parsing the voice query into a set of terms.

15. The computer-readable media of claim 11, wherein identifying the plurality of objects in the image comprises: generating, by the machine-learned model, a set of bounding boxes, each bounding box encompassing an object in the image; and identifying, by the machine-learned model, the object within the bounding box.