CN115885275A

CN115885275A - System and method for retrieving images using natural language descriptions

Info

Publication number: CN115885275A
Application number: CN202080101465.XA
Authority: CN
Inventors: 宁颜
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-05-30
Filing date: 2020-10-01
Publication date: 2023-03-31
Also published as: EP4154174A4; WO2021042084A1; EP4154174A1

Abstract

Implementations relate to methods, systems, and computer-readable media that obtain images and, for each of the images, generate a scene graph for the image. Generating the scene graph for the image includes: identifying an object in the image; and extracting a relationship feature defining a relationship between a first object and a second, different object of the objects in the image. A scene graph is generated for the image, the scene graph including a set of nodes and a set of edges. A natural language query request is received for an image, including terms defining a relationship between two or more particular objects. A query graph is generated for a natural language query request and a set of images corresponding to a set of scene graphs matching the query graph are provided for display on a user device.

Description

System and method for retrieving images using natural language descriptions

Cross Reference to Related Applications

This application claims priority from U.S. application No. 63/032,569, filed on 30/5/2020, the entire disclosure of which is incorporated herein by reference.

Technical Field

This specification relates generally to image processing and searching for images in an image library.

Background

Searching for a particular image within an image library containing a large number of images can be time consuming and can result in search results containing images that are not responsive to or relevant to a search query submitted by a user.

Disclosure of Invention

Implementations of the present disclosure generally relate to image processing and image library querying. More particularly, implementations of the present disclosure relate to processing a repository of images with a machine learning model to extract objects from each image and relational features that define relationships between the objects. The extracted objects and relational features are used to construct a scene graph for each of the images, in which the objects form nodes and the relational features form edges between the nodes. A searchable scenegraph index for the image repository may be generated from the scenegraphs. A user may provide a query for an image, where the query includes a natural language description of visual relationships between objects included in the image of interest. A query graph may be generated from the query, where the query graph may be matched against one or more scenegraphs in accordance with a searchable scenegraph index. In response to a query for images, images corresponding to one or more matching scene graphs may be provided.

In some implementations, the operations may include: obtaining a plurality of images; and generating a scene map for each image of the plurality of images. Generating the scene graph for the image includes: identifying a plurality of objects in the image through a machine learning model; and extracting, by the machine learning model, a relationship feature that defines a relationship between a first object and a second, different object of the plurality of objects in the image. The machine learning model generates a scene graph for the image from objects and relational features, the scene graph including a set of nodes and a set of edges interconnecting a subset of the nodes in the set of nodes, wherein a first object is represented by a first node in the set of nodes, a second object is represented by a second node in the set of nodes, and the relational features are edges connecting the first node and the second node. A natural language query request is received for an image of a plurality of images, wherein the natural language query request includes a plurality of terms specifying two or more particular objects and relationships between the two or more particular objects. A query graph is generated for a natural language query request, a set of scene graphs in the scene graphs that match the query graph is identified, and a set of images corresponding to the set of scene graphs is provided for display on a user device.

Other implementations of this aspect include corresponding systems, apparatus, and computer programs configured to perform the actions of the methods encoded on computer storage.

These and other aspects may each optionally include one or more of the following features. In some implementations, the method can further include generating, by the data processing device, a scenegraph index from the scenegraphs, wherein identifying a set of scenegraphs of the plurality of scenegraphs that matches the query graph includes searching the scenegraph index.

In some implementations, the method can further include ranking the set of scene graphs matching the query graph, including: assigning a confidence score for each scene graph that matches the query graph; and providing a subset of scenegraphs that each include at least a threshold score.

In some implementations, the natural language query request can be a voice query from a user, and wherein generating the query graph includes parsing the voice query into a set of terms.

In some implementations, identifying the object in the image can include: generating a set of bounding boxes by a machine learning model, each bounding box enclosing an object in the image; and identifying objects within the bounding box by the machine learning model.

The present disclosure also provides a non-transitory computer-readable medium coupled to one or more processors and having instructions stored thereon, which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with implementations of the methods provided herein.

The present disclosure also provides a system for implementing the methods provided herein. The system comprises: one or more processors; and a non-transitory computer-readable medium device coupled to the one or more processors and having instructions stored thereon, which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with implementations of the methods provided herein.

Particular embodiments of the subject matter described in this specification can be implemented to realize one or more of the following advantages. For example, an advantage of this technique is that it can facilitate efficient and accurate image discovery using natural language descriptions of visual relationships between objects depicted in an image, and can reduce the number of queries input by a user to find a particular image of interest. This in turn reduces the number of computer resources required to perform multiple queries until a suitable image is identified.

The system may provide a more intuitive interface for end users to find images of interest by using natural language and visual relationship descriptions to search scene graphs generated from images. Searching the scene graph index may speed up query processing, where the query is performed against a scene graph generated for the image, rather than the image itself, thus reducing the need to iterate and/or search for the image. Deep neural networks and machine learning models can be utilized to map images into scene graphs representing potential visual relationships. The machine learning model may be pre-trained using a repository of training images and may be further refined for the user's particular image library to improve the accuracy of the determined visual relationships.

The system may be used to facilitate discovery of images from various sources, such as photographs taken by a user, generated photographs, downloaded photographs, and the like, as well as images stored in various locations, such as on a local storage of a user device or on a cloud-based server.

It should be understood that a method according to the present disclosure may include any combination of the aspects and features described herein. That is, methods according to the present disclosure are not limited to the combinations of aspects and features specifically described herein, but may include any combination of aspects and features provided.

The details of one or more implementations of the disclosure are set forth in the accompanying drawings and the description below. Other features and advantages of the disclosure will be apparent from the description and drawings, and from the claims.

Drawings

FIG. 1 depicts an example operating environment of a visual relationship system.

FIG. 2A depicts a block diagram of an example implementation of a visual relationship system.

FIG. 2B depicts a block diagram of an example architecture of a visual relationship model.

FIG. 3 depicts a block diagram of another example implementation of a visual relationship system.

FIG. 4 depicts a block diagram of an example object and visual relationship determined by a visual relationship system.

FIG. 5 is a flow diagram of an example process performed by the visual relationship system for processing an image and a query image.

FIG. 6 illustrates an example of a computing system in which the microprocessor architecture disclosed herein may be implemented.

Fig. 7 shows a schematic diagram of a general network element or computer system.

Detailed Description

SUMMARY

Implementations of the present disclosure generally relate to image processing and image library querying. More particularly, implementations of the present disclosure relate to processing a repository of images with a machine learning model to extract objects from each image and relational features defining relationships between the objects. The extracted object and relationship features are used to construct a scene graph for each of the images, in which the objects form nodes and the relationship features form edges between the nodes. A searchable scenegraph index for the image repository may be generated from the scenegraphs. A query graph may be generated with a query for an image, where the query graph may be matched against one or more scene graphs in accordance with a searchable scene graph index. In response to a query for images, images corresponding to one or more matching scene graphs may be provided.

A user may provide a natural language query that includes a plurality of terms that describe visual relationships between objects. The query may be provided as a text query or a voice query, for example, by an auxiliary application on the user device, in which case voice-to-text processing and natural language processing may be applied to the query. A query graph may be generated from a plurality of terms of the query, and such query graph identifies objects and relational features between the identified objects, as defined by the terms of the query.

A search of the scene graph index may be performed to find a match between the query graph and the scene graph. As part of this matching, a confidence score between each matched scenegraph and the query graph may be assigned and the matched scenegraphs ranked with the confidence score. In response to the query, a set of images corresponding to the matched scene graph may be provided, for example, for display on the user device.

In some implementations, an Artificial Intelligence (AI) -enabled processor chip may have natural language understanding and be integrated with a processor in an "intelligent" mobile device, such as a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU). An AI-enabled processor chip with natural language understanding can be used to receive a natural language voice query and generate a query graph for the voice query from the natural language voice query. The AI chip may be used to accelerate object detection and relational feature extraction using pre-trained machine learning models stored locally on the user device and/or on the cloud-based server.

Example operating Environment

FIG. 1 depicts an example operating environment 100 of a visual relationship system 102. The visual relationship system 102 may be hosted on a local device, such as the user device 104, one or more local servers, a cloud-based service, or a combination thereof. In some implementations, some or all of the processing described herein may be hosted on cloud-based server 103.

The visual relationship system 102 may be in data communication with a network 105, wherein the network 105 may be configured to be capable of exchanging electronic communications between devices connected to the network 105. In some implementations, the visual relationship system 102 is hosted on a cloud-based server 103, where the user device 104 can communicate with the visual relationship system 102 via a network 105.

The network 105 may include, for example, one or more of the following: the internet, wide Area Networks (WAN), local Area Networks (LAN), analog or Digital wired and wireless Telephone networks such as the Public Switched Telephone Network (PSTN), integrated Services Digital Network (ISDN), cellular and Digital Subscriber Lines (DSL), radio, television, cable, satellite, or any other transport or tunneling mechanism for carrying data. The network may include multiple networks or subnetworks, each of which may include, for example, wired or wireless data paths. The network may comprise a circuit switched network, a packet switched data network, or any other network capable of carrying electronic communications such as data communications or voice communications. For example, the network may include an Internet Protocol (IP), asynchronous Transfer Mode (ATM) based network, PSTN, packet switched networks based on IP, x.25, or frame relay or other similar technologies, and may support Voice using, for example, voIP (Voice over Internet Protocol) or other similar protocols for Voice communications. The network may include one or more networks including wireless data channels and wireless voice channels. The network may be a wireless network, a broadband network, or a combination of networks including both wireless and broadband networks. In some implementations, the network 105 may be accessed over wired and/or wireless communication links. For example, a mobile computing device, such as a smartphone, may utilize a cellular network to access the network 105.

The user device 104 may host and display an application 110, including an application environment. For example, the user device 104 is a mobile device hosting one or more native applications, such as application 110, the application 110 including an application interface 112, such as a graphical user interface, through which a user may interact with the visual relationship system 102. The user device 104 comprises any suitable type of computing device, such as a desktop computer, a laptop computer, a handheld computer, a tablet computer, a Personal Digital Assistant (PDA), a cellular telephone, a network appliance, a camera device, a smartphone, an Enhanced General Packet Radio Service (EGPRS) mobile phone, a media player, a navigation device, an email device, a game console, or a suitable combination of any two or more of these or other data processing devices. In addition to performing functions related to the visual relationship system 102, the user device 104 may also perform other unrelated functions, such as placing a personal telephone call, playing music, playing video, displaying pictures, browsing the internet, maintaining an electronic calendar, and so forth.

The application 110 refers to a software/firmware program running on a respective mobile device that implements the user interfaces and features described throughout, and is the system by which the visual relationship system 102 can communicate with a user on the user device 104. The user device 104 may load or install the application 110 based on data received over a network or data received from local media. The application 110 runs on a mobile device platform. The user device 104 may receive data from the visual relationship system 102 over the network 105 and/or the user device 104 may host a portion or all of the visual relationship system 102 on the user device 104.

The visual relationship system 102 includes a speech to text converter 106 and a visual relationship model 108. Although described herein with reference to the speech-to-text converter 106 and the visual relationship model 108, the operations described may be performed by more or fewer subcomponents. The visual relationship model 108 may be a machine learning model and may be constructed using a plurality of sub-models, each implementing machine learning to perform the operations described herein. Additional details of the visual relationship model 108 are described with reference to fig. 2A, 3, and 4.

The visual relationship system 102 may obtain the images 114 as input from an image database 116 that includes a repository of images 114. The image database 116 may be stored locally on the user device 104 and/or on the cloud-based server 103, where the visual relationship system 102 may access the image database 116 via the network 105. The image database 116 may include, for example, a collection of photographs taken by a user using a camera on a mobile phone. As another example, the image database 116 may include a collection of photographs captured by multiple user devices and stored in remote locations, such as cloud servers.

The visual relationship system 102 may generate as output a scene graph for the scene graph database 118 using the visual relationship model 108. The scene graph database 118 may be stored locally on the user device 104 and/or on the cloud-based server 103, wherein the visual relationship system 102 may access the scene graph database 118 via the network 105. The scene graph database 118 may include scene graphs generated for at least a subset of the images 114 in the image database 116. Additional details of scene graph generation are described with reference to FIG. 2A.

The visual relationship system 102 may receive as input a query 120 from a user on the user device 104 through the application interface 112. The query 120 may be a voice query provided by a user of the user device 104 through the application interface 112. The query 120 may be a text-based query entered into the application interface 112 by a user.

The application interface 112 may include a search feature 122 in which a user may select an input query 120, such as a voice query. In one example, a user may use a helper function of the user device 104 to enter a voice query, which may be activated, for example, by pressing a microphone button 124 in the search feature 122. In another example, a user may enter a text query in a text field of search feature 122.

The query 120 may be a natural language query that includes terms describing visual relationships between objects that may be included in the one or more images 114. A natural language query may include terms that are part of the user's normal vocabulary and do not include any special syntax or format. Natural language queries may be entered in various forms, for example, as statement sentences, question sentences, or simple lists of keywords. In one example, the natural language query is "I want to find a boy holding a ball". In another example, the natural language query is "where is a photo of a dog running at sea? In yet another example, the natural language query is "boy holding ball. Boys are at seaside ".

The speech-to-text converter 106 may receive a user's speech query and parse the user's speech query into text using speech-to-text techniques and natural language processing. The parsed query may be provided by the speech-to-text converter 106 to the visual relationship model 108 as input. In response to a user inputting a query 120, the visual relationship system 102 may provide one or more images 114 responsive to the query 120 as output to the user device 104 for display in the application interface 112 of the application 110.

In some implementations, the user can select to enter a query 120, such as a text-based query. For example, the user may type a text query into the search feature 122. The query 120 may be a natural language query that includes terms that describe visual relationships included in the one or more images 114. The visual relationship system 102 can receive a text query as input and parse the text query using natural language processing (e.g., as a function of an AI-based chip). In response to a user inputting a query 120, the visual relationship system 102 may provide one or more images 114 responsive to the query 120 as output to the user device 104 for display in the application interface 112 of the application 110. Additional details of the processing of the visual relationship system 102 are described with reference to fig. 2A and 3.

Fig. 2A depicts a block diagram 200 of an example implementation of the visual relationship system 102, and in particular the visual relationship model 108, that generates a scene graph from the input image 114.

The visual relationship model 108 may be a machine learning model, which in turn may be constructed with a plurality of sub-models to perform the actions described herein. The visual relationship model 108 may include a deep neural network model in which the images 114 in the image database 116 are mapped to a scene graph 202 representing potential visual relationships. An example architecture of the visual relationship model 108 is described below with reference to fig. 2B, however, the actions performed by the visual relationship model 108 may generally be implemented to perform the actions described with reference to feature/object extraction 208 and scene graph generation 214.

As depicted in fig. 2A and as briefly described with reference to fig. 1, the visual relationship model 108, as part of the visual relationship system 108, may receive as input the images 114 from the image database 116 and generate as output for each image 114 a respective scene graph 202 for storage in the scene graph database 118. In some implementations, the scene graph 202 is generated for each of a subset of the images 114 in the image database 116, e.g., a subset of the total number of images in the image database 116.

The scene graph 202 includes a set of nodes 204 and a set of edges 206 that interconnect a subset of the nodes in the set of nodes. Each scene graph 202 may define a set of objects represented by respective nodes 204, e.g., where a first object is represented by a first node in the set of nodes and a second object is represented by a second node in the set of nodes. The first node and the second node may be connected by an edge representing a relational feature defining a relationship between the two objects.

The visual relationship model 108 may be implemented using one or more deep neural networks. In some implementations, the visual relationship model 108 includes a machine learning model based on one or more pre-trained models trained using generic data, e.g., a generic image repository, or user-specific data, e.g., a user's image library, to generate a scene graph for each image entering the model. The pre-trained models may then be further fine-tuned based on an image database 116, such as a user's image or video gallery. The refinement adjustment process may be performed on the user device 104 and/or the cloud-based server 103 depending on, for example, the location of the image 114 and the processing power of the user device 104. Thus, in some implementations, initial training may be performed by a machine learning model stored in the cloud-based server 103 or another networked location, and then, after training is completed, the initial training may be provided to the user device 104 for storage and further refinement of the adjustments. Alternatively, initial training and any subsequent refinement adjustments may be performed on the user device 104. Alternatively, the initial training and any subsequent refinement adjustments may be performed on the cloud-based server 103 or another networked location.

In some implementations, after the visual relationship model has been initially trained and/or fine-tuned, the visual relationship model 108 may process the obtained images 114 to perform feature/object extraction 208, which feature/object extraction 208 may in turn be used to generate a scene map for the images. In one example, the visual relationship model 108 may analyze a user image library or a cloud-based image library including a set of photographs on a mobile device to generate a corresponding scene graph 202 describing visual relationships within the images 114.

Feature/object extraction 208 may include identifying objects in the image 114 by the visual relationship model 108. Identifying the object in the image 114 may include applying bounding boxes 210 to the image 114, wherein each bounding box 210 encloses the object appearing in the image 114. For example, a plurality of bounding boxes 210 may be applied to an image of a boy holding a ball, where a first bounding box may enclose the boy and a second bounding box may enclose the ball. A partial object, such as a partial sphere, may appear in the image 114, wherein a bounding box may be applied to the partial object appearing in the image 114. Identifying objects in the image 114 may be performed using various object detection models, such as Mask R-CNN (R-CNN) or YOLO (You Look Only one). In some implementations, identifying objects in the image 114 may be performed using a machine learning model architecture, which may perform object detection and scene graph prediction/generation in parallel processing. For example, a Feature Pyramid Network (FPN) can be utilized to aggregate multi-scale information derived from the ResNet50 backbone applied to the input image 114.

The feature/object extraction 208 may additionally include extracting, by the visual relationship model 108, relationship features 212, the relationship features 212 defining relationships between objects of the plurality of objects in the image 114. In some implementations, each relationship feature 212 defines a relationship between a first object and a second, different object. For example, the relationship feature 212 may be "holding," wherein the relationship feature 212 defines a relationship between a first object "boy" and a second object "ball" to define a visual relationship of "boy," holding, "" ball. The relationships may be determined by the visual relationship model 108, for example, based in part on proximity/spatial distances between objects, known relationships between classes of objects, user-defined relationships between particular objects and/or classes of objects, and so forth.

In some implementations, a machine learning model can be used to predict relationships between pairs of detected objects. For example, the model may be a one-way model that accomplishes both object detection and relationship identification simultaneously. In other words, feature/object extraction may be performed using a one-pass model for identifying objects and defining relationships between objects, where the machine learning model completes both object detection processing and relationship recognition inference processing in a single pass.

In some implementations, the visual relationship model 108 is a machine learning model implemented as a single pass model that can predict a scene map of the input image 114 in a single pass. An example architecture 250 of a single-pass model of machine learning is depicted in fig. 2B.

As depicted in architecture 250, object detection and relational feature extraction may be performed utilizing a two-branch technique, e.g., as described with reference to feature/object extraction 208. The architecture 250 may include a Resnet50, a HRNet (High-Resolution Net), or another similar convolutional neural network to obtain the image 114 and generate a multiscale output representing features extracted/generated from a multiscale of the original output, e.g., 256x256, 128x128, 64x64, etc. The multi-scale output may be provided as input to a Feature Pyramid Network (FPN) -style structure for processing the multi-scale output. In the example depicted in fig. 2B, two FPNs (each individually referred to as an FPN or BiFPN) may be used to perform object detection and relational feature extraction, respectively, e.g., as described with reference to feature/object extraction 208, however, more or fewer FPNs may be used in architecture 250. The multiple output relationship prediction tensors for each BiFPN may be used as inputs to multiple convolution and bulk normalization layers to predict a scene map of the input image. The output of the architecture 250 includes a scene graph, such as the scene graph 202 generated from the input image 114.

The visual relationship model 108 predicts a scene graph via scene graph generation 214 based on objects extracted from the bounding box 210 and the relationship features 212. A scene graph 202 for the image 114 is generated from the object and relationship features of the image 114, where in the scene graph 202 each object is a node 204 and each relationship feature is an edge 206 connecting at least two nodes 204 together. Scene graph 202 may include each identified object as a node and a relationship feature between at least two objects as an edge connecting the nodes. The first node may be connected to a plurality of other different nodes, where each connection is an edge defining a relationship characteristic between the first node and a second different node in the plurality of other nodes. For example, a first node may be a "boy", a second node may be a "ball", and a third node may be a "hat". The first node and the second node may be connected by an edge representing the relationship feature "hold", e.g., "boy holds the ball", while the first node and the third node may be connected by an edge representing the relationship feature "wear", e.g., "boy wears the hat".

In some implementations, a first node may be connected to a plurality of other different nodes by the same type of relationship feature, where each connection is represented by a separate edge. For example, a boy may hold a ball and a book in the image. The first node may be a "boy", while the second node may be a "ball", and the third node may be a "book". The relationship feature between the first node and the second node may be "holding," e.g., "boy holding a ball," and the relationship feature between the first node and the third node may also be "holding," e.g., "boy holding a book. Scene graph 202 may include: three nodes, such as "boy", "ball", "book"; and two sides, such as "holding" and "holding".

A scene graph 202 of the image 114 is stored in the scene graph database 118 and includes a reference to the image 114. The scenegraph index 216 may be built from the stored scenegraphs 202 in the scenegraph database 118, which may facilitate matching the stored scenegraphs 202 with queries using graph indexing techniques. As one example, the scene graph index may be a lookup table that identifies each image and its corresponding scene graph, as depicted in fig. 2A.

Various graph indexing techniques may be utilized, for example, graph indexing: frequent structure based methods (gIndex). More generally, path-based graph indexing techniques and/or structure-based techniques may be utilized. An inverse indexing technique may be used for the scenegraph index, depending in part on the size of the generated scenegraph.

Referring back to FIG. 1, a user may provide a query 120 to the visual relationship system 102 via the application interface 112, for example, as a voice query or a text query. The visual relationship system 102 may process the voice query 120 using the speech-to-text converter 106 to generate a parsed query. In some implementations, the speech-to-text converter 106 can translate the speech query 120 into a text command using a speech-to-text neural network model, such as ALBERT or another similar neural network model.

FIG. 3 depicts a block diagram 300 of another example implementation of the visual relationship system in which the visual relationship model 108 is used to identify scene graphs that match a user-entered query.

The query 302 including terms describing visual relationships may be provided to the visual relationship system 102. In some implementations, the query 302 is a text query generated by the speech-to-text converter 106 from the query 120 received by the visual relationship system 102 from the user on the user device 104.

Visual relationship system 102 may receive query 302 as input and perform feature/object extraction 304 on query 302 to determine terms in query 302 that define object 306 and relationship features 308. The visual relationship system 102 may extract the objects 306 and the relationship features 308 from the input query 302, for example, by using natural language processing to parse terms in the query and identify the objects/relationship features. In one example, a natural language processing technique such as the Python space toolkit may be used to process queries to extract objects and relationships. In one example, the query 302 is "boy i want to hold ball", where the object items are determined to be "boy" and "ball" and the relational feature items are determined to be "hold".

The visual relationship model 108 may perform query graph generation 310 using the extracted objects 306 and the relationship features 308 defined in the terms of the query 302. A query graph 312 may be generated in which the objects 306 and relational features 308 extracted from the terms of the query 302 serve as nodes 314 and edges 316 between the nodes, respectively. Continuing with the example provided above, the query graph 312 may include the first node "boy" and the second node "ball" and the edge "holding" connecting the first node with the second node 314.

The visual relationship system 102 may perform a scene graph match 318 between the query graph 312 and the scene graph 202 from the scene graph database 118. In some implementations, the matching between the query graph 312 and the scene graph 202 from the scene graph database 118 includes searching the scene graph index 216 to retrieve relevant images 114 responsive to the query 120, the matching being described further below. A set of scenegraphs 202 that match the query graph 312 are selected from the scenegraphs 202 in the scenegraph database 118.

In some implementations, the visual relationship system 102 can utilize one or more correlation models to perform the scene graph matching 318. The scenario map 202 may be assigned a confidence score, wherein scenario maps 202 that satisfy a threshold confidence score for the query map 312 may be identified. A set of identified scenegraphs 202 that meet a threshold confidence score may be ranked, where a first scenegraph 202 and query 312 with a higher confidence score, e.g., closer match, may be ranked higher than a second scenegraph 202 with a lower confidence score, e.g., farther match. The scene graph match may be an exact match of words, e.g., a first node and a second node in the same group in both the scene graph and the query graph are connected by the same edge. For example, the scene graph may include a "boy-hold-ball" node 1-edge-node 2 relationship, and the query graph may also include a "boy-hold-ball" relationship. The scene graph match may alternatively be an approximate match or a fuzzy match, e.g., where one or more nodes or one or more edges between nodes are different between the scene graph and the query graph. The approximate match may be a Word match based on a semantic distance based on words based on Word embedding (e.g., using Word to Vector, word2 vec), etc.). For example, the query graph may include "boy-hold-ball," and the identified scene graph may include "boy-throw-ball," where "hold" and "throw" are determined to be within a threshold of match, e.g., through a pre-generated dictionary.

The images 114 corresponding to the set of identified scene graphs 202 may be provided for display on the user device, for example, in the application interface 112. The images 114 corresponding to the set of identified scenegraphs 202 may be displayed according to a ranking, wherein the images 114 corresponding to the scenegraphs 202 with higher confidence scores may be presented at a more prominent location, e.g., at the top of the display, than the images 114 corresponding to the scenegraphs 202 with lower confidence scores.

In some implementations, a set of top ranked images, such as a set of top 10 ranked images, is provided for display on the user device. The user may provide feedback to the visual relationship system 102 requesting some images to be provided in response to the query request, for example providing 0 to 25 images. In one example, a user may request that up to 15 images be returned in response to a query request. In some implementations, the number of images returned for display on the user device may depend on predefined parameters, such as predefined parameters set by the application 110. The number of images displayed may depend on the device screen size, where the number of images is set by the available display space for the thumbnail preview of images.

As described with reference to fig. 2A, the visual relationship model 108 may perform feature/object extraction 208 on the image 114. Fig. 4 depicts a block diagram 400 of example objects and visual relationships determined/extracted by a visual relationship system.

As depicted in fig. 4, a photograph 402, such as image 114, depicts a woman sitting in a chair next to a table with a book on top of the table. The visual relationship model 108 may receive the photograph 402 and determine a set of bounding boxes 404, each bounding box enclosing an object or portion of an object appearing within the photograph 402. For example, the bounding box 404 of the photograph 402 identifies an object 405 that includes: the person within the photograph 402 is, for example, a woman, clothing, a chair, a book, and a table.

Each of the identified objects 405 is surrounded by a bounding box and may be associated, e.g., linked, with one or more other identified objects 405 using, e.g., a relationship feature from a set of relationship features 406, where each relationship feature in the relationship features 406 describes a relationship between a pair of objects. The relational features 406 may include natural language terms. The relationship features 406 of the photograph 402 may include, for example, "beside," at \8230; above, "and" worn. In one example, a visual relationship may be defined as "table beside chair," where "table" and "chair" are objects 405 and "at 8230 \ 8230;" beside "is a relationship feature 406 between objects 405.

An example of a scene graph is depicted in fig. 4, which shows a plurality of objects as nodes connected by a relational feature as an edge. An object such as "female" may be connected to a plurality of other objects such as "chair", "clothing", and "desk" via corresponding relationship features 406 such as "on 8230; \8230on", "wearing", and "on the side of \8230; \8230. The visual relationship model 108 may utilize the extracted objects 405 and the relationship features 406 to generate a scene legend, such as scene graph 202, for a photograph 402, such as an image 114.

In some implementations, the scene graph 202 generated for the image 114 may be replaced with text describing the semantics of the image 114. In other words, text describing objects and relational features within the image 114 may be associated with the image 114. For example, an image including boy holding a ball may be associated with, e.g., labeled or otherwise assigned, with items including "boy", "holding", "ball", "boy has held a ball", and "boy has held a ball". In some implementations, the neural network model can map images to textual descriptions, for example, using image captioning techniques. A semantic language search may be performed on the descriptive text of each image of the image database 116.

Example processing of a visual relationship System

FIG. 5 is a flow diagram of an example process 500 of the visual relationship system 102 for processing images and query images. The operations of process 500 are described below as being performed by the visual relationship system described and depicted in fig. 1-3. The operation of process 500 is described below for illustrative purposes only. The operations of process 500 may be performed by any suitable apparatus or system, for example, any suitable data processing device such as, for example, a visual relationship system or user equipment 104. The operations of process 500 may also be implemented as instructions stored on a non-transitory computer-readable medium. Execution of the instructions causes one or more data processing devices to perform the operations of process 500.

An image is obtained (502). The visual relationship system 102 may obtain images 114 from an image database 116. In some implementations, the visual relationship system 102 obtains the image 114 as it is captured and/or saved to the image database 116. In some implementations, the visual relationship system 102 can periodically obtain the images 114 from the image database 116 for processing, e.g., when the user device 104 is connected to a power source, when memory usage of the user device 104 is below an activity threshold, etc.

In some implementations, the image 114 is stored locally on the user device 104, such as in a memory of a mobile phone. The image 114 may additionally or alternatively be stored on a cloud-based server 103 in data communication with the user device 104 via the network 105. The image 114 may be, for example, a document including a visual representation, such as a photograph captured by a camera of the user device 104. In general, the visual relationship system 102 may process documents, including, for example, portable Document Format (PDF) documents, graphics Interchange Format (GIF) documents, portable Network Graphics (PNG) documents, joint Photographic Experts Group (JPEG) documents, or documents in another Format based on vision.

In some implementations, the operations described below with reference to steps 504-508 may be performed on each image in the repository of images in the image database 116. Alternatively, the operations described below with reference to steps 504-508 may be performed on each image in the subset of images acquired from the image repository. As described above with reference to fig. 2A, each image 114 may be received as input by the visual relationship model 108, and a scene graph 202 may be generated for the images 114.

For each image, an object in the image is identified (504). The visual relationship model 108 may receive the image 114 and perform feature/object extraction 208 on the image 114. Object extraction may include applying bounding boxes 210 to the image, where each bounding box 210 encloses an object or a portion of an object appearing within the image. As depicted in fig. 4, the bounding boxes 404 may each define an object 405, such as a table, a woman, a book, etc., that appears in the photograph 402. Objects in the image may be identified using an object detection model such as mask R-CNN, YOLO, single Shot Detector (SSD).

Referring back to fig. 5, for each image, a relationship feature is extracted from the image, wherein the relationship feature defines a relationship between a first object and a second, different object in the image (506). The relationship features may be extracted by the visual relationship model 108, for example, using a deep neural network, and define a relationship between at least two objects appearing within the image. Extracting the relational features from the image may be built into the visual relational model as part of the end-to-end output. The relationship feature may include one or more items that define a relationship between the first object and the second object. As depicted in FIG. 4, the relationship features 406 may include an item or a group of items, such as "at 8230; \8230; next to" and "wear," where the items define how a first object is related to a second object.

Referring back to fig. 5, for each image, a scene graph is generated from the identified objects and the extracted relational features (508). The visual relationship model 108 generates a scene graph, such as the scene graph 202 depicted in fig. 2A and shown in fig. 4, in which scene graph 202 each object is defined as a node 204 and each relationship feature is defined as an edge 206 connecting two nodes 204. In some implementations, the first node 204 may be connected to the second node via the first edge 206 and to the third node via the second edge 206.

The generated scene graph 202 is stored in the scene graph database 118, e.g., locally on the user device 104 and/or on the cloud-based server 103, the cloud-based server 103 being in data communication with the user device 104 via the network 105. Each generated scene graph 202 may include a reference to the particular image 114 from which the scene graph was generated, e.g., an identifier that references the image 114 or a storage location of the image 114 in the image database 116. The scene graph database 118 may be indexed to generate a scene graph index 216, and the scene graph index 216 may be used to search the scene graph database 118 for a particular set of scene graphs 202.

Referring back to FIG. 5, a natural language query request is received requesting one or more images from the image database 116, wherein the natural language query request specifies two or more objects and one or more relationships between the two or more objects (510). The natural language query request, such as query 120, may be a voice query provided by a user of the user device 104, such as through the application interface 112 of the application 110 and/or through a digital assistant on the user device 104. The natural language query request may include a set of terms that describe one or more objects in the image that the user is interested in viewing and one or more relationships between the objects. For example, the natural language query request may be "i want to find a woman sitting on a chair", where the objects are "woman" and "chair" and the relationship between the objects is "sitting on 8230; \8230postion". In another example, the natural language query request may be "find photos of I on foot on san Hai Lon mountain," where the objects are "I [ user ]" and "san Hai Lon mountain," and the relationship between the objects is "on 8230%, \8230; on foot travel.

In some implementations, a speech-to-text converter, such as speech-to-text converter 106, receives a speech query and converts the speech query into a text-based query, which can be provided to visual relationship model 108. The speech-to-text converter 106 may be part of the visual relationship system 102 or may be a function of a digital assistant or another application 110 located on the user device 104.

Visual relationship system 102 may receive a text query, such as query 302, from speech-to-text converter 106 and perform feature/object extraction, such as feature/object extraction 304, to extract objects and relationship features, such as object 306 and relationship feature 308, included in the query.

Referring now to FIG. 5, a query graph is generated for a natural language query request (512). Query graph generation, such as query graph generation 310, may be performed by the visual relationship system 102 using object and relationship features extracted from a user-provided query. A query graph, such as query graph 312, may be generated that includes a graph-based representation of query 302, where each object 306 is represented by a node 314 and each relational feature 308 is represented by an edge 316 connecting a first node to a second node.

Referring back to FIG. 5, a set of scenegraphs that match the query graph is identified from the plurality of scenegraphs (514). Scene graph matching, such as scene graph matching 318, may be performed by the visual relationship system 102, where the query graph 312 is compared to the scene graph 202 in the scene graph database 118 in the visual relationship system 102. As described with reference to fig. 3, a set of scene graphs 202 matching the query graph 312 are identified from the scene graphs, e.g., by searching the scene graph index 216 to find a scene graph 202 in the scene graph that matches the query graph 312, e.g., an exact match, an approximate/fuzzy match. In some implementations, based on the matches, each scene graph 202 in the scene graph database 118 may be assigned a confidence score, e.g., veracity of the match, with respect to the query graph 312, and only those scene graphs having confidence scores that meet (e.g., meet or exceed) a threshold confidence score are included in the set of scene graphs.

Referring now to fig. 5, a set of images corresponding to the set of scene graphs is provided for display on the user device (516). A set of images corresponding to the set of scene images, such as image 114 corresponding to scene image 202, may be identified from image database 116. Each scene graph of the set of scene graphs may include a reference to the particular image from which the scene graph was generated, e.g., a reference to a storage location, a unique identifier, etc. The set of images may be identified in the image database 116 and provided by the visual relationship system 102 to an application 110, such as a photo library application, on the user device 104 for display.

The set of images may be displayed in an application interface of an application on the user device 104, such as application interface 112 of application 110. In some implementations, the set of images can present a display for a ranking of each image relative to each other image in the set of images. In one example, the first image corresponding to the scenegraph with the higher confidence score may be presented in the application interface 112 at a more prominent location, such as at the top of the display, than the second image corresponding to the scenegraph with the lower confidence score.

FIG. 6 illustrates an example of a computing system in which the microprocessor architecture disclosed herein may be implemented. The computing system 600 includes at least one processor 602, which may be a single Central Processing Unit (CPU) or an arrangement of multiple processor cores of a multi-core architecture. In the depicted example, processor 602 includes a pipeline 604, an instruction cache 606, and a data cache 608 (as well as other circuitry, not shown). The processor 602 is connected to a processor bus 610, the processor bus 610 being capable of communicating with an external memory system 612 and an Input/Output (I/O) bridge 614. The I/O bridge 614 is capable of communicating with various I/O devices 618A-618D (e.g., disk controllers, network interfaces, display adapters, and/or user input devices such as a keyboard or mouse) via an I/O bus 616.

The external memory system 612 is part of a hierarchical memory system that includes multiple levels of cache, including a first level (L1) instruction cache 606 and a data cache 608, as well as any number of higher level (L2, L3, \ 8230; \ 8230;) caches within the external memory system 612. Other circuitry (not shown) in processor 602 that supports

caches

606 and 608 includes a Translation Lookaside Buffer (TLB), various other circuitry for handling TLBs or misses in cache 606 and cache 608. For example, a TLB is used to translate an address of a fetched instruction or referenced data from a virtual address to a physical address and determine whether a copy of the address is in instruction cache 606 or data cache 608, respectively. If a copy of the address is determined to be in instruction cache 606 or data cache 608, the instruction or data may be fetched from the L1 cache. If it is determined that a copy of the address is not in instruction cache 606 or data cache 608, the miss is handled by miss circuitry so that the miss may be performed from external memory system 612. It should be appreciated that the division between which levels of cache are within the processor 602 and which levels of cache are within the external memory system 612 may be different in various examples. For example, the L1 cache and the L2 cache may both be internal, while the L3 (and higher levels) cache may be external. The external memory system 612 also includes a main memory interface 620, the main memory interface 620 connecting to any number of memory modules (not shown) that function as a main memory (e.g., dynamic random access memory modules).

Fig. 7 shows a schematic diagram of a general network element or computer system. A general-purpose network component or computer system includes a processor 702 (which may be referred to as a central processor unit or CPU),

the processor 702 is in communication with Memory devices including secondary Memory 704, memory such as ROM (Read Only Memory) 706 and RAM (Random Access Memory) 708, input/Output (Input/Output) devices 710, and a network 712,

the network 712, such as the internet or any other well-known type of network, may include network connection means, such as a network interface. Although illustrated as a single processor, the processor 702 is not limited thereto and may comprise a plurality of processors. The Processor 702 may be implemented as one or more CPU chips, cores (e.g., multi-core processors), FPGAs (FPGAs), ASICs (Application Specific Integrated circuits), and/or DSPs (DSPs) and/or may be part of one or more ASICs. The processor 702 may be configured to implement any of the aspects described herein. The processor 702 may be implemented using hardware, software, or both hardware and software.

The secondary storage 704 is typically comprised of one or more disk drives or tape drives and is used for non-volatile storage of data and as an over-flow data store in the event that RAM 708 is not large enough to hold all working data. The secondary storage 704 may be used to store the following programs: the program is loaded into RAM 708 when it is selected for execution. The ROM 706 is used to store instructions and perhaps data that are read during program execution. The ROM 706 is a non-volatile memory device that typically has a small memory capacity relative to the larger memory capacity of the secondary storage 704. The RAM 708 is used to store volatile data and perhaps to store instructions. Access to both ROM 706 and RAM 708 is typically faster than to secondary storage 704. At least one of secondary memory 704 or RAM 708 may be configured to store routing tables, forwarding tables, or other tables or information disclosed herein.

It should be appreciated that at least one of the processor 720 or the memory 722 is altered by programming and/or loading executable instructions onto the node 700 to transform, in part, the node 700 into a particular machine or device, such as a router, having the novel functionality taught by the present disclosure. Similarly, it will be appreciated that at least one of the processor 702, the ROM 706 and the RAM 708 has been altered by programming and/or loading executable instructions onto the node 700 to thereby transform, in part, the node 700 into a particular machine or device, such as a router, having the novel functionality taught by the present disclosure. The following is essential for the field of electrical engineering and the field of software engineering: functions that can be implemented by loading executable software into a computer can be converted into a hardware implementation by well-known design rules. The decision between implementing the concept in software and implementing it in hardware generally depends on considerations of the stability of the design and the number of units to be produced, rather than any issues involved in transitioning from the software domain to the hardware domain. In general, designs that are still subject to frequent changes may preferably be implemented in software, since redeveloping hardware implementations is more expensive than redeveloping software designs.

Generally, a stable design to be mass produced may preferably be implemented in hardware, e.g. in ASIC, since for large production runs a hardware implementation may be cheaper than a software implementation. Typically, a design is developed and tested in software and then converted to an equivalent hardware implementation in an application specific integrated circuit of hard-bound (hardwire) software instructions by well-known design rules. In the same manner, since the machine controlled by the new ASIC is a particular machine or device, as such, a computer that has been programmed and/or loaded with executable instructions may be considered a particular machine or device.

The techniques described herein may be implemented using hardware, firmware, software, or a combination of these. The software used is stored on one or more of the processor-readable storage devices described above to program one or more of the processors to perform the functions described herein. Processor-readable storage can include computer-readable media such as volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer readable storage media and communication media. Computer-readable storage media may be implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Examples of computer readable storage media include RAM, ROM, EEPROM (Electrically Erasable Programmable Read Only Memory), flash Memory or other Memory technology, CDROM (Compact Disc Read Only Memory), digital Versatile Discs (DVD) or other optical Disc storage, magnetic cassettes, magnetic tape, magnetic Disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. A computer-readable medium does not include propagated, modulated, or transient signals.

Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.

In alternative embodiments, some or all of the software may be replaced by dedicated hardware logic components. For example, and without limitation, exemplary hardware Logic component types that may be used include Field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system-on-a-Chip Systems (SOCs), complex Programmable Logic Devices (CPLDs), special purpose computers, and the like. In one embodiment, one or more processors are programmed with software (stored on a storage device) to implement one or more embodiments. The one or more processors may communicate with one or more computer-readable media/storage devices, peripheral devices, and/or communication interfaces.

It should be understood that the present subject matter may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter is thorough and complete, and will fully convey the disclosure to those skilled in the art. Indeed, the present subject matter is intended to cover alternatives, modifications and equivalents of these embodiments, which are included within the scope and spirit of the subject matter as defined by the appended claims. Furthermore, in the following detailed description of the present subject matter, numerous specific details are set forth in order to provide a thorough understanding of the present subject matter. It will be apparent, however, to one of ordinary skill in the art that the present subject matter may be practiced without these specific details.

Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the present disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various modifications as are suited to the particular use contemplated.

For purposes of this disclosure, each process associated with the disclosed technology may be performed continuously and by one or more computing devices. Each step in the process may be performed by the same or different computing device as used in the other steps, and each step need not necessarily be performed by a single computing device.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

While this specification contains many specifics, these should not be construed as limitations on the scope of the disclosure or of what may be claimed, but rather as descriptions of features specific to particular implementations. Certain features that are described in this specification in the context of separate implementations can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the implementations described above should not be understood as requiring such separation in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. For example, various forms of the flows shown above may be used, with steps reordered, added, or removed. Accordingly, other implementations are within the scope of the following claims.

The protection sought herein is as set forth in the claims below.

Claims

1. A computer-implemented method, comprising:

generating, by a data processing device, for each image of a plurality of images, a scene map for the image, wherein generating the scene map for the image comprises:

identifying, by a machine learning model, a plurality of objects in the image;

extracting, by the machine learning model, a relationship feature that defines a relationship between a first object and a second, different object of a plurality of objects in the image; and

generating, by the machine learning model, a scene graph for the image from the plurality of objects and the relational features, the scene graph including a set of nodes and a set of edges interconnecting a subset of nodes in the set of nodes, wherein the first object is represented by a first node in the set of nodes, the second object is represented by a second node in the set of nodes, and the relational features are edges connecting the first node with the second node;

receiving, by the data processing apparatus, a natural language query request for an image of the plurality of images, wherein the natural language query request includes a plurality of terms specifying two or more particular objects and relationships between the two or more particular objects;

generating, by the data processing apparatus, a query graph for the natural language query request;

identifying, by the data processing device, a set of scene graphs of the plurality of scene graphs that match the query graph from a plurality of scene graphs generated for the plurality of images; and

providing, by the data processing apparatus, a set of images corresponding to the set of scene graphs for display on a user device.

2. The method of claim 1, further comprising:

generating, by the data processing device, a scene graph index from the plurality of scene graphs,

wherein identifying a set of scenegraphs of the plurality of scenegraphs that match the query graph comprises searching the scenegraph index.

3. The method of claim 1, further comprising:

ranking the set of scene graphs matching the query graph, including:

assigning a confidence score for each scene graph that matches the query graph; and

a subset of scenegraphs each comprising at least a threshold score is provided.

4. The method of claim 1, wherein the natural language query request is a voice query from a user, and wherein generating the query graph comprises parsing the voice query into a set of terms.

5. The method of claim 1, wherein identifying a plurality of objects in the image comprises:

generating, by the machine learning model, a set of bounding boxes, each bounding box enclosing an object in the image; and

identifying, by the machine learning model, an object within the bounding box.

6. One or more non-transitory computer-readable media coupled to one or more processors and having instructions stored thereon, which when executed by the one or more processors, cause the one or more processors to perform operations comprising:

generating a scene graph for the image for each of a plurality of images, wherein generating the scene graph for the image comprises:

identifying, by a machine learning model, a plurality of objects in the image;

generating, by the machine learning model, a scene graph for the image from the plurality of objects and the relational features, the scene graph including a set of nodes and a set of edges interconnecting a subset of nodes in the set of nodes, wherein the first object is represented by a first node in the set of nodes, the second object is represented by a second node in the set of nodes, and the relational features are edges connecting the first node and the second node;

receiving a natural language query request for an image of the plurality of images, wherein the natural language query request includes a plurality of terms specifying two or more particular objects and relationships between the two or more particular objects;

generating a query graph for the natural language query request;

identifying, from a plurality of scene graphs generated for the plurality of images, a set of scene graphs of the plurality of scene graphs that matches the query graph; and

a set of images corresponding to the set of scene graphs is provided for display on a user device.

7. The computer-readable medium of claim 6, further comprising:

generating a scene graph index from the plurality of scene graphs,

8. The computer-readable medium of claim 6, further comprising:

ranking the set of scene graphs matching the query graph, including:

a subset of scenegraphs each including at least a threshold score is provided.

9. The computer-readable medium of claim 6, wherein the natural language query request is a voice query from a user, and wherein generating the query graph comprises parsing the voice query into a set of terms.

10. The computer-readable medium of claim 6, wherein identifying the plurality of objects in the image comprises:

identifying, by the machine learning model, an object within the bounding box.

11. A system, comprising:

one or more processors; and

a computer-readable medium device coupled to the one or more processors and having instructions stored thereon, which when executed by the one or more processors, cause the one or more processors to perform operations comprising:

identifying, by a machine learning model, a plurality of objects in the image;

generating a query graph for the natural language query request;

12. The computer-readable medium of claim 11, further comprising:

generating a scene graph index from the plurality of scene graphs,

13. The computer-readable medium of claim 11, further comprising:

ranking the set of scene graphs matching the query graph, including:

a subset of scenegraphs each comprising at least a threshold score is provided.

14. The computer-readable medium of claim 11, wherein the natural language query request is a voice query from a user, and wherein generating the query graph includes parsing the voice query into a set of terms.

15. The computer-readable medium of claim 11, wherein identifying the plurality of objects in the image comprises:

identifying, by the machine learning model, an object within the bounding box.