CN113614711A

CN113614711A - Embedded based image search retrieval

Info

Publication number: CN113614711A
Application number: CN202080006089.6A
Authority: CN
Inventors: S.K.巴苏; W.尹; W.范; D.格拉斯纳; S.蒂鲁马拉雷迪; T.R.斯特罗曼; S.弗马; M.A.帕塔克; S.卡兰杰卡
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2021-11-05
Also published as: EP3891624A1; US20230409653A1; WO2021173158A1; US11782998B2; US20220012297A1

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for retrieving image search results using an embedded neural network model. In one aspect, an image search query is received. Respective pair-wise numerical embedding is determined for each of a plurality of image-landing page pairs. Each pairwise numerical embedding is a numerical representation in an embedding space. Image search query embedding neural network processes features of an image search query and generates a query value embedding. Query numerical embedding is a numerical representation of an image search query in the same embedding space. A subset of pairs of numerically embedded image-landing page pairs in the embedding space having a query numerical embedding closest to the image search query is identified as a first candidate image search result.

Description

Embedded based image search retrieval

Background

This description relates generally to retrieving image search results.

Online search engines typically retrieve candidate resources (e.g., images) in response to a received search query to present search results that identify resources that are responsive to the search query. Search engines typically retrieve search results through a term-based retrieval system that identifies search results based on keywords of a search query. The search engine may retrieve resources based on various factors.

Some conventional image search engines, i.e., search engines configured to identify images on landing pages (e.g., on web pages on the internet), generate separate signals from i) features of the images and ii) features of the landing pages in response to received search queries, and then combine the separate signals according to the same fixed weighting scheme for each received search query.

Disclosure of Invention

This specification describes techniques for retrieving image search results in response to an image search query using a trained embedded neural network model.

In one aspect, an image search query is received; determining a respective pair-wise numerical embedding for each of a plurality of image-landing page pairs, each image-landing page pair comprising a respective image and a respective landing page for the respective image, wherein each pair-wise numerical embedding is a numerical representation in an embedding space; processing features of the image search query using an image search query embedding neural network to generate a query value embedding of the image search query, and wherein the query value embedding is a numerical representation in the same embedding space; and identifying, as a first candidate image search result for the image search query, an image search result that identifies a subset of image-landing page pairs having a pairwise numerical embedding in the embedding space that is closest to a query numerical embedding of the image search query. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. For a system consisting of one or more computers, being configured to perform certain operations or actions means that the system has installed thereon software, firmware, hardware, or a combination thereof, which in operation cause the system to perform the operations or actions. By one or more computer programs configured to perform certain operations or actions, it is meant that the one or more programs include instructions, which when executed by data processing apparatus, cause the apparatus to perform the operations or actions.

The subject matter described in this specification can be implemented in particular embodiments to realize one or more of the following advantages. As described herein, retrieving image search query pairs by evaluating proximity in an embedding space defined by an embedding generated by a trained embedded neural network model allows images to be provided in response to an image search query as an example of a search query. That is, the image provided in response to the image search query is responsive to the image search query. Unlike conventional methods of retrieving resources, the embedded neural network model receives a single input comprising an image search query, a landing page, features of an image identified by a given image search result, and generates an embedded representation of the image search result in the same embedding space as the generated embedded representation of the received query. Such embedded representations may model more general semantic relationships between simulated features. Thus, the distance in the embedding space reflects the similarity of one point to another, and any query or search result can be represented as a point in the embedding space. This may allow for efficient retrieval of relevant image search results. The retrieval in the embedding space is computationally efficient because fast algorithms can be developed to efficiently find the nearest neighbors or near-nearest neighbors in the embedding space. In some implementations, the distance in the embedding space can be used for ranking. For example, given a set of query and image-landing page pairs, the set of image-landing page pairs can be ordered and ranked by a corresponding distance in the embedding space. Furthermore, by utilizing an embedded-based retrieval system as well as a term-based retrieval system, the system can retrieve relevant candidate search results that do not exactly match all terms of the search query, which is beneficial for long or ambiguous search queries.

Having query and image-landing page pairs in the same embedding space may enable features that require identification of relationships between different queries and different landing pages. For example, the features may include one or more of the following: obtaining related queries based on the query, obtaining related documents based on the document, obtaining related queries based on the document, or obtaining related documents based on the query. These features can be supported by the same embedded neural network model without the need for separate indexing and retrieval systems used in conventional approaches.

In some implementations, the embedded spaces of the query and image-landing page pairs in different languages can be learned simultaneously. The distance in the embedding space can be used to link landing pages with similar content in different languages. The distance in the embedding space can be used to understand that queries in different languages have similar content. These connections provided by the image can be obtained by embedding a neural network model. The same or similar images may exist on landing pages in different languages. The embedded neural network model can take advantage of this language-independent similarity in the embedding space to help identify connections.

The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

Drawings

FIG. 1A is a block diagram of an example search system.

FIG. 1B illustrates an example of candidate image search results identifying an image-landing page pair as an image search query.

FIG. 2 illustrates an example architecture of an embedded neural network for generating candidate image search results from image-landing page pairs and image search queries.

FIG. 3 is a flow diagram of an example process for generating image search results from an image search query.

FIG. 4 is a flow diagram of an example process for training an embedded neural network.

Like reference numbers and designations in the various drawings indicate like elements.

Detailed Description

FIG. 1A illustrates an example image search system 114. Image search system 114 is an example of an information retrieval system in which the systems, components, and techniques described below may be implemented.

The user 102 may interact with the image search system 114 through the user device 104. For example, the user device 104 may be a computer coupled to the image search system 114 via a data communication network 112 (e.g., a Local Area Network (LAN) or Wide Area Network (WAN), such as the internet, or a combination of networks). In some cases, the image search system 114 may be implemented on the user device 104, for example, if the user installs an application that performs a search on the user device 104. The user device 104 will typically include a memory (e.g., Random Access Memory (RAM) 106) for storing instructions and data and a processor 108 for executing stored instructions. The memory may include both read-only and writable memory.

The image search system 114 is configured to search a collection of images. Typically, the images in the collection are images found on web pages on the internet or a private network (e.g., an intranet). The web page on which the image is found (i.e., the web page in which the image is included) will be referred to as the landing page for the image in this specification.

The user 102 may submit a search query 110 to the image search system 114 using the user device 104. When the user 102 submits the search query 110, the search query 110 is sent to the image search system 114 over the network 112.

When the image search system 114 receives the search query 110, a search engine 130 within the image search system 114 identifies image-landing page pairs that satisfy the search query 110 and responds to the query 110 by generating search results 128, each search result 128 identifying a corresponding image-landing page pair that satisfies the search query 110. Each image-landing page pair includes an image and a landing page on which the image is found. For example, the image search results may include a lower resolution version of the image or data from a crop of the image and identifying the landing page, e.g., a resource locator of the landing page, a title of the landing page, or other identifying information. The image search system 114 sends the search results 128 to the user device 104 over the network 112 for presentation to the user 102, i.e., in a form that can be presented to the user 102.

Search engine 130 may include an index engine 132, a ranking engine 134, and a retrieval engine 135. The indexing engine 132 indexes the image-landing page pairs and adds the indexed image-landing page pairs to the index database 122. That is, the index database 122 includes data identifying images and corresponding landing pages for each image.

The index database 122 also associates image-landing page pairs with (i) features of the images (i.e., features that characterize the images) and (ii) features of the landing pages (i.e., features that characterize the landing pages). Examples of features of the images and landing pages are described in more detail below.

The retrieval engine 135 identifies candidate image-landing page pairs for the search query 110. The candidate image-landing page pairs comprise a subset of the available image-landing page pairs, i.e., a subset of pairs identified in the index database 122.

In particular, as part of identifying candidate image search results, the retrieval engine 135 may map each of the search query 110 and image-landing page pairs to the same embedding space by using the trained embedded neural network model 136. The distance between the embedding of the image-landing page pair in the embedding space and the embedding of the search query 110 may reflect the relevance of the image-landing page pair to the search query 110. The retrieval engine 135 identifies a subset of the available image-landing page pairs in the embedding space that are closest to the search query as candidate image search results. The candidate image search results may later be ranked by the ranking engine 134.

For each image-landing page pair, the search engine 135 determines a pair-wise numerical value embedding as a numerical representation of the image-landing page pair in an embedding space. In some implementations, the system can access an index database 122 that associates image-landing page pairs with corresponding previously generated pairwise numerical value embeddings. In some other implementations, the system can process features of each image-landing page pair using a trained embedded neural network to generate a respective pair-wise numerical embedding of the image-landing page pairs at query time.

In some implementations, the search engine 135 can include two or more search systems that each generate a set of candidate image-landing page pairs. For example, in addition to the embedded-based retrieval system discussed above, the retrieval engine 135 may include a term-based retrieval system that identifies image-landing page pairs based on keywords. The search engine 135 may combine the search results from the embedded based search system and the search results from the term based search system to generate a final set of candidate image-landing page pairs. By utilizing an embedded-based retrieval system in addition to a term-based retrieval system, the retrieval engine 135 may retrieve relevant results for all terms that do not exactly match the query. This advantage is useful for long or ambiguous queries.

The ranking engine 134 generates respective ranking scores for the candidate image-landing page pairs. The ranking engine 134 may generate relevance scores based on scores stored in the index database 122 or relevance scores computed at query time, and then rank the candidate image-landing page pairs based on the respective ranking scores. The relevance score for a given image-landing page pair reflects the relevance of the image-landing page pair to the received search query 110, the quality of the given image-landing page pair, or both.

The embedded neural network model 136 may be any of a variety of embedded neural network models. For example, the embedded neural network model 1360 may be a deep machine learning model, e.g., a neural network including multiple layers of nonlinear operation.

The retrieval of candidate image-landing page pairs using an embedded neural network model will be described in more detail below with reference to fig. 2 and 3.

To train the embedded neural network model 136 so that the embedded neural network model 136 can be used to accurately generate embedded representations of image-landing page pairs and search queries in the embedding space, the image search system 114 includes a training engine 160. The training engine 160 trains the embedded neural network model 136 on training data generated using image-landing page pairs that have been associated with truth values or known search queries. Training the machine learning model will be described in more detail below with reference to fig. 4.

FIG. 1B illustrates an example of candidate image search results identifying an image-landing page pair as an image search query. In the example of FIG. 1B, a user submits an image search query 170 ("conifer"). The system generates image query features 172 based on the image search query 170 submitted by the user. An example of query features 172 is described below with reference to FIG. 2.

The system also generates or obtains landing page features 174 for the landing page and image features 176 for the images in the particular image-landing page that are part of the particular image-landing page pair identified in the index database. Examples of landing page features 174 and image features 176 are described below with reference to FIG. 2. The system then provides the landing page features 174 and the image features 176 as inputs to the paired embedded neural network 178. The system also provides query features 172 as input to an image search query embedded in neural network 180.

The pairwise embedding neural network 178 receives input including features of landing pages and features of images and generates pairwise numerical embedding of image-landing page pairs. Paired numerical embedding is a numerical representation of image-landing page pairs in the embedding space.

The image search query embedding neural network 180 receives an input including features of an image search query and generates a query value embedding of the image search query. Query value embedding is a numerical representation of an image search query in the same embedding space as the numerical embedding of image-landing page pairs.

The system then determines 186 whether the pairwise value embedding 182 is close enough in the embedding space to the query value embedding 184. For example, the system may identify the K candidate image-landing page pairs in the index that have a pair-wise numerical embedding closest to the query numerical embedding. If the system determines that the pairwise value embedding 182 is close enough to the query value embedding 184, the system identifies 188 the image-landing page pair as a candidate image search result. The candidate image search results may be later processed by the ranking engine 134.

FIG. 2 illustrates an example architecture of an embedded neural network 200 for generating candidate image search results from image-landing page pairs and image search queries. For each image-landing page pair and image search query, the embedded neural network 200 takes as input the query features 202, the image features 206, and the landing page features 208, and may generate an output that can help the system identify whether the image-landing page pair is a candidate image search result. The embedded neural network 200 includes two sub-neural networks: the image search queries embedded neural networks 204 and pairs of embedded neural networks 210.

The image search query embedding neural network 204 takes the query features 202 as input and generates a query value embedded representation 184 of the search query. The query features 202 may include a plurality of features, such as location features, text features, and the like. The location features may characterize a location at which the image search query was submitted. The text features may include single words or double words of the image search query.

In general, the image search query embedded neural network 204 may be a deep neural network that includes a plurality of embedded sub-networks for each of a plurality of query features. Each embedding subnetwork may generate an embedded representation of an instance of the respective feature. For example, a location embedding subnetwork may generate an embedded representation of a location feature, and a text embedding subnetwork may generate an embedded representation of a query single word or double word. For example, a single word or a biword in a text feature may be represented as a separate token. Single word or double word embeddings may be computed using a look-up table. The look-up table may be an embedded weight matrix and may be a shortcut to matrix multiplication for improved efficiency. The lookup table may be trained similar to the parameters in the training weight matrix. The output of the lookup table may be a one-dimensional integer vector. For example, the word "cat" may be represented as token 543. The embedding of the word cat may be a value in line 543 of the look-up table, e.g. embedding a vector of dimension or length 5 14679. After calculating the embedding for each token, the numerical embedding representation of the text feature may be an average of all token embeddings.

The output of each embedded subnetwork may be a vector of values. For example, the numeric vector may be a length 128 vector with floating point numbers.

Each embedding subnetwork is pre-trained to generate an embedded vector of query features of a particular type. The trained sub-network may map different query features of a particular type into a common space. For example, a text embedding subnetwork may map different kinds of query text into a common space by generating corresponding embedding vectors. A query text [ red cap ] may be mapped to a numeric vector [0.1, -0.2, 0.0, …, -0.3, 0.2] as a length 128 vector. These embeddings can model more general semantic relationships and can be effectively used in image search systems.

The outputs of each embedded subnetwork are merged together by an operation such as stitching or adding to generate an embedded representation of the image search query. For example, assuming that the output of the location embedding sub-network is a length 128 vector and the output of the text embedding sub-network is also a length 128 vector, these outputs may be stitched together and a length 256 vector summarizing the embedded representations of the text features and location features of the image search query may be generated.

In some implementations, the merged features are processed through one or more fully connected layers that further extract features from the merged features to generate a final query value embedding 184 of the image search query.

The paired embedding neural network 210 takes as input the image features 206 and landing page features 208 and generates paired numerical value embedding 182 of image-landing page pairs. The image features 206 and landing page features 208 may be from the index database 122 or other data maintained by the system that associates images and landing pages with corresponding features.

The image features 206 may include one or more of pixel data of an image or an embedding of an image characterizing content of the image. For example, the image feature may include all or a portion of pixels of the image that are capable of representing original content information of the image. As another example, the image features 206 may include embedded vectors that represent the content of the image. These embedded vectors representing the image may be derived by processing the image through another embedded neural network. Alternatively, the embedded vector may be generated by other image processing techniques for feature extraction. Example feature extraction techniques include edge, corner, ridge, and blob detection.

In some implementations, the embedded vectors of image content may be pre-generated and stored in an index database. Thus, the embedded representation of the content of the image can be obtained directly by accessing the index database without the need to compute it in the embedded neural network 200.

The image features 206 may also include data identifying a domain of the image, and/or text from a Uniform Resource Locator (URL) of the image, such as a single word or double word. The textual features of the image and the textual features from the search query both include a single word or a double word. Thus, they can both be mapped to the same embedding space later by the embedding neural network 200. The corresponding embedded representations of the relevant text features are closer to each other in the embedding space than the corresponding embedded representations of the text features that are less relevant or irrelevant.

The landing page features 208 may include one or more of text from a title of the landing page, salient terms appearing on the landing page, text from a URL of the landing page, and data identifying a field of the landing page. Further, examples of features extracted from the landing page can include the date the page was first crawled or updated, data characterizing the author of the landing page, the language of the landing page, keywords representing the content of the landing page, features of the links to the images and landing page (such as anchor text or source pages of links), features describing the context of the images in the landing page, and so forth.

The landing page features 208 may also include features extracted from the landing page that describe the context of the images in the landing page. Examples of features extracted from the landing page that describe the context of the image in the landing page include data characterizing the location of the image in the landing page, the saliency of the image on the landing page, textual descriptions of the image on the landing page, and so forth. The location of the image in the landing page can be accurately located using a pixel-based geometric location in the horizontal and vertical dimensions, a length (e.g., in inches) based on the user device in the horizontal and vertical dimensions, an XPATH-like identifier based on HTML/XML DOM, a CSS-based selector, and the like. The relative sizes of the images displayed on the generic device and the specific user device may be used to measure the saliency of the images on the landing page. The textual description of the image on the landing page may include a substitute text label for the image, text surrounding the image, and the like.

Similar to the image search query embedded neural network 204, the pair-wise embedded neural network 210 may be a deep neural network that includes a plurality of embedded sub-networks for each of a plurality of image-landing page pair features. Each embedding subnetwork may generate an embedded representation of an instance of the respective feature. For example, the domain embedding subnetwork may generate an embedded representation of the page domain features, and the text embedding subnetwork may generate an embedding of the text data of the image URL. The output of each embedded subnetwork may be a vector of values. For example, the numeric vector may be a length 128 vector with floating point numbers.

Similar to the image search query embedded neural network 204, the outputs of each embedded subnetwork are merged together by an operation such as stitching or adding to generate an embedded representation of the image-landing page pair. For example, the output from the multiple embedding subnetworks may be multiple embedding vectors each having a length of 128, for a page title single/double word, a page saliency term, a page URL single/double word, an image URL single/double word, and an image domain, among others. In some implementations, an embedded vector of length 128 of image content may be obtained from an index database. Multiple N-bit embedded vectors may be stitched together and a 128xN length vector may be generated that summarizes the embedded representation of the features of the image-landing page pair. Similar to the image search query embedding neural network 204, in some embodiments, the merged features are processed through one or more fully connected layers that further extract features from the merged features to generate the final pairwise numerical embedding 182 of the image-landing page pair. The pairwise value embedding 182 and the query value embedding 184 are in the same embedding space.

In some embodiments, the outputs of the embedded subnetworks may or may not be partially merged. Rather than merging the outputs of the embedding sub-networks and generating a single embedded representation of an image-landing page pair, the outputs of the embedding sub-networks may be merged into an embedded representation of two or more image-landing page pairs. Thus, the respective final pairwise value embedding 182 may include two or more embedded representations in the same embedding space as the query value embedding 184.

In some implementations, the image search query embedded neural network 204 and the pair of embedded neural networks 210 share at least some parameters. For example, two or more of the subnetworks (such as query text embedding subnetwork, landing page title embedding subnetwork, landing page salient term embedding subnetwork, landing page URL embedding subnetwork, image URL embedding network, etc.) may share parameters because these features are extracted from the same vocabulary. Sharing a parameter by two neural networks means that the two neural networks are constrained to have the same value for each parameter shared.

In some implementations, the image search query embedded neural network 204 and the pair-wise embedded neural network 210 can be trained jointly to facilitate training shared parameters between these networks. More details regarding training the embedded neural network will be described in more detail below with reference to FIG. 4.

The prediction layer 212 compares the pairwise value embedding 182 with the query value embedding 184 in the same embedding space. In some implementations, the prediction layer 212 may output a distance value that may measure the proximity of the pairwise value embedding 182 and the query value embedding 184. For example, the prediction layer 212 may include a dot product between the pairwise value embedding 182 and the query value embedding 184.

The output from the prediction layer 212 may be used differently during training of the embedded neural network 200 and during image searching. During an image search, the retrieval engine 135 may identify candidate image search results for the search query based on an output from the prediction layer 212 that measures the proximity of the embedded representation of the image-landing page pair to the embedded representation of the search query. When training the embedded neural network 200, the training engine 160 may jointly train the pair of embedded neural networks and the image search query embedded neural network to minimize a loss function (e.g., dot product) that depends on the output from the prediction layer 212.

FIG. 3 is a flow diagram of an example process 300 for generating image search results from an image search query. For convenience, process 300 will be described as being performed by a system of one or more computers located at one or more locations. For example, an image search system (e.g., image search system 114 of FIG. 1A) suitably programmed in accordance with the subject specification can perform process 300.

An image search system receives an image search query from a user device (302). In some cases, the image search query is submitted through a dedicated image search interface (i.e., a user interface for submitting image search queries) provided by the image search system. In other cases, a search query is submitted via a general Internet search interface, and image search results, as well as other types of search results, i.e., search results that identify other types of content available on the Internet, are displayed in response to the image search query.

Upon receiving the image search query, the image search system identifies an initial image-landing page pair (304). For example, the system may identify an initial image-landing page pair from image pairs indexed in a search engine index database based on signals measuring the quality of the image pair, the relevance of the image pair to the search query, or both.

For each image-landing page pair, the system determines a corresponding pair-wise numerical value embedding (306) that is a numerical representation of the image-landing page pair in an embedding space. In some implementations, the system can access an index database that associates image-landing page pairs with corresponding pairwise numerical value embeddings that have been previously generated using pairwise embedding neural networks. This may save image search time because the pairwise numerical value embedding has been previously calculated and stored.

In some other implementations, the system can process features of each image-landing page pair using a pair-wise embedding neural network to generate a respective pair-wise numerical embedding of the image-landing page pairs. The features of each image-landing page pair may include features of an image and features of a landing page. These features may come from an index database or other data maintained by the system that associates images and landing pages with corresponding features. These features may be represented categorically or discretely. Furthermore, additional relevant features may be created by pre-existing features. For example, the system may create a relationship between one or more features by a combination of addition, multiplication, or other mathematical operations.

The system obtains features of the image search query (308) and processes the features of the image search query using the image search query embedded neural network (310). The image search query embedding neural network may generate a query value embedding of the image search query. The generated query numerical embedding is a numerical representation of the image search query in the same embedding space as the paired numerical representation of the image-landing page pair.

The system identifies a subset of initial image-landing page pairs as first candidate image search results (312). The subset of initial image-landing page pairs has a pair-wise numerical embedding in the embedding space that is closest to the query numerical embedding of the image search query. For example, a nearest neighbor search may be used to select the first K image-landing page pairs of the initial image-landing page pair having an embedded representation that is closest to the embedded representation of the search query.

Feature embedding can model more general semantic relationships between features. The numerically embedded closeness of features may be trained to measure the relevance of candidate image search results to an image search query. In some implementations, the proximity of numerical embedding may be trained to measure the likelihood that a user submitting a search query will interact with search results. Numerical embedding closer to each other indicates that users submitting search queries will find and interact with candidate image search results more relevant. Training the embedded neural network to generate numerical embedding will be described below with reference to fig. 4.

The first candidate image search result typically includes many fewer candidates than the initial image search result. For example, the number of first candidate image search results may be limited to an order of magnitude of less than one hundred results. This is much less than the initial image search results, which may be thousands or millions of image search results.

In some implementations, upon receiving the first candidate image search results, the system then generates a plurality of second candidate image search results that include at least some of the first candidate image search results. For example, the system may obtain other candidate words that are retrieved by the keyword-based term-based retrieval system. The system may merge the term-based candidates and the embedding-based candidates and send the merged candidates for a second round of relevance scoring. After the second round of relevance scoring, a second candidate image search result may be selected from the embedded first candidate image-based search results and the term-based candidate image search results.

The system ranks the plurality of second candidate image search results using a ranking engine (314). The ranking engine may generate relevance scores based on scores stored in the index database or scores computed at query time, and rank the plurality of second image-landing page pairs based on the respective ranking scores. The relevance score for a candidate image-landing page pair reflects the relevance of the image-landing page pair to the received search query, the quality of the given image-landing page pair, or both. The system ranks the image search results based on the relevance scores of the respective image-landing page pairs.

The system generates an image search result presentation showing image search results ordered according to the ranking (316), and provides the image search result presentation for presentation (318) by sending the search result presentation over a network to a user device from which the image search query was received in a form that can be presented to the user.

FIG. 4 is a flow diagram of an example process 400 for training an embedded neural network. For convenience, process 400 will be described as being performed by a system of one or more computers located at one or more locations. For example, an image search system (e.g., image search system 114 of FIG. 1A) suitably programmed in accordance with the subject specification can perform process 400.

The system receives a set of training image search queries and, for each training image search query, receives training image search results for the query (402). Each training image search result may be identified as either a positive training example or a negative training example. In some implementations, when the user interacts with the search results that identify training image-landing page pairs after submitting the training image search query, the system identifies the training image search query and the training image-landing page pairs as training examples.

For each of the training image search queries, the system generates training examples using features of the image search query (404). For each of the training image search results, the system generates training examples using the features of the image-landing page pairs (408). For each training pair, the system identifies (i) features of the image search query, (ii) features of the image, and (iii) features of the landing page. Extracting, generating, and selecting features may occur prior to training or using other embedded models. Examples of features are described above with reference to fig. 2.

The system trains a pair of embedded neural networks (410) and queries the embedded neural networks (406) in combination with the training images. The system jointly trains two neural networks to minimize a loss function that depends on the dot product between (i) the query-value embedding of the training image search query and (ii) the pair-wise value embedding of the training image-landing page pair. For example, the loss function may cause the dot product when the training image search query and training image-landing page pair have been identified as positive training examples to be higher than the dot product when the training image search query and training image-landing page pair have been identified as negative training examples.

In some implementations, the image search query embedded neural network may be pre-trained for other embedded representation tasks. For example, an image search query embedded in a neural network may be implemented with a lookup table having predetermined or trained parameters. The numerical representation of the training image search query may be computed by indexing a lookup table using the token representations of the training image search query. In some embodiments, the pair-wise embedded neural networks may be pre-trained for other embedded representation tasks.

In some implementations, the paired embedded neural network and the image search query embedded neural network may share at least some parameters. For example, the paired embedded neural network and the image search query embedded neural network may share parameters corresponding to any features extracted from the same vocabulary. The shared neural network parameters may be efficiently trained by the joint training method described above.

In some embodiments, the system may implement the loss function using any of a variety of available loss functions in training the embedded neural network model in order to improve the efficient utilization of the large amount of data available. Examples of loss functions that may be used to train the model include softmax with cross-entropy loss, sampled softmax loss (Jean, S basis, et al, "On using very large target volumetric for neural machine translation." arXiv prediction arXiv:1412.2007.2014), a contrast loss function, or a combination of two or more thereof.

In some embodiments, the system may train the embedded neural network model in several stages, and the system may implement different kinds of loss functions at each stage of the training process. For example, the system may use the softmax loss function in a first phase and may use the contrast loss function or the asymmetric-scale sigmoid loss function in a subsequent phase. In some embodiments, in one or more stages subsequent to the first stage, hard samples (e.g., training samples with large penalty values in one or more previous training stages) may be used during training to increase the convergence speed of the training process or to increase the performance of the final model being trained.

For example, the system receives a set of 4096 training image search queries and, for each training image search query, searches the query_iReceiving the image search result SelectdImage of the query_iI.e., the selected image-landing page pair. Here, the index i is 1,2, …, 4096. Search query for each training image_iThe system generates a positive training example (query)_i,SelectedImage_i) And generates 4095 negative training examples (queries)_i,SelectedImage_j) Where i ≠ j. During training, for each positive or negative training example, the embedded neural network may output normalized to [0, 1 ] by the softmax function]Dot product of the range of (a). The system may then compute each training image search query using the normalized dot product computed from its corresponding 4095 negative training examples and one positive training example_iSoftmax loss of samples. Because the number of training image search queries is very large, the softmax penalty of sampling only accounts for a subset of the training examples to compute the penalty, rather than computing the softmax penalty for all 4096 training image search queries. The total loss is the sum of the losses calculated for each of the 4096 training image search queries.

The system trains the embedded neural network by minimizing a loss function. For example, the system may train the embedded neural network model to determine training values for the weights of the neural network from initial values of the weights by repeatedly performing a neural network training process to calculate gradients of the loss function with respect to the weights (e.g., using back propagation) and determine updates of the weights from the gradients (e.g., using update rules corresponding to the neural network training process).

The term "configured" is used herein in connection with system and computer program components. For a system consisting of one or more computers, being configured to perform certain operations or actions means that the system has installed thereon software, firmware, hardware, or a combination thereof that, in operation, causes the system to perform the operations or actions. By one or more computer programs configured to perform certain operations or actions, it is meant that the one or more programs include instructions, which when executed by a data processing apparatus, cause the apparatus to perform the operations or actions.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal (e.g., a machine-generated electrical, optical, or electromagnetic signal) that is generated to encode information for transmission to suitable receiving apparatus for execution by the data processing apparatus.

The term "data processing apparatus" refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be or include special purpose logic circuitry, e.g., a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for the computer program, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program, which can also be referred to or described as a program, software application, app, module, software module, script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, such as one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.

In this specification, the term "database" is used broadly to refer to any collection of data: the data need not be structured in any particular way, or at all, and it may be stored on a storage device in one or more locations. Thus, for example, an index database may include multiple sets of data, each of which may be organized and accessed differently.

Similarly, in this specification, the term "engine" is used broadly to refer to a software-based system, subsystem, or process programmed to perform one or more particular functions. Typically, the engine will be implemented as one or more software modules or components installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines may be installed and run on the same computer or computers.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and in particular by, special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers.

A computer suitable for executing a computer program may be based on a general purpose or special purpose microprocessor or both, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for executing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such a device. Further, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD ROM and DVD-ROM disks.

To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device (e.g., a Cathode Ray Tube (CRT) or Liquid Crystal Display (LCD) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other types of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. Further, the computer may interact with the user by sending and receiving documents to and from the device used by the user; for example, by sending a web page to a web browser on the user device in response to a request received from the web browser. In addition, the computer may interact with the user by sending a text message or other form of message to a personal device (e.g., a smartphone running a messaging application) and, in turn, receiving a response message from the user.

The data processing apparatus for implementing the machine learning model may also comprise, for example, a dedicated hardware accelerator unit for processing common and computationally intensive parts of the machine learning training or production, i.e. inference, workload.

The machine learning model may be implemented and deployed using a machine learning framework (e.g., a TensorFlow framework, a Microsoft cognitive toolkit framework, an Apache Singa framework, or an Apache MXNet framework).

Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification), or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN) and a Wide Area Network (WAN), e.g., the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, the server sends data (e.g., HTML pages) to the user device, for example, for the purpose of displaying data to and receiving user input from a user interacting with the device acting as a client. Data generated on the user device (e.g., the results of the user interaction) may be received at the server from the device.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings and referred to in the claims as being performed in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Specific embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

1. A method, comprising:

receiving an image search query;

determining a respective pair-wise numerical embedding for each of a plurality of image-landing page pairs, each image-landing page pair comprising a respective image and a respective landing page for the respective image, wherein each pair-wise numerical embedding is a numerical representation in an embedding space;

processing features of the image search query using an image search query embedding neural network to generate a query value embedding of the image search query, and wherein the query value embedding is a numerical representation in the same embedding space; and

identifying an image search result as a first candidate image search result for the image search query, the image search result identifying a subset of image-landing page pairs having a numerical embedding of a query value closest to the image search query in the embedding space.

2. The method of claim 1, further comprising:

ranking a plurality of second candidate image search results that include at least some of the first candidate image search results;

generating an image search result presentation showing second candidate image search results ordered according to the ranking; and

providing the image search result presentation for user device presentation.

3. The method of any preceding claim, wherein determining a respective pair-wise numerical embedding for each of a plurality of image-landing page pairs comprises:

an index database is accessed using a pairwise embedding neural network, the index database associating image-landing page pairs with respective pairwise numerical value embeddings that have been generated for the image-landing page pairs.

4. The method of any of claims 1 or 2, wherein determining a respective pair-wise numerical embedding for each of a plurality of image-landing page pairs comprises:

features of each image-landing page pair are processed using a pair-wise embedding neural network to generate a respective pair-wise numerical embedding of the image-landing page pairs.

5. The method of any of claims 3 or 4, wherein the pair-wise embedded neural network and the image search query embedded neural network have been jointly trained.

6. The method of claim 5, wherein the paired embedded neural network and the image search query embedded neural network have been jointly trained to minimize a loss function that depends on a dot product between (i) a query value embedding of a training image search query and (ii) a paired value embedding of a training image-landing page pair.

7. The method of claim 6, wherein the penalty function causes a dot product between (i) a query numerical embedding of a training image search query and (ii) a pair-wise numerical embedding of a training image-landing page pair to be higher when the training image search query and the training image-landing page pair have been identified as positive training examples than when the training image search query and the training image-landing page pair have been identified as negative training examples.

8. The method of claim 7, further comprising:

when a user interacts with a search result that identifies the training image-landing page pair after submitting the training image search query, identifying the training image search query and the training image-landing page pair as a training example.

9. The method of any of claims 3-8, wherein the pair of embedded neural networks and the image search query embedded neural network share at least some parameters.

10. The method of claim 9, wherein the pair of embedded neural networks and the image search query embedded neural network share parameters corresponding to two features extracted from the same vocabulary.

11. The method of any preceding claim, wherein the features of the image search query comprise data characterizing a location at which the image search query was submitted.

12. The method of any preceding claim, wherein the features of the image search query comprise text of the image search query.

13. The method of any preceding claim, wherein the features of each image-landing page pair comprise a combination of features of the landing page and features of the image.

14. The method of claim 13, wherein the features of the landing page include one or more of text from a title of the landing page, salient terms appearing on the landing page, text from a URL of the landing page, or data identifying a field of the landing page.

15. The method of any of claims 13 or 14, wherein the features of the image comprise one or more of pixel data of the image or an embedding of the image.

16. The method of any of claims 13-15, wherein the features of the image include one or more of data identifying a domain of the image or text from a URL of the image.

17. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the respective operations of the method of any preceding claim.

18. One or more non-transitory computer-readable storage media storing instructions that, when executed by the one or more computers, cause the one or more computers to perform respective operations of the method of any one of claims 1-16.