US8359282B2  Supervised semantic indexing and its extensions  Google Patents
Supervised semantic indexing and its extensions Download PDFInfo
 Publication number
 US8359282B2 US8359282B2 US12562840 US56284009A US8359282B2 US 8359282 B2 US8359282 B2 US 8359282B2 US 12562840 US12562840 US 12562840 US 56284009 A US56284009 A US 56284009A US 8359282 B2 US8359282 B2 US 8359282B2
 Authority
 US
 Grant status
 Grant
 Patent type
 Prior art keywords
 words
 vector
 weight
 block
 query
 Prior art date
 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
 Active, expires
Links
Images
Classifications

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 G06F17/30—Information retrieval; Database structures therefor ; File system structures therefor
 G06F17/3061—Information retrieval; Database structures therefor ; File system structures therefor of unstructured textual data
 G06F17/30634—Querying
 G06F17/30657—Query processing
 G06F17/3066—Query translation
 G06F17/30663—Selection or weighting of terms from queries, including natural language queries

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 G06F17/30—Information retrieval; Database structures therefor ; File system structures therefor
 G06F17/3061—Information retrieval; Database structures therefor ; File system structures therefor of unstructured textual data
 G06F17/30613—Indexing
 G06F17/30616—Selection or weighting of terms for indexing
Abstract
Description
This application claims priority to provisional application Ser. No. 61/143,942 filed on Jan. 12, 2009, incorporated herein by reference.
This case is related to application Ser. No. 12/562,802 filed concurrently herewith.
1. Technical Field
The present invention relates to informational retrieval technology and more particularly to systems and methods for text document matching and ranking.
2. Description of the Related Art
Ranking text documents given a textbased query is one of the key tasks in information retrieval. Classical vector space models use weighted word counts and cosine similarity. This type of model often performs remarkably well, but suffers from the fact that only exact matches of words between query and target texts contribute to the similarity score. It also lacks the ability to adapt to specific datasets since no learning is involved.
Latent Semantic Indexing (LSI), and related methods such as probabilistic Latent Semantic Indexing (pLSA), and Latent Dirichlet Allocation (LDA) choose a low dimensional feature representation of “latent concepts”, and hence words are no longer independent. A support vector machine with handcoded features based on the title, body, search engine rankings and the URL has also been implemented, as well as neural network methods based on training of a similar set of features. Other methods learned the weights of orthogonal vector space models on Wikipedia links and showed improvements over the OKAPI method. The same authors also used a class of models for matching images to text. Several authors have proposed interesting nonlinear versions of (unsupervised) LSI using neural networks and showed they outperform LSI or pLSA. However, we note their method is rather slow, thus dictionary size is limited.
A system and method for determining a similarity between a document and a query includes providing a frequently used dictionary and an infrequently used dictionary in storage memory. For each word or gram in the infrequently used dictionary, n words or grams are correlated from the frequently used dictionary based on a first score. Features for a vector of the infrequently used words or grams are replaced with features from a vector of the correlated words or grams from the frequently used dictionary when the features from a vector of the correlated words or grams meet a threshold value. A similarity score is determined between weight vectors of a query and one or more documents in a corpus by employing the features from the vector of the correlated words or grams that met the threshold value.
A system for determining a similarity between a document and a query includes a memory configured to store a frequently used dictionary and an infrequently used dictionary. A processing device is configured to execute a program to correlate n words or grams from the frequently used dictionary based on a first score for each word or gram in the infrequently used dictionary. The program is further configured to replace features for a vector of the infrequently used words or grams with features from a vector of the correlated words or grams from the frequently used dictionary when the features from a vector of the correlated words or grams meet a threshold value. The processing device is further configured to determine a similarity score between weight vectors of a query and one or more documents in a corpus by employing the features from the vector of the correlated words or grams that met the threshold value.
These and other features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:
In accordance with the present principles, a class of models is provided that are discriminatively trained to directly map from word content in a querydocument or documentdocument pair to a ranking score. Like Latent Semantic Indexing (LSI), the present models account for correlations between words (e.g., synonymy, polysemy). However, unlike LSI, the present models are trained with a supervised signal (for example, an expert's judgment on whether a document is relevant to a query, or links among documents) directly on the task of interest, which is one reason for superior results. As the query and target texts are modeled separately, the present approach can easily be generalized to other retrieval tasks as well, such as crosslanguage retrieval. An empirical study is provided on a retrieval task based on Wikipedia documents, where we obtain stateoftheart performance using the present method.
Embodiments described herein may be entirely hardware, entirely software or including both hardware and software elements. In a preferred embodiment, the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Embodiments may include a computer program product accessible from a computerusable or computerreadable medium providing program code for use by or in connection with a computer or any instruction execution system. A computerusable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. The medium may include a computerreadable medium such as a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a readonly memory (ROM), a rigid magnetic disk and an optical disk, etc.
Supervised Semantic Indexing: We illustratively define a similarity function between a query and a document as:
where q and d are feature vectors of query and document, respectively. Each feature in a feature vector could be an individual word, its weight can be, for instance, a TFIDF value. W is a weight matrix and T is the transpose function. We can see that w_{ij }models correlation between the ith feature of query q and jth feature of d.
Sparse W matrix: We consider both memory and speed considerations. Firstly, this method so far assumes that W fits in memory. For example, if the dictionary size D=30000, then this needs, e.g., 3.4 Gb of RAM (assuming oats). The vectors q and d are sparse so the speed of computation of a single querydocument pair involves mn computations of w_{ij}q_{i}d_{j}, where q and d have m and n nonzero terms, respectively. We have found this is reasonable for training, but may be an issue at test time. Alternatively, one can compute v=q^{T}W once, and then compute vd for each document. This is the same speed as a classical vector space model where the query contains D terms, assuming W is dense.
Sparse W matrices: If W was itself a sparse matrix, then computation of f(x) would be considerably faster. If the query has m nonzero terms, and any given column of W has p nonzero terms, then the method is at most mp times slower than a classical vector space model. We can enforce W to be sparse using standard feature selection algorithms; we hence generalize the known “Recursive Feature Elimination” algorithm yielding a simple, intuitive method:
1. First, we train the model with a dense matrix W as before.
2. For each column i of W find the k active elements with the smallest values of w_{ij}. Constrain these elements to equal zero (make them inactive).
3. Train the model with the constrained W matrix.
4. If W contains more than p nonzero terms in each column go back to 2.
Low rank W matrices: An alternative efficient scheme is to constrain W in the following way:
W=U ^{T} V+I
This induces a low dimensional “latent concept” space in a similar way to LSI. However, it differs in several ways: most importantly it is trained with a supervised signal. Further, U and V differ so it does not assume a query and a target document should be embedded in the same way, and the addition of the identity term (I) means this model automatically learns the tradeoff between using the low dimensional space and a classical vector space model. In terms of efficiency however—it is the same: its speed depends on the dimensionality of U and V.
We also highlight several variants:
1. W=I: if q and d are normalized TFIDF vectors, this is equivalent to using the standard cosine similarity with no learning (and no synonymy or polysemy).
2. W=D: where D is a diagonal matrix: one learns a reweighting of TFIDF using labeled data.
3. W=U^{T}U+I: we constrain the model to be symmetric; the query and target document are treated in the same way.
Polynomial Features and Low Rank Approximation: A ranking function may include: f(q,d)=q^{T}Wd where q and d are vectors to be ranked, W is a weight and T is the transpose function. If one were to concatenate the vectors q and d: v=(q,d) and then using a polynomial mapping Φ(•): f(q,d)=w^{T}Φ(v), i.e., take all pairwise terms, one ends up with a ranking function: f(q,d)=q^{T}W_{1}q+q^{T}W_{2}d+d^{T}W_{3}d with three “W” matrices to learn. The first term q^{T}W_{1}q is useless since it is the same for all target documents. Although this term is not affected by the query at all, it corresponds to ranking a document's “importance” in a static way, which can still be useful. However, in the present case this static ranking is learned using a supervised signal.
Furthermore, in general, we could consider Φ(•) to be a polynomial of an order greater than 2. Although this can be computationally challenging, it can again be sped up using a low rank approximation. For example, if we would like the ranking function: f(q,d)=W_{ijk}q_{i}d_{j}d_{k}. We can do that with a low rank approximation:
This corresponds to embedding the query into an ndimensional space using the mapping U, embedding the target document twice with two separate functions V and Y also in ndimensional spaces. Then, the product of all three is taken for each dimension of the embedding space. If there are only two embedding spaces U and V, this reduces to the original ranking function. This can be generalized to more than three embeddings (e.g., more than degree 3).
Convolutions or Sliding Windows: In general, if we have an embedding t(q_{i}) where q_{i }is the i^{th }sliding window in a text, we could embed the document with:
One could choose a function t which is a nonlinear neural network based on embedding words, one could use a multilayer network with sigmoid functions, or one could apply a similar process using polynomials as described above—but instead apply a polynomial map to ngrams (i.e. within a sliding window). For example, for 2grams containing two words w_{1 }and w_{2 }one might want to consider the following mapping. If g(w_{i}) is the embedding of the word w_{i}, one uses the mapping:
This can be generalized to ngrams, and basically consists of only using some of the polynomial terms. Alternatively, one could hash ngrams.
Transductive embedding: Each document could have its own embedding vector V_{i}. (Could be done for advertisements too, e.g., for Yahoo). Note this embedding may not be a function. In other words, it is not necessary to find function V=f(v). We can add one of the following constraints:
1. V_{d+} ^{T}V_{w}>V_{d−} ^{T}V_{w}—train word embedding with document embedding together, where w is in d+, but not in d−.
2. V_{d+} ^{T}f(d+)>V_{d+} ^{T}f(d−) train functional embedding with transductive embedding. Again, note V_{d} ^{T}≠f(d), instead, they have nonlinear relationship.
Reducing the dimensionality of infrequent words: One can introduce capacity control (and lower the memory footprint) by making the representation of each word a variable rather than a fixed dimension. The idea is that infrequent words should have less dimensions because they cannot be trained as well anyway. To make this work, a linear layer maps from a reduced dimension to the full dimension in the model. If a particular word w has dim(w) dimensions then the first dim(w) rows of a d by d matrix M are used to linearly map it to d dimensions, the matrix M is learned along with the representations of each word.
Hashing: Random Hashing of Words: One could embed a document using the average of B “bin” vectors rather than using words. One randomly assigns any given word to n (e.g., n=2 or 3) bins. In this way, rare words are still used but with a smaller number of features, making lower capacity and smaller memory footprint models.
Hashing ngrams: One could also use this trick to, e.g., hash ngrams, which otherwise would have no way of fitting in memory. To use less features, one could hash a 3gram by making it the sum of three 2grams e.g. “the cat sat”→g(the,cat)+g(cat,sat)+g(the,sat), where g(•) is the embedding.
Hashing with Prior Knowledge: One does not have to use random hashing functions. One could find the synonyms of rare words, e.g., use a web search with a DICE score (a known measuring score) to pick out the n most related words that have a frequency >F, e.g., F=30,000. We then bin that word into the n bins. In this way, rare words are represented by their synonyms resulting in less capacity models that should be smarter than random hashing. In general, if one has prior knowledge, one might always be able to do better than random hashing.
Referring now to the drawings in which like numerals represent the same or similar elements and initially to
Block 11 represents a document d, and block 12 represents a query q. In block 13, a TFIDF vector is built for document d and in block 14, a TFIDF vector is built for query q. In blocks 15 and 16, TFIDF vectors Td and Tq are generated. In block 17, a similarity score is computed between q and d. As shown, the similarity between the query and the document is described as:
f(q,d)=q′Wd=Σ _{i,j} w _{ij} q′ _{i} d _{j} (1)
The dimension of q and d are of the vocabulary size D (q′ is the transpose of q). For illustrative purposes, in Wikipedia, D>2.5 million. So matrix W can be intractably large. Ways to control the size of W will be described herein.
Training: Suppose we are given a set of tuples R (labeled data), where each tuple contains a query q, a relevant document d^{+} and an irrelevant (or lower ranked) document d^{−}. We choose W such that q′Wd^{+}>q′Wd^{−}, expressing that d^{+} should be ranked higher than d^{−}. For that purpose, we employ a known method called margin ranking loss, and minimize:
This optimization problem is solved through stochastic gradient descent, iteratively, one picks a random tuple and makes a gradient step for that tuple: W−λ(qd^{+}−qd^{−}), if 1−qd^{+}−qd^{−}>0.
One could exploit the sparsity of q and d when calculating these updates. To train the present model, we choose the (fixed) learning rate λ which minimizes the training error. Stochastic training is highly scalable and is easy to implement for the model. The present method includes a margin ranking perceptron with a particular choice of features. It thus involves a convex optimization problem and is hence related to a ranking SVM, except we have a highly scalable optimizer.
The vector space models for text in accordance with the present principles employ a weight or sparse matrix W (block 17). If W was itself a sparse matrix, then computation of f(•) would be considerably faster, not to mention significant memory savings. If the query has m nonzero terms, and any given row of W has p nonzero terms, then there are at most mp terms in v=q′W (compared to mD terms for a dense W, and m terms for a classical vector space cosine).
The simplest way to do this is to apply a regularizer such as minimizing the L1norm of W:
where L(•) is a function that computes the loss of the current model given a query q in Q and a labeling Y(q,d) of the relevance of the documents d with respect to q. However, in general any sparsity promoting regularizer or feature selection technique could be used.
Training W: We enforce sparsity through feature selection. Here, a “projection” method can be seen as a generalization of the Recursive Feature Elimination (RFE) feature selection method. This method may include the following: 1) Train the model with a dense matrix W as before. 2) For each row i of W, find the k active elements with the smallest values of W_{ij}. Constrain these elements to equal zero, i.e., make them inactive. 3) Train the model with the constrained W matrix. 4) If W contains more than p nonzero terms in each row go back to 2).
This method is motivated by the removal of the smallest weights being linked to the smallest change in the objective function and has been shown to work well in several experimental setups. Overall, we found this scheme to be simple and efficient, while yielding good results. Advantageously, feature selection may be applied to supervised semantic indexing (SSI).
Referring to
f(q,d)=q′Wd=q′U′Vd+q′d (4)
The first term on the righthand side (q′U′Vd) can be viewed as a dot product of q and d after transformation of U and V, respectively. This term contains synonym/polysemy information and thus allows fuzzy matching. The second term (q′d) is a traditional vector space model similarity, which uses exact word matching. By dropping the second term in Eq. (4), we get:
f(q,d)=q′U′Vd (5).
This form is useful for heterogeneous systems where query features and document features are different, e.g. crosslanguage retrieval. If we model query and document in the same way, Eq. (2) becomes: W=U′U+I (6).
Eq. (6) has fewer parameters to train than Eq. (2), and thus provides for faster training and is less likely to overfit. This can provide better performance.
Training: When the W matrix is constrained, e.g., W=U′V+I training is performed in a similar way as before, but in this case by making a gradient step to optimize the parameters U and V:
U←U+λV(d ^{+} −d ^{−})q′, if 1−f(q,d ^{+})+f(q,d ^{−})>0
V←V+λUq(d ^{+} −d ^{−})′, if 1−f(q,d ^{+})+f(q,d ^{−})>0
Note this is no longer a convex optimization problem. In our experiments, we initialized the matrices U and V randomly using a normal distribution with mean zero and standard deviation one. In block 27, the classical vector space model (q′d) and optimal embedding (q′UVd) are embedded to give better performance. U and V can model query and documents in different ways, thus U and V are good for heterogeneous systems such as systems that perform cross language retrieval (e.g., query is Japanese, but documents are in English).
Referring to
To further reduce the number of features, correlated feature hashing may be introduced. With this method, we limit the features to the most frequent ones (e.g., the most frequent 30,000 words). Any other word, say w, is “hashed” into the 5 words (or at least two) that cooccur with w most often. The cooccurrence can be measured by a “DICE” score in block 33, which is the number of cooccurrences of two words divided by the sum of the occurrences of the two words. The same method can be applied to the features of ngrams. Ngrams can be hashed into most cooccurred word features.
Training: Correlated feature hashing can be viewed as a preprocessing stage (e.g., block 201 of
Correlated feature hashing includes providing a frequently used dictionary 31 and an infrequently used dictionary 32 in storage memory. For each word or gram in the infrequently used dictionary 32, correlate n words or grams from the frequently used dictionary 32 based on a first score (e.g., DICE score). In block 36, features are replaced for a vector of the infrequently used words or grams with features from a vector of the correlated words or grams from the frequently used dictionary 31 when the features from a vector of the correlated words or grams meet a threshold value (e.g., 0.2). A similarity score between weight vectors of a query and one or more documents is achieved by employing the features from the vector of the correlated words or grams that met the threshold value.
Referring to
In querydocument ranking, we allow document terms to appear more than once, but limit query terms to appear only once. Also, we only consider the terms with both query terms and document terms. For example, take k=3, we have:
f(q,d)=Σ_{i,j,k} w _{ijk} q _{i} d _{i} d _{j} d _{k} (8)
Again, to have better scalability, it can be approximated with a low dimension approximation:
f(q,d)=Σ_{jεD} h(q), π_{m1,k1} g _{m}(d) (9)
The process of Eq. (9) is depicted in
If we set k=3, and embedding functions to be linear, we have f(q,d)=Σ_{i}(Uq)_{i}(Vd)_{i}(Yd)_{i }(10).
Training: Again, we could adopt the Stochastic Gradient Descent (SGD) method. The gradient can be easily determined using partial derivatives. For formulation in Eq. (8), w_{ijk}←w_{ijk}+λ(d_{j} ^{+}d_{k} ^{+}−d_{j} ^{−}d_{k} ^{−})q_{i}, if 1−f(q,d^{+})+f(q,d^{−})>0. For formulation in Eq. (10), U←U+λ(Vd^{+}
Yd^{+}−Vd^{−} Yd^{−})q′, if 1−f(q,d^{+})+f(q,d^{−})>0; V←V+λ(UqYd^{+}−UqYd^{−})(d^{+}−d^{−}), if 1−f(q,d^{+})+f(q,d^{−})>0; and Y←Y+λ(UqVd^{+}−UgYd^{−})(d^{+}−d^{−})′, if 1−f(q,d^{+})+f(q,d^{−})>0 where operator is a componentwise product of vectors, i.e., (AB)_{i}=A_{i}B_{i}.Advantageously, Eq. (8) provides significant improvements in document ranking. Eq. (9) can be viewed as a generalized dot product, which simplifies the superior document ranking capabilities of Eq. (8).
Halftransductive Ranking: Eq. (5) can be viewed as dot product of two functions: f(q,d)=φ_{1}(q)·φ_{2}(d)=(Uq)′(Vd) (10)
where two function (linear) mappings are φ_{1}(•) and φ_{1}(•) and the parameters being features in original feature space. This method is also called functional embedding because new samples can be put into embedding space using function φ(•). Another important category of learning methods is transductive learning. This method assigns a vector v_{i}εR^{m }to each object y_{i}εY that will be learned using a supervised or unsupervised signal and does not involve any feature representation of y_{i}. Instead, the embedding only uses information of relative distances between y_{i}, y_{j}εY. In this sense, it can be said to be a nonlinear method. Ranking is then typically achieved by measuring the distances in the embedding space (Euclidean distance or dot product similarity). This method provides a pointtopoint correspondence between the input space and the intrinsic space in which the data lie. Such methods have multiplied in recent years and are most commonly rooted either in factor analysis (e.g., principal component analysis) or multidimensional scaling and their variants: e.g., kernel PCA, Isomap, Locally Linear Embedding and Laplacian Eigenmaps. In many cases, transductive learning gives outstanding performance for clustering applications. The drawback of this method is that, the query has to be one of the objects y_{i}εY for this approach to make sense as only these objects are embedded. Outofsample extension methods have to be adopted for new samples.
In accordance with the present principles, we can combine the functional embedding and transductive embedding, using the following half transductive model:
f(q,d_{i})=M(φ(q),v_{i}) (11) where φ(q) is the functional embedding for new queries, while v_{i }is the nonlinear embedding to give better performance. Specifically, if we require φ(q) to be a linear function, Eq. (11) becomes:
f(q,d _{i})=M(Wq,v _{i}) (12).
Training: We can adopt a gradientdescent based optimization method. For Eq. (11), the derivative of function M and φ can be calculated and used in adjusting parameters. For the transductive part v_{i}, technology like “lookup table” can be used.
Advantageously, the new framework for halftransductive learning is provided. Half of the formulation is functional which handles new sample embedding, while the other half is transductive which provides better performance for existing databases.
Sliding window based embedding function: In general if we have an embedding t(q_{i}) where q_{i }is the ith sliding window in a text, we could embed the document with: f(q,d)=Σ_{i}(t(q_{i}))^{T}·Σ_{j}(t(d_{j})) or its normalized version:
One could choose a function which is a nonlinear neural network based on embedding words, one could use a multilayer network with sigmoid functions, or could apply a similar process as in the section on polynomials above (but instead to apply a polynomial map to ngrams, within a sliding window)). For example, for 2grams containing two words w_{1 }and w_{2 }one might want to consider the following mapping. If g(w_{1}) is the embedding of the word with w_{i}, one uses the mapping: t(d_{i})=Σ_{j}g(w_{1})_{j}g(w_{2})_{j}: This can be generalized to ngrams. This basically includes using only some of the polynomial terms.
Reducing the dimensionality of infrequent words: One can introduce capacity control (and lower the memory footprint) by making the representation of each word a variable rather than fixed dimension. The idea is that infrequent words should have fewer dimensions because they cannot be trained as well anyway. To make this work a linear layer maps from a reduced dimension to the full dimension in the model. If a particular word w has dim(w) dimensions, then the first dim(w) rows of a d×d matrix M are used to linearly map the word to d dimensions. The matrix M is learned along with the representations of each word.
Referring to
System 100 may interact with local servers 108 and databases 110 or remote servers 108 and databases 110 through a network 106. These databases 110 and servers 108 may include the documents or items to be searched using a user or machine generated query. It should be understood that the program 114 may be remotely disposed from system 100, e.g., located on a server or the like. Training of models may be performed on system 100 or may be performed on other platforms to create models 115 for implementing the present methods.
Referring to
In block 202, a weight vector for each of a plurality of documents in a corpus of documents is built and stored in memory. In block 203, a weight vector is built for a query input into a document system. The weight vectors may include Term Frequency Inverse Document Frequency (TFIDF) weight vectors, although other feature vectors are also contemplated.
In one embodiment, the weight vectors for the plurality of documents are mapped to multiple embedding spaces to handle models having polynomial functions of norder in block 204. In another embodiment, functional embedding and transductance embedding are combined to transform weight vectors to model both new queries and improve performance in existing databases in block 205. In block 206, a sliding window may be employed while embedding words or grams to consider other words or grams in proximity thereof.
In one simplifying embodiment, in block 20B, building a weight vector for each of a plurality of documents includes performing a linear embedding to transform the weight vector to embedding space using a matrix U, and in block 209, building a weight vector for a query includes performing a linear embedding to transform the weight vector to embedding space using a matrix V.
In block 210, a weight matrix (W) is generated which distinguishes between relevant documents and lower ranked documents by comparing document/query tuples using a gradient step approach. Other comparison techniques may also be employed.
In block 211, the weight matrix generation may include assigning infrequently occurring words to a reduced dimension representation; and mapping the reduced dimension representation to a full dimension representation when the infrequently occurring words are encountered. The weight matrix may be sparsified through feature selection in block 212. Sparification may include finding active elements in the weight matrix with smallest weight values in block 213 by searching the matrix row by row. The active elements in the weight matrix are constrained with smallest weight values set to equal zero in block 214. In block 215, a model is trained with the weight matrix, and, in block 216, if the weight matrix includes more than p nonzero terms, returning to the finding step (block 213). The result is a sparsely populated weight matrix to improve computation time and reduce memory and computational overhead.
In block 220, a similarity score is determined between weight vectors of the query and documents in a corpus by determining a product of a document weight vector, a query weight vector and the weight matrix. In the simplified embodiment, determining the similarity score includes taking a product between a transformed document weight vector (block 208) and a transformed query weight vector (block 209) to simplify computations.
It should be understood that blocks 204, 205, 206, 211, 212 are all alternative options to decide similarity score f(q,d). Although these techniques can be combined, they may be employed independently or combined in any order.
Having described preferred embodiments of systems and methods for supervised semantic indexing and its extensions (which are intended to be illustrative and not limiting), it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in the particular embodiments disclosed which are within the scope and spirit of the invention as outlined by the appended claims. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.
Claims (14)
Priority Applications (2)
Application Number  Priority Date  Filing Date  Title 

US14394209 true  20090112  20090112  
US12562840 US8359282B2 (en)  20090112  20090918  Supervised semantic indexing and its extensions 
Applications Claiming Priority (1)
Application Number  Priority Date  Filing Date  Title 

US12562840 US8359282B2 (en)  20090112  20090918  Supervised semantic indexing and its extensions 
Publications (2)
Publication Number  Publication Date 

US20100185659A1 true US20100185659A1 (en)  20100722 
US8359282B2 true US8359282B2 (en)  20130122 
Family
ID=42319750
Family Applications (2)
Application Number  Title  Priority Date  Filing Date 

US12562840 Active 20310623 US8359282B2 (en)  20090112  20090918  Supervised semantic indexing and its extensions 
US12562802 Active 20310306 US8341095B2 (en)  20090112  20090918  Supervised semantic indexing and its extensions 
Family Applications After (1)
Application Number  Title  Priority Date  Filing Date 

US12562802 Active 20310306 US8341095B2 (en)  20090112  20090918  Supervised semantic indexing and its extensions 
Country Status (1)
Country  Link 

US (2)  US8359282B2 (en) 
Cited By (4)
Publication number  Priority date  Publication date  Assignee  Title 

US20150242387A1 (en) *  20140224  20150827  Nuance Communications, Inc.  Automated text annotation for construction of natural language understanding grammars 
US9477654B2 (en)  20140401  20161025  Microsoft Corporation  Convolutional latent semantic models and their applications 
US9519859B2 (en)  20130906  20161213  Microsoft Technology Licensing, Llc  Deep structured semantic model produced using clickthrough data 
US9535960B2 (en)  20140414  20170103  Microsoft Corporation  Contextsensitive search using a deep learning model 
Families Citing this family (51)
Publication number  Priority date  Publication date  Assignee  Title 

US9195640B1 (en) *  20090112  20151124  Sri International  Method and system for finding content having a desired similarity 
JP5119288B2 (en) *  20100315  20130116  シャープ株式会社  The mobile terminal device, an information output system, information output method, program and recording medium 
US9298818B1 (en) *  20100528  20160329  Sri International  Method and apparatus for performing semanticbased data analysis 
US20120078918A1 (en) *  20100928  20120329  Siemens Corporation  Information Relation Generation 
US8819019B2 (en) *  20101118  20140826  Qualcomm Incorporated  Systems and methods for robust pattern classification 
US9529908B2 (en)  20101122  20161227  Microsoft Technology Licensing, Llc  Tiering of posting lists in search engine index 
US8478704B2 (en)  20101122  20130702  Microsoft Corporation  Decomposable ranking for efficient precomputing that selects preliminary ranking features comprising static ranking features and dynamic atomisolated components 
US9424351B2 (en)  20101122  20160823  Microsoft Technology Licensing, Llc  Hybriddistribution model for search engine indexes 
US8713024B2 (en)  20101122  20140429  Microsoft Corporation  Efficient forward ranking in a search engine 
US8620907B2 (en) *  20101122  20131231  Microsoft Corporation  Matching funnel for large document index 
US20120253792A1 (en) *  20110330  20121004  Nec Laboratories America, Inc.  Sentiment Classification Based on Supervised Latent NGram Analysis 
US8892488B2 (en) *  20110601  20141118  Nec Laboratories America, Inc.  Document classification with weighted supervised ngram embedding 
US8515964B2 (en) *  20110725  20130820  Yahoo! Inc.  Method and system for fast similarity computation in high dimensional space 
CN102360372B (en) *  20111009  20130130  北京航空航天大学  Crosslanguage document similarity detection method 
US8799269B2 (en) *  20120103  20140805  International Business Machines Corporation  Optimizing map/reduce searches by using synthetic events 
JP2015004996A (en) *  20120214  20150108  インターナショナル・ビジネス・マシーンズ・コーポレーションＩｎｔｅｒｎａｔｉｏｎａｌ Ｂｕｓｉｎｅｓｓ Ｍａｃｈｉｎｅｓ Ｃｏｒｐｏｒａｔｉｏｎ  Apparatus for clustering plural documents 
JP5890539B2 (en) *  20120222  20160322  ノキア テクノロジーズ オーユー  Access to the service based on the prediction 
CN103309886B (en) *  20120313  20170510  阿里巴巴集团控股有限公司  A structured information search method and apparatus based trading platform 
US8898165B2 (en)  20120702  20141125  International Business Machines Corporation  Identification of null sets in a contextbased electronic document search 
US8903813B2 (en)  20120702  20141202  International Business Machines Corporation  Contextbased electronic document search using a synthetic event 
US9460200B2 (en)  20120702  20161004  International Business Machines Corporation  Activity recommendation based on a contextbased electronic files search 
US9262499B2 (en)  20120808  20160216  International Business Machines Corporation  Contextbased graphical database 
US8676857B1 (en)  20120823  20140318  International Business Machines Corporation  Contextbased search for a data store related to a graph node 
US8959119B2 (en)  20120827  20150217  International Business Machines Corporation  Contextbased graphrelational intersect derived database 
US9619580B2 (en)  20120911  20170411  International Business Machines Corporation  Generation of synthetic context objects 
US8620958B1 (en)  20120911  20131231  International Business Machines Corporation  Dimensionally constrained synthetic context objects database 
US9251237B2 (en)  20120911  20160202  International Business Machines Corporation  Userspecific synthetic context object matching 
US9223846B2 (en)  20120918  20151229  International Business Machines Corporation  Contextbased navigation through a database 
US8782777B2 (en)  20120927  20140715  International Business Machines Corporation  Use of synthetic contextbased objects to secure data stores 
US9020954B2 (en)  20120928  20150428  International Business Machines Corporation  Ranking supervised hashing 
US9741138B2 (en)  20121010  20170822  International Business Machines Corporation  Node cluster relationships in a graph database 
US8931109B2 (en)  20121119  20150106  International Business Machines Corporation  Contextbased security screening for accessing data 
US9286379B2 (en) *  20121126  20160315  WalMart Stores, Inc.  Document quality measurement 
US9229932B2 (en)  20130102  20160105  International Business Machines Corporation  Conformed dimensional data gravity wells 
US8914413B2 (en)  20130102  20141216  International Business Machines Corporation  Contextbased data gravity wells 
US8983981B2 (en)  20130102  20150317  International Business Machines Corporation  Conformed dimensional and contextbased data gravity wells 
US9069752B2 (en)  20130131  20150630  International Business Machines Corporation  Measuring and displaying facets in contextbased conformed dimensional data gravity wells 
US8856946B2 (en)  20130131  20141007  International Business Machines Corporation  Security filter for contextbased data gravity wells 
US9053102B2 (en)  20130131  20150609  International Business Machines Corporation  Generation of synthetic context frameworks for dimensionally constrained hierarchical synthetic contextbased objects 
US20140236577A1 (en) *  20130215  20140821  Nec Laboratories America, Inc.  Semantic Representations of Rare Words in a Neural Probabilistic Language Model 
US9292506B2 (en)  20130228  20160322  International Business Machines Corporation  Dynamic generation of demonstrative aids for a meeting 
US9110722B2 (en)  20130228  20150818  International Business Machines Corporation  Data processing work allocation 
US9600777B2 (en) *  20130311  20170321  Georges Harik  Configuring and optimizing computational structure for a machine learning application using a tuple of vectors 
US20140279748A1 (en) *  20130315  20140918  Georges Harik  Method and program structure for machine learning 
US9348794B2 (en)  20130517  20160524  International Business Machines Corporation  Population of contextbased data gravity wells 
US9195608B2 (en)  20130517  20151124  International Business Machines Corporation  Stored data analysis 
US9262485B2 (en) *  20130813  20160216  International Business Machines Corporation  Identifying a sketching matrix used by a linear sketch 
JP2015166962A (en) *  20140304  20150924  日本電気株式会社  Information processing device, learning method, and program 
JPWO2015145981A1 (en) *  20140328  20170413  日本電気株式会社  Multilingual document similarity learning device, multilingual document similarity determination device, multilingual document similarity learning method, multilingual document similarity determination method, and multilingual document similarity learning program 
US9330104B2 (en)  20140430  20160503  International Business Machines Corporation  Indexing and searching heterogenous data entities 
WO2017180475A1 (en) *  20160415  20171019  3M Innovative Properties Company  Query optimizer for combined structured and unstructured data records 
Citations (2)
Publication number  Priority date  Publication date  Assignee  Title 

US6137911A (en) *  19970616  20001024  The Dialog Corporation Plc  Test classification system and method 
US20100082511A1 (en) *  20080930  20100401  Microsoft Corporation  Joint ranking model for multilingual web search 
Family Cites Families (2)
Publication number  Priority date  Publication date  Assignee  Title 

US8195669B2 (en) *  20080922  20120605  Microsoft Corporation  Optimizing ranking of documents using continuous conditional random fields 
US8166032B2 (en) *  20090409  20120424  MarketChorus, Inc.  System and method for sentimentbased text classification and relevancy ranking 
Patent Citations (2)
Publication number  Priority date  Publication date  Assignee  Title 

US6137911A (en) *  19970616  20001024  The Dialog Corporation Plc  Test classification system and method 
US20100082511A1 (en) *  20080930  20100401  Microsoft Corporation  Joint ranking model for multilingual web search 
NonPatent Citations (19)
Title 

Balasubramanian et al. The Isomap algorithm and topological stability. www.sciencemag.org, Science. vol. 295 . Jan. 2002. (3 Pages) http://www.sciencemag.org/cgi/reprint/295/5552/7a.pdf. 
Belkin et al. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Advances in Neural Information Processing Systems 14 . Dec. 2002, pp. 128. 
Bengio et al. Outofsample extensions for LLE, isomap, MDS, eigenmaps, and spectral clustering. Advances in Neural Information Processing Systems 16. 2004. (8 Pages). 
Blei et al. Latent Dirichlet Allocation. Journal of Machine Learning Research. vol. 3. Jan. 2003, pp. 9931022. 
Burges et al. Learning to rank using gradient descent. In ICML 2005, in Proceeding of the 22th International Conference on Machine Learning. Aug. 2005. pp. 8995. 
Collins et al. New ranking algorithms for parsing and tagging: kernels over discrete structures, and the voted perceptron. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics. Jul. 2002. pp. 263270. 
Collobert et al. Fast Semantic Extraction Using a Novel Neural Network Architecture. 45th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. ACL. Jun. 2007. (8 pages). 
Deerwester et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. 41. 1990. pp. 391407. 
Grangier et al. A discriminative kernelbased approach to rank Images from text queries. IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI. vol. 30, Issue: 8. Aug. 2008. pp. 114. 
Grangier et al. Inferring document similarity from hyperlinks. In CIKM '05. ACM. Oct. 2005. pp. 359360. 
Herbrich et al. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifers. Mar. 2000. pp. 115132. 
Hofmann. Probabilistic latent semantic Indexing. Proceedings of the TwentySecond Annual International SIGIR Conference on Research and Development in Information Retrieval. In SIGIR. Aug. 1999. pp. 5057. 
Joachims. Optimizing search engines using clickthrough data. In ACM SIGKDD '02. International Conference on Knowledge Discovery and Data Mining. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 2002. pp. 133142. (10 Pages). 
Keller et al. A Neural network for Text Representation. In International Conference on Artificial Neural Networks. Sep. 2005. (6 Pages). 
Roweis et al. Nonlinear dimensionality reduction by locally linear embedding. www.sciencemag.org. Science. vol. 290. Dec. 2000. pp. 23232326. 
Salakhutdinov et al. Semantic Hashing. In proceedings of the SIGIR Workshop on Information Retrieval and Applications of Graphical Models. vol. 41 No. 2. Dec. 2007. (8 Pages). 
Scholkopf et al. Kernel principal component analysis. Advances in kernel methods: support vector learning. 1999. (6 Pages). 
Shi et al. Hash kernels. In Twelfth International Conference on Artificial Intelligence and Statistics. Apr. 2009. (8 Pages). 
Sun et al. Supervised latent semantic Indexing for document categorization. IEEE Computer Society. In ICDM 2004. ICDM '04: Proceedings of the Fourth IEEE International Conference on Data Mining. Nov. 2004. pp. 535538. 
Cited By (5)
Publication number  Priority date  Publication date  Assignee  Title 

US9519859B2 (en)  20130906  20161213  Microsoft Technology Licensing, Llc  Deep structured semantic model produced using clickthrough data 
US20150242387A1 (en) *  20140224  20150827  Nuance Communications, Inc.  Automated text annotation for construction of natural language understanding grammars 
US9524289B2 (en) *  20140224  20161220  Nuance Communications, Inc.  Automated text annotation for construction of natural language understanding grammars 
US9477654B2 (en)  20140401  20161025  Microsoft Corporation  Convolutional latent semantic models and their applications 
US9535960B2 (en)  20140414  20170103  Microsoft Corporation  Contextsensitive search using a deep learning model 
Also Published As
Publication number  Publication date  Type 

US20100185659A1 (en)  20100722  application 
US8341095B2 (en)  20121225  grant 
US20100179933A1 (en)  20100715  application 
Similar Documents
Publication  Publication Date  Title 

Berry et al.  Using linear algebra for intelligent information retrieval  
Sarawagi  Information extraction  
Khan et al.  A review of machine learning algorithms for textdocuments classification  
Bollegala et al.  A web search enginebased approach to measure semantic similarity between words  
Song et al.  Automatic tag recommendation algorithms for social recommender systems  
Lee et al.  Information gain and divergencebased feature selection for machine learningbased text categorization  
Markov et al.  Data mining the Web: uncovering patterns in Web content, structure, and usage  
Peng et al.  Information extraction from research papers using conditional random fields  
Huang  Similarity measures for text document clustering  
Kowalski  Information retrieval systems: theory and implementation  
Aggarwal et al.  A survey of text clustering algorithms  
Ando et al.  A framework for learning predictive structures from multiple tasks and unlabeled data  
Ramakrishnan et al.  Data mining: From serendipity to science  
US20090006382A1 (en)  System and method for measuring the quality of document sets  
Hotho et al.  A brief survey of text mining.  
Giunchiglia et al.  Lightweight ontologies  
US20050080776A1 (en)  Internet searching using semantic disambiguation and expansion  
US20070143101A1 (en)  Class description generation for clustering and categorization  
US20100185691A1 (en)  Scalable semistructured named entity detection  
Boley et al.  Partitioningbased clustering for web document categorization  
US20120278321A1 (en)  Visualization of concepts within a collection of information  
Mecca et al.  A new algorithm for clustering search results  
US20090254512A1 (en)  Ad matching by augmenting a search query with knowledge obtained through search engine results  
Liu et al.  Robust and scalable graphbased semisupervised learning  
US20060259481A1 (en)  Method of analyzing documents 
Legal Events
Date  Code  Title  Description 

AS  Assignment 
Owner name: NEC LABORATORIES AMERICA, INC., NEW JERSEY Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BAI, BING;WESTON, JASON;COLLOBERT, RONAN;AND OTHERS;SIGNING DATES FROM 20090911 TO 20090916;REEL/FRAME:023255/0416 

AS  Assignment 
Owner name: NEC CORPORATION, JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NEC LABORATORIES AMERICA, INC.;REEL/FRAME:031998/0667 Effective date: 20140113 

FPAY  Fee payment 
Year of fee payment: 4 

AS  Assignment 
Owner name: NEC CORPORATION, JAPAN Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE REMOVE 8538896 AND ADD 8583896 PREVIOUSLY RECORDED ON REEL 031998 FRAME 0667. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT;ASSIGNOR:NEC LABORATORIES AMERICA, INC.;REEL/FRAME:042754/0703 Effective date: 20140113 