US6847966B1 - Method and system for optimally searching a document database using a representative semantic space - Google Patents

Method and system for optimally searching a document database using a representative semantic space Download PDF

Info

Publication number
US6847966B1
US6847966B1 US10/131,888 US13188802A US6847966B1 US 6847966 B1 US6847966 B1 US 6847966B1 US 13188802 A US13188802 A US 13188802A US 6847966 B1 US6847966 B1 US 6847966B1
Authority
US
United States
Prior art keywords
document
term
matrix
documents
pseudo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime, expires
Application number
US10/131,888
Inventor
Matthew S. Sommer
Kevin B. Thompson
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
KLDiscovery Ontrack LLC
Original Assignee
Engenium Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Engenium Corp filed Critical Engenium Corp
Priority to US10/131,888 priority Critical patent/US6847966B1/en
Assigned to ENGENIUM CORPORATION reassignment ENGENIUM CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SOMMER, MATTHEW S., THOMPSON, KEVIN B.
Priority to US11/041,799 priority patent/US7483892B1/en
Application granted granted Critical
Publication of US6847966B1 publication Critical patent/US6847966B1/en
Assigned to KROLL ONTRACK INC. reassignment KROLL ONTRACK INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ENGENIUM CORPORATION
Assigned to LEHMAN COMMERCIAL PAPER INC. reassignment LEHMAN COMMERCIAL PAPER INC. PATENT SECURITY AGREEMENT Assignors: KROLL ONTRACK INC.
Assigned to GOLDMAN SACHS BANK USA reassignment GOLDMAN SACHS BANK USA ASSIGNMENT AND ASSUMPTION OF ALL LIEN AND SECURITY INTERESTS IN PATENTS Assignors: LEHMAN COMMERCIAL PAPER INC.
Assigned to KROLL ONTRACK INC., HIRERIGHT, INC., US INVESTIGATIONS SERVICES, LLC reassignment KROLL ONTRACK INC. TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT COLLATERAL Assignors: GOLDMAN SACHS BANK USA
Assigned to GOLDMAN SACHS BANK USA reassignment GOLDMAN SACHS BANK USA NOTICE AND CONFIRMATION OF GRANT OF SECURITY INTEREST IN PATENTS - TERM Assignors: KROLL ONTRACK, INC.
Assigned to WILMINGTON TRUST, N.A. reassignment WILMINGTON TRUST, N.A. NOTICE AND CONFIRMATION OF GRANT OF SECURITY INTEREST IN PATENTS - FIRST LIEN NOTE Assignors: KROLL ONTRACK, INC.
Assigned to WILMINGTON TRUST, N.A. reassignment WILMINGTON TRUST, N.A. NOTICE AND CONFIRMATION OF GRANT OF SECURITY INTEREST IN PATENTS - SECOND LIEN NOTE Assignors: KROLL ONTRACK, INC.
Assigned to WILMINGTON TRUST, N.A. reassignment WILMINGTON TRUST, N.A. NOTICE AND CONFIRMATION OF GRANT OF SECURITY INTEREST IN PATENTS - THIRD LIEN NOTE Assignors: KROLL ONTRACK, INC.
Assigned to KROLL ONTRACK, LLC reassignment KROLL ONTRACK, LLC CONVERSION Assignors: KROLL ONTRACK INC.
Assigned to CERBERUS BUSINESS FINANCE, LLC, AS COLLATERAL AGENT reassignment CERBERUS BUSINESS FINANCE, LLC, AS COLLATERAL AGENT NOTICE AND CONFIRMATION OF GRANT OF SECURITY INTEREST IN PATENTS Assignors: HIRERIGHT, LLC, KROLL ONTRACK, LLC
Assigned to ROYAL BANK OF CANADA, AS COLLATERAL AGENT reassignment ROYAL BANK OF CANADA, AS COLLATERAL AGENT SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KROLL ONTRACK, LLC, LDISCOVERY TX, LLC, LDISCOVERY, LLC
Assigned to KROLL ONTRACK, LLC reassignment KROLL ONTRACK, LLC PARTIAL TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS Assignors: CERBERUS BUSINESS FINANCE, LLC
Assigned to ROYAL BANK OF CANADA, AS COLLATERAL AGENT reassignment ROYAL BANK OF CANADA, AS COLLATERAL AGENT SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KROLL ONTRACK, LLC, LDISCOVERY TX, LLC, LDISCOVERY, LLC
Assigned to KROLL ONTRACK, INC. reassignment KROLL ONTRACK, INC. RELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS). Assignors: GOLDMAN SACHS BANK USA, AS COLLATERAL AGENT
Assigned to KROLL ONTRACK, INC. reassignment KROLL ONTRACK, INC. TERMINATION AND RELEASE OF SECURITY IN PATENTS-THIRD LIEN Assignors: WILMINGTON TRUST, N.A.
Assigned to KROLL ONTRACK, INC. reassignment KROLL ONTRACK, INC. TERMINATION AND RELEASE OF SECURITY IN PATENTS-SECOND LIEN Assignors: WILMINGTON TRUST, N.A.
Assigned to KROLL ONTRACK, INC. reassignment KROLL ONTRACK, INC. TERMINATION AND RELEASE OF SECURITY IN PATENTS-FIRST LIEN Assignors: WILMINGTON TRUST, N.A.
Assigned to LDISCOVERY TX, LLC, LDISCOVERY, LLC, KROLL ONTRACK, LLC reassignment LDISCOVERY TX, LLC TERMINATION AND RELEASE OF INTELLECTUAL PROPERTY SECURITY AGREEMENT Assignors: ROYAL BANK OF CANADA, AS SECOND LIEN COLLATERAL AGENT
Assigned to WILMINGTON TRUST, NATIONAL ASSOCIATION, AS COLLATERAL AGENT reassignment WILMINGTON TRUST, NATIONAL ASSOCIATION, AS COLLATERAL AGENT SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KL DISCOVERY ONTRACK, LLC (F/K/A KROLL ONTRACK, LLC), LDISCOVERY TX, LLC, LDISCOVERY, LLC
Assigned to LDISCOVERY, LLC, LDISCOVERY TX, LLC, KROLL ONTRACK, LLC reassignment LDISCOVERY, LLC RELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS). Assignors: ROYAL BANK OF CANADA
Assigned to KLDiscovery Ontrack, LLC reassignment KLDiscovery Ontrack, LLC CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: KROLL ONTRACK, LLC
Adjusted expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99936Pattern matching access

Definitions

  • the present invention relates generally to data processing and more generally to document searching.
  • the present invention more particularly relates to using a representational semantic space for comparing the similarity of document representations.
  • Term searching involves a user querying a corpus of documents containing information for a specific term or word. A resultant solution set of documents is then identified as containing the search term.
  • Single term searching is extremely fast and efficient because little computational effort is involved in the query process, but it can result in a relatively large solution set of unranked documents being returned.
  • Many of the documents may not be relevant to the user because, although the term occurs in the solution document, it is out of context with the user's intended meaning. This often precipitates the user to perform a secondary search for a more relevant solution set of documents. Even if all of the resultant documents use the search term in the proper context, the user has no way of judging which documents in the solution set are more relevant than others.
  • a user must have some knowledge of the subject matter of the search topic for the search to be efficient (e.g., a less relevant word in a query could produce a solution set of less relevant documents and cause more relevant documents in the solution set to be ranked lower than more relevant documents).
  • word stemming Searching for additional affixes of a root or base word is called “word stemming.” Often word stemming is incorporated in a search tool as an automated function but some others require the user to identify the manner in which a query term should be stemmed, usually by manually inserting term “truncaters” or “wildcards” into the search query. Word stemming merely increases the possibility that a document be listed as relevant, thereby further increasing the size of the solution set without addressing the other shortcomings of term searching.
  • a logical improvement to single term searching is multiple, simultaneous term searching using “Boolean term operators,” or simply, Boolean operators.
  • a Boolean retrieval query passes through a corpus of documents by linking search terms together with Boolean operators such as AND, OR and NOT.
  • the solution set of documents is smaller than single term searches and all returned documents are normally ranked equally with respect to relevance.
  • the Boolean method of term searching might be the most widely used information retrieval process and is often used in search engines available on the Internet because it is fast, uncomplicated and easy to implement in a remote online environment.
  • the Boolean search method carries with it many of the shortcomings of term searching.
  • the user has to have some knowledge of the search topic for the search to be efficient in order to avoid relevant documents being ranked as non-relevant and visa versa. Furthermore, since the returned documents are not ranked, the user may be tempted to reduce the size of the solution set by including more Boolean linked search terms. However, increasing the number of search terms in the query narrows the scope of the search and thereby increases the risk that a relevant document is missed. Still again, all to documents in the solution set are ranked equally.
  • Boolean searching variants that allow for term frequency based document ranking have a number of drawbacks, the most obvious of which is that longer documents have a higher probability of being ranked higher in the solution set without a corresponding increase in relevance. Additionally, because the occurrence of the term is not context related, higher ranking documents are not necessarily more relevant to the user. Moreover, if the user does not have an understanding of the subject matter being searched, the combination of Boolean logic and term weighting may exclude the most relevant documents from the solution and simultaneously under rank the most relevant documents to the subject matter. Additionally, certain techniques are language dependent, for instance word stemming and thesaurus.
  • a term-by-document matrix is compiled from a corpus of documents representative of a particular subject matter that represents the frequency of occurrence of each term per document.
  • a weighted term dictionary is created using a global weighting algorithm and then applied to the term-by-document matrix forming a weighted term-by-document matrix.
  • a term vector matrix and a singular value concept matrix are computed by singular value decomposition of the weighted term-document index. The k largest singular concept values are kept while all others are set to zero thereby reducing the concept dimensions in the term vector matrix and a singular value concept matrix.
  • the reduced term vector matrix, reduced singular value concept matrix and weighted term-document dictionary can be used to project pseudo-document vectors representing documents not appearing in the original information corpus, in a representative semantic space. The similarities of those documents can be ascertained from the position of their respective pseudo-document vectors in the representative semantic space.
  • a collection of documents is identified as having a similar subject matter as the original information corpus.
  • Pseudo-document vectors for documents in the collection are computed by the product of the reduced term vector matrix, the inverse of the singular value concept matrix and a transposed column term vector consisting of the global term weight of each term co-occurring in the weighted term dictionary and the document.
  • the similarity of two documents can be found by the dot product of the documents' pseudo-document vectors.
  • a pseudo-query document vector is computed for each document in the collection by finding the product of the reduced term vector matrix, the inverse of the singular value concept matrix and a transposed column term vector consisting of the global term weight of each term co-occurring in the weighted term dictionary and the terms in each document.
  • the similarity of the query to any document in the collection can be found by the dot product of the documents' pseudo-document vectors.
  • a similarity index for all documents in the collection can be created by compiling all pseudo-document vectors representing documents in the collection into a reduced term-by-pseudo-document matrix and then finding the product of the reduced term-by-pseudo-document matrix and its transpose.
  • documents not relevant for a comparison with a query document are eliminated from a comparison by first creating an inverted index, which maps each term in the collection of documents to the documents in the collection in which that term appears.
  • a term search is accomplished using the Boolean query to search the inverted index to obtain a filtered subset S of matching documents.
  • the similarity index for the subset S of matching documents can be created by compiling all pseudo-document vectors representing documents in the filtered subset S and the pseudo-query document vector into a reduced term-by-pseudo-document matrix and then finding the product of the reduced term-by-pseudo-document matrix and its inverse.
  • documents not relevant for a comparison can be first eliminated from the comparison by creating a field index, which maps the value of a document field to the corresponding documents in the collection containing the field with that value.
  • a field search is accomplished using the field query to search the field index to obtain a filtered subset S of matching documents.
  • the similarity index for the subset S of matching documents can be created by compiling all pseudo-document vectors representing documents in the filtered subset S into a reduced term-by-pseudo-document matrix and then finding the product of the reduced term-by-pseudo-document matrix and its inverse.
  • FIG. 1 is a flowchart representing a high-level process for using LSA for comparing the similarity of data objects in accordance with the prior art
  • FIG. 2 is a flowchart depicting a prior art process for determining the similarity of documents and terms in a semantics base using latent semantic indexing
  • FIG. 3 is a diagram representing a typical singular value decomposition of matrix A
  • FIG. 4 is a diagram representing a reduced singular value decomposition of matrix A
  • FIG. 5 is a flowchart depicting a high-level process for representative semantic analysis in accordance with an exemplary embodiment of the present invention
  • FIG. 6 is a flowchart representing a process for creating a concept database in accordance with an exemplary embodiment of the present invention
  • FIG. 7A is a logical diagram illustrating the compilation of a tern-by-document matrix and a weighted term dictionary in accordance with an exemplary embodiment of the present invention
  • FIG. 7B is a logical diagram depicting the steps necessary for converting the term-by-document matrix and weighted term dictionary G into the components of the concept database necessary for performing representative semantic analysis in accordance with an exemplary embodiment of the present invention
  • FIG. 8 is a flowchart depicting a process for creating a document database in accordance with an exemplary embodiment of the present invention.
  • FIG. 9 is a flowchart depicting a process for creating a query pseudo-document vector in accordance with an exemplary embodiment of the present invention.
  • FIG. 10 is a flowchart depicting an optimized process for finding the similarity between documents in a document database in accordance with an exemplary embodiment of the present invention.
  • Synonomy refers to the notion that multiple terms may accurately describe a single object or concept
  • polysemy refers to the notion that most terms have multiple meanings. Therefore, for a user to completely understand the subject matter of a corpus of information, the user must be aware of each and all synonomic uses of a term, while simultaneously recognizing that the occurrence of a term outside the realm of the context of the subject matter.
  • Latent Semantic Analysis provides a means for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text.
  • Information content is defined by more than just its mere terms but is also characterized by the context in which a specific term is given.
  • latent semantic structure in word usage that is partially hidden or obscured by the variability of word choice.
  • An important concept is that the similarity of meaning of terms to each other is based on the context in which that term appears and the context in which that term does not appear.
  • LSA determines similarity of meaning of terms by context in which the terms are used and represents the terms and passages by statistical analysis of a large information corpus.
  • LSA represents the terms taken from an original corpus of information or any subset of terms contained in the original corpus, as individual points in a high dimensional semantic space.
  • a semantic space is a mathematical representation of a large body of terms, and every term, combination of terms or sub-combinations of a term can be represented as a unique high-dimensional vector at a point in semantic space.
  • the similarity of two terms can be understood by comparing the terms' vectors.
  • Quantitative similarity values are defined by the cosine of the angle between the term vectors representing the terms in the semantic space. More importantly, this comparison is valid only in the defined semantic space for the corpus of information.
  • LSA utilizes concepts rather than terms contained in the documents for analyzing the latent semantic structure of documents.
  • LSA allows for information retrieval on the basis of concepts contained in the document with respect to other documents in the corpus rather than terms in the documents of the corpus.
  • Latent Semantic Indexing is the application of LSA principles in a statistical information retrieval methodology.
  • the application of LSI is advantageous over traditional search techniques, such as Boolean, because statistically representative information concepts based on the latent structure of the information relieves the user from being an expert on the mass of information in the corpus. Specifically, the user need not know or understand the concepts used by the original authors of the information in the corpus to accurately retrieve similar information from the corpus of information.
  • LSI is able to infer the structure of relationships between articles and words that define context and meaning by the relationship between structure and concept.
  • FIG. 1 is a flow chart representing a high-level process for using LSI for comparing the similarity of data objects in accordance with the prior art.
  • the process begins by generating a term-by-data object matrix (A) that represents the frequency of term occurrence in the data objects from an information corpus (step 102 ).
  • A term-by-data object matrix
  • Matrix A has a rank of m where m ⁇ (minimum of terms and data objects).
  • An individual entry in matrix A, say j i represents the frequency of the term i in data object j.
  • the occurrence frequency of a selected term in a data object can be found in tern-by-data object matrix A at the intersection of a data object column j and term row i.
  • tern-by-document matrix A is decomposed using a singular value decomposition (SVD) into two orthogonal matrices T 0 and D 0 T containing term and data object vectors, respectively, and a diagonal matrix S 0 containing the “singular concept values” (step 104 ).
  • SVD singular value decomposition
  • FIG. 3 a diagram representing a typical singular value decomposition of matrix A is depicted.
  • matrix A is a t ⁇ d dimensional matrix composed of document columns and term rows and is decomposed into the three smaller matrices identified above, T 0 , S 0 and D 0 T .
  • the terms matrix T 0 sometimes referred to as the left vector matrix, has dimensions of t ⁇ m
  • the documents matrix, D 0 T sometimes referred to as the right vector matrix
  • the singular value matrix, S 0 has the dimensions of m ⁇ m.
  • a matrix A has rank m if A has exactly m non-zero singular values.
  • Term vector matrix T 0 represents term-by-concept relationships inherent in the information corpus, while document vector matrix D 0 represents concept-by-data object relationships in the original information corpus.
  • the dimensional size of matrix S 0 is reduced from m ⁇ m to some value of k (two dimension size of k ⁇ k). Dimensional representation can be better viewed is with respect to FIG. 4 .
  • the reduction of concepts is performed by keeping only the k largest values in matrix S 0 where, of course, m is greater than k, and setting all other values in matrix S 0 to 0.
  • matrices T 0 , S 0 and D 0 T are reduced to forms T, S, and D T , having dimensions of t ⁇ k, k ⁇ k and k ⁇ d, respectively, as can be seen in FIG. 4 .
  • a pseudo-data object is generated from the selected terms used to construct matrix ⁇ and is then inserted in matrix ⁇ (step 106 ).
  • the object here is to take a completely novel query, find a representative point for it in the semantic space, and then look at its cosine with respect to other terms or data objects in the semantic space.
  • the new pseudo-data object must be in a form that is very much like the data objects of the matrices, A and ⁇ , in that a pseudo-object must appear as vectors of terms.
  • a term vector A q must be used for the derivation of a pseudo-data object vector A q used similarly with a row of A n for comparing data objects.
  • query pseudo-data object vector A q and the other terms and data objects in the semantic space can be found by using a similarity metric (step 108 ).
  • the prior art uses the cosine between query pseudo-data object D q and the document vectors represented in reduced term-by-data object matrix ⁇ that represent data objects in the original information corpus for that metric.
  • a cosine value of “1.0” would indicate that query pseudo-data object D q was superimposed on another document vector in the semantic space.
  • FIG. 2 a flowchart illustrating a prior art process for determining the similarity of documents and terms in a semantics base using latent semantic indexing is depicted.
  • the process depicted in FIG. 2 is a lower-level representation of the latent semantic indexing process depicted in FIG. 1 .
  • the process begins with the preparation of a term frequency document matrix A for an information corpus (step 202 ).
  • the term frequency document matrix A can be a tabular listing of document columns and term rows wherein the occurrence frequency of each term for a document is listed in the respective document column term row intersection.
  • a term frequency matrix can be thought of as the set of document vectors for documents contained in the information corpus.
  • a document can be represented as a vector of terms (or keywords) that are meaningful words (i.e., conveys a content).
  • a compact list of terms in the document may be formulated by first removing stop words or non-content words such as “are, in, is, of, the, etc.” Then, terms that have identical underlying roots, such as “troubling, troublesome, troublemaker, trouble, etc.,” should be grouped to the unique word “trouble.” Synonyms, such as “sole, only, unique,” should also be grouped, and concept phrases should be grouped as single terms, such as “Information Highway,” “Information Technology” and “Data Mining.” Given the reduced set of terms, the number of times (frequency) each term occurs in a document is tabulated for the document.
  • term frequency document matrix A has a rank of m (the value m being greater than or equal to the larger of the total number of documents or terms in the term-by-document matrix A).
  • the process proceeds by preprocessing the term frequency document matrix A by using a singular value decomposition algorithm (SVD) (step 204 ).
  • SVD singular value decomposition algorithm
  • the T 0 matrix (having the dimensions t ⁇ m) describes the term-by-concept attributes of the semantics base
  • the S 0 matrix describes the concept-by-concept attributes of the semantics base
  • the D 0 matrix (having the dimensions d ⁇ m) describes the concept-by-document attributes of the semantics base.
  • the SVD of matrix A is represented in FIG. 3 as described above.
  • the k largest values in the matrix S 0 are kept, thereby reducing the m ⁇ m matrix S 0 to a k ⁇ k matrix, matrix S, as are matrices T 0 and D 0 (step 206 ).
  • the concept dimension is reduced for all three matrices forming T, S, and D T , having dimensions of t ⁇ k, k ⁇ k and k ⁇ d, respectively, as can be seen in FIG. 4 .
  • term-by-document matrix ⁇ has been constructed from matrices T, S and D T , the semantic space defined by matrix ⁇ can be examined for a number of similarity relationships between terms or documents in the information corpus.
  • a pseudo-query may be created and inserted into term-by-document matrix ⁇ (step 210 ).
  • step 210 if, on the other hand, the user does not intend to issue a pseudo-query, the user can compare each term or document within term-by-document matrix ⁇ . Therefore, a check is made to determine whether a term similarity matrix is to be made for term comparisons (step 216 ). If a term comparison is to be made, a table is constructed exhibiting the relative similarity of each term to every other term in term-by-document matrix ⁇ by ⁇ T or TS 2 T T . The resultant term-by-term similarity matrix gives the relative similarities between every term and every other term in matrix ⁇ . Again, computing the term-by-term matrices is a very exhaustive process and requires intensive processing power.
  • Similarities between individual terms can be found by merely finding the dot product of the two row term vectors ie., row term vector T i and row term vector T j , corresponding to respective term t i and term t j being compared. For instance, similarity between the terms i and j can be found by finding the dot product between the two term row vectors, row vector T i and row vector T j , which exist in matrix ⁇ corresponding to terms t i and t j , respectively. This product is actually a cosine between the term vectors T i and T j , therefore as the value of the product approaches “1.0,” the two terms are over one another in semantics space, and close in semantic similarity (step 218 ).
  • step 216 if the similarity between terms is not to be determined, then the user may instead use term-by-document matrix ⁇ to determine the similarity between any two documents (i.e., document d i and document d j .)
  • the process again reverts to step 214 and proceeds as described exactly in the case of the pseudo-query. That is, a document-to-document table of similarities can be created by ⁇ T or DS 2 D T , which defines the similarity of all documents in matrix A.
  • finding the similarity between any document d i and any other document d j is merely a function of following one or the other from one of the rows or columns to the other of the columns and rows and finding the value at the intersection point in the document similarity table for ⁇ .
  • computing ⁇ T is a very intensive process that requires extraordinary processing power and is highly dependent upon the size and complexity of the original document corpus.
  • Finding the similarity between a few documents within the semantics space defined by matrix ⁇ may be accomplished in a similar manner with respect to two terms. That is, finding the dot product between two column vectors, vector D i and vector D j , that define the respective subject document d i and document d j .
  • the similarity between the documents may be found by the dot product of the two-column document vectors corresponding to the respective documents (i.e., column document vector D i and column document vector D j , which are taken from matrix ⁇ ). Again, this product results in the cosine between column document vector D i and column document vector D j . Therefore, as the product approaches “1.0,” the documents are more similar.
  • the LSI process is extremely processing intensive and requires a new singular value decomposition to be performed each time a different document corpus is selected to be ranked, searched or documents therein otherwise compared.
  • SVD of the term-frequency document matrix into left and right vector matrices and a singular values matrix is a bottleneck of the LSI process.
  • SVD of the term-frequency document matrix simply cannot be performed efficiently for traditional LSI processes and negatively impacts the LSI process performance. Processing efficiency is inversely related to the size and complexity of the original document corpus; however, searching for similarities in large and complicated document sets is the major benefit of LSI.
  • LSI is most beneficial in ranking documents in a larger and more complicated collection of documents.
  • the reduced reconstructed term-by-document matrix A merely presents concepts as a semantic space that is in better form for searching similarities between terms and documents.
  • LSI is extremely processing intensive and may require frequent decompositions of a dynamic information corpus or one being searched with a variety of query documents. Therefore, in an effort to reduce the amount of preprocessing steps needed for a similarity search of either terms or documents in a semantic space, the present invention utilizes representative semantic analysis for preprocessing a semantic space.
  • LSI LSI
  • N the number of terms plus documents
  • k the number of dimensions in the semantic space.
  • N expands rapidly with the number and complexity of documents in the information corpus (i.e., as the quantity of terms and documents increase).
  • the SVD algorithm is extremely unfriendly for decomposing a large information corpus.
  • each addition or deletion of a document in the information corpus requires that the corpus be reprocessed in order to accurately reflect the documents that may be recalled. Therefore, the SVD algorithm is unfeasible for a dynamic collection of documents that comprise an information corpus. Again, performing an SVD is much too time consuming to do on a regular basis.
  • documents in document collections are characterized by using a concept database.
  • a concept database is a representative intelligence useful for the creation of a representative semantic space for comparing documents.
  • the representative semantic space is not created from the documents in a collection to be compared, as in the prior art, but is used merely for projecting documents for making a comparison.
  • the concept database consists of term and singular vector matrices which are products of the decomposition of a corpus of documents.
  • the corpus of documents is that collection of documents selected as being representative of a similarity criteria related to document comparisons for the document collections.
  • the concept database provides a mechanism for the creation of a representational semantic space for the placement of pseudo-document vectors that represent related similarity criteria, without having to decompose a reduced term-by-document matrix ⁇ for the entire collection of documents. Therefore, the process of searching a collection of documents is condensed to the steps of creating a concept database from which a document database is created using a collection of documents to be searched and searching the document database. Searching the document database may be accomplished similar to that process described above using term or document vector matching.
  • searching the document database includes a two-part query: term searching and matching (pseudo) vectors, either document or term pseudo vectors.
  • the present invention may be better understood by paying particular attention to the flowchart depicted in FIG. 5 depicting a high-level process for representative semantic analysis in accordance with an exemplary embodiment of the present invention.
  • the process begins by creating a representative intelligence for defining a representative semantic space based on representative latent semantic similarities from a representative document corpus (step 502 ).
  • the representative intelligence is referred to herein as a concept database.
  • a representative corpus of documents is compiled based on similarity criteria and used as a sample of the intelligence of the subject documents and databases to be compared.
  • a representative semantic space defined by similarity criteria for the documents making up a corpus is at least partially based on latent semantic similarities inherent in those representative documents.
  • similar documents, databases with similar documents and even query documents that are similar to the documents in the corpus can be projected into the representative semantic space and compared based, at least partially, on their latent semantic similarities.
  • the concept database is constructed from a corpus of representative documents (e.g., document d l to document d d ), wherein the corpus contains exactly d number of documents. While the corpus might include some of the documents to be compared, optimally the concept database is created from a corpus of documents selected for reliably representing similarity criteria which form the basis for document comparisons. The actual documents to be compared may not have been identified at the time the document corpus is compiled and the concept database is created therefrom. Similarity criteria refer to the parametric attributes that define comparison metrics applicable to a representative semantic space. In other words, similarity criteria refers to the document attributes that will be used for making document comparisons. For example, similarity criteria may be as uncomplicated as specifying a document type, or alternatively may be as complicated as defining a hierarchy of attributes and parametric values used to characterize document similarity.
  • Similarity criteria may be as uncomplicated as specifying a document type, or alternatively may be as complicated as defining a hierarchy of attributes and parametric values used
  • the concept database consists of two matrices, S and T, found by decomposing a term-by-document matrix for the document corpus and a dictionary of weighted terms in the corpus.
  • any collection of documents related by the similarity criteria used to compile the original document corpus can be compared without SVD to that collection. Instead, comparing a collection of documents merely involves creating a document database for documents in a collection to be compared.
  • the concept database is used to create a document database of, inter alia, pseudo-document vectors for a group of documents to be searched (step 504 ).
  • the creation of the present document database for the representative semantic analysis represents a significant departure from prior art LSI processing. It should be understood that the document database, while serving a similar function as the document matrix D, is not a document matrix D because the collection of documents to be compared is never decomposed.
  • LSI process decomposes a term-by-document matrix (derived from the collection of documents to be compared) into a term-by-concept matrix T, a concept-by-concept matrix S, and a concept-by-document, document matrix D.
  • the document matrix D is the most important matrix of the three matrices and is used primarily for comparing documents according to the prior art.
  • the collection must be reprocessed using, for example SVD, and a new document matrix D created for comparisons.
  • the document matrix D is not used in the present representative latent semantic analysis search. Instead, only the term vector matrix T and the singular value matrix S are used to build pseudo-document vectors for collections of documents to be compared which are included in the document database.
  • a second departure from the traditional LSI processing is more conceptual and relates to the definition of the semantic space used for making document comparisons.
  • the document matrix D of the prior art LSI processing is decomposed from the actual documents to be compared; therefore, the LSI semantic space defined by the document matrix D is created from actual concepts existing in the collection of documents to be compared.
  • the semantic space defined by the document database is not created from the document collection, but is instead created by application of the concept database to the document collection. Therefore, the semantic space used for comparing documents in the document database is defined by the similarity criteria used to compile the document corpus rather than the documents in the collection.
  • the semantic space used for comparing documents in the present invention is merely a representational semantic space since the document corpus is compiled with documents representing a similarity criteria.
  • a document database is a dynamic collection of documents, which presents an interface that allows documents to be added, deleted and compared simultaneously.
  • Each document stored in the document database consists of one or more fields containing the text of the document and associated meta-data. The data in each of these fields are stored in the document database, as well as a field index corresponding to each field.
  • an inverted index is maintained, which maps all terms appearing in every document to the documents in the document database. Unlike the weighted term dictionary in the concept database, the inverted index contains all distinct terms occurring in the documents, including stop words.
  • the document database contains an additional database with a set of pseudo-document vectors, each of which represents a unique document in the document database.
  • documents are added to a document database for comparisons without decomposing the document database prior to making document comparisons.
  • document may be deleted from the document database without decomposing the document database prior to making document comparisons.
  • a query document d q may be easily added to the document database for making a comparison and then deleted prior to making any other document comparisons. Therefore, using the concept database a query pseudo-document vector can be created for a query document (step 506 ).
  • a term-by-document matrix ⁇ can be produced and from that, a table of similarities between documents can also be produced, as discussed above, from the product of DS 2 D T for ranking documents in the document database (step 508 ). If query document D q is used, it is inserted in the document database and the documents in the document database are ranked with respect to the query pseudo-document vector in accordance with the product of DS 2 D T .
  • concept database 730 comprises two matrices, matrix T 722 and matrix S 724 , and a dictionary of weighted terms G 710 .
  • the process begins with the selection of a corpus of documents having similar subject matters (step 602 ).
  • the corpus of documents 704 may be large or small, but a larger corpus may yield better results due to a larger variety of term concepts being represented within the corpus.
  • Building a concept database begins with selecting an appropriate document corpus, such as from document collection 702 .
  • Document corpus 704 is a representative sample of documents which is based on a similarity criteria to be used for a search and contains exactly d number of documents.
  • Similarity criteria refer to the parametric attributes that define similarities in a representative semantic space with regard to documents. Similarity criteria may be as uncomplicated as the particular document type, or alternatively, may be a complicated hierarchy of attributes and parametric values associated with document similarity. For instance, a corpus of documents might be compiled based on similarity criteria involving a particular document type, such as a United States issued patent. Other predetermined document attributes might be defined for compiling the document corpus include a broad statement of the subject matter of the issued patent or the technology area such as the subject matter of computer-implemented technologies. Other document attributes might further refine the subject matter, such as one of three-dimensional graphics, information technology, or data compression. An exemplary document corpus based on the similarity criteria for U.S.
  • issued patent document types related to information technology would comprise only U.S. issued patents related to information technology.
  • the similarity criteria may be narrowed even further by specifying other parametric attributes, for example, such as a category of information technology, such as database recovery, interface engines or search engines.
  • an exemplary document corpus might comprise only U.S. issued patents related to search engines, which is a category of information technology.
  • a representative semantic space defined by such similarity criteria would be useful for comparing U.S. issued patents related to search engines or for searching databases of U.S. issued patents for documents similar to U.S. issued patents directed to search engines or, alternatively, even searching more general databases for documents similar to U.S. issued patents directed to search engines.
  • databases to be searched contain documents on Object Oriented Programming (OOP)
  • OOP Object Oriented Programming
  • document corpus 704 is a representative sample of d number of documents, that sample may be compiled from different document sources without varying from the intended scope of the present invention.
  • document corpus 704 is optionally compiled directly from document collection 702 which is the subject to an initial document comparison.
  • the representative semantic space is defined by the similarity criteria of the document collection, similar to traditional LSI processing.
  • document collection 702 may be condensed from c number of document to a smaller number, d, of representative documents for corpus 704 . Reducing the number of documents in document corpus 704 greatly accelerates the representative semantic analysis process without sacrificing precision in the document comparison.
  • One method of condensing document collection 702 is to sort the collection using a random selection algorithm (note the process arrow from document collection 702 and document corpus 704 is dashed indicating that the step is optional). Another method is to manually sort document collection 702 for the first d number of documents that accurately represents a predetermined similarity criteria.
  • documents in the document corpus need not be identical to the documents that will be compared or searched, but instead are used only for creating a concept database for defining a representative semantic space in which pseudo-document vectors for documents in a collection to be compared are projected.
  • the documents in the corpus may be subsequently vectorized and included in a document database for making document comparisons, it is not necessary to do so.
  • the primary purpose for compiling the document corpus 704 is for the realization of a representative semantic space which is inherently based on representative latent semantic similarities attributable to the d number of representative documents in the corpus.
  • the documents in the corpus are of no further value in defining the representative semantic space and are discarded.
  • the document corpus is merely a representative sample of documents that conform to a predefined similarity criteria, the actual documents used for compiling the corpus are relatively unimportant.
  • the documents themselves merely represent the predefined similarity criteria with which the semantic space is defined.
  • document corpus 704 is compiled for documents that are not included in the document collection to be compared.
  • Representative documents are compiled from any source based on predetermined similarity criteria.
  • document corpus 704 may be compiled rather expeditiously for less complicated similarity criteria such as document type.
  • documents directed to media broadcasts may be compiled into a document corpus by merely collecting media documents of the type media to be compared. If too many equally relevant documents are gathered, then a sorting algorithm may be employed as described above for randomly selecting documents based on the optimal size of the corpus. More complicated similarity criteria may require an operator to manually sift though relevant documents to identify the most representative of a predetermined similarity criteria.
  • a client may best understand which documents are representative of similarity criteria without fully understanding how to articulate exactly what parametric attributes are desired.
  • the client may compile a valid document corpus of representative documents from which document comparisons may be based.
  • a corpus of documents may be compiled by a client who has no understanding whatsoever of the present process.
  • the representative semantic analysis process of the present invention is largely a mathematical process, of which compiling a document corpus is the primary non-mathematical step. Compiling the document corpus is the primary step where understanding the similarity criteria plays a role. Therefore, should a corpus of documents be provided beforehand, the representative semantic analysis process can make document comparisons essentially automatic. Moreover, because the documents in a document corpus and in a document database to be compared are implemented in software document codes, the only requirement is that the document codes be equivalent. Therefore, accurate document comparisons may be made where the subject documents are in a language which is foreign to the operator so long as a document corpus has been provided and the foreign language documents are all implemented in a matching document code.
  • Document database 702 is considered to be the corpus of the subject matter to be searched and may be reduced in size by applying a sorting algorithm.
  • the sorting algorithm for instance, a random selection algorithm, reduces the number of documents in the document database by randomly selecting documents to populate document corpus 704 .
  • the sorting algorithm for instance, a random selection algorithm, reduces the number of documents in the document database by randomly selecting documents to populate document corpus 704 .
  • the optimum value for d is greater than 5,000 documents, where each document contains between 1-Kilobyte and 64-Kilobytes of text.
  • selecting a document database which contains an extraordinary number of documents not closely related to the similarity criteria to be searched will result in the reduction of the number of contextual concepts represented in term-by-document matrix A that are important for discovering similarities between documents within a given subject matter.
  • the accuracy of the search results will be lowered due to the extraordinary number of documents in document database which are not closely related to the similarity criteria.
  • a dictionary which comprises all t number of terms occurring in the d number of document in the corpus (i e., dictionary 706 (step 604 )). Indexing each and every term is not necessary and may lower the overall efficiency of the representative semantic analysis process because some terms have little or no meaning with respect to the similarity criteria and yet have extremely high occurrence frequency. For instance, stop words (e.g., “a,” “and” and “the”) might all be eliminated from the dictionary of terms.
  • a term occurrence algorithm can be applied to the documents of document corpus 704 for creating term dictionary 706 .
  • the term occurrence algorithm may have a number of forms.
  • the term occurrence algorithm may also implement other specialized rules for forming dictionary 706 . It may be the case that a particular term has a contradictory or misleading meaning which would tend to skew the concept analysis of the term.
  • object is an extremely important concept when trying to describe object oriented programming concepts. However, if the collection of documents to be examined is a collection of documents which represents issued patents, the term “object” also conveys a meaning of the purpose of the patent or subpart of the patent. Thus, in that context, the term “object” specifically relates to the subject matter of patentability rather the subject matter of object oriented programming.
  • the term occurrence algorithm may exclude terms from the term dictionary that do not occur more than a predetermined number of times in the corpus. Normally, terms that do not occur more than a threshold number of times occupy a space in the term dictionary. Since each term in the term dictionary occupies a row in the term-by-document matrix A, low occurrence terms also have a row in matrix A as does any other term. However, terms with an extremely low number of occurrences in the corpus cannot participate in the formation of stronger concepts due to their particularly low occurrence of frequencies. Therefore, in accordance with another exemplary embodiment, the occurrence algorithm allows for an artificial threshold to be predetermined for term occurrences in the corpus.
  • Terms that do not occur the predetermined number of times in the corpus are excluded from the dictionary, and therefore are also excluded from the term-by-document matrix A. Excluding terms with extremely low occurrences optimizes the decomposition process by eliminating rows in the term-by-document matrix A devoted to terms that do not participate in forming stronger concepts, while not affecting the quality or character of the decomposed matrices.
  • the predetermined threshold amount is from one to four occurrences in the corpus, but as a practical matter has been set at three for various sized corpuses with exceptional results.
  • dictionary 706 will contain t number of terms that are also contained in the documents in document corpus 704 (ie., term t l through term t t , and each of terms t l through t t may have some semantic relationship to the similarity criteria used to compile the corpus).
  • a term-by-document matrix 708 is constructed from document corpus 704 and term dictionary 706 (step 606 ).
  • term-by-document matrix 708 for the representative semantic analysis process of the present invention is fundamentally different from the term-by-document matrix created for LSI processing.
  • Documents in the representative term-by-document are selected to represent a similarity criteria related to document comparisons, while traditionally documents for a LSI term-by-document matrix were those used in the actual document comparisons.
  • Term-by-document matrix 708 is populated with local term weights L for the terms in the term dictionary 706 (e.g., term t l through term t t ) such that L i represents the frequency of occurrence of each unique dictionary term t i in a unique document d j in which the term occurs. Because term-by-document matrix 708 is populated with document information from document corpus 704 and that information is representative of similarity criteria (and not necessarily the document to be compared), term-by-document matrix 708 is also referred to as a representative term-by-document matrix.
  • representative term-by-document matrix 708 is typically arranged in the representative term-by-document matrix 708 such that the occurrence frequency of a term t i in a document d j is found in the entry at the intersection of the term t i row and document d j column.
  • representative term-by-document matrix 708 is a tabular listing of term occurrences by document and thus the indices in a matrix are document-local term weights L i (or simply local term weights).
  • the representative term-by-document matrix is a collection of document column vectors, wherein each column is a vector representing a unique document d j in the corpus 704 and each index is a coefficient representing occurrences for a dictionary term in the document.
  • the term-by-document matrix is a collection of term row vectors, wherein each row is a vector representing occurrences of dictionary term 11 by document.
  • a weighted term dictionary G 710 is constructed by calculating the global term weight for each term in the term-by-document matrix (step 608 ).
  • Terms in dictionary 706 are then weighted based on information theory and statistical practices (ie., log entropy and local and global weighting schemes) converting term dictionary 706 into weighted term dictionary G 710 .
  • terms occurring in term dictionary 706 are globally weighted based on a term's cumulative occurrences in the d number of documents (e.g., document d l through document d d ) of document corpus 704 .
  • Term dictionary 706 is modified to hold the global term weights G i to form weighted term dictionary G 710 .
  • Global term weight G i is determined by the occurrences of each term t l in documents d l through d d of term-by-document matrix 708 (or similarly in document corpus 704 ).
  • a weighted tern-by-document matrix A 720 is then created from the weighted term dictionary G 710 and the representative term-by-document matrix 708 by applying the term weighting function to the representative term-by-document matrix 708 (step 610 ).
  • a singular values matrix S 0 and a corresponding left singular term vector matrix T 0 are decomposed from weighted term-by-document matrix A 720 using any one of a number of decomposition algorithms such as a singular value decomposition (step 612 ).
  • matrices S 0 , T 0 and a third matrix, document matrix D 0 are decomposed from weighted term-by-document matrix A 720 as described above with respect to FIGS. 3 and 4 .
  • concept database 730 is for defining a representative semantic space in which related documents may be compared, the document matrix D 0 is of no use and is immediately discarded as it is not needed for the representative semantic analysis.
  • the dimensional size of singular value matrix S 0 is reduced by selecting the k largest singular values 725 which represent the concepts in the representative semantic space (step 614 ).
  • the value of k is normally set between 250 and 400, having an 1; optimum value of approximately 300.
  • the size of the term vector matrix, T 0 is also reduced to k columns in accordance with the reduction, thereby forming reduced term vector matrix T.
  • the singular values in matrix S 722 are ordered by size for the m concepts from weighted term-by-document matrix A 720 . Only the k highest ranking concepts 725 are used for defining the representative semantic space, which are identified in matrix S 724 . A reduced term-by-document matrix ⁇ is then realized from the first k largest concepts are kept from matrix S 724 .
  • the values of the decomposition product, matrix S and T matrices are the means for defining a representative semantic space for comparing documents by the similarity of their latent semantic concepts.
  • Accurate S and T matrices are more important for representative semantic analysis than standard LSI searching because the representative semantic space is representative of the concepts in the collections of documents to be searched.
  • the present invention relates to an optimization over traditional LSI searching methods, so selecting an efficient SVD process is significant for achieving overall optimized results.
  • the present invention does not rely on SVD for finding a document matrix D because documents in document matrix D are not compared to one another. Moreover, the document vectors defined in the row vectors of the document matrix D are not properly defined in the domain of the representative semantic space defined by concept database 730 .
  • Concept database 730 defines a representative semantic space which is then used for comparing one document to another as described further below, whether or not the document was a part of the document corpus used for the SVD. Each document in a document database to be compared is placed in a document vector form similar to that illustrated above with respect to weighted term-by-document matrix A 720 .
  • term dictionary 710 is also included in concept database 730 which includes the global weights of each term with respect to document corpus 704 .
  • concept database 730 is realized from the reduced singular values matrix S 724 , reduced the term vector matrix T 722 and the weighted term dictionary G 710 each from the similarity criteria used to compile representative corpus 704 (step 616 ). The process then ends.
  • FIG. 8 is a flowchart depicting a process for creating a document database in accordance with an exemplary embodiment of the present invention. This process is represented as step 504 representing semantic analysis processes depicted in FIG. 5 .
  • the document database consists of a set of documents represented by the document field data, field indices, inverted indices and pseudo-document vectors that represent the documents in the document database.
  • Associated with the compressed set of documents is an index of unique internal doc-IDs mapped to a user-defined document field and associated field index that maps the user-defined field to the unique internal doc-ID.
  • the process begins by identifying a collection of documents to be searched (step 802 ). For best document comparison results, the similarity criteria of the collection of documents should be related to that used to compile the representative document corpus. The more askew the similarity criteria of the two databases, the less reliable the similarity results are between documents or terms because the comparison of document vectors must be performed in an appropriate semantics space. More precisely, pseudo-document vectors for the collection of documents are compared in the representative semantic space which produces an indication of the similarity between documents. Document vectors that are not created from SVD of the original information corpus are called ‘pseudo-document vectors’ because they present themselves as vectors of terms rather than a SVD product.
  • LSI ensures that the semantic space used for comparing document vectors is the appropriate space for making comparisons by always using document vectors from the document vector matrix D returned from the decomposed corpus of documents. In doing so, the multidimensional semantic space created by LSI is defined around the semantic concepts in the information corpus. All document vectors representing documents in the information corpus will map to that semantic space and therefore similarity metrics can be properly applied to the document vectors.
  • the present invention avoids the task of continually computing a SVD of new collections of documents by preprocessing a representative semantic space. The representative semantic space can then be used for comparing the similarity of documents having comparable subject matters to the subject matters of the information corpus used to create the concept database.
  • an inverted index is constructed that maps each term from the collection of documents to the documents in which that the term appears (step 804 ).
  • the inverted index contains all terms occurring in the collection of documents including all stop words, unlike the weighted term-by-document matrix 720 in FIG. 7B that excludes stop words.
  • a d is a column vector formed from a term vector created from terms co-occurring in the weighted term dictionary G and in the document d j to be represented as a pseudo-document vector. Also note that at this point, all co-occurring terms in term vector A d are weighted locally by a term's occurrence in document d j , local weight L i , and then scaled by the global weight of the term taken from term dictionary G, global weight G i .
  • Matrices T and S are the term vector and singular value matrices respectively taken from the concept database.
  • a reduced term-by-document matrix ⁇ can be compiled from the pseudo-document vectors as discussed above with respect to previous embodiments. From there, the product of ⁇ T or DS 2 D T defines an index of similarities from document-to-document in the collection of documents to be searched.
  • any document d i of the c number of documents in the document collection may be compared to any other document d j in the document collection.
  • a subset, S 1 comprising s1 number of documents of the c number of documents in the document collection may first be sorted using the inverted index and then the documents in subset S 1 compared using pseudo-document vectors as described above.
  • the inverted index is searched using Boolean logic, or any conventional search technique, to reduce the quantity of the pseudo-document vectors comprising the subset S 1 to be compared, from the c number originally in the document collection to s1 more relevant documents in subset S 1 .
  • Boolean logic or any conventional search technique
  • FIG. 9 is a flowchart depicting a process for creating a query pseudo-document vector in accordance with an exemplary embodiment of the present invention. Recall that the purpose for placing the query pseudo-document in the representative semantic space with other pseudo-documents in an effort to find the most similar document as represented within that semantic space.
  • the query pseudo-document vector must be constructed using the same parametric attributes as those used for constructing the semantics base itself (i.e., the similarity criteria of the original corpus of documents). Thus, it would be unlikely to expect accurate search results from a query document which was outside the similarity criteria of the original corpus of documents.
  • the first constraint similar to building the document database is that the query document pertains to a similar subject matter as both the original corpus of documents and the newly-created document database.
  • the process begins by identifying a representative query document or alternatively, a set of query terms (step 902 ). Again, this document or these terms must pertain to the original subject matter of the original corpus of documents. Finally, the query is represented as a pseudo-document vector D q similar to that described above with respect to the documents in the collection of documents (step 904 ). Notice that when creating the pseudo-document vector, there is no need to create an inverted index for the query because the inverted index created in step 804 of FIG. 8 is used solely for reducing the number of documents to be presented in the semantic space. However, if the query document is to become a part of the document database for additional document comparisons with other query documents, then in might be prudent to create the inverted index for the query document.
  • the query document need be represented merely as D q equals X T TS ⁇ 1 .
  • X T is a column vector formed from the term vector of the terms which co-occur in the weighted term dictionary G and in the query document d q .
  • co-occurring terms' local weights from the query document d q are scaled by the global weights G i taken from the term dictionary G.
  • Matrices T and S are taken directly from the concept database and represent the term vector matrix T and singular value matrix.
  • a query is composed of two distinct parts: a Boolean query and matching pseudo-document vectors.
  • the Boolean query allows the combination of field value constraints, including the existence of certain key terms as well as precise field values and ranges and acts as a filter for selecting a subset S of documents from the document database which match the query condition.
  • documents d l through d s in the subset S are compared to each other by matching corresponding pseudo-document vectors D l through D s , thereby ranking documents d l through d s in subset S based on the similarity of pseudo-document vectors projected in the representative semantic space.
  • Document comparisons are made by taking the inner product of pseudo-document vectors.
  • a resulting set of values for matching any of pseudo-document vectors D l through D s to all other pseudo-document vectors D l through D s represents a document comparison based on the similarity criteria used for creating the concept database.
  • the resulting set of values is sorted from largest to smallest. Larger inner products correspond to a greater semantic similarity between documents.
  • the resulting set of matching documents is the set of documents in subset S, ranked by how closely they match each other.
  • a query pseudo-document vector D q is derived from a search or query document d q which typifies similarity criteria for a comparison or search document.
  • Query document d q represents particular parametric attributes for the similarity criteria used to create the concept database. Once created, query document d q and query pseudo-document vector D q are included in the document database as any other document in the collection. The aspects of the present invention discussed above are further described below with respect to the flowchart on FIG. 10 .
  • FIG. 10 is a flowchart depicting an optimized process for finding the similarity between documents in a document database in accordance with an exemplary embodiment of the present invention.
  • the process depicted in FIG. 10 is performed subsequent to creating a document database as represented by step 504 in FIG. 5 (and further described in FIG. 8 ).
  • One intent of the presently described process is to further optimize the representative semantic analysis search process by reducing the size of the document database to be used for document ranking. However, reducing the number of documents to be compared in the document database is optional and may not be performed. It should be understood that if documents are removed from the document database prior to matching the pseudo-document vectors, the advantage of using the present representative semantic analysis is not realized for the removed documents.
  • the process begins by finding a subset S 1 (subset S 1 comprising document d 1 through document d s1 and having exactly sl number of documents) from an inverted index which maps each term in the collection of documents to the documents in the collection using Boolean query (step 1002 ).
  • Steps 1002 and 1004 represent optimizations of the representative semantic analysis searching process and are represented as dashed process boxes. In doing so, documents in the document database can be either included or excluded from further consideration by using selected terms in the Boolean query.
  • a second subset, subset S 2 is obtained from the original document database, or alternatively from documents d 1 to d s1 in subset S 1 , by using a field value query.
  • Subset S 2 comprises document d, through document d s2 and has exactly s2 number of documents.
  • field values associated with documents in the collection are used for including or eliminating a document in the document database.
  • one field may be presented for the date of the document, for instance, 1999.
  • the user may limit the search of documents to a period which the user feels is relevant.
  • a field query value may be set greater than 1998, or “>1998.” In that case, documents having field year values of “1999,” “2000” or “2001” would be included for further consideration. Thus, a document with a date field value of “1999” would be included in the document database for further representative semantic analysis.
  • Query pseudo-document vector D q contains an additional database with a set of pseudo-document vectors, each of which represents a unique document in the document database.
  • X is a term vector representing occurrences of terms in the weighted term dictionary G for the document d j .
  • the process then continues by matching the query document to a subset of documents in the document database by comparing the pseudo-document vector D q with pseudo-document vectors D l through D S in the subset S (step 1008 ).
  • S is a subset of documents in the document database which have been identified for further consideration by one of a Boolean search query or a field value query.
  • comparing documents involves finding the dot product between any two pseudo-document two vectors e.g., pseudo-document vector D i and pseudo-document vector D j that define the subject document d i and document d j , respectively.
  • this product results in the cosine of the pseudo-document vectors being compared i.e., pseudo-document vector D i and pseudo-document vector D j .
  • pseudo-document vector D i and pseudo-document vector D j As the product approaches “1.0,” subject documents d i and d j are more similar.
  • the pseudo-document vector (any of pseudo-document vector D l through pseudo-document vector D S ) in which the resultant dot product with pseudo-document vector D q is closest to a value of “1.0” corresponds to the most similar document d j to query document d q , based on the similarity criteria used for defining the concept database.
  • documents d l through d s are ranked by document similarity by listing the cosine values (dot products) between the pseudo-document vector D j for document d j , and all other pseudo-document vectors associated with documents to be compared, such as pseudo-document vectors D l to D s for documents d l to d s in a collection of documents.
  • Ranking can be performed as an iterative process in which each document in the collection is compared to every other document in the document collection by matching the documents' corresponding pseudo-document vectors. The process then ends.
  • Boolean query could take the form of: “body contains ‘oop’” and “salary>‘60000’”, returning a subset of documents matching these criteria.
  • a ranked list of results is produced by computing the cosine distance between the pseudo-document for the query text and the stored pseudo-documents in the document database;

Abstract

A term-by-document matrix is compiled from a corpus of documents representative of a particular subject matter that represents the frequency of occurrence of each term per document. A weighted term dictionary is created using a global weighting algorithm and then applied to the term-by-document matrix forming a weighted term-by-document matrix. A term vector matrix and a singular value concept matrix are computed by singular value decomposition of the weighted term-document index. The k largest singular concept values are kept and all others are set to zero thereby reducing to the concept dimensions in the term vector matrix and a singular value concept matrix. The reduced term vector matrix, reduced singular value concept matrix and weighted term-document dictionary can be used to project pseudo-document vectors representing documents not appearing in the original document corpus in a representative semantic space. The similarities of those documents can be ascertained from the position of their respective pseudo-document vectors in the representative semantic space.

Description

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to data processing and more generally to document searching. The present invention more particularly relates to using a representational semantic space for comparing the similarity of document representations.
2. Description of Related Art
Information is being created and made available to the public at a greater rate than ever before, due for the most part, to advances in information technology and electronic communications systems. There is so much information that it is simply not possible for an individual person to read it all, much less remember its content, context or the semantic concepts that are associated with it. Still, many times an enterprise relies on certain information items reaching certain enterprise members. Even before the recent advances in information technology and electronic communications systems, it was understood that there was a need for efficient and accurate information retrieval. The result was myriad information retrieval processes and strategies that worked relatively well under a narrowly-defined set of searching conditions.
Probably the simplest of all information retrieval techniques is word or term searching. Term searching involves a user querying a corpus of documents containing information for a specific term or word. A resultant solution set of documents is then identified as containing the search term. Single term searching is extremely fast and efficient because little computational effort is involved in the query process, but it can result in a relatively large solution set of unranked documents being returned. Many of the documents may not be relevant to the user because, although the term occurs in the solution document, it is out of context with the user's intended meaning. This often precipitates the user to perform a secondary search for a more relevant solution set of documents. Even if all of the resultant documents use the search term in the proper context, the user has no way of judging which documents in the solution set are more relevant than others. Additionally, a user must have some knowledge of the subject matter of the search topic for the search to be efficient (e.g., a less relevant word in a query could produce a solution set of less relevant documents and cause more relevant documents in the solution set to be ranked lower than more relevant documents).
Another pitfall of term searching is that traditional term searching overlooks many potential relevant documents because of case and inflectional prefixes and suffixes added to a root word as that word is used in a document, even though those words are used in context with the user's intent (e.g., “walking,” “walks,” “walker” have in common the root “walk” and the affixes -ing, -s, and -er). Searching for additional affixes of a root or base word is called “word stemming.” Often word stemming is incorporated in a search tool as an automated function but some others require the user to identify the manner in which a query term should be stemmed, usually by manually inserting term “truncaters” or “wildcards” into the search query. Word stemming merely increases the possibility that a document be listed as relevant, thereby further increasing the size of the solution set without addressing the other shortcomings of term searching.
A logical improvement to single term searching is multiple, simultaneous term searching using “Boolean term operators,” or simply, Boolean operators. A Boolean retrieval query passes through a corpus of documents by linking search terms together with Boolean operators such as AND, OR and NOT. The solution set of documents is smaller than single term searches and all returned documents are normally ranked equally with respect to relevance. The Boolean method of term searching might be the most widely used information retrieval process and is often used in search engines available on the Internet because it is fast, uncomplicated and easy to implement in a remote online environment. However, the Boolean search method carries with it many of the shortcomings of term searching. The user has to have some knowledge of the search topic for the search to be efficient in order to avoid relevant documents being ranked as non-relevant and visa versa. Furthermore, since the returned documents are not ranked, the user may be tempted to reduce the size of the solution set by including more Boolean linked search terms. However, increasing the number of search terms in the query narrows the scope of the search and thereby increases the risk that a relevant document is missed. Still again, all to documents in the solution set are ranked equally.
Other information retrieval processes were devised that extended and refined the Boolean term searching method to, although not necessarily reducing the size of the solution set, attempt to rank the resultant documents. One such effort is by term weighting the query terms and/or term weighting the occurrence of terms in the solution set of documents by frequency. Expanded term weighting operations make ranking of documents possible by assigning weights to search terms used in the query. Documents that are returned with higher ranking search terms are themselves ranked higher in relevance. However, a more useful variation of term weighting is by occurrence frequency in resultant document database, thereby allowing the documents in the solution set to be ranked. A higher term occurrence frequency in a resultant document is indicative of relevance. Boolean searching variants that allow for term frequency based document ranking have a number of drawbacks, the most obvious of which is that longer documents have a higher probability of being ranked higher in the solution set without a corresponding increase in relevance. Additionally, because the occurrence of the term is not context related, higher ranking documents are not necessarily more relevant to the user. Moreover, if the user does not have an understanding of the subject matter being searched, the combination of Boolean logic and term weighting may exclude the most relevant documents from the solution and simultaneously under rank the most relevant documents to the subject matter. Additionally, certain techniques are language dependent, for instance word stemming and thesaurus.
Information retrieval methods have been devised that combine Boolean logic with other techniques such as content-based navigation, where shared terms from previously attained documents are used to refine and expand the query. Additionally, Boolean operators have been replaced with fuzzy operators that recognize more than simple true and false values and weighted query expansion has been accomplished using a thesaurus. Thesaurus or a dictionary is a common way to expand queries and can be used to broaden the meaning of the term as well as narrowing it down or simply finding related terms. The main problem of using a thesaurus is that terms have different meanings, depending upon the subject. Thesaurus is therefore often used in databases within a special field like pharmacological and biomedical databases where they can be constructed manually. While each of the above described improvements provide some benefit to the user, the solution set of documents does not optimally convey the right information to a user if the user does not have an understanding of the subject matter being searched.
BRIEF SUMMARY OF THE INVENTION
A term-by-document matrix is compiled from a corpus of documents representative of a particular subject matter that represents the frequency of occurrence of each term per document. A weighted term dictionary is created using a global weighting algorithm and then applied to the term-by-document matrix forming a weighted term-by-document matrix. A term vector matrix and a singular value concept matrix are computed by singular value decomposition of the weighted term-document index. The k largest singular concept values are kept while all others are set to zero thereby reducing the concept dimensions in the term vector matrix and a singular value concept matrix. The reduced term vector matrix, reduced singular value concept matrix and weighted term-document dictionary can be used to project pseudo-document vectors representing documents not appearing in the original information corpus, in a representative semantic space. The similarities of those documents can be ascertained from the position of their respective pseudo-document vectors in the representative semantic space.
In accordance with one exemplary embodiment, a collection of documents is identified as having a similar subject matter as the original information corpus. Pseudo-document vectors for documents in the collection are computed by the product of the reduced term vector matrix, the inverse of the singular value concept matrix and a transposed column term vector consisting of the global term weight of each term co-occurring in the weighted term dictionary and the document. The similarity of two documents can be found by the dot product of the documents' pseudo-document vectors.
In accordance with another exemplary embodiment, a pseudo-query document vector is computed for each document in the collection by finding the product of the reduced term vector matrix, the inverse of the singular value concept matrix and a transposed column term vector consisting of the global term weight of each term co-occurring in the weighted term dictionary and the terms in each document. The similarity of the query to any document in the collection can be found by the dot product of the documents' pseudo-document vectors.
In accordance with still another exemplary embodiment, a similarity index for all documents in the collection can be created by compiling all pseudo-document vectors representing documents in the collection into a reduced term-by-pseudo-document matrix and then finding the product of the reduced term-by-pseudo-document matrix and its transpose.
In accordance with still another exemplary embodiment, documents not relevant for a comparison with a query document are eliminated from a comparison by first creating an inverted index, which maps each term in the collection of documents to the documents in the collection in which that term appears. A term search is accomplished using the Boolean query to search the inverted index to obtain a filtered subset S of matching documents. The similarity index for the subset S of matching documents can be created by compiling all pseudo-document vectors representing documents in the filtered subset S and the pseudo-query document vector into a reduced term-by-pseudo-document matrix and then finding the product of the reduced term-by-pseudo-document matrix and its inverse.
In accordance with still another exemplary embodiment, documents not relevant for a comparison can be first eliminated from the comparison by creating a field index, which maps the value of a document field to the corresponding documents in the collection containing the field with that value. A field search is accomplished using the field query to search the field index to obtain a filtered subset S of matching documents. The similarity index for the subset S of matching documents can be created by compiling all pseudo-document vectors representing documents in the filtered subset S into a reduced term-by-pseudo-document matrix and then finding the product of the reduced term-by-pseudo-document matrix and its inverse.
BRIEF DESCRIPTION OF THE DRAWINGS
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as an exemplary mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
FIG. 1 is a flowchart representing a high-level process for using LSA for comparing the similarity of data objects in accordance with the prior art;
FIG. 2 is a flowchart depicting a prior art process for determining the similarity of documents and terms in a semantics base using latent semantic indexing;
FIG. 3 is a diagram representing a typical singular value decomposition of matrix A;
FIG. 4 is a diagram representing a reduced singular value decomposition of matrix A;
FIG. 5 is a flowchart depicting a high-level process for representative semantic analysis in accordance with an exemplary embodiment of the present invention;
FIG. 6 is a flowchart representing a process for creating a concept database in accordance with an exemplary embodiment of the present invention;
FIG. 7A is a logical diagram illustrating the compilation of a tern-by-document matrix and a weighted term dictionary in accordance with an exemplary embodiment of the present invention;
FIG. 7B is a logical diagram depicting the steps necessary for converting the term-by-document matrix and weighted term dictionary G into the components of the concept database necessary for performing representative semantic analysis in accordance with an exemplary embodiment of the present invention;
FIG. 8 is a flowchart depicting a process for creating a document database in accordance with an exemplary embodiment of the present invention; and
FIG. 9 is a flowchart depicting a process for creating a query pseudo-document vector in accordance with an exemplary embodiment of the present invention;
FIG. 10 is a flowchart depicting an optimized process for finding the similarity between documents in a document database in accordance with an exemplary embodiment of the present invention.
Other features of the present invention will be apparent from the accompanying drawings and from the detailed description which follows.
DETAILED DESCRIPTION OF THE INVENTION
As discussed above, term searching using Boolean logic does not always offer optimal retrieval results because traditional Boolean information retrieval techniques require the user to understand the subject matter and comprehend how the terms are used in the corpus of information to be searched. Search strategies must be adjusted to reflect contextual usage of the terms in the corpus of information.
Human understanding of terms might be based on both the synonomic and polysemic nature of term meanings. Synonomy refers to the notion that multiple terms may accurately describe a single object or concept, while polysemy refers to the notion that most terms have multiple meanings. Therefore, for a user to completely understand the subject matter of a corpus of information, the user must be aware of each and all synonomic uses of a term, while simultaneously recognizing that the occurrence of a term outside the realm of the context of the subject matter.
As alluded to above, using term expansion and thesauri techniques reduces the likelihood that a relevant document will be overlooked, expands queries and broadens the meaning of the terms. However, the use of term expansion and thesauri techniques may result in a large volume of documents being returned due to the polysemic nature of the query terms. Additionally, these differing meanings may ultimately depend on the subject matter, thus the user's familiarity with the subject matter remains necessary. These techniques help improve recall, but lead to a degradation of precision, but will also retrieve documents irrelevant to the original query, decreasing precision.
Ambiguities inherent to natural language, polysemy, is generally seen as the outgrowth of adapting terms from one context to another, sometimes with respect to unrelated environments. The prior art does not suggest an adequate remedy for the search precision being degraded by terms having multiple meanings. The solutions fall in two categories, controlling language in the corpus and structuring the corpus to account for term meanings. Controlling term usage in an information corpus is almost out of the question because of the burden placed on authors, not to mention the exclusion of very important legacy information. Additionally, certain terms may not adequately describe the subject matter of a document. However, limited success is possible via tightly controlled abstract and summary drafting, which are appended to documents in the corpus and often searched separately. The problem with structured information is that the hierarchy may have synonymous branches causing authors and users to index and search for documents in different branches.
Latent Semantic Analysis (LSA) provides a means for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. Information content is defined by more than just its mere terms but is also characterized by the context in which a specific term is given. There is an underlying latent semantic structure in word usage that is partially hidden or obscured by the variability of word choice. An important concept is that the similarity of meaning of terms to each other is based on the context in which that term appears and the context in which that term does not appear. LSA determines similarity of meaning of terms by context in which the terms are used and represents the terms and passages by statistical analysis of a large information corpus.
LSA represents the terms taken from an original corpus of information or any subset of terms contained in the original corpus, as individual points in a high dimensional semantic space. A semantic space is a mathematical representation of a large body of terms, and every term, combination of terms or sub-combinations of a term can be represented as a unique high-dimensional vector at a point in semantic space. The similarity of two terms can be understood by comparing the terms' vectors. Quantitative similarity values are defined by the cosine of the angle between the term vectors representing the terms in the semantic space. More importantly, this comparison is valid only in the defined semantic space for the corpus of information.
LSA utilizes concepts rather than terms contained in the documents for analyzing the latent semantic structure of documents. Thus, LSA allows for information retrieval on the basis of concepts contained in the document with respect to other documents in the corpus rather than terms in the documents of the corpus.
Latent Semantic Indexing (LSI) is the application of LSA principles in a statistical information retrieval methodology. The application of LSI is advantageous over traditional search techniques, such as Boolean, because statistically representative information concepts based on the latent structure of the information relieves the user from being an expert on the mass of information in the corpus. Specifically, the user need not know or understand the concepts used by the original authors of the information in the corpus to accurately retrieve similar information from the corpus of information.
Artisans have understood that, even though intelligible information contains a bewildering variety of terms, associative possibilities, and multiple meanings per term and term use, certain inherent structures common to all languages can be identified. LSI focuses upon that inherent structure of language, specifically the relationship between documents and terms, to register recurring patterns of semantic structure. Co-occurrences of terms are given higher weights than singular occurrences because the object of semantic indexing is to recognize which terms reappear most often in which parts of a document's structure. It has been demonstrated that there are a few constantly recurring structural patterns within sentences and these patterns occur at the level in which terms and their articles interrelate. The recurrence of these patterns is “latent” because they are inherent in grammatical structure and “semantic” because they define linguistic contextual meaning. In other words, through the pattern of co-occurrences of words, LSI is able to infer the structure of relationships between articles and words that define context and meaning by the relationship between structure and concept.
FIG. 1 is a flow chart representing a high-level process for using LSI for comparing the similarity of data objects in accordance with the prior art. The process begins by generating a term-by-data object matrix (A) that represents the frequency of term occurrence in the data objects from an information corpus (step 102).
Each term in the information corpus is represented by a row in matrix A, and similarly, every data object is represented by a column in matrix A. Matrix A has a rank of m where m≦(minimum of terms and data objects). An individual entry in matrix A, say ji, represents the frequency of the term i in data object j. Thus, the occurrence frequency of a selected term in a data object can be found in tern-by-data object matrix A at the intersection of a data object column j and term row i.
Once tern-by-document matrix A has been constructed, it is decomposed using a singular value decomposition (SVD) into two orthogonal matrices T0 and D0 T containing term and data object vectors, respectively, and a diagonal matrix S0 containing the “singular concept values” (step 104).
The SVD of a real h×l matrix A is the factorization; the general case is represented by the equation
A=USV T  (1)
Where matrices in this factorization have the following properties U[h×m] and V[l×m] are orthogonal matrices. The columns ui of U=[ul, . . . , um] are the left singular vectors, and the columns vl of V[vl, . . . , vm] are the right singular vectors. Additionally, S[m×m]=diag (σl, . . . , σm) is a real, non-negative, and diagonal matrix. Its diagonal contains the so-called singular values σi, where σl□ . . . □σm□0.
Turning now to FIG. 3, a diagram representing a typical singular value decomposition of matrix A is depicted. Notice here that matrix A is a t×d dimensional matrix composed of document columns and term rows and is decomposed into the three smaller matrices identified above, T0, S0 and D0 T. The terms matrix T0, sometimes referred to as the left vector matrix, has dimensions of t×m, while the documents matrix, D0 T, sometimes referred to as the right vector matrix, has dimensions of m×d. The singular value matrix, S0, on the other hand, has the dimensions of m×m.
The singular value decomposition is useful for the numerical determination of the rank of a matrix. In exact arithmetic, a matrix A has rank m if A has exactly m non-zero singular values. Matrix notation used herein refers to matrix YT or Y′ as the transpose of matrix Y such that YYT=I, where I is an identity matrix, obviously Y′=YT. Term vector matrix T0 represents term-by-concept relationships inherent in the information corpus, while document vector matrix D0 represents concept-by-data object relationships in the original information corpus. While multiplying T0 times S0 times D0 T yields the original matrix A (T0S0D0 T=A), the number of dimensions derived from the SVD is usually reduced from m to k where k<m and ranges from 50-350 dimensions. This dimensional reduction is performed because many of the relationships are weak and the concepts are very small and not worth keeping. This reduced-dimension subspace is normally derived by selecting sub-matrices that correspond to the largest singular values and setting all other values to zero. This concept is diagrammatically illustrated with respect to FIG. 3 and FIG. 4.
Thus, the dimensional size of matrix S0 is reduced from m×m to some value of k (two dimension size of k×k). Dimensional representation can be better viewed is with respect to FIG. 4. Returning to the flowchart on FIG. 3, the reduction of concepts is performed by keeping only the k largest values in matrix S0 where, of course, m is greater than k, and setting all other values in matrix S0 to 0. In doing so, matrices T0, S0 and D0 T are reduced to forms T, S, and DT, having dimensions of t×k, k×k and k×d, respectively, as can be seen in FIG. 4.
Once the concept reduction has been accomplished, multiplying T times S times DT yields a term-by-data object matrix Â(TSDT=Â), with reduced term-by-data object matrix  being slightly different than matrix A, but for the purposes of LSI,Â≅A. Reduced term-by-data object matrix  is a mathematical representation of the semantic space for the original information corpus. So, the product of the orthogonal sub-matrices, T and D, and the largest singular values S forms a semantic subspace which approximates the original fill term-data object space, reduced term-by-data object matrix Â.
Next, a pseudo-data object is generated from the selected terms used to construct matrix  and is then inserted in matrix  (step 106). The object here is to take a completely novel query, find a representative point for it in the semantic space, and then look at its cosine with respect to other terms or data objects in the semantic space. The new pseudo-data object must be in a form that is very much like the data objects of the matrices, A and Â, in that a pseudo-object must appear as vectors of terms. Thus, in order to compare a query or pseudo-data object, q, to other data objects, a term vector Aq must be used for the derivation of a pseudo-data object vector Aq used similarly with a row of An for comparing data objects. The derivation is based on the fact that a real document Ai should yield document vector Di, where A=Â. It follows then that since:
A q =TSD q T  (2)
    • because:
      TT T −I
    • such that:
      D q T =S I T T A q  (3)
    • or
      D q =A q T TS 1  (4)
Finally, the similarity between query pseudo-data object vector Aq and the other terms and data objects in the semantic space can be found by using a similarity metric (step 108). The prior art uses the cosine between query pseudo-data object Dq and the document vectors represented in reduced term-by-data object matrix  that represent data objects in the original information corpus for that metric. A cosine value of “1.0” would indicate that query pseudo-data object Dq was superimposed on another document vector in the semantic space.
With regard to FIG. 2, a flowchart illustrating a prior art process for determining the similarity of documents and terms in a semantics base using latent semantic indexing is depicted. The process depicted in FIG. 2 is a lower-level representation of the latent semantic indexing process depicted in FIG. 1. The process begins with the preparation of a term frequency document matrix A for an information corpus (step 202). As discussed before, the term frequency document matrix A can be a tabular listing of document columns and term rows wherein the occurrence frequency of each term for a document is listed in the respective document column term row intersection. Essentially, a term frequency matrix can be thought of as the set of document vectors for documents contained in the information corpus. A document can be represented as a vector of terms (or keywords) that are meaningful words (i.e., conveys a content). Given a document, a compact list of terms in the document may be formulated by first removing stop words or non-content words such as “are, in, is, of, the, etc.” Then, terms that have identical underlying roots, such as “troubling, troublesome, troublemaker, trouble, etc.,” should be grouped to the unique word “trouble.” Synonyms, such as “sole, only, unique,” should also be grouped, and concept phrases should be grouped as single terms, such as “Information Highway,” “Information Technology” and “Data Mining.” Given the reduced set of terms, the number of times (frequency) each term occurs in a document is tabulated for the document.
Again, term frequency document matrix A has a rank of m (the value m being greater than or equal to the larger of the total number of documents or terms in the term-by-document matrix A). The process proceeds by preprocessing the term frequency document matrix A by using a singular value decomposition algorithm (SVD) (step 204). By applying the SVD to matrix A, the three sub-matrices produced are T0, S0, and D0 T. As previously discussed, the T0 matrix (having the dimensions t×m) describes the term-by-concept attributes of the semantics base, the S0 matrix (having the dimensions m×m) describes the concept-by-concept attributes of the semantics base and the D0 matrix (having the dimensions d×m) describes the concept-by-document attributes of the semantics base. Diagrammatically, the SVD of matrix A is represented in FIG. 3 as described above.
Next, the k largest values in the matrix S0 are kept, thereby reducing the m×m matrix S0 to a k×k matrix, matrix S, as are matrices T0 and D0 (step 206). Thus, the concept dimension is reduced for all three matrices forming T, S, and DT, having dimensions of t×k, k×k and k×d, respectively, as can be seen in FIG. 4.
When matrices T, S and DT are recombined, a new, reduced term-by-document matrix  is created. Because only the weakest concepts have been eliminated from the singular value matrix, matrix  approximates the previous term-by-document matrix A (step 208).
Once term-by-document matrix  has been constructed from matrices T, S and DT, the semantic space defined by matrix  can be examined for a number of similarity relationships between terms or documents in the information corpus. However, in addition to merely examining similarities between existing documents in the existing term frequency document matrix A, a pseudo-query may be created and inserted into term-by-document matrix  (step 210). If the user chooses to create a pseudo-query to make a comparison of all documents dl to dn existing in the semantic space defined by term-by-document matrix Â, a pseudo-document vector Dq must be created for a query document dq, where Dq=Aq TTS−1 where Aq is a document vector created from the term frequency for each term in dq that co-occurs in matrix A (step 212). In other words, what the user must actually create is a document column for query document dq where the frequency occurrence of each term listed in the original term-by-document matrix A is given in the respective cell when pseudo-document vector Dq has been assembled i.e., document dq is inserted in the information corpus. Then, a similarity matrix can be computed for the pseudo-document vector Dq versus all other documents in term-by-document matrix Â. This is possible because ÂÂT=DS2DT. The similarity of all documents in matrix  can be defined, including the similarity to pseudo-document vector Dq. Still, on an individual document-by-document basis, for any two documents di and dj of the form defined by matrix Â, the similarity can be compared by finding the dot product of the two column vectors which define the document, Di and Dj. Therefore, finding the similarity between the pseudo-document Dq and any other document di in matrix  is simply a matter of finding the document product of column vector Dq and any other column document vector listed in matrix Â, such as column document vector Di(step 214).
Returning again to step 210, if, on the other hand, the user does not intend to issue a pseudo-query, the user can compare each term or document within term-by-document matrix Â. Therefore, a check is made to determine whether a term similarity matrix is to be made for term comparisons (step 216). If a term comparison is to be made, a table is constructed exhibiting the relative similarity of each term to every other term in term-by-document matrix  by ÂÂT or TS2TT. The resultant term-by-term similarity matrix gives the relative similarities between every term and every other term in matrix Â. Again, computing the term-by-term matrices is a very exhaustive process and requires intensive processing power. Similarities between individual terms can be found by merely finding the dot product of the two row term vectors ie., row term vector Ti and row term vector Tj, corresponding to respective term ti and term tj being compared. For instance, similarity between the terms i and j can be found by finding the dot product between the two term row vectors, row vector Ti and row vector Tj, which exist in matrix  corresponding to terms ti and tj, respectively. This product is actually a cosine between the term vectors Ti and Tj, therefore as the value of the product approaches “1.0,” the two terms are over one another in semantics space, and close in semantic similarity (step 218).
Returning to step 216, if the similarity between terms is not to be determined, then the user may instead use term-by-document matrix  to determine the similarity between any two documents (i.e., document di and document dj.) The process again reverts to step 214 and proceeds as described exactly in the case of the pseudo-query. That is, a document-to-document table of similarities can be created by ÂÂT or DS2DT, which defines the similarity of all documents in matrix A. Therefore, finding the similarity between any document di and any other document dj is merely a function of following one or the other from one of the rows or columns to the other of the columns and rows and finding the value at the intersection point in the document similarity table for Â. Again, computing ÂT is a very intensive process that requires extraordinary processing power and is highly dependent upon the size and complexity of the original document corpus. Finding the similarity between a few documents within the semantics space defined by matrix  may be accomplished in a similar manner with respect to two terms. That is, finding the dot product between two column vectors, vector Di and vector Dj, that define the respective subject document di and document dj. For example, when comparing two documents (e.g., document di, and document dj), the similarity between the documents may be found by the dot product of the two-column document vectors corresponding to the respective documents (i.e., column document vector Di and column document vector Dj, which are taken from matrix Â). Again, this product results in the cosine between column document vector Di and column document vector Dj. Therefore, as the product approaches “1.0,” the documents are more similar.
As discussed above, the LSI process is extremely processing intensive and requires a new singular value decomposition to be performed each time a different document corpus is selected to be ranked, searched or documents therein otherwise compared. SVD of the term-frequency document matrix into left and right vector matrices and a singular values matrix is a bottleneck of the LSI process. SVD of the term-frequency document matrix simply cannot be performed efficiently for traditional LSI processes and negatively impacts the LSI process performance. Processing efficiency is inversely related to the size and complexity of the original document corpus; however, searching for similarities in large and complicated document sets is the major benefit of LSI. In other words, while it might be possible to reduce the size and complexity of a collection of documents, thereby realizing a corresponding reduction in complexity in computing and decomposing the term-frequency matrix, the benefit of the LSI process would be lost because other search processes may perform equally well on such collections.
LSI is most beneficial in ranking documents in a larger and more complicated collection of documents. However, once the term-by-document matrix A matrix has been decomposed by SVD, the reduced reconstructed term-by-document matrix A merely presents concepts as a semantic space that is in better form for searching similarities between terms and documents.
In short, LSI is extremely processing intensive and may require frequent decompositions of a dynamic information corpus or one being searched with a variety of query documents. Therefore, in an effort to reduce the amount of preprocessing steps needed for a similarity search of either terms or documents in a semantic space, the present invention utilizes representative semantic analysis for preprocessing a semantic space.
One problem with LSI is that it is based on document occurrence. Decomposing a term-by-document matrix of a large information corpus can take days or even weeks because the time complexity is quadratic in the number of documents in the corpus to be processed. The SVD algorithm is O(N2k3), where N is the number of terms plus documents, and k is the number of dimensions in the semantic space. Typically, k will be held small, ranging anywhere from 50 to 350. However, N expands rapidly with the number and complexity of documents in the information corpus (i.e., as the quantity of terms and documents increase). Thus, the SVD algorithm is extremely unfriendly for decomposing a large information corpus.
Moreover, each addition or deletion of a document in the information corpus requires that the corpus be reprocessed in order to accurately reflect the documents that may be recalled. Therefore, the SVD algorithm is unfeasible for a dynamic collection of documents that comprise an information corpus. Again, performing an SVD is much too time consuming to do on a regular basis.
With respect to the present invention, multiple collections of documents may be ranked, searched and/or otherwise compared for sets of matching documents without having to first generate and decompose a reduced term-by-document matrix  for each entire collection of documents to be compared. Instead, documents in document collections are characterized by using a concept database. A concept database is a representative intelligence useful for the creation of a representative semantic space for comparing documents. The representative semantic space is not created from the documents in a collection to be compared, as in the prior art, but is used merely for projecting documents for making a comparison. The concept database consists of term and singular vector matrices which are products of the decomposition of a corpus of documents. The corpus of documents is that collection of documents selected as being representative of a similarity criteria related to document comparisons for the document collections. Thus, the concept database provides a mechanism for the creation of a representational semantic space for the placement of pseudo-document vectors that represent related similarity criteria, without having to decompose a reduced term-by-document matrix  for the entire collection of documents. Therefore, the process of searching a collection of documents is condensed to the steps of creating a concept database from which a document database is created using a collection of documents to be searched and searching the document database. Searching the document database may be accomplished similar to that process described above using term or document vector matching. However, searching might be further optimized by relying on term or Boolean searching to further reduce the size of the document database prior to matching (pseudo) document vectors. Therefore, in accordance to one exemplary embodiment of the present invention, searching the document database includes a two-part query: term searching and matching (pseudo) vectors, either document or term pseudo vectors.
The present invention may be better understood by paying particular attention to the flowchart depicted in FIG. 5 depicting a high-level process for representative semantic analysis in accordance with an exemplary embodiment of the present invention. The process begins by creating a representative intelligence for defining a representative semantic space based on representative latent semantic similarities from a representative document corpus (step 502). The representative intelligence is referred to herein as a concept database. Initially, a representative corpus of documents is compiled based on similarity criteria and used as a sample of the intelligence of the subject documents and databases to be compared. Although not immediately apparent, a representative semantic space defined by similarity criteria for the documents making up a corpus is at least partially based on latent semantic similarities inherent in those representative documents. Thus, similar documents, databases with similar documents and even query documents that are similar to the documents in the corpus can be projected into the representative semantic space and compared based, at least partially, on their latent semantic similarities.
The concept database is constructed from a corpus of representative documents (e.g., document dl to document dd), wherein the corpus contains exactly d number of documents. While the corpus might include some of the documents to be compared, optimally the concept database is created from a corpus of documents selected for reliably representing similarity criteria which form the basis for document comparisons. The actual documents to be compared may not have been identified at the time the document corpus is compiled and the concept database is created therefrom. Similarity criteria refer to the parametric attributes that define comparison metrics applicable to a representative semantic space. In other words, similarity criteria refers to the document attributes that will be used for making document comparisons. For example, similarity criteria may be as uncomplicated as specifying a document type, or alternatively may be as complicated as defining a hierarchy of attributes and parametric values used to characterize document similarity.
The concept database consists of two matrices, S and T, found by decomposing a term-by-document matrix for the document corpus and a dictionary of weighted terms in the corpus.
By using the contents of the concept database, any collection of documents related by the similarity criteria used to compile the original document corpus can be compared without SVD to that collection. Instead, comparing a collection of documents merely involves creating a document database for documents in a collection to be compared. Returning to the flowchart depicted in FIG. 5, the concept database is used to create a document database of, inter alia, pseudo-document vectors for a group of documents to be searched (step 504). The creation of the present document database for the representative semantic analysis represents a significant departure from prior art LSI processing. It should be understood that the document database, while serving a similar function as the document matrix D, is not a document matrix D because the collection of documents to be compared is never decomposed. Recall that typically the prior art LSI process decomposes a term-by-document matrix (derived from the collection of documents to be compared) into a term-by-concept matrix T, a concept-by-concept matrix S, and a concept-by-document, document matrix D. The document matrix D is the most important matrix of the three matrices and is used primarily for comparing documents according to the prior art. Thus, each time a new document is added to the collection of documents, the collection must be reprocessed using, for example SVD, and a new document matrix D created for comparisons. By contrast, the document matrix D is not used in the present representative latent semantic analysis search. Instead, only the term vector matrix T and the singular value matrix S are used to build pseudo-document vectors for collections of documents to be compared which are included in the document database.
In other words, one departure from prior art LSI is that the document matrix D produced from the singular value decomposition of the corpus is never used to make document comparisons. Therefore, the database of documents to be compared need not be arranged in a term-by-document matrix for singular value decomposition, and then decomposed as is typically performed with LSI processing. This represents a significant saving in processing power and optimizing the searching process without any substantial reduction in accuracy for defining the similarity between terms and documents in the semantic space.
A second departure from the traditional LSI processing is more conceptual and relates to the definition of the semantic space used for making document comparisons. The document matrix D of the prior art LSI processing is decomposed from the actual documents to be compared; therefore, the LSI semantic space defined by the document matrix D is created from actual concepts existing in the collection of documents to be compared. By contrast, the semantic space defined by the document database is not created from the document collection, but is instead created by application of the concept database to the document collection. Therefore, the semantic space used for comparing documents in the document database is defined by the similarity criteria used to compile the document corpus rather than the documents in the collection. The semantic space used for comparing documents in the present invention, on the other hand, is merely a representational semantic space since the document corpus is compiled with documents representing a similarity criteria.
According to exemplary embodiments of the present invention, a document database is a dynamic collection of documents, which presents an interface that allows documents to be added, deleted and compared simultaneously. Each document stored in the document database consists of one or more fields containing the text of the document and associated meta-data. The data in each of these fields are stored in the document database, as well as a field index corresponding to each field. In addition, for each field containing textual data, an inverted index is maintained, which maps all terms appearing in every document to the documents in the document database. Unlike the weighted term dictionary in the concept database, the inverted index contains all distinct terms occurring in the documents, including stop words. Finally, the document database contains an additional database with a set of pseudo-document vectors, each of which represents a unique document in the document database.
Thus, in stark contrast with the prior art LSI processing, documents are added to a document database for comparisons without decomposing the document database prior to making document comparisons. Additionally, document may be deleted from the document database without decomposing the document database prior to making document comparisons. Thus, a query document dq may be easily added to the document database for making a comparison and then deleted prior to making any other document comparisons. Therefore, using the concept database a query pseudo-document vector can be created for a query document (step 506).
Once the document database of pseudo-document vectors has been compiled, a term-by-document matrix  can be produced and from that, a table of similarities between documents can also be produced, as discussed above, from the product of DS2DT for ranking documents in the document database (step 508). If query document Dq is used, it is inserted in the document database and the documents in the document database are ranked with respect to the query pseudo-document vector in accordance with the product of DS2DT.
The process for building a concept database will be particularly described with respect to the flowchart depicted in FIG. 6 and the logical diagrams illustrated in FIGS. 7A and 7B in accordance with an exemplary embodiment of the present invention. As mentioned above, concept database 730 comprises two matrices, matrix T 722 and matrix S 724, and a dictionary of weighted terms G 710.
The process begins with the selection of a corpus of documents having similar subject matters (step 602). The corpus of documents 704 may be large or small, but a larger corpus may yield better results due to a larger variety of term concepts being represented within the corpus. Building a concept database begins with selecting an appropriate document corpus, such as from document collection 702. Document corpus 704 is a representative sample of documents which is based on a similarity criteria to be used for a search and contains exactly d number of documents.
Similarity criteria refer to the parametric attributes that define similarities in a representative semantic space with regard to documents. Similarity criteria may be as uncomplicated as the particular document type, or alternatively, may be a complicated hierarchy of attributes and parametric values associated with document similarity. For instance, a corpus of documents might be compiled based on similarity criteria involving a particular document type, such as a United States issued patent. Other predetermined document attributes might be defined for compiling the document corpus include a broad statement of the subject matter of the issued patent or the technology area such as the subject matter of computer-implemented technologies. Other document attributes might further refine the subject matter, such as one of three-dimensional graphics, information technology, or data compression. An exemplary document corpus based on the similarity criteria for U.S. issued patent document types related to information technology would comprise only U.S. issued patents related to information technology. The similarity criteria may be narrowed even further by specifying other parametric attributes, for example, such as a category of information technology, such as database recovery, interface engines or search engines. In that case, an exemplary document corpus might comprise only U.S. issued patents related to search engines, which is a category of information technology. A representative semantic space defined by such similarity criteria would be useful for comparing U.S. issued patents related to search engines or for searching databases of U.S. issued patents for documents similar to U.S. issued patents directed to search engines or, alternatively, even searching more general databases for documents similar to U.S. issued patents directed to search engines.
Therefore, if databases to be searched contain documents on Object Oriented Programming (OOP), then the representative document corpus should be compiled on the basis of a similar subject matter criteria. That is, the document corpus should be comprised of documents pertaining to the subject of OOP processing.
While document corpus 704 is a representative sample of d number of documents, that sample may be compiled from different document sources without varying from the intended scope of the present invention. In accordance with one exemplary embodiment, document corpus 704 is optionally compiled directly from document collection 702 which is the subject to an initial document comparison. In that case, the representative semantic space is defined by the similarity criteria of the document collection, similar to traditional LSI processing. However, since the present method utilizes only a representative semantic space, document collection 702 may be condensed from c number of document to a smaller number, d, of representative documents for corpus 704. Reducing the number of documents in document corpus 704 greatly accelerates the representative semantic analysis process without sacrificing precision in the document comparison. One method of condensing document collection 702 is to sort the collection using a random selection algorithm (note the process arrow from document collection 702 and document corpus 704 is dashed indicating that the step is optional). Another method is to manually sort document collection 702 for the first d number of documents that accurately represents a predetermined similarity criteria. However, as mentioned above, documents in the document corpus need not be identical to the documents that will be compared or searched, but instead are used only for creating a concept database for defining a representative semantic space in which pseudo-document vectors for documents in a collection to be compared are projected.
Thus, while in accordance some exemplary embodiments of the present invention, the documents in the corpus may be subsequently vectorized and included in a document database for making document comparisons, it is not necessary to do so. If fact, the primary purpose for compiling the document corpus 704 is for the realization of a representative semantic space which is inherently based on representative latent semantic similarities attributable to the d number of representative documents in the corpus. Once document corpus 704 is compiled, the documents in the corpus are of no further value in defining the representative semantic space and are discarded. Moreover, because the document corpus is merely a representative sample of documents that conform to a predefined similarity criteria, the actual documents used for compiling the corpus are relatively unimportant. The documents themselves merely represent the predefined similarity criteria with which the semantic space is defined.
Therefore, in accordance with other exemplary embodiments of the present invention, document corpus 704 is compiled for documents that are not included in the document collection to be compared. Representative documents are compiled from any source based on predetermined similarity criteria. In practice, document corpus 704 may be compiled rather expeditiously for less complicated similarity criteria such as document type. For example, documents directed to media broadcasts may be compiled into a document corpus by merely collecting media documents of the type media to be compared. If too many equally relevant documents are gathered, then a sorting algorithm may be employed as described above for randomly selecting documents based on the optimal size of the corpus. More complicated similarity criteria may require an operator to manually sift though relevant documents to identify the most representative of a predetermined similarity criteria. However, in accordance with an exemplary embodiment, a client, rather than a skilled operator, may best understand which documents are representative of similarity criteria without fully understanding how to articulate exactly what parametric attributes are desired. In that case, the client may compile a valid document corpus of representative documents from which document comparisons may be based. In such cases, a corpus of documents may be compiled by a client who has no understanding whatsoever of the present process.
It should be understood that the representative semantic analysis process of the present invention is largely a mathematical process, of which compiling a document corpus is the primary non-mathematical step. Compiling the document corpus is the primary step where understanding the similarity criteria plays a role. Therefore, should a corpus of documents be provided beforehand, the representative semantic analysis process can make document comparisons essentially automatic. Moreover, because the documents in a document corpus and in a document database to be compared are implemented in software document codes, the only requirement is that the document codes be equivalent. Therefore, accurate document comparisons may be made where the subject documents are in a language which is foreign to the operator so long as a document corpus has been provided and the foreign language documents are all implemented in a matching document code.
Document database 702 is considered to be the corpus of the subject matter to be searched and may be reduced in size by applying a sorting algorithm. The sorting algorithm, for instance, a random selection algorithm, reduces the number of documents in the document database by randomly selecting documents to populate document corpus 704. Here, it is important to understand that the more documents in document database 702 have the advantage of presenting terms within the documents in a varied of semantic context. However, it has also been discussed that too many documents in the information corpus do not increase the accuracy of the search results beyond a distinguishable amount. Using smaller databases tend to lessen the number of contextual concepts that can be represented by the term-by-document matrix. In practice, it has been discovered that the optimum value for d (the quantity of similarity criteria related documents in an information corpus) is greater than 5,000 documents, where each document contains between 1-Kilobyte and 64-Kilobytes of text. Similarly, selecting a document database which contains an extraordinary number of documents not closely related to the similarity criteria to be searched will result in the reduction of the number of contextual concepts represented in term-by-document matrix A that are important for discovering similarities between documents within a given subject matter. Thus, it may be expected that the accuracy of the search results will be lowered due to the extraordinary number of documents in document database which are not closely related to the similarity criteria.
Returning to the process depicted in FIG. 6, a dictionary is created which comprises all t number of terms occurring in the d number of document in the corpus (i e., dictionary 706 (step 604)). Indexing each and every term is not necessary and may lower the overall efficiency of the representative semantic analysis process because some terms have little or no meaning with respect to the similarity criteria and yet have extremely high occurrence frequency. For instance, stop words (e.g., “a,” “and” and “the”) might all be eliminated from the dictionary of terms. In practice, once a document corpus 704 has been assimilated, a term occurrence algorithm can be applied to the documents of document corpus 704 for creating term dictionary 706. The term occurrence algorithm may have a number of forms. For example, it may optionally eliminate stop words and/or may attempt to reduce terms to their root form. Additionally, the term occurrence algorithm may also implement other specialized rules for forming dictionary 706. It may be the case that a particular term has a contradictory or misleading meaning which would tend to skew the concept analysis of the term. With respect to a document comparison directed to OOP, the term “object” is an extremely important concept when trying to describe object oriented programming concepts. However, if the collection of documents to be examined is a collection of documents which represents issued patents, the term “object” also conveys a meaning of the purpose of the patent or subpart of the patent. Thus, in that context, the term “object” specifically relates to the subject matter of patentability rather the subject matter of object oriented programming.
Alternatively, the term occurrence algorithm may exclude terms from the term dictionary that do not occur more than a predetermined number of times in the corpus. Normally, terms that do not occur more than a threshold number of times occupy a space in the term dictionary. Since each term in the term dictionary occupies a row in the term-by-document matrix A, low occurrence terms also have a row in matrix A as does any other term. However, terms with an extremely low number of occurrences in the corpus cannot participate in the formation of stronger concepts due to their particularly low occurrence of frequencies. Therefore, in accordance with another exemplary embodiment, the occurrence algorithm allows for an artificial threshold to be predetermined for term occurrences in the corpus. Terms that do not occur the predetermined number of times in the corpus are excluded from the dictionary, and therefore are also excluded from the term-by-document matrix A. Excluding terms with extremely low occurrences optimizes the decomposition process by eliminating rows in the term-by-document matrix A devoted to terms that do not participate in forming stronger concepts, while not affecting the quality or character of the decomposed matrices. The predetermined threshold amount is from one to four occurrences in the corpus, but as a practical matter has been set at three for various sized corpuses with exceptional results. Ultimately, dictionary 706 will contain t number of terms that are also contained in the documents in document corpus 704 (ie., term tl through term tt, and each of terms tl through tt may have some semantic relationship to the similarity criteria used to compile the corpus).
Next, a term-by-document matrix 708 is constructed from document corpus 704 and term dictionary 706 (step 606). Here it is understood that term-by-document matrix 708 for the representative semantic analysis process of the present invention is fundamentally different from the term-by-document matrix created for LSI processing. Documents in the representative term-by-document are selected to represent a similarity criteria related to document comparisons, while traditionally documents for a LSI term-by-document matrix were those used in the actual document comparisons.
Term-by-document matrix 708 is populated with local term weights L for the terms in the term dictionary 706 (e.g., term tl through term tt) such that Li represents the frequency of occurrence of each unique dictionary term ti in a unique document dj in which the term occurs. Because term-by-document matrix 708 is populated with document information from document corpus 704 and that information is representative of similarity criteria (and not necessarily the document to be compared), term-by-document matrix 708 is also referred to as a representative term-by-document matrix. Normally, documents represent columns and terms represent rows; however, one of ordinary skill in the art would understand that the term-by-document matrix can be presented in any one of a number of presentation formats provided that the format is maintained throughout the remainder of the concept database building process. Dictionary terms and document nomenclatures are typically arranged in the representative term-by-document matrix 708 such that the occurrence frequency of a term ti in a document dj is found in the entry at the intersection of the term ti row and document dj column. Essentially, representative term-by-document matrix 708 is a tabular listing of term occurrences by document and thus the indices in a matrix are document-local term weights Li (or simply local term weights). When examined by document, the representative term-by-document matrix is a collection of document column vectors, wherein each column is a vector representing a unique document dj in the corpus 704 and each index is a coefficient representing occurrences for a dictionary term in the document. In a similar manner, when examined by term, the term-by-document matrix is a collection of term row vectors, wherein each row is a vector representing occurrences of dictionary term 11 by document.
From the term-by-document matrix, a weighted term dictionary G 710 is constructed by calculating the global term weight for each term in the term-by-document matrix (step 608). Terms in dictionary 706 are then weighted based on information theory and statistical practices (ie., log entropy and local and global weighting schemes) converting term dictionary 706 into weighted term dictionary G 710. In practice, terms occurring in term dictionary 706 are globally weighted based on a term's cumulative occurrences in the d number of documents (e.g., document dl through document dd) of document corpus 704. Term dictionary 706 is modified to hold the global term weights Gi to form weighted term dictionary G 710. Global term weight Gi is determined by the occurrences of each term tl in documents dl through dd of term-by-document matrix 708 (or similarly in document corpus 704).
A weighted tern-by-document matrix A 720 is then created from the weighted term dictionary G 710 and the representative term-by-document matrix 708 by applying the term weighting function to the representative term-by-document matrix 708 (step 610).
Having constructed the weighted term-by-document matrix A 720, a singular values matrix S0 and a corresponding left singular term vector matrix T0 are decomposed from weighted term-by-document matrix A 720 using any one of a number of decomposition algorithms such as a singular value decomposition (step 612). Actually, matrices S0, T0 and a third matrix, document matrix D0, are decomposed from weighted term-by-document matrix A 720 as described above with respect to FIGS. 3 and 4. However, as the purpose of constructing concept database 730 is for defining a representative semantic space in which related documents may be compared, the document matrix D0 is of no use and is immediately discarded as it is not needed for the representative semantic analysis.
While the present invention is described using the singular value decomposition algorithm for decomposing the weighted term-by-document matrix A 720, other decomposition algorithms are equally well known and reap acceptable results. One of ordinary skill in the art would immediately recognize that other decomposition algorithms are available, such as semi-discrete decompositions and ULV decompositions.
The dimensional size of singular value matrix S0 is reduced by selecting the k largest singular values 725 which represent the concepts in the representative semantic space (step 614). In accordance with an exemplary embodiment of the present invention, the value of k is normally set between 250 and 400, having an 1; optimum value of approximately 300.
In addition to reducing the size of singular value matrix S0 to reduced singular value matrix S, the size of the term vector matrix, T0 is also reduced to k columns in accordance with the reduction, thereby forming reduced term vector matrix T. The singular values in matrix S 722 are ordered by size for the m concepts from weighted term-by-document matrix A 720. Only the k highest ranking concepts 725 are used for defining the representative semantic space, which are identified in matrix S 724. A reduced term-by-document matrix  is then realized from the first k largest concepts are kept from matrix S 724. This is accomplished expeditiously by merely setting the smaller concept values in the S matrix to zero (i.e., concepts ranked lower than k, between the kth and mth values). Then, the S and T matrices are truncated to the k concepts; matrix S becomes a k×k matrix with k non-zero entries and matrix T becomes a m×k matrix. As a practical matter, the matrix A is decomposed for the kth highest singular values and the decomposition is then terminated because the kth through mth singular values will be truncated in the concept reduction.
The values of the decomposition product, matrix S and T matrices, are the means for defining a representative semantic space for comparing documents by the similarity of their latent semantic concepts. Accurate S and T matrices are more important for representative semantic analysis than standard LSI searching because the representative semantic space is representative of the concepts in the collections of documents to be searched. The present invention relates to an optimization over traditional LSI searching methods, so selecting an efficient SVD process is significant for achieving overall optimized results.
The present invention does not rely on SVD for finding a document matrix D because documents in document matrix D are not compared to one another. Moreover, the document vectors defined in the row vectors of the document matrix D are not properly defined in the domain of the representative semantic space defined by concept database 730. Concept database 730 defines a representative semantic space which is then used for comparing one document to another as described further below, whether or not the document was a part of the document corpus used for the SVD. Each document in a document database to be compared is placed in a document vector form similar to that illustrated above with respect to weighted term-by-document matrix A 720. Therefore, in addition to needing the frequency occurrence values of terms in term dictionary 710, which occur in the document to be searched, the global weights Gi for terms' occurrences in document corpus 704 are also needed. Thus, in addition to the left singular term vectors T and the k largest singular values S, term dictionary 710 is also included in concept database 730 which includes the global weights of each term with respect to document corpus 704. Thus, concept database 730 is realized from the reduced singular values matrix S 724, reduced the term vector matrix T 722 and the weighted term dictionary G 710 each from the similarity criteria used to compile representative corpus 704 (step 616). The process then ends.
FIG. 8 is a flowchart depicting a process for creating a document database in accordance with an exemplary embodiment of the present invention. This process is represented as step 504 representing semantic analysis processes depicted in FIG. 5. In practice, the document database consists of a set of documents represented by the document field data, field indices, inverted indices and pseudo-document vectors that represent the documents in the document database. Associated with the compressed set of documents is an index of unique internal doc-IDs mapped to a user-defined document field and associated field index that maps the user-defined field to the unique internal doc-ID.
The process begins by identifying a collection of documents to be searched (step 802). For best document comparison results, the similarity criteria of the collection of documents should be related to that used to compile the representative document corpus. The more askew the similarity criteria of the two databases, the less reliable the similarity results are between documents or terms because the comparison of document vectors must be performed in an appropriate semantics space. More precisely, pseudo-document vectors for the collection of documents are compared in the representative semantic space which produces an indication of the similarity between documents. Document vectors that are not created from SVD of the original information corpus are called ‘pseudo-document vectors’ because they present themselves as vectors of terms rather than a SVD product.
LSI ensures that the semantic space used for comparing document vectors is the appropriate space for making comparisons by always using document vectors from the document vector matrix D returned from the decomposed corpus of documents. In doing so, the multidimensional semantic space created by LSI is defined around the semantic concepts in the information corpus. All document vectors representing documents in the information corpus will map to that semantic space and therefore similarity metrics can be properly applied to the document vectors. The present invention avoids the task of continually computing a SVD of new collections of documents by preprocessing a representative semantic space. The representative semantic space can then be used for comparing the similarity of documents having comparable subject matters to the subject matters of the information corpus used to create the concept database. The subject matters constraint must be rigorously enforced because latent semantic concepts inherent in the information corpus are a result, at least partially, of the subject matters of the corpus. So, for a similarity metric to properly apply to pseudo-document vectors, the multidimensional semantic space must be appropriate for mapping the vectors. The suitability of the semantic space to the pseudo-document vectors is ensured by limiting the realm of the semantic space similar semantic concepts (i.e., by constraining the similarity criteria).
Returning to the process depicted in FIG. 8, an inverted index is constructed that maps each term from the collection of documents to the documents in which that the term appears (step 804). The inverted index contains all terms occurring in the collection of documents including all stop words, unlike the weighted term-by-document matrix 720 in FIG. 7B that excludes stop words.
Once the inverted index has been constructed, each document in the collection is represented as a pseudo-document vector Dj, where Dj=Ad TTS−1 (step 806). Here, Ad is a column vector formed from a term vector created from terms co-occurring in the weighted term dictionary G and in the document dj to be represented as a pseudo-document vector. Also note that at this point, all co-occurring terms in term vector Ad are weighted locally by a term's occurrence in document dj, local weight Li, and then scaled by the global weight of the term taken from term dictionary G, global weight Gi. Matrices T and S are the term vector and singular value matrices respectively taken from the concept database. Once each document in the collection of documents to be searched is represented as a pseudo-document vector Dj, a reduced term-by-document matrix  can be compiled from the pseudo-document vectors as discussed above with respect to previous embodiments. From there, the product of ÂÂT or DS2DT defines an index of similarities from document-to-document in the collection of documents to be searched.
By using the inverted index and the pseudo-document vector in the document database, any document di, of the c number of documents in the document collection may be compared to any other document dj in the document collection. Moreover, a subset, S1, comprising s1 number of documents of the c number of documents in the document collection may first be sorted using the inverted index and then the documents in subset S1 compared using pseudo-document vectors as described above. The inverted index is searched using Boolean logic, or any conventional search technique, to reduce the quantity of the pseudo-document vectors comprising the subset S1 to be compared, from the c number originally in the document collection to s1 more relevant documents in subset S1. Thus, more accurate results may be achieved. For example, narrowing a search between a set of dates, or constraining the set of results to those containing specific keywords eliminates documents not meeting the search criteria and those documents will not be ranked. In cases where a purely conceptual match is not desired, the added ability to require that a document have certain words is extremely beneficial because only documents containing the specified words are ranked by their vector similarity. An added benefit is that the reduction in the number of similarity comparisons to be made as a result of performing field and/or keyword queries reduces the computational requirements.
Of course, in many instances, the purpose of performing the document search is not for comparing one document within a collection of documents to another document within the same collection, but instead to find the most similar document in the collection to a query document. One of ordinary skill in the art would readily understand the query document may be representative of documents that are desired or mainstayed and be merely a series of query terms used to convey a central idea which is paramount to the search strategy. FIG. 9 is a flowchart depicting a process for creating a query pseudo-document vector in accordance with an exemplary embodiment of the present invention. Recall that the purpose for placing the query pseudo-document in the representative semantic space with other pseudo-documents in an effort to find the most similar document as represented within that semantic space. Therefore, the query pseudo-document vector must be constructed using the same parametric attributes as those used for constructing the semantics base itself (i.e., the similarity criteria of the original corpus of documents). Thus, it would be unlikely to expect accurate search results from a query document which was outside the similarity criteria of the original corpus of documents. Hence, the first constraint similar to building the document database is that the query document pertains to a similar subject matter as both the original corpus of documents and the newly-created document database.
The process begins by identifying a representative query document or alternatively, a set of query terms (step 902). Again, this document or these terms must pertain to the original subject matter of the original corpus of documents. Finally, the query is represented as a pseudo-document vector Dq similar to that described above with respect to the documents in the collection of documents (step 904). Notice that when creating the pseudo-document vector, there is no need to create an inverted index for the query because the inverted index created in step 804 of FIG. 8 is used solely for reducing the number of documents to be presented in the semantic space. However, if the query document is to become a part of the document database for additional document comparisons with other query documents, then in might be prudent to create the inverted index for the query document.
Alternatively, the query document need be represented merely as Dq equals XTTS−1. Again, XT is a column vector formed from the term vector of the terms which co-occur in the weighted term dictionary G and in the query document dq. Further, co-occurring terms' local weights from the query document dq are scaled by the global weights Gi taken from the term dictionary G. Matrices T and S are taken directly from the concept database and represent the term vector matrix T and singular value matrix.
Similar documents may be found in the document database by using an exemplary query that combines a Boolean search and a reduced dimensional vector search using pseudo-document vectors. Thus, in accordance with an exemplary embodiment of the present invention, a query is composed of two distinct parts: a Boolean query and matching pseudo-document vectors. With regard to the first part, the Boolean query allows the combination of field value constraints, including the existence of certain key terms as well as precise field values and ranges and acts as a filter for selecting a subset S of documents from the document database which match the query condition. With regard to the second part, documents dl through ds in the subset S are compared to each other by matching corresponding pseudo-document vectors Dl through Ds, thereby ranking documents dl through ds in subset S based on the similarity of pseudo-document vectors projected in the representative semantic space.
Document comparisons are made by taking the inner product of pseudo-document vectors. A resulting set of values for matching any of pseudo-document vectors Dl through Ds to all other pseudo-document vectors Dl through Ds represents a document comparison based on the similarity criteria used for creating the concept database. The resulting set of values is sorted from largest to smallest. Larger inner products correspond to a greater semantic similarity between documents. The resulting set of matching documents is the set of documents in subset S, ranked by how closely they match each other. Alternatively, a query pseudo-document vector Dq is derived from a search or query document dq which typifies similarity criteria for a comparison or search document. Query document dq represents particular parametric attributes for the similarity criteria used to create the concept database. Once created, query document dq and query pseudo-document vector Dq are included in the document database as any other document in the collection. The aspects of the present invention discussed above are further described below with respect to the flowchart on FIG. 10.
FIG. 10 is a flowchart depicting an optimized process for finding the similarity between documents in a document database in accordance with an exemplary embodiment of the present invention. The process depicted in FIG. 10 is performed subsequent to creating a document database as represented by step 504 in FIG. 5 (and further described in FIG. 8). One intent of the presently described process is to further optimize the representative semantic analysis search process by reducing the size of the document database to be used for document ranking. However, reducing the number of documents to be compared in the document database is optional and may not be performed. It should be understood that if documents are removed from the document database prior to matching the pseudo-document vectors, the advantage of using the present representative semantic analysis is not realized for the removed documents.
The process begins by finding a subset S1 (subset S1 comprising document d1 through document ds1 and having exactly sl number of documents) from an inverted index which maps each term in the collection of documents to the documents in the collection using Boolean query (step 1002). Steps 1002 and 1004 represent optimizations of the representative semantic analysis searching process and are represented as dashed process boxes. In doing so, documents in the document database can be either included or excluded from further consideration by using selected terms in the Boolean query. Next, a second subset, subset S2, is obtained from the original document database, or alternatively from documents d1 to ds1 in subset S1, by using a field value query. Subset S2 comprises document d, through document ds2 and has exactly s2 number of documents. Here, rather than using Boolean logic to search the actual terms in the documents, field values associated with documents in the collection are used for including or eliminating a document in the document database. For example, with respect to a database of documents on object-oriented processing, one field may be presented for the date of the document, for instance, 1999. In an effort to reduce the number of documents for consideration in the representative semantic analysis, the user may limit the search of documents to a period which the user feels is relevant. To illustrate, a field query value may be set greater than 1998, or “>1998.” In that case, documents having field year values of “1999,” “2000” or “2001” would be included for further consideration. Thus, a document with a date field value of “1999” would be included in the document database for further representative semantic analysis.
If a query document is to be used as a basis for the document comparison, the process continues by constructing a query pseudo-document vector Dq from a search sample document dq as discussed above with respect to step 506 in FIG. 5 (and further described in FIG. 9). Query pseudo-document vector Dq included in the document database contains an additional database with a set of pseudo-document vectors, each of which represents a unique document in the document database. A pseudo-document vector Di is derived from the text of a document di in the document database and the concept database (T and S matrices and the weighted dictionary of terms) by using the formula D=XTTS1, where X is a term vector representing occurrences of terms in the weighted term dictionary G for the document dj. Each time a document dj is added to the document database, a pseudo-document vector Dj is created for the document, the document field data is stored, and the associated field indices and the field inverted indices are updated accordingly.
The process then continues by matching the query document to a subset of documents in the document database by comparing the pseudo-document vector Dq with pseudo-document vectors Dl through DS in the subset S (step 1008). S is a subset of documents in the document database which have been identified for further consideration by one of a Boolean search query or a field value query. As described elsewhere above, comparing documents involves finding the dot product between any two pseudo-document two vectors e.g., pseudo-document vector Di and pseudo-document vector Dj that define the subject document di and document dj, respectively. Again, this product results in the cosine of the pseudo-document vectors being compared i.e., pseudo-document vector Di and pseudo-document vector Dj. As the product approaches “1.0,” subject documents di and dj are more similar. Thus, the pseudo-document vector (any of pseudo-document vector Dl through pseudo-document vector DS) in which the resultant dot product with pseudo-document vector Dq is closest to a value of “1.0” corresponds to the most similar document dj to query document dq, based on the similarity criteria used for defining the concept database.
Alternatively, documents dl through ds are ranked by document similarity by listing the cosine values (dot products) between the pseudo-document vector Dj for document dj, and all other pseudo-document vectors associated with documents to be compared, such as pseudo-document vectors Dl to Ds for documents dl to ds in a collection of documents. Ranking can be performed as an iterative process in which each document in the collection is compared to every other document in the document collection by matching the documents' corresponding pseudo-document vectors. The process then ends.
Presented below is pseudo code for implementing an exemplary search of a document database based on hiring criteria
Example: Assuming a concept database consisting of a weighted term dictionary, reduced term matrix built from repository of resumès, and document database of indexed searchable documents job descriptions:
Query: “I am a c++ programmer with experience in object oriented analysis and design looking for senior level development position” with required terms “oop” and fielded requirement “Salary>$60,000”
Assuming document database index is of following example fielded structure:
    • <key>Job0001</key>
    • <job-title>job title here</job-title>
    • <salary>45000</salary>
    • <body>this is the body of the job description</body>
The Boolean query could take the form of: “body contains ‘oop’” and “salary>‘60000’”, returning a subset of documents matching these criteria. Following, a ranked list of results is produced by computing the cosine distance between the pseudo-document for the query text and the stored pseudo-documents in the document database;
Results:
    • 1. <key>Job6531</key>
    • <job-title>Senior Level Software Engineer</job-title>
    • <salary>75000</salary>
<body>Acme Software is looking for a senior software engineer for design and development of signal processing applications. Candidate should be proficient in c++, with experience in OOAD and OOP. </body>
score: 0.78
    • 2. <key>Job1345</key>
    • <job-title>Sr. Software Developer</job-title>
    • <salary>62000</salary>
<body>Startup looking for programmer with four (4)+years of hands-on C++ and OOP programming experience and a minimum of five (5) years overall software engineering experience to develop text messaging system.</body>
score: 0.76
The description of the present invention has been presented for purposes of illustration and description but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (100)

1. A method for comparing documents for similarity by representative latent semantic analysis comprising:
generating a weighted term-by-document matrix for a group of selected documents, said group of selected documents being representative of a similarity criteria;
decomposing said weighted term-by-document matrix into a term matrix of terms occurring said group of selected documents and a reduced concepts matrix of concepts, at least a portion of said concepts indicative of latent semantics in at least one of said selected documents;
generating a first pseudo-document vector utilizing a first document and said reduced concepts matrix;
generating a second pseudo-document vector utilizing a second document and said reduced concepts matrix;
comparing said first pseudo-document vector to said second pseudo-document vector; and
determining similarity of said first document and said second document based on the comparison of said first pseudo-document vector to said second pseudo-document vector.
2. The method recited in claim 1 above wherein generating the first pseudo-document vector further comprises utilizing the term matrix.
3. The method recited in claim 1 above wherein generating the second pseudo-document vector further comprises utilizing the term matrix.
4. The method recited in claim 1 above wherein generating the first and the second pseudo-document vectors comprise utilizing the term matrix.
5. The method recited in claim 1 above wherein the similarity criteria is one of document language, document type, subject matter and a category of a subject matter.
6. The method recited in claim 1 above wherein generating the weighted term-by-document matrix further comprises:
compiling the group of selected documents representative of a similarity criteria;
identifying terms occurring in the group of documents;
finding a global term weight for each identified term based on a frequency of occurrences for said each term in the group of documents;
finding a local term weight for each identified term based on a frequency of occurrences for said each term in each document in the group of documents;
organizing said local term weights in a term-by-document matrix for the identified terms; and
creating a weighted term-by-document matrix by applying a global term weight to a corresponding local term weight for each of the respective identified terms in the term-by-document matrix.
7. The method recited in claim 6 above, wherein identifying terms occurring in the group of documents further comprises eliminating stop words from the identified terms.
8. The method recited in claim 7 above wherein said weighted term-by-document matrix is comprised of a product of said term matrix, said reduced concepts matrix S and the transpose of an other matrix, wherein said term matrix is a t row by k column matrix, said reduced concepts matrix is a k row by k column matrix, and said other matrix is a v row by k column matrix.
9. The method recited in claim 8 above, wherein generating a first pseudo-document vector D further comprises:
finding the product of a transpose of a term vector Ad, said term matrix T, and an inverse of said reduced concepts matrix S, such that a D=Ad TTS−1, wherein said term vector Ad being formed from identified terms occurring in said first document and said identified occurring terms being weighted locally based on frequency of occurrence in said first document and scaled based on the global weight of said each of said identified occurring term.
10. The method recited in claim 9 above further comprises:
compiling a collection of documents;
generating a pseudo-document vector for each of the documents in the collection of documents, each of said pseudo-document vectors being formed utilizing a document from the collection of documents, said reduced concepts matrix and said term matrix;
comparing said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents; and
determining similarity of each of the documents in the collection of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for the documents in the collection of documents.
11. The method recited in claim 10 further comprises:
constructing an index of term-to-document mappings, each of said mappings correlate a term occurring in the collection of documents to documents in which said term occurs; and
finding a subset of documents in the collection of document in which a query term occurs by searching the inverted index for the query term.
12. The method recited in claim 11 further comprises:
generating a pseudo-document vector for each document in the subset of documents utilizing a document from the subset, said term matrix, and said reduced concepts matrix;
comparing said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the subset of documents; and
determining document similarity of each of the documents in the subset of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the subset of documents.
13. The method recited in claim 12 above, wherein the first document is a list of terms.
14. The method recited in claim 12 above, wherein the first document is a query document.
15. The method recited in claim 10 further comprises:
adding a document to the collection of documents; and
generating a pseudo-document vector for the added document, said added pseudo-document vector being formed utilizing said added document, said reduced concepts matrix and said term matrix.
16. The method recited in claim 15 above further comprises:
comparing said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents; and
determining document similarity of each of the documents in the collection of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents.
17. The method recited in claim 10 further comprises:
deleting a document from the collection of documents; and
eliminating a pseudo-document vector for the deleted document.
18. The method recited in claim 17 above further comprises:
comparing said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents; and
determining document similarity of each of the documents in the collection of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents.
19. The method recited in claim 1 above, wherein decomposing said weighted term-by-document matrix further comprises:
establishing a predetermined quantity of strong concepts;
deriving the predetermined quantity of strong concepts from said weighted term-by-document matrix; and
ceasing decomposition.
20. The method recited in claim 1 above, wherein decomposing said weighted term-by-document matrix further comprises applying a singular value decomposition algorithm to said weighted term-by-document matrix.
21. A method for comparing documents for similarity by representative latent semantic analysis comprising:
selecting documents based on a similarity criteria;
generating a weighted term-by-document matrix for said selected documents;
decomposing said weighted term-by-document matrix into a reduced concepts matrix, a term matrix, and a document matrix;
discarding said document matrix;
generating a first pseudo-document vector utilizing a first document, said reduced concepts matrix and said term matrix;
generating a second pseudo-document vector utilizing a second document, said reduced concepts matrix and said term matrix;
comparing said first pseudo-document vector to said second pseudo-document vector, and
determining similarity of said first document and said second document based on the comparison of said first pseudo-document vector to said second pseudo-document vector.
22. The method recited in claim 21 above, wherein decomposing said weighted term-by-document matrix further comprises applying a singular value decomposition algorithm to said weighted term-by-document matrix.
23. The method recited in claim 21 above, wherein decomposing said weighted term-by-document matrix further comprises:
establishing a predetermined quantity of strong concepts;
deriving the predetermined quantity of strong concepts from said weighted term-by-document matrix; and
ceasing decomposition.
24. The method recited in claim 23 above, wherein the predetermined quantity of strong concepts is k number of concepts.
25. The method recited in claim 24 above wherein said weighted term-by-document matrix is comprised of a product of said term matrix, said reduced concepts matrix, and the transpose of an other matrix, wherein said term matrix is a t row by k column matrix, said reduced concepts matrix is a k row by k column matrix, and said other matrix is a v row by k column matrix.
26. The method recited in claim 21 above wherein the similarity criteria is one of document language, document type, subject matter and subject matter category.
27. The method recited in claim 21 above wherein said selected documents exclude said first document and further exclude said second document.
28. The method recited in claim 21 above wherein generating the weighted term-by-document matrix further comprises:
identifying terms occurring in the selecting documents;
finding a global term weight for each identified term based on a frequency of occurrences for said each identified term in the selected documents;
finding a local term weight for each identified term based on a frequency of occurrences in each of the plurality of documents in the selected documents;
organizing said local term weights in a term-by-document matrix for the identified terms; and
creating a weighted term-by-document matrix by applying a global term weight to a corresponding local term weight for each of the respective identified terms in the term-by-document matrix.
29. The method recited in claim 28 above, wherein identifying terms occurring in the group of documents further comprises eliminating stop words from the identified terms.
30. The method recited in claim 29 above wherein said weighted term-by-document matrix is comprised of a product of said term matrix, said reduced concepts matrix, and the transpose of a document matrix, wherein said term matrix is a t row by k column matrix, said reduced concepts matrix is a k row by k column matrix, and said matrix is a d row by k column matrix.
31. The method recited in claim 30 above, wherein generating a first pseudo-document vector Dl further comprises:
finding the product of a transpose of a term vector Ad1, said term matrix T, and an inverse of said reduced concepts matrix S, such that a Dl=Ad1 TS−1, wherein said term vector Ad1 being formed from identified terms occurring in said first document, wherein said identified occurring terms are terms being weighted locally based on frequency of occurrence in said first document and scaled based the global weight of said each of said identified occurring term.
32. A method for comparing documents for similarity by representative latent semantic analysis comprising:
creating a concept database comprising:
compiling a corpus from a plurality of documents representative of a similarity criteria;
identifying terms occurring in the corpus;
finding a global term weight for each identified term based on a frequency of occurrences in the corpus;
finding a local term weight for each identified term based on a frequency of occurrences in each of the plurality of documents in the corpus;
organizing said local term weights in a term-by-document matrix for the identified terms;
creating a weighted term-by-document matrix by applying global term weights to corresponding local term weights for each of the respective identified terms in the term-by-document matrix;
decomposing the weighted term-by-document matrix into at least a term matrix and a concepts matrix;
forming a reduced concepts matrix by eliminating indices in said concept matrix identifying weak concepts; and
forming a reduced term matrix by eliminating indices in said term matrix corresponding to the indices eliminated from the concept matrix;
creating a document database from a collection of documents and said concept database comprising:
compiling said collection of documents;
generating a plurality of pseudo-document vectors utilizing said collection of documents and each of said reduced concepts matrix and said term matrix of said concept database; and
determining document similarity between documents in the collection of documents comprising:
comparing pseudo-document vectors from the document database for respective documents in the collection of documents; and
determining similarity of said respective documents based on the comparison of the respective pseudo-document vectors from the document database.
33. The method recited in claim 32 above wherein the weighted term-by-document matrix is comprised of a product of said term matrix, said concepts matrix, and the transpose of a document matrix, wherein said term matrix is a t row by m column matrix, said concepts matrix is an m row by m column matrix, and said document matrix is a d row by m column matrix.
34. The method recited in claim 33 above wherein forming a reduced concepts matrix by eliminating indices in said concept matrix identifying weak concepts further comprises;
selecting the k strongest concepts;
eliminating (k+1)th through mth column indices of said concepts matrix; and
eliminating (k+1)th through mth row indices of said concepts matrix, thereby truncating said m row by m column concepts matrix to a k row by k column reduced concepts matrix.
35. The method recited in claim 34 above wherein forming a reduced term matrix by eliminating indices in said concept matrix identifying weak concepts further comprises:
eliminating (k+1)th through mth column indices of said term matrix, thereby truncating said t row by m column term matrix to a t row by k column reduced term matrix.
36. The method recited in claim 33 above wherein forming a reduced concepts matrix by eliminating indices in said concept matrix identifying weak concepts further comprises;
establishing a predetermined quantity of strong concepts;
deriving the predetermined quantity of strong concepts from said weighted term-by-document matrix; and
ceasing decomposition.
37. The method recited in claim 36 above, wherein generating a plurality of pseudo-document vectors further comprises:
finding the product of a transpose of a term vector Adi, said term matrix T, and an inverse of said reduced concepts matrix S, for each document di in the collection of documents such that a Di=Ad TTS1, said term vector Adi being formed from identified terms occurring in an ith document, wherein each of said identified occurring terms are weighted locally based on occurrences in said each document di of the collection of documents and scaled based on the global weight of said each of said identified occurring term.
38. A method for comparing documents for similarity by representative latent semantic analysis comprising:
defining a representative semantic space;
creating a first pseudo-document vector for a first document using at least a portion of the definition for the representative semantic space;
creating a second pseudo-document vector for a second document using at least a portion of the definition for the representative semantic space;
comparing the first pseudo-document vector to the second pseudo-document vector, and
determining similarity of the first document and the second document based on the comparison of the first pseudo-document vector to the second pseudo-document vector.
39. The method recited in claim 38 above, wherein defining a representative semantic space further comprises:
determining a similarity criteria;
selecting parameters for said similarity criteria;
compiling a corpus of documents representative of said similarity criteria based on said selected parameters, wherein said corpus of documents excludes said first and said second documents; and
deriving concepts from the corpus of documents representative of said similarity criteria.
40. The method recited in claim 39 above, wherein deriving concepts from the corpus of documents further comprises:
identifying terms occurring in the corpus of documents;
finding a global term weight for each identified term based on a frequency of occurrences for said each term in the corpus of documents;
finding a local term weight for each identified term based on a frequency of occurrences in each of the plurality of documents in the corpus of documents;
organizing said local term weights in a term-by-document matrix for the identified terms;
creating a weighted term-by-document matrix by applying a global term weight to a corresponding local term weight for each of the respective identified terms in the term-by-document matrix; and
decomposing the weighted term-by-document matrix into concepts representative of said similarity criteria.
41. The method recited in claim 40 above wherein said concepts representative of said similarity criteria further comprise a concept database including a concept index and a term index.
42. The method recited in claim 41 above, wherein said concept index and term index being organized as a concept matrix and a term matrix respectively.
43. The method recited in claim 42 above, wherein said concept matrix and said term matrix being products of the decomposition of said weighted term-by-document matrix, said decomposition comprises:
establishing a predetermined quantity of strongest concepts;
decomposing said weighted term-by-document matrix into said term matrix, said concept matrix and a document matrix;
deriving the predetermined quantity of strong concepts from said weighted term-by-document matrix; and
ceasing decomposition.
44. The method recited in claim 43 above, wherein creating a first pseudo-document vector for a first document using at least a portion of the definition for the representative semantic space further comprises:
creating a first document vector for the first document based on the co-occurrence of terms in the first document and the corpus of documents, the local term weight of each co-occurring term based on a frequency of occurrences in said first document and a global term weight for each co-occurring term based on a frequency of occurrences for said each term in the corpus of documents; and
transposing said first document vector to a first pseudo-document vector using a product of said concept matrix and said term matrix.
45. A method for comparing resumès for similarity by representative latent semantic analysis comprising:
generating a weighted term-by-resumè matrix for a group of resumès;
decomposing said weighted term-by-resumè matrix into a term matrix of terms occurring in said group of resumès and a reduced concepts matrix of concepts, at least a portion of said concepts indicative of latent semantics in at least one of said resumès;
identifying a collection of resumès, wherein at least one resumè in the collection of resumès is not included in said group of resumès;
generating a plurality of pseudo-resumè vectors utilizing said collection of resumès and said reduced concepts matrix;
generating a query pseudo-document vector utilizing a query document and said reduced concepts matrix;
comparing said query pseudo-document vector to said plurality of pseudo-resumè vectors; and
ranking said plurality of resumè based on similarity between said query pseudo-document vector to said plurality of pseudo-resumè vectors.
46. The method recited in claim 45 above wherein generating the pseudo-document vector further comprises utilizing the term matrix.
47. The method recited in claim 45 above wherein generating the plurality of pseudo-resumè vectors further comprises utilizing the term matrix.
48. The method recited in claim 45 above wherein generating the query pseudo-document vector and the plurality of pseudo-resumè vectors comprise utilizing the term matrix.
49. The method recited in claim 45 above wherein said query document is a resumè.
50. The method recited in claim 45 above wherein said query document is a list of query terms.
51. The method recited in claim 45 above wherein generating the weighted term-by-resumè matrix for a group of resumès further comprises:
selecting the group of resumè s;
identifying terms occurring in the group of resumès;
finding a global term weight for each term occurring in the group of resumès based on a frequency of occurrences in the group of resumès;
finding a local term weight for each term identified as occurring in the group of resumès based on a frequency of occurrences in each of the selected resumès of the group of resumès;
creating a term-by-document matrix of local term weights for the terms identified as occurring in the group of resumès; and
creating a weighted term-by-document matrix by applying a global term weight to corresponding local term weight for each of the respective terms in the term-by-document matrix.
52. The method recited in claim 51 further comprises:
adding a resumè to the collection of resumès; and
generating a pseudo-resumè vector for said added resumè, said added pseudo-resumè vector being formed by utilizing said added resumè, said reduced concepts matrix and said term matrix.
53. The method recited in claim 52 above further comprises:
comparing said query pseudo-document vector to each of the pseudo-resumè vectors for each of the resumès in the collection of resumès; and
determining similarity of each of the resumès in the collection of resumès to said query document based on the comparison of said query pseudo-resumè vector to each of the pseudo-resume vectors for each of the resumès in the collection of resumès.
54. The method recited in claim 45 further comprises:
deleting a resumè from the collection of resumès; and
eliminating a pseudo-resumè vector for the deleted resumè.
55. The method recited in claim 54 above further comprises:
comparing said query pseudo-document vector to each of the pseudo-resumè vectors for each of the resumès in the collection of resumès; and
determining similarity of each of the resumès in the collection of resumès to said query document based on the comparison of said query pseudo-document vector to each of the pseudo-resumè vectors for each of the resumès in the collection of resumès.
56. The method recited in claim 45 further comprises:
constructing an index of term-to-resumè mappings, each of said mappings correlate a term occurring in the collection of resumès to resumès in which said each term occurs; and
finding a subset of resumès in the collection of resumè in which a query term occurs by searching the inverted index for the query term.
57. The method recited in claim 56 further comprises:
generating a pseudo-resumè vector for each resumè in the subset of resumès utilizing a resumè from the subset, said term matrix, and said reduced concepts matrix;
comparing said first pseudo-resumè vector to each of the pseudo-resumè vectors pseudo-resumè vectors for each of the resumès in the subset of resumès; and
determining resumè similarity of each of the resumès in the subset of resumès to said first resumè based on the comparison of said first pseudo-resumè vector to each of the pseudo-resumè vectors for each of the resumès in the subset of resumès.
58. A method for comparing patent disclosures for similarity by representative latent semantic analysis comprising:
generating a weighted term-by-patent matrix for a group of patent disclosures;
decomposing said weighted term-by-patent disclosure matrix into a term matrix of terms occurring in said group of patent disclosures and a reduced concepts matrix of concepts, at least a portion of said concepts indicative of latent semantics in at least one of said patent disclosures;
generating a plurality of pseudo-patent vectors utilizing said reduced concepts matrix and a plurality of patent disclosures;
generating a query pseudo-document vector utilizing a query document and said reduced concepts matrix;
comparing said query pseudo-document vector to said plurality of pseudo-patent disclosure vectors; and
ranking said plurality of patent disclosures based on similarity between said query pseudo-document vector to said plurality of pseudo-patent vectors.
59. The method recited in claim 58 above wherein generating the pseudo-document vector further comprises utilizing the term matrix.
60. The method recited in claim 58 above wherein generating the plurality of pseudo-patent vectors further comprises utilizing the term matrix.
61. The method recited in claim 58 above wherein generating the query pseudo-document vector and the plurality of pseudo-patent vectors comprise utilizing the term matrix.
62. The method recited in claim 58 above wherein said query document is a patent disclosure.
63. The method recited in claim 58 above wherein said query document is a patent application.
64. The method recited in claim 58 above wherein said query document is a list of query terms.
65. The method recited in claim 58 above wherein generating the weighted term-by-patent matrix for a group of patent disclosures further comprises:
selecting the group of patent disclosures;
identifying terms occurring in the group of patent disclosures;
finding a global term weight for each term occurring in the group of patent disclosures based on a frequency of occurrences in the group of patent disclosures;
finding a local term weight for each term identified as occurring in the group of patent disclosures based on a frequency of occurrences in each of the selected patent disclosures of the group of patent disclosures;
creating a term-by-document matrix of local term weights for the terms identified as occurring in the group of patent disclosures; and
creating a weighted term-by-document matrix by applying a global term weight to corresponding local term weight for each of the respective terms in the term-by-document matrix.
66. The method recited in claim 65 further comprises:
adding a patent disclosure to the collection of patent disclosures; and
generating a pseudo-patent vector for said added patent disclosure, said added pseudo-patent vector being formed utilizing said added patent disclosure, said reduced concepts matrix and said term matrix.
67. The method recited in claim 66 above further comprises:
comparing said query pseudo-document vector to each of the pseudo-patent vectors for each of the patent disclosures in the collection of patent disclosures; and
determining similarity of each of the patent disclosures in the collection of patent disclosures to said query document based on the comparison of said query pseudo-document vector to each of the pseudo-patent vectors for each of the patent disclosures in the collection of patent disclosures.
68. The method recited in claim 58 further comprises:
deleting a patent from the collection of patent disclosures; and
eliminating a pseudo-patent vector for the deleted patent.
69. The method recited in claim 68 above further comprises:
comparing said query pseudo-document vector to each of the pseudo-patent vectors for each of the patent disclosures in the collection of patent disclosures; and
determining similarity of each of the patent disclosures in the collection of patent disclosures to said query document based on the comparison of said query pseudo-document vector to each of the pseudo-patent vectors for each of the patent disclosures in the collection of patent disclosures.
70. The method recited in claim 58 further comprises:
constructing an index of term-to-patent mappings, each of said mapping correlates a term occurring in the collection of patent disclosures to patent disclosures in which said each term occurs; and
finding a subset of patent disclosures in the collection of patent in which a query term occurs by searching the inverted index for the query term.
71. The method recited in claim 70 further comprises:
generating a pseudo-patent vector for each patent in the subset of patent disclosures utilizing a patent from the subset, said term matrix, and said reduced concepts matrix;
comparing said first pseudo-patent vector to each of the pseudo-patent vectors pseudo-patent vectors for each of the patent disclosures in the subset of patent disclosures; and
determining similarity of each of the patent disclosures in the subset. of patent disclosures to said first patent based on the comparison of said first pseudo-patent vector to each of the pseudo-patent vectors for each of the patent disclosures in the subset of patent disclosures.
72. A system for comparing documents for similarity by representative latent semantic analysis comprising:
a network containing a plurality of network elements;
a server connected to said network, said server comprising:
an interface for interfacing with the network;
a memory, said memory containing instructions for:
identifying terms occurring in the plurality of documents;
determining a global term weight for each identified term;
determining a local term weight for each identified term;
organizing said local term weights in a term-by-document matrix for the identified terms;
creating a weighted term-by-document matrix for global term weights and local term weight for each of the respective identified terms in the weighted term-by-document matrix;
decomposing the weighted term-by-document matrix into at least a term matrix and a reduced concepts matrix;
generating a pseudo-document vector utilizing a document, said reduced concepts matrix, and said term matrix;
comparing pseudo-document vectors; and
determining similarity of respective documents based on the comparison of their respective pseudo-document vectors; and
a processor for executing instructions from memory.
73. The system recited in claim 72 above, wherein said server further comprises:
a client connected to said network, said client passing the plurality of documents to said server.
74. The system recited in claim 72 above further comprises:
a client connected to said network, said client passing a document to said server, said client indicating to said server to generate a pseudo-document vector for said document.
75. The system recited in claim 72 above further comprises:
a client connected to said network, said client passing a plurality of documents to said server, said client indicating for said server to generate a plurality of pseudo-document vectors for respective said plurality of documents.
76. The system recited in claim 72 above, wherein said server further comprises:
a local interface for passing the plurality of documents to said server.
77. The system recited in claim 72 above, wherein said server further comprises:
a local interface for passing a document to said server, said client indicating to said server to generate a pseudo-document vector for said document.
78. The system recited in claim 72 above, wherein said server further comprises:
a local interface for passing a plurality of documents to said server, said client indicating for said server to generate a plurality of pseudo-document vectors for respective said plurality of documents.
79. A search engine method comparing documents for similarity by representative latent semantic analysis comprising:
receiving a corpus of documents;
creating a concept database from said corpus of documents, said concept database including at least a term matrix and a concepts matrix;
receiving a collection of documents, wherein at least one document in said collection of documents is not in said corpus of documents;
creating a document database from said collection of documents and said concept database, said documents database including at least one pseudo-document vector for each of at least two documents in said collection of documents, said at least one pseudo-document vector being created utilizing said each of at least two documents in said collection of documents and each of said reduced concepts matrix and said term matrix of said concept database; and
determining document similarity between at least said each of at least two documents in said collection of documents using said pseudo-document vectors in said document database.
80. The system recited in claim 79 above, wherein creating a concept database from said corpus of documents further comprises:
identifying terms occurring in the corpus of documents;
finding a global term weight for each identified term based on a frequency of occurrences for said each term in the corpus of documents;
finding a local term weight for each identified term based on a frequency of occurrences in each of the plurality of documents in the corpus of documents;
organizing said local term weights in a term-by-document matrix for the identified terms;
creating a weighted term-by-document matrix by applying a global term weight to a corresponding local term weight for each of the respective identified terms in the term-by-document matrix;
determining a number of stronger concepts; and
decomposing the weighted tern-by-document matrix into at least a term matrix and a concepts matrix, said concepts matrix having indices for stronger concepts.
81. A data processing system implemented program product for performing a method for comparing documents for similarity by representative latent semantic analysis, the program product comprising:
instructions for generating a weighted tern-by-document matrix for a group of selected documents, said group of selected documents being representative of a similarity criteria;
instructions for decomposing said weighted term-by-document matrix into a term matrix of terms occurring said group of selected documents and a reduced concepts matrix of concepts, at least a portion of said concepts indicative of latent semantics in at least one of said selected documents;
instructions for generating a first pseudo-document vector utilizing a first document and said reduced concepts matrix;
instructions for generating a second pseudo-document vector utilizing a second document and said reduced concepts matrix;
comparing said first pseudo-document vector to said second pseudo-document vector; and
instructions for determining similarity of said first document and said second document based on the comparison of said first pseudo-document vector to said second pseudo-document vector.
82. The program product recited in claim 81 above wherein said instructions for generating the first pseudo-document vector further comprises instructions for utilizing the term matrix.
83. The program product recited in claim 81 above wherein said instructions for generating the second pseudo-document vector further comprises instructions for utilizing the term matrix.
84. The program product recited in claim 81 above wherein said instructions for generating the first and the second pseudo-document vectors comprise instructions for utilizing the term matrix.
85. The program product recited in claim 81 above wherein the similarity criteria is one of document language, document type, subject matter and a category of a subject matter.
86. The program product recited in claim 81 above wherein said instructions for generating the weighted term-by-document matrix further comprises:
instructions for compiling the group of selected documents representative of a similarity criteria;
instructions for identifying terms occurring in the group of documents;
instructions for finding a global term weight for each identified term based on a frequency of occurrences for said each term in the group of documents;
instructions for finding a local term weight for each identified term based on a frequency of occurrences for said each term in each document in the group of documents;
instructions for organizing said local term weights in a term-by-document matrix for the identified terms; and
instructions for creating a weighted term-by-document matrix by applying a global term weight to a corresponding local term weight for each of the respective identified terms in the term-by-document matrix.
87. The program product recited in claim 86 above, wherein said instructions for identifying terms occurring in the group of documents further comprises instructions for eliminating stop words from the identified terms.
88. The program product recited in claim 87 above wherein said weighted term-by-document matrix is comprised of a product of said term matrix, said reduced concepts matrix and the transpose of an other matrix, wherein said term matrix is a t row by k column matrix, said reduced concepts matrix is a k row by k column matrix, and said other matrix is a v row by k column matrix.
89. The program product recited in claim 88 above, wherein said instructions for generating a first pseudo-document vector D further comprises:
instructions for finding the product of a transpose of a term vector Ad, said term matrix T, and an inverse of said reduced concepts matrix S, such that a D=Ad TTS−1, and said term vector Ad being formed from identified terms occurring in said first document, wherein said identified occurring terms are weighted locally based on frequency of occurrence in said first document and scaled by the global weight of said each of said identified occurring term.
90. The program product recited in claim 89 above further comprises:
instructions for compiling a collection of documents;
instructions for generating a pseudo-document vector for each of the documents in the collection of documents, each of said pseudo-document vectors being formed utilizing a document from the collection of documents di, said reduced concepts matrix S and said term matrix T;
instructions for comparing said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents; and
instructions for determining document similarity of each of the documents in the collection of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for the documents in the collection of documents.
91. The program product recited in claim 90 further comprises:
instructions for constructing an index of term-to-document mappings, each of said mapping correlates a term occurring in the collection of documents to documents in which said each term occurs; and
instructions for finding a subset of documents in the collection of documents in which a query term occurs by searching the inverted index for the query term.
92. The program product recited in claim 91 further comprises:
instructions for generating a pseudo-document vector for each document in the subset of documents utilizing a document from the subset, said term matrix, and said reduced concepts matrix;
instructions for comparing said first pseudo-document vector to each of the pseudo-document vectors pseudo-document vectors for each of the documents in the subset of documents; and
instructions for determining document similarity of each of the documents in the subset of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the subset of documents.
93. The program product recited in claim 92 above, wherein the first document is a list of terms.
94. The program product recited in claim 92 above, wherein the first document is a query document.
95. The program product recited in claim 90 further comprises:
instructions for adding a document to the collection of documents; and
instructions for generating a pseudo-document vector for the added document, said added pseudo-document vector being formed utilizing said added document, said reduced concepts matrix and said term matrix.
96. The program product recited in claim 95 above further comprises:
instructions for comparing said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents; and
instructions for determining document similarity of each of the documents in the collection of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents.
97. The program product recited in claim 90 further comprises:
instructions for deleting a document from the collection of documents; and
instructions for eliminating a pseudo-document vector for the deleted document.
98. The program product recited in claim 97 above further comprises:
instructions for comparing said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents; and
instructions for determining document similarity of each of the documents in the collection of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents.
99. The program product recited in claim 81 above, wherein said instructions for decomposing said weighted term-by-document matrix further comprises:
instructions for establishing a predetermined quantity of strong concepts;
instructions for deriving the predetermined quantity of strong concepts from said weighted term-by-document matrix; and
instructions for ceasing decomposition.
100. The program product recited in claim 81 above, wherein said instructions for decomposing said weighted term-by-document matrix further comprises instructions for applying a singular value decomposition algorithm to said weighted term-by-document matrix.
US10/131,888 2002-04-24 2002-04-24 Method and system for optimally searching a document database using a representative semantic space Expired - Lifetime US6847966B1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US10/131,888 US6847966B1 (en) 2002-04-24 2002-04-24 Method and system for optimally searching a document database using a representative semantic space
US11/041,799 US7483892B1 (en) 2002-04-24 2005-01-24 Method and system for optimally searching a document database using a representative semantic space

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US10/131,888 US6847966B1 (en) 2002-04-24 2002-04-24 Method and system for optimally searching a document database using a representative semantic space

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US11/041,799 Continuation US7483892B1 (en) 2002-04-24 2005-01-24 Method and system for optimally searching a document database using a representative semantic space

Publications (1)

Publication Number Publication Date
US6847966B1 true US6847966B1 (en) 2005-01-25

Family

ID=34061474

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/131,888 Expired - Lifetime US6847966B1 (en) 2002-04-24 2002-04-24 Method and system for optimally searching a document database using a representative semantic space
US11/041,799 Expired - Lifetime US7483892B1 (en) 2002-04-24 2005-01-24 Method and system for optimally searching a document database using a representative semantic space

Family Applications After (1)

Application Number Title Priority Date Filing Date
US11/041,799 Expired - Lifetime US7483892B1 (en) 2002-04-24 2005-01-24 Method and system for optimally searching a document database using a representative semantic space

Country Status (1)

Country Link
US (2) US6847966B1 (en)

Cited By (367)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030154068A1 (en) * 2002-02-12 2003-08-14 Naoyuki Tokuda Computer-assisted memory translation scheme based on template automaton and latent semantic index principle
US20030229470A1 (en) * 2002-06-10 2003-12-11 Nenad Pejic System and method for analyzing patent-related information
US20040049505A1 (en) * 2002-09-11 2004-03-11 Kelly Pennock Textual on-line analytical processing method and system
US20040133574A1 (en) * 2003-01-07 2004-07-08 Science Applications International Corporaton Vector space method for secure information sharing
US20040139105A1 (en) * 2002-11-27 2004-07-15 Trepess David William Information storage and retrieval
US20040163044A1 (en) * 2003-02-14 2004-08-19 Nahava Inc. Method and apparatus for information factoring
US20040199546A1 (en) * 2000-01-27 2004-10-07 Manning & Napier Information Services, Llc Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors
US20040243560A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, including an annotation inverted file system facilitating indexing and searching
US20040243554A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis
US20040243556A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, and including a document common analysis system (CAS)
US20040243557A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a weighted and (WAND)
US20040243645A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations
US20050027678A1 (en) * 2003-07-30 2005-02-03 International Business Machines Corporation Computer executable dimension reduction and retrieval engine
US20050027691A1 (en) * 2003-07-28 2005-02-03 Sergey Brin System and method for providing a user interface with search query broadening
WO2005017698A2 (en) * 2003-08-11 2005-02-24 Educational Testing Service Cooccurrence and constructions
US20050071365A1 (en) * 2003-09-26 2005-03-31 Jiang-Liang Hou Method for keyword correlation analysis
US20050080657A1 (en) * 2003-10-10 2005-04-14 Unicru, Inc. Matching job candidate information
US20050080656A1 (en) * 2003-10-10 2005-04-14 Unicru, Inc. Conceptualization of job candidate information
US20050149516A1 (en) * 2002-04-25 2005-07-07 Wolf Peter P. Method and system for retrieving documents with spoken queries
US20050149510A1 (en) * 2004-01-07 2005-07-07 Uri Shafrir Concept mining and concept discovery-semantic search tool for large digital databases
US20050165600A1 (en) * 2004-01-27 2005-07-28 Kas Kasravi System and method for comparative analysis of textual documents
US20050171948A1 (en) * 2002-12-11 2005-08-04 Knight William C. System and method for identifying critical features in an ordered scale space within a multi-dimensional feature space
US20050216516A1 (en) * 2000-05-02 2005-09-29 Textwise Llc Advertisement placement method and system using semantic analysis
US20050240394A1 (en) * 2004-04-22 2005-10-27 Hewlett-Packard Development Company, L.P. Method, system or memory storing a computer program for document processing
US20060015486A1 (en) * 2004-07-13 2006-01-19 International Business Machines Corporation Document data retrieval and reporting
US20060020593A1 (en) * 2004-06-25 2006-01-26 Mark Ramsaier Dynamic search processor
US20060036640A1 (en) * 2004-08-03 2006-02-16 Sony Corporation Information processing apparatus, information processing method, and program
US20060106767A1 (en) * 2004-11-12 2006-05-18 Fuji Xerox Co., Ltd. System and method for identifying query-relevant keywords in documents with latent semantic analysis
US20060179051A1 (en) * 2005-02-09 2006-08-10 Battelle Memorial Institute Methods and apparatus for steering the analyses of collections of documents
US20060218140A1 (en) * 2005-02-09 2006-09-28 Battelle Memorial Institute Method and apparatus for labeling in steered visual analysis of collections of documents
US20060217959A1 (en) * 2005-03-25 2006-09-28 Fuji Xerox Co., Ltd. Translation processing method, document processing device and storage medium storing program
US20070005344A1 (en) * 2005-07-01 2007-01-04 Xerox Corporation Concept matching system
US20070005343A1 (en) * 2005-07-01 2007-01-04 Xerox Corporation Concept matching
US20070050356A1 (en) * 2005-08-23 2007-03-01 Amadio William J Query construction for semantic topic indexes derived by non-negative matrix factorization
US20070067281A1 (en) * 2005-09-16 2007-03-22 Irina Matveeva Generalized latent semantic analysis
US20070073625A1 (en) * 2005-09-27 2007-03-29 Shelton Robert H System and method of licensing intellectual property assets
US20070083906A1 (en) * 2005-09-23 2007-04-12 Bharat Welingkar Content-based navigation and launching on mobile devices
US20070112755A1 (en) * 2005-11-15 2007-05-17 Thompson Kevin B Information exploration systems and method
US20070124316A1 (en) * 2005-11-29 2007-05-31 Chan John Y M Attribute selection for collaborative groupware documents using a multi-dimensional matrix
US20070164882A1 (en) * 2006-01-13 2007-07-19 Monro Donald M Identification of text
US20070168345A1 (en) * 2006-01-17 2007-07-19 Andrew Gibbs System and method of identifying subject matter experts
US20070174296A1 (en) * 2006-01-17 2007-07-26 Andrew Gibbs Method and system for distributing a database and computer program within a network
US20070185871A1 (en) * 2006-02-08 2007-08-09 Telenor Asa Document similarity scoring and ranking method, device and computer program product
US20070239707A1 (en) * 2006-04-03 2007-10-11 Collins John B Method of searching text to find relevant content
US20070250502A1 (en) * 2006-04-24 2007-10-25 Telenor Asa Method and device for efficiently ranking documents in a similarity graph
US7296009B1 (en) * 1999-07-02 2007-11-13 Telstra Corporation Limited Search system
US20070271250A1 (en) * 2005-10-19 2007-11-22 Monro Donald M Basis selection for coding and decoding of data
US20070282809A1 (en) * 2006-06-06 2007-12-06 Orland Hoeber Method and apparatus for concept-based visual
US20070288421A1 (en) * 2006-06-09 2007-12-13 Microsoft Corporation Efficient evaluation of object finder queries
US20070294232A1 (en) * 2006-06-15 2007-12-20 Andrew Gibbs System and method for analyzing patent value
US7353164B1 (en) * 2002-09-13 2008-04-01 Apple Inc. Representation of orthography in a continuous vector space
US20080082489A1 (en) * 2006-09-28 2008-04-03 International Business Machines Corporation Row Identifier List Processing Management
US20080133521A1 (en) * 2006-11-30 2008-06-05 D&S Consultants, Inc. Method and System for Image Recognition Using a Similarity Inverse Matrix
US20080129520A1 (en) * 2006-12-01 2008-06-05 Apple Computer, Inc. Electronic device with enhanced audio feedback
US20080137367A1 (en) * 2006-12-12 2008-06-12 Samsung Electronics Co., Ltd. Optical sheet and method for fabricating the same
US20080163039A1 (en) * 2006-12-29 2008-07-03 Ryan Thomas A Invariant Referencing in Digital Works
US20080168073A1 (en) * 2005-01-19 2008-07-10 Siegel Hilliard B Providing Annotations of a Digital Work
US20080195962A1 (en) * 2007-02-12 2008-08-14 Lin Daniel J Method and System for Remotely Controlling The Display of Photos in a Digital Picture Frame
US20080208840A1 (en) * 2007-02-22 2008-08-28 Microsoft Corporation Diverse Topic Phrase Extraction
US20080222146A1 (en) * 2006-05-26 2008-09-11 International Business Machines Corporation System and method for creation, representation, and delivery of document corpus entity co-occurrence information
US20080243828A1 (en) * 2007-03-29 2008-10-02 Reztlaff James R Search and Indexing on a User Device
US20080288489A1 (en) * 2005-11-02 2008-11-20 Jeong-Jin Kim Method for Searching Patent Document by Applying Degree of Similarity and System Thereof
US20080293450A1 (en) * 2007-05-21 2008-11-27 Ryan Thomas A Consumption of Items via a User Device
US20080294597A1 (en) * 2006-09-19 2008-11-27 Exalead Computer-implemented method, computer program product and system for creating an index of a subset of data
US7483892B1 (en) * 2002-04-24 2009-01-27 Kroll Ontrack, Inc. Method and system for optimally searching a document database using a representative semantic space
US20090070323A1 (en) * 2007-09-12 2009-03-12 Nishith Parikh Inference of query relationships
US7508325B2 (en) 2006-09-06 2009-03-24 Intellectual Ventures Holding 35 Llc Matching pursuits subband coding of data
US20090089314A1 (en) * 2007-09-28 2009-04-02 Cory Hicks Duplicate item detection system and method
US20090089273A1 (en) * 2007-09-27 2009-04-02 Cory Hicks System for detecting associations between items
US20090089058A1 (en) * 2007-10-02 2009-04-02 Jerome Bellegarda Part-of-speech tagging using latent analogy
EP2045731A1 (en) * 2007-10-05 2009-04-08 Fujitsu Limited Automatic generation of ontologies using word affinities
US20090094208A1 (en) * 2007-10-05 2009-04-09 Fujitsu Limited Automatically Generating A Hierarchy Of Terms
US20090119281A1 (en) * 2007-11-03 2009-05-07 Andrew Chien-Chung Wang Granular knowledge based search engine
US20090157401A1 (en) * 1999-11-12 2009-06-18 Bennett Ian M Semantic Decoding of User Queries
US20090164441A1 (en) * 2007-12-20 2009-06-25 Adam Cheyer Method and apparatus for searching using an active ontology
US20090177300A1 (en) * 2008-01-03 2009-07-09 Apple Inc. Methods and apparatus for altering audio output signals
US20090182730A1 (en) * 2008-01-14 2009-07-16 Infosys Technologies Ltd. Method for semantic based storage and retrieval of information
US20090187446A1 (en) * 2000-06-12 2009-07-23 Dewar Katrina L Computer-implemented system for human resources management
US7586424B2 (en) 2006-06-05 2009-09-08 Donald Martin Monro Data coding using an exponent and a residual
US20090240682A1 (en) * 2008-03-22 2009-09-24 International Business Machines Corporation Graph search system and method for querying loosely integrated data
US20090248682A1 (en) * 2008-04-01 2009-10-01 Certona Corporation System and method for personalized search
US20090254345A1 (en) * 2008-04-05 2009-10-08 Christopher Brian Fleizach Intelligent Text-to-Speech Conversion
US20090300006A1 (en) * 2008-05-29 2009-12-03 Accenture Global Services Gmbh Techniques for computing similarity measurements between segments representative of documents
US20090300011A1 (en) * 2007-08-09 2009-12-03 Kazutoyo Takata Contents retrieval device
US20100010982A1 (en) * 2008-07-09 2010-01-14 Broder Andrei Z Web content characterization based on semantic folksonomies associated with user generated content
US20100049708A1 (en) * 2003-07-25 2010-02-25 Kenji Kawai System And Method For Scoring Concepts In A Document Set
US20100048256A1 (en) * 2005-09-30 2010-02-25 Brian Huppi Automated Response To And Sensing Of User Activity In Portable Devices
US20100063818A1 (en) * 2008-09-05 2010-03-11 Apple Inc. Multi-tiered voice feedback in an electronic device
US20100064218A1 (en) * 2008-09-09 2010-03-11 Apple Inc. Audio user interface
US20100082349A1 (en) * 2008-09-29 2010-04-01 Apple Inc. Systems and methods for selective text to speech synthesis
US20100085221A1 (en) * 2008-10-06 2010-04-08 Donald Martin Monro Mode switched adaptive combinatorial coding/decoding for electrical computers and digital data processing systems
US20100085219A1 (en) * 2008-10-06 2010-04-08 Donald Martin Monro Combinatorial coding/decoding with specified occurrences for electrical computers and digital data processing systems
US20100085218A1 (en) * 2008-10-06 2010-04-08 Donald Martin Monro Combinatorial coding/decoding with specified occurrences for electrical computers and digital data processing systems
US20100085224A1 (en) * 2008-10-06 2010-04-08 Donald Martin Monro Adaptive combinatorial coding/decoding with specified occurrences for electrical computers and digital data processing systems
US20100094879A1 (en) * 2007-03-30 2010-04-15 Stuart Donnelly Method of detecting and responding to changes in the online community's interests in real time
US20100094840A1 (en) * 2007-03-30 2010-04-15 Stuart Donnelly Method of searching text to find relevant content and presenting advertisements to users
US7702509B2 (en) 2002-09-13 2010-04-20 Apple Inc. Unsupervised data-driven pronunciation modeling
US7707214B2 (en) 2007-02-21 2010-04-27 Donald Martin Monro Hierarchical update scheme for extremum location with indirect addressing
US7707213B2 (en) 2007-02-21 2010-04-27 Donald Martin Monro Hierarchical update scheme for extremum location
US20100131569A1 (en) * 2008-11-21 2010-05-27 Robert Marc Jamison Method & apparatus for identifying a secondary concept in a collection of documents
US20100185685A1 (en) * 2009-01-13 2010-07-22 Chew Peter A Technique for Information Retrieval Using Enhanced Latent Semantic Analysis
US7770091B2 (en) 2006-06-19 2010-08-03 Monro Donald M Data compression for use in communication systems
US20100211534A1 (en) * 2009-02-13 2010-08-19 Fujitsu Limited Efficient computation of ontology affinity matrices
US7783079B2 (en) 2006-04-07 2010-08-24 Monro Donald M Motion assisted data enhancement
USD622722S1 (en) 2009-01-27 2010-08-31 Amazon Technologies, Inc. Electronic reader device
USD624074S1 (en) 2009-05-04 2010-09-21 Amazon Technologies, Inc. Electronic reader device
US7801898B1 (en) * 2003-12-30 2010-09-21 Google Inc. Methods and systems for compressing indices
US20100262454A1 (en) * 2009-04-09 2010-10-14 SquawkSpot, Inc. System and method for sentiment-based text classification and relevancy ranking
US20100280985A1 (en) * 2008-01-14 2010-11-04 Aptima, Inc. Method and system to predict the likelihood of topics
US20100287177A1 (en) * 2009-05-06 2010-11-11 Foundationip, Llc Method, System, and Apparatus for Searching an Electronic Document Collection
US20100287148A1 (en) * 2009-05-08 2010-11-11 Cpa Global Patent Research Limited Method, System, and Apparatus for Targeted Searching of Multi-Sectional Documents within an Electronic Document Collection
WO2010134885A1 (en) * 2009-05-20 2010-11-25 Farhan Sarwar Predicting the correctness of eyewitness' statements with semantic evaluation method (sem)
US20100312547A1 (en) * 2009-06-05 2010-12-09 Apple Inc. Contextual voice commands
US20110004475A1 (en) * 2009-07-02 2011-01-06 Bellegarda Jerome R Methods and apparatuses for automatic speech recognition
US20110016118A1 (en) * 2009-07-20 2011-01-20 Lexisnexis Method and apparatus for determining relevant search results using a matrix framework
CN101079026B (en) * 2007-07-02 2011-01-26 蒙圣光 Text similarity, acceptation similarity calculating method and system and application system
US20110029529A1 (en) * 2009-07-28 2011-02-03 Knight William C System And Method For Providing A Classification Suggestion For Concepts
US20110066612A1 (en) * 2009-09-17 2011-03-17 Foundationip, Llc Method, System, and Apparatus for Delivering Query Results from an Electronic Document Collection
US20110072014A1 (en) * 2004-08-10 2011-03-24 Foundationip, Llc Patent mapping
WO2011035389A1 (en) * 2009-09-26 2011-03-31 Hamish Ogilvy Document analysis and association system and method
US20110082839A1 (en) * 2009-10-02 2011-04-07 Foundationip, Llc Generating intellectual property intelligence using a patent search engine
USD636771S1 (en) 2009-01-27 2011-04-26 Amazon Technologies, Inc. Control pad for an electronic device
US20110112825A1 (en) * 2009-11-12 2011-05-12 Jerome Bellegarda Sentiment prediction from textual data
US20110119250A1 (en) * 2009-11-16 2011-05-19 Cpa Global Patent Research Limited Forward Progress Search Platform
US7974488B2 (en) 2006-10-05 2011-07-05 Intellectual Ventures Holding 35 Llc Matching pursuits basis selection
US20110166856A1 (en) * 2010-01-06 2011-07-07 Apple Inc. Noise profile determination for voice-related feature
US20110173210A1 (en) * 2010-01-08 2011-07-14 Microsoft Corporation Identifying a topic-relevant subject
US20110184828A1 (en) * 2005-01-19 2011-07-28 Amazon Technologies, Inc. Method and system for providing annotations of a digital work
US20110208511A1 (en) * 2008-11-04 2011-08-25 Saplo Ab Method and system for analyzing text
US20110219000A1 (en) * 2008-11-26 2011-09-08 Yukitaka Kusumura Search apparatus, search method, and recording medium storing program
US20110258196A1 (en) * 2008-12-30 2011-10-20 Skjalg Lepsoy Method and system of content recommendation
US20120011139A1 (en) * 2010-07-12 2012-01-12 International Business Machines Corporation Unified numerical and semantic analytics system for decision support
US8155453B2 (en) 2004-02-13 2012-04-10 Fti Technology Llc System and method for displaying groups of cluster spines
US20120130993A1 (en) * 2005-07-27 2012-05-24 Schwegman Lundberg & Woessner, P.A. Patent mapping
US20120150531A1 (en) * 2010-12-08 2012-06-14 At&T Intellectual Property I, L.P. System and method for learning latent representations for natural language tasks
US20120259846A1 (en) * 2011-03-29 2012-10-11 Rafsky Lawrence C Method for selecting a subset of content sources from a collection of content sources
US8352449B1 (en) 2006-03-29 2013-01-08 Amazon Technologies, Inc. Reader device content indexing
US20130013612A1 (en) * 2011-07-07 2013-01-10 Software Ag Techniques for comparing and clustering documents
US8378979B2 (en) 2009-01-27 2013-02-19 Amazon Technologies, Inc. Electronic device with haptic feedback
US8386239B2 (en) 2010-01-25 2013-02-26 Holovisions LLC Multi-stage text morphing
US20130054577A1 (en) * 2011-08-23 2013-02-28 Pierre Schwob Knowledge matrix utilizing systematic contextual links
US8402395B2 (en) 2005-01-26 2013-03-19 FTI Technology, LLC System and method for providing a dynamic user interface for a dense three-dimensional scene with a plurality of compasses
US8417772B2 (en) 2007-02-12 2013-04-09 Amazon Technologies, Inc. Method and system for transferring content from the web to mobile devices
US8423889B1 (en) 2008-06-05 2013-04-16 Amazon Technologies, Inc. Device specific presentation control for electronic book reader devices
US20130158979A1 (en) * 2011-12-14 2013-06-20 Purediscovery Corporation System and Method for Identifying Phrases in Text
US20130159313A1 (en) * 2011-12-14 2013-06-20 Purediscovery Corporation Multi-Concept Latent Semantic Analysis Queries
CN103189881A (en) * 2010-08-17 2013-07-03 西格拉姆申德勒有限公司 The FSTP expert system
US20130204877A1 (en) * 2012-02-08 2013-08-08 International Business Machines Corporation Attribution using semantic analyisis
US8533195B2 (en) * 2011-06-27 2013-09-10 Microsoft Corporation Regularized latent semantic indexing for topic modeling
US8538794B2 (en) 2007-06-18 2013-09-17 Reuven A. Marko Method and apparatus for management of the creation of a patent portfolio
US8571535B1 (en) 2007-02-12 2013-10-29 Amazon Technologies, Inc. Method and system for a hosted mobile management service architecture
US8583418B2 (en) 2008-09-29 2013-11-12 Apple Inc. Systems and methods of detecting language and natural language strings for text to speech synthesis
US8612446B2 (en) 2009-08-24 2013-12-17 Fti Consulting, Inc. System and method for generating a reference set for use during document review
US8612411B1 (en) * 2003-12-31 2013-12-17 Google Inc. Clustering documents using citation patterns
US8620662B2 (en) 2007-11-20 2013-12-31 Apple Inc. Context-aware unit selection
US8620842B1 (en) 2013-03-15 2013-12-31 Gordon Villy Cormack Systems and methods for classifying electronic information using advanced active learning techniques
JP2014500528A (en) * 2010-06-03 2014-01-09 トムソン ライセンシング Enhancement of meaning using TOP-K processing
US8645137B2 (en) 2000-03-16 2014-02-04 Apple Inc. Fast, language-independent method for user authentication by voice
US8660849B2 (en) 2010-01-18 2014-02-25 Apple Inc. Prioritizing selection criteria by automated assistant
US20140059035A1 (en) * 2012-08-24 2014-02-27 iCONECT Development, LLC Process for generating a composite search document used in computer-based information searching
US8670985B2 (en) 2010-01-13 2014-03-11 Apple Inc. Devices and methods for identifying a prompt corresponding to a voice input in a sequence of prompts
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US8688446B2 (en) 2008-02-22 2014-04-01 Apple Inc. Providing text input using speech data and non-speech data
US8701048B2 (en) 2005-01-26 2014-04-15 Fti Technology Llc System and method for providing a user-adjustable display of clusters and text
US8706472B2 (en) 2011-08-11 2014-04-22 Apple Inc. Method for disambiguating multiple readings in language conversion
US8713021B2 (en) 2010-07-07 2014-04-29 Apple Inc. Unsupervised document clustering using latent semantic density analysis
US8719006B2 (en) 2010-08-27 2014-05-06 Apple Inc. Combined statistical and rule-based part-of-speech tagging for text-to-speech synthesis
US8718047B2 (en) 2001-10-22 2014-05-06 Apple Inc. Text to speech conversion of text messages from mobile communication devices
US8719014B2 (en) 2010-09-27 2014-05-06 Apple Inc. Electronic device with text error correction based on voice recognition data
US8725565B1 (en) 2006-09-29 2014-05-13 Amazon Technologies, Inc. Expedited acquisition of a digital item following a sample presentation of the item
CN103838789A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Text similarity computing method
US8751238B2 (en) 2009-03-09 2014-06-10 Apple Inc. Systems and methods for determining the language to use for speech generated by a text to speech engine
US20140164388A1 (en) * 2012-12-10 2014-06-12 Microsoft Corporation Query and index over documents
US8762156B2 (en) 2011-09-28 2014-06-24 Apple Inc. Speech recognition repair using contextual information
US8775442B2 (en) 2012-05-15 2014-07-08 Apple Inc. Semantic search using a single-source semantic model
US8781836B2 (en) 2011-02-22 2014-07-15 Apple Inc. Hearing assistance system for providing consistent human speech
US8793575B1 (en) 2007-03-29 2014-07-29 Amazon Technologies, Inc. Progress indication for a digital work
US8812294B2 (en) 2011-06-21 2014-08-19 Apple Inc. Translating phrases from one language into another using an order-based set of declarative rules
US8832584B1 (en) 2009-03-31 2014-09-09 Amazon Technologies, Inc. Questions on highlighted passages
US8862252B2 (en) 2009-01-30 2014-10-14 Apple Inc. Audio user interface for displayless electronic device
CN104252465A (en) * 2013-06-26 2014-12-31 南宁明江智能科技有限公司 Method and device utilizing representative vectors to filter information
US8935167B2 (en) 2012-09-25 2015-01-13 Apple Inc. Exemplar-based latent perceptual modeling for automatic speech recognition
US8977584B2 (en) 2010-01-25 2015-03-10 Newvaluexchange Global Ai Llp Apparatuses, methods and systems for a digital conversation management platform
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US9020924B2 (en) * 2005-05-04 2015-04-28 Google Inc. Suggesting and refining user input based on original user input
US20150127650A1 (en) * 2013-11-04 2015-05-07 Ayasdi, Inc. Systems and methods for metric data smoothing
US20150169746A1 (en) * 2008-07-24 2015-06-18 Hamid Hatami-Hanza Ontological Subjects Of A Universe And Knowledge Processing Thereof
US9087032B1 (en) 2009-01-26 2015-07-21 Amazon Technologies, Inc. Aggregation of highlights
US9158741B1 (en) 2011-10-28 2015-10-13 Amazon Technologies, Inc. Indicators for navigating digital works
US9201927B1 (en) * 2009-01-07 2015-12-01 Guangsheng Zhang System and methods for quantitative assessment of information in natural language contents and for determining relevance using association data
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US9280610B2 (en) 2012-05-14 2016-03-08 Apple Inc. Crowd sourcing information to fulfill user requests
US9300784B2 (en) 2013-06-13 2016-03-29 Apple Inc. System and method for emergency calls initiated by voice command
US9311043B2 (en) 2010-01-13 2016-04-12 Apple Inc. Adaptive audio feedback system and method
US20160110186A1 (en) * 2011-09-29 2016-04-21 Accenture Global Services Limited Systems and methods for finding project-related information by clustering applications into related concept categories
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US9361290B2 (en) 2014-01-18 2016-06-07 Christopher Bayan Bruss System and methodology for assessing and predicting linguistic and non-linguistic events and for providing decision support
US9368114B2 (en) 2013-03-14 2016-06-14 Apple Inc. Context-sensitive handling of interruptions
US9367608B1 (en) * 2009-01-07 2016-06-14 Guangsheng Zhang System and methods for searching objects and providing answers to queries using association data
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US9483461B2 (en) 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
US9495129B2 (en) 2012-06-29 2016-11-15 Apple Inc. Device, method, and user interface for voice-activated navigation and browsing of a document
US9495322B1 (en) 2010-09-21 2016-11-15 Amazon Technologies, Inc. Cover display
US9502031B2 (en) 2014-05-27 2016-11-22 Apple Inc. Method for supporting dynamic grammars in WFST-based ASR
US20160378796A1 (en) * 2015-06-23 2016-12-29 Microsoft Technology Licensing, Llc Match fix-up to remove matching documents
US9535906B2 (en) 2008-07-31 2017-01-03 Apple Inc. Mobile device having human language translation capability with positional feedback
US9547647B2 (en) 2012-09-19 2017-01-17 Apple Inc. Voice-based media searching
US9564089B2 (en) 2009-09-28 2017-02-07 Amazon Technologies, Inc. Last screen rendering for electronic book reader
US9576574B2 (en) 2012-09-10 2017-02-21 Apple Inc. Context-sensitive handling of interruptions by intelligent digital assistant
US9582608B2 (en) 2013-06-07 2017-02-28 Apple Inc. Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
US9620104B2 (en) 2013-06-07 2017-04-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9620105B2 (en) 2014-05-15 2017-04-11 Apple Inc. Analyzing audio input for efficient speech and music recognition
US9633674B2 (en) 2013-06-07 2017-04-25 Apple Inc. System and method for detecting errors in interactions with a voice-based digital assistant
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US9672533B1 (en) 2006-09-29 2017-06-06 Amazon Technologies, Inc. Acquisition of an item based on a catalog presentation of items
US9697820B2 (en) 2015-09-24 2017-07-04 Apple Inc. Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks
US9697822B1 (en) 2013-03-15 2017-07-04 Apple Inc. System and method for updating an adaptive speech recognition model
US9711141B2 (en) 2014-12-09 2017-07-18 Apple Inc. Disambiguating heteronyms in speech synthesis
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9721563B2 (en) 2012-06-08 2017-08-01 Apple Inc. Name recognition system
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US9733821B2 (en) 2013-03-14 2017-08-15 Apple Inc. Voice control to diagnose inadvertent activation of accessibility features
US9734193B2 (en) 2014-05-30 2017-08-15 Apple Inc. Determining domain salience ranking from ambiguous words in natural speech
US9760559B2 (en) 2014-05-30 2017-09-12 Apple Inc. Predictive text input
US9785630B2 (en) 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US9798393B2 (en) 2011-08-29 2017-10-24 Apple Inc. Text correction processing
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US9842105B2 (en) 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
US9842101B2 (en) 2014-05-30 2017-12-12 Apple Inc. Predictive conversion of language input
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US9865280B2 (en) 2015-03-06 2018-01-09 Apple Inc. Structured dictation using intelligent automated assistants
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US9886432B2 (en) 2014-09-30 2018-02-06 Apple Inc. Parsimonious handling of word inflection via categorical stem + suffix N-gram language models
US9899019B2 (en) 2015-03-18 2018-02-20 Apple Inc. Systems and methods for structured stem and suffix language models
US9904726B2 (en) 2011-05-04 2018-02-27 Black Hills IP Holdings, LLC. Apparatus and method for automated and assisted patent claim mapping and expense planning
US9922642B2 (en) 2013-03-15 2018-03-20 Apple Inc. Training an at least partial voice command system
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9946706B2 (en) 2008-06-07 2018-04-17 Apple Inc. Automatic language identification for dynamic text processing
JP2018063596A (en) * 2016-10-13 2018-04-19 富士通株式会社 Document comparison program, document comparison method, and document comparison device
US9959870B2 (en) 2008-12-11 2018-05-01 Apple Inc. Speech recognition involving a mobile device
US9966065B2 (en) 2014-05-30 2018-05-08 Apple Inc. Multi-command single utterance input method
US9966068B2 (en) 2013-06-08 2018-05-08 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US9977779B2 (en) 2013-03-14 2018-05-22 Apple Inc. Automatic supplementation of word correction dictionaries
US10019994B2 (en) 2012-06-08 2018-07-10 Apple Inc. Systems and methods for recognizing textual identifiers within a plurality of words
US10019512B2 (en) 2011-05-27 2018-07-10 International Business Machines Corporation Automated self-service user support based on ontology analysis
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
US10057736B2 (en) 2011-06-03 2018-08-21 Apple Inc. Active transport based notifications
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US10078631B2 (en) 2014-05-30 2018-09-18 Apple Inc. Entropy-guided text prediction using combined word and character n-gram language models
US10078487B2 (en) 2013-03-15 2018-09-18 Apple Inc. Context-sensitive handling of interruptions
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US10089072B2 (en) 2016-06-11 2018-10-02 Apple Inc. Intelligent device arbitration and control
US10101822B2 (en) 2015-06-05 2018-10-16 Apple Inc. Language input correction
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US10127220B2 (en) 2015-06-04 2018-11-13 Apple Inc. Language identification from short strings
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
US10186254B2 (en) 2015-06-07 2019-01-22 Apple Inc. Context-based endpoint detection
US10185542B2 (en) 2013-06-09 2019-01-22 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US10199051B2 (en) 2013-02-07 2019-02-05 Apple Inc. Voice trigger for a digital assistant
RU2679960C2 (en) * 2013-10-10 2019-02-14 Общество С Ограниченной Ответственностью "Яндекс" Method and system of database for locating documents
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10229117B2 (en) 2015-06-19 2019-03-12 Gordon V. Cormack Systems and methods for conducting a highly autonomous technology-assisted review classification
US10229143B2 (en) 2015-06-23 2019-03-12 Microsoft Technology Licensing, Llc Storage and retrieval of data from a bit vector search index
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US10242071B2 (en) 2015-06-23 2019-03-26 Microsoft Technology Licensing, Llc Preliminary ranker for scoring matching documents
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US10255907B2 (en) 2015-06-07 2019-04-09 Apple Inc. Automatic accent detection using acoustic models
US10255566B2 (en) 2011-06-03 2019-04-09 Apple Inc. Generating and processing task items that represent tasks to perform
US10269345B2 (en) 2016-06-11 2019-04-23 Apple Inc. Intelligent task discovery
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US10289433B2 (en) 2014-05-30 2019-05-14 Apple Inc. Domain specific language for encoding assistant dialog
US10296160B2 (en) 2013-12-06 2019-05-21 Apple Inc. Method for extracting salient dialog usage from live data
US10297253B2 (en) 2016-06-11 2019-05-21 Apple Inc. Application integration with a digital assistant
US20190179901A1 (en) * 2017-12-07 2019-06-13 Fujitsu Limited Non-transitory computer readable recording medium, specifying method, and information processing apparatus
US10354011B2 (en) 2016-06-09 2019-07-16 Apple Inc. Intelligent automated assistant in a home environment
US10354010B2 (en) * 2015-04-24 2019-07-16 Nec Corporation Information processing system, an information processing method and a computer readable storage medium
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US10372714B2 (en) * 2016-02-05 2019-08-06 International Business Machines Corporation Automated determination of document utility for a document corpus
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US10446141B2 (en) 2014-08-28 2019-10-15 Apple Inc. Automatic speech recognition based on user feedback
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US10467215B2 (en) 2015-06-23 2019-11-05 Microsoft Technology Licensing, Llc Matching documents using a bit vector search index
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US20190362710A1 (en) * 2018-05-23 2019-11-28 Bank Of America Corporation Quantum technology for use with extracting intents from linguistics
US10496753B2 (en) 2010-01-18 2019-12-03 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10515147B2 (en) 2010-12-22 2019-12-24 Apple Inc. Using statistical language models for contextual lookup
US10521466B2 (en) 2016-06-11 2019-12-31 Apple Inc. Data driven natural language event detection and classification
US10546273B2 (en) 2008-10-23 2020-01-28 Black Hills Ip Holdings, Llc Patent mapping
US10553209B2 (en) 2010-01-18 2020-02-04 Apple Inc. Systems and methods for hands-free notification summaries
US10552013B2 (en) 2014-12-02 2020-02-04 Apple Inc. Data detection
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US10565198B2 (en) 2015-06-23 2020-02-18 Microsoft Technology Licensing, Llc Bit vector search index using shards
US10572476B2 (en) 2013-03-14 2020-02-25 Apple Inc. Refining a search based on schedule items
US10579662B2 (en) 2013-04-23 2020-03-03 Black Hills Ip Holdings, Llc Patent claim scope evaluator
US10592095B2 (en) 2014-05-23 2020-03-17 Apple Inc. Instantaneous speaking of content on touch devices
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US10614082B2 (en) 2011-10-03 2020-04-07 Black Hills Ip Holdings, Llc Patent mapping
US10614519B2 (en) 2007-12-14 2020-04-07 Consumerinfo.Com, Inc. Card registry systems and methods
US10621657B2 (en) 2008-11-05 2020-04-14 Consumerinfo.Com, Inc. Systems and methods of credit information reporting
US20200117738A1 (en) * 2018-10-11 2020-04-16 International Business Machines Corporation Semantic concept discovery over event databases
US10628448B1 (en) 2013-11-20 2020-04-21 Consumerinfo.Com, Inc. Systems and user interfaces for dynamic access of multiple remote databases and synchronization of data based on user rules
US10642999B2 (en) 2011-09-16 2020-05-05 Consumerinfo.Com, Inc. Systems and methods of identity protection and management
US10642574B2 (en) 2013-03-14 2020-05-05 Apple Inc. Device, method, and graphical user interface for outputting captions
US10652394B2 (en) 2013-03-14 2020-05-12 Apple Inc. System and method for processing voicemail
US10657498B2 (en) 2017-02-17 2020-05-19 Walmart Apollo, Llc Automated resume screening
US10659851B2 (en) 2014-06-30 2020-05-19 Apple Inc. Real-time digital assistant knowledge updates
US10671749B2 (en) 2018-09-05 2020-06-02 Consumerinfo.Com, Inc. Authenticated access and aggregation database platform
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10672399B2 (en) 2011-06-03 2020-06-02 Apple Inc. Switching between text data and audio data based on a mapping
US10679605B2 (en) 2010-01-18 2020-06-09 Apple Inc. Hands-free list-reading by intelligent automated assistant
US10685398B1 (en) 2013-04-23 2020-06-16 Consumerinfo.Com, Inc. Presenting credit score information
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10698977B1 (en) 2014-12-31 2020-06-30 Guangsheng Zhang System and methods for processing fuzzy expressions in search engines and for information extraction
US10705794B2 (en) 2010-01-18 2020-07-07 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US10733164B2 (en) 2015-06-23 2020-08-04 Microsoft Technology Licensing, Llc Updating a bit vector search index
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10748529B1 (en) 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
US10755703B2 (en) 2017-05-11 2020-08-25 Apple Inc. Offline personal assistant
US10762293B2 (en) 2010-12-22 2020-09-01 Apple Inc. Using parts-of-speech tagging and named entity recognition for spelling correction
US10791216B2 (en) 2013-08-06 2020-09-29 Apple Inc. Auto-activating smart responses based on activities from remote devices
US10789041B2 (en) 2014-09-12 2020-09-29 Apple Inc. Dynamic thresholds for always listening speech trigger
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US10798197B2 (en) 2011-07-08 2020-10-06 Consumerinfo.Com, Inc. Lifescore
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US10810693B2 (en) 2005-05-27 2020-10-20 Black Hills Ip Holdings, Llc Method and apparatus for cross-referencing important IP relationships
WO2020232092A1 (en) * 2019-05-15 2020-11-19 RELX Inc. Systems and methods for generating a low-dimensional space representing similarities between patents
US10860657B2 (en) 2011-10-03 2020-12-08 Black Hills Ip Holdings, Llc Patent mapping
US10915707B2 (en) * 2017-10-20 2021-02-09 MachineVantage, Inc. Word replaceability through word vectors
US10929925B1 (en) 2013-03-14 2021-02-23 Consumerlnfo.com, Inc. System and methods for credit dispute processing, resolution, and reporting
US10963959B2 (en) 2012-11-30 2021-03-30 Consumerinfo. Com, Inc. Presentation of credit score factors
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US11012491B1 (en) 2012-11-12 2021-05-18 ConsumerInfor.com, Inc. Aggregating user web browsing data
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US11068546B2 (en) 2016-06-02 2021-07-20 Nuix North America Inc. Computer-implemented system and method for analyzing clusters of coded documents
US11113759B1 (en) 2013-03-14 2021-09-07 Consumerinfo.Com, Inc. Account vulnerability alerts
US11151899B2 (en) 2013-03-15 2021-10-19 Apple Inc. User training by intelligent digital assistant
US11157872B2 (en) 2008-06-26 2021-10-26 Experian Marketing Solutions, Llc Systems and methods for providing an integrated identifier
US20210383076A1 (en) * 2018-05-11 2021-12-09 Abbtype Ltd Method of abbreviated typing and compression of texts written in languages using alphabetic scripts
US11200620B2 (en) 2011-10-13 2021-12-14 Consumerinfo.Com, Inc. Debt services candidate locator
US11217255B2 (en) 2017-05-16 2022-01-04 Apple Inc. Far-field extension for digital assistant services
US11238656B1 (en) 2019-02-22 2022-02-01 Consumerinfo.Com, Inc. System and method for an augmented reality experience via an artificial intelligence bot
US11315179B1 (en) 2018-11-16 2022-04-26 Consumerinfo.Com, Inc. Methods and apparatuses for customized card recommendations
US11356430B1 (en) 2012-05-07 2022-06-07 Consumerinfo.Com, Inc. Storage and maintenance of personal data
US11392568B2 (en) 2015-06-23 2022-07-19 Microsoft Technology Licensing, Llc Reducing matching documents for a search query
US11461862B2 (en) 2012-08-20 2022-10-04 Black Hills Ip Holdings, Llc Analytics generation for patent portfolio management
US20220374422A1 (en) * 2019-10-23 2022-11-24 Soppra Corporation Information output device, information output method, and information output program
US20220405416A1 (en) * 2021-06-14 2022-12-22 International Business Machines Corporation Data query against an encrypted database
US20230051059A1 (en) * 2021-08-11 2023-02-16 Sap Se Relationship analysis using vector representations of database tables
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification
US20230136726A1 (en) * 2021-10-29 2023-05-04 Peter A. Chew Identifying Fringe Beliefs from Text

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8312034B2 (en) 2005-06-24 2012-11-13 Purediscovery Corporation Concept bridge and method of operating the same
DE102005051617B4 (en) * 2005-10-27 2009-10-15 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Automatic, computer-based similarity calculation system for quantifying the similarity of textual expressions
US9275129B2 (en) 2006-01-23 2016-03-01 Symantec Corporation Methods and systems to efficiently find similar and near-duplicate emails and files
WO2007149216A2 (en) * 2006-06-21 2007-12-27 Information Extraction Systems An apparatus, system and method for developing tools to process natural language text
US8340957B2 (en) * 2006-08-31 2012-12-25 Waggener Edstrom Worldwide, Inc. Media content assessment and control systems
US20080288292A1 (en) * 2007-05-15 2008-11-20 Siemens Medical Solutions Usa, Inc. System and Method for Large Scale Code Classification for Medical Patient Records
US8793264B2 (en) * 2007-07-18 2014-07-29 Hewlett-Packard Development Company, L. P. Determining a subset of documents from which a particular document was derived
US7801884B2 (en) * 2007-09-19 2010-09-21 Accenture Global Services Gmbh Data mapping document design system
US7801908B2 (en) * 2007-09-19 2010-09-21 Accenture Global Services Gmbh Data mapping design tool
US7917496B2 (en) * 2007-12-14 2011-03-29 Yahoo! Inc. Method and apparatus for discovering and classifying polysemous word instances in web documents
US7957957B2 (en) * 2007-12-14 2011-06-07 Yahoo! Inc. Method and apparatus for discovering and classifying polysemous word instances in web documents
US8135715B2 (en) * 2007-12-14 2012-03-13 Yahoo! Inc. Method and apparatus for discovering and classifying polysemous word instances in web documents
US8176072B2 (en) * 2009-07-28 2012-05-08 Vulcan Technologies Llc Method and system for tag suggestion in a tag-associated data-object storage system
US8306983B2 (en) * 2009-10-26 2012-11-06 Agilex Technologies, Inc. Semantic space configuration
US9147039B2 (en) * 2010-09-15 2015-09-29 Epic Systems Corporation Hybrid query system for electronic medical records
EP2625655A4 (en) 2010-10-06 2014-04-16 Planet Data Solutions System and method for indexing electronic discovery data
US8719257B2 (en) * 2011-02-16 2014-05-06 Symantec Corporation Methods and systems for automatically generating semantic/concept searches
US9158841B2 (en) * 2011-06-15 2015-10-13 The University Of Memphis Research Foundation Methods of evaluating semantic differences, methods of identifying related sets of items in semantic spaces, and systems and computer program products for implementing the same
EP2595065B1 (en) 2011-11-15 2019-08-14 Kairos Future Group AB Categorizing data sets
US9256649B2 (en) * 2012-01-10 2016-02-09 Ut-Battelle Llc Method and system of filtering and recommending documents
US10108526B2 (en) * 2012-11-27 2018-10-23 Purdue Research Foundation Bug localization using version history
WO2016043609A1 (en) * 2014-09-18 2016-03-24 Empire Technology Development Llc Three-dimensional latent semantic analysis
US10332123B2 (en) 2015-08-27 2019-06-25 Oracle International Corporation Knowledge base search and retrieval based on document similarity
US10296637B2 (en) * 2016-08-23 2019-05-21 Stroz Friedberg, LLC System and method for query expansion using knowledge base and statistical methods in electronic search
US10444945B1 (en) 2016-10-10 2019-10-15 United Services Automobile Association Systems and methods for ingesting and parsing datasets generated from disparate data sources
US10783314B2 (en) * 2018-06-29 2020-09-22 Adobe Inc. Emphasizing key points in a speech file and structuring an associated transcription
US11023682B2 (en) * 2018-09-30 2021-06-01 International Business Machines Corporation Vector representation based on context
US11468240B2 (en) 2019-12-16 2022-10-11 Raytheon Company System and method for using machine learning supporting natural language processing analysis
US11836886B2 (en) 2021-04-15 2023-12-05 MetaConsumer, Inc. Systems and methods for capturing and processing user consumption of information
US11688035B2 (en) 2021-04-15 2023-06-27 MetaConsumer, Inc. Systems and methods for capturing user consumption of information
WO2023081684A1 (en) * 2021-11-02 2023-05-11 MetaConsumer, Inc. Systems and methods for capturing and processing user consumption of information

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4839853A (en) 1988-09-15 1989-06-13 Bell Communications Research, Inc. Computer information retrieval using latent semantic structure
US5301109A (en) * 1990-06-11 1994-04-05 Bell Communications Research, Inc. Computerized cross-language document retrieval using latent semantic indexing
US5642518A (en) 1993-06-18 1997-06-24 Hitachi, Ltd. Keyword assigning method and system therefor
US5675819A (en) 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US5873056A (en) 1993-10-12 1999-02-16 The Syracuse University Natural language processing system for semantic vector representation which accounts for lexical ambiguity
US5895464A (en) 1997-04-30 1999-04-20 Eastman Kodak Company Computer program product and a method for using natural language for the description, search and retrieval of multi-media objects
US5950189A (en) 1997-01-02 1999-09-07 At&T Corp Retrieval system and method
US5953718A (en) 1997-11-12 1999-09-14 Oracle Corporation Research mode for a knowledge base search and retrieval system
US5983237A (en) 1996-03-29 1999-11-09 Virage, Inc. Visual dictionary
US6101492A (en) 1998-07-02 2000-08-08 Lucent Technologies Inc. Methods and apparatus for information indexing and retrieval as well as query expansion using morpho-syntactic analysis
US6138116A (en) 1996-08-01 2000-10-24 Canon Kabushiki Kaisha Method and apparatus for retrieving data
US6356864B1 (en) * 1997-07-25 2002-03-12 University Technology Corporation Methods for analysis and evaluation of the semantic content of a writing based on vector length
US6615208B1 (en) * 2000-09-01 2003-09-02 Telcordia Technologies, Inc. Automatic recommendation of products using latent semantic indexing of content

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6847966B1 (en) * 2002-04-24 2005-01-25 Engenium Corporation Method and system for optimally searching a document database using a representative semantic space

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4839853A (en) 1988-09-15 1989-06-13 Bell Communications Research, Inc. Computer information retrieval using latent semantic structure
US5301109A (en) * 1990-06-11 1994-04-05 Bell Communications Research, Inc. Computerized cross-language document retrieval using latent semantic indexing
US5642518A (en) 1993-06-18 1997-06-24 Hitachi, Ltd. Keyword assigning method and system therefor
US5873056A (en) 1993-10-12 1999-02-16 The Syracuse University Natural language processing system for semantic vector representation which accounts for lexical ambiguity
US5675819A (en) 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US5983237A (en) 1996-03-29 1999-11-09 Virage, Inc. Visual dictionary
US6138116A (en) 1996-08-01 2000-10-24 Canon Kabushiki Kaisha Method and apparatus for retrieving data
US5950189A (en) 1997-01-02 1999-09-07 At&T Corp Retrieval system and method
US5895464A (en) 1997-04-30 1999-04-20 Eastman Kodak Company Computer program product and a method for using natural language for the description, search and retrieval of multi-media objects
US6356864B1 (en) * 1997-07-25 2002-03-12 University Technology Corporation Methods for analysis and evaluation of the semantic content of a writing based on vector length
US5953718A (en) 1997-11-12 1999-09-14 Oracle Corporation Research mode for a knowledge base search and retrieval system
US6101492A (en) 1998-07-02 2000-08-08 Lucent Technologies Inc. Methods and apparatus for information indexing and retrieval as well as query expansion using morpho-syntactic analysis
US6615208B1 (en) * 2000-09-01 2003-09-02 Telcordia Technologies, Inc. Automatic recommendation of products using latent semantic indexing of content

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
Author Unknown, "Indexing and Custering," http://www.media.mit.edu/~emnett/research/slides/sld011.htm; Date Unknown.
Author Unknown, "Latent Semantic Indexing (LSI)," Date Unknown.
Author Unknown, "Latent Semantic Indexing," Date Unknown.
Author Unknown, Slide Presentation (Slides 1-12), http://www.cs.rip,edu/~sidbel/4962/class10, Date Unknown.
Berry, Michael, "Latent Semantic Indexing," Date Unknown.
Deerwester et al., "Indexing by Latent Semantic Analysis," Date Unknown.
Foltz, Peter W., "Using Latent Semantic Indexing for Information Filtering," Date Unknown.
Kolda, T.G. et al., "A Semidiscrete Matrix Decomposition for Latent Semantic Indexing in Information Retrieval," Date Unknown.

Cited By (695)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7870118B2 (en) 1999-07-02 2011-01-11 Telstra Corporation Limited Search system
US20080133508A1 (en) * 1999-07-02 2008-06-05 Telstra Corporation Limited Search System
US7296009B1 (en) * 1999-07-02 2007-11-13 Telstra Corporation Limited Search system
US20090157401A1 (en) * 1999-11-12 2009-06-18 Bennett Ian M Semantic Decoding of User Queries
US8229734B2 (en) * 1999-11-12 2012-07-24 Phoenix Solutions, Inc. Semantic decoding of user queries
US7444356B2 (en) * 2000-01-27 2008-10-28 Manning & Napier Information Services, Llc. Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors
US20080281814A1 (en) * 2000-01-27 2008-11-13 Manning & Napier Information Services, Llc Construction of trainable semantic vectors and clustering, classification, and searching using a trainable semantic vector
US20040199546A1 (en) * 2000-01-27 2004-10-07 Manning & Napier Information Services, Llc Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors
US7406456B2 (en) 2000-01-27 2008-07-29 Manning & Napier Information Services, Llc. Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors
US8024331B2 (en) 2000-01-27 2011-09-20 Manning & Napier Information Services, Llc. Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors
US9646614B2 (en) 2000-03-16 2017-05-09 Apple Inc. Fast, language-independent method for user authentication by voice
US8645137B2 (en) 2000-03-16 2014-02-04 Apple Inc. Fast, language-independent method for user authentication by voice
US7912868B2 (en) 2000-05-02 2011-03-22 Textwise Llc Advertisement placement method and system using semantic analysis
US20050216516A1 (en) * 2000-05-02 2005-09-29 Textwise Llc Advertisement placement method and system using semantic analysis
US8086558B2 (en) 2000-06-12 2011-12-27 Previsor, Inc. Computer-implemented system for human resources management
US20100042574A1 (en) * 2000-06-12 2010-02-18 Dewar Katrina L Computer-implemented system for human resources management
US20090187446A1 (en) * 2000-06-12 2009-07-23 Dewar Katrina L Computer-implemented system for human resources management
US8718047B2 (en) 2001-10-22 2014-05-06 Apple Inc. Text to speech conversion of text messages from mobile communication devices
US7124073B2 (en) * 2002-02-12 2006-10-17 Sunflare Co., Ltd Computer-assisted memory translation scheme based on template automaton and latent semantic index principle
US20030154068A1 (en) * 2002-02-12 2003-08-14 Naoyuki Tokuda Computer-assisted memory translation scheme based on template automaton and latent semantic index principle
US7483892B1 (en) * 2002-04-24 2009-01-27 Kroll Ontrack, Inc. Method and system for optimally searching a document database using a representative semantic space
US7542966B2 (en) 2002-04-25 2009-06-02 Mitsubishi Electric Research Laboratories, Inc. Method and system for retrieving documents with spoken queries
US20050149516A1 (en) * 2002-04-25 2005-07-07 Wolf Peter P. Method and system for retrieving documents with spoken queries
US20030229470A1 (en) * 2002-06-10 2003-12-11 Nenad Pejic System and method for analyzing patent-related information
US20040049505A1 (en) * 2002-09-11 2004-03-11 Kelly Pennock Textual on-line analytical processing method and system
US7702509B2 (en) 2002-09-13 2010-04-20 Apple Inc. Unsupervised data-driven pronunciation modeling
US7353164B1 (en) * 2002-09-13 2008-04-01 Apple Inc. Representation of orthography in a continuous vector space
US7668853B2 (en) * 2002-11-27 2010-02-23 Sony United Kingdom Limited Information storage and retrieval
US20040139105A1 (en) * 2002-11-27 2004-07-15 Trepess David William Information storage and retrieval
US20050171948A1 (en) * 2002-12-11 2005-08-04 Knight William C. System and method for identifying critical features in an ordered scale space within a multi-dimensional feature space
US20090328226A1 (en) * 2003-01-07 2009-12-31 Content Analyst Company. LLC Vector Space Method for Secure Information Sharing
US20040133574A1 (en) * 2003-01-07 2004-07-08 Science Applications International Corporaton Vector space method for secure information sharing
US8024344B2 (en) * 2003-01-07 2011-09-20 Content Analyst Company, Llc Vector space method for secure information sharing
US20040163044A1 (en) * 2003-02-14 2004-08-19 Nahava Inc. Method and apparatus for information factoring
US8280903B2 (en) 2003-05-30 2012-10-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a Weighted AND (WAND)
US7139752B2 (en) 2003-05-30 2006-11-21 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations
US7146361B2 (en) 2003-05-30 2006-12-05 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a Weighted AND (WAND)
US20040243645A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations
US20040243556A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, and including a document common analysis system (CAS)
US7512602B2 (en) 2003-05-30 2009-03-31 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a weighted and (WAND)
US20090222441A1 (en) * 2003-05-30 2009-09-03 International Business Machines Corporation System, Method and Computer Program Product for Performing Unstructured Information Management and Automatic Text Analysis, Including a Search Operator Functioning as a Weighted And (WAND)
US20040243560A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, including an annotation inverted file system facilitating indexing and searching
US20040243554A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis
US20070112763A1 (en) * 2003-05-30 2007-05-17 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a weighted and (WAND)
US20040243557A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a weighted and (WAND)
US20100049708A1 (en) * 2003-07-25 2010-02-25 Kenji Kawai System And Method For Scoring Concepts In A Document Set
US8626761B2 (en) * 2003-07-25 2014-01-07 Fti Technology Llc System and method for scoring concepts in a document set
US20050027691A1 (en) * 2003-07-28 2005-02-03 Sergey Brin System and method for providing a user interface with search query broadening
US20050027678A1 (en) * 2003-07-30 2005-02-03 International Business Machines Corporation Computer executable dimension reduction and retrieval engine
US8147250B2 (en) 2003-08-11 2012-04-03 Educational Testing Service Cooccurrence and constructions
US20080183463A1 (en) * 2003-08-11 2008-07-31 Paul Deane Cooccurrence and constructions
WO2005017698A3 (en) * 2003-08-11 2005-09-01 Educational Testing Service Cooccurrence and constructions
WO2005017698A2 (en) * 2003-08-11 2005-02-24 Educational Testing Service Cooccurrence and constructions
US20050049867A1 (en) * 2003-08-11 2005-03-03 Paul Deane Cooccurrence and constructions
US7373102B2 (en) * 2003-08-11 2008-05-13 Educational Testing Service Cooccurrence and constructions
US20050071365A1 (en) * 2003-09-26 2005-03-31 Jiang-Liang Hou Method for keyword correlation analysis
US20050080656A1 (en) * 2003-10-10 2005-04-14 Unicru, Inc. Conceptualization of job candidate information
US7555441B2 (en) 2003-10-10 2009-06-30 Kronos Talent Management Inc. Conceptualization of job candidate information
US20050080657A1 (en) * 2003-10-10 2005-04-14 Unicru, Inc. Matching job candidate information
US7801898B1 (en) * 2003-12-30 2010-09-21 Google Inc. Methods and systems for compressing indices
US8060516B2 (en) 2003-12-30 2011-11-15 Google Inc. Methods and systems for compressing indices
US8549000B2 (en) 2003-12-30 2013-10-01 Google Inc. Methods and systems for compressing indices
US20110066623A1 (en) * 2003-12-30 2011-03-17 Google Inc. Methods and Systems for Compressing Indices
US8612411B1 (en) * 2003-12-31 2013-12-17 Google Inc. Clustering documents using citation patterns
US20050149510A1 (en) * 2004-01-07 2005-07-07 Uri Shafrir Concept mining and concept discovery-semantic search tool for large digital databases
US8868405B2 (en) * 2004-01-27 2014-10-21 Hewlett-Packard Development Company, L. P. System and method for comparative analysis of textual documents
US20050165600A1 (en) * 2004-01-27 2005-07-28 Kas Kasravi System and method for comparative analysis of textual documents
US8369627B2 (en) 2004-02-13 2013-02-05 Fti Technology Llc System and method for generating groups of cluster spines for display
US9619909B2 (en) 2004-02-13 2017-04-11 Fti Technology Llc Computer-implemented system and method for generating and placing cluster groups
US9245367B2 (en) 2004-02-13 2016-01-26 FTI Technology, LLC Computer-implemented system and method for building cluster spine groups
US9342909B2 (en) 2004-02-13 2016-05-17 FTI Technology, LLC Computer-implemented system and method for grafting cluster spines
US9384573B2 (en) 2004-02-13 2016-07-05 Fti Technology Llc Computer-implemented system and method for placing groups of document clusters into a display
US9082232B2 (en) 2004-02-13 2015-07-14 FTI Technology, LLC System and method for displaying cluster spine groups
US9495779B1 (en) 2004-02-13 2016-11-15 Fti Technology Llc Computer-implemented system and method for placing groups of cluster spines into a display
US8942488B2 (en) 2004-02-13 2015-01-27 FTI Technology, LLC System and method for placing spine groups within a display
US8792733B2 (en) 2004-02-13 2014-07-29 Fti Technology Llc Computer-implemented system and method for organizing cluster groups within a display
US9858693B2 (en) 2004-02-13 2018-01-02 Fti Technology Llc System and method for placing candidate spines into a display with the aid of a digital computer
US8312019B2 (en) 2004-02-13 2012-11-13 FTI Technology, LLC System and method for generating cluster spines
US8639044B2 (en) 2004-02-13 2014-01-28 Fti Technology Llc Computer-implemented system and method for placing cluster groupings into a display
US8155453B2 (en) 2004-02-13 2012-04-10 Fti Technology Llc System and method for displaying groups of cluster spines
US9984484B2 (en) 2004-02-13 2018-05-29 Fti Consulting Technology Llc Computer-implemented system and method for cluster spine group arrangement
US20050240394A1 (en) * 2004-04-22 2005-10-27 Hewlett-Packard Development Company, L.P. Method, system or memory storing a computer program for document processing
US7565361B2 (en) * 2004-04-22 2009-07-21 Hewlett-Packard Development Company, L.P. Method and system for lexical mapping between document sets having a common topic
US20090292697A1 (en) * 2004-04-22 2009-11-26 Hiromi Oda Method and system for lexical mapping between document sets having a common topic
US8065306B2 (en) * 2004-04-22 2011-11-22 Hewlett-Packard Development Company, L. P. Method and system for lexical mapping between document sets having a common topic
US20060020593A1 (en) * 2004-06-25 2006-01-26 Mark Ramsaier Dynamic search processor
US7571383B2 (en) * 2004-07-13 2009-08-04 International Business Machines Corporation Document data retrieval and reporting
US20060015486A1 (en) * 2004-07-13 2006-01-19 International Business Machines Corporation Document data retrieval and reporting
US20060036640A1 (en) * 2004-08-03 2006-02-16 Sony Corporation Information processing apparatus, information processing method, and program
US9697577B2 (en) 2004-08-10 2017-07-04 Lucid Patent Llc Patent mapping
US20110072014A1 (en) * 2004-08-10 2011-03-24 Foundationip, Llc Patent mapping
US11080807B2 (en) 2004-08-10 2021-08-03 Lucid Patent Llc Patent mapping
US11776084B2 (en) 2004-08-10 2023-10-03 Lucid Patent Llc Patent mapping
US7440947B2 (en) * 2004-11-12 2008-10-21 Fuji Xerox Co., Ltd. System and method for identifying query-relevant keywords in documents with latent semantic analysis
US20060106767A1 (en) * 2004-11-12 2006-05-18 Fuji Xerox Co., Ltd. System and method for identifying query-relevant keywords in documents with latent semantic analysis
US20080168073A1 (en) * 2005-01-19 2008-07-10 Siegel Hilliard B Providing Annotations of a Digital Work
US9275052B2 (en) 2005-01-19 2016-03-01 Amazon Technologies, Inc. Providing annotations of a digital work
US8131647B2 (en) 2005-01-19 2012-03-06 Amazon Technologies, Inc. Method and system for providing annotations of a digital work
US10853560B2 (en) 2005-01-19 2020-12-01 Amazon Technologies, Inc. Providing annotations of a digital work
US20110184828A1 (en) * 2005-01-19 2011-07-28 Amazon Technologies, Inc. Method and system for providing annotations of a digital work
US9208592B2 (en) 2005-01-26 2015-12-08 FTI Technology, LLC Computer-implemented system and method for providing a display of clusters
US8701048B2 (en) 2005-01-26 2014-04-15 Fti Technology Llc System and method for providing a user-adjustable display of clusters and text
US8402395B2 (en) 2005-01-26 2013-03-19 FTI Technology, LLC System and method for providing a dynamic user interface for a dense three-dimensional scene with a plurality of compasses
US9176642B2 (en) 2005-01-26 2015-11-03 FTI Technology, LLC Computer-implemented system and method for displaying clusters via a dynamic user interface
US20060218140A1 (en) * 2005-02-09 2006-09-28 Battelle Memorial Institute Method and apparatus for labeling in steered visual analysis of collections of documents
US20060179051A1 (en) * 2005-02-09 2006-08-10 Battelle Memorial Institute Methods and apparatus for steering the analyses of collections of documents
WO2006090600A1 (en) * 2005-02-25 2006-08-31 Mitsubishi Denki Kabushiki Kaisha Computer implemented method for indexing and retrieving documents stored in a database and system for indexing and retrieving documents
US20060217959A1 (en) * 2005-03-25 2006-09-28 Fuji Xerox Co., Ltd. Translation processing method, document processing device and storage medium storing program
US9411906B2 (en) 2005-05-04 2016-08-09 Google Inc. Suggesting and refining user input based on original user input
US9020924B2 (en) * 2005-05-04 2015-04-28 Google Inc. Suggesting and refining user input based on original user input
US11798111B2 (en) 2005-05-27 2023-10-24 Black Hills Ip Holdings, Llc Method and apparatus for cross-referencing important IP relationships
US10810693B2 (en) 2005-05-27 2020-10-20 Black Hills Ip Holdings, Llc Method and apparatus for cross-referencing important IP relationships
US7689411B2 (en) 2005-07-01 2010-03-30 Xerox Corporation Concept matching
US20070005344A1 (en) * 2005-07-01 2007-01-04 Xerox Corporation Concept matching system
US20070005343A1 (en) * 2005-07-01 2007-01-04 Xerox Corporation Concept matching
US7809551B2 (en) * 2005-07-01 2010-10-05 Xerox Corporation Concept matching system
US9659071B2 (en) 2005-07-27 2017-05-23 Schwegman Lundberg & Woessner, P.A. Patent mapping
US20120130993A1 (en) * 2005-07-27 2012-05-24 Schwegman Lundberg & Woessner, P.A. Patent mapping
US9201956B2 (en) * 2005-07-27 2015-12-01 Schwegman Lundberg & Woessner, P.A. Patent mapping
US20070050356A1 (en) * 2005-08-23 2007-03-01 Amadio William J Query construction for semantic topic indexes derived by non-negative matrix factorization
US10318871B2 (en) 2005-09-08 2019-06-11 Apple Inc. Method and apparatus for building an intelligent automated assistant
US9501741B2 (en) 2005-09-08 2016-11-22 Apple Inc. Method and apparatus for building an intelligent automated assistant
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US8312021B2 (en) * 2005-09-16 2012-11-13 Palo Alto Research Center Incorporated Generalized latent semantic analysis
US20070067281A1 (en) * 2005-09-16 2007-03-22 Irina Matveeva Generalized latent semantic analysis
US20070083906A1 (en) * 2005-09-23 2007-04-12 Bharat Welingkar Content-based navigation and launching on mobile devices
US7783993B2 (en) 2005-09-23 2010-08-24 Palm, Inc. Content-based navigation and launching on mobile devices
US20070073625A1 (en) * 2005-09-27 2007-03-29 Shelton Robert H System and method of licensing intellectual property assets
US9958987B2 (en) 2005-09-30 2018-05-01 Apple Inc. Automated response to and sensing of user activity in portable devices
US9389729B2 (en) 2005-09-30 2016-07-12 Apple Inc. Automated response to and sensing of user activity in portable devices
US20100048256A1 (en) * 2005-09-30 2010-02-25 Brian Huppi Automated Response To And Sensing Of User Activity In Portable Devices
US8614431B2 (en) 2005-09-30 2013-12-24 Apple Inc. Automated response to and sensing of user activity in portable devices
US9619079B2 (en) 2005-09-30 2017-04-11 Apple Inc. Automated response to and sensing of user activity in portable devices
US20070271250A1 (en) * 2005-10-19 2007-11-22 Monro Donald M Basis selection for coding and decoding of data
US20080288489A1 (en) * 2005-11-02 2008-11-20 Jeong-Jin Kim Method for Searching Patent Document by Applying Degree of Similarity and System Thereof
US7676463B2 (en) * 2005-11-15 2010-03-09 Kroll Ontrack, Inc. Information exploration systems and method
US20070112755A1 (en) * 2005-11-15 2007-05-17 Thompson Kevin B Information exploration systems and method
WO2007059225A2 (en) * 2005-11-15 2007-05-24 Engenium Corporation Information exploration systems and methods
WO2007059225A3 (en) * 2005-11-15 2009-05-07 Engenium Corp Information exploration systems and methods
US20070124316A1 (en) * 2005-11-29 2007-05-31 Chan John Y M Attribute selection for collaborative groupware documents using a multi-dimensional matrix
US8674855B2 (en) 2006-01-13 2014-03-18 Essex Pa, L.L.C. Identification of text
US20070164882A1 (en) * 2006-01-13 2007-07-19 Monro Donald M Identification of text
US20070174296A1 (en) * 2006-01-17 2007-07-26 Andrew Gibbs Method and system for distributing a database and computer program within a network
US20070168345A1 (en) * 2006-01-17 2007-07-19 Andrew Gibbs System and method of identifying subject matter experts
US20070185871A1 (en) * 2006-02-08 2007-08-09 Telenor Asa Document similarity scoring and ranking method, device and computer program product
US7689559B2 (en) 2006-02-08 2010-03-30 Telenor Asa Document similarity scoring and ranking method, device and computer program product
WO2007091896A1 (en) * 2006-02-08 2007-08-16 Telenor Asa Document similarity scoring and ranking method, device and computer program product
US7844595B2 (en) 2006-02-08 2010-11-30 Telenor Asa Document similarity scoring and ranking method, device and computer program product
US20100131515A1 (en) * 2006-02-08 2010-05-27 Telenor Asa Document similarity scoring and ranking method, device and computer program product
US8352449B1 (en) 2006-03-29 2013-01-08 Amazon Technologies, Inc. Reader device content indexing
US8019754B2 (en) 2006-04-03 2011-09-13 Needlebot Incorporated Method of searching text to find relevant content
US20070239707A1 (en) * 2006-04-03 2007-10-11 Collins John B Method of searching text to find relevant content
US7783079B2 (en) 2006-04-07 2010-08-24 Monro Donald M Motion assisted data enhancement
US7752198B2 (en) * 2006-04-24 2010-07-06 Telenor Asa Method and device for efficiently ranking documents in a similarity graph
US20070250502A1 (en) * 2006-04-24 2007-10-25 Telenor Asa Method and device for efficiently ranking documents in a similarity graph
US20080222146A1 (en) * 2006-05-26 2008-09-11 International Business Machines Corporation System and method for creation, representation, and delivery of document corpus entity co-occurrence information
US7586424B2 (en) 2006-06-05 2009-09-08 Donald Martin Monro Data coding using an exponent and a residual
US7809717B1 (en) * 2006-06-06 2010-10-05 University Of Regina Method and apparatus for concept-based visual presentation of search results
US20070282809A1 (en) * 2006-06-06 2007-12-06 Orland Hoeber Method and apparatus for concept-based visual
US20070288421A1 (en) * 2006-06-09 2007-12-13 Microsoft Corporation Efficient evaluation of object finder queries
US7730060B2 (en) * 2006-06-09 2010-06-01 Microsoft Corporation Efficient evaluation of object finder queries
US20070294232A1 (en) * 2006-06-15 2007-12-20 Andrew Gibbs System and method for analyzing patent value
US7770091B2 (en) 2006-06-19 2010-08-03 Monro Donald M Data compression for use in communication systems
US7508325B2 (en) 2006-09-06 2009-03-24 Intellectual Ventures Holding 35 Llc Matching pursuits subband coding of data
US9117447B2 (en) 2006-09-08 2015-08-25 Apple Inc. Using event alert text as input to an automated assistant
US8942986B2 (en) 2006-09-08 2015-01-27 Apple Inc. Determining user intent based on ontologies of domains
US8930191B2 (en) 2006-09-08 2015-01-06 Apple Inc. Paraphrasing of user requests and results by automated digital assistant
US8010501B2 (en) * 2006-09-19 2011-08-30 Exalead Computer-implemented method, computer program product and system for creating an index of a subset of data
US20080294597A1 (en) * 2006-09-19 2008-11-27 Exalead Computer-implemented method, computer program product and system for creating an index of a subset of data
US7895185B2 (en) * 2006-09-28 2011-02-22 International Business Machines Corporation Row-identifier list processing management
US20080082489A1 (en) * 2006-09-28 2008-04-03 International Business Machines Corporation Row Identifier List Processing Management
US8725565B1 (en) 2006-09-29 2014-05-13 Amazon Technologies, Inc. Expedited acquisition of a digital item following a sample presentation of the item
US9672533B1 (en) 2006-09-29 2017-06-06 Amazon Technologies, Inc. Acquisition of an item based on a catalog presentation of items
US9292873B1 (en) 2006-09-29 2016-03-22 Amazon Technologies, Inc. Expedited acquisition of a digital item following a sample presentation of the item
US7974488B2 (en) 2006-10-05 2011-07-05 Intellectual Ventures Holding 35 Llc Matching pursuits basis selection
US8184921B2 (en) 2006-10-05 2012-05-22 Intellectual Ventures Holding 35 Llc Matching pursuits basis selection
US20080133521A1 (en) * 2006-11-30 2008-06-05 D&S Consultants, Inc. Method and System for Image Recognition Using a Similarity Inverse Matrix
US7921120B2 (en) * 2006-11-30 2011-04-05 D&S Consultants Method and system for image recognition using a similarity inverse matrix
US20080129520A1 (en) * 2006-12-01 2008-06-05 Apple Computer, Inc. Electronic device with enhanced audio feedback
US20080137367A1 (en) * 2006-12-12 2008-06-12 Samsung Electronics Co., Ltd. Optical sheet and method for fabricating the same
US9116657B1 (en) 2006-12-29 2015-08-25 Amazon Technologies, Inc. Invariant referencing in digital works
US7865817B2 (en) 2006-12-29 2011-01-04 Amazon Technologies, Inc. Invariant referencing in digital works
US20080163039A1 (en) * 2006-12-29 2008-07-03 Ryan Thomas A Invariant Referencing in Digital Works
US8417772B2 (en) 2007-02-12 2013-04-09 Amazon Technologies, Inc. Method and system for transferring content from the web to mobile devices
US9313296B1 (en) 2007-02-12 2016-04-12 Amazon Technologies, Inc. Method and system for a hosted mobile management service architecture
US20080195962A1 (en) * 2007-02-12 2008-08-14 Lin Daniel J Method and System for Remotely Controlling The Display of Photos in a Digital Picture Frame
US8571535B1 (en) 2007-02-12 2013-10-29 Amazon Technologies, Inc. Method and system for a hosted mobile management service architecture
US9219797B2 (en) 2007-02-12 2015-12-22 Amazon Technologies, Inc. Method and system for a hosted mobile management service architecture
US7707214B2 (en) 2007-02-21 2010-04-27 Donald Martin Monro Hierarchical update scheme for extremum location with indirect addressing
US7707213B2 (en) 2007-02-21 2010-04-27 Donald Martin Monro Hierarchical update scheme for extremum location
US20080208840A1 (en) * 2007-02-22 2008-08-28 Microsoft Corporation Diverse Topic Phrase Extraction
US8280877B2 (en) * 2007-02-22 2012-10-02 Microsoft Corporation Diverse topic phrase extraction
US8954444B1 (en) 2007-03-29 2015-02-10 Amazon Technologies, Inc. Search and indexing on a user device
WO2008121589A3 (en) * 2007-03-29 2008-12-31 Amazon Tech Inc Searching and indexing on a user device
US9665529B1 (en) 2007-03-29 2017-05-30 Amazon Technologies, Inc. Relative progress and event indicators
WO2008121589A2 (en) * 2007-03-29 2008-10-09 Amazon Technologies, Inc. Searching and indexing on a user device
US20080243828A1 (en) * 2007-03-29 2008-10-02 Reztlaff James R Search and Indexing on a User Device
US8793575B1 (en) 2007-03-29 2014-07-29 Amazon Technologies, Inc. Progress indication for a digital work
US7716224B2 (en) * 2007-03-29 2010-05-11 Amazon Technologies, Inc. Search and indexing on a user device
US8275773B2 (en) 2007-03-30 2012-09-25 Stuart Donnelly Method of searching text to find relevant content
US20100094879A1 (en) * 2007-03-30 2010-04-15 Stuart Donnelly Method of detecting and responding to changes in the online community's interests in real time
US20100094840A1 (en) * 2007-03-30 2010-04-15 Stuart Donnelly Method of searching text to find relevant content and presenting advertisements to users
US8271476B2 (en) 2007-03-30 2012-09-18 Stuart Donnelly Method of searching text to find user community changes of interest and drug side effect upsurges, and presenting advertisements to users
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US10568032B2 (en) 2007-04-03 2020-02-18 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US7853900B2 (en) 2007-05-21 2010-12-14 Amazon Technologies, Inc. Animations
US8965807B1 (en) 2007-05-21 2015-02-24 Amazon Technologies, Inc. Selecting and providing items in a media consumption system
US9178744B1 (en) 2007-05-21 2015-11-03 Amazon Technologies, Inc. Delivery of items for consumption by a user device
US8656040B1 (en) 2007-05-21 2014-02-18 Amazon Technologies, Inc. Providing user-supplied items to a user device
US8341513B1 (en) 2007-05-21 2012-12-25 Amazon.Com Inc. Incremental updates of items
US9888005B1 (en) 2007-05-21 2018-02-06 Amazon Technologies, Inc. Delivery of items for consumption by a user device
US8700005B1 (en) 2007-05-21 2014-04-15 Amazon Technologies, Inc. Notification of a user device to perform an action
US8990215B1 (en) 2007-05-21 2015-03-24 Amazon Technologies, Inc. Obtaining and verifying search indices
US8234282B2 (en) 2007-05-21 2012-07-31 Amazon Technologies, Inc. Managing status of search index generation
US8266173B1 (en) * 2007-05-21 2012-09-11 Amazon Technologies, Inc. Search results generation and sorting
US8341210B1 (en) 2007-05-21 2012-12-25 Amazon Technologies, Inc. Delivery of items for consumption by a user device
US20080293450A1 (en) * 2007-05-21 2008-11-27 Ryan Thomas A Consumption of Items via a User Device
US20080294674A1 (en) * 2007-05-21 2008-11-27 Reztlaff Ii James R Managing Status of Search Index Generation
US20080295039A1 (en) * 2007-05-21 2008-11-27 Laurent An Minh Nguyen Animations
US7921309B1 (en) 2007-05-21 2011-04-05 Amazon Technologies Systems and methods for determining and managing the power remaining in a handheld electronic device
US9479591B1 (en) 2007-05-21 2016-10-25 Amazon Technologies, Inc. Providing user-supplied items to a user device
US9568984B1 (en) 2007-05-21 2017-02-14 Amazon Technologies, Inc. Administrative tasks in a media consumption system
US8538794B2 (en) 2007-06-18 2013-09-17 Reuven A. Marko Method and apparatus for management of the creation of a patent portfolio
CN101079026B (en) * 2007-07-02 2011-01-26 蒙圣光 Text similarity, acceptation similarity calculating method and system and application system
US7831610B2 (en) * 2007-08-09 2010-11-09 Panasonic Corporation Contents retrieval device for retrieving contents that user wishes to view from among a plurality of contents
US20090300011A1 (en) * 2007-08-09 2009-12-03 Kazutoyo Takata Contents retrieval device
US20090070323A1 (en) * 2007-09-12 2009-03-12 Nishith Parikh Inference of query relationships
US8086620B2 (en) 2007-09-12 2011-12-27 Ebay Inc. Inference of query relationships
US10055484B2 (en) 2007-09-12 2018-08-21 Ebay Inc. Inference of query relationships based on retrieved attributes
US8655868B2 (en) * 2007-09-12 2014-02-18 Ebay Inc. Inference of query relationships based on retrieved attributes
US20090070299A1 (en) * 2007-09-12 2009-03-12 Nishith Parikh Inference of query relationships based on retrieved attributes
US9330201B2 (en) 2007-09-12 2016-05-03 Ebay Inc. Inference of query relationships based on retrieved attributes
US9311046B1 (en) 2007-09-27 2016-04-12 Amazon Technologies, Inc. System for detecting associations between items
US20090089273A1 (en) * 2007-09-27 2009-04-02 Cory Hicks System for detecting associations between items
US7779040B2 (en) * 2007-09-27 2010-08-17 Amazon Technologies, Inc. System for detecting associations between items
US7827186B2 (en) 2007-09-28 2010-11-02 Amazon Technologies, Inc. Duplicate item detection system and method
US20090089314A1 (en) * 2007-09-28 2009-04-02 Cory Hicks Duplicate item detection system and method
US9053089B2 (en) 2007-10-02 2015-06-09 Apple Inc. Part-of-speech tagging using latent analogy
US20090089058A1 (en) * 2007-10-02 2009-04-02 Jerome Bellegarda Part-of-speech tagging using latent analogy
US8171029B2 (en) 2007-10-05 2012-05-01 Fujitsu Limited Automatic generation of ontologies using word affinities
US20090094208A1 (en) * 2007-10-05 2009-04-09 Fujitsu Limited Automatically Generating A Hierarchy Of Terms
EP2045731A1 (en) * 2007-10-05 2009-04-08 Fujitsu Limited Automatic generation of ontologies using word affinities
US20090094262A1 (en) * 2007-10-05 2009-04-09 Fujitsu Limited Automatic Generation Of Ontologies Using Word Affinities
US8332439B2 (en) 2007-10-05 2012-12-11 Fujitsu Limited Automatically generating a hierarchy of terms
CN101430695B (en) * 2007-10-05 2012-06-06 富士通株式会社 System and method for computing difference affinities of word
US20090119281A1 (en) * 2007-11-03 2009-05-07 Andrew Chien-Chung Wang Granular knowledge based search engine
US8620662B2 (en) 2007-11-20 2013-12-31 Apple Inc. Context-aware unit selection
US10614519B2 (en) 2007-12-14 2020-04-07 Consumerinfo.Com, Inc. Card registry systems and methods
US10878499B2 (en) 2007-12-14 2020-12-29 Consumerinfo.Com, Inc. Card registry systems and methods
US11379916B1 (en) 2007-12-14 2022-07-05 Consumerinfo.Com, Inc. Card registry systems and methods
US11023513B2 (en) 2007-12-20 2021-06-01 Apple Inc. Method and apparatus for searching using an active ontology
US10002189B2 (en) 2007-12-20 2018-06-19 Apple Inc. Method and apparatus for searching using an active ontology
US20090164441A1 (en) * 2007-12-20 2009-06-25 Adam Cheyer Method and apparatus for searching using an active ontology
US20090177300A1 (en) * 2008-01-03 2009-07-09 Apple Inc. Methods and apparatus for altering audio output signals
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
US10381016B2 (en) 2008-01-03 2019-08-13 Apple Inc. Methods and apparatus for altering audio output signals
US9165254B2 (en) 2008-01-14 2015-10-20 Aptima, Inc. Method and system to predict the likelihood of topics
US20100280985A1 (en) * 2008-01-14 2010-11-04 Aptima, Inc. Method and system to predict the likelihood of topics
US20090182730A1 (en) * 2008-01-14 2009-07-16 Infosys Technologies Ltd. Method for semantic based storage and retrieval of information
US7870133B2 (en) * 2008-01-14 2011-01-11 Infosys Technologies Ltd. Method for semantic based storage and retrieval of information
US9361886B2 (en) 2008-02-22 2016-06-07 Apple Inc. Providing text input using speech data and non-speech data
US8688446B2 (en) 2008-02-22 2014-04-01 Apple Inc. Providing text input using speech data and non-speech data
US8326847B2 (en) * 2008-03-22 2012-12-04 International Business Machines Corporation Graph search system and method for querying loosely integrated data
US20090240682A1 (en) * 2008-03-22 2009-09-24 International Business Machines Corporation Graph search system and method for querying loosely integrated data
US20090248682A1 (en) * 2008-04-01 2009-10-01 Certona Corporation System and method for personalized search
US8903811B2 (en) * 2008-04-01 2014-12-02 Certona Corporation System and method for personalized search
US9626955B2 (en) 2008-04-05 2017-04-18 Apple Inc. Intelligent text-to-speech conversion
US9865248B2 (en) 2008-04-05 2018-01-09 Apple Inc. Intelligent text-to-speech conversion
US8996376B2 (en) 2008-04-05 2015-03-31 Apple Inc. Intelligent text-to-speech conversion
US20090254345A1 (en) * 2008-04-05 2009-10-08 Christopher Brian Fleizach Intelligent Text-to-Speech Conversion
US20090300006A1 (en) * 2008-05-29 2009-12-03 Accenture Global Services Gmbh Techniques for computing similarity measurements between segments representative of documents
US8166049B2 (en) * 2008-05-29 2012-04-24 Accenture Global Services Limited Techniques for computing similarity measurements between segments representative of documents
US8423889B1 (en) 2008-06-05 2013-04-16 Amazon Technologies, Inc. Device specific presentation control for electronic book reader devices
US9946706B2 (en) 2008-06-07 2018-04-17 Apple Inc. Automatic language identification for dynamic text processing
US11157872B2 (en) 2008-06-26 2021-10-26 Experian Marketing Solutions, Llc Systems and methods for providing an integrated identifier
US11769112B2 (en) 2008-06-26 2023-09-26 Experian Marketing Solutions, Llc Systems and methods for providing an integrated identifier
US20100010982A1 (en) * 2008-07-09 2010-01-14 Broder Andrei Z Web content characterization based on semantic folksonomies associated with user generated content
US9679030B2 (en) * 2008-07-24 2017-06-13 Hamid Hatami-Hanza Ontological subjects of a universe and knowledge processing thereof
US20150169746A1 (en) * 2008-07-24 2015-06-18 Hamid Hatami-Hanza Ontological Subjects Of A Universe And Knowledge Processing Thereof
US9535906B2 (en) 2008-07-31 2017-01-03 Apple Inc. Mobile device having human language translation capability with positional feedback
US10108612B2 (en) 2008-07-31 2018-10-23 Apple Inc. Mobile device having human language translation capability with positional feedback
US9691383B2 (en) 2008-09-05 2017-06-27 Apple Inc. Multi-tiered voice feedback in an electronic device
US8768702B2 (en) 2008-09-05 2014-07-01 Apple Inc. Multi-tiered voice feedback in an electronic device
US20100063818A1 (en) * 2008-09-05 2010-03-11 Apple Inc. Multi-tiered voice feedback in an electronic device
US8898568B2 (en) 2008-09-09 2014-11-25 Apple Inc. Audio user interface
US20100064218A1 (en) * 2008-09-09 2010-03-11 Apple Inc. Audio user interface
US8583418B2 (en) 2008-09-29 2013-11-12 Apple Inc. Systems and methods of detecting language and natural language strings for text to speech synthesis
US20100082349A1 (en) * 2008-09-29 2010-04-01 Apple Inc. Systems and methods for selective text to speech synthesis
US8712776B2 (en) 2008-09-29 2014-04-29 Apple Inc. Systems and methods for selective text to speech synthesis
US8762469B2 (en) 2008-10-02 2014-06-24 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US8713119B2 (en) 2008-10-02 2014-04-29 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10643611B2 (en) 2008-10-02 2020-05-05 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US11348582B2 (en) 2008-10-02 2022-05-31 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US9412392B2 (en) 2008-10-02 2016-08-09 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US20100085224A1 (en) * 2008-10-06 2010-04-08 Donald Martin Monro Adaptive combinatorial coding/decoding with specified occurrences for electrical computers and digital data processing systems
US20100085218A1 (en) * 2008-10-06 2010-04-08 Donald Martin Monro Combinatorial coding/decoding with specified occurrences for electrical computers and digital data processing systems
US7791513B2 (en) 2008-10-06 2010-09-07 Donald Martin Monro Adaptive combinatorial coding/decoding with specified occurrences for electrical computers and digital data processing systems
US20100085221A1 (en) * 2008-10-06 2010-04-08 Donald Martin Monro Mode switched adaptive combinatorial coding/decoding for electrical computers and digital data processing systems
US7864086B2 (en) 2008-10-06 2011-01-04 Donald Martin Monro Mode switched adaptive combinatorial coding/decoding for electrical computers and digital data processing systems
US7786907B2 (en) 2008-10-06 2010-08-31 Donald Martin Monro Combinatorial coding/decoding with specified occurrences for electrical computers and digital data processing systems
US20100085219A1 (en) * 2008-10-06 2010-04-08 Donald Martin Monro Combinatorial coding/decoding with specified occurrences for electrical computers and digital data processing systems
US7786903B2 (en) 2008-10-06 2010-08-31 Donald Martin Monro Combinatorial coding/decoding with specified occurrences for electrical computers and digital data processing systems
US11301810B2 (en) 2008-10-23 2022-04-12 Black Hills Ip Holdings, Llc Patent mapping
US10546273B2 (en) 2008-10-23 2020-01-28 Black Hills Ip Holdings, Llc Patent mapping
EP2353108A4 (en) * 2008-11-04 2018-01-03 Strossle International AB Method and system for analyzing text
US8788261B2 (en) * 2008-11-04 2014-07-22 Saplo Ab Method and system for analyzing text
US20110208511A1 (en) * 2008-11-04 2011-08-25 Saplo Ab Method and system for analyzing text
US9292491B2 (en) * 2008-11-04 2016-03-22 Strossle International Ab Method and system for analyzing text
US20140309989A1 (en) * 2008-11-04 2014-10-16 Saplo Ab Method and system for analyzing text
US10621657B2 (en) 2008-11-05 2020-04-14 Consumerinfo.Com, Inc. Systems and methods of credit information reporting
US20110131228A1 (en) * 2008-11-21 2011-06-02 Emptoris, Inc. Method & apparatus for identifying a secondary concept in a collection of documents
US20100131569A1 (en) * 2008-11-21 2010-05-27 Robert Marc Jamison Method & apparatus for identifying a secondary concept in a collection of documents
US20110219000A1 (en) * 2008-11-26 2011-09-08 Yukitaka Kusumura Search apparatus, search method, and recording medium storing program
US8892574B2 (en) * 2008-11-26 2014-11-18 Nec Corporation Search apparatus, search method, and non-transitory computer readable medium storing program that input a query representing a subset of a document set stored to a document database and output a keyword that often appears in the subset
US9959870B2 (en) 2008-12-11 2018-05-01 Apple Inc. Speech recognition involving a mobile device
US20110258196A1 (en) * 2008-12-30 2011-10-20 Skjalg Lepsoy Method and system of content recommendation
US9311391B2 (en) * 2008-12-30 2016-04-12 Telecom Italia S.P.A. Method and system of content recommendation
US9201927B1 (en) * 2009-01-07 2015-12-01 Guangsheng Zhang System and methods for quantitative assessment of information in natural language contents and for determining relevance using association data
US9367608B1 (en) * 2009-01-07 2016-06-14 Guangsheng Zhang System and methods for searching objects and providing answers to queries using association data
US20100185685A1 (en) * 2009-01-13 2010-07-22 Chew Peter A Technique for Information Retrieval Using Enhanced Latent Semantic Analysis
US8290961B2 (en) * 2009-01-13 2012-10-16 Sandia Corporation Technique for information retrieval using enhanced latent semantic analysis generating rank approximation matrix by factorizing the weighted morpheme-by-document matrix
US9087032B1 (en) 2009-01-26 2015-07-21 Amazon Technologies, Inc. Aggregation of highlights
USD636771S1 (en) 2009-01-27 2011-04-26 Amazon Technologies, Inc. Control pad for an electronic device
US8378979B2 (en) 2009-01-27 2013-02-19 Amazon Technologies, Inc. Electronic device with haptic feedback
USD622722S1 (en) 2009-01-27 2010-08-31 Amazon Technologies, Inc. Electronic reader device
US8862252B2 (en) 2009-01-30 2014-10-14 Apple Inc. Audio user interface for displayless electronic device
US20100211534A1 (en) * 2009-02-13 2010-08-19 Fujitsu Limited Efficient computation of ontology affinity matrices
US8554696B2 (en) 2009-02-13 2013-10-08 Fujitsu Limited Efficient computation of ontology affinity matrices
US8751238B2 (en) 2009-03-09 2014-06-10 Apple Inc. Systems and methods for determining the language to use for speech generated by a text to speech engine
US8832584B1 (en) 2009-03-31 2014-09-09 Amazon Technologies, Inc. Questions on highlighted passages
US8166032B2 (en) * 2009-04-09 2012-04-24 MarketChorus, Inc. System and method for sentiment-based text classification and relevancy ranking
US20100262454A1 (en) * 2009-04-09 2010-10-14 SquawkSpot, Inc. System and method for sentiment-based text classification and relevancy ranking
USD624074S1 (en) 2009-05-04 2010-09-21 Amazon Technologies, Inc. Electronic reader device
US20100287177A1 (en) * 2009-05-06 2010-11-11 Foundationip, Llc Method, System, and Apparatus for Searching an Electronic Document Collection
US20100287148A1 (en) * 2009-05-08 2010-11-11 Cpa Global Patent Research Limited Method, System, and Apparatus for Targeted Searching of Multi-Sectional Documents within an Electronic Document Collection
WO2010134885A1 (en) * 2009-05-20 2010-11-25 Farhan Sarwar Predicting the correctness of eyewitness' statements with semantic evaluation method (sem)
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US10475446B2 (en) 2009-06-05 2019-11-12 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US10540976B2 (en) 2009-06-05 2020-01-21 Apple Inc. Contextual voice commands
US20100312547A1 (en) * 2009-06-05 2010-12-09 Apple Inc. Contextual voice commands
US11080012B2 (en) 2009-06-05 2021-08-03 Apple Inc. Interface for a virtual digital assistant
US10795541B2 (en) 2009-06-05 2020-10-06 Apple Inc. Intelligent organization of tasks items
US20110004475A1 (en) * 2009-07-02 2011-01-06 Bellegarda Jerome R Methods and apparatuses for automatic speech recognition
US9431006B2 (en) 2009-07-02 2016-08-30 Apple Inc. Methods and apparatuses for automatic speech recognition
US10283110B2 (en) 2009-07-02 2019-05-07 Apple Inc. Methods and apparatuses for automatic speech recognition
US20110016118A1 (en) * 2009-07-20 2011-01-20 Lexisnexis Method and apparatus for determining relevant search results using a matrix framework
US8478749B2 (en) * 2009-07-20 2013-07-02 Lexisnexis, A Division Of Reed Elsevier Inc. Method and apparatus for determining relevant search results using a matrix framework
US9679049B2 (en) 2009-07-28 2017-06-13 Fti Consulting, Inc. System and method for providing visual suggestions for document classification via injection
US9898526B2 (en) 2009-07-28 2018-02-20 Fti Consulting, Inc. Computer-implemented system and method for inclusion-based electronically stored information item cluster visual representation
US8909647B2 (en) 2009-07-28 2014-12-09 Fti Consulting, Inc. System and method for providing classification suggestions using document injection
US8515957B2 (en) 2009-07-28 2013-08-20 Fti Consulting, Inc. System and method for displaying relationships between electronically stored information to provide classification suggestions via injection
US8713018B2 (en) 2009-07-28 2014-04-29 Fti Consulting, Inc. System and method for displaying relationships between electronically stored information to provide classification suggestions via inclusion
US9165062B2 (en) 2009-07-28 2015-10-20 Fti Consulting, Inc. Computer-implemented system and method for visual document classification
US8572084B2 (en) 2009-07-28 2013-10-29 Fti Consulting, Inc. System and method for displaying relationships between electronically stored information to provide classification suggestions via nearest neighbor
US9336303B2 (en) 2009-07-28 2016-05-10 Fti Consulting, Inc. Computer-implemented system and method for providing visual suggestions for cluster classification
US10083396B2 (en) 2009-07-28 2018-09-25 Fti Consulting, Inc. Computer-implemented system and method for assigning concept classification suggestions
US9477751B2 (en) 2009-07-28 2016-10-25 Fti Consulting, Inc. System and method for displaying relationships between concepts to provide classification suggestions via injection
US9064008B2 (en) 2009-07-28 2015-06-23 Fti Consulting, Inc. Computer-implemented system and method for displaying visual classification suggestions for concepts
US20110029529A1 (en) * 2009-07-28 2011-02-03 Knight William C System And Method For Providing A Classification Suggestion For Concepts
US8700627B2 (en) 2009-07-28 2014-04-15 Fti Consulting, Inc. System and method for displaying relationships between concepts to provide classification suggestions via inclusion
US9542483B2 (en) 2009-07-28 2017-01-10 Fti Consulting, Inc. Computer-implemented system and method for visually suggesting classification for inclusion-based cluster spines
US8635223B2 (en) 2009-07-28 2014-01-21 Fti Consulting, Inc. System and method for providing a classification suggestion for electronically stored information
US8645378B2 (en) 2009-07-28 2014-02-04 Fti Consulting, Inc. System and method for displaying relationships between concepts to provide classification suggestions via nearest neighbor
US8515958B2 (en) 2009-07-28 2013-08-20 Fti Consulting, Inc. System and method for providing a classification suggestion for concepts
US8612446B2 (en) 2009-08-24 2013-12-17 Fti Consulting, Inc. System and method for generating a reference set for use during document review
US9275344B2 (en) 2009-08-24 2016-03-01 Fti Consulting, Inc. Computer-implemented system and method for generating a reference set via seed documents
US10332007B2 (en) 2009-08-24 2019-06-25 Nuix North America Inc. Computer-implemented system and method for generating document training sets
US9336496B2 (en) 2009-08-24 2016-05-10 Fti Consulting, Inc. Computer-implemented system and method for generating a reference set via clustering
US9489446B2 (en) 2009-08-24 2016-11-08 Fti Consulting, Inc. Computer-implemented system and method for generating a training set for use during document review
US20110066612A1 (en) * 2009-09-17 2011-03-17 Foundationip, Llc Method, System, and Apparatus for Delivering Query Results from an Electronic Document Collection
US8364679B2 (en) 2009-09-17 2013-01-29 Cpa Global Patent Research Limited Method, system, and apparatus for delivering query results from an electronic document collection
CN102597991A (en) * 2009-09-26 2012-07-18 哈米什·奥格尔维 Document analysis and association system and method
WO2011035389A1 (en) * 2009-09-26 2011-03-31 Hamish Ogilvy Document analysis and association system and method
AU2010300096B2 (en) * 2009-09-26 2012-10-04 Sajari Pty Ltd Document analysis and association system and method
US8666994B2 (en) 2009-09-26 2014-03-04 Sajari Pty Ltd Document analysis and association system and method
US9564089B2 (en) 2009-09-28 2017-02-07 Amazon Technologies, Inc. Last screen rendering for electronic book reader
US20110082839A1 (en) * 2009-10-02 2011-04-07 Foundationip, Llc Generating intellectual property intelligence using a patent search engine
US20110112825A1 (en) * 2009-11-12 2011-05-12 Jerome Bellegarda Sentiment prediction from textual data
US8682649B2 (en) 2009-11-12 2014-03-25 Apple Inc. Sentiment prediction from textual data
US20110119250A1 (en) * 2009-11-16 2011-05-19 Cpa Global Patent Research Limited Forward Progress Search Platform
US8600743B2 (en) 2010-01-06 2013-12-03 Apple Inc. Noise profile determination for voice-related feature
US20110166856A1 (en) * 2010-01-06 2011-07-07 Apple Inc. Noise profile determination for voice-related feature
US8954434B2 (en) * 2010-01-08 2015-02-10 Microsoft Corporation Enhancing a document with supplemental information from another document
US20110173210A1 (en) * 2010-01-08 2011-07-14 Microsoft Corporation Identifying a topic-relevant subject
US8670985B2 (en) 2010-01-13 2014-03-11 Apple Inc. Devices and methods for identifying a prompt corresponding to a voice input in a sequence of prompts
US9311043B2 (en) 2010-01-13 2016-04-12 Apple Inc. Adaptive audio feedback system and method
US8670979B2 (en) 2010-01-18 2014-03-11 Apple Inc. Active input elicitation by intelligent automated assistant
US10496753B2 (en) 2010-01-18 2019-12-03 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US8706503B2 (en) 2010-01-18 2014-04-22 Apple Inc. Intent deduction based on previous user interactions with voice assistant
US10553209B2 (en) 2010-01-18 2020-02-04 Apple Inc. Systems and methods for hands-free notification summaries
US8903716B2 (en) 2010-01-18 2014-12-02 Apple Inc. Personalized vocabulary for digital assistant
US8892446B2 (en) 2010-01-18 2014-11-18 Apple Inc. Service orchestration for intelligent automated assistant
US8731942B2 (en) 2010-01-18 2014-05-20 Apple Inc. Maintaining context information between user interactions with a voice assistant
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US8660849B2 (en) 2010-01-18 2014-02-25 Apple Inc. Prioritizing selection criteria by automated assistant
US8799000B2 (en) 2010-01-18 2014-08-05 Apple Inc. Disambiguation based on active input elicitation by intelligent automated assistant
US10705794B2 (en) 2010-01-18 2020-07-07 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US10706841B2 (en) 2010-01-18 2020-07-07 Apple Inc. Task flow identification based on user intent
US9548050B2 (en) 2010-01-18 2017-01-17 Apple Inc. Intelligent automated assistant
US10679605B2 (en) 2010-01-18 2020-06-09 Apple Inc. Hands-free list-reading by intelligent automated assistant
US8386239B2 (en) 2010-01-25 2013-02-26 Holovisions LLC Multi-stage text morphing
US9431028B2 (en) 2010-01-25 2016-08-30 Newvaluexchange Ltd Apparatuses, methods and systems for a digital conversation management platform
US8543381B2 (en) 2010-01-25 2013-09-24 Holovisions LLC Morphing text by splicing end-compatible segments
US9424861B2 (en) 2010-01-25 2016-08-23 Newvaluexchange Ltd Apparatuses, methods and systems for a digital conversation management platform
US9424862B2 (en) 2010-01-25 2016-08-23 Newvaluexchange Ltd Apparatuses, methods and systems for a digital conversation management platform
US8977584B2 (en) 2010-01-25 2015-03-10 Newvaluexchange Global Ai Llp Apparatuses, methods and systems for a digital conversation management platform
US10049675B2 (en) 2010-02-25 2018-08-14 Apple Inc. User profiling for voice input processing
US9190062B2 (en) 2010-02-25 2015-11-17 Apple Inc. User profiling for voice input processing
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US9633660B2 (en) 2010-02-25 2017-04-25 Apple Inc. User profiling for voice input processing
JP2014500528A (en) * 2010-06-03 2014-01-09 トムソン ライセンシング Enhancement of meaning using TOP-K processing
US8713021B2 (en) 2010-07-07 2014-04-29 Apple Inc. Unsupervised document clustering using latent semantic density analysis
US20120011139A1 (en) * 2010-07-12 2012-01-12 International Business Machines Corporation Unified numerical and semantic analytics system for decision support
US8538915B2 (en) * 2010-07-12 2013-09-17 International Business Machines Corporation Unified numerical and semantic analytics system for decision support
CN103189881B (en) * 2010-08-17 2019-06-18 西格拉姆申德勒有限公司 FSTP expert system
CN103189881A (en) * 2010-08-17 2013-07-03 西格拉姆申德勒有限公司 The FSTP expert system
US8719006B2 (en) 2010-08-27 2014-05-06 Apple Inc. Combined statistical and rule-based part-of-speech tagging for text-to-speech synthesis
US9495322B1 (en) 2010-09-21 2016-11-15 Amazon Technologies, Inc. Cover display
US9075783B2 (en) 2010-09-27 2015-07-07 Apple Inc. Electronic device with text error correction based on voice recognition data
US8719014B2 (en) 2010-09-27 2014-05-06 Apple Inc. Electronic device with text error correction based on voice recognition data
US20120150531A1 (en) * 2010-12-08 2012-06-14 At&T Intellectual Property I, L.P. System and method for learning latent representations for natural language tasks
US9135241B2 (en) * 2010-12-08 2015-09-15 At&T Intellectual Property I, L.P. System and method for learning latent representations for natural language tasks
US9720907B2 (en) 2010-12-08 2017-08-01 Nuance Communications, Inc. System and method for learning latent representations for natural language tasks
US10762293B2 (en) 2010-12-22 2020-09-01 Apple Inc. Using parts-of-speech tagging and named entity recognition for spelling correction
US10515147B2 (en) 2010-12-22 2019-12-24 Apple Inc. Using statistical language models for contextual lookup
US8781836B2 (en) 2011-02-22 2014-07-15 Apple Inc. Hearing assistance system for providing consistent human speech
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US10102359B2 (en) 2011-03-21 2018-10-16 Apple Inc. Device access using voice authentication
US20120259846A1 (en) * 2011-03-29 2012-10-11 Rafsky Lawrence C Method for selecting a subset of content sources from a collection of content sources
US8838584B2 (en) * 2011-03-29 2014-09-16 Acquire Media Ventures, Inc. Method for selecting a subset of content sources from a collection of content sources
US9904726B2 (en) 2011-05-04 2018-02-27 Black Hills IP Holdings, LLC. Apparatus and method for automated and assisted patent claim mapping and expense planning
US11714839B2 (en) 2011-05-04 2023-08-01 Black Hills Ip Holdings, Llc Apparatus and method for automated and assisted patent claim mapping and expense planning
US10885078B2 (en) 2011-05-04 2021-01-05 Black Hills Ip Holdings, Llc Apparatus and method for automated and assisted patent claim mapping and expense planning
US10037377B2 (en) 2011-05-27 2018-07-31 International Business Machines Corporation Automated self-service user support based on ontology analysis
US10162885B2 (en) 2011-05-27 2018-12-25 International Business Machines Corporation Automated self-service user support based on ontology analysis
US10019512B2 (en) 2011-05-27 2018-07-10 International Business Machines Corporation Automated self-service user support based on ontology analysis
US10057736B2 (en) 2011-06-03 2018-08-21 Apple Inc. Active transport based notifications
US10672399B2 (en) 2011-06-03 2020-06-02 Apple Inc. Switching between text data and audio data based on a mapping
US10706373B2 (en) 2011-06-03 2020-07-07 Apple Inc. Performing actions associated with task items that represent tasks to perform
US10255566B2 (en) 2011-06-03 2019-04-09 Apple Inc. Generating and processing task items that represent tasks to perform
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US8812294B2 (en) 2011-06-21 2014-08-19 Apple Inc. Translating phrases from one language into another using an order-based set of declarative rules
US8533195B2 (en) * 2011-06-27 2013-09-10 Microsoft Corporation Regularized latent semantic indexing for topic modeling
US20130013612A1 (en) * 2011-07-07 2013-01-10 Software Ag Techniques for comparing and clustering documents
US8983963B2 (en) * 2011-07-07 2015-03-17 Software Ag Techniques for comparing and clustering documents
US11665253B1 (en) 2011-07-08 2023-05-30 Consumerinfo.Com, Inc. LifeScore
US10798197B2 (en) 2011-07-08 2020-10-06 Consumerinfo.Com, Inc. Lifescore
US8706472B2 (en) 2011-08-11 2014-04-22 Apple Inc. Method for disambiguating multiple readings in language conversion
US20130054577A1 (en) * 2011-08-23 2013-02-28 Pierre Schwob Knowledge matrix utilizing systematic contextual links
US8700612B2 (en) * 2011-08-23 2014-04-15 Contextu, Inc. Knowledge matrix utilizing systematic contextual links
US9798393B2 (en) 2011-08-29 2017-10-24 Apple Inc. Text correction processing
US11790112B1 (en) 2011-09-16 2023-10-17 Consumerinfo.Com, Inc. Systems and methods of identity protection and management
US11087022B2 (en) 2011-09-16 2021-08-10 Consumerinfo.Com, Inc. Systems and methods of identity protection and management
US10642999B2 (en) 2011-09-16 2020-05-05 Consumerinfo.Com, Inc. Systems and methods of identity protection and management
US8762156B2 (en) 2011-09-28 2014-06-24 Apple Inc. Speech recognition repair using contextual information
US9804838B2 (en) * 2011-09-29 2017-10-31 Accenture Global Services Limited Systems and methods for finding project-related information by clustering applications into related concept categories
US20160110186A1 (en) * 2011-09-29 2016-04-21 Accenture Global Services Limited Systems and methods for finding project-related information by clustering applications into related concept categories
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US11775538B2 (en) 2011-10-03 2023-10-03 Black Hills Ip Holdings, Llc Systems, methods and user interfaces in a patent management system
US10614082B2 (en) 2011-10-03 2020-04-07 Black Hills Ip Holdings, Llc Patent mapping
US11048709B2 (en) 2011-10-03 2021-06-29 Black Hills Ip Holdings, Llc Patent mapping
US11714819B2 (en) 2011-10-03 2023-08-01 Black Hills Ip Holdings, Llc Patent mapping
US11256706B2 (en) 2011-10-03 2022-02-22 Black Hills Ip Holdings, Llc System and method for patent and prior art analysis
US11803560B2 (en) 2011-10-03 2023-10-31 Black Hills Ip Holdings, Llc Patent claim mapping
US11797546B2 (en) 2011-10-03 2023-10-24 Black Hills Ip Holdings, Llc Patent mapping
US11360988B2 (en) 2011-10-03 2022-06-14 Black Hills Ip Holdings, Llc Systems, methods and user interfaces in a patent management system
US11372864B2 (en) * 2011-10-03 2022-06-28 Black Hills Ip Holdings, Llc Patent mapping
US11789954B2 (en) 2011-10-03 2023-10-17 Black Hills Ip Holdings, Llc System and method for patent and prior art analysis
US10860657B2 (en) 2011-10-03 2020-12-08 Black Hills Ip Holdings, Llc Patent mapping
US11200620B2 (en) 2011-10-13 2021-12-14 Consumerinfo.Com, Inc. Debt services candidate locator
US9158741B1 (en) 2011-10-28 2015-10-13 Amazon Technologies, Inc. Indicators for navigating digital works
US8949111B2 (en) * 2011-12-14 2015-02-03 Brainspace Corporation System and method for identifying phrases in text
US20130158979A1 (en) * 2011-12-14 2013-06-20 Purediscovery Corporation System and Method for Identifying Phrases in Text
US20130159313A1 (en) * 2011-12-14 2013-06-20 Purediscovery Corporation Multi-Concept Latent Semantic Analysis Queries
US9015160B2 (en) * 2011-12-14 2015-04-21 Brainspace Corporation Multi-concept latent semantic analysis queries
US9026535B2 (en) * 2011-12-14 2015-05-05 Brainspace Corporation Multi-concept latent semantic analysis queries
US9734130B2 (en) * 2012-02-08 2017-08-15 International Business Machines Corporation Attribution using semantic analysis
US9141605B2 (en) * 2012-02-08 2015-09-22 International Business Machines Corporation Attribution using semantic analysis
US20130204877A1 (en) * 2012-02-08 2013-08-08 International Business Machines Corporation Attribution using semantic analyisis
US20150286613A1 (en) * 2012-02-08 2015-10-08 International Business Machines Corporation Attribution using semantic analysis
US20150019209A1 (en) * 2012-02-08 2015-01-15 International Business Machines Corporation Attribution using semantic analysis
US10839134B2 (en) * 2012-02-08 2020-11-17 International Business Machines Corporation Attribution using semantic analysis
US9104660B2 (en) * 2012-02-08 2015-08-11 International Business Machines Corporation Attribution using semantic analysis
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
US9483461B2 (en) 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
US11356430B1 (en) 2012-05-07 2022-06-07 Consumerinfo.Com, Inc. Storage and maintenance of personal data
US9280610B2 (en) 2012-05-14 2016-03-08 Apple Inc. Crowd sourcing information to fulfill user requests
US9953088B2 (en) 2012-05-14 2018-04-24 Apple Inc. Crowd sourcing information to fulfill user requests
US8775442B2 (en) 2012-05-15 2014-07-08 Apple Inc. Semantic search using a single-source semantic model
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US9721563B2 (en) 2012-06-08 2017-08-01 Apple Inc. Name recognition system
US10019994B2 (en) 2012-06-08 2018-07-10 Apple Inc. Systems and methods for recognizing textual identifiers within a plurality of words
US10079014B2 (en) 2012-06-08 2018-09-18 Apple Inc. Name recognition system
US9495129B2 (en) 2012-06-29 2016-11-15 Apple Inc. Device, method, and user interface for voice-activated navigation and browsing of a document
US11461862B2 (en) 2012-08-20 2022-10-04 Black Hills Ip Holdings, Llc Analytics generation for patent portfolio management
US20140059035A1 (en) * 2012-08-24 2014-02-27 iCONECT Development, LLC Process for generating a composite search document used in computer-based information searching
US9576574B2 (en) 2012-09-10 2017-02-21 Apple Inc. Context-sensitive handling of interruptions by intelligent digital assistant
US9971774B2 (en) 2012-09-19 2018-05-15 Apple Inc. Voice-based media searching
US9547647B2 (en) 2012-09-19 2017-01-17 Apple Inc. Voice-based media searching
US8935167B2 (en) 2012-09-25 2015-01-13 Apple Inc. Exemplar-based latent perceptual modeling for automatic speech recognition
US11863310B1 (en) 2012-11-12 2024-01-02 Consumerinfo.Com, Inc. Aggregating user web browsing data
US11012491B1 (en) 2012-11-12 2021-05-18 ConsumerInfor.com, Inc. Aggregating user web browsing data
CN103838789A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Text similarity computing method
US10963959B2 (en) 2012-11-30 2021-03-30 Consumerinfo. Com, Inc. Presentation of credit score factors
US11651426B1 (en) 2012-11-30 2023-05-16 Consumerlnfo.com, Inc. Credit score goals and alerts systems and methods
US11308551B1 (en) 2012-11-30 2022-04-19 Consumerinfo.Com, Inc. Credit data analysis
US9208254B2 (en) * 2012-12-10 2015-12-08 Microsoft Technology Licensing, Llc Query and index over documents
US20140164388A1 (en) * 2012-12-10 2014-06-12 Microsoft Corporation Query and index over documents
US10978090B2 (en) 2013-02-07 2021-04-13 Apple Inc. Voice trigger for a digital assistant
US10199051B2 (en) 2013-02-07 2019-02-05 Apple Inc. Voice trigger for a digital assistant
US10572476B2 (en) 2013-03-14 2020-02-25 Apple Inc. Refining a search based on schedule items
US9733821B2 (en) 2013-03-14 2017-08-15 Apple Inc. Voice control to diagnose inadvertent activation of accessibility features
US11113759B1 (en) 2013-03-14 2021-09-07 Consumerinfo.Com, Inc. Account vulnerability alerts
US10929925B1 (en) 2013-03-14 2021-02-23 Consumerlnfo.com, Inc. System and methods for credit dispute processing, resolution, and reporting
US10642574B2 (en) 2013-03-14 2020-05-05 Apple Inc. Device, method, and graphical user interface for outputting captions
US10652394B2 (en) 2013-03-14 2020-05-12 Apple Inc. System and method for processing voicemail
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US11514519B1 (en) 2013-03-14 2022-11-29 Consumerinfo.Com, Inc. System and methods for credit dispute processing, resolution, and reporting
US9368114B2 (en) 2013-03-14 2016-06-14 Apple Inc. Context-sensitive handling of interruptions
US11769200B1 (en) 2013-03-14 2023-09-26 Consumerinfo.Com, Inc. Account vulnerability alerts
US9977779B2 (en) 2013-03-14 2018-05-22 Apple Inc. Automatic supplementation of word correction dictionaries
US10748529B1 (en) 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
US9922642B2 (en) 2013-03-15 2018-03-20 Apple Inc. Training an at least partial voice command system
US8620842B1 (en) 2013-03-15 2013-12-31 Gordon Villy Cormack Systems and methods for classifying electronic information using advanced active learning techniques
US11080340B2 (en) 2013-03-15 2021-08-03 Gordon Villy Cormack Systems and methods for classifying electronic information using advanced active learning techniques
US10078487B2 (en) 2013-03-15 2018-09-18 Apple Inc. Context-sensitive handling of interruptions
US11151899B2 (en) 2013-03-15 2021-10-19 Apple Inc. User training by intelligent digital assistant
US9697822B1 (en) 2013-03-15 2017-07-04 Apple Inc. System and method for updating an adaptive speech recognition model
US9678957B2 (en) 2013-03-15 2017-06-13 Gordon Villy Cormack Systems and methods for classifying electronic information using advanced active learning techniques
US8713023B1 (en) * 2013-03-15 2014-04-29 Gordon Villy Cormack Systems and methods for classifying electronic information using advanced active learning techniques
US8838606B1 (en) 2013-03-15 2014-09-16 Gordon Villy Cormack Systems and methods for classifying electronic information using advanced active learning techniques
US9122681B2 (en) 2013-03-15 2015-09-01 Gordon Villy Cormack Systems and methods for classifying electronic information using advanced active learning techniques
US10685398B1 (en) 2013-04-23 2020-06-16 Consumerinfo.Com, Inc. Presenting credit score information
US10579662B2 (en) 2013-04-23 2020-03-03 Black Hills Ip Holdings, Llc Patent claim scope evaluator
US11354344B2 (en) 2013-04-23 2022-06-07 Black Hills Ip Holdings, Llc Patent claim scope evaluator
US9582608B2 (en) 2013-06-07 2017-02-28 Apple Inc. Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
US9620104B2 (en) 2013-06-07 2017-04-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9966060B2 (en) 2013-06-07 2018-05-08 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9633674B2 (en) 2013-06-07 2017-04-25 Apple Inc. System and method for detecting errors in interactions with a voice-based digital assistant
US9966068B2 (en) 2013-06-08 2018-05-08 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10657961B2 (en) 2013-06-08 2020-05-19 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10185542B2 (en) 2013-06-09 2019-01-22 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
US9300784B2 (en) 2013-06-13 2016-03-29 Apple Inc. System and method for emergency calls initiated by voice command
CN104252465A (en) * 2013-06-26 2014-12-31 南宁明江智能科技有限公司 Method and device utilizing representative vectors to filter information
CN104252465B (en) * 2013-06-26 2018-10-12 南宁明江智能科技有限公司 A kind of method and apparatus filtering information using representation vector
US10791216B2 (en) 2013-08-06 2020-09-29 Apple Inc. Auto-activating smart responses based on activities from remote devices
RU2679960C2 (en) * 2013-10-10 2019-02-14 Общество С Ограниченной Ответственностью "Яндекс" Method and system of database for locating documents
US20150127650A1 (en) * 2013-11-04 2015-05-07 Ayasdi, Inc. Systems and methods for metric data smoothing
US10114823B2 (en) * 2013-11-04 2018-10-30 Ayasdi, Inc. Systems and methods for metric data smoothing
US10678868B2 (en) 2013-11-04 2020-06-09 Ayasdi Ai Llc Systems and methods for metric data smoothing
US10628448B1 (en) 2013-11-20 2020-04-21 Consumerinfo.Com, Inc. Systems and user interfaces for dynamic access of multiple remote databases and synchronization of data based on user rules
US11461364B1 (en) 2013-11-20 2022-10-04 Consumerinfo.Com, Inc. Systems and user interfaces for dynamic access of multiple remote databases and synchronization of data based on user rules
US10296160B2 (en) 2013-12-06 2019-05-21 Apple Inc. Method for extracting salient dialog usage from live data
US9361290B2 (en) 2014-01-18 2016-06-07 Christopher Bayan Bruss System and methodology for assessing and predicting linguistic and non-linguistic events and for providing decision support
US9620105B2 (en) 2014-05-15 2017-04-11 Apple Inc. Analyzing audio input for efficient speech and music recognition
US10592095B2 (en) 2014-05-23 2020-03-17 Apple Inc. Instantaneous speaking of content on touch devices
US9502031B2 (en) 2014-05-27 2016-11-22 Apple Inc. Method for supporting dynamic grammars in WFST-based ASR
US10497365B2 (en) 2014-05-30 2019-12-03 Apple Inc. Multi-command single utterance input method
US10078631B2 (en) 2014-05-30 2018-09-18 Apple Inc. Entropy-guided text prediction using combined word and character n-gram language models
US10083690B2 (en) 2014-05-30 2018-09-25 Apple Inc. Better resolution when referencing to concepts
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11257504B2 (en) 2014-05-30 2022-02-22 Apple Inc. Intelligent assistant for home automation
US9760559B2 (en) 2014-05-30 2017-09-12 Apple Inc. Predictive text input
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9734193B2 (en) 2014-05-30 2017-08-15 Apple Inc. Determining domain salience ranking from ambiguous words in natural speech
US9785630B2 (en) 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US9966065B2 (en) 2014-05-30 2018-05-08 Apple Inc. Multi-command single utterance input method
US10169329B2 (en) 2014-05-30 2019-01-01 Apple Inc. Exemplar-based natural language processing
US9842101B2 (en) 2014-05-30 2017-12-12 Apple Inc. Predictive conversion of language input
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
US10289433B2 (en) 2014-05-30 2019-05-14 Apple Inc. Domain specific language for encoding assistant dialog
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US9668024B2 (en) 2014-06-30 2017-05-30 Apple Inc. Intelligent automated assistant for TV user interactions
US10659851B2 (en) 2014-06-30 2020-05-19 Apple Inc. Real-time digital assistant knowledge updates
US10904611B2 (en) 2014-06-30 2021-01-26 Apple Inc. Intelligent automated assistant for TV user interactions
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US10446141B2 (en) 2014-08-28 2019-10-15 Apple Inc. Automatic speech recognition based on user feedback
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10431204B2 (en) 2014-09-11 2019-10-01 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10789041B2 (en) 2014-09-12 2020-09-29 Apple Inc. Dynamic thresholds for always listening speech trigger
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US9986419B2 (en) 2014-09-30 2018-05-29 Apple Inc. Social reminders
US9886432B2 (en) 2014-09-30 2018-02-06 Apple Inc. Parsimonious handling of word inflection via categorical stem + suffix N-gram language models
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US10552013B2 (en) 2014-12-02 2020-02-04 Apple Inc. Data detection
US11556230B2 (en) 2014-12-02 2023-01-17 Apple Inc. Data detection
US9711141B2 (en) 2014-12-09 2017-07-18 Apple Inc. Disambiguating heteronyms in speech synthesis
US10698977B1 (en) 2014-12-31 2020-06-30 Guangsheng Zhang System and methods for processing fuzzy expressions in search engines and for information extraction
US9865280B2 (en) 2015-03-06 2018-01-09 Apple Inc. Structured dictation using intelligent automated assistants
US10311871B2 (en) 2015-03-08 2019-06-04 Apple Inc. Competing devices responding to voice triggers
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US11087759B2 (en) 2015-03-08 2021-08-10 Apple Inc. Virtual assistant activation
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US9899019B2 (en) 2015-03-18 2018-02-20 Apple Inc. Systems and methods for structured stem and suffix language models
US9842105B2 (en) 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
US10354010B2 (en) * 2015-04-24 2019-07-16 Nec Corporation Information processing system, an information processing method and a computer readable storage medium
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US10127220B2 (en) 2015-06-04 2018-11-13 Apple Inc. Language identification from short strings
US10101822B2 (en) 2015-06-05 2018-10-16 Apple Inc. Language input correction
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10186254B2 (en) 2015-06-07 2019-01-22 Apple Inc. Context-based endpoint detection
US10255907B2 (en) 2015-06-07 2019-04-09 Apple Inc. Automatic accent detection using acoustic models
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US10445374B2 (en) 2015-06-19 2019-10-15 Gordon V. Cormack Systems and methods for conducting and terminating a technology-assisted review
US10671675B2 (en) 2015-06-19 2020-06-02 Gordon V. Cormack Systems and methods for a scalable continuous active learning approach to information classification
US10242001B2 (en) 2015-06-19 2019-03-26 Gordon V. Cormack Systems and methods for conducting and terminating a technology-assisted review
US10229117B2 (en) 2015-06-19 2019-03-12 Gordon V. Cormack Systems and methods for conducting a highly autonomous technology-assisted review classification
US10353961B2 (en) 2015-06-19 2019-07-16 Gordon V. Cormack Systems and methods for conducting and terminating a technology-assisted review
US10242071B2 (en) 2015-06-23 2019-03-26 Microsoft Technology Licensing, Llc Preliminary ranker for scoring matching documents
US11392568B2 (en) 2015-06-23 2022-07-19 Microsoft Technology Licensing, Llc Reducing matching documents for a search query
US10565198B2 (en) 2015-06-23 2020-02-18 Microsoft Technology Licensing, Llc Bit vector search index using shards
US20160378796A1 (en) * 2015-06-23 2016-12-29 Microsoft Technology Licensing, Llc Match fix-up to remove matching documents
US10229143B2 (en) 2015-06-23 2019-03-12 Microsoft Technology Licensing, Llc Storage and retrieval of data from a bit vector search index
US11281639B2 (en) * 2015-06-23 2022-03-22 Microsoft Technology Licensing, Llc Match fix-up to remove matching documents
US10733164B2 (en) 2015-06-23 2020-08-04 Microsoft Technology Licensing, Llc Updating a bit vector search index
US10467215B2 (en) 2015-06-23 2019-11-05 Microsoft Technology Licensing, Llc Matching documents using a bit vector search index
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US9697820B2 (en) 2015-09-24 2017-07-04 Apple Inc. Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US11550794B2 (en) 2016-02-05 2023-01-10 International Business Machines Corporation Automated determination of document utility for a document corpus
US10372714B2 (en) * 2016-02-05 2019-08-06 International Business Machines Corporation Automated determination of document utility for a document corpus
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US11068546B2 (en) 2016-06-02 2021-07-20 Nuix North America Inc. Computer-implemented system and method for analyzing clusters of coded documents
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US11069347B2 (en) 2016-06-08 2021-07-20 Apple Inc. Intelligent automated assistant for media exploration
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
US10354011B2 (en) 2016-06-09 2019-07-16 Apple Inc. Intelligent automated assistant in a home environment
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US11037565B2 (en) 2016-06-10 2021-06-15 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10521466B2 (en) 2016-06-11 2019-12-31 Apple Inc. Data driven natural language event detection and classification
US10269345B2 (en) 2016-06-11 2019-04-23 Apple Inc. Intelligent task discovery
US10089072B2 (en) 2016-06-11 2018-10-02 Apple Inc. Intelligent device arbitration and control
US10297253B2 (en) 2016-06-11 2019-05-21 Apple Inc. Application integration with a digital assistant
US11152002B2 (en) 2016-06-11 2021-10-19 Apple Inc. Application integration with a digital assistant
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10553215B2 (en) 2016-09-23 2020-02-04 Apple Inc. Intelligent automated assistant
JP2018063596A (en) * 2016-10-13 2018-04-19 富士通株式会社 Document comparison program, document comparison method, and document comparison device
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US10657498B2 (en) 2017-02-17 2020-05-19 Walmart Apollo, Llc Automated resume screening
US10755703B2 (en) 2017-05-11 2020-08-25 Apple Inc. Offline personal assistant
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US11217255B2 (en) 2017-05-16 2022-01-04 Apple Inc. Far-field extension for digital assistant services
US10915707B2 (en) * 2017-10-20 2021-02-09 MachineVantage, Inc. Word replaceability through word vectors
US20190179901A1 (en) * 2017-12-07 2019-06-13 Fujitsu Limited Non-transitory computer readable recording medium, specifying method, and information processing apparatus
US11720760B2 (en) * 2018-05-11 2023-08-08 Abbtype Ltd Method of abbreviated typing and compression of texts written in languages using alphabetic scripts
US20210383076A1 (en) * 2018-05-11 2021-12-09 Abbtype Ltd Method of abbreviated typing and compression of texts written in languages using alphabetic scripts
US20190362710A1 (en) * 2018-05-23 2019-11-28 Bank Of America Corporation Quantum technology for use with extracting intents from linguistics
US10825448B2 (en) * 2018-05-23 2020-11-03 Bank Of America Corporation Quantum technology for use with extracting intents from linguistics
US10665228B2 (en) * 2018-05-23 2020-05-26 Bank of America Corporaiton Quantum technology for use with extracting intents from linguistics
US10880313B2 (en) 2018-09-05 2020-12-29 Consumerinfo.Com, Inc. Database platform for realtime updating of user data from third party sources
US11399029B2 (en) 2018-09-05 2022-07-26 Consumerinfo.Com, Inc. Database platform for realtime updating of user data from third party sources
US10671749B2 (en) 2018-09-05 2020-06-02 Consumerinfo.Com, Inc. Authenticated access and aggregation database platform
US11265324B2 (en) 2018-09-05 2022-03-01 Consumerinfo.Com, Inc. User permissions for access to secure data at third-party
US11074266B2 (en) * 2018-10-11 2021-07-27 International Business Machines Corporation Semantic concept discovery over event databases
US20200117738A1 (en) * 2018-10-11 2020-04-16 International Business Machines Corporation Semantic concept discovery over event databases
US11315179B1 (en) 2018-11-16 2022-04-26 Consumerinfo.Com, Inc. Methods and apparatuses for customized card recommendations
US11238656B1 (en) 2019-02-22 2022-02-01 Consumerinfo.Com, Inc. System and method for an augmented reality experience via an artificial intelligence bot
US11842454B1 (en) 2019-02-22 2023-12-12 Consumerinfo.Com, Inc. System and method for an augmented reality experience via an artificial intelligence bot
WO2020232092A1 (en) * 2019-05-15 2020-11-19 RELX Inc. Systems and methods for generating a low-dimensional space representing similarities between patents
US11200448B2 (en) 2019-05-15 2021-12-14 RELX Inc. Systems and methods for generating a low-dimensional space representing similarities between patents
US11599536B2 (en) * 2019-10-23 2023-03-07 Soppra Corporation Information output device, information output method, and information output program
US20220374422A1 (en) * 2019-10-23 2022-11-24 Soppra Corporation Information output device, information output method, and information output program
US20220405416A1 (en) * 2021-06-14 2022-12-22 International Business Machines Corporation Data query against an encrypted database
US11893128B2 (en) * 2021-06-14 2024-02-06 International Business Machines Corporation Data query against an encrypted database
US11620271B2 (en) * 2021-08-11 2023-04-04 Sap Se Relationship analysis using vector representations of database tables
US20230051059A1 (en) * 2021-08-11 2023-02-16 Sap Se Relationship analysis using vector representations of database tables
US11907195B2 (en) 2021-08-11 2024-02-20 Sap Se Relationship analysis using vector representations of database tables
US20230136726A1 (en) * 2021-10-29 2023-05-04 Peter A. Chew Identifying Fringe Beliefs from Text

Also Published As

Publication number Publication date
US7483892B1 (en) 2009-01-27

Similar Documents

Publication Publication Date Title
US6847966B1 (en) Method and system for optimally searching a document database using a representative semantic space
US8024331B2 (en) Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors
Turtle Text retrieval in the legal world
US7076484B2 (en) Automated research engine
US20090119281A1 (en) Granular knowledge based search engine
EP0965089B1 (en) Information retrieval utilizing semantic representation of text
EP0597630B1 (en) Method for resolution of natural-language queries against full-text databases
Turney Extraction of keyphrases from text: evaluation of four algorithms
US20070185831A1 (en) Information retrieval
EP1419461A2 (en) Method and system for enhanced data searching
Strzalkowski Natural language processing in large-scale text retrieval tasks
Husain Critical concepts and techniques for information retrieval system
Milić-Frayling Text processing and information retrieval
Bhari et al. An Approach for Improving Similarity Measure Using Fuzzy Logic
Sharma et al. Improved stemming approach used for text processing in information retrieval system
Gyorodi et al. Full-text search engine using mySQL
Banerjee et al. Word image based latent semantic indexing for conceptual querying in document image databases
Yee Retrieving semantically relevant documents using Latent Semantic Indexing
Park GlossOnt: A Concept-focused Ontology Building Tool.
Martin-Bautista et al. Text mining using fuzzy association rules
Mamchich Models and algorithms of information retrieval in a multilingual environment on the basis of thematic and dynamic text corpora
Tokunaga et al. Paraphrasing Japanese noun phrases using character-based indexing
Kowalski et al. Automatic indexing
Goehler et al. Smart internet search engine through 6W
Jenkins Automatic classification and metadata generation for world-wide web resources

Legal Events

Date Code Title Description
AS Assignment

Owner name: ENGENIUM CORPORATION, TEXAS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SOMMER, MATTHEW S.;THOMPSON, KEVIN B.;REEL/FRAME:012836/0648

Effective date: 20020423

STCF Information on status: patent grant

Free format text: PATENTED CASE

AS Assignment

Owner name: KROLL ONTRACK INC., MINNESOTA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ENGENIUM CORPORATION;REEL/FRAME:019754/0927

Effective date: 20070806

FPAY Fee payment

Year of fee payment: 4

AS Assignment

Owner name: LEHMAN COMMERCIAL PAPER INC., NEW YORK

Free format text: PATENT SECURITY AGREEMENT;ASSIGNOR:KROLL ONTRACK INC.;REEL/FRAME:026883/0916

Effective date: 20100803

FEPP Fee payment procedure

Free format text: PAT HOLDER NO LONGER CLAIMS SMALL ENTITY STATUS, ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: STOL); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

REFU Refund

Free format text: REFUND - PAYMENT OF MAINTENANCE FEE, 8TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: R2552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

AS Assignment

Owner name: GOLDMAN SACHS BANK USA, NEW JERSEY

Free format text: ASSIGNMENT AND ASSUMPTION OF ALL LIEN AND SECURITY INTERESTS IN PATENTS;ASSIGNOR:LEHMAN COMMERCIAL PAPER INC.;REEL/FRAME:027579/0784

Effective date: 20120119

FPAY Fee payment

Year of fee payment: 8

AS Assignment

Owner name: KROLL ONTRACK INC., MINNESOTA

Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT COLLATERAL;ASSIGNOR:GOLDMAN SACHS BANK USA;REEL/FRAME:033358/0818

Effective date: 20140703

Owner name: US INVESTIGATIONS SERVICES, LLC, VIRGINIA

Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT COLLATERAL;ASSIGNOR:GOLDMAN SACHS BANK USA;REEL/FRAME:033358/0818

Effective date: 20140703

Owner name: HIRERIGHT, INC., CALIFORNIA

Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT COLLATERAL;ASSIGNOR:GOLDMAN SACHS BANK USA;REEL/FRAME:033358/0818

Effective date: 20140703

AS Assignment

Owner name: WILMINGTON TRUST, N.A., MINNESOTA

Free format text: NOTICE AND CONFIRMATION OF GRANT OF SECURITY INTEREST IN PATENTS - SECOND LIEN NOTE;ASSIGNOR:KROLL ONTRACK, INC.;REEL/FRAME:033511/0248

Effective date: 20140703

Owner name: WILMINGTON TRUST, N.A., MINNESOTA

Free format text: NOTICE AND CONFIRMATION OF GRANT OF SECURITY INTEREST IN PATENTS - THIRD LIEN NOTE;ASSIGNOR:KROLL ONTRACK, INC.;REEL/FRAME:033511/0255

Effective date: 20140703

Owner name: GOLDMAN SACHS BANK USA, NEW YORK

Free format text: NOTICE AND CONFIRMATION OF GRANT OF SECURITY INTEREST IN PATENTS - TERM;ASSIGNOR:KROLL ONTRACK, INC.;REEL/FRAME:033511/0222

Effective date: 20140703

Owner name: WILMINGTON TRUST, N.A., MINNESOTA

Free format text: NOTICE AND CONFIRMATION OF GRANT OF SECURITY INTEREST IN PATENTS - FIRST LIEN NOTE;ASSIGNOR:KROLL ONTRACK, INC.;REEL/FRAME:033511/0229

Effective date: 20140703

AS Assignment

Owner name: KROLL ONTRACK, LLC, MINNESOTA

Free format text: CONVERSION;ASSIGNOR:KROLL ONTRACK INC.;REEL/FRAME:036407/0547

Effective date: 20150819

AS Assignment

Owner name: CERBERUS BUSINESS FINANCE, LLC, AS COLLATERAL AGEN

Free format text: NOTICE AND CONFIRMATION OF GRANT OF SECURITY INTEREST IN PATENTS;ASSIGNORS:HIRERIGHT, LLC;KROLL ONTRACK, LLC;REEL/FRAME:036515/0121

Effective date: 20150831

REMI Maintenance fee reminder mailed
FPAY Fee payment

Year of fee payment: 12

SULP Surcharge for late payment

Year of fee payment: 11

AS Assignment

Owner name: KROLL ONTRACK, LLC, MINNESOTA

Free format text: PARTIAL TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:CERBERUS BUSINESS FINANCE, LLC;REEL/FRAME:040876/0298

Effective date: 20161206

Owner name: ROYAL BANK OF CANADA, AS COLLATERAL AGENT, CANADA

Free format text: SECURITY INTEREST;ASSIGNORS:LDISCOVERY, LLC;LDISCOVERY TX, LLC;KROLL ONTRACK, LLC;REEL/FRAME:040959/0223

Effective date: 20161209

Owner name: ROYAL BANK OF CANADA, AS COLLATERAL AGENT, CANADA

Free format text: SECURITY INTEREST;ASSIGNORS:LDISCOVERY, LLC;LDISCOVERY TX, LLC;KROLL ONTRACK, LLC;REEL/FRAME:040960/0728

Effective date: 20161209

AS Assignment

Owner name: KROLL ONTRACK, INC., MINNESOTA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:GOLDMAN SACHS BANK USA, AS COLLATERAL AGENT;REEL/FRAME:040739/0641

Effective date: 20161209

AS Assignment

Owner name: KROLL ONTRACK, INC., MINNESOTA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS-SECOND LIEN;ASSIGNOR:WILMINGTON TRUST, N.A.;REEL/FRAME:041128/0877

Effective date: 20161209

Owner name: KROLL ONTRACK, INC., MINNESOTA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS-THIRD LIEN;ASSIGNOR:WILMINGTON TRUST, N.A.;REEL/FRAME:041128/0581

Effective date: 20161209

Owner name: KROLL ONTRACK, INC., MINNESOTA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS-FIRST LIEN;ASSIGNOR:WILMINGTON TRUST, N.A.;REEL/FRAME:041129/0290

Effective date: 20161209

AS Assignment

Owner name: LDISCOVERY, LLC, VIRGINIA

Free format text: TERMINATION AND RELEASE OF INTELLECTUAL PROPERTY SECURITY AGREEMENT;ASSIGNOR:ROYAL BANK OF CANADA, AS SECOND LIEN COLLATERAL AGENT;REEL/FRAME:051396/0423

Effective date: 20191219

Owner name: LDISCOVERY TX, LLC, VIRGINIA

Free format text: TERMINATION AND RELEASE OF INTELLECTUAL PROPERTY SECURITY AGREEMENT;ASSIGNOR:ROYAL BANK OF CANADA, AS SECOND LIEN COLLATERAL AGENT;REEL/FRAME:051396/0423

Effective date: 20191219

Owner name: KROLL ONTRACK, LLC, VIRGINIA

Free format text: TERMINATION AND RELEASE OF INTELLECTUAL PROPERTY SECURITY AGREEMENT;ASSIGNOR:ROYAL BANK OF CANADA, AS SECOND LIEN COLLATERAL AGENT;REEL/FRAME:051396/0423

Effective date: 20191219

AS Assignment

Owner name: LDISCOVERY TX, LLC, VIRGINIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:ROYAL BANK OF CANADA;REEL/FRAME:055184/0970

Effective date: 20210208

Owner name: WILMINGTON TRUST, NATIONAL ASSOCIATION, AS COLLATERAL AGENT, MINNESOTA

Free format text: SECURITY INTEREST;ASSIGNORS:LDISCOVERY, LLC;LDISCOVERY TX, LLC;KL DISCOVERY ONTRACK, LLC (F/K/A KROLL ONTRACK, LLC);REEL/FRAME:055183/0853

Effective date: 20210208

Owner name: LDISCOVERY, LLC, VIRGINIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:ROYAL BANK OF CANADA;REEL/FRAME:055184/0970

Effective date: 20210208

Owner name: KROLL ONTRACK, LLC, VIRGINIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:ROYAL BANK OF CANADA;REEL/FRAME:055184/0970

Effective date: 20210208

AS Assignment

Owner name: KLDISCOVERY ONTRACK, LLC, VIRGINIA

Free format text: CHANGE OF NAME;ASSIGNOR:KROLL ONTRACK, LLC;REEL/FRAME:057968/0923

Effective date: 20190123