New! View global litigation for patent families

US20030069873A1 - Multiple engine information retrieval and visualization system - Google Patents

Multiple engine information retrieval and visualization system Download PDF

Info

Publication number
US20030069873A1
US20030069873A1 US09195773 US19577398D US2003069873A1 US 20030069873 A1 US20030069873 A1 US 20030069873A1 US 09195773 US09195773 US 09195773 US 19577398 D US19577398 D US 19577398D US 2003069873 A1 US2003069873 A1 US 2003069873A1
Authority
US
Grant status
Application
Patent type
Prior art keywords
documents
document
retrieval
search
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US09195773
Other versions
US6574632B2 (en )
Inventor
Kevin L. Fox
Ophir Frieder
Margaret M. Knepper
Robert A. Killam
Joseph M. Nemethy
Gregory J. Cusick
Eric J. Snowberg
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Technology Licensing Corp
Original Assignee
Harris Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/3061Information retrieval; Database structures therefor ; File system structures therefor of unstructured textual data
    • G06F17/30634Querying
    • G06F17/30696Presentation or visualization of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/30861Retrieval from the Internet, e.g. browsers
    • G06F17/30864Retrieval from the Internet, e.g. browsers by querying, e.g. search engines or meta-search engines, crawling techniques, push systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/912Applications of a database
    • Y10S707/917Text
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/953Organization of data
    • Y10S707/959Network
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99943Generating database or data structure, e.g. via user interface

Abstract

An information retrieval and visualization system utilizes multiple search engines for retrieving documents from a document database based upon user input queries. Search engines include an n-gram search engine and a vector space model search engine using a neural network training algorithm. Each search engine produces a common mathematical representation of each retrieved document. The retrieved documents are then combined and ranked. Mathematical representations for each respective document is mapped onto a display. Information displayed includes a three-dimensional display of keywords from the user input query. The three-dimensional visualization capability based upon the mathematical representation of information within the information retrieval and visualization system provides users with an intuitive understanding, with relevance feedback/query refinement techniques that can be better utilized, resulting in higher retrieval accuracy (precision).

Description

    FIELD OF THE INVENTION
  • [0001]
    The present invention relates to the field of information retrieval systems, and, more particularly, to computer based information retrieval and visualization systems.
  • BACKGROUND OF THE INVENTION
  • [0002]
    The advent of the World-Wide-Web has increased the importance of information retrieval. Instead of visiting the local library to find information on a particular topic, a person can search the Web to find the desired information. Thus, the relative number of manual versus computer-assisted searches for information has shifted dramatically. This has increased the need for automated information retrieval for relatively large document collections.
  • [0003]
    Information retrieval systems search and retrieve data from a collection of documents in response to user input queries. Ever increasing volumes of data are rendering traditional information retrieval systems ineffective in production environments. As data volumes continue to grow, it becomes increasingly difficult to develop search engines that support search and retrieval with non-prohibitive search times. These larger data collections necessitate the need to formulate accurate queries, as well as the need to intuitively present the results to the user to increase retrieval efficiency of the desired information.
  • [0004]
    Currently, users retrieve distributed information from the Web via the use of search engines. Many search engines exist, such as, for example, Excite, Infoseek, Yahaoo, Alta Vista, Sony Search Engine and Lycos. Private document collections may also be searched using these search engines. A common goal of each search engine is to yield a highly accurate set of results to satisfy the information desired. Two accuracy measures often used to evaluate information retrieval systems are recall and precision. Recall is the ratio of the number of the relevant documents retrieved from the total number of relevant documents available collection-wide. Precision is the ratio of the number of relevant documents retrieved from the total number of documents retrieved. In many interactive applications, however, users require only a few highly relevant documents to form a general assessment of the topic, as opposed to detailed knowledge obtained by reading many related documents.
  • [0005]
    Time constraints and interest level typically limit the user to reviewing the top documents before determining if the results of a query are accurate and satisfactory. In such cases, retrieval times and precision accuracy are at a premium, with recall potentially being less important. A recent user study conducted by Excite Corporation demonstrated that less than five percent of the users looked beyond the first screen of documents returned in response to their queries. Other studies conducted on a wide range of operational environments have shown that the average number of terms provided by the user as an input query are often less than two and rarely greater than four. Therefore, high precision with efficient search times may typically be more critical than high recall.
  • [0006]
    In spite of the respective strengths for each of the various search engines, there is no one best search engine for all applications. Accordingly, results from multiple search engines or from multiple runs have been combined to yield better overall results. By combining the results of multiple search engines, an information retrieval system is able to capitalize on the advantages of a search engine with the intention of masking the weaknesses of the other search engine. A discussion of combining the results of an individual search engine using different fusion rules is disclosed, for example, by Kantor in Information Retrieval Techniques, volume 29, chapter 2, pages 53-90 (1994). However, the article discloses that it is not simple to obtain better results using multiple engines as compared to only a single search engine.
  • [0007]
    An article by Cavnar, titled “Using an N-Gram Based Document Representation with a Vector Processing Retrieval Model,” discloses the use of a n-gram technology and a vector space model in a single information retrieval system. The two search retrieval techniques are combined such that the vector processing model is used for documents and queries, and the n-gram frequencies are used as the basis for the vector element values instead of the traditional term frequencies. The information retrieval system disclosed by Cavnar is a hybrid between an n-gram search engine and a vector space model search engine.
  • [0008]
    In an article by Shaw and Fox, titled “Combination of Multiple Searches,” a method of combining the results from various divergent search schemes and document collections is disclosed. In particular, the results from vector and P-norm queries were considered in estimating the similarity for each document in an individual collection. P-norm extends boolean queries and natural language vector queries. The results for each collection are merged to create a single final set of documents to be presented to the user. By summing the similarity values obtained, the article describes better overall accuracy than using a single similarity value.
  • [0009]
    Once the information has been retrieved, user understanding of the information is critical. As previously stated, time constraints and interest level limit the user to reviewing the top documents before determining if the results of a query are accurate and satisfactory. Therefore, presentation of the retrieved information in an easily recognizable manner to the user is important. For example, presenting data to the user in a multi-dimensional format is disclosed in the patent U.S. Pat. No. 5,649,193 to Sumita et al. Detection results are presented in a multi-dimensional display format by setting the viewpoints to axes. The detection command is an origin and using distances of the detected documents with respect to the origin for each viewpoint as coordinates, the detected documents with respect to each axis are displayed.
  • [0010]
    Despite the continuing development of search engines and result visualization techniques, there is still a need to quickly and efficiently search large document collections and present the results in a meaningful manner to the user.
  • SUMMARY OF THE INVENTION
  • [0011]
    In view of the foregoing background, it is therefore an object of the present invention to provide an information retrieval and visualization system and related method for efficiently retrieving documents from a document database and for visually displaying the searh results in a format readily comprehended and meaningful to the user.
  • [0012]
    These and other objects, features and advantages in accordance with the present invention are provided by an information retrieval system for selectively retrieving documents from a document database using multiple search engines and a three-dimensional visualization approach. More particularly, the system comprises an input interface for accepting at least one user search query, and a plurality of search engines for retrieving documents from the document database based upon at least one user search query. Each of the search engines advantageously produces a common mathematical representation of each retrieved document. The system further comprises a display and visualization display means for mapping respective mathematical representations of the retrieved documents onto the display.
  • [0013]
    At least one search engine produces a document context vector representation and an axis context vector representation of each retrieved document. The document context vector representation is the sum of all the words in a document after reducing low content words, and is used to compare documents and queries. The axis context vector representation is a sum of the words in each axis after reducing low content words, and is used for building a query for a document cluster. The axis context vector is also used by the visualization means to map onto the display.
  • [0014]
    The present invention thereby provides a three-dimensional display of keywords, for example, from the user input query via the visualization display means. Displaying documents in a three-dimensional space enables a user to see document clusters, the relationships of documents to each other, and also aids in new document identification. Documents near identified relevant documents can be easily reviewed for topic relevance. Advantageously, the user is able to manipulate the dimensional view via the input interface to gain new views of document relationships. Changing the documents dimensionality allows the information to be viewed for different aspects of the topics to aid in further identification of relevant documents.
  • [0015]
    The plurality of search engines may comprise an n-gram search engine and a vector space model (VSM) search engine. The n-gram search engine comprises n-gram training means for least frequency training of the training documents. Similarly, the VSM search engine comprises VSM training means for processing training documents and further comprises a neural network.
  • [0016]
    The present invention provides precision in retrieving documents from a document database by providing users with multiple input interaction modes, and fusing results obtained from multiple information retrieval search engines, each supporting a different retrieval strategy, and by supporting relevance feedback mechanisms. The multiple engine information retrieval and visualization system allows users to build and tailor a query as they further define the topic of interest, moving from a generic search to specific topic areas through query inputs. Users can increase or decrease the system precision, effecting the number of documents that are retrieved as relevant. The weights on the retrieval engines can be modified to favor different engines based on the query types.
  • [0017]
    A method aspect of the invention is for selectively retrieving documents from a document database using an information retrieval system comprising a plurality of search engines. The method preferably comprises the steps of generating at least one user search query and retrieving documents from the document database based upon the user search query. Each search engine searches the document database and produces a common mathematical representation of each retrieved document. The respective mathematical representations of the retrieved documents are mapped onto a display. The method further preferably comprises the steps of producing a document context vector representation of each retrieved document, and producing an axis context vector representation of each retrieved document. The step of mapping preferably comprises the step of mapping the axis context vector representations of the retrieved documents onto the display.
  • [0018]
    Another method aspect of the invention is for selectively retrieving documents from a document database. The method preferably comprises the steps of defining a dictionary, randomly assigning a context vector to each word in the dictionary, training the dictionary words, assigning axis representation to each dictionary word, receiving at least one user search query, and searching a document database based upon the user search query. The dictionary comprises a plurality of words related to a topic to be searched. Advantageously, each dictionary word is assigned a context vector representation. These context vector representations are then used to create context vectors for representation of any document in a collection of documents, and for representation of any search query. If more documents are added to the collection, document representations do not have to be recalculated because a context vector representation of a document is not dependent on term frequency across the entire document collection.
  • [0019]
    In particular, training the dictionary words comprises the steps of receiving a training document, creating context vectors for each word in the training document, and converging the context vectors toward each other for the context vectors representing words appearing close to one another based upon contextual usage. Assigning axis representation comprises the step of assigning each dictionary word to an axis having the largest component. The method further preferably comprises the steps of displaying a mathematical representation of the retrieved documents from the document database corresponding to the search query.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • [0020]
    [0020]FIG. 1 is a block diagram of the information retrieval and visualization system according to the present invention.
  • [0021]
    [0021]FIG. 2a is a schematic display view of an example vector in 3-dimensional space with the words “budget” and “music” assigned to an axis according to the present invention.
  • [0022]
    [0022]FIG. 2b is a graph of a distribution of the words through the first thirty axis for the example illustrated in FIG. 2a.
  • [0023]
    [0023]FIG. 3 are comparative graphs illustrating a percentage of words in the first thirty axis before and after document reduction according to the present invention.
  • [0024]
    [0024]FIG. 4 is a sample display of a document being reduced according to the present invention.
  • [0025]
    [0025]FIG. 5 are graphs of comparative word retrieval recall and precision results according to the present invention.
  • [0026]
    [0026]FIG. 6 are graphs of comparative word retrieval recall and precision results according to the present invention.
  • [0027]
    [0027]FIG. 7 are graphs illustrating examples of multiple query retrieval according to the present invention.
  • [0028]
    [0028]FIG. 8 are graphs illustrating recall and precision of an example search using the VSM document axis according to the present invention.
  • [0029]
    [0029]FIG. 9 are graphs illustrating selection of document retrieval engine penalty according to the present invention.
  • [0030]
    [0030]FIG. 10 is a diagram illustrating an assignment of the list location penalty according to the present invention.
  • [0031]
    [0031]FIGS. 11a-11 d are display screens for example searches according to the present invention.
  • [0032]
    [0032]FIG. 12a is a display screen showing an example of the 3-dimensional viewer containing documents retrieved for the McVeigh trial topic using the query keywords: McVeigh, trial, Oklahoma City, and bomb, according to the present invention.
  • [0033]
    [0033]FIG. 12b is a display screen showing spheres drawn around the keywords as shown in FIG. 12a.
  • [0034]
    [0034]FIG. 13 is a display screen showing the clustering of documents retrieved as in FIG. 12a.
  • [0035]
    [0035]FIG. 14a is a display screen showing zooming in on the word “trial” with the text turned on as shown in FIG. 12a.
  • [0036]
    [0036]FIG. 14b is a display screen showing different aspects of the retrieved document set as shown in FIG. 12a.
  • [0037]
    [0037]FIG. 15 is a graph showing precision scores from the TREC-6 Ad Hoc Manual Track Competition according to the present invention.
  • [0038]
    [0038]FIG. 16 are graphs showing results using a modified ranking algorithm according to the present invention.
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
  • [0039]
    The present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which preferred embodiments of the invention are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Like numbers refer to like elements throughout.
  • [0040]
    Referring initially to FIG. 1 the system architecture of a multiple engine information retrieval and visualization system 10 according to the present invention is now described. For convenience, the multiple engine information retrieval and visualization system 10 will be referred to as the retrieval system 10. The retrieval system 10 selectively retrieves documents from a document database. System features and benefits are listed in Table 1 for the retrieval system 10.
    TABLE 1
    System Features
    FEATURES BENEFITS
    Fusion of multiple retrieval Improved retrieval performance
    engines over independent search engines
    An architecture that supports Flexible search strategies
    the addition of new search
    engines
    Search by keyword or document Tailored queries
    example
    Multiple query capability for a Emphasize best search engine
    topic and queries for a topic
    Document reduction Improved performance and
    smaller document footprint
    Partitioning of document corpus Reduces search space
    based on similarity
    Search screening Remove topical but irrelevant
    documents
    Manual Relevance Feedback Query refinement
    Web-based user interface Familiar style facilitates
    quick review of retrieved
    documents and selection of
    example documents for use as
    additional queries
    3-D Visualization of retrieved Facilitates intuitive
    document sets identification of additional
    relevant documents
  • [0041]
    The retrieval system 10 includes an input interface 12 for accepting at least one user interface query. The input interface 12 also allows users to build queries for a topic of interest, execute the queries, examine retrieved documents, and build additional or refine existing queries. A plurality of search engines are used for retrieving documents from the document database based upon the user search query, wherein each search query produces a common mathematical representation of each retrieved document. The plurality of search engines include n-gram search engine 14, a Vector Space Model (VSM) search engine 16 which, in turn, includes a neural network training portion 18 to query a document corpus 20 to retrieve relevant documents. Results of the retrieval engines 14, 16 are fused together and ranked.
  • [0042]
    In one embodiment, the fusing together and ranking of the retrieved documents is performed by a ranking portion 22 of the computer system. The retrieval system 10 further comprises visualization display means 24 for mapping respective mathematical representations of the retrieved documents onto a display. The visualization display means 24 allow users to explore various aspects of the retrieved documents, and look for additional relevant documents by redefining the topic query corpus 26 via the input interface 12.
  • [0043]
    As previously discussed in the background section, information retrieval is the dissemination of information in response to user input queries. The information to be retrieved is typically stored in documents of various forms. Query formats range from a single set of words or phrases, to a Boolean logical expression that combines sets of words and phrases, to a complete natural language sentence or paragraph. A full document may also be used as an input query. A user begins by defining a topic of interest, then proceeds to define one or more queries for that topic. User queries for the retrieval system 10 can thus take the form of keywords or phrases, an example document, and even document clusters.
  • [0044]
    The retrieval system 10 utilizes an interactive multi-pass approach. It is not assumed that the information will be found immediately, and thus the user needs to interactively refine the search query. The retrieval system 10 allows the user to review the documents and select the documents most relevant to the topic. Relevant documents can be used as queries to further refine the topic. The user can then quickly query over the data with the additional queries.
  • [0045]
    By combining multiple separate and independent retrieval technologies, the various strengths of each approach can be leveraged to develop a more robust information retrieval system. The retrieval system 10 illustrated in FIG. 1 uses an n-gram search engine 14 and a vector space model (VSM) 16 search engine. These two search engines represent an embodiment of the present invention, and one skilled in the art will readily realize that other search engines can be used in place of, or in addition to, the n-gram 14 and the VSM 16 search engines. Other search engines that can be incorporated into the retrieval system 10 include, but are not limited to those using the following retrieval strategies: probabilistic retrieval, inference networks, boolean indexing, latent semantic indexing, genetic algorithms and fuzzy set retrieval. All of these retrieval strategies, including n-gram 14 and VSM 16, are well known to one skilled in the art. For illustrative purposes, a comparison of the strengths and weaknesses of the n-gram 14 and the VSM 16 search engines are provided in Table 2.
    TABLE 2
    Retrieval Engine Strengths and Weaknesses
    n-gram VSM
    Strength unique terms (e.g. example documents used as
    proper nouns) input
    mis-spelled words document meaning
    short documents (e.g.
    e-mail)
    Weakness long documents unique terms (e.g. proper
    nouns - terms that did not
    appear in the training corpus
    and hence do not appear in
    the VSM's dictionary)
  • [0046]
    With an n-gram search engine 14, an input query is partitioned into n-grams to form an n-gram query 30. An n-gram is a consecutive sequence of n characters, with n being a positive integer. The premise of an n-gram search engine 14 is to separate terms into word fragments of size n, then design algorithms that use these fragments to determine whether or not a match exists. For example, the first 15 tri-grams (3-grams) in the phrase “information retrieval” are listed below:
    inf orm ati on ret
    nfo rma tio n_r etr
    for mat ion _re tri
  • [0047]
    The frequency of occurrence of n-grams can be used to distinguish/characterize the language of a document (3-grams), and as a means of gauging the topical similarity of documents (5-grams). The retrieval system 10 employs an n-gram filter based on work with Least Frequent Tri-grams (LFT), as represented in FIG. 1 by the n-grams least frequency training block 32. The retrieval system 10 moves an n-character sliding window over a document while recording the frequency of occurrence of different combinations of n-characters. A least frequency table 34 is built from a corpus of training documents 36, representative of the document collection. Relevant documents are rapidly identified by looking for the occurrence in the document of the least-frequent n-gram of a search string, such as a keyword. If the least frequently occurring n-gram of a search term is not present in a document, then the term itself is not in the document. If the least frequently occurring n-gram is present, the search continues for the entire string.
  • [0048]
    In one embodiment of the retrieval system 10, a 3-character sliding window (3-grams) is used. For illustrative purposes, a tri-gram frequency analysis is performed on a representative sample of documents. From this analysis, a table of tri-gram probabilities is developed for assigning an occurrence probability p(t) to each of the unique tri-grams. The occurrence probability is expressed as:
  • p(t)=number documents which tri-gram occurred/total number of documents analyzed
  • [0049]
    The developed table is used to determine the LFT for any given query, and the frequencies were not evenly distributed throughout the tri-grams. When the documents used to train against are written in English, certain tri-grams (TER, ING, THE) occur very frequently, while others (JXF, DBP, GGG) occur only under highly unusual circumstances.
  • [0050]
    The n-gram retrieval search engine 14 counts the number of occurrences of the string. Since typical early searches are keywords, the n-gram search engine 14 acts as a filter to quickly identify candidate documents to be reviewed early in the retrieval process. It is especially useful for phrases not in the dictionary/vocabulary 44 of the VSM search engine 16. The identified relevant documents are then used on the next pass through the retrieval system 10. Additionally, if the keyword phrase of interest is unique enough, the documents containing the keyword phrase are used to create a document subset. Further queries are performed on just the document subset, which increases the search speed.
  • [0051]
    The retrieval system 10 also comprises a Vector Space Model (VSM) search engine 16 to represent documents in an n-dimensional vector space. The strengths and weaknesses of the VSM 16 are listed above in Table 2. In particular, a context vector model is implemented for the retrieval system 10. Words appearing in the document training corpus 40 are represented as vectors in the n-dimensional vector space, ωε
    Figure US20030069873A1-20030410-P00900
    n. The word vectors, ωε
    Figure US20030069873A1-20030410-P00900
    n, are normalized to unit length so that they all lie on the unit hyper-sphere and i ω i 2 = 1
    Figure US20030069873A1-20030410-M00001
  • [0052]
    The similarity of two words is measured by computing the cosine similarity measure of the associated vectors, ω, vε
    Figure US20030069873A1-20030410-P00900
    n. ( ω , v ) ω 2 v 2 = ( ω · v ) | ω || v |
    Figure US20030069873A1-20030410-M00002
  • [0053]
    This similarity measure is the cosine of the angle between the two vectors. The higher the value of the cosine angle(i.e., closer to +1), the smaller the angle between the two vectors. Since all of the word vectors are on the unit hyper-sphere, ∥ω∥2=∥ω∥=1 for all ω, the cosine similarity measure reduces to ( ω , v ) ω 2 v 2 = ( ω · v ) ( 1 ) ( 1 ) = ω · v = i ω i v i
    Figure US20030069873A1-20030410-M00003
  • [0054]
    A vector for each document, ωε
    Figure US20030069873A1-20030410-P00900
    n, is constructed based on the terms in a document. A query is considered to be like a document, so a document and a query can be compared by comparing their respective vectors in the vector space. Documents whose content, as measured by the terms in the document, correspond most closely to the content of the query are judged to be the most relevant. The documents are retrieved through keyword, word clusters (series of words), and example document queries mapped into the n-dimensional vector space. The documents whose respective vectors are a minimal distance from the query's vector are retrieved.
  • [0055]
    Many Vector Space Models count the frequency of occurrence of words and phrases to build the document queries 42. Frequency counts are done on the individual document files and for the document corpus 20. As new data are entered, the frequency counts must be updated or recomputed. Queries 42 are built upon the highest frequency counts for documents, necessitating more computation time. The retrieval system 10 creates an entirely mathematical representation of a document, and builds queries 42 from that representation. The mathematical representation allows consistent grouping of the words so that they can be compared.
  • [0056]
    Using only a mathematical representation offers several advantages. One advantage is that additional documents can be added to the document corpus 20 without having to recalculate word occurrence frequencies. Other advantages include reduced documents, small vectors, a small index, and minimal calculations are required to build positive and negative queries, i.e., no document recalculation is required. Also, document size independence and similarity equations are simplified.
  • [0057]
    Keywords, keyword phrases, single documents, and document clusters are provided as input to the retrieval system's 10 VSM component as queries. Queries 42 constructed by the VSM 16 can be broadly or narrowly focused, depending on the keywords, phrases, and example documents used in the queries. The document's score is obtained by computing the distance between the vectors representing the query and the document. Scores for relevant documents typically range from approximately 0.45 to 1. The closer to 1, the better the document matches the search query.
  • [0058]
    Experiments have shown that the VSM's 16 strongest performance results from the use of example documents and document clusters. As passes are completed, top query results are reviewed and identified as relevant or irrelevant. Relevant documents from the query are input to the next pass of the VSM search engine 16.
  • [0059]
    A neural network (NN) training portion 18 is used within the retrieval system 10 to train the word vectors, ωε
    Figure US20030069873A1-20030410-P00900
    n, in the VSM search engine 16. The NN training algorithm 18 is based on the training rule for Kohonen's Self-Organizing Map. This unsupervised learning algorithm organizes a high-dimensional vector space based on features within the training data so that items with similar usage are clustered together. Heavier weights are placed on words in closer proximity. The neural network 18 also accounts for training that has already taken place by adjusting the lesser trained words. The training algorithm is described as follows:
  • [0060]
    1. Initialize context vectors for each word in the dictionary. The values in the vector are chosen at random, and
  • [0061]
    2. For N training epochs
  • [0062]
    3. For each training document
  • [0063]
    a) Set word_index to 0
  • [0064]
    b) Set neighbor_word_index to word_index +1
  • [0065]
    c) Adjust context vector for word(word_index) and word(neighbor_word_index). This accounts for proximity of words in the document, weighting more heavily words that are closer in proximity. This also accounts for training that has already taken place by adjusting highly trained words less. The algorithm is given as
  • [0066]
    i) d=w1-w2, where w1=context vector for word(word_index), and w2=context vector for word(neighbor_word_index)
  • [0067]
    ii) w1(k+1)=w1(k)−μwk1d, where μw=learning rate for word neighbor adjustments, and k1=(w1_num_updates ★(neighbor_word_index−word_index)) −1
  • [0068]
    iii) w2(K+1)=w2(k)+μwk2d where k2=(w2_num_updates ★neighbor_word_index−word_index)) −1
  • [0069]
    iv) Renormalize w1 and w2
  • [0070]
    d) if (neighbor_word_index−word_index)<max_neighbor_words
  • [0071]
    i) Increment neighbor_word_index
  • [0072]
    ii) Go to 3c, else if not done with all words in document
  • [0073]
    iii) Increment word_index
  • [0074]
    iv) Go to 3b
  • [0075]
    e) Calculate context vector for document
  • [0076]
    f) For every word in the document, adjust the word's context vector so that it is closer to the document's context vector. This steers words and the document that contains them towards a cluster of similar meaning in the vector space.
  • [0077]
    i) d=w−v, where w is the context vector for the word, and v is the context vector for the entire document
  • [0078]
    ii) w(k+1)=w(k)−μdd where μd is the learning rate for word-to-document adjustment (μd<<μw)
  • [0079]
    iii) Renormalize w
  • [0080]
    iv) Note that early in the training, μwki should be much larger than μd, since the document's context vector is very random until some word training has been done. Eventually, μd may dominate μwki since ki shrinks rapidly as word training continues.
  • [0081]
    4. Get next document and go to 3
  • [0082]
    5. Finish training epoch
  • [0083]
    a) Increment epoch_count
  • [0084]
    b) Reduce μw. This ensures that as training nears completion, updates are small even for words that have not been trained much.
  • [0085]
    c) If this is not the last epoch, go to 2, else done
  • [0086]
    Application of the neural network 18 training rule causes the vectors for words with similar meaning, as defined by similar usage in the document corpus 20, to converge towards each other. Upon completion of training, words are assigned to the closest axis in the n-dimensional vector space. An axis represents the direction of each vector in the n-dimensional space. Words are assigned to the closest axis by taking the highest cosine similarity among the axes. FIG. 2a shows an example vector in a 3-dimensional space, with the words “budget” and “music” being assigned to an axis. FIG. 2b shows a fairly even distribution of words through the first thirty axes.
  • [0087]
    A document is cleaned by removing stop-words, performing stemming, and inserting compound words. A document is further reduced by examining the word axis representation. Document cleaning and reduction is performed in portion 46, as illustrated in FIG. 1. The number of words in each axis are also counted. The axes containing the highest percentage of words are retained until over 70% of the document is represented. In lieu of 70%, other percentage levels are acceptable. FIG. 3 shows the percentage of words in the first 30 document axes before reduction and after reduction. FIG. 4 shows text of a document that has been reduced from 58 axes to 26 axes.
  • [0088]
    Reducing the words in a document increases the speed of the document representation and improves the query matching. Reduction increases the match because it removes terms that lower the values of the higher axes used to match other documents. Tests have been performed to determine what is a sufficient amount of document reduction without removing too many words. Reduction beyond 70% begins to remove some of the unique words of the document. Therefore, documents reduced up to 70% give the best performance.
  • [0089]
    An example of test results performed on over 2000 web news stories is shown in FIG. 5. The topic was “find awards/honors given to people and things, i.e., television shows”. The search started with the set of words: honor, award, mvp, noble prize, and hall of fame. The search was performed on documents which contain 100%, 90%, 80%, 70%, and 60% of the original document. FIG. 5 shows the results using the keyword search. The top 10 relevant documents found in the reduction were then used in the second pass to further define the query. FIG. 6 shows the results of the keyword and document example search.
  • [0090]
    Information for a document is stored in two ways: a document context vector, and an axis context vector. The document context vector, xε
    Figure US20030069873A1-20030410-P00900
    n, is the sum of all the words that are in the document after clean up and reduction. It is used for building a single document query, and to compare documents and queries. A document's axis context vector is the sum of the words in each axis vector after document clean up and reduction. The axis context vector is used for building a query for a document cluster and the 3-D display. A query, whether it consists of keywords, keyword clusters, example documents, or document clusters, is considered to be like a document. A query is represented by an n-dimensional context vector, yε
    Figure US20030069873A1-20030410-P00900
    n. Therefore, a query can be compared within the document corpus 20.
  • [0091]
    Positive queries are combinations of words and documents. Single word and single document queries use the entire word/document as the query. When multiple documents are used to build the query, the document axes with the highest usage are used to build the query. Table 3 shows three documents with an example vector size of 6 to build a positive query. In this example, the query is built using axes 1, 3, and 5 since they contain the highest axis usage among the three relevant documents. The default is to use the axis used by all the documents, and the next highest used axes. The user is allowed to lower or raise the number of axes used to build the query.
    TABLE 3
    Positive Query Example
    Axis Doc 1 Doc 2 Doc 3
    1 x x x
    2 x
    3 x x x
    4 x
    5 x x
    6 x
  • [0092]
    When building a multiple document query, the documents should be similar. That is, a cosine measure of similarity should be greater then 0.6. Documents lower then 0.6 are not very similar, and it is not beneficial to combine them into one query. FIG. 7 shows examples of multiple query retrieval. The multiple query done on documents with a measure of similarity greater then 0.6 retrieves more relevant documents in the top 25 retrieved documents. The documents retrieved are mainly the documents retrieved by the individual queries, with the multiple query identifying one new relevant document in the top 25. The multiple queries can be used to help further identify relevant documents, through repetition of the document being repeated in several queries. Multiple document queries with dissimilar documents identify more irrelevant documents. These queries correspond to a cosine measure of similarity less then 0.6.
  • [0093]
    Building a negative query is very similar to building a positive query. Instead of looking for the most frequently used axes, the least frequently used axes are examined. That is, the least frequently used axes in the specified relevant documents relative to the axis used by the bad documents are examined. Table 4 shows a negative query being built. In this example, the least frequently used axes 2, 4, and 6 are used with respect to the good documents to build the negative query. As with building the positive query, the user can also raise or lower the number of axes used in building the negative query.
    TABLE 4
    Negative Query Example
    Relevant Documents Bad
    Axis Doc 1 Doc 2 Doc 3 Doc 4
    1 x x x x
    2 x x
    3 x x x
    4 x x
    5 x x
    6 x x
  • [0094]
    The retrieval system 10 initiates each retrieval engine to calculate a document's score for each query. The retrieval engines maintain only the high level scores. The retrieval system 10 standardizes the scores from each retrieval engine to range from 0 to 1. The user can adjust the lowest acceptable score and retrieval engine weight to effect score results to favor/disfavor a particular retrieval engine. A ranking processor 22 uses an algorithm to fuse the results of the retrieval engines and ranks the documents based on the number of times the document was selected, highest score, lowest score, average score, location in the query list and number of retrieval engines locating the document. Irrelevant documents and queries can be removed. Each topic is made up of multiple queries from each of the retrieval components. The scores are scaled by query and for the entire topic, i.e., all the queries. Each set of scores are a separate entry into the ranking algorithm.
  • [0095]
    The query 42 for the VSM 16 contains the entire vector and the axis vectors. Scores are obtained by taking the cosine similarity measure of the query vector with the entire vector of each of the documents in the document corpus 20. The closer to one the better the match, where only high positive scores are kept. If none of the document scores equals the value 1, then the documents are scaled based on the highest score for the query. This is a quick method of increasing the scores on potentially relevant VSM documents, thus allowing fusing of the VSM highest results with the n-gram result scores.
  • [0096]
    A cosine similarity measure of the query vector against the corresponding axis of the documents in the document corpus 20 can also be applied. In this case, applying a cosine similarity measure is also useful for word queries. However, the axis for the document typically contain a large number of axes, and a large number of documents unrelated to the topic are retrieved. Using the example search that retrieved 2000 web news stories, retrieved documents describing “awards/honors given to people and things receiving awards” shows the effect of using the document axis queries, as shown in Table 5 and FIG. 8. Using the VSM document axis the precision and recall scores are lower, and 16% more irrelevant documents were retrieved. This causes more work to be performed in locating relevant documents.
    TABLE 5
    Example of Using VSM Document Axis
    n-gram, n-gram, VSM, VSM
    VSM, VSM word axis, VSM
    word axis document axis
    Number relevant documents 12 17
    located
    Number relevant documents 17 17
    in the corpus
    Number documents retrieved 42 705
  • [0097]
    Referring back to the n-gram retrieval engine 14, the number of occurrences of the least frequent term (n-gram) are counted. Since most early searches on the topic are keyword(s), the filter quickly identifies candidate documents to be reviewed early in the retrieval process. The filter is especially useful for keywords or phrases which may not have appeared in the document corpus 40 used to train the VSM 16 component of the retrieval system 10. The identified relevant documents are used on the next pass through the retrieval system 10. A tri-gram was selected due to the speed of processing and small amount of storage for the least frequency table corresponding to block 34 in FIG. 1.
  • [0098]
    The n-gram frequency count can vary widely. The n-grams need to be in the range from 0 to 1 to correspond with the standardized range. The documents having a large number of the specified tri-grams (3-grams) appearing in one document were examined. The documents were divided into three groups: few matches, high matches, and scaled documents. The few matches were removed from the scaling calculation. Few matches consists of a large number of documents (greater than 50) with a couple of matches (approximately ranging from 1-20 matches) per document. High matches have a large number of matches in a single document. These documents need to be examined and are set to a value 1. The remaining documents are scaled between 0 and 1. The scaling helps to identify documents that have the most 3-gram matches and should be reviewed. As a further note, taking the mean or dividing by the largest number does not provide a good representation of the n-gram documents. Furthermore, looking at a document that had only had a few matching occurrences is not desirable.
  • [0099]
    A statistical method is used to locate the clusters of interest to be scaled. The mean and standard deviation are calculated without the largest n-gram frequency value. If the largest value fits within three standard deviations of the mean, then the number is used as the scaling factor. If the largest value does not fit, it is considered to be outside the cluster range. When the process is repeated, it takes out the next largest value. Accordingly, this process is repeated until the largest removed frequency value falls within the range of the third standard deviation. Numbers larger than the selected scaling number are set to one. An example of calculating the scaling value is shown in Table 6.
    TABLE 6
    Scaling Example
    Original Removed Removed Next
    Data Largest # Largest #
    n-Gram 1 1 1
    Frequency 2 2 2
    Occurrence 1 1 1
    3 3 2
    12 2 2
    2 2 1
    2 1 1
    1 1
    1
    Mean 2.77 1.625 1.42
    Standard 3.52 0.744 0.534
    Deviation
    3 Standard 13.33 3.857 3.03
    Deviations
    Comment If used Remove the Remove the next largest # −
    largest largest 3. The # 3 does fall within
    value. # − 12. three standard deviations.
    12/12 = The # 12 Use 3 as the scaling factor.
    1 does not All numbers larger than 3 are
    3/12 = fall within set to
    0.25 three 1.
    2/12= standard  12 = 1
    0.16 deviations 3/3 = 1
    1/12 = of the 2/3 = 0.66
    0.08 remaining 1/3 = 0.33
    It is values The documents with values 1 −
    doubtful 0.66 would probably be
    that reviewed.
    anything
    other
    then the
    document
    with the
    value of
    one
    would be
    reviewed
  • [0100]
    Table 7 shows the number of files used to calculate the n-gram scaling factor applied to data provided for a TREC-6 conference. TREC is an acronym for Text REtrieval Conference. This conference is a workshop series that encourages research in information retrieval from large text applications by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results.
    TABLE 7
    Number of Files Used to Calculate the n-gram Scaling Factor
    Query #301 Query #337 Query #350
    # Files 79,777 550 3,271
    Range of files dropped out 1-18 1-3 1-10
    (# matches/file)
    # of files dropped out 79,155 434 2,547
    Average 60  10 48
    # Documents Scaled 522  73 538
    Range of files where 60-829  4-10   48-7,789
    documents = 1
    # Documents = 1 100  43 186
  • [0101]
    The user specifies the lowest acceptable score for each of the retrieval engines. This helps eliminate the lower scoring documents. This also controls the number of documents shown to the user. The higher the number, the fewer the documents shown to the user. Additionally, the documents should reveal better matches to the query. Each retrieval engine is assigned a specific percentage by the user. The document scores for the retrieval engine are reduced by the specified percentage. Depending upon the query type, different retrieval engines can be emphasized. For example, potentially misspelled words may put more emphasis on the n-gram search engine 14. Document example queries place more emphasis on the VSM search engine 16. An algorithm ranks the document scores from the different retrieval engines. The algorithm rates the following items: number of times document identified per query and per retrieval engine, maximum and minimum score, average score and penalty points. An algorithm performing these functions is well known to one skilled in the art and will not be described in further detail herein. The ranking is thus determined by a score.
  • [0102]
    Each item is ranked except for the penalty points. A higher number results in a lower score. The individual items are totaled and the lowest final score indicates the best document. Penalty points are assigned to a document based on the retrieval engine and the document location in the ranked list. A penalty is added to a document for each retrieval engine not identifying it as relevant. Multiple engines retrieving the document is a strong indication of a relevant document. The score must be above the supplied value after it is calculated according to predetermined scaling and retrieval engine weight. If all the engines retrieve the document, then the penalty score is 0. Referring to FIG. 9, the value 200 was chosen because it supplied enough of a penalty that it moved a document's rank below documents that appeared in more then one method.
  • [0103]
    The algorithm allows each document to receive a penalty point for its location in each query list. This is intended to reward the documents that are located close to the top of the list in the individual queries, by assigning fewer penalty points. FIG. 10 further illustrates the penalty point assignment. Once the penalties are assigned for each query, the document's final list location based on all the queries results is calculated. The final location is based on the scores of the number of times the document is identified, maximum, minimum, average score, number of retrieval engines locating the document, and the list location penalty. In addition, location of the individual queries can also improve the results.
  • [0104]
    High precision in the retrieval system 10 is derived by providing users multiple input interaction modes, fusing results obtained from multiple information retrieval search engines 14 and 16, each supporting a different retrieval strategy, and by supporting relevance feedback mechanisms. The basic premise of relevance feedback is to implement information retrieval in multiple passes. The user refines the query in each pass based on results of previous queries. Typically, the user indicates which of the documents presented in response to an initial query are relevant, and new terms are added to the query based on this selection. Additionally, existing terms in the query can be re-weighted based on user feedback.
  • [0105]
    Primary user interaction with the retrieval system 10 is through a web-browser-based user interface 12. Users can build and tailor queries as the topic of interest is further defined, moving from a generic search to specific topic areas through query inputs. Queries may consist of a single keyword, multiple keywords (or phrases), keyword clusters, an example document, and document clusters. In addition, a user can increase or decrease the system precision, effecting the number of documents that will be rated as relevant. The weights on the retrieval engines can be modified to favor different engines based on the type of query.
  • [0106]
    A set of documents retrieved for a particular topic many exhibit a variety of aspects. This can be seen, for example, in a search retrieving information about the Oklahoma City bombing of 1996. There are relevant articles about the bomb, damage from the bomb blast, rescue work, the victims, suspects, the Timothy McVeigh trial, and the Terry Nichols trial, just to name a few. In particular, the retrieval system 10 is used to search for documents relevant to the trial of Timothy McVeigh for the Oklahoma City bombing. The document corpus consists of over 2000 news stories from the CNN web site on a variety of topics. In this case, a user begins by creating a new topic of interest: McVeigh Trial. Since this is a new topic, there are no queries associated with the topic. So the user creates a query by entering a few keywords: “McVeigh”, “trial”, “bombing”, and “Oklahoma City”, as shown in FIG. 11a. This initial query is labeled “words-1” and added to the list of queries for a McVeigh trial topic, as shown in FIG. 11b. The user has the retrieval system 10 execute this query.
  • [0107]
    As illustrated in FIG. 11c, a ranked list of documents, complete with score, is returned to the user. Clicking on the document retrieves the text through HTML hyperlinks. This enables a user to determine the relevance, from his point of view, of a document to the topic. Top documents can be reviewed, and both relevant and irrelevant documents are identified and marked as such. Irrelevant documents are filtered from subsequent queries. Removal of the higher-scoring irrelevant documents allows lower scoring documents to be accepted on the final result list. Documents can also be marked for use as examples in additional queries for the topic. Such stories are then added to the list of queries, as shown in FIG. 11d.
  • [0108]
    In the retrieval system 10, visualization display means comprises an n-dimensional document visualization display for enhancing user understanding of the retrieved document set. This tool supports multiple levels of data abstraction, clustered document presentation, data thresholding, and a variety of user interaction paradigms. The n-dimensional document visualization display enables the user to view different aspects of the document's topic. The visualization display means displays a similarity measure of the documents. The information retrieval system 10 is thus able to reduce the display down to the most important aspects of the document.
  • [0109]
    The Oklahoma City bombing stories have a number of different aspects: the bomb, building damage, the victims, victim's families, the Timothy McVeigh trial, etc. Displaying documents in a 3-dimensional space enables a user to see document clusters, the relationships of documents to each other, and also aids in the location of additional documents that may be relevant to a query. Documents near identified relevant documents (through queries) can be easily reviewed for topic relevance. The user is able to manipulate the dimensional view to gain new views of document relationships. Changing the document's dimensionality allows the information to be viewed for different topic aspects to aid in further identification of relevant documents.
  • [0110]
    [0110]FIG. 12a shows an example of the 3-dimensional viewer containing documents retrieved for the McVeigh trial topic using the query keywords: McVeigh, trial, Oklahoma City, and bomb. Document locations are represented in space by a box. Additionally in this view, documents determined as relevant by the information retrieval system 10 displays the document name next to the box. Clustering of documents can be observed in several areas. A first cluster is with respect to the McVeigh trial stories, plus additional stories related to the topic, but not identified by retrieval system 10. A second cluster is with respect to the Bosnia stories deals with bombing, and are near the keyword “bomb”. A third cluster is with respect to the O. J. Simpson trial stories, and appears near the word “trial”.
  • [0111]
    Referring to FIG. 12a, each document in the retrieved document corpus may be represented mathematically in the 3-dimensional space by a colored cube. For example, a red cube could represent a query request—in this case the words “trial”, “bombing”, “McVeigh” and “Oklahoma City”. Yellow text could be used to indicate the relevant documents found through text queries submitted to the retrieval system 10. Additional colors may be used to indicated document clusters, i.e., documents related in some aspect. The colors selected to represent a query request, relevant documents and document clusters are not limited to red and yellow. These colors are for illustrative purposes, wherein other colors are acceptable for indicating such information to a user. Therefore, using different colors to represent different aspects of the retrieved documents on the display would allow the user to more quickly identify the relevant information to be retrieved.
  • [0112]
    The information retrieval system's 10 3-dimensional view enables a user to quickly identify document clusters around queries by plotting spheres around each query. FIG. 12b shows spheres drawn around the keywords in the “words-1” query. A user can “fly” into the sphere to explore the documents clustered around a query. FIG. 13 shows the clustering of documents in several areas. As previously stated, Cluster 1 corresponds to the McVeigh trial stories, plus additional stories related to the topic not identified by the information retrieval system 10. Cluster 2 corresponds to Bosnia stories dealing with bombing, and are near the keyword “bomb”. Cluster 3 corresponds to the O. J. Simpson trial stories, and appears near the word “trial”. FIG. 13 clusters found in the 3-dimensional keyword view of higher ranking documents in the corpus include McVeigh trial stories, Bosnia bombing, and O. J. Simpson trial stories
  • [0113]
    [0113]FIG. 14a shows a close-up focused on the word “trial”, and the text has been turned on so that the file names of documents represented by the boxes are also displayed. It is noted that the O. J. Simpson trial articles have clustered near the word trial. A user can double click on any of the boxes to bring up the associated document. FIG. 14b shows another perspective on the whole corpus of retrieved documents using different dimensions. Again, it is noted that the relevant stories appear to separate from the other stories using the retrieval system 10. To look at some of the documents which have not been identified, the user can double click on any of the boxes to bring up the associated document, and make an inspection to determine whether or not it is relevant to the Timothy McVeigh trial.
  • [0114]
    The information retrieval system 10 has been evaluated using the benchmark data and query sets provided by the National Institute of Standards and Technology as part of the 6th Text REtrieval Conference (TREC-6). The results obtained demonstrated high precision for limited-sized retrieval sets. As previously stated, the Text REtrieval Conference (TREC) workshop series encourages research in information retrieval from large text applications by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. TREC has become the major experimental effort in the field of information retrieval. The information retrieval system 10 of the present invention competed in the Manual Ad Hoc Retrieval task for the large data set. During TREC, the n-gram search engine 14 was used as a high speed filter to provide document examples. The VSM search engine 16 was used for final retrieval and document scoring.
  • [0115]
    There are two primary measures employed in the TREC competition; precision and recall. Precision is the percentage of documents from the retrieved set that are relevant to the query. Recall is the percentage of relevant documents retrieved from the total number of relevant documents available within the collection. Time constraints and interest-level limit the typical user of an information retrieval system to reviewing the top documents before determining if the results of a query are accurate and satisfactory. The user needs precise, representative documents at the top of the list to make this determination. Accordingly, the information retrieval system 10 emphasizes precision (accuracy) and speed while retrieving relevant documents high on the list. The information retrieval system 10 permits the user to build and tailor the query as he or she further defines the topic. The system also permits movement from a generic search to a specific topic area through query inputs.
  • [0116]
    In comparing precision results using the present invention with other TREC teams that retrieved a similar number of documents, the information retrieval system 10 maintains a high level of precision for the top 5, 10, 20 and 30 documents retrieved. As FIG. 15 illustrates, precision for the Applicants' invention (which is assigned to Harris corporation) is higher than the other TREC teams that retrieved more relevant documents. This fits well with the fact that time constraints and interest-level on the part of an information retrieval system user often limit the user to reviewing the top documents before the user determines if the results of a particular query were accurate and satisfactory. During TREC, the documents were scored for the entire topic, i.e., all the queries were combined into the ranking. A modified algorithm included individual query list locations into the overall scoring, which improved the results, as shown in FIG. 1G.
  • CONCLUSION
  • [0117]
    The information retrieval system 10 is an efficient, high-level precision information retrieval and visualization system. The information retrieval system 10 allows interactive formation of query refinement, and fuses results form multiple retrieval engines to leverage the strengths of the each one. In addition, the information retrieval system 10 allows for efficient maintenance, i.e., making it easy to add new documents. The information retrieval system 10 also allows for multiple dictionaries and vocabularies, thus allowing a user to develop role-based dictionaries and/or vocabularies for searching specific databases. The information retrieval system 10 provides a user interface for user interaction as well as a 3-dimensional presentation of the retrieved documents for more efficiently exploring the documents retrieved in response to a user's search query.
  • [0118]
    Many modifications and other embodiments of the invention will come to the mind of one skilled in the art having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific embodiments disclosed, and that the modifications and embodiments are intended to be included within the scope of the dependent claims.

Claims (79)

    That which is claimed is:
  1. 1. An information retrieval system for selectively retrieving documents from a document database, the system comprising:
    an input interface for accepting at least one user search query;
    a plurality of search engines for retrieving documents from the document database based upon the at least one user search query, each of said search engines producing a common mathematical representation of each retrieved document;
    a display; and
    visualization display means for mapping respective mathematical representations of the retrieved documents onto said display.
  2. 2. An information retrieval system according to claim 1, wherein at least one of said search engines produces a document context vector representation and an axis context vector representation of each retrieved document; and wherein said visualization means maps the axis context vector representations of the retrieved documents onto said display.
  3. 3. An information retrieval system according to claim 2, wherein the document context vector is based upon a sum of all words in a document after reducing low content words.
  4. 4. An information retrieval system according to claim 2, wherein the axis context vector is based upon a sum of all words in each axis after reducing low content words.
  5. 5. An information retrieval system according to claim 1, wherein said visualization display means further comprises keyword display means for providing a three-dimensional display of keywords from the at least one user search query.
  6. 6. An information retrieval system according to claim 5, wherein said visualization display means further comprises cluster display means for displaying retrieved documents in clusters surrounding respective user search queries.
  7. 7. An information retrieval system according to claim 6, wherein the retrieved documents are displayed in a color different from a color used for displaying the keywords from the at least one user search query.
  8. 8. An information retrieval system according to claim 6, wherein said visualization display means further comprises means for viewing different aspects of the retrieved documents on the three-dimensional display.
  9. 9. An information retrieval system according to claim 1, wherein said visualization display means provides a list of retrieved documents, and wherein each retrieved document has an assigned score indicating relevance to the search query with respect to the other retrieved documents.
  10. 10. An information retrieval system according to claim 1, wherein a retrieved document mapped onto said display is selectable via said input interface, with the text of the selected document being displayed on said display.
  11. 11. An information retrieval system according to claim 1, further comprising fusing means for combining and ranking retrieved documents from all of said search engines.
  12. 12. An information retrieval system according to claim 1, wherein said plurality of search engine comprises an n-gram search engine.
  13. 13. An information retrieval system according to claim 12, wherein said n-gram search engine comprises n-gram training means for least frequency training of training documents.
  14. 14. An information retrieval system according to claim 1, wherein said plurality of search engines comprises a vector space model (VSM) search engine.
  15. 15. An information retrieval system according to claim 14, wherein said VSM search engine comprises VSM training means for processing training documents, and comprising a neural network.
  16. 16. An information retrieval system according to claim 1, wherein each said search engine comprises a plurality of user selectable dictionaries for determining mathematical representations of documents.
  17. 17. An information retrieval system according to claim 1, wherein said input interface comprises relevance feedback means for accepting relevance feedback from the user.
  18. 18. An information retrieval system according to claim 17, wherein said relevance feedback means comprises means for selecting one or more retrieved documents as a next search query.
  19. 19. An information retrieval system according to claim 1, wherein said input interface comprises means for permitting a user to assign a weighting percentage to each search engine.
  20. 20. An information retrieval system according to claim 1, wherein the at least one user search query comprises at least one keyword.
  21. 21. An information retrieval system according to claim 1, wherein the at least one user search query comprises at least one document.
  22. 22. An information retrieval system according to claim 1, wherein the at least one user search query comprises at least one document cluster.
  23. 23. An information retrieval system for selectively retrieving documents from a document database, the system comprising:
    an input interface for accepting at least one user search query;
    a n-gram search engine for retrieving documents from the document database based upon the at least one user search query, said n-gram search engine producing a common mathematical representation of each retrieved document;
    a vector space model (VSM) search engine for retrieving documents from the document database based upon the at least one user search query, said VSM search engine producing a common mathematical representation of each retrieved document;
    a display; and
    visualization display means for mapping respective mathematical representations of the retrieved documents onto said display.
  24. 24. An information retrieval system according to claim 23, wherein said VSM search engine produces a document context vector representation and an axis context vector representation of each retrieved document; and wherein said visualization means maps the axis context vector representations of the retrieved documents onto said display.
  25. 25. An information retrieval system according to claim 23, wherein the document context vector is based upon a sum of all words in a document after reducing low content words.
  26. 26. An information retrieval system according to claim 23, wherein the axis context vector is based upon a sum of all words in each axis after reducing low content words.
  27. 27. An information retrieval system according to claim 23, wherein said visualization display means further comprises keyword display means for providing a three-dimensional display of keywords from the at least one user input query.
  28. 28. An information retrieval system according to claim 27, wherein said visualization display means further comprises cluster display means for displaying retrieved documents in clusters surrounding respective user search queries.
  29. 29. An information retrieval system according to claim 28, wherein the retrieved documents are displayed in a color different from a color used for displaying the keywords from the at least one user search query.
  30. 30. An information retrieval system according to claim 28, wherein said visualization display means further comprises means for viewing different aspects of the retrieved documents on the three-dimensional display.
  31. 31. An information retrieval system according to claim 23, wherein said visualization display means provides a list of retrieved documents, and wherein each retrieved document has an assigned score indicating relevance to the search query with respect to the other retrieved documents.
  32. 32. An information retrieval system according to claim 23, wherein a retrieved document mapped onto said display is selectable via said input interface, with the text of the selected document being displayed on said display.
  33. 33. An information retrieval system according to claim 23, further comprising fusing means for combining and ranking retrieved documents from said n-gram search engine and said VSM search engine.
  34. 34. An information retrieval system according to claim 23, wherein said n-gram search engine comprises n-gram training means for least frequency training of training documents.
  35. 35. An information retrieval system according to claim 23, wherein said VSM search engine comprises VSM training means for processing training documents, and comprising a neural network.
  36. 36. An information retrieval system according to claim 23, wherein each search engine comprises a plurality of user selectable dictionaries for determining mathematical representations of documents.
  37. 37. An information retrieval system according to claim 23, wherein said input interface comprises relevance feedback means for accepting relevance feedback from the user.
  38. 38. An information retrieval system according to claim 37, wherein said relevance feedback means comprises means for selecting one or more retrieved documents as a next search query.
  39. 39. An information retrieval system according to claim 23, wherein said input interface comprises means for permitting a user to assign a weighting percentage to each search engine.
  40. 40. An information retrieval system according to claim 23, wherein the at least one user search query comprises at least one keyword.
  41. 41. An information retrieval system according to claim 23, wherein the at least one user search query comprises at least one document.
  42. 42. An information retrieval system according to claim 23, wherein the at least one user search query comprises at least one document cluster.
  43. 43. A method for selectively retrieving documents from a document database using an information retrieval system comprising a plurality of search engines, the method comprising the steps of:
    generating at least one user search query;
    retrieving documents from the document database based upon the user search query, with each search engine searching the document database;
    producing a common mathematical representation of each document retrieved by the respective search engines; and
    mapping respective mathematical representations of the retrieved documents onto a display.
  44. 44. A method according to claim 43, wherein the step of producing comprises the steps of:
    producing a document context vector representation of each retrieved document; and
    producing an axis context vector representation of each retrieved document.
  45. 45. A method according to claim 44, wherein the document context vector is based upon a sum of all words in a document after reducing low content words.
  46. 46. A method according to claim 44, wherein the axis context vector is based upon a sum of all words in each axis after reducing low content words.
  47. 47. A method according to claim 44, wherein the step of mapping comprises the step of mapping axis the context vector representations of the retrieved documents onto the display.
  48. 48. A method according to claim 44, wherein the step of mapping comprises the step of comparing the document context vectors with the at least one user search query.
  49. 49. A method according to claim 43, wherein the step of mapping further comprising the step of displaying keywords from the at least one user input query onto a three dimensional display.
  50. 50. A method according to claim 49, wherein the step of mapping comprises the step of displaying retrieved documents in clusters surrounding respective user search queries.
  51. 51. A method according to claim 50, wherein the step of displaying comprises the step of displaying retrieved documents in a color different from a color used for displaying the keywords from the at least one user search query.
  52. 52. A method according to claim 50, wherein the step of displaying comprises the step of displaying different aspects of the retrieved documents on the three-dimensional display.
  53. 53. A method according to claim 43, further comprising the step of providing a list of retrieved documents, and wherein each retrieved document has an assigned score indicating relevance to the search query with respect to the other retrieved documents.
  54. 54. A method according to claim 43, further comprising the steps of:
    receiving an input for selecting a retrieved document mapped onto the display; and
    displaying the text of the selected document on the display.
  55. 55. A method according to claim 43, further comprising the step of combining and ranking retrieved documents from all of the search engines.
  56. 56. A method according to claim 43, wherein the plurality of search engines comprises an n-gram search engine, and further comprising the step of least frequency training of training documents.
  57. 57. A method according to claim 43, wherein the plurality of search engines comprises a vector space model (VSM) search engine, and further comprising the step of processing training documents.
  58. 58. A method according to claim 43 further comprising the step of providing relevance feedback means for accepting relevance feedback from a user.
  59. 59. A method according to claim 58, wherein the step of providing relevance feedback means comprises the step of selecting one or more retrieved documents as a next search query.
  60. 60. A method according to claim 43, further comprising the step of assigning a weighting percentage to each search engine.
  61. 61. A method according to claim 43, wherein the step of generating comprises the step of generating at least one keyword.
  62. 62. A method according to claim 43, wherein the step of generating comprises the step of generating at least one document.
  63. 63. A method according to claim 43, wherein the step of generating comprises the step of generating at least one document cluster.
  64. 64. A method for selectively retrieving documents from a document database, the method comprising the steps of:
    defining a dictionary comprising a plurality of words related to a topic to be searched;
    randomly assigning a context vector to each word in the dictionary;
    training the dictionary words;
    assigning axis representation to each dictionary word;
    receiving at least one user search query; and
    searching a document database based upon the at least one user search query.
  65. 65. A method according to claim 64, wherein the step of training comprises the steps of:
    creating context vectors for each word in at least one training document; and
    converging the context vectors toward each other for the context vectors representing words appearing close to one another based upon contextual usage.
  66. 66. A method according to claim 64, wherein the step of assigning comprises the step of assigning each dictionary word to an axis having a largest component of a respective context vector.
  67. 67. A method according to claim 64, further comprising the step of displaying a mathematical representation of the retrieved documents from the document database corresponding to the at least one user search query.
  68. 68. A method according to claim 67, wherein the step of displaying comprises displaying three-dimensional keywords from the at least one user search query in three-dimensions.
  69. 69. A method according to claim 68, wherein the step of displaying comprises displaying retrieved documents in clusters surrounding respective keywords.
  70. 70. A method according to claim 69, wherein the step of displaying comprises displaying retrieved documents in a color different from a color used for displaying keywords from the at least one user search query.
  71. 71. A method according to claim 67, wherein the step of displaying comprises displaying different aspects of the retrieved documents on the three-dimensional display.
  72. 72. A method according to claim 64, further comprising the step of generating a list of retrieved documents, each having an assigned score indicating relevance to the at least one user search query with respect to other retrieved documents.
  73. 73. A method according to claim 67, further comprising the steps of:
    receiving an input for selecting a retrieved document; and
    displaying the text of the selected document.
  74. 74. A method according to claim 67, where the step of displaying comprises the step of multiplying an axis representation of the search query by an axis representation of a retrieved document.
  75. 75. A method according to claim 64, where the step of assigning comprises the steps of:
    calculating a cosine angle between the axis representation of each dictionary word with each context vector; and
    assigning each dictionary word to a respective context vector corresponding to an axis having the highest calculated cosine angle.
  76. 76. A method according to claim 64, further comprising the step of receiving relevance feedback from a user.
  77. 77. A method according to claim 76, further comprising the step of selecting one or more retrieved documents as a next search query.
  78. 78. A method according to claim 64, wherein the step of searching comprises the step of using an n-gram search engine for searching the document database.
  79. 79. A method according to claim 64, wherein the step of searching comprises the step of using a vector space model (VSM) for searching the document database.
US09195773 1998-11-18 1998-11-18 Multiple engine information retrieval and visualization system Granted US20030069873A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US09195773 US6574632B2 (en) 1998-11-18 1998-11-18 Multiple engine information retrieval and visualization system

Publications (1)

Publication Number Publication Date
US20030069873A1 true true US20030069873A1 (en) 2003-04-10

Family

ID=22722748

Family Applications (3)

Application Number Title Priority Date Filing Date
US09195773 Granted US20030069873A1 (en) 1998-11-18 1998-11-18 Multiple engine information retrieval and visualization system
US09195773 Expired - Fee Related US6574632B2 (en) 1998-11-18 1998-11-18 Multiple engine information retrieval and visualization system
US10356958 Expired - Fee Related US6701318B2 (en) 1998-11-18 2003-02-03 Multiple engine information retrieval and visualization system

Family Applications After (2)

Application Number Title Priority Date Filing Date
US09195773 Expired - Fee Related US6574632B2 (en) 1998-11-18 1998-11-18 Multiple engine information retrieval and visualization system
US10356958 Expired - Fee Related US6701318B2 (en) 1998-11-18 2003-02-03 Multiple engine information retrieval and visualization system

Country Status (1)

Country Link
US (3) US20030069873A1 (en)

Cited By (50)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030208485A1 (en) * 2002-05-03 2003-11-06 Castellanos Maria G. Method and system for filtering content in a discovered topic
US20040148562A1 (en) * 2001-03-23 2004-07-29 Hardy Hofer Methods for the arrangement of a document in a document inventory
US20040230574A1 (en) * 2000-01-31 2004-11-18 Overture Services, Inc Method and system for generating a set of search terms
US20050055362A1 (en) * 2000-03-22 2005-03-10 Gunter Schmidt Method for finding objects
US20050060289A1 (en) * 2003-09-12 2005-03-17 Mark Keenan A method for relaxing multiple constraints in search and calculation and then displaying results
US20050060287A1 (en) * 2003-05-16 2005-03-17 Hellman Ziv Z. System and method for automatic clustering, sub-clustering and cluster hierarchization of search results in cross-referenced databases using articulation nodes
US20050165805A1 (en) * 2001-06-29 2005-07-28 International Business Machines Corporation Method and system for spatial information retrieval for hyperlinked documents
US20060010117A1 (en) * 2004-07-06 2006-01-12 Icosystem Corporation Methods and systems for interactive search
US20060036588A1 (en) * 2000-02-22 2006-02-16 Metacarta, Inc. Searching by using spatial document and spatial keyword document indexes
US20060173817A1 (en) * 2004-12-29 2006-08-03 Chowdhury Abdur R Search fusion
US20060218100A1 (en) * 1999-06-30 2006-09-28 Silverbrook Research Pty Ltd Method of collecting a copyright fee for a document requested via an interactive surface
US20070067212A1 (en) * 2005-09-21 2007-03-22 Eric Bonabeau System and method for aiding product design and quantifying acceptance
US20070214158A1 (en) * 2006-03-08 2007-09-13 Yakov Kamen Method and apparatus for conducting a robust search
US7283997B1 (en) * 2003-05-14 2007-10-16 Apple Inc. System and method for ranking the relevance of documents retrieved by a query
US20070298866A1 (en) * 2006-06-26 2007-12-27 Paolo Gaudiano Methods and systems for interactive customization of avatars and other animate or inanimate items in video games
US20080059456A1 (en) * 2004-12-29 2008-03-06 Aol Llc, A Delaware Limited Liability Company (Formerly Known As America Online, Inc.) Domain Expert Search
US20080140374A1 (en) * 2003-08-01 2008-06-12 Icosystem Corporation Methods and Systems for Applying Genetic Operators to Determine System Conditions
US20080147644A1 (en) * 2000-05-31 2008-06-19 Yariv Aridor Information search using knowledge agents
US20080172368A1 (en) * 2004-12-29 2008-07-17 Aol Llc Query routing
US20080235192A1 (en) * 2007-03-19 2008-09-25 Mitsuhisa Kanaya Information retrieval system and information retrieval method
US7475072B1 (en) * 2005-09-26 2009-01-06 Quintura, Inc. Context-based search visualization and context management using neural networks
WO2009058625A1 (en) * 2007-11-01 2009-05-07 Ut-Battelle, Llc Dynamic reduction of dimensions of a document vector in a document search and retrieval system
US20090198668A1 (en) * 2008-01-31 2009-08-06 Business Objects, S.A. Apparatus and method for displaying documents relevant to the content of a website
US20090199158A1 (en) * 2008-01-31 2009-08-06 Business Objects, S.A. Apparatus and method for building a component to display documents relevant to the content of a website
US20090222444A1 (en) * 2004-07-01 2009-09-03 Aol Llc Query disambiguation
US20090248669A1 (en) * 2008-04-01 2009-10-01 Nitin Mangesh Shetti Method and system for organizing information
US20090327279A1 (en) * 2008-06-25 2009-12-31 International Business Machines Corporation Apparatus and method for supporting document data search
US20100083131A1 (en) * 2008-09-19 2010-04-01 Nokia Corporation Method, Apparatus and Computer Program Product for Providing Relevance Indication
US20100153357A1 (en) * 2003-06-27 2010-06-17 At&T Intellectual Property I, L.P. Rank-based estimate of relevance values
US20100185572A1 (en) * 2003-08-27 2010-07-22 Icosystem Corporation Methods and systems for multi-participant interactive evolutionary computing
US20100199186A1 (en) * 2003-04-04 2010-08-05 Eric Bonabeau Methods and systems for interactive evolutionary computing (iec)
US20100211558A1 (en) * 2004-07-06 2010-08-19 Icosystem Corporation Methods and apparatus for interactive searching techniques
US20100317446A1 (en) * 2008-02-05 2010-12-16 Konami Digital Entertainment Co., Ltd. Information processing device, information processing device control method, program, and information storage medium
US20110047111A1 (en) * 2005-09-26 2011-02-24 Quintura, Inc. Use of neural networks for annotating search results
US20110047145A1 (en) * 2007-02-19 2011-02-24 Quintura, Inc. Search engine graphical interface using maps of search terms and images
US20110179002A1 (en) * 2010-01-19 2011-07-21 Dell Products L.P. System and Method for a Vector-Space Search Engine
US20110282865A1 (en) * 2010-05-17 2011-11-17 Microsoft Corporation Geometric mechanism for privacy-preserving answers
US8180754B1 (en) 2008-04-01 2012-05-15 Dranias Development Llc Semantic neural network for aggregating query searches
US20130311450A1 (en) * 2003-01-25 2013-11-21 Karthik Ramani Methods, systems, and data structures for performing searches on three dimensional objects
US8645409B1 (en) * 2008-04-02 2014-02-04 Google Inc. Contextual search term evaluation
US20140129494A1 (en) * 2012-11-08 2014-05-08 Georges Harik Searching text via function learning
US20140149415A1 (en) * 2003-09-05 2014-05-29 Google Inc. System and method for providing search query refinements
US20140250376A1 (en) * 2013-03-04 2014-09-04 Microsoft Corporation Summarizing and navigating data using counting grids
WO2014120851A3 (en) * 2013-02-04 2015-02-26 TextWise Company, LLC Method and system for visualizing documents
US9058395B2 (en) 2003-05-30 2015-06-16 Microsoft Technology Licensing, Llc Resolving queries based on automatic determination of requestor geographic location
US20150199339A1 (en) * 2014-01-14 2015-07-16 Xerox Corporation Semantic refining of cross-lingual information retrieval results
US20160267546A1 (en) * 2005-08-04 2016-09-15 Time Warner Cable Enterprises Llc Method and apparatus for context-specific content delivery
US9510058B2 (en) 2007-04-30 2016-11-29 Google Inc. Program guide user interface
US9703871B1 (en) * 2010-07-30 2017-07-11 Google Inc. Generating query refinements using query components
US20170270097A1 (en) * 2016-03-17 2017-09-21 Yahoo Japan Corporation Determination apparatus and determination method

Families Citing this family (227)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6963920B1 (en) * 1993-11-19 2005-11-08 Rose Blush Software Llc Intellectual asset protocol for defining data exchange rules and formats for universal intellectual asset documents, and systems, methods, and computer program products related to same
US6792412B1 (en) * 1999-02-02 2004-09-14 Alan Sullivan Neural network system and method for controlling information output based on user feedback
US6326988B1 (en) * 1999-06-08 2001-12-04 Monkey Media, Inc. Method, apparatus and article of manufacture for displaying content in a multi-dimensional topic space
US6556992B1 (en) 1999-09-14 2003-04-29 Patent Ratings, Llc Method and system for rating patents and other intangible assets
US20090259506A1 (en) * 1999-09-14 2009-10-15 Barney Jonathan A Method and system for rating patents and other intangible assets
US7725307B2 (en) * 1999-11-12 2010-05-25 Phoenix Solutions, Inc. Query engine for processing voice based queries including semantic decoding
EP1204032A4 (en) * 1999-12-21 2008-06-11 Matsushita Electric Ind Co Ltd Vector index creating method, similar vector searching method, and devices for them
US7191223B1 (en) * 2000-01-11 2007-03-13 The Relegence Corporation System and method for real-time alerts
US6999957B1 (en) * 2000-01-11 2006-02-14 The Relegence Corporation System and method for real-time searching
US6883135B1 (en) 2000-01-28 2005-04-19 Microsoft Corporation Proxy server using a statistical model
US7333983B2 (en) * 2000-02-03 2008-02-19 Hitachi, Ltd. Method of and an apparatus for retrieving and delivering documents and a recording media on which a program for retrieving and delivering documents are stored
ES2208164T3 (en) * 2000-02-23 2004-06-16 Ser Solutions, Inc Method and apparatus for processing electronic documents.
US8645137B2 (en) 2000-03-16 2014-02-04 Apple Inc. Fast, language-independent method for user authentication by voice
US6895552B1 (en) * 2000-05-31 2005-05-17 Ricoh Co., Ltd. Method and an apparatus for visual summarization of documents
JP3499808B2 (en) * 2000-06-29 2004-02-23 本田技研工業株式会社 Electronic document classification system
US20020078091A1 (en) * 2000-07-25 2002-06-20 Sonny Vu Automatic summarization of a document
US7007008B2 (en) 2000-08-08 2006-02-28 America Online, Inc. Category searching
US7359951B2 (en) * 2000-08-08 2008-04-15 Aol Llc, A Delaware Limited Liability Company Displaying search results
US7225180B2 (en) 2000-08-08 2007-05-29 Aol Llc Filtering search results
US7047229B2 (en) 2000-08-08 2006-05-16 America Online, Inc. Searching content on web pages
US7080073B1 (en) 2000-08-18 2006-07-18 Firstrain, Inc. Method and apparatus for focused crawling
US7103838B1 (en) 2000-08-18 2006-09-05 Firstrain, Inc. Method and apparatus for extracting relevant data
US6915294B1 (en) * 2000-08-18 2005-07-05 Firstrain, Inc. Method and apparatus for searching network resources
US7158989B2 (en) * 2000-10-27 2007-01-02 Buc International Corporation Limit engine database management system
US7099860B1 (en) * 2000-10-30 2006-08-29 Microsoft Corporation Image retrieval systems and methods with semantic and feature based relevance feedback
US6970860B1 (en) * 2000-10-30 2005-11-29 Microsoft Corporation Semi-automatic annotation of multimedia objects
KR100422710B1 (en) * 2000-11-25 2004-03-12 엘지전자 주식회사 Multimedia query and retrieval system using multi-weighted feature
WO2002048905A1 (en) * 2000-12-15 2002-06-20 80-20 Software Pty. Limited Method of document searching
US7174453B2 (en) 2000-12-29 2007-02-06 America Online, Inc. Message screening system
US6727922B2 (en) * 2001-02-21 2004-04-27 International Business Machines Corporation GUI for representing entity matches utilizing graphical transitions performed directly on the matching object
US7117434B2 (en) 2001-06-29 2006-10-03 International Business Machines Corporation Graphical web browsing interface for spatial data navigation and method of navigating data blocks
US7188141B2 (en) * 2001-06-29 2007-03-06 International Business Machines Corporation Method and system for collaborative web research
JP3907161B2 (en) * 2001-06-29 2007-04-18 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Maschines Corporation Keyword search method, keyword search terminal, computer program
US20030014405A1 (en) * 2001-07-09 2003-01-16 Jacob Shapiro Search engine designed for handling long queries
US20030066025A1 (en) * 2001-07-13 2003-04-03 Garner Harold R. Method and system for information retrieval
DK1288792T3 (en) * 2001-08-27 2012-04-02 Bdgb Entpr Software Sarl A method of automatically indexing documents
US7606819B2 (en) * 2001-10-15 2009-10-20 Maya-Systems Inc. Multi-dimensional locating system and method
US7680817B2 (en) * 2001-10-15 2010-03-16 Maya-Systems Inc. Multi-dimensional locating system and method
CA2601154C (en) * 2007-07-07 2016-09-13 Mathieu Audet Method and system for distinguising elements of information along a plurality of axes on a basis of a commonality
US20080058106A1 (en) 2002-10-07 2008-03-06 Maya-Systems Inc. Multi-dimensional locating game system and method
US20030084219A1 (en) * 2001-10-26 2003-05-01 Maxxan Systems, Inc. System, apparatus and method for address forwarding for a computer network
US7085846B2 (en) * 2001-12-31 2006-08-01 Maxxan Systems, Incorporated Buffer to buffer credit flow control for computer network
US7145914B2 (en) 2001-12-31 2006-12-05 Maxxan Systems, Incorporated System and method for controlling data paths of a network processor subsystem
US7356461B1 (en) * 2002-01-14 2008-04-08 Nstein Technologies Inc. Text categorization method and apparatus
US7379970B1 (en) 2002-04-05 2008-05-27 Ciphermax, Inc. Method and system for reduced distributed event handling in a network environment
US20030195956A1 (en) * 2002-04-15 2003-10-16 Maxxan Systems, Inc. System and method for allocating unique zone membership
US20030200330A1 (en) * 2002-04-22 2003-10-23 Maxxan Systems, Inc. System and method for load-sharing computer network switch
US20030202510A1 (en) * 2002-04-26 2003-10-30 Maxxan Systems, Inc. System and method for scalable switch fabric for computer network
JP4116329B2 (en) * 2002-05-27 2008-07-09 株式会社日立製作所 Document information display system, document information display method, and a document search method
US7024408B2 (en) * 2002-07-03 2006-04-04 Word Data Corp. Text-classification code, system and method
CA2395905A1 (en) * 2002-07-26 2004-01-26 Teraxion Inc. Multi-grating tunable chromatic dispersion compensator
US20040030766A1 (en) * 2002-08-12 2004-02-12 Michael Witkowski Method and apparatus for switch fabric configuration
WO2004044896A3 (en) * 2002-11-13 2004-07-08 Kenneth Nadav Method and system for using query information to enhance categorization and navigation within the whole knowledge base
JP3974511B2 (en) * 2002-12-19 2007-09-12 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Maschines Corporation Computer system for generating a data structure for information retrieval, a method therefor, computer-executable program for generating a data structure for information retrieval, computer-executable program for generating a data structure for information retrieval the stored computer-readable storage medium, the information retrieval system and graphical user interface system,
US7640336B1 (en) 2002-12-30 2009-12-29 Aol Llc Supervising user interaction with online services
EP1443427A1 (en) * 2003-01-29 2004-08-04 Hewlett-Packard Company, A Delaware Corporation Maintenance of information retrieval systems using global metrics
US8055669B1 (en) * 2003-03-03 2011-11-08 Google Inc. Search queries improved based on query semantic information
US20040177081A1 (en) * 2003-03-03 2004-09-09 Scott Dresden Neural-based internet search engine with fuzzy and learning processes implemented at multiple levels
US20040254957A1 (en) * 2003-06-13 2004-12-16 Nokia Corporation Method and a system for modeling user preferences
GB0315505D0 (en) * 2003-07-02 2003-08-06 Sony Uk Ltd Information retrieval
JP4349875B2 (en) * 2003-09-19 2009-10-21 株式会社リコー Document filtering apparatus, a document filtering method, and a document filtering program
US7698345B2 (en) 2003-10-21 2010-04-13 The Nielsen Company (Us), Llc Methods and apparatus for fusing databases
US20050114198A1 (en) * 2003-11-24 2005-05-26 Ross Koningstein Using concepts for ad targeting
US20070300142A1 (en) * 2005-04-01 2007-12-27 King Martin T Contextual dynamic advertising based upon captured rendered text
CN1658234B (en) * 2004-02-18 2010-05-26 国际商业机器公司 Method and device for generating hierarchy visual structure of semantic network
US8005835B2 (en) * 2004-03-15 2011-08-23 Yahoo! Inc. Search systems and methods with integration of aggregate user annotations
US8595146B1 (en) 2004-03-15 2013-11-26 Aol Inc. Social networking permissions
US7584221B2 (en) 2004-03-18 2009-09-01 Microsoft Corporation Field weighting in text searching
US8131702B1 (en) 2004-03-31 2012-03-06 Google Inc. Systems and methods for browsing historical content
US7311666B2 (en) * 2004-07-10 2007-12-25 Trigeminal Solutions, Inc. Apparatus for collecting information
US7395260B2 (en) * 2004-08-04 2008-07-01 International Business Machines Corporation Method for providing graphical representations of search results in multiple related histograms
US8261196B2 (en) * 2004-08-04 2012-09-04 International Business Machines Corporation Method for displaying usage metrics as part of search results
US20060036451A1 (en) * 2004-08-10 2006-02-16 Lundberg Steven W Patent mapping
EP1779263A1 (en) * 2004-08-13 2007-05-02 Swiss Reinsurance Company Speech and textual analysis device and corresponding method
US7540051B2 (en) * 2004-08-20 2009-06-02 Spatial Systems, Inc. Mapping web sites based on significance of contact and category
CN101073077A (en) * 2004-09-10 2007-11-14 色杰斯提卡股份有限公司 User creating and rating of attachments for conducting a search directed by a hierarchy-free set of topics, and a user interface therefor
US7606793B2 (en) * 2004-09-27 2009-10-20 Microsoft Corporation System and method for scoping searches using index keys
US7739277B2 (en) * 2004-09-30 2010-06-15 Microsoft Corporation System and method for incorporating anchor text into ranking search results
US7827181B2 (en) * 2004-09-30 2010-11-02 Microsoft Corporation Click distance determination
JP4639734B2 (en) * 2004-09-30 2011-02-23 富士ゼロックス株式会社 Slide content processing apparatus and program
US7761448B2 (en) 2004-09-30 2010-07-20 Microsoft Corporation System and method for ranking search results using click distance
US7801887B2 (en) * 2004-10-27 2010-09-21 Harris Corporation Method for re-ranking documents retrieved from a document database
US7603353B2 (en) * 2004-10-27 2009-10-13 Harris Corporation Method for re-ranking documents retrieved from a multi-lingual document database
US7814105B2 (en) * 2004-10-27 2010-10-12 Harris Corporation Method for domain identification of documents in a document database
US20060112079A1 (en) * 2004-11-23 2006-05-25 International Business Machines Corporation System and method for generating personalized web pages
US7716198B2 (en) * 2004-12-21 2010-05-11 Microsoft Corporation Ranking search results using feature extraction
JP2010536102A (en) * 2007-08-08 2010-11-25 ベイノート,インク.Baynote,Inc. Content recommendation of the method and apparatus based on the context
EP1844406A1 (en) * 2005-02-02 2007-10-17 Sdn Ag Search engine based self-teaching system
US20060218140A1 (en) * 2005-02-09 2006-09-28 Battelle Memorial Institute Method and apparatus for labeling in steered visual analysis of collections of documents
US20060179051A1 (en) * 2005-02-09 2006-08-10 Battelle Memorial Institute Methods and apparatus for steering the analyses of collections of documents
US7792833B2 (en) * 2005-03-03 2010-09-07 Microsoft Corporation Ranking search results using language types
US20060200460A1 (en) * 2005-03-03 2006-09-07 Microsoft Corporation System and method for ranking search results using file types
JP4314204B2 (en) * 2005-03-11 2009-08-12 株式会社東芝 Document management method, system and program
US7519580B2 (en) * 2005-04-19 2009-04-14 International Business Machines Corporation Search criteria control system and method
US8020110B2 (en) * 2005-05-26 2011-09-13 Weisermazars Llp Methods for defining queries, generating query results and displaying same
US20070005588A1 (en) * 2005-07-01 2007-01-04 Microsoft Corporation Determining relevance using queries as surrogate content
US7984039B2 (en) * 2005-07-14 2011-07-19 International Business Machines Corporation Merging of results in distributed information retrieval
WO2007014341A3 (en) 2005-07-27 2007-12-21 Janal M Kalis Patent mapping
US7599917B2 (en) * 2005-08-15 2009-10-06 Microsoft Corporation Ranking search results using biased click distance
EP1755051A1 (en) * 2005-08-15 2007-02-21 Mitsubishi Electric Information Technology Centre Europe B.V. Method and apparatus for accessing data using a symbolic representation space
US7546280B1 (en) * 2005-08-30 2009-06-09 Quintura, Inc. Use of neural networks for keyword generation
US7949581B2 (en) * 2005-09-07 2011-05-24 Patentratings, Llc Method of determining an obsolescence rate of a technology
US20070214119A1 (en) * 2006-03-07 2007-09-13 Microsoft Corporation Searching within a Site of a Search Result
WO2007109890A1 (en) * 2006-03-29 2007-10-04 Mathieu Audet Multi-dimensional locating system and method
US7907755B1 (en) * 2006-05-10 2011-03-15 Aol Inc. Detecting facial similarity based on human perception of facial similarity
US7783085B2 (en) * 2006-05-10 2010-08-24 Aol Inc. Using relevance feedback in face recognition
US8150827B2 (en) * 2006-06-07 2012-04-03 Renew Data Corp. Methods for enhancing efficiency and cost effectiveness of first pass review of documents
US7809704B2 (en) * 2006-06-15 2010-10-05 Microsoft Corporation Combining spectral and probabilistic clustering
US7831928B1 (en) * 2006-06-22 2010-11-09 Digg, Inc. Content visualization
US8452767B2 (en) * 2006-09-15 2013-05-28 Battelle Memorial Institute Text analysis devices, articles of manufacture, and text analysis methods
US8996993B2 (en) * 2006-09-15 2015-03-31 Battelle Memorial Institute Text analysis devices, articles of manufacture, and text analysis methods
KR100878535B1 (en) * 2006-10-26 2009-01-13 삼성전자주식회사 Apparatus and method for searching for contents
JP4856196B2 (en) * 2007-01-30 2012-01-18 富士通株式会社 Setting checking information collecting method, setting check information collection system and configuration checking information collection program
US7933904B2 (en) * 2007-04-10 2011-04-26 Nelson Cliff File search engine and computerized method of tagging files with vectors
US20080281581A1 (en) * 2007-05-07 2008-11-13 Sparta, Inc. Method of identifying documents with similar properties utilizing principal component analysis
US9218412B2 (en) * 2007-05-10 2015-12-22 Microsoft Technology Licensing, Llc Searching a database of listings
US8826123B2 (en) * 2007-05-25 2014-09-02 9224-5489 Quebec Inc. Timescale for presenting information
US9396254B1 (en) * 2007-07-20 2016-07-19 Hewlett-Packard Development Company, L.P. Generation of representative document components
US20120166414A1 (en) * 2008-08-11 2012-06-28 Ultra Unilimited Corporation (dba Publish) Systems and methods for relevance scoring
US8601392B2 (en) 2007-08-22 2013-12-03 9224-5489 Quebec Inc. Timeline for presenting information
US20090055368A1 (en) * 2007-08-24 2009-02-26 Gaurav Rewari Content classification and extraction apparatus, systems, and methods
US20090055242A1 (en) * 2007-08-24 2009-02-26 Gaurav Rewari Content identification and classification apparatus, systems, and methods
US8041773B2 (en) 2007-09-24 2011-10-18 The Research Foundation Of State University Of New York Automatic clustering for self-organizing grids
US7716228B2 (en) * 2007-09-25 2010-05-11 Firstrain, Inc. Content quality apparatus, systems, and methods
US7840569B2 (en) 2007-10-18 2010-11-23 Microsoft Corporation Enterprise relevancy ranking using a neural network
US9348912B2 (en) * 2007-10-18 2016-05-24 Microsoft Technology Licensing, Llc Document length as a static relevance feature for ranking search results
US20090106221A1 (en) * 2007-10-18 2009-04-23 Microsoft Corporation Ranking and Providing Search Results Based In Part On A Number Of Click-Through Features
US8332411B2 (en) * 2007-10-19 2012-12-11 Microsoft Corporation Boosting a ranker for improved ranking accuracy
US7779019B2 (en) * 2007-10-19 2010-08-17 Microsoft Corporation Linear combination of rankers
US7895225B1 (en) * 2007-12-06 2011-02-22 Amazon Technologies, Inc. Identifying potential duplicates of a document in a document corpus
US8346953B1 (en) 2007-12-18 2013-01-01 AOL, Inc. Methods and systems for restricting electronic content access based on guardian control decisions
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
CA2657835C (en) 2008-03-07 2017-09-19 Mathieu Audet Documents discrimination system and method thereof
US8996376B2 (en) 2008-04-05 2015-03-31 Apple Inc. Intelligent text-to-speech conversion
US8812493B2 (en) 2008-04-11 2014-08-19 Microsoft Corporation Search results ranking using editing distance and document information
US8032469B2 (en) * 2008-05-06 2011-10-04 Microsoft Corporation Recommending similar content identified with a neural network
KR100987330B1 (en) * 2008-05-21 2010-10-13 성균관대학교산학협력단 A system and method generating multi-concept networks based on user's web usage data
US20090313286A1 (en) * 2008-06-17 2009-12-17 Microsoft Corporation Generating training data from click logs
US8171031B2 (en) * 2008-06-27 2012-05-01 Microsoft Corporation Index optimization for ranking using a linear model
US8161036B2 (en) * 2008-06-27 2012-04-17 Microsoft Corporation Index optimization for ranking using a linear model
US8918383B2 (en) * 2008-07-09 2014-12-23 International Business Machines Corporation Vector space lightweight directory access protocol data search
US20100030549A1 (en) 2008-07-31 2010-02-04 Lee Michael M Mobile device having human language translation capability with positional feedback
US8607155B2 (en) 2008-09-12 2013-12-10 9224-5489 Quebec Inc. Method of managing groups of arrays of documents
US20100094823A1 (en) * 2008-10-14 2010-04-15 Mathieu Lemaire Enhanced linear presentation of search results based on search result metadata
US20110213771A1 (en) * 2008-11-18 2011-09-01 Kyota Kanno Hybrid search system, hybrid search method, and hybrid search program
US8396865B1 (en) * 2008-12-10 2013-03-12 Google Inc. Sharing search engine relevance data between corpora
US8224839B2 (en) * 2009-04-07 2012-07-17 Microsoft Corporation Search query extension
GB0907664D0 (en) 2009-05-05 2009-07-22 Aurix Ltd User interface
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US8321398B2 (en) * 2009-07-01 2012-11-27 Thomson Reuters (Markets) Llc Method and system for determining relevance of terms in text documents
US9298722B2 (en) 2009-07-16 2016-03-29 Novell, Inc. Optimal sequential (de)compression of digital data
CN101996191B (en) 2009-08-14 2013-08-07 北京大学 Method and system for searching for two-dimensional cross-media element
US9152883B2 (en) * 2009-11-02 2015-10-06 Harry Urbschat System and method for increasing the accuracy of optical character recognition (OCR)
US9213756B2 (en) * 2009-11-02 2015-12-15 Harry Urbschat System and method of using dynamic variance networks
US9158833B2 (en) * 2009-11-02 2015-10-13 Harry Urbschat System and method for obtaining document information
US8954893B2 (en) * 2009-11-06 2015-02-10 Hewlett-Packard Development Company, L.P. Visually representing a hierarchy of category nodes
US20110161340A1 (en) * 2009-12-31 2011-06-30 Honeywell International Inc. Long-term query refinement system
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US8782734B2 (en) * 2010-03-10 2014-07-15 Novell, Inc. Semantic controls on data storage and access
US8339094B2 (en) * 2010-03-11 2012-12-25 GM Global Technology Operations LLC Methods, systems and apparatus for overmodulation of a five-phase machine
US9760634B1 (en) 2010-03-23 2017-09-12 Firstrain, Inc. Models for classifying documents
US8463789B1 (en) 2010-03-23 2013-06-11 Firstrain, Inc. Event detection
US8832103B2 (en) * 2010-04-13 2014-09-09 Novell, Inc. Relevancy filter for new data based on underlying files
US9129300B2 (en) * 2010-04-21 2015-09-08 Yahoo! Inc. Using external sources for sponsored search AD selection
JP5083367B2 (en) * 2010-04-27 2012-11-28 カシオ計算機株式会社 Search apparatus, search method, as well as, a computer program
CA2836700C (en) 2010-05-25 2017-05-30 Mark F. Mclellan Active search results page ranking technology
US8738635B2 (en) 2010-06-01 2014-05-27 Microsoft Corporation Detection of junk in search result ranking
US8713021B2 (en) * 2010-07-07 2014-04-29 Apple Inc. Unsupervised document clustering using latent semantic density analysis
US8825649B2 (en) * 2010-07-21 2014-09-02 Microsoft Corporation Smart defaults for data visualizations
US20120047172A1 (en) * 2010-08-23 2012-02-23 Google Inc. Parallel document mining
US9384216B2 (en) 2010-11-16 2016-07-05 Microsoft Technology Licensing, Llc Browsing related image search result sets
US9058093B2 (en) 2011-02-01 2015-06-16 9224-5489 Quebec Inc. Active element
US9177828B2 (en) 2011-02-10 2015-11-03 Micron Technology, Inc. External gettering method and device
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US8355922B1 (en) * 2011-04-07 2013-01-15 Symantec Corporation Systems and methods for prioritizing review of items
US8620933B2 (en) 2011-04-11 2013-12-31 Google Inc. Illustrating cross channel conversion paths
US8510326B2 (en) 2011-04-11 2013-08-13 Google Inc. Priority dimensional data conversion path reporting
US8655907B2 (en) 2011-07-18 2014-02-18 Google Inc. Multi-channel conversion path position reporting
US8959450B2 (en) 2011-08-22 2015-02-17 Google Inc. Path explorer visualization
US8994660B2 (en) 2011-08-29 2015-03-31 Apple Inc. Text correction processing
CA2790799A1 (en) 2011-09-25 2013-03-25 Mathieu Audet Method and apparatus of navigating information element axes
US8782042B1 (en) 2011-10-14 2014-07-15 Firstrain, Inc. Method and system for identifying entities
US9495462B2 (en) 2012-01-27 2016-11-15 Microsoft Technology Licensing, Llc Re-ranking search results
US9483461B2 (en) 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
US8880525B2 (en) * 2012-04-02 2014-11-04 Xerox Corporation Full and semi-batch clustering
US9015201B2 (en) * 2012-04-24 2015-04-21 Honeywell International Inc. Discriminative classification using index-based ranking of large multimedia archives
US9519693B2 (en) 2012-06-11 2016-12-13 9224-5489 Quebec Inc. Method and apparatus for displaying data element axes
US9292505B1 (en) 2012-06-12 2016-03-22 Firstrain, Inc. Graphical user interface for recurring searches
US9646080B2 (en) 2012-06-12 2017-05-09 9224-5489 Quebec Inc. Multi-functions axis-based interface
US9495129B2 (en) 2012-06-29 2016-11-15 Apple Inc. Device, method, and user interface for voice-activated navigation and browsing of a document
US8793258B2 (en) * 2012-07-31 2014-07-29 Hewlett-Packard Development Company, L.P. Predicting sharing on a social network
US9576574B2 (en) 2012-09-10 2017-02-21 Apple Inc. Context-sensitive handling of interruptions by intelligent digital assistant
US9536223B2 (en) * 2012-10-24 2017-01-03 Twitter, Inc. Gathering, selecting and graphing n-grams
US9368114B2 (en) 2013-03-14 2016-06-14 Apple Inc. Context-sensitive handling of interruptions
CN105027197A (en) 2013-03-15 2015-11-04 苹果公司 Training an at least partial voice command system
WO2014144579A1 (en) 2013-03-15 2014-09-18 Apple Inc. System and method for updating an adaptive speech recognition model
WO2014197336A1 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for detecting errors in interactions with a voice-based digital assistant
WO2014197334A3 (en) 2013-06-07 2015-01-29 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9582608B2 (en) 2013-06-07 2017-02-28 Apple Inc. Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
JP2016521948A (en) 2013-06-13 2016-07-25 アップル インコーポレイテッド System and method for emergency call initiated by voice command
US9519859B2 (en) 2013-09-06 2016-12-13 Microsoft Technology Licensing, Llc Deep structured semantic model produced using click-through data
US8744840B1 (en) 2013-10-11 2014-06-03 Realfusion LLC Method and system for n-dimentional, language agnostic, entity, meaning, place, time, and words mapping
US9477654B2 (en) 2014-04-01 2016-10-25 Microsoft Corporation Convolutional latent semantic models and their applications
US9535960B2 (en) 2014-04-14 2017-01-03 Microsoft Corporation Context-sensitive search using a deep learning model
US9620105B2 (en) 2014-05-15 2017-04-11 Apple Inc. Analyzing audio input for efficient speech and music recognition
US9502031B2 (en) 2014-05-27 2016-11-22 Apple Inc. Method for supporting dynamic grammars in WFST-based ASR
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
US9785630B2 (en) 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US9842101B2 (en) 2014-05-30 2017-12-12 Apple Inc. Predictive conversion of language input
US9734193B2 (en) 2014-05-30 2017-08-15 Apple Inc. Determining domain salience ranking from ambiguous words in natural speech
US9760559B2 (en) 2014-05-30 2017-09-12 Apple Inc. Predictive text input
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US9606986B2 (en) 2014-09-29 2017-03-28 Apple Inc. Integrated word N-gram and class M-gram language models
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US9886432B2 (en) 2014-09-30 2018-02-06 Apple Inc. Parsimonious handling of word inflection via categorical stem + suffix N-gram language models
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US9711141B2 (en) 2014-12-09 2017-07-18 Apple Inc. Disambiguating heteronyms in speech synthesis
US9865280B2 (en) 2015-03-06 2018-01-09 Apple Inc. Structured dictation using intelligent automated assistants
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US9899019B2 (en) 2015-03-18 2018-02-20 Apple Inc. Systems and methods for structured stem and suffix language models
US9842105B2 (en) 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
US9697820B2 (en) 2015-09-24 2017-07-04 Apple Inc. Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4599692A (en) 1984-01-16 1986-07-08 Itt Corporation Probabilistic learning element employing context drive searching
US5062143A (en) 1990-02-23 1991-10-29 Harris Corporation Trigram-based method of language identification
US5325298A (en) 1990-11-07 1994-06-28 Hnc, Inc. Methods for generating or revising context vectors for a plurality of word stems
EP0615201B1 (en) 1993-03-12 2001-01-10 Kabushiki Kaisha Toshiba Document detection system using detection result presentation for facilitating user's comprehension
JPH0756933A (en) 1993-06-24 1995-03-03 Xerox Corp Document retrieval method
US5724567A (en) 1994-04-25 1998-03-03 Apple Computer, Inc. System for directing relevance-ranked data objects to computer users
US5675819A (en) 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US5706497A (en) 1994-08-15 1998-01-06 Nec Research Institute, Inc. Document retrieval using fuzzy-logic inference
US5717913A (en) 1995-01-03 1998-02-10 University Of Central Florida Method for detecting and extracting text data using database schemas
US5724571A (en) 1995-07-07 1998-03-03 Sun Microsystems, Inc. Method and apparatus for generating query responses in a computer-based document retrieval system
US5713016A (en) 1995-09-05 1998-01-27 Electronic Data Systems Corporation Process and system for determining relevance
US5745893A (en) 1995-11-30 1998-04-28 Electronic Data Systems Corporation Process and system for arrangement of documents
US6026397A (en) * 1996-05-22 2000-02-15 Electronic Data Systems Corporation Data analysis system and method
US5765150A (en) 1996-08-09 1998-06-09 Digital Equipment Corporation Method for statistically projecting the ranking of information
EP0827063B1 (en) * 1996-08-28 2002-11-13 Philips Electronics N.V. Method and system for selecting an information item
US6038561A (en) * 1996-10-15 2000-03-14 Manning & Napier Information Services Management and analysis of document information text
US5987446A (en) * 1996-11-12 1999-11-16 U.S. West, Inc. Searching large collections of text using multiple search engines concurrently
US5774888A (en) * 1996-12-30 1998-06-30 Intel Corporation Method for characterizing a document set using evaluation surrogates
US6041331A (en) * 1997-04-01 2000-03-21 Manning And Napier Information Services, Llc Automatic extraction and graphic visualization system and method
US5835905A (en) * 1997-04-09 1998-11-10 Xerox Corporation System for predicting documents relevant to focus documents by spreading activation through network representations of a linked collection of documents
US6269362B1 (en) * 1997-12-19 2001-07-31 Alta Vista Company System and method for monitoring web pages by comparing generated abstracts
US6216134B1 (en) * 1998-06-25 2001-04-10 Microsoft Corporation Method and system for visualization of clusters and classifications

Cited By (92)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060218100A1 (en) * 1999-06-30 2006-09-28 Silverbrook Research Pty Ltd Method of collecting a copyright fee for a document requested via an interactive surface
US7266551B2 (en) * 2000-01-31 2007-09-04 Overture Services, Inc. Method and system for generating a set of search terms
US20040230574A1 (en) * 2000-01-31 2004-11-18 Overture Services, Inc Method and system for generating a set of search terms
US20080228754A1 (en) * 2000-02-22 2008-09-18 Metacarta, Inc. Query method involving more than one corpus of documents
US9201972B2 (en) 2000-02-22 2015-12-01 Nokia Technologies Oy Spatial indexing of documents
US7953732B2 (en) 2000-02-22 2011-05-31 Nokia Corporation Searching by using spatial document and spatial keyword document indexes
US7908280B2 (en) 2000-02-22 2011-03-15 Nokia Corporation Query method involving more than one corpus of documents
US20080115076A1 (en) * 2000-02-22 2008-05-15 Metacarta, Inc. Query parser method
US20060036588A1 (en) * 2000-02-22 2006-02-16 Metacarta, Inc. Searching by using spatial document and spatial keyword document indexes
US20080228729A1 (en) * 2000-02-22 2008-09-18 Metacarta, Inc. Spatial indexing of documents
US20080126343A1 (en) * 2000-02-22 2008-05-29 Metacarta, Inc. Method for defining the georelevance of documents
US20080228728A1 (en) * 2000-02-22 2008-09-18 Metacarta, Inc. Geospatial search method that provides for collaboration
US20080114736A1 (en) * 2000-02-22 2008-05-15 Metacarta, Inc. Method of inferring spatial meaning to text
US20080109713A1 (en) * 2000-02-22 2008-05-08 Metacarta, Inc. Method involving electronic notes and spatial domains
US7523115B2 (en) * 2000-03-22 2009-04-21 Definiens Ag Method for finding objects
US20050055362A1 (en) * 2000-03-22 2005-03-10 Gunter Schmidt Method for finding objects
US20080147644A1 (en) * 2000-05-31 2008-06-19 Yariv Aridor Information search using knowledge agents
US7809708B2 (en) * 2000-05-31 2010-10-05 International Business Machines Corporation Information search using knowledge agents
US20040148562A1 (en) * 2001-03-23 2004-07-29 Hardy Hofer Methods for the arrangement of a document in a document inventory
US20050165805A1 (en) * 2001-06-29 2005-07-28 International Business Machines Corporation Method and system for spatial information retrieval for hyperlinked documents
US20030208485A1 (en) * 2002-05-03 2003-11-06 Castellanos Maria G. Method and system for filtering content in a discovered topic
US7146359B2 (en) * 2002-05-03 2006-12-05 Hewlett-Packard Development Company, L.P. Method and system for filtering content in a discovered topic
US20130311450A1 (en) * 2003-01-25 2013-11-21 Karthik Ramani Methods, systems, and data structures for performing searches on three dimensional objects
US9348877B2 (en) * 2003-01-25 2016-05-24 Purdue Research Foundation Methods, systems, and data structures for performing searches on three dimensional objects
US20100199186A1 (en) * 2003-04-04 2010-08-05 Eric Bonabeau Methods and systems for interactive evolutionary computing (iec)
US8117139B2 (en) 2003-04-04 2012-02-14 Icosystem Corporation Methods and systems for interactive evolutionary computing (IEC)
US7283997B1 (en) * 2003-05-14 2007-10-16 Apple Inc. System and method for ranking the relevance of documents retrieved by a query
US20050060287A1 (en) * 2003-05-16 2005-03-17 Hellman Ziv Z. System and method for automatic clustering, sub-clustering and cluster hierarchization of search results in cross-referenced databases using articulation nodes
US9058395B2 (en) 2003-05-30 2015-06-16 Microsoft Technology Licensing, Llc Resolving queries based on automatic determination of requestor geographic location
US8078606B2 (en) * 2003-06-27 2011-12-13 At&T Intellectual Property I, L.P. Rank-based estimate of relevance values
US20100153357A1 (en) * 2003-06-27 2010-06-17 At&T Intellectual Property I, L.P. Rank-based estimate of relevance values
US7882048B2 (en) 2003-08-01 2011-02-01 Icosystem Corporation Methods and systems for applying genetic operators to determine system conditions
US8117140B2 (en) 2003-08-01 2012-02-14 Icosystem Corporation Methods and systems for applying genetic operators to determine systems conditions
US20080140374A1 (en) * 2003-08-01 2008-06-12 Icosystem Corporation Methods and Systems for Applying Genetic Operators to Determine System Conditions
US20100185572A1 (en) * 2003-08-27 2010-07-22 Icosystem Corporation Methods and systems for multi-participant interactive evolutionary computing
US7966272B2 (en) 2003-08-27 2011-06-21 Icosystem Corporation Methods and systems for multi-participant interactive evolutionary computing
US9552388B2 (en) * 2003-09-05 2017-01-24 Google Inc. System and method for providing search query refinements
US20140149415A1 (en) * 2003-09-05 2014-05-29 Google Inc. System and method for providing search query refinements
US20050060289A1 (en) * 2003-09-12 2005-03-17 Mark Keenan A method for relaxing multiple constraints in search and calculation and then displaying results
US8768908B2 (en) 2004-07-01 2014-07-01 Facebook, Inc. Query disambiguation
US9183250B2 (en) 2004-07-01 2015-11-10 Facebook, Inc. Query disambiguation
US20090222444A1 (en) * 2004-07-01 2009-09-03 Aol Llc Query disambiguation
US20100211558A1 (en) * 2004-07-06 2010-08-19 Icosystem Corporation Methods and apparatus for interactive searching techniques
US20060010117A1 (en) * 2004-07-06 2006-01-12 Icosystem Corporation Methods and systems for interactive search
US8005813B2 (en) 2004-12-29 2011-08-23 Aol Inc. Domain expert search
US8135737B2 (en) 2004-12-29 2012-03-13 Aol Inc. Query routing
US8521713B2 (en) 2004-12-29 2013-08-27 Microsoft Corporation Domain expert search
US20060173817A1 (en) * 2004-12-29 2006-08-03 Chowdhury Abdur R Search fusion
US20080172368A1 (en) * 2004-12-29 2008-07-17 Aol Llc Query routing
US20080059456A1 (en) * 2004-12-29 2008-03-06 Aol Llc, A Delaware Limited Liability Company (Formerly Known As America Online, Inc.) Domain Expert Search
US7818314B2 (en) * 2004-12-29 2010-10-19 Aol Inc. Search fusion
US20160267546A1 (en) * 2005-08-04 2016-09-15 Time Warner Cable Enterprises Llc Method and apparatus for context-specific content delivery
US20070067212A1 (en) * 2005-09-21 2007-03-22 Eric Bonabeau System and method for aiding product design and quantifying acceptance
US8423323B2 (en) 2005-09-21 2013-04-16 Icosystem Corporation System and method for aiding product design and quantifying acceptance
US8229948B1 (en) 2005-09-26 2012-07-24 Dranias Development Llc Context-based search query visualization and search query context management using neural networks
US20110047111A1 (en) * 2005-09-26 2011-02-24 Quintura, Inc. Use of neural networks for annotating search results
US7475072B1 (en) * 2005-09-26 2009-01-06 Quintura, Inc. Context-based search visualization and context management using neural networks
US8533130B2 (en) 2005-09-26 2013-09-10 Dranias Development Llc Use of neural networks for annotating search results
US8078557B1 (en) 2005-09-26 2011-12-13 Dranias Development Llc Use of neural networks for keyword generation
US20070214158A1 (en) * 2006-03-08 2007-09-13 Yakov Kamen Method and apparatus for conducting a robust search
US20070298866A1 (en) * 2006-06-26 2007-12-27 Paolo Gaudiano Methods and systems for interactive customization of avatars and other animate or inanimate items in video games
US20110047145A1 (en) * 2007-02-19 2011-02-24 Quintura, Inc. Search engine graphical interface using maps of search terms and images
US8533185B2 (en) 2007-02-19 2013-09-10 Dranias Development Llc Search engine graphical interface using maps of search terms and images
US20080235192A1 (en) * 2007-03-19 2008-09-25 Mitsuhisa Kanaya Information retrieval system and information retrieval method
US9510058B2 (en) 2007-04-30 2016-11-29 Google Inc. Program guide user interface
WO2009058625A1 (en) * 2007-11-01 2009-05-07 Ut-Battelle, Llc Dynamic reduction of dimensions of a document vector in a document search and retrieval system
US7937389B2 (en) 2007-11-01 2011-05-03 Ut-Battelle, Llc Dynamic reduction of dimensions of a document vector in a document search and retrieval system
US20090119343A1 (en) * 2007-11-01 2009-05-07 Yu Jiao Dynamic reduction of dimensions of a document vector in a document search and retrieval system
US20090198668A1 (en) * 2008-01-31 2009-08-06 Business Objects, S.A. Apparatus and method for displaying documents relevant to the content of a website
US20090199158A1 (en) * 2008-01-31 2009-08-06 Business Objects, S.A. Apparatus and method for building a component to display documents relevant to the content of a website
US8615733B2 (en) 2008-01-31 2013-12-24 SAP France S.A. Building a component to display documents relevant to the content of a website
US8260772B2 (en) * 2008-01-31 2012-09-04 SAP France S.A. Apparatus and method for displaying documents relevant to the content of a website
US20100317446A1 (en) * 2008-02-05 2010-12-16 Konami Digital Entertainment Co., Ltd. Information processing device, information processing device control method, program, and information storage medium
US20090248669A1 (en) * 2008-04-01 2009-10-01 Nitin Mangesh Shetti Method and system for organizing information
WO2009123866A3 (en) * 2008-04-01 2010-03-25 Iac Search & Media, Inc. Method and system for organizing information
US8180754B1 (en) 2008-04-01 2012-05-15 Dranias Development Llc Semantic neural network for aggregating query searches
WO2009123866A2 (en) * 2008-04-01 2009-10-08 Iac Search & Media, Inc. Method and system for organizing information
US8645409B1 (en) * 2008-04-02 2014-02-04 Google Inc. Contextual search term evaluation
US20090327279A1 (en) * 2008-06-25 2009-12-31 International Business Machines Corporation Apparatus and method for supporting document data search
US8200672B2 (en) * 2008-06-25 2012-06-12 International Business Machines Corporation Supporting document data search
US9317599B2 (en) * 2008-09-19 2016-04-19 Nokia Technologies Oy Method, apparatus and computer program product for providing relevance indication
US20100083131A1 (en) * 2008-09-19 2010-04-01 Nokia Corporation Method, Apparatus and Computer Program Product for Providing Relevance Indication
US20110179002A1 (en) * 2010-01-19 2011-07-21 Dell Products L.P. System and Method for a Vector-Space Search Engine
US8661047B2 (en) * 2010-05-17 2014-02-25 Microsoft Corporation Geometric mechanism for privacy-preserving answers
US20110282865A1 (en) * 2010-05-17 2011-11-17 Microsoft Corporation Geometric mechanism for privacy-preserving answers
US9703871B1 (en) * 2010-07-30 2017-07-11 Google Inc. Generating query refinements using query components
US20140129494A1 (en) * 2012-11-08 2014-05-08 Georges Harik Searching text via function learning
WO2014120851A3 (en) * 2013-02-04 2015-02-26 TextWise Company, LLC Method and system for visualizing documents
US9418145B2 (en) 2013-02-04 2016-08-16 TextWise Company, LLC Method and system for visualizing documents
US20140250376A1 (en) * 2013-03-04 2014-09-04 Microsoft Corporation Summarizing and navigating data using counting grids
US20150199339A1 (en) * 2014-01-14 2015-07-16 Xerox Corporation Semantic refining of cross-lingual information retrieval results
US20170270097A1 (en) * 2016-03-17 2017-09-21 Yahoo Japan Corporation Determination apparatus and determination method

Also Published As

Publication number Publication date Type
US6574632B2 (en) 2003-06-03 grant
US6701318B2 (en) 2004-03-02 grant
US20030130998A1 (en) 2003-07-10 application

Similar Documents

Publication Publication Date Title
Leuski Evaluating document clustering for interactive information retrieval
Sebastiani A tutorial on automated text categorisation
Salton et al. Extended Boolean information retrieval
Salton et al. Advanced feedback methods in information retrieval
Mukherjea et al. Amore: A world wide web image retrieval engine
US6772170B2 (en) System and method for interpreting document contents
Harman Relevance feedback revisited
US7051017B2 (en) Inverse inference engine for high performance web search
US6397205B1 (en) Document categorization and evaluation via cross-entrophy
US6847966B1 (en) Method and system for optimally searching a document database using a representative semantic space
US6704729B1 (en) Retrieval of relevant information categories
US5794178A (en) Visualization of information using graphical representations of context vector based relationships and attributes
US5634051A (en) Information management system
Harmandas et al. Image retrieval by hypertext links
US6766316B2 (en) Method and system of ranking and clustering for document indexing and retrieval
US6006221A (en) Multilingual document retrieval system and method using semantic vector matching
US5911140A (en) Method of ordering document clusters given some knowledge of user interests
Dumais Latent semantic indexing (LSI) and TREC-2
US6366908B1 (en) Keyfact-based text retrieval system, keyfact-based text index method, and retrieval method
US6665661B1 (en) System and method for use in text analysis of documents and records
US6236987B1 (en) Dynamic content organization in information retrieval systems
Jennings et al. A user model neural network for a personal news service
US6519586B2 (en) Method and apparatus for automatic construction of faceted terminological feedback for document retrieval
Finkelstein et al. Placing search in context: The concept revisited
US7346629B2 (en) Systems and methods for search processing using superunits

Legal Events

Date Code Title Description
AS Assignment

Owner name: TECHNOLOGY LICENSING CORPORATION, NEVADA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:HARRIS CORPORATION;REEL/FRAME:035681/0476

Effective date: 20150518