EP1836555A4 - Search engine methods and systems for generating relevant search results and advertisements - Google Patents
Search engine methods and systems for generating relevant search results and advertisementsInfo
- Publication number
- EP1836555A4 EP1836555A4 EP05778301A EP05778301A EP1836555A4 EP 1836555 A4 EP1836555 A4 EP 1836555A4 EP 05778301 A EP05778301 A EP 05778301A EP 05778301 A EP05778301 A EP 05778301A EP 1836555 A4 EP1836555 A4 EP 1836555A4
- Authority
- EP
- European Patent Office
- Prior art keywords
- topic
- topics
- relevant
- significant
- search
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Definitions
- the present invention relates to search engines, and more particularly, to search engine methods and systems that provide relevant advertisements associated with search results.
- Illustrative supervised classification technologies include semantic networks and neural networks. While supervised systems generally derive classifications more attuned to what a human would generate, they often require substantial training and tuning by expert operators and, in addition, often rely for their results on data that is more consistent or homogeneous that is often possible to obtain in practice. Hybrid systems attempt to fuse the benefits of manual classification methods with the speed and processing capabilities employed by unsupervised and supervised systems. In known hybrid systems, human operators are used to derive "rules of thumb" which drive the underlying classification engines.
- the boss would like the individual to send a copy of the email and the references back to him as soon as possible. Also, he would like the individual to check for additional references to see if the conclusions in the memo need to be updated.
- the boss requires that the project be completed within fifteen minutes.
- the worker is not disorganized, but as is common, does not have total recall of how the information was gathered or where the email is stored. After thirty minutes, the worker finally finds the email. But, the worker still needs to search for additional information as requested by his boss. The end result is that because no efficient search mechanism existed the worker has missed his boss' deadline.
- search engine ads try to identify promising content to be associated with. Unfortunately, these are often not very relevant either. For example, you entered "plasma injectors" and you get several ads for plasma televisions. Individuals have learned that keyword ads are not usually very useful, so individuals often completely ignore keyword ads.
- search methods and systems that can efficiently generate search results that are relevant to the particular user's interest and also display advertisements that are relevant to a particular user's interest that accompany the search results.
- the present invention provides search engine methods and systems for generating relevant search results and displaying relevant advertisements with those search results.
- the invention provides methods and systems for modeling and storing data in neutral forms, and then applying topif ⁇ cation techniques to the data to generate search results that are relevant to a particular user's search request.
- the invention also applies topif ⁇ cation and relevancy methods to associate ads that are relevant to a user with the search results for display.
- the systems and methods also analyze the user's inputs using an extensive natural language processing (NLP) scheme and artificial intelligence algorithms. As a result the system is capable of distinguishing contexts, generating highly relevant results, and relevancy ranking of results is customizable.
- NLP natural language processing
- an interactive drill down with a user during searching inquiries produces a semantic topic network and allows the user to do real time personalization of the semantics.
- the present invention can be used to search for information in a plethora of environments, including enterprise systems and across the Internet.
- the present invention provides for pinpoint search-based advertising, improves search efficiency and provides a flexible search engine system.
- FIG. 1 shows, in flowchart form, a method to identify topics in a corpus of data in accordance with one embodiment of the invention.
- FIG. 2 shows, in flowchart form, a method to generate a domain specific word list in accordance with one embodiment of the invention.
- FIG. 3 shows, in flowchart form, a method to identify topics in a corpus of data in accordance with one embodiment of the invention.
- FIG. 4 shows, in flowchart form, a method to measure actual usage of significant words in a corpus of data in accordance with one embodiment of the invention.
- FIG. 5 shows, in flowchart form, a topic refinement process in accordance with one embodiment of the invention.
- FIGS. 6 shows, in flowchart form, a topic identification method in accordance with one embodiment of the invention.
- FIG. 7 shows, in flowchart form, one method in accordance with the invention to identify those topics for display during a user query operation.
- FIG. 8 shows, in block diagram form, a system in accordance with one embodiment of the invention.
- FIG. 9 is a diagram that shows, enterprise information sources.
- FIG. 10 is a semantic construct table, according to an embodiment of the invention.
- FIG. 11 is a topification table, according to an embodiment of the invention.
- FIG. 12 is a diagram of a topic hierarchy, according to an embodiment of the invention.
- FIG. 13 is a flowchart of a method for displaying advertisements based on search results of data items within a set of information, according to an embodiment of the invention.
- FIG. 14 is a flowchart of a method for displaying advertisements based on search results of data items within a set of information using the relevancy of search results, according to an embodiment of the invention. DETAILED DESCRIPTION OF THE INVENTION
- a collection of topics is determined for a first corpus of data, wherein the topics are domain specific, based on a statistical analysis of the first data , corpus and substantially automatically generated.
- the topics may be associated with each "segment" of a second corpus of data, wherein a segment is a uer-defined quantum of information.
- Example segments include, but are not limited to, sentences, paragraphs, headings (e.g., chapter headings, titles of manuscripts, titles of brochures and the like), chapters and complete documents.
- Data comprising the data corpus may be unstructured (e.g., text) or structured (e.g., spreadsheets and database tables), m yet another embodiment of the invention, topics maybe used during user query operations to return a result set based on a user's query input.
- one method in accordance with the invention uses domain specific word list 100 as a starling point from which to analyze data 105 (block 110) to generate domain specific topic list 115.
- topic list 115 entries may be associated with each segment of data 105 (block 120) and stored in database 125 where it may be queried by user 135 through user interface 130.
- Word list 100 may comprise a list of words or word combinations that are meaningful to the domain from which data 105 is drawn. For example, if data 105 represents medical documents then word list 100 may be those words that are meaningful to the medical field or those subfields within the field of medicine relevant to data 105.
- Data 105 may be substantially any form of data, structured or unstructured.
- data 105 comprises unstructured text files such as medical abstracts and/or articles.
- data 105 comprises books, newspapers, magazine content or a combination of these sources,
- data 105 comprises structured data such as design documents and spreadsheets describing an oil refinery process.
- data 105 comprises content tagged image data, video data and/or audio data.
- data 105 comprises a combination of structured and unstructured data.
- Acts in accordance with block 110 use word list 100 entries to statistically analyze data 105 on a segment-by-segment basis.
- a segment may be defined as a sentence and/or heading and/or title, hi another embodiment, a segment may be defined as a paragraph and/or heading and/or title.
- a segment may be defined as a chapter and/or heading and/or title.
- a segment may be defined as a complete document and/or heading and/or title.
- Other definitions may be appropriate for certain types of data and, while different from those enumerated here, would be obvious to one of ordinary skill in the art. For example, headings and titles may be excluded from consideration.
- data 105 comprises the text of approximately 12 million abstracts from the Medline® data collection. These abstracts include approximately 2.8 million unique words, representing approximately 40 Gigabytes of raw data.
- MEDLINE® Medical Literature, Analysis, and Retrieval System Online
- NLM National Library of Medicine's
- the database contains bibliographic citations and author abstracts from more than 4,600 biomedical journals published in the United States and 71) other countries.
- Medline M is searchable at no cost from the NLM's web site at http://www.nlm.nih.gov.
- word list 100 may be generated by first compiling a preliminary list of domain specific words 200 and then pruning front that list those entries that do not significantly and,(r uniquely identify concepts or topics within the target domain (block 205).
- Preliminary list 200 may, for example, lie comprised of words from a dictionary, thesaurus, glossary, domain specific word list or a combination of these sources.
- the Internet may be used to obtain preliminary word lists for virtually any field.
- Words removed in accordance with block 205 may include standard STOP words as illustrated in Table 2. (One of ordinary skill in the art will recognize that other STOP words may be used.)
- a general domain word list may be created that comprises those words commonly used in English (or another language), including those that are specific to a number of different domains.
- This "general word list” may be used to prune words from a preliminary domain specific word list.
- some common words removed as a result of the general word list pruning just described may be added back into preliminary word list 200 because, while used across a number of domains, have a particular importance in the particular domain.
- Example Stop Words a, about affect, after, again, all, along, also, although, among, an, and, another, any, anything, are, as, at, be, became, because, been, before, both, but, by, can, difference, each, even, ever, every, everyone, for, from, great, had, has.
- preliminary word list 200 was derived from the Unified Medical language System Semantic Network (see ht ⁇ :/www.r ⁇ m.nm.gov/datebases/leased.html#umls) and included 4,000,000 unique single- word entries. Of these, roughly 3,945,000 were .moved in accordance with block 205. Accordingly, word list 100 comprised approximately 55,000 one word entries.
- Example word list 200 entries for the medical domain include: abdomen, biotherapy, chlorided, distichiasis, enzyme, enzymes, freckle, gustatory, immune, kyphoplasty, laryngectomy, malabsorption, nebulize,, obstetrics, pancytopenia, quadriparesis, retinae, sideeffect, tonsils, unguium, vermicular, womb, xerostomia, yersinia, and zygote.
- word list 100 provides an initial estimation of domain specific concepts/topics. Analysis in accordance with the invention beneficially expands the semantic breadth of word list 100, however, by identifying word collections (e.g., pairs and triplets) as topics (i.e., topic list 115). Once topics are identified, each segment in data 105 may be associated with those topics (block 120) that exist in that segment. Accordingly, if a corpus of data comprises information from a plurality of domains, analysis in accordance with FIG. 1 may be run multiple times-each time with a different word list 100.
- word collections e.g., pairs and triplets
- FIG. 3 illustrates one method in accordance with the invention to identify topics (block 110 of FIG.
- preliminary topic fist 305 A result of this initial step is preliminary topic fist 305.
- an expected value for each entry in preliminary topic list 305 is computed (block 310) and compared with the actual usage value determined during block 300 (block 315). If the measured actual usage of a preliminary topic list entry Ls significantly greater than the computed expected value of the entry (the "yes" prong of block 315), that entry is added to topic list 115 (block 320).
- topic list 115 For the data set identified in Tables 1 and 3, 10 of the 35 Gigabytes were used to generate topic list 115.
- topic list 115 comprised approximately 506,000 entries. In one embodiment, each of these entries are double word entries.
- Illustrative topics identified for Medline (9 abstract content in accordance with the invention include: adenine nucleotide, heart disease, left ventricular, atria ventricles, heart failure, muscle, heart rate, fatty acids, loss bone, patient case, bone marrow, and arterial hypertension.
- one method to measure the actual usage of significant words in data 105 is to determine three statistics for each entry in word list 100: Sl (block 400); S2 (block 405); and S3 (block 410).
- Sl block 400
- S2 block 405
- S3 block 410
- statistics Sl, S2 and S3 measure the actual frequency of usage of various words and word combinations in data 105 at the granularity of the user-defined segment. More specifically:
- Statistic Sl (block 400) is a segment-level frequency count for each entry in word list 100.
- Sl for word-i is the number of unique paragraphs in data 105 in which word-i is found.
- An Sl value may also be computed for non-word list 100 words if they are identified as part of a word combination as described below with respect to statistic S2.
- Statistic S2 (block 405) is a segment-level frequency count for each significant word combination in data 105. 'nose word combinations having a non-zero S2 value may be identified as preliminary topics 305. In one embodiment, a "significant word combination" comprises any two entries in word list 100 that are in the same segment.
- a "significant word combination” comprises any two entries in word list 100 that are in the same segment and contiguous
- a "significant word combination” comprises any two entries in word list 100 that are in the same segment and contiguous or separated only by one or more STOP words
- a l l significant word combination comprises any two words that are in the same segment and contiguous or separated only by one or more STOP words where at least one of the words in the word combination is in word list 100.
- a "significant word combination” comprises any two or more words that are in the same segment and separated by 1 N' or fewer specified other words: N may be zero or more; and the specified words are typically STOP words.
- word combinations comprising non-word list 100 words may be ignored if they appear in less than a specified number of segments in data 105 (e.g., less than 10 segments).
- S2 for word-combination-i is the number of unique paragraphs in data 105 in which word-combination-i is found.
- Statistic S3 (block 410) indicates the number of unique word combinations (identified by having non-zero S2 values, for example) each word in word list 100 was found in.
- word-z's S3 value is 3.
- One method to compute the expected usage of significant words in data 105 is to calculate the expected value for each preliminary topic list 305 entry based only on its overall frequency of use in data 105.
- the expected value for each word pair in preliminary word list 305 maybe computed as follows:
- Sl(word-i) and Sl(word-j) represents the Sl statistic value for word-i and word-j respectively
- N represents the total number of segments in the data corpus being analyzed.
- the test (block 315) of whether a topic's measured usage (block 300) is significantly greater than the topic's expected usage (block 310), is a constant multiplier. For example, if the measured usage of preliminary topic list entry-i is twice that of preliminary topic list entry-i is expected usage, preliminary topic list entry-i may be added to topic list 115 in accordance with block 320. In another embodiment of the invention, if the measured usage of preliminary topic list entry-i is greater than a threshold value (e.g., 10) across all segments, then that preliminary topic list entry is selected as a topic.
- a threshold value e.g. 10
- a different multiplier may be used (e.g., 1.5 or 3). Additionally conventional statistical tests of significance may be used.
- topic list 115 may be refined in accordance with
- FIG. 5 (For convenience, this refinement process will be described in terms of two-word topics. One of ordinary skill in the art will recognize that the technique is equally applicable to topics having more than two words.)
- a first two word topic is selected (block 500). If both words comprising the topic are found in word list 100 (the "Yes" prong of block 505), the two word topic is retained (block 510). If both words comprising the topic are not found in word list 100 (the "no" prong of block 505), but the S3 value for that word which is in word list 100 is not significantly less than the S3 value for the other word (the yes" prong of block 515), the two word topic is retained (block 510).
- the test for significance is based on whether the "high" S3 value is in the upper one-third of all S3 values and the "low" S3 value is in the lower one-third of all S3 values.
- the test for significance in accordance with block 515 may be based on quartiles, quintiles or Bayesian tests. Refinement processes such as that outlined in FIG. 5 acknowledge word associations within data, while ignoring individual words that are so prevalent alone (high S3 value) as to offer substantially no differentiation as to content.
- each segment in data 105 may associated with those topics which exist within it (block 120) and stored in database 125.
- Topics may be associated with a data segment in any desired fashion. For example, topics found in a segment may be stored as metadata for the segment. In addition, stored topics may be indexed for improved retrieval performance during subsequent lookup operations.
- Empirical studies show that the large majority of user queries are "under-defined.” That is, the query itself does not identify any particular subject matter with sufficient specificity to allow a search engine to return the user's desired data in a result set (i.e., that collection of results presented to the user) that is acceptably small.
- a typical user query may be a single word such as, for example, "kidney.”
- prior art search techniques generally return large result sets—often containing thousands, or tens of thousands, of "hits.” Such large result sets are almost never useful to a user as they do not have the time to go through every entry to find that one having the information they seek.
- topics associated with data Segments in accordance with the invention may be used to facilitate data retrieval operations as shown in FIG. 6.
- a user query When a user query is received (block 600) it may be used to generate an initial result set (block 605) in a conventional manner. For example, a literal text search of the query term may identify 100,000 documents (or objects stored in database 125) that contain the search term. From this initial result set, a subset may be selected for analysis in accordance with topics (block 610). In one embodiment, the subset is a randomly chosen 1% of the initial result set. In another embodiment, the subset is a randomly chosen 1,000 entries from the initial result set. In yet another embodiment, a specified number of entries are selected from the initial result set (chosen in any manner desired).
- While the number of entries in the resu It subset may be chosen in substantially any manner desired, it is preferable to select at least a number that provides "coverage" (in a statistical sense) for the initial result set. In other words, it is desirable that the selected subset mirror the initial result set in terms of topics. With an appropriately chosen result subset, the most relevant topics associated with those results may be identified (block 615) and displayed to the user (block 620).
- FIG. 7 shows one method in accordance with the invention to identify those topics for display (block 615). Initially, all unique topics associated with the result subset are identified (block 700), and those topics that appear in more than a specified fraction of the result subset are removed (block 705). For example, those topics appearing in 80% or more of the segments comprising the result subset may be ignored for the purposes of this analysis. (A percentage higher or lower than this may be selected without altering the salient characteristics of the process.) Next, that topic which appears in the most result subset entries is selected for display (block 710). If more than one topic ties for having the most coverage, one may be selected for display in any manner desired.
- the specified threshold of block 715 is 20%, although a percentage higher or lower than this may be selected without altering the salient characteristics of the process.
- result subset entries remain un-chosen (the "yes" prong of block 735)
- that topic having the next highest coverage is selected (block 740).
- the process of blocks 735 and 740 is repeated until all remaining result subset entries are selected for display (the "no" prong of block 735).
- the topics identified in accordance with FIG. 7 may be displayed to the user (block 620 in FIG. 6).
- data retrieval operations in accordance with the invention return one or more topics which the user may select to pursue or reline their initial search.
- a specified number of search result entries may be displayed in conjunction with the displayed topics. By selecting one or more of the displayed topics, a user may be presented with those data corresponding to the selected topics.
- Topics may, for example, be combined through Boolean "and” and/or “or” operators.
- the user may be presented with another list of topics based on the "new" result set in a manner described above.
- search operations in accordance with the invention respond to user queries by presenting a series of likely topics that most closely reflect the subjects that their initial search query relate to. Subsequent selection of a topic by the user, in effect, supplies additional search information which is used to refine the Search.
- renal function topic identified a total of 6,853 entries divided among the following topics: effects renal, kidney transplantation, renal parenchyma, glomerular filtration, loss renal, blood flow, histological examination, renal artery, creatinine clearance, intensive care, and renal failure. Selection of the "glomerular filtration” topic from this list identified a total of 1,400 entries. Thus, in two steps the number of "hits" through which a person must search was reduced front approximately 148,000 to 1,500-a reduction of nearly two orders of magnitude.
- retrieval operations in accordance with FIG. 6 may not be needed for all queries. For example, if a user query includes multiple search words or a quoted phrase that, using literal text-based search techniques, returns a relatively small result set (e.g., 50 hits or fewer), the presentation of this relatively small result set may be made immediately without resort to the topic-based approach of FIG. 6. What size of initial result set that triggers use of a topic-based retrieval operation in accordance with the invention is a matter of design choice. In one embodiment, all initial result sets having more than 50 hits use a method in accordance with FIG. 6. hi another embodiment * only initial result sets having more than 200 results trigger use of a method in accordance with FIG. 6.
- FIGS. 1 through 7 may be performed by a programmable control device executing instructions organized into one or more program modules 800.
- programmable control device comprises computer system 805 that includes central processing unit 810, storage 815, network interface card 820 for coupling computer system 805 to network 825, display unit 830, keyboard 835 and mouse 840.
- a programmable control device may be a multiprocessor computer system or a custom designed state machine. Custom designed state machines may be embodied in a hardware device such as a printed circuit board comprising, discrete logic, integrated circuits, or specially designed Application Specific Integrated Circuits (ASICs).
- ASICs Application Specific Integrated Circuits
- Storage devices such as device 815, suitable for tangibly embodying program module(s) 800 include all forms of non- volatile memory including, but not limited to: semiconductor memory devices such as Electrically Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), and flash devices; magnetic disks (fixed, floppy, and removable); other magnetic media such as tape; and optical media such as CD-ROM disks.
- semiconductor memory devices such as Electrically Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), and flash devices
- EPROM Electrically Programmable Read Only Memory
- EEPROM Electrically Erasable Programmable Read Only Memory
- flash devices such as electrically Erasable Programmable Read Only Memory (EEPROM), and flash devices
- magnetic disks fixed, floppy, and removable
- other magnetic media such as tape
- optical media such as CD-ROM disks.
- the present invention using sophisticated natural language processing and interactive artificial intelligence (AI) algorithms based on automated classification, can generate search results that are highly relevant, referred to as "true relevance" to a user's search request.
- the present invention can provide true relevance within search results for both an end user and an advertiser.
- FIG. 9 provides a diagram that shows enterprise information sources.
- Enterprise information can include data warehouses, multiple databases, and document systems.
- Server and PC information can include reports, presentations and data generated by the worker or his colleagues.
- Internet information can include a wealth of information, including business websites and business news.
- the first step in addressing the information dilemma is to provide real ⁇ time aggregation of information where the context (e.g. title, to, from, name, product, etc.) is identified and maintained. This must be done without requiring normalization of the data. Or, in other words, the information must be imported "as is” without having to reformat or transform the information into some common form. Examples of methods for aggregating the data are taught in commonly owned U.S. patent number 5,842,213, entitled Method for Modeling, Storing and Transferring Data in Neutral Form, issued Nov. 24, 1998 to Odom et al, and U.S.
- the proposed aggregation addresses the issue of practically pooling diverse information.
- the second step relates to the search problem, or put another way, finding the needed information - the proverbial needle in the haystack.
- True relevancy is the missing ingredient in search.
- the industry is looking for ways to produce better results for the user. This is particularly true when the user is searching for specific content as opposed to general information from an omnibus website. The emphasis is on trying to find a way to easily determine which information is relevant to the user.
- the present invention uses sophisticated natural language processing and interactive artificial intelligence (AI) algorithms based on automated classification to provide true relevance in an efficient manner.
- AI artificial intelligence
- the present invention supports zero latency. When new information is added there is no re-indexing required. Because the meta data is so extensive, the addition of new information becomes only a simple adjustment to the meta data.
- the present invention also supports full automation. Automated crawling of the target information is common in the industry, but implementation of NLP and taxonomy classification has been a manual or training process.
- the present invention has fully automated implementations of crawling, NLP, classification, and loading, hi the system the automated implementation of semantics is accomplished by using existing thesaurus data sets which are accessed in a single query to evaluate all the possible variations. This often involves 20 or more variations for each word the user enters into the search query.
- Semantics data coupled with the identification of phrases form the NLP methodology used in the present invention.
- An example method is disclosed in commonly owned pending U.S. patent application 10/086,026, filed Feb. 26, 2002.
- the automated methodology disclosed has been developed to extract subject descriptions from the content. This methodology can be referred to as "Topification.”
- the present invention has an automated procedure for the definition of semantics. Additionally, the application interface can provide the user with the capability to personalize semantics in real-time. The handling of the semantics in the query process has been integrated into a search engine. This provides superior performance and allows the semantics to be independent (orthogonal to) of the data. With this implementation it is possible to do many semantic variations without the performance constraints.
- FIG. 2 provides a semantic construct table, according to an embodiment of the invention.
- the construct table can be used to explain the scope of the present invention's implementation.
- the table shows and explains semantic constructs for stems, synonyms, concepts, names, misspellings, language and phrases. Topification
- Taxonomies were developed by a biologist in the 1800's to classify plants and animals. Plants and animals are real entities: a rabbit vs. a cow or a rose vs. a sunflower. These are groups of objects that are easily understood and identified by the concrete differences in their attributes. Taxonomies have been adapted for use in classifying information. Categories of subject matter replace what in the original methodology were entities (i.e. plants and animals). Documents have differences, but these differences can often be abstract and/or very subtle. This usually means the differences are qualitative and require significant manual effort to create and maintain.
- Topification is a solution to the classification problem in electronic information. Topification uses topics to categorize documents and document content.
- FIG. 3 provides a topification table that provides definitions, concepts, rules and tenets (collectively known as an ontology), according to an embodiment of the invention.
- the topification table shows that understanding topics (second order concepts) is much easier than understanding categories (third order concepts). This is validated when manual effort, training exercises, or example Meta data sets are used to "define” the "meaning” of the category
- Topics form a network that has an implied hierarchy.
- FIG. 4 shows a hierarchy that illustrates the relationship between a hypothetical set of topics and documents, according to an embodiment of the invention. Any given document contains a set of topics. In the hierarchy, solid lines represent paths from topics to the documents they are contained in. For example, Topic A is found in Documents 1, 2, 3 and 5 (as well as an approximate 20,000 additional documents). Topic B is found in Documents 3, 4 and 5 (as well as an approximate 2,000 additional documents).
- the diagram's bands indicate the (relative) number of documents that contain a given topic. So, Topic A at the top of the diagram is contained in more documents than any other topic. Topic B is found in fewer documents than Topic A, but in more documents than Topics C, D, E, F, G, or H.
- the implied hierarchy is a result of the frequency that a topic occurs in the document set. A topic that appears in many documents is less specific and, therefore, higher in the hierarchy than a topic that appears in just a few documents.
- Topic A is related to any topic that occurs with it in a document. For example, Topic A and C both are found in Document 2. Topic A is found in more documents than Topic C, so Topic A is an implied parent of Topic C as expressed by the line connecting both.
- Topic networking characteristics become apparent when studying paths to Document 4.
- Topic A is not found in Document 4, but both Topic B and Topic D are found in other documents with Topic A.
- Topics B, C, D, E, F, G, and H are viable topic results since they are found in common documents along with Topic A. Notice that even though Document 4 does not contain Topic A, it is on a path from Topic B or Topic D. So picking Topic B and then Topic D would lead to the display of Document 4 as a relevant search result.
- Topic D has two implied parents: Topics A and B. This means coverage in the topic selection process is extensive because there are multiple paths to relevant results. Taxonomies do not have this networking property. There is only one parent for each child in taxonomy.
- Topification coupled with natural language processing produces a multi-path semantic network to the searcher's desired result.
- taxonomy has one and only one path to a set of results which may or may not include all the relevant documents.
- the present invention can handle millions of topics. Using our previous example lets assume that the present invention has defined 4 million topics. Then on average each topic will provide a granularity of 10 documents. In practice there is a range which is typically less than 100. With a single distinct search word entered by the user it is not unusual to produce a set of results that are less than 20.
- the system uses artificial intelligence (AI) to evaluate the query entries made by the user to develop a list of topics that will provide paths to all of the potential solutions sets.
- AI artificial intelligence
- the AI routines re-evaluate the constraints to provide a new list of topics.
- the system is evaluating all the potential solutions to the user's constraints and provides to the user knowledge of what is relevant to the current search.
- the searcher in turn, by clicking on relevant topics is providing the system information about what is relevant and what is not. It is typical to take only 3 or 4 clicks to arrive at a handful of relevant results.
- True Relevance in the sense that through the interaction the user has defined what is relevant for the search at hand.
- the AI routines only work effectively if they are integrated with the semantics (stems, synonyms, phrases, etc.) and reasonable granularity.
- the present invention provides a way for the user to express the domain of interest. Since relevancy is expressed through a "known" set of topics the marketers can determine the set of topics that apply to their products. Relevancy for a single semantically enabled topic is more than a factor of two greater than for two single words and relevancy increases exponentially with each additional topic added by the user. If a combination of topics and constraint words are used, then advertisements that qualify will be relevant in almost all cases.
- the relevancy ranking is customizable. Options for relevancy ranking would include any or all the following, but is not limit to this list:
- search constraints are more relevant (e.g. title, first paragraph, first page, author, etc.)
- This appropriate relevancy ranking can significantly reduce the resource requirements if the user uses more relevant results as a basis for refining the search.
- the user can express the domain of interest, relevancy is defined by combinations of millions of topics, relevancy for a single topic is at least twice that of for two single words and relevancy increases exponentially with each additional topic added by a user.
- FIG. 13 provides method 1300 for displaying advertisements based on search results of data items within a set of information when a user enters a search constraint, in accordance with an embodiment of the invention.
- Methods 1300 and 1400 presented in FIG. 14 provides example implementations for displaying relevant advertisements with search results based on the above methods and concepts disclosed for topification and pinpoint advertisements.
- Method 1300 begins in step 1310.
- Li step 1310 a search to generate the search results is conducted within a set of information.
- the search results include a set of data items contained within the set of information.
- the set of information can include, but is not limited to one or more of information located within an enterprise network, information located within a server, information located within a personal computer, information located on the Internet, or information contained within email messages or email attachments.
- the data items can include, but are not limited to one or more of text documents, graphic documents, audio files, video files, multimedia documents, email messages, email attachments, or Internet web page.
- the search includes identifying topics in a data corpus having a plurality of segments that is representative of the set of information.
- Identifying topics includes determining a segment-level actual usage value for one or more word combinations, computing a segment-level expected usage value for each of the one or more word combinations, and designating a word combination as a topic if the segment-level actual usage value of the word combination is substantially greater than the segment-level expected usage value of the word combination.
- the search then associates topics with each data item included within the set of information, hi embodiment the association of topics with each data item can be completed prior to conducting a search.
- the search can determine that a data item should be included in the search results, when a topic entered by the user matches or is similar to a topic associated with the data item.
- a topic entered by a user matches a topic associated with the data item when the topics are the same, for example the user enters "spear fishing" and the topic is "spear fishing.”
- a topic is similar to the term or phrase entered by the user when the topics are the same except for minor spelling errors or capitalization.
- the topic can also be similar to the user constraint when the terms are semantically similar.
- the topic can also be similar to the user constraint when a portion of the user constraint matches a portion of the topic, for example, one word in the topic matches one word in the user constraint.
- a topic can include one or more words for this purpose.
- topics include two or more words the effectiveness of the search is significantly improved, hi this case a topic includes a word combination of two or more substantially contiguous words.
- the two or more words can be considered substantially contiguous if they are separated only by zero or more words selected from a predetermined list of words.
- the predetermined list of words comprises STOP words, hi another approach, at least one word in each of the word combinations making up the topics is selected from a predetermined list of words in which the predetermined list of words includes a list of domain specific words. For example, a predetermined list of words associated with the domain of baseball, might include bat, glove, baseball, etc.
- determining a set of significant topics includes first counting the frequency of occurrence of each topic within the search results. So 5 for example, if the topic was "spear fishing" and there were 100 data items in the search results. A count would be made of all the occurrences of "spear fishing" in the 100 data items. Once a count was completed for each topic, the topics are hierarchically ranked based on the frequency of occurrence of the topic. So, for example, the topic occurring most frequently would be ranked 1, the topic occurring second most frequently would be ranked 2, and so on.
- a topic is then identified as among the set of significant topics when its frequency of occurrence ranks above a significant topic threshold.
- the significant topic threshold is the number of topics to be included in the set of significant topics. In one embodiment, the significant topic threshold is ten.
- the significant topic threshold can be adjusted based on the particular needs and factors associated with a search.
- determining the set of significant topics from the search results includes for each topic determining a data item count.
- the topic data item count is the number of data items within the search results that the topic appears in. Rather than counting the total frequency of occurrences of a topic, as in the previous embodiment, only the number of data items that a topic occurs in is counted. Thus, whether a topic occurred ten times or only once in a particular data item, the data item count would be one.
- the topics are hierarchically ranked based on the data item count of the topic. For example, a topic within the highest data item count is given a ranking of 1, the topic with the second highest data item count is given a ranking of 2, and so on. [000102] A topic is then identified to be included among the set of significant topics when it ranks above the significant topic threshold.
- the significant topic threshold is the number of topics to be included in the set of significant topics.
- the most specific topics are included in the set of significant topics, hi this case a preliminary set of most significant topics from the search results are determined. Note that either approach of using the frequency of occurrence or data item count can be used to determine the preliminary set of most significant topics and also to identify which topics are most specific.
- the topic's frequency of occurrence (or data item count, depending on the approach) within the set of information is determined.
- the most specific topics within the preliminary set of most significant topics are determined as those that have the lowest frequency of occurrence within the set of documents. For example, the topic within the lowest frequency of occurrence within the set of information is given a ranking of 1, the topic with the second lowest frequency of occurrence within the set of information is given a ranking of 2, and so on.
- a topic is identified as among the most specific topics when its frequency of occurrence ranks above the specific topic threshold.
- the specific topic threshold is the number of topics to be included in the most specific topics.
- relevant advertisements related to the set of significant topics are identified.
- relevant advertisements that are related to the set of significant topics includes selecting an advertisement as relevant when a topic associated with the advertisement matches one of the topics within the set of significant topics.
- relevant advertisements that are related to the set of significant topic includes selecting an advertisement as relevant when a topic associated with the advertisement matches the top ranked topic within the set of significant topics.
- relevant advertisements that are related to the set of significant topic includes selecting an advertisement as relevant when a topic associated with the advertisement is similar to a topic within the set of significant topics.
- a topic associated with an advertisement matches one of the topics within the set of significant topics if the topics are the same. For example the topic associated with an advertisement is "spear fishing" and a topic within the set of significant topics is “spear fishing.” Topics are similar when the topics are the same except for minor spelling errors or capitalization. Topics can also be similar when the terms are semantically similar.
- a set of relevant advertisements is displayed.
- the maximum number of advertisements to display is determined. Once the maximum number of advertisements is determined, relevant advertisements equal to the maximum number of advertisements that have the highest relevant advertisement display quotient are displayed.
- the relevant advertisement quotient is a function of one or more of a relationship between the search constraint of the user and topics associated with relevant advertisements, a relationship between the set of significant topics and topics associated with relevant advertisements, existing click-throughs by a user to relevant advertisements, and premium financial payments by an advertiser to promote display of their advertisement.
- relative advertisements that are displayed are randomly selected from the set of relevant advertisements that were determined in step 1340.
- relevant advertisements that are displayed are relevant advertisements determined in step 1340 in which the advertisers have paid the largest financial premium for placement of their advertisements.
- relevant advertisements that are displayed are relevant advertisements determined in step 1340 in which the topics associated with the advertisement are most similar to the user's constraint terms.
- FIG. 14 provides method 1400 for displaying advertisements based on search results from data items within a set of information when a user enters a search constraint, according to an embodiment of the invention.
- Method 1400 is similar to method 1300, except that search results are ranked by relevancy before a set of significant topics are determined. Using the relevancy factors associated with search results that were discussed above can further improve the relevancy of advertisements that will be displayed along side search results.
- Method 1400 begins in step 1410.
- a search is conducted to generate search results. This step is the same as step 1310 above.
- the search results are ranked by relevancy. This step was not present in method 1300.
- Ranking the search results by relevancy includes providing a relevancy rank for each data item in the search results based on one or more of what component of the search result contains the search constraint (e.g., the component was the title of the data item), a proximity of search text (e.g., all search constraints are located near to one another within a data item), and a level of semantics that had to be applied to the search result (e.g., the closer the terms that match the user constraint, the more relevant the search result).
- search constraint e.g., the component was the title of the data item
- a proximity of search text e.g., all search constraints are located near to one another within a data item
- a level of semantics that had to be applied to the search result e.g., the closer the terms that
- the relevancy ranking can also be based on the popularity of the website search result and previous click-throughs by the user to the website search result. Those search results with the highest relevancy ranking are determined to be included in the set of most relevant search results.
- a set of significant topics is determined from the most relevant search results. This step is the same as step 1320 above, except that the set of significant topics is determined from the set of most relevant search results in step 1430 and the set of significant topics was determined from all search results in step 1320.
- step 1440 relevant advertisements related to the set of significant topics are identified.
- step 1450 the most relevant topics are displayed.
- steps 1440 and 1450 are the same as steps 1330 and 1340 respectively.
- step 1460 method 1400 ends.
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Accounting & Taxation (AREA)
- Development Economics (AREA)
- Finance (AREA)
- Strategic Management (AREA)
- General Physics & Mathematics (AREA)
- Entrepreneurship & Innovation (AREA)
- General Business, Economics & Management (AREA)
- Marketing (AREA)
- Economics (AREA)
- Game Theory and Decision Science (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US59240404P | 2004-08-02 | 2004-08-02 | |
PCT/US2005/027406 WO2006017495A2 (en) | 2004-08-02 | 2005-08-02 | Search engine methods and systems for generating relevant search results and advertisements |
Publications (2)
Publication Number | Publication Date |
---|---|
EP1836555A2 EP1836555A2 (en) | 2007-09-26 |
EP1836555A4 true EP1836555A4 (en) | 2009-04-22 |
Family
ID=35839854
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP05778301A Withdrawn EP1836555A4 (en) | 2004-08-02 | 2005-08-02 | Search engine methods and systems for generating relevant search results and advertisements |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP1836555A4 (en) |
WO (1) | WO2006017495A2 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10783149B2 (en) | 2017-08-02 | 2020-09-22 | Microsoft Technology Licensing, Llc | Dynamic productivity content rendering based upon user interaction patterns |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030167252A1 (en) * | 2002-02-26 | 2003-09-04 | Pliant Technologies, Inc. | Topic identification and use thereof in information retrieval systems |
US20040093327A1 (en) * | 2002-09-24 | 2004-05-13 | Darrell Anderson | Serving advertisements based on content |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5278980A (en) * | 1991-08-16 | 1994-01-11 | Xerox Corporation | Iterative technique for phrase query formation and an information retrieval system employing same |
US5625748A (en) * | 1994-04-18 | 1997-04-29 | Bbn Corporation | Topic discriminator using posterior probability or confidence scores |
GB9625284D0 (en) * | 1996-12-04 | 1997-01-22 | Canon Kk | A data processing method and apparatus for identifying a classification to which data belongs |
US20030018659A1 (en) * | 2001-03-14 | 2003-01-23 | Lingomotors, Inc. | Category-based selections in an information access environment |
US20050114198A1 (en) * | 2003-11-24 | 2005-05-26 | Ross Koningstein | Using concepts for ad targeting |
-
2005
- 2005-08-02 WO PCT/US2005/027406 patent/WO2006017495A2/en active Application Filing
- 2005-08-02 EP EP05778301A patent/EP1836555A4/en not_active Withdrawn
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030167252A1 (en) * | 2002-02-26 | 2003-09-04 | Pliant Technologies, Inc. | Topic identification and use thereof in information retrieval systems |
US20040093327A1 (en) * | 2002-09-24 | 2004-05-13 | Darrell Anderson | Serving advertisements based on content |
Also Published As
Publication number | Publication date |
---|---|
WO2006017495A2 (en) | 2006-02-16 |
EP1836555A2 (en) | 2007-09-26 |
WO2006017495A3 (en) | 2007-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20060004732A1 (en) | Search engine methods and systems for generating relevant search results and advertisements | |
US7716207B2 (en) | Search engine methods and systems for displaying relevant topics | |
US7340466B2 (en) | Topic identification and use thereof in information retrieval systems | |
US7617199B2 (en) | Characterizing context-sensitive search results as non-spam | |
Medelyan | Human-competitive automatic topic indexing | |
US7634466B2 (en) | Realtime indexing and search in large, rapidly changing document collections | |
Kalashnikov et al. | Web people search via connection analysis | |
Biancalana et al. | Social semantic query expansion | |
Gupta et al. | Frequent item-set mining and clustering based ranked biomedical text summarization | |
US20110289081A1 (en) | Response relevance determination for a computerized information search and indexing method, software and device | |
Spangler et al. | Exploratory analytics on patent data sets using the SIMPLE platform | |
Rahman | Search engines going beyond keyword search: a survey | |
Kavuluru et al. | An up-to-date knowledge-based literature search and exploration framework for focused bioscience domains | |
Guo et al. | Complex-query web image search with concept-based relevance estimation | |
Spangler et al. | Simple: Interactive analytics on patent data | |
Durao et al. | Expanding user’s query with tag-neighbors for effective medical information retrieval | |
WO2006017495A2 (en) | Search engine methods and systems for generating relevant search results and advertisements | |
Shaila et al. | TAG term weight-based N gram Thesaurus generation for query expansion in information retrieval application | |
Acharya et al. | The process of information extraction through natural language processing | |
WO2007103096A2 (en) | Search engine methods and systems for displaying relevant topics | |
Alli | Result Page Generation for Web Searching: Emerging Research and Opportunities: Emerging Research and Opportunities | |
Alli | Result Page Generation for Web Searching: Emerging Research and | |
Briscoe et al. | Intelligent information access from scientific papers | |
Heenan | A Review of Academic Research on Information Retrieval | |
Durao et al. | Medical Information Retrieval Enhanced with User’s Query Expanded with Tag-Neighbors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20070619 |
|
AK | Designated contracting states |
Kind code of ref document: A2 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR |
|
AX | Request for extension of the european patent |
Extension state: AL BA HR MK YU |
|
DAX | Request for extension of the european patent (deleted) | ||
A4 | Supplementary search report drawn up and despatched |
Effective date: 20090325 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06F 17/30 20060101ALI20090319BHEP Ipc: G06Q 30/00 20060101AFI20090319BHEP |
|
17Q | First examination report despatched |
Effective date: 20090814 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20100225 |