CN100568172C - The system and method that is used for interactive search query refinement - Google Patents

The system and method that is used for interactive search query refinement Download PDF

Info

Publication number
CN100568172C
CN100568172C CNB2004800140270A CN200480014027A CN100568172C CN 100568172 C CN100568172 C CN 100568172C CN B2004800140270 A CNB2004800140270 A CN B2004800140270A CN 200480014027 A CN200480014027 A CN 200480014027A CN 100568172 C CN100568172 C CN 100568172C
Authority
CN
China
Prior art keywords
candidate item
document
item
classified
inquiry
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
CNB2004800140270A
Other languages
Chinese (zh)
Other versions
CN1795432A (en
Inventor
彼得·G.·安尼克
阿拉斯塔尔·果尔蕾
约翰·约瑟夫·瑟蕾尔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Altaba Inc
Original Assignee
Yahoo Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yahoo Inc filed Critical Yahoo Inc
Publication of CN1795432A publication Critical patent/CN1795432A/en
Application granted granted Critical
Publication of CN100568172C publication Critical patent/CN100568172C/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Abstract

The inquiry that receives is processed, so that generate the initial group of ranked documents corresponding to the inquiry that receives.Each document in all or part of document in initial group of ranked documents all is associated with the set of separately classified candidate item respectively, so that each candidate item in the set of classified candidate item separately all is embedded in the document.The set of each classified candidate item separately all a certain moment before the inquiry that processing receives identified.According to selection function, the subclass of the candidate item in one or more candidate item set separately is selected.In response to the inquiry that receives, the subclass of initial group of ranked documents and candidate item is presented.

Description

The system and method that is used for interactive search query refinement
Technical field
The application requires the U.S. Patent Application Serial Number 60/456 that is entitled as " Systems and Methods ForInteractive Search Query Refinement " submitted on March 21st, 2003, the right of priority of 905 (application attorney docket 10130-044-888) is therefore incorporated its full content as a reference into.The present invention relates to the field of search engine, described search engine for example is used for the document of location database or is stored in document on the server that is coupled in the Internet or the Intranet, more specifically, the present invention relates to be used to assist the system and method for its search inquiry of search engine user refinement with the consumer positioning interested documents.
Background technology
For a long time, develop following search expression and be regarded as cognitive task for text search engine user's a difficulty: described search expression had both been passed on user's information requirement, also was complementary with the expression way of this demand in the vocabulary of destination document.Most of search engine user begin their search for document with the inquiry that only contains one or two word, and they understand rather disappointed when they do not find one or more document that they want in about ten results that search engine at first produces then.Though by improving result's arrangement mode, can improve (at least for some search) user satisfaction, search inquiry can't satisfy the more concrete information requirement of a lot of different search engine user very widely.A kind of method of the user's of help refinement query express provides (term) suggestion, as the librarian with the talking face to face of information seeker in done.But, since system must from a query concept on relevant up to a hundred " guesing out " which most probable relevant with the user who carries out search, therefore automatic doing like this is very different.Be used for selecting the common method of continuous item to comprise the online dictionary or the database (it can be searched to find the previous inquiry of the one or more words that comprise current inquiry) of the inquiry that consulting is write down in advance.The weakness of this method is to guarantee that in fact the continuous item that so generates has reflected theme or the vocabulary that uses in the collected works of document itself.Therefore, the alternative method of trial dynamic collection continuous item from the actual result of inquiry has obtained more concerns.
The existing method that some use is arranged for the Search Results that generates the refinement suggestion comprises a relevance feedback (" the Fast andEffective Query Refinement " that show such as people such as Velez in the 6th to 15 page in SIGIR ' 97 annual reports), super index (Bruza and Dennis show in the 500th to 509 page in RIAO ' 97 annual report " Query Reformulation on the Internet:Empirical Dataand the Hyperindex Search Engine "), free translation (Anick and Tipirneni show in the 153rd to 159 page in SIGIR ' 99 annual report " The Paraphrase Search Assistant:Terminological Feedback for Iterative Information Seeking ") and gathering (Zamir and Etzioni show in the 46th to 54 page in SIGIR ' 98 annual report " Web DocumentClustering:A Feasibility Demonstration "). Most of relevance feedback methods are designed for part match search engine, and generally include by adding a plurality of weight term and enlarge query express, described a plurality of weight term are to derive from the calculating of the subclass that clearly is labeled as relevant or incoherent search file by the user.Super index moves syntactic analyser on the segment of being returned by search engine, to extract the noun phrase that comprises query term.Free translation extracts noun phrase from the results set document, and the feedback term that selection is used to show based on lexical diffusion.Method for congregating is attempted assembling the results set segment, and derives representative query term the item in appearing at each cluster.Though much all very practical in these methods, they are not suitable for very large network search engines a bit, and its reason is performance or the correlativity of the feedback term of generation working time.Still exist in this area being used for assisting users identification continuous item to improve the demand of the effective ways of searching for.
In order better to understand the limitation of prior art, further look back Velez etc. " Fast and Effective Query Refinement " that the people showed in the 6th to 15 page in SIGIR ' 97 annual reports.People such as Velez provide the system and method that is used for query refinement, and wherein the item from automatic suggestion is added to initial query, so that the refinement initial query.In people such as Velez, the author depends on common query refinement program DM.Propose as people such as Velez, DM has following steps:
Order:
C=document collected works
The q=user inquiring
The number of the coupling document that r=will consider
W Fcn(s)=the special-purpose weight term S set of algorithm
Then,
1. calculate set D (q) the ∈ C of the document of matching inquiry q
2. select the subset D of top r coupling document r(q)
3. from document D r(q) the set T (q) of computational item in makes T ( q ) = { t | ∃ d ∈ D r ( q ) : t ∈ d } ,
Wherein d is a document, and t is an item
4. n that calculates from T (q) has highest weighting W FcnThe subclass S of item (s)
5. S is offered the user with a set as the item suggestion
As described in people such as Velez, this method is also unsatisfactory, because this technology will spend very long working time.In other words, under the very big situation of document database (collected works), in order to use the S set of DM computational item suggestion, will spend can't gratifying time quantum.
People such as Velez attempt by calculating the speed that the huge workload of dynamically being carried out by DM improves DM in advance.In calculation stages in advance, people such as Velez generate following data structure: this data structure is mapped to each set m that the DM algorithm will be advised for this single inquiry t with the every single lexical item t in the collected works.Then, when operation, receive any inquiry from the user.This inquiry generally includes the set of item.In response to this inquiry, people such as Velez collect corresponding to each each set m in the inquiry, and these set are merged in the single set, and subsequently it are turned back to the user with as to through improving the suggestion of searching for.For example consider the situation of user input query " space shuttle ".In the case, item set m that is used for word " space " that people such as Velez can obtain to have precomputed and the item set m that is used for word " shuttle " that has precomputed, and they are combined, be used for the set of the recommended items of inquiry " space shuttle " with derivation.
Though this method has improved performance working time by the subclass of calculated off-line n-th-trem relation n in advance, there is defective in people's such as Velez method.At first, there is the context problem.People's such as Velez method depends on following hypothesis: no matter given t is independent appearance, and still the part as multinomial inquiry occurs, and the item set m relevant with this t is identical.But should suppose always not real.Appear at the item expression and the diverse meaning of item that occurs separately in some cases in the multinomial phrase.Because this hypothesis that the people relied on such as Velez, this method may cause unsuitable search terms suggestions in some cases, perhaps may be lost in other more relevant in the context of whole inquiry suggestions.Secondly, when collected works (document database) change, people's such as Velez method need recomputate respectively the item set m that is associated with item t in the collected works, this is that described a plurality of files may comprise the file that adds to recently in the collected works because each set m depends on the content of a plurality of files in the collected works.
Xu and Croft have described another kind of method in SIGIR ' 97 4-11 pages or leaves, wherein with given conceptual dependency the item be integrated into receive can comprise some notions (search terms) search inquiry before calculated in advance.The same with people such as Velez, the static state that method relied on of Xu and Croft is intersected the structure of document data structure and statistics and need extensively recomputate item with conceptual dependency when collected works is changed in time.Therefore, for large-scale dynamic document database, the computation requirement of Xu and Croft is also unsatisfactory.
In view of above background, wish that assisting users becomes their search query refinement by the inquiry of narrower qualification, so that produce the Search Results that they more want.
Summary of the invention
The invention provides and be used for improving one's methods of search refinement inquiry, wherein said search inquiry is designed to obtain document from document index.The present invention is favourable, because intersection document data structure or global statistics that it must recomputate when not depending on each renewal collected works.In addition, the present invention's (during operation) when inquiry needs quite few I/O resource, this is because compare with known method, tabulates the result that need obtain still less for the related advisory that produces the potpourri that has comprised phrase, single lexical item and nomenclature (phrase that comprises query term) short when operation.In the present invention, a certain moment of each document in the document index before inquiry is processed, for example during document index generates.In the reason, each document in the document index is examined herein, whether comprises any item of the set that is suitable for being included in the classified candidate item that is used for the document to judge the document.When document comprised such item, the document index clauses and subclauses that are used for the document were configured to the set that comprises the item that is associated with the document.This set is called as classified candidate item set.
When receiving inquiry, the initial set of document is acquired from document index.The initial set of document is according to for the correlativity of this inquiry and by classification." initial set " of document can be the subclass that is identified as whole set of document that may be relevant with this inquiry.In one embodiment, the number of documents in initial set be identified as may be relevant with this inquiry all documents and the common little person among a parameter value between 20 and 200 (for example 50).Then, each candidate item that appears in any classified candidate item set that is associated with a document in the initial group of ranked documents is used weighted function.In response to inquiry, the candidate item that rank is high is presented out with initial group of ranked documents.The user that one of candidate item of presenting is carried out selects to cause this to be added in the original search inquiry.
One aspect of the present invention provides a kind of method that is used for the inquiry that refinement receives when the searching documents index, and the inquiry that receives is processed, so that generate the initial group of ranked documents corresponding to the inquiry that receives.Each document in all or part of document in initial group of ranked documents all respectively the set of its corresponding classified candidate item be associated.It can be a word or a phrase that the set of the classified candidate item that precomputes that each document is associated is included in the document index at each candidate item in the clauses and subclauses of the document.In addition, in a preferred embodiment, a certain moment before the inquiry that processing receives makes up a plurality of corresponding candidate item set.This method continues to select to be in the subclass of the candidate item in one or more set of set of a plurality of corresponding classified candidate item.Selection function is used to select the subclass of this candidate item.Then, in response to the inquiry that receives, the subclass of initial group of ranked documents and candidate item is presented out.In certain embodiments, use revised inquiry to repeat above-mentioned processing, selection and rendering step, described revised inquiry comprises inquiry that primary reception is arrived and from the candidate item of candidate item subclass.
In certain embodiments, the set of the candidate item that is associated with a document is to make up by the item in the document is compared with the candidate item master list.When this in the candidate item master list time, this is added in the candidate item set that is associated with the document with as a candidate item.In certain embodiments, the master list of candidate item comprises more than 10,000,000 candidate item.This relatively is repeated, and the maximum number item in document is considered, or till the threshold number uniquity has been considered.Then, weighting and/or selection function are applied to the set of candidate item, to produce the set of classified candidate item.Usually, this weighting and/or selection function are used then and are blocked (cutoff) the candidate item classification, and this blocks and has only kept high level.In certain embodiments, optimize the master list of candidate item at language-specific (for example English, Spanish, French, German, Portuguese, Italian, Russian, Chinese or Japanese).In certain embodiments, each document in all or part of document in the initial group of ranked documents has the same language of having optimized the candidate item master list.
In certain embodiments, (for example during the initial document index) is classified each document in the document index in a certain moment before the query processing.In certain embodiments, there are two kinds of possible classifications, the friendly class of the friendly class of First Family with the second non-family.Other appointment of document class is included in the document index.
In certain embodiments, the following single speech candidate item in the classified candidate item set is dropped: this single speech candidate item is actually one more complicated subclass (substring) in this classified candidate item set.In addition, believe this more complicated example number with this simple terms in the whole or top that appears at the document that is associated with the set of classified candidate item.This abandons and believes that step is repeated, till the single speech candidate item that no longer includes as the subclass of the more complicated candidate item in the set of classified candidate item.In addition, same process can be applied to a plurality of speech candidate item as more complicated subclass.
In certain embodiments, second the correct spelling variant (orthographic variant) in this set of conduct in the set of classified candidate item or the candidate item of inflectional variant (inflectional variant) are dropped.In addition, believe second the example number that should correctly spell variant or inflectional variant item that has in the whole or top that appears at the document that is associated with the set of classified candidate item.This abandons and believes that step is repeated, up to no longer include as in the set of classified candidate item another correct spelling variant or inflectional variant the item till.In some instances, second is rewritten as the group item that comprises that (for example a plurality of) correctly spell variant or inflectional variant in candidate collection, and wherein the variant that occurs the most repeatedly in the whole or top of relevant documentation at first appears in the group item.In certain embodiments, when group item is selected in the subclass that is included in the candidate item that is presented, have only the first of group item to be presented to the user.
Some embodiment of the present invention provides various selection functions, and they are used to select in response to inquiry the subclass of the candidate item that will be presented.In certain embodiments, the information that in the candidate item set that is associated with top document in the initial group of ranked documents, finds of this selection function utilization.This information comprises the classification of two kinds of forms.At first, document is by classification.Secondly, with each classified candidate item set that document in the initial group of ranked documents is associated in each candidate item by classification.
In one embodiment, selection function comprises: (i) each candidate item in each the classified candidate item set separately that is associated with top document in the initial group of ranked documents is used weighted function.As used herein, the top document of in initial group of ranked documents each is to have following other document of level, described rank is lower than certain threshold level (for example 50 in number, in other words, top document is preceding 50 documents in the initial group of ranked documents of returning at this inquiry).For example, consider that initial group of ranked documents comprises 100 documents and situation that threshold level equals 50.Then, preceding 50 documents will be counted as top document.These candidate item that receive highest weighting are included in the subclass of the candidate item that presents with Query Result.In certain embodiments, the weight that is applied to candidate item by weighted function is to determine according to the number of the candidate item set that is associated with top document that this candidate item is occurred, or determine according to the mean place of this candidate item in each so classified candidate item set, or whether in candidate item, come definite according to the item in the inquiry that receives, or determine, or determine according to the average rank position of the top document of the item in the relative set that comprises candidate item according to the number of characters in this candidate item.In certain embodiments, be weighted weight that function is applied to a candidate item and be according to the combination in any of TermCount, TermPosition, ResultPosition, TermLength and QueryInclusion or arbitrarily the weighting subclass determine, wherein
TermCount is that both (i) comprised this candidate item, the number of the set of the classified candidate item that (ii) is associated with a top document respectively again,
To be this candidate item comprise this candidate item at both (i) to TermPosition, the function (for example average) of the position in those set of the classified candidate item that (ii) is associated with a top document respectively again,
ResultPosition is level other function (for example average) of those top documents of being associated with the set of the classified candidate item that comprises this candidate item,
TermLength is the number of characters (complicacy of candidate item) in this candidate item, and
QueryInclusion is the item value in this candidate item whether in the inquiry that receives of indication.
In certain embodiments, being weighted the weight that function is applied to a candidate item determines according to following formula:
TermCount+TermPosition+ResultPosition+TermLength+QueryInclusion。
In certain embodiments, each among TermCount, TermPosition, ResultPosition, TermLength and the QueryInclusion is by independent weighting.In certain embodiments, being weighted the weight that function is applied to candidate item determines according to following formula:
(TermCount*w 1)+
(TermPosition*(w 2+(RefinementDepth*w 2′)))+
(ResultPosition*w 3)+
(TermLength*(w 4+(RefinementDepth*w 4′)))+
(QueryInclusion*(w 5+(RefinementDepth*w 5′)))
W wherein 1, w 2, w 3, w 4, w 5, w 2', w 4' and w 5' be weight independently, and RefinementDepth is the number of times to the executed processing of inquiry that receives.
In certain embodiments, selection function comprises for each document in the initial group of ranked documents, determines the classification of the document.Then, when the document of the threshold percentage in the collection of document belongs to first category (for example friendly class of family), belong to the subclass that is not used to constitute candidate item as the set of all classified candidate item of the member's of second classification (the friendly class of for example non-family) document.
Another aspect of the present invention provides and the collaborative computer program that uses of computer system.This computer program comprises computer-readable recording medium and the computer program mechanism that is embedded in wherein.Described computer program mechanism comprises the query refinement suggestion engine, and it is used for the inquiry that refinement receives when the searching documents index.This engine comprises and is used to handle the inquiry that receives, so that generate the instruction corresponding to the initial group of ranked documents of the inquiry that receives.Each document in all or part of document in initial group of ranked documents all is associated with the set of its classified candidate item separately respectively, and the set of the classified candidate item that precomputes that each document is associated is included in the document index in the clauses and subclauses at the document.The set of each classified candidate item separately all is that a certain moment before the inquiry that processing receives is identified.This engine also comprises and being used for according to selection function, selects to be in the instruction of the candidate item subclass in the one or more set in the candidate item set separately.In addition, this engine comprises the inquiry that is used in response to receiving, and presents the instruction of the subclass of initial group of ranked documents and described candidate item.
Another aspect of the present invention provides a kind of document index data structure that comprises a plurality of URL(uniform resource locator) (URL).Each URL specifies a document respectively.By a plurality of URL respectively each document in all or part of document of appointment be associated with the set of its classified candidate item separately.The set of the classified candidate item that precomputes that each document is associated is included in the document index in the clauses and subclauses at the document.In addition, these candidate item by weighted function by classification.In certain embodiments, create the set of classified candidate item separately by following steps:
(A) will with document that the set of separately classified candidate item is associated in compare with the master list of candidate item, wherein, when in the described master list in candidate item, described is added in the set of described classified candidate item separately, with as a candidate item;
(B) repeat described comparison, till the maximum number item in described document has been considered;
(C) according to weighted function to the candidate item classification, thereby constitute classified candidate item.
Description of drawings
Hereinafter, with reference to the detailed description of the preferred embodiment of the present invention, above-mentioned feature and advantage of the present invention and other feature and advantage will be understood more clearly in conjunction with the drawings.
Fig. 1 shows the client computer of submitting inquiry to search engine to.
Fig. 2 shows the search results pages that produces according to the embodiment of the invention, and it comprises the query refinement suggestion.
Fig. 3 is the block diagram of search engine server.
Fig. 4 is the block diagram of search engine index.
Fig. 5 is the process flow diagram of document indexing method.
Fig. 6 is the process flow diagram that is used to handle the process of the inquiry of being submitted to by the user.
Wherein same numeral refers to corresponding part in the accompanying drawings.
Embodiment
In exemplary embodiments, the present invention is with a small set (10 to 20) (subclass of candidate item) of effective and efficient manner generated query refinement suggestion, the small set of this query refinement suggestion may with user's inquiry height correlation, and reflect the vocabulary of destination document.
As shown in Figure 1, search inquiry is submitted to search engine server 110 by client computer 100.After receiving search inquiry, search engine server 110 is discerned the document relevant with this search inquiry in document index 120.In addition, search engine server 110 is for example arranged relevant documentation according to degree of correlation or other arrangement factors of document and this search inquiry.Then, the description (Search Results) of this orderly collection of document is returned to client computer 100, with as the set of document in order.In the present invention, additional information is returned to client computer with the form of candidate item subclass (search refinement suggestion) with the initial sets of orderly document.
Forward to how to generate the subclass of candidate item about server 110 before, at first provide the Search Results that the embodiment by search engine server 110 as shown in Figure 2 returns and the screenshot capture of search refinement suggestion, so that better understand advantage of the present invention.In Fig. 2, the user provides initial query (inquiry that receives) 132.When pressing when searching button 134, inquiry 132 is sent to search engine server 110 from client computer 100.After receiving inquiry 132, search engine server 110 is handled the inquiry 132 that receives, and with the initial sets of orderly document and the form of candidate item subclass Search Results and search refinement suggestion is beamed back client computer 100.The candidate item subclass is displayed on the panel 140 at interface 180.Particularly, each in the candidate item subclass 136 is displayed in the zone 140 with label 138.Simultaneously, the list items of Search Results (being arranged on the document on top in order in the initial list of document) is displayed in the panel 142.System and method of the present invention at can refinement, change or improve the identification of the item 136 of initial query 132.When the user presses label 138,136 be added in the initial query 132 corresponding to the item of label 138, and with new inquiry whole process repeated.When the user presses another label 139, substitute initial queries 132 corresponding to the item 136 of label 138, and search engine server is handled the item 136 as new inquiry.In unshowned embodiment, can be added to panel 140 corresponding to each one or more additional label of 136.In one example, there is the label that is used for adding corresponding entry 136 to exception tabulation (exceptionlist).For example, when initial inquiry is " A " and user when pressing an eliminating label that is used for " B ", new inquiry becomes " A " rather than " B ".The item subclass in being presented at panel 140, the initial sets of document also is displayed on the panel 140 in order.In order to save the bandwidth between computing machine 100 and the server 110, in exemplary embodiments, in order the initial sets of document includes only the index of each document in the initial sets of orderly document usually, so that the user can determine the essence of each document in the initial document in order.Such index still is called as the initial sets of orderly document here.
The overview of system and method for the present invention is disclosed.From this overview as can be known, lot of advantages of the present invention and feature are conspicuous.Novel algorithm of the present invention provides the tabulation of the recommended items 136 that can be used to improve initial query automatically to the user.For example, in Fig. 2, initial query 132 is " space shuttle ".In response to this initial query, embodiments of the invention provide a candidate item subclass that comprises 136 (for example " Challenger Disaster ").Item " ChallengerDisaster " is added to initial query or provides probably the closer inquiry of match user interest with the way that " Challenger Disaster " substitutes initial query for the user.By using new candidate item subclass, the user can set up improved inquiry under the situation of the document (or its index) in need not to analyze the initial sets of orderly document.Therefore, use the present invention, need no longer to determine why initial query produces too many (or very little) result not directly related with user's information requirement.
Now, provide overview of the present invention and advantage of the present invention, will disclose the more detailed description of system and method for the present invention below.For this reason, Fig. 3 shows search engine server 110 according to an embodiment of the invention.In a preferred embodiment, use one or more computer systems 300 (as shown in Figure 3) to realize search engine server 110.One of skill in the art will appreciate that to handling the search engine that a large amount of inquiries design and to use the Computer Architecture more complicated than computer system shown in Figure 3.For example, front-end server set can be used to receive inquiry, and inquiry is distributed in the set of back-end server of actual treatment inquiry.In such system, system 300 as shown in Figure 3 will be one of back-end server.
Computer system 300 has the bus 312 of user interface 304 (comprising display 306 and keyboard 308), one or more processing unit (CPU) 302, network or other communication interfaces 310, storer 314 and one or more these assemblies that are used to interconnect usually.Storer 314 can comprise high-speed random access memory, can also comprise nonvolatile memory, for example one or more disk storage device (not shown).Storer 314 can comprise the mass-memory unit far apart from CPU (central processing unit) 302.Storer 314 is preferably stored:
operating system 316, it comprises the program that is used to handle various basic system services and carries out hardware dependent tasks;
network communication module 318, it is used for via one or more communication networks system 300 being connected to various client computers 100 (Fig. 1) and possible other servers or computing machine, and described communication network for example is the Internet, other wide area networks, LAN (Local Area Network) (for example client computer 100 can be connected to the Local wireless network of computing machine 300), Metropolitan Area Network (MAN) or the like;
query processor 320 is used to receive the inquiry from client computer 100;
● search engine 332, be used for searching documents index 352, finding document associated with the query, and be used to form the initial sets of orderly document associated with the query; And
● query refinement suggestion engine 324 is used to realize many aspects of the present invention.
Query refinement suggestion engine 324 can comprise executable program, submodule, table and other data structures.In one embodiment, refinement suggestion engine 324 comprises:
selection function 326 is used to discern the subclass of candidate item, so that it shows with initial sets of document in order; And
formatting module 328 as a result, be used to format the subclass of candidate item and in order the initial sets of document show being used for.
Method of the present invention behavior with document index device 344 received inquiry 132 by query processor 320 before begins.Document index device 344 uses to search on the net to plunder with indexing technique sets up document index 352.But except this traditional function, document index device 344 also comprises the novel program module of the document in the further processing document index 352.For example, document index device 344 comprises " the candidate item set makes up device " 346.In a preferred embodiment, make up each document that device 346 is checked in the document index 352.In other embodiments, having only the document (document that for example comprises the text with a kind of language in the scheduled instruction set) that satisfies preassigned to be fabricated device 346 checks.
For each checked document, make up device 346 and judge whether the document comprises any candidate item that is embedded in the document.Exist a lot of different structure devices 346 can realize the method for this task, and all these methods all are included in the scope of the present invention.In one embodiment, by being complementary with the master list 342 of candidate item, item from the document realizes this task.The master list 342 of candidate item comprises all possible candidate item.In certain embodiments, tabulation 342 is the Unix type texts with effective candidate list.Tabulation 342 representative form is candidate item of every row, each candidate item in the tabulation 342 be unique, through the UTF-8 coding, and removed all commas, tab, row end and @ symbol.In certain embodiments, master list is restricted to noun and noun phrase (the item most probable of these kinds is of value to and is used as query term), has wherein clearly removed any noun phrase with limited query refinement value.
In typical embodiment, have only the first of each document in the document index 352 to be examined to find candidate item.For example, in some cases, make up 346 preceding 100,000 bytes of checking each document in the document index 352 of device.In certain embodiments, make up the document that device 346 is checked in the document index 352, till the item of the maximum number in the document (for example 100,1000,5000 etc.) has been considered.In certain embodiments, when the uniquity of the threshold number in the document has been found when appearing in the master list 342 (for example 1000 items), stop search to the candidate item in the document.
Some embodiment of the present invention provides the master list 342 more than a candidate item.Each master list 342 is optimized to be used for different language.For example, first tabulation 342 is optimized to be used for English, and second tabulation 342 is optimized to be used for Spanish.Thereby English tabulation 342 will be included in the item of information that finds in the English documents, and Spanish tabulation 342 will be included in the item of information that finds in the Spanish document.Similarly, some embodiment of the present invention comprises the tabulation that is optimized to be used for French, German, Portuguese, Italian, Russian, Chinese or Japanese.In certain embodiments of the present invention, tabulation 342 is optimized to be used for the classification of other types.For example, in certain embodiments, tabulation 342 is optimized to comprise science item, fashion item, engineering item or travelling item.But in a preferred embodiment, each master list 342 comprises item of information as much as possible.In fact, master list 342 can comprise more than 10,000,000 item, and generally include far away more than 1,000 000 item.In these each can be a speech or a phrase.For clear, a representative phrase is " Challenger Disaster ".
The method of the main language that is used for determining that document uses is well known in the art.Therefore, in certain embodiments of the present invention, make up device 346 and use such method (i) to determine the language of just checked document, (ii) use the master list 342 that is optimized to be used for the document same-language then.
Be embedded under the situation of top section (100K byte for example) of a document of index 352 in the one or more candidate item that are arranged in master list 342, check that by making up device 346 net result of the document is these signs.When these were fabricated device 346 and identify, they were with in the data structure that orderly form is added to the document is associated.This data structure is called as the candidate item set.Be fabricated after device 346 checks at index 352, each in the index 352 will be associated with the candidate item set that comprises these respectively at the document that its top section has embedded candidate item.Therefore, for example, if there are two documents (being A and B) that comprise candidate item in index 352, then the set of first candidate item will be associated with document A, and the set of second candidate item will be associated with document B.The set of first candidate item will comprise each candidate item of the top section that is embedded in document A, and the second orderly candidate item set will comprise each candidate item of the top section that is embedded in document B.In fact, the set of each candidate item all is arranged in inside constituting the ordered set of its candidate item separately, as will be more detailed disclosed below.
Fig. 4 shows the inspection that makes up the document 402 in 346 pairs of document index 352 of device and how to cause modification to document index 352.Before the document in making up device 346 inspection index 352, each document 402 in the index 352 comprises the unified resource position (URL) 406 and a stack features value 408 of document 402.Eigenwert 408 comprises the metadata that is associated with the document, and comprises the value that is used for assisting search engine when arrangement is identified as possibility document associated with the query.Eigenwert can comprise the length of the indication of the file layout of document, document, enter the number of link (from other documents), exercise question of document (for example exercise question that is used to show) or the like to the document known when the document is chosen as in response to an inquiry.Be fabricated device 346 at document 402 and checked (Fig. 3) afterwards, candidate item set 410 is associated with the document 402.
In certain embodiments of the present invention, a method that candidate item is complementary in item realizing being used for making document as follows and the tabulation 342: guarantee that this may the most complicated candidate item be complementary in 342 with tabulating.For example, consider that " A B " is embedded in the situation in the document in the index 352, wherein A and B are words.In addition, list of hypotheses 342 comprises " A ", " B " and " A B ".At this moment, the item in the document " A B " will be complementary in " the A B " in the tabulation 342, rather than is complementary with " A " or " B ".Have the multiple method that may realize this coupling, and all these matching schemes all within the scope of the invention.A kind of such matching process is to use " the greedy from left to right algorithm " with following logic:
Have the sentence of " A B C D... " form in the document each, by following inspection:
Is A the prefix of a candidate item in the tabulation 342?
Be o is: " A B " the prefix of a candidate item in the tabulation 342?
Be ■ is: " A B C " the prefix of a candidate item in the tabulation 342?
● be->continue to pass in the same manner this sentence
● not: " A B " added in that candidate item set 410 that is associated with the document, and move to C to consider " C D E F... "
■ denys: " A " added in that candidate item set 410 that is associated with the document, and move to B to consider " B C D E... "
O denys: move to B and begin to consider " B C D E... "
The most complicated item that such algorithm is guaranteed to tabulate in 342 is complementary with a item in the document, wherein " sentence " is certain any amount in the document, for example between two phrasal boundaries in delegation or the document or the part between other breakpoints, and " A B C D... " is the individual word that do not have in the item.In relevant method, when first candidate item in the candidate item set 410 is the subclass of second candidate item in this candidate item set, makes up device 346 and abandon this first candidate item.
In certain embodiments of the present invention, it is tracked that ordered set of item closes the number of times that each candidate item in 410 occurs in the whole or top section (100K byte for example) of the document that is associated with this set 410.For example, if the candidate item of gathering in 410 " A " occurs 12 times at the top section with set 410 documents that are associated, then 12 times indication appears in entry " A " in document, and in the weighting scheme that this indication is used for designing in order to determine which candidate item will be retained in the final set of orderly candidate item.
In certain embodiments, under an item appeared at situation in the preceding threshold number word of relevant documentation, this indication that appears at the number of times in the document was by upwards weighting (upweight).For example, the value of consideration first threshold is the situation of 15 words.In addition, under this exemplary cases, candidate item " A " occurs twice just.Occurring the first time of phrase " A " is before 15 word boundaries, and to occur the second time of " A " be after 15 word boundaries.Be used for the weighting scheme of this exemplary cases, appearing at the weight of the word reception twice in preceding 15 words.Thereby in the candidate item set 402 that is associated with document, candidate item " A " will be listed with following indication, and it is inferior that described indication shows that this (2*1+1) occur at the document top section, promptly 3 times.One of skill in the art will appreciate that more the first threshold of complex form is possible.For example, the weighting that is applied to candidate item counting can be the function of the position of this candidate item in document.For example, this function can be a linear function (or nonlinear function, or the linear function of segmentation), and the maximal value of this function is positioned at the top of document, and minimum value is positioned at the end of the document.Replacedly, can use weighting with the form of basket (basket), wherein begin place's (first basket) and have big weight, in the low weight of second portion (second basket) existence of document at document, there is lower weight in third part (the 3rd basket) at document, by that analogy.
In following embodiment: wherein (i) candidate item appears at the indication of the number of times in the relevant documentation, and (ii) close first candidate item in 410 when being the subclass of second candidate item in this orderly candidate item set when orderly candidate, make up device 346 and abandon this first candidate item, in such embodiments, believe that second candidate item has by making up device 346 identifies first candidate item in document number of times.
Except making up device 346, index device 344 also comprises redundant filtrator 348.Filtrator 348 is designed to remove the correct spelling variant or the inflectional variant that may appear in the candidate item set.The correct spelling variant of item is the interchangeable correct spelling for item.The inflectional variant of item has the replaceable suffix or the stress form of item.In certain embodiments, correctly spelling variant and/or inflectional variant is stored in the list of variants 360 (Fig. 3).Then, the work of redundant filtrator 348 is to guarantee that a pair of candidate item that does not have in the candidate item set 410 appears in the list of variants 360.When a pair of candidate item in the candidate item set 410 was in the list of variants 360, filtrator 348 abandoned an item of this centering from gathering 410.In certain embodiments, first of this centering will be abandoned effectively from gathering 410, and second of this centering will be retained.But in certain embodiments, second will be modified, so that it combines with first that is dropped.For example, if A and B are inflectional variants or correctly spell variant that then one of them (for example A) will be dropped, and another (B) is retained.In addition, a B will be rewritten as A, B.This feature is favourable, because it has kept the useful information about lower floor's document that can be used by more high-rise module of the present invention (for example the query refinement suggestion engine 324).Usually, under the situation of correct spelling variant that is merged like this or the appearance of inflectional variant candidate item, engine 3 24 will only provide first (not being dropped) item.For example, rewriteeing an A, under the situation of B, having only " A " to be included in the subclass of being presented in the candidate item on the panel 140.Usually, appearing at the item that is dropped among tabulation a pair of in 360 is lower that of the frequency of occurrences in relevant documentation.In certain embodiments, its difference candidate item of only being to occur or do not occur some noise word (for example a, the, who, what, where or the like) is folded in the identical mode of mode that is folded together with the candidate item that has comprised correct spelling variant or inflectional variant.Equally, in certain embodiments, difference between two items in given candidate item set only is to occur or does not occur under the situation of punctuation mark, and these two Xiang Yiyu have comprised that the identical mode of mode that the candidate item of correct spelling variant or inflectional variant is folded together is folded.In certain embodiments, each phrase in the candidate item set all is converted into identical letter (for example lowercase).An exception of this rule is, have 6 or still less the such item of word of capitalization character be not converted into lowercase because such may be an abb..
In following embodiment: wherein (i) candidate item appears at the indication of the number of times in the relevant documentation, and (ii) filtrator 348 owing to first candidate item in the candidate item set is that the correct spelling variant or the inflectional variant of second candidate item in this set abandons first candidate item, in such embodiments, believe that second candidate item has by making up device 346 identifies first candidate item in document number of times.In other words, when the difference between two candidate item only is that one of them candidate item comprises following word, abandon one of them candidate item, wherein said word is the inflectional variant of the respective word in another candidate item or correctly spells variant.An example of this situation appears at candidate item under the situation of " towtruck " and " tow trucks ".In this example, the difference between these two candidate item only is the literary style of " trucks " among the literary style of " truck " in first and second.
A lot of details about document index device 344 are disclosed.Be necessary to refer to the process flow diagram of Fig. 5 in this stage, this figure discloses the step that some embodiment taked of index device 344.After all or part of in its other index tasks (for example to climbed traditional index of seeking the word in the document that device finds by network), index device 344 is delivered to control and makes up device 346, makes up device 346 and selects by the document of index (Fig. 5, step 502).
In step 504, the item in the document is compared with the master list 342 of candidate item.If this item (506 be) in master list 342 then adds this in the candidate item set 402 that is associated with the document (510).Note the matching scheme that step 504 can be more complicated, for example aforesaid " greedy from left to right algorithm "
In certain embodiments, the document that is compared is a webpage.Therefore, must carry out some and which word to constitute the judgement that is suitable for effective word of comparing about with master list 342.In one approach, be actually that the document of webpage is resolved to be used for the text that phrase extracts to find.In one embodiment, use all " visual " texts to add metapage and describe phrase match in the execution in step 504, and such phrase does not comprise HTML code, java script or the like.In order to obtain effective phrase, " phrasal boundary " in the webpage (for example Biao label) is retained, so that being used in extracting from document can not cross over phrasal boundary with tabulation 342 expression of comparing.Other examples of the phrasal boundary of Shi Yonging include, but are not limited in certain embodiments of the present invention, such as ". ", "? ", null etc. punctuation mark.
In certain embodiments of the present invention, master list 342 be collect from some different sources the item very big collection.Therefore, in step 504, can carry out other filtration and be selected for the candidate item of guaranteeing only to provide information and be included in the candidate item set.In certain embodiments, with master list 342 in the item document of comparing in processed before comparison.For example, in certain embodiments, with remove before tabulation 342 is compared in punctuation mark.In certain embodiments, with replace the punctuation mark character with the space before tabulation 342 is compared.In certain embodiments, the tabulation 354 of noise item is stored in the storer 314.Representational noise item includes, but are not limited to, the word such as " a ", " the ", " who ", " what " and " where ".Therefore, be stored among the embodiment of storer 314 in the tabulation 354 of noise item, comparison step 504 will judge at first whether the item that will compare with master list 342 is in the noise item tabulation 354.If then ignore this, and do not compare with tabulation 342.In certain embodiments, have only those items that comprise certain minimum threshold number character at least just in step 504, to be compared.For example, in certain embodiments, have only those items that comprise 4 characters at least in step 504, to be compared.
Regardless of the result who judges 506, all carry out and judge 508, whether this judgement should make any other in the document compare with master list 342 about making up device 346.The different condition that much can be used to determine to judge 508 result is disclosed (for example the maximum number that blocks of item, maximum number that unique item blocks, Already in gather the maximum number or the like of the candidate item in 410).
Be a plurality of optional steps subsequently in the process flow diagram of Fig. 5.In optional step 512, redundancy is folded in the candidate item set that is associated with document.In optional step 514, the document in the index 352 is classified (for example being categorized as first and second classes).
Have the multiple distinct methods that can influence classification step 514, all these methods all are included in the scope of the present invention.For example, in certain embodiments, each document 402 is classified as first or second class.In a preferred embodiment, the first kind is the friendly class of family, and the friendly class of the second class right and wrong family.When document 402 comprises pornographic, aggressiveness or violence language, it will be assigned to second class.Otherwise it will be assigned to the first kind.In certain embodiments, classifier modules 350 (Fig. 3) is used to carry out such classification.Usually, classifier modules 350 is by judging whether document wants to comprise that pornographic, aggressiveness or violent content work.If like this, document then is designated as the non-close friend's of family.This appointment is stored in the eigenwert 408 (Fig. 4), this eigenwert 408 be associated with that to be classified set 410 document corresponding.
In this stage, a large amount of candidate item are arranged usually in the set of candidate item.For example, having 1000 candidate item can be added among the embodiment in the set of candidate item, this stage that is integrated into of candidate item comprises 1000 items.No matter how many numbers of the candidate item in each candidate item set has, and they are not as yet by classification.Therefore, in step 516, candidate item is by classification, and the N in the classified then candidate item candidate item the highest is allowed to be retained in the candidate collection, and every other candidate item is all deleted, so that only keep individual most representative (516) of N (for example 20) in the set of classification.Therefore, the final effect of step 516 is the set that produce classified candidate item from the set of candidate item.In addition, in step 516, have only top item (for example 20 of the top) to be allowed to be retained in the set of classified candidate item.
Employed standard of grading function and parameter can comprise one or more in the following parameters: each number of times that occurs in document, whether this appear at a preceding part, this primary importance in document predetermined in the document, and the number of characters in this item.Based on these parameters, distribute a rank for each candidate item, N that only has highest level then is retained in the set of classified candidate item.Other are by deletion from set.When processing speed was of crucial importance, the number that limits the candidate item that is associated with each document helped to prevent that document index from too increasing, and had reduced the quantity of the item that need consider when inquiring about.Be used for a document classified candidate item set can by at the directory entry of document (referring to 410, storage represent one group of character string (being compressed alternatively) of candidate item or index and is associated with the document Fig. 4), wherein the item in the master list of each index value sensing candidate item 342.Correlation can be stored in clauses and subclauses, the item score of for example using and/or the primary importance of this item in document of the document index 352 that is used for document with each candidate item that is associated with document (perhaps pointing to the pointer of candidate item) in classification process.But in a preferred embodiment, these added values are not stored in the document index 352.
The process that the set 410 of classified candidate item is associated with document in the document index 352 is described.Forward concern to Fig. 6 now, this Figure illustrates according to one embodiment of present invention, such set 410 is used to make up the mode of the subclass of the candidate item that is used to present.In step 602, query processor 320 receives inquiry.In step 604, this inquiry is processed, thereby obtains initial group of ranked documents from document index 352.Will appreciate that in certain embodiments, initial group of ranked documents can only comprise the index of document, and need not to comprise document itself.But this index will comprise the URL(uniform resource locator) (URL) of each document of the initial sets that is used for document.Therefore, each document can (or certain other forms of network) obtain from the Internet, if asked by the user subsequently.In certain embodiments, the initial sets of document is stored in the storer 314 of server 300 (Fig. 3) as Search Results 340.Refer again to Fig. 6, use Search Results 340 to create the tabulation (subclass of candidate item) of the query refinement that (606) be proposed.
Whether the method for establishment suggestion query refinement tabulation (subclass of candidate item) will depend on this inquiry is the friendly search of family.In optional step 608,, determine the classification of the document for each the top document in the Search Results 340 (initial group of ranked documents) (for example preceding 50 documents).When the top document of threshold percentage in the Search Results 340 belongs to the first kind (the friendly class of family), all set 410 any subsequent steps that are not used among Fig. 6 of the candidate item that does not belong to the first kind that is associated with a top document.In certain embodiments, the classification except that the friendly class of family is used to classifying documents during index (Fig. 5).In such embodiments, these classifications can be used in step 608 judge which set of classified candidate item will be used to make up the subclass of candidate item.In the exemplary embodiment, have only the classification of M top document 10 top documents of Search Results 340 (for example from) to be used to judgement in the execution in step 608.For example, if in 10 top documents at least 8 are classified as the close friend's of family, then be excluded outside the set of the classified candidate item of the tabulation that is used to make up the suggestion query refinement from the candidate item of the friendly document of non-family.
In step 610, select to be in a subclass of the candidate item in the one or more set in each set of the classified candidate item that is associated with document in the Search Results 340.In one embodiment, this selection function comprises each candidate item in each set of the classified candidate item 410 that is associated with top document in the initial set of ranked documents (Search Results 340) is used weighted function.The top document of in the initial group of ranked documents each all has a rank, and this rank is in number less than threshold level.In certain embodiments, top document is a T top document, and wherein T is a predetermined number, for example 50 (and preferably, in 5 to 200 scope, most preferably, in 20 to 100 scope).In step 610, only consider top document, so that make relevant item be collected into the chance maximization of will present in user's the subclass of candidate item.In various embodiments, only consider 5,10,15,20,50 or 100 top documents.These candidate item that receive highest weighting are included in the subclass of candidate item.In certain embodiments, the item number in the subclass of candidate item is restricted to the numeral less than 25.
In certain embodiments, when existence is less than the truncation number destination document in the initial set of Search Results 340, does not set up the subclass of candidate item and do not have the subclass of candidate item to be presented to the user.For example, in one embodiment, be less than 35 documents, then do not set up the subclass of candidate item if in the initial set of Search Results 340, exist.
The invention provides multiple different weighted function, be used for estimating the candidate item in each set 410 that is associated with the top document of Search Results 340.These different weighted function are used to the various embodiment of the selection function 342 of engine 3 22 (Fig. 3).
In certain embodiments, the weight that is applied to a candidate item by function 324 (weighted function) is to comprise this candidate item according to both (i), and the number of sets of the classified candidate item that (ii) is associated with a top document is respectively determined again.For example, the situation in three set considering to exist 50 top documents and candidate item " Space Shuttle " to appear at the classified candidate item that is associated with a top document respectively.In the case, weight 3 is applied to this candidate item " SpaceShuttle ".
In certain embodiments, the weight that is applied to a candidate item by selection function 326 is to comprise this candidate item according to both (i), and the function (for example average) of this candidate item in the set of those classified candidate item that (ii) are associated with a top document is respectively determined again.Some embodiment had both considered to comprise this set, also considered not comprise this set.The set that does not comprise this is assigned with one and is used for average digital value, and this digital value is indicated in this set and do not comprised this.This weighting factor has utilized the following fact: all be classified sequential list on each collective entity of classified candidate item.Therefore, if candidate item " Space Shuttle " appears at the top of the classified tabulation in a plurality of candidate item set that are associated with a top document respectively, then it will receive higher relatively weight in this weighting scheme.On the contrary, if " Space Shuttle " is among in the set of each classified candidate item of its appearance more last, then it will receive relatively low weight in this weighting scheme.
In certain embodiments, whether the weight that is applied to a candidate item by function 324 is to be in the candidate item according to one in the inquiry that receives to determine.For example, if query term is " Shuttle " and candidate item is " Space Shuttle ", then this candidate item is endowed full (full) weight, otherwise this candidate item is not endowed weight.
In certain embodiments, the weight that is applied to a candidate item by function 324 (weighted function) is to determine according to the number of characters in the candidate item.For example, candidate item " Space Shuttle " will receive than the more weight of candidate item " Dogs ".
In certain embodiments, the weight that is applied to a candidate item by function 324 is to determine according to other function of level (for example average) of those top documents that are associated with the set of the classified candidate item that has comprised this candidate item.This weighting scheme has adopted the rank that is applied to the initial sets of Search Results by search engine 322.In this weighting scheme, be endowed than the right of priority higher with the candidate item that is associated than the low level document from the candidate item that is associated with the higher level document of gathering 410.For example, consider candidate item " Space Shuttle " appear at initial group of ranked documents in top document in each set of the classified candidate item that is associated of document 2,4 and 6 in situation.Therefore, in this weighting scheme, the weight of the function that this " Space Shuttle " will the value of receiving 4.Now, suppose this " Space Shuttle " appear at initial group of ranked documents in top document in each set of the classified candidate item that is associated of document 10,20 and 30 in.Therefore, in this weighting scheme, the weight of the function that this " Space Shuttle " will the value of receiving 20.Under this weighting scheme, value 4 will produce the better weight of weight that Billy produced with value 20 (will to this candidate item make progress weighting).In certain embodiments, this weighted function will consider the set that does not comprise this candidate item.They are assigned with one and are used for average digital value.
In certain embodiments, the word that at first occurs is used to weighted function as the rank of the document of candidate item.
The employed specific weight factors of various embodiment of selected function 326 is summarized, so that introduce these factors.But in a preferred embodiment, some this factors are combined, so that produce required result.Be some preferred embodiment of selection function 326 below.
In certain embodiments, the weight that is applied to a candidate item by function 324 is that the combination in any (or weighted array arbitrarily) according to TermCount, TermPosition, ResultPosition, TermLength and QueryInclusion is determined, wherein:
TermCount is that both (i) comprised this candidate item, the number of sets of the classified candidate item that (ii) is associated with a top document respectively again,
To be this candidate item comprise this candidate item at both (i) to TermPosition, the function (for example average) of the position in those set of the classified candidate item that (ii) is associated with a top document respectively again,
ResultPosition is level other function (for example average) of gathering those top documents that are associated with the classified candidate item that comprises this candidate item,
TermLength is the heavy number of characters (complicacy of candidate item) of this candidate item, and
QueryInclusion is whether the item in the inquiry that receives of indication is in the value in the candidate item.
As used herein, use QueryInclusion (for example working as QueryInclusion is nonzero value) at for example 1 o'clock and mean that when being in this candidate item for one in the inquiry that receives, this candidate item is by upwards weighting.In addition, do not use QueryInclusion (for example when QueryInclusion is set to equal 0) and mean, when the Xiang Wei in the inquiry that receives was in this candidate item, this candidate item was not by upwards weighting.In certain embodiments, do not think that noise item (for example a, the, who, what, where or the like) belongs to candidate item.Therefore, candidate item also comprises word " for " if inquiry comprises noise word " for ", then distrust this candidate item, and QueryInclusion is not by upwards weighting.
In certain embodiments, the weight that is applied to a candidate item by function 324 is determined according to following formula:
TermCount+TermPosition+ResultPosition+TermLength+QueryInclusion。Wherein weight, TermCount, TerPosition, ResultPosition, TermLength and QueryInclusion are as defined above.In certain embodiments, each among TermCount, TermPosition, ResultPosition, TermLength and the QueryInclusion is by independent weighting.
In certain embodiments, the weight that is applied to a candidate item by function 324 is determined according to following formula:
(TermCount*w 1)+
(TermPosition*(w 2+(RefinementDepth*w 2′)))+
(ResultPosition*w 3)+
(TermLength*(w 4+(RefinementDepth*w 4′)))+
(QueryInclusion*(w 5+(RefinementDepth*w 5′)))
W wherein 1, w 2, w 3, w 4, w 5, w 2', w 4' and w 5' be weight independently.In addition, RefinementDepth is the number of times of processing that the inquiry that receives is carried out.In other words, RefinementDepth be the executable operations of optional step 614 repeating step 602 is to the number of times of step 612, in described step 614, the user will add initial search query to from of the subclass of candidate item.In one embodiment,
w 1=100
w 2=15
w 2′=15
w 3=1
w 4=1
w 4′=0
w 5=100, and
w 5′=50。
In certain embodiments of the present invention, selection function 610 will be deleted some candidate item in the set of classified candidate item.For example, in certain embodiments, only have certain prefix or the different candidate item of suffix in the set of classified candidate item are folded together.For example, in certain embodiments, the tabulation of prefix and the tabulation of suffix are stored in the storer 314.If the difference between two candidate item only is that one of them candidate item has comprised a word, this word is the prefix at word top or the suffix of word end with respect to the difference of the respective word in another candidate item, and then these two candidate item can be folded together.In certain embodiments, there are three quasiprefixs (and suffix of three kinds of similar classifications).If candidate item comprises a prefix that belongs to the first kind, then abandon this word.If candidate item comprises a prefix that belongs to second class, then delete this prefix.If candidate item comprises a prefix that belongs to the 3rd class, then carry out assessment.In this assessment, the set of each the classified candidate item that is associated with a top document is all searched, to find the example of the identical entry that does not comprise this prefix.If do not find such example, then do not peel off this prefix.If find such example, then peel off this prefix.Such prefix (and suffix) is handled under institute's situation very all very useful.For example, consider that candidate item is the situation of " the cars ".Usually, prefix " the " is counted as the prefix that be peelled off.But this candidate item might refer to famous musical combinations, and this musical combinations is commonly referred to as " the cars ".Therefore, once search takes place then, to judge in the arbitrary collection in other set of the classified candidate item that is associated with top document whether find the item " cars " that does not have prefix " the ".If such example do not occur, then this prefix is not peelled off.In this example, notice that employed prefix can be an affixe (for example un-, non-or the like) the preceding here, or word or expression (for example the, of, to go or the like) the preceding.
In step 612, the subclass of candidate item is presented to the user.In step 614, the user selects 136 (Fig. 2) in the subclass of candidate item alternatively, and come re-treatment (step 604), select (step 606) and present (step 612) with revised inquiry, wherein said revised inquiry comprises the candidate item of selecting the subclass that initially (receives) inquiry and the candidate item on being presented at panel 140 (Fig. 2) 136.As mentioned above, in certain embodiments, the user can select one 136 with in the inquiry of submitting to before it is added to, with the inquiry of submitting to before being used to replace, or is used as with the inquiry of submitting to before and gets rid of.
Here all that quote by with reference to having incorporated its full content into to be used for all purposes to a certain extent, just look like that each independent open or patent or patented claim all are designated as particularly and individually by with reference to having incorporated its full content into to be used for all purposes with reference to all.
The present invention may be implemented as and comprises the computer program that is embedded in the computer program mechanism in the computer-readable recording medium.For example, this computer program can comprise program module shown in Figure 3.These program modules can be stored in CD-ROM, disk storage product, or arbitrarily on other the mechanized data or program storage product.Software module in the computer program can also be via the Internet or is otherwise distributed electronically by transmit computer data signal (wherein having embedded software module) on carrier wave.
It will be apparent to those skilled in the art, under the situation that does not break away from the spirit and scope of the present invention, can modifications and variations of the present invention are.Specific embodiment described herein only is provided by way of example.Selected and the description of these embodiment so that principle of the present invention and practical application thereof are described best, thereby makes those skilled in the art revise the present invention and various embodiment best at the special-purpose of being expected.The four corner of the equivalent that the present invention is only authorized by project and claims of appended claims limits.

Claims (56)

1. method that is used for the inquiry that refinement receives when the searching documents index comprises:
Handle the described inquiry that receives, so that generate initial group of ranked documents corresponding to the inquiry that receives, wherein each document in all or part of document in described initial group of ranked documents all is associated with the set of separately the classified candidate item that precomputes respectively, and the set of the described classified candidate item that precomputes is included in the described index in the clauses and subclauses at described document;
According to selection function, select to be in the subclass of the candidate item in the one or more set in the set of described classified candidate item separately; And
In response to the inquiry that receives, present the subclass of initial group of ranked documents and described candidate item.
2. the method for claim 1, wherein all or part of in the top document in the described initial group of ranked documents, discern the set of the classified candidate item separately that is associated with described document by following steps:
(A), wherein, when described item is in the master list of described candidate item, described item is added in the set of candidate item with comparing in the described document with the master list of candidate item;
(B) repeatedly repeat described comparison; And
(C), thereby constitute the set of described classified candidate item separately to the described candidate item classification in the set of described candidate item.
3. method as claimed in claim 2, wherein, all or part of in each the top document in the described initial group of ranked documents, the classification of document is included in the set of the described classified candidate item separately that is associated with described document, and wherein said classification comprises the first category or second classification.
4. method as claimed in claim 2, wherein, the number of times that all or part of in the top document in the described initial group of ranked documents, the example recognition of described comparison (A) go out candidate item is used for the described candidate item of the set of described classified candidate item is carried out classification by described classification (C).
5. method as claimed in claim 4, wherein said classification (C) also use the primary importance of described candidate item in relevant documentation separately to come the described candidate item of classification.
6. method as claimed in claim 4, the identification of the set of described classified candidate item separately also comprises:
(i) when first candidate item is the subclass of second candidate item in the set of described candidate item, abandon described first candidate item;
(ii) believe described second candidate item have described first candidate item with described document that the set of described candidate item is associated in the number of times that come out by the example recognition of described comparison (A); And
(iii) repeat described abandoning (i) and describedly believe (ii), till first candidate item that no longer includes as the subclass of second candidate item in the set of described candidate item.
7. method as claimed in claim 4, the identification of the set of described classified candidate item separately also comprises:
(i) when first candidate item is the correct spelling variant of second candidate item in the set of described candidate item or inflectional variant, abandon described first candidate item;
(ii) believe described second candidate item have described first candidate item with described document that the set of described candidate item is associated in the number of times that come out by the example recognition of described comparison (A); And
(iii) repeat described abandoning (i) and describedly believe (ii), till first candidate item that no longer includes as the correct spelling variant of second candidate item in the set of described candidate item or inflectional variant.
8. method as claimed in claim 7, wherein saidly believe that the step of described second candidate item (ii) also comprises:
Described second candidate item is rewritten as the group item that has comprised described first candidate item and described second candidate item, and wherein described first candidate item or in described second candidate item that is gone out the most repeatedly by the example recognition of described comparison (A) appears at the place that begins of described group item.
9. method as claimed in claim 8, the item that begins to locate that wherein only appears at described group item is used to described presenting.
10. method as claimed in claim 2, wherein, all or part of in the top document in the described initial group of ranked documents, the set of the classified candidate item separately that is associated with the document all comprises the primary importance of described candidate item in described document for each candidate item in the described set separately.
11. the method for claim 1, wherein a certain moment before the step of the described inquiry that receives of described processing is identified the set of each described classified candidate item separately.
12. method as claimed in claim 2, wherein said identification also comprises:
(i) when the number of times of in described comparison (A) item in the described document being compared with the master list of described candidate item reaches threshold number, stop described comparison (A) and stop described repetition (B).
13. method as claimed in claim 2 wherein at language-specific, is optimized the master list of described candidate item.
14. method as claimed in claim 13, wherein said language-specific are English, Spanish, French, German, Portuguese, Italian, Russian, Chinese or Japanese.
15. method as claimed in claim 13, all or part of in the top document in the wherein said initial group of ranked documents have the same language of having optimized the master list of described candidate item at it.
16. method as claimed in claim 2, each in the master list of wherein said candidate item is a word or expression.
17. method as claimed in claim 2, the master list of wherein said candidate item comprise more than 1,000,000 item.
Use revised inquiry to repeat described processing, select and present 18. the method for claim 1, this method also comprise, wherein said revised inquiry comprises the described inquiry that receives and from the candidate item of the subclass of described candidate item.
19. the method for claim 1, wherein said selection function comprises:
(i) each candidate item in each the classified candidate item set separately that is associated with top document in the described initial group of ranked documents is used weighted function, the top document of each in the wherein said initial group of ranked documents all has in number the rank less than threshold level; And
(ii) be the subclass of described candidate item, selective reception is to those candidate item of highest weighting.
20. method as claimed in claim 19, wherein the weight that is applied to candidate item by described weighted function is to comprise this candidate item according to both (i), and the number of the set of the classified candidate item that (ii) is associated with top document is respectively determined again.
21. method as claimed in claim 19, wherein the weight that is applied to candidate item by described weighted function is to comprise this candidate item according to this candidate item at both (i), and the mean place in the set of those classified candidate item that (ii) are associated with top document is respectively determined again.
22. method as claimed in claim 19, wherein whether the weight that is applied to candidate item by described weighted function is to determine in described candidate item according to the item in the described inquiry that receives.
23. method as claimed in claim 19 is to determine according to the number of characters in the described candidate item by the weight that described weighted function is applied to candidate item wherein.
24. method as claimed in claim 19, wherein the weight that is applied to candidate item by described weighted function is to determine according to the average rank of those top documents that are associated with the set of the classified candidate item that has comprised this candidate item.
25. method as claimed in claim 19 is to determine according to the combination in any of TermCount, TermPosition, ResultPosition, TermLength and QueryInclusion by the weight that described weighted function is applied to candidate item wherein, wherein
TermCount is that not only (i) comprised this candidate item but also the number of the set of the classified candidate item that (ii) is associated with top document respectively,
TermPosition be this candidate item not only (i) comprise this candidate item but also the set of those classified candidate item of (ii) being associated with top document respectively in the function of rank position,
ResultPosition is other function of level of those top documents of being associated with the set of the classified candidate item that comprises this candidate item,
TermLength is the number of characters in this candidate item, and
QueryInclusion is whether the item in the inquiry that receives of indication is in the value in this candidate item.
26. method as claimed in claim 25 is wherein determined according to following formula by the weight that described weighted function is applied to candidate item:
TermCount+TermPosition+ResultPosition+TermLength+QueryInclusion。
27. method as claimed in claim 26, wherein each among TermCount, TermPosition, ResultPosition, TermLength and the QueryInclusion is by independent weighting.
Use revised inquiry to repeat described processing alternatively, select and present 28. method as claimed in claim 25, this method also comprise, wherein said revised inquiry comprises the described inquiry that receives and from the candidate item of the subclass of described candidate item.
29. method as claimed in claim 28 is wherein determined according to following formula by the weight that described weighted function is applied to candidate item:
(TermCount*w 1)+
(TermPosition*(w 2+(RefinementDepth*w 2′)))+
(ResultPosition*w 3)+
(TermLength*(w 4+(RefinementDepth*w 4′)))+
(QueryInclusion*(w 5+(RefinementDepth*w 5′)))
W wherein 1, w 2, w 3, w 4, w 5, w 2', w 4' and w 5' be weight independently, and RefinementDepth is the number of times to the described executed described processing of inquiry that receives.
30. method as claimed in claim 3, wherein said selection function comprises:
For the set of each the classified candidate item separately that is associated with top document in the described initial group of ranked documents, determine the described classification of the described top document that is associated; And
When the top document that threshold percentage is arranged in the described described top document that is associated evaluated in determining belonged to described first category, all candidate item set that are associated with the document that belongs to described second classification were not used to form the subclass of described candidate item.
31. a device that is used for the inquiry that refinement receives when the searching documents index, this device comprises:
Be used to handle the described inquiry that receives, so that generate device corresponding to the initial group of ranked documents of the inquiry that receives, wherein each document in all or part of document in described initial group of ranked documents all is associated with the set of separately the classified candidate item that precomputes respectively, and the set of the described classified candidate item that precomputes is included in the described index in the clauses and subclauses at described document;
Be used for according to selection function, select to be in the device of the candidate item subclass in one or more set of set of described candidate item separately; And
Be used for presenting the device of the subclass of initial group of ranked documents and described candidate item in response to the inquiry that receives.
32. device as claimed in claim 31 also comprises, for all or part of top document in the described initial group of ranked documents, is used to discern the device of the set of the classified candidate item separately that is associated with described document, this device further comprises:
(A) be used for the item device of comparing with the master list of candidate item with described document, wherein, when in the described master list in described candidate item, described is added in the set of the described classified candidate item separately that is associated with described document, with as candidate item; And
(B) be used for carrying out once more the described device that is used for comparison, the device till the Xiang Douyi of the maximum number in described document is considered.
33. device as claimed in claim 32, wherein, for all or part of top document in the described initial group of ranked documents, the classification of the candidate item in the set of the described classified candidate item separately that is associated with described document is included in the set of the described classified candidate item separately that is associated with described document, and wherein said classification comprises the first category or second classification.
34. device as claimed in claim 32, wherein, for all or part of top document in the described initial group of ranked documents, the number of times that goes out candidate item by the described example recognition that is used for the device (A) of comparison is included in the set of the described classified candidate item separately that is associated with described document.
35. device as claimed in claim 34 wherein when identifying described candidate item in the first threshold number word in described document, goes out the described number of times of described candidate item by upwards weighting by the described example recognition that is used for the device (A) of comparison.
36. device as claimed in claim 34, the device of the set of described identification classified candidate item separately also comprises:
(C) be used for when first candidate item is the subclass of second candidate item of set of described candidate item, abandoning the device of described first candidate item;
(D) be used for believing that described second candidate item has the device of the number of times that described first candidate item come out by the described example recognition that is used for the device (A) of comparison at the described document that is associated with the set of described candidate item; And
(E) be used to repeat described device that is used to abandon (C) and the described device that is used to believe (D), up to no longer including as the device till first candidate item of the second candidate item subclass in the set of described candidate item.
37. device as claimed in claim 34, the device of the set of described identification classified candidate item separately also comprises:
(C) be used for when first candidate item is the correct spelling variant of second candidate item of set of described candidate item or inflectional variant, abandoning the device of described first candidate item;
(D) be used for believing that described second candidate item has the device of the number of times that described first candidate item come out by the described example recognition that is used for the device (A) of comparison at the described document that is associated with the set of described candidate item; And
(E) be used to repeat described device that is used to abandon (C) and the described device that is used to believe (D), up to no longer including as the device till first candidate item of the correct spelling variant of second candidate item in the set of described candidate item or inflectional variant.
38. device as claimed in claim 37 wherein saidly is used to believe that the device (D) of described second candidate item also comprises:
Be used for described second candidate item is rewritten as the device of the group item that has comprised described first candidate item and described second candidate item, wherein go out the most repeatedly described first candidate item or that of described second candidate item place that begins that appears at described group item by the described example recognition that is used for the device (A) of comparison.
39. device as claimed in claim 38, the item that begins to locate that wherein only appears at described group item is used by the described device that is used to present.
40. device as claimed in claim 32, wherein, for all or part of top document in the described initial group of ranked documents, the set of the classified candidate item separately that is associated with the document all comprises the mean place of described candidate item in described document for each candidate item in the described set separately.
41. device as claimed in claim 40, the mean place of wherein said candidate item in described document are to determine according to the mean place of each example of the candidate item that identifies during the example of the described device (A) that is used for comparison.
42. device as claimed in claim 32, the wherein said device that is used to discern the set of the classified candidate item separately that is associated with described document also comprises:
(C) be used for when the number of times of the item of described document being compared with the master list of described candidate item at the described device (A) that is used for comparison reaches threshold number, stopping the described device that is used for the device (A) of comparison and stops the described device that is used for carrying out once more (B).
43. device as claimed in claim 32 wherein at language-specific, is optimized the master list of described candidate item.
44. device as claimed in claim 43, each document in all or part of document in the wherein said initial group of ranked documents has the same language of having optimized the master list of described candidate item at it.
45. device as claimed in claim 32, each in the master list of wherein said candidate item is a word or expression.
46. device as claimed in claim 31, the device that also comprises the device that is used to use revised inquiry to repeat the described device that is used to handle, the device that is used to select and is used to present, wherein said revised inquiry comprise the described inquiry that receives and from the candidate item of the subclass of described candidate item.
47. device as claimed in claim 31, the wherein said device that is used to select comprises:
(i) be used for each candidate item in each the classified candidate item set separately that is associated with the top document of described initial group of ranked documents is used the device of weighted function, the top document of each in the wherein said initial group of ranked documents all has in number the rank less than threshold level; And
(ii) be used to the subclass of described candidate item, selective reception is to the device of those candidate item of highest weighting.
48. device as claimed in claim 47, wherein the weight that is applied to candidate item by described weighted function is to determine according to the number of times that described candidate item appears in the top of top document.
49. device as claimed in claim 47 is to determine according to the position of described candidate item in the top document that it occurred by the weight that described weighted function is applied to candidate item wherein.
50. device as claimed in claim 47, wherein whether the weight that is applied to candidate item by described weighted function is to determine in described candidate item according to the item in the described inquiry that receives.
51. device as claimed in claim 47 is to determine according to the number of characters in the described candidate item by the weight that described weighted function is applied to candidate item wherein.
52. device as claimed in claim 47, wherein the weight that is applied to candidate item by described weighted function is to determine according to the position of the document in the described initial group of ranked documents that has comprised described candidate item.
53. device as claimed in claim 47 is to determine according to the combination in any of TermCount, TermPosition, ResultPosition, TermLength and QueryInclusion by the weight that described weighted function is applied to candidate item wherein, wherein
TermCount is that described candidate item appears at the number of times in the top of each top document,
TermPosition is the function of the position of described candidate item in each top document that it occurred,
ResultPosition is the function that has comprised the documents location of the top document in the described initial group of ranked documents of described candidate item,
TermLength is the number of characters in the described candidate item, and
When the item in the described inquiry that receives was in described candidate item, QueryInclusion was non-vanishing, and when the item in the described inquiry that receives was not in described candidate item, QueryInclusion was zero.
54. device as claimed in claim 53 is wherein determined according to following formula by the weight that described weighted function is applied to candidate item:
TermCount+TermPosition+ResultPosition+TermLength+QueryInclusion。
55. device as claimed in claim 54, wherein each among TermCount, TermPosition, ResultPosition, TermLength and the QueryInclusion is by independent weighting.
56. device as claimed in claim 33, the wherein said device that is used to select comprises:
Be used for set, determine the device of described classification of the set of described classified candidate item separately for each the classified candidate item separately that is associated with the top document of described initial group of ranked documents; And
When described when the set of the threshold percentage in the set of evaluated described candidate item belongs to described first category in determining, the set that belongs to all candidate item of described second classification is not used to form the subclass of described candidate item.
CNB2004800140270A 2003-03-21 2004-03-22 The system and method that is used for interactive search query refinement Expired - Lifetime CN100568172C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US45690503P 2003-03-21 2003-03-21
US60/456,905 2003-03-21
US10/424,180 2003-04-25

Publications (2)

Publication Number Publication Date
CN1795432A CN1795432A (en) 2006-06-28
CN100568172C true CN100568172C (en) 2009-12-09

Family

ID=36806170

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2004800140270A Expired - Lifetime CN100568172C (en) 2003-03-21 2004-03-22 The system and method that is used for interactive search query refinement

Country Status (1)

Country Link
CN (1) CN100568172C (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120303355A1 (en) * 2011-05-27 2012-11-29 Robert Bosch Gmbh Method and System for Text Message Normalization Based on Character Transformation and Web Data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Fast and Effective Query Refinement. Bienvenifo velez et al.in proceedings of 20th International Conference on Research and Development in Information Retrieval. 1997
Fast and Effective Query Refinement. Bienvenifo velez et al.in proceedings of 20th International Conference on Research and Development in Information Retrieval. 1997 *

Also Published As

Publication number Publication date
CN1795432A (en) 2006-06-28

Similar Documents

Publication Publication Date Title
US10713571B2 (en) Displaying quality of question being asked a question answering system
Moldovan et al. Using wordnet and lexical operators to improve internet searches
Glance et al. Deriving marketing intelligence from online discussion
US7562074B2 (en) Search engine determining results based on probabilistic scoring of relevance
KR101203345B1 (en) Method and system for classifying display pages using summaries
US20170235841A1 (en) Enterprise search method and system
US9715531B2 (en) Weighting search criteria based on similarities to an ingested corpus in a question and answer (QA) system
US8965894B2 (en) Automated web page classification
CN113822067A (en) Key information extraction method and device, computer equipment and storage medium
CA2701171A1 (en) System and method for processing a query with a user feedback
JP5605583B2 (en) Retrieval method, similarity calculation method, similarity calculation and same document collation system, and program thereof
CN112989208A (en) Information recommendation method and device, electronic equipment and storage medium
Yeasmin et al. Study of abstractive text summarization techniques
Radoev et al. A language adaptive method for question answering on French and English
JP5079642B2 (en) History processing apparatus, history processing method, and history processing program
Moumtzidou et al. Discovery of environmental nodes in the web
CN100568172C (en) The system and method that is used for interactive search query refinement
Wegrzyn-Wolska et al. Classification of RSS-formatted documents using full text similarity measures
Ajitha et al. EFFECTIVE FEATURE EXTRACTION FOR DOCUMENT CLUSTERING TO ENHANCE SEARCH ENGINE USING XML.
Selvadurai A natural language processing based web mining system for social media analysis
KR101778901B1 (en) Generating resources for support of online services
Gelfand et al. Discovering concepts in raw text: Building semantic relationship graphs
Ackerman Extracting Causal Relations between News Topics from Distributed Sources
Ye et al. Clustering web pages about persons and organizations
Sharma Hybrid Query Expansion assisted Adaptive Visual Interface for Exploratory Information Retrieval

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20090306

Address after: California, USA

Applicant after: YAHOO! Inc.

Address before: California, USA

Applicant before: OVERTURE SERVICES, Inc.

ASS Succession or assignment of patent right

Owner name: YAHOO! CO.,LTD.

Free format text: FORMER OWNER: WAFUL TOURS SERVICES

Effective date: 20090306

C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: FEIYANG MANAGEMENT CO., LTD.

Free format text: FORMER OWNER: YAHOO CORP.

Effective date: 20150331

TR01 Transfer of patent right

Effective date of registration: 20150331

Address after: The British Virgin Islands of Tortola

Patentee after: Yahoo! Inc.

Address before: California, USA

Patentee before: YAHOO! Inc.

CX01 Expiry of patent term

Granted publication date: 20091209