CN102682073B - Selection of atoms for search engine retrieval - Google Patents

Selection of atoms for search engine retrieval Download PDF

Info

Publication number
CN102682073B
CN102682073B CN201210060934.5A CN201210060934A CN102682073B CN 102682073 B CN102682073 B CN 102682073B CN 201210060934 A CN201210060934 A CN 201210060934A CN 102682073 B CN102682073 B CN 102682073B
Authority
CN
China
Prior art keywords
atom
file
search
measure information
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210060934.5A
Other languages
Chinese (zh)
Other versions
CN102682073A (en
Inventor
K.M.里斯维克
M.霍普克罗夫特
J.G.班尼特
K.卡亚纳拉曼
T.基林比
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US13/045,278 external-priority patent/US9342582B2/en
Application filed by Microsoft Technology Licensing LLC filed Critical Microsoft Technology Licensing LLC
Publication of CN102682073A publication Critical patent/CN102682073A/en
Application granted granted Critical
Publication of CN102682073B publication Critical patent/CN102682073B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

Methods are provided for populating search indexes with atoms identified in documents. Documents that are to be indexed are identified, and for each document, atoms are identified and are categorized as unigrams, n-grams, and n-tuples. A list of atom/document pairs is generated such that an information metric can be computed for each pair. An information metric represents a ranking of the atom in relation to the particular document. Based on the information metric, some atom/document pairs are discarded and others are indexed.

Description

For the atom of search engine retrieving(ATOM)Selection
Background technology
The quantity of available information and content increases continuously and healthily on the Internet.In view of the enormous quantity of information, Search engine is had been developed in search e-file.Especially, user can may feel emerging by input comprising user One or more words of interest(term)Search inquiry come the information of searching for and file.Receiving the search inquiry from user Afterwards, search engine recognizes related file and/or webpage based on search inquiry.Because its practicality, web search(That is, pin The process of related web page and file is found to the search inquiry that user issues)Demonstrably become on current the Internet and most flowed Capable business.
Search engine is by crawling(crawl)File and the index in search index(index)Letter associated with the file Cease to operate.When search inquiry is received, search engine recognizes the file related to search inquiry using search index.With This mode allows rapidly to retrieve information for inquiry using search index.Index is not searched for, search engine will need to search To find correlated results, this will take unacceptable time quantum to rope file set.
Due to the Internet sustainable growth, the quantity of documents that can search for that can be crawled and index in search index is Become very large.As a result, for search engine, index is infeasible with regard to the information of all-network file.For example, will Need too big hardware store amount.In addition, will be from the process time required for retrieval result in very big index can not Receive.Even so, search engine strives indexing feasible as far as possible many files to provide Search Results to arbitary inquiry, together When be cost-effective(cost-effective)And related knot can be provided within the acceptable time quantum of terminal use Really.
The content of the invention
The content of the invention is provided to introduce the selection of concept according to simplified form, it is in following specific embodiment It is further described.The content of the invention is not intended to the key feature of the theme for recognizing claimed or required feature, also not Intention is used to aid in the scope of the theme for determining claimed.
Embodiments of the present invention are related to be filled with the atom for having identified from multiple files(population) One or more search indexes.Atom can be linear model(unigrams), n meta-models(n-grams)Or n tuples(n- tuples).The list of atom/document pair is generated, and such atom for example can be identified as being from spy based on file identification Determine file.For each atom/document pair, a measure information is calculated, it represents that atom is near with specific file degree of correlation Like degree.Many factors be used to calculate measure information, and such as frequency hereof, file occurs in atom includes the word of atom The nearness of language, the degree of correlation of word, by check inquiry log look at whether word has been linked together. In some examples, Machine learning tools are employed to compute measure information.Measure information meet or more than specific threshold atom/ File is to being indexed in a search index, and those incongruent be dropped and therefore be not indexed.
Brief description
The present invention is described in detail below with reference to accompanying drawing, wherein:
Fig. 1 is adapted for the block diagram of the exemplary computing environments used in the realization of embodiments of the present invention;
Fig. 2 be a diagram that the intelligent funnel according to embodiment of the present invention(smart funnel)Figure, it is used for Reduce file candidate to obtain the file set of classification;
Fig. 3 can be the block diagram using the example system of embodiment of the present invention;
Fig. 4 is to illustrate the method flow diagram according to embodiment of the present invention, and the method is used to process stage by stage return In response to the Search Results of search inquiry;
Fig. 5 is to illustrate the method flow diagram according to embodiment of the present invention, and the method is used to precalculate/index the stage Period generates search index;
Fig. 6 is to illustrate the method flow diagram according to embodiment of the present invention, and the method is used to be recognized during matching stage Initial matching files set;
Fig. 7 is to illustrate the method flow diagram according to embodiment of the present invention, and the method is used to delete(prune)Phase in stage Between delete file from initial matching files set;
Fig. 8 illustrates the example system that can use embodiment of the present invention;
Fig. 9 A, 9B and 9C respectively illustrate according to embodiment of the present invention, linear model search index, n meta-models The example of entry in search index and n units group searching index;
Figure 10 is to illustrate the method flow diagram according to embodiment of the present invention, and the method is used to know using in multiple files The atom not gone out fills one or more search indexes;
Figure 11 is to illustrate the method flow diagram according to embodiment of the present invention, and the method is used to know using in multiple files The atom not gone out fills one or more search indexes;With
Figure 12 is to illustrate the method flow diagram according to embodiment of the present invention, and the method is used to know using in multiple files The atom not gone out fills one or more search indexes.
Specific embodiment
Subject of the present invention is described with specificity to meet legal requirement herein.However, the description itself is unexpectedly Figure limits the scope of the patent.Conversely, inventor already allows for claimed theme to be also possible to otherwise realize, In the case of in combination with other existing or future technologies, similar different step the step of including with described in this document Or step combination.In addition, although word " step " and/or " frame " can be used to indicate that the difference of used method will herein Element, but the word should not be construed as to imply that among multiple steps disclosed herein or between any particular order, Unless and except the order of each step is expressly depicted.
Embodiments of the present invention provide a kind of index and search process, and it allows heap file with cost-effective side Formula is indexed and retrieves, and meets strict waiting time constraint.According to the embodiment of the present invention, using in multiple stages Estimate and delete the process of file candidate.Conceptive, the process appears to funnel(funnel), due in whole rank Analysis becomes more complicated in section, therefore file candidate is estimated and deletes.It is more high as the process continues on all stage Expensive calculating is employed and the quantity of alternative file can be reduced multiple orders of magnitude.Different strategies is applied in each stage To allow that Search Results are returned from heap file with quickly and efficiently mode.In addition, the strategy used in each stage can The strategy that other stages use is replenished to be designed, so that the process is more efficient.
Primitive of the search index index that embodiments of the present invention are used from the higher-order of file(primitive) Or " atom ", this is contrary with single word is simply indexed.As employed herein, " atom " can refer to inquiry or file it is various Unit.These units can include such as word, n meta-models, n tuples, k adjacent to n tuples etc..Word is mapped as downwards single symbol Number or word, as by the specific segmenter for being used(tokenizer)Technology limiting.In one embodiment, word is single Character.In another embodiment, word is single word or combinatorics on words.N meta-models can be extracted from file " n " individual continuous or subcontinuous sequence of terms.If n meta-models correspond to a string continuous words, it may be said that it is " tight ", if it includes word according to the order for occurring hereof, but the word is not required continuously, then be " pine ".The n meta-models of pine are normally used to indicate the phrase that difference is the class equivalent of footy word(For example " if Raining, I will be drenched " and " I will be drenched if raining ").N tuples as employed herein are co-occurrences hereof(Sequentially It is unrelated)" n " individual word set.Further, as employed herein, k refers to " k " individual word hereof adjacent to n tuples The set of " n " individual word of co-occurrence in the window of language.Therefore, atom is generally defined as above all of generalization summary.This The realization of invention embodiment can use different types of atom, but as employed herein, atom is described in general manner State species each.
When search index is set up, each file is analyzed to identify the atom in file and generates for each atom advance The fraction or grade of calculating, it represents the importance or the dependency with context of atom.Search index storage with regard to The information of the precalculated fraction generated for file/atom pair, it is used during funnel process.
Fig. 2 illustrates multiple stages that the funnel of an embodiment of the invention is processed.Place shown in figure 2 The stage of reason is performed after search inquiry is received, and including:L0 matching stages 202, L1 is classified the stage 204 temporarily, and L2 is finally classified the stage 206.As represented in Fig. 2, as the process is carried out, the quantity of alternative file is reduced.
When search inquiry is received, search inquiry is analyzed to identify atom.The atom is during L0 matching stages 202 It is used to query search index and recognizes the initial matching file set comprising the atom from search inquiry.As in fig. 2 Shown, the quantity of alternative file can be reduced to matching from search by this from all files indexed in search index Those files of the atom of inquiry.
The stage 204 is classified temporarily in L1, is the candidate remained from L0 matching stages 202 using simplified score function File calculates preliminary score.The simplification score function is especially precalculated to what is stored in search index for file/atom pair Fraction carry out computing.In some embodiments, simplify score function to can serve as that hierarchial file structure will be eventually used to most Whole hierarchical algorithmses it is approximate.However, simplifying score function provides the computing more cheap than final hierarchical algorithmses, this allows a large amount of Alternative file is rendered adequately treated quite quickly.Based on preliminary fraction, alternative file is deleted.For example, only with highest preliminary score The N number of file in top can be retained.
The stage 206 is finally classified in L2, estimates to be classified what the stage 204 remained temporarily from L1 using final hierarchical algorithmses Alternative file.Compared to the simplified score function used during L1 is classified the stage 204 temporarily, final hierarchical algorithmses are that have greatly The computing costly of amount graded features.However, final hierarchical algorithmses are applied to the alternative file of much smaller number.Final point Level algorithm provides the file set of classification, and in response to original search inquiry, the file set based on the classification provides search As a result.
So as to, in one aspect, one of the embodiments of the present invention instruction that computer can use for being stored with or Multiple computer-readable storage mediums, when being used by computing device, the instruction causes computing device method.The method includes connecing Receive search inquiry and rewrite(reformulating)The search inquiry is recognizing one or more atoms.The method also includes base In one or more atoms from the initial file set of search index identification.The method further includes former for one or more Sub and initial file set, is that file/atom pair is precalculated using simplifying score function and being stored in search index Fraction, is that each file in initial file set calculates preliminary score.The method is also included based on preliminary score from initial File set in select the file set deleted.The method further includes that using complete hierarchical algorithmses be the file set deleted Each file in conjunction calculates classification fraction to provide the file set of classification.The method still further comprises the text based on classification Part set provides Search Results to present to terminal use.
In yet another embodiment of the present invention, its aspect be for include at least one processor and one or more The computerized system of computer-readable storage medium.The system includes inquiry reformulation component, its search inquiry for receiving of analysis with Based on one or more atoms of the words recognition included in the search inquiry for receiving and generate the inquiry of rewriting.The system is also Including file matching component, it carrys out query search index and recognizes initial matching files set using the inquiry rewritten.This is System also includes document pruning component, and it calculates preliminary using each file that score function is initial matching files set is simplified Fraction, and the file set deleted based on the preliminary score identification.The system still further comprises definitive document classification component, its The use of complete hierarchical algorithmses is each file calculating classification fraction in the file set deleted.
The further embodiment of the present invention provides the search in response to search inquiry for a kind of use staged care As a result method.The method includes receiving search inquiry and recognizing one or more atoms from search inquiry.The method also includes Identification includes the initial file set of one or more atoms, the use of simplify score function is every in initial file set Individual file calculates preliminary score, and the subset based on the preliminary score select file so as to further process.The method is further Classification fraction is calculated including using final hierarchical algorithmses for each file in the subset of file.The method still further comprises base The set of Search Results is provided in the classification fraction.
Except implementations described above, describe here again for from file identification relevant atomic and indexing original The method of son/file pair.For example, atom(It can be classified as linear model, n meta-models or n tuples)Known from file Other or extraction.Be each atom/document to calculate measure information.The calculating of the measure information can be based on many factors, or even Can be completed by Machine learning tools, the Machine learning tools can learn how to calculate measure information.Threshold value is used to base Being discarded in parsing inquiry in measure information is considered as not being that those related or useful to are former as other atom/documents Son/file pair.Be considered as it is maximally related those search index in be indexed for future receive search inquiry when use.
According to the first aspect of the invention, there is provided a kind of method is so that the atom filling identified in multiple files One or more search indexes.The method includes recognizing in search index by the file set being indexed, for file set In each file, recognize multiple atoms, the plurality of atom includes one or more linear models, one or more n meta-models With one or more n tuples.In addition, the method includes, based on recognized file set and multiple atoms, generating atom/document To list, and calculate the measure information of each atom/document pair, the wherein measure information represents the original related to specific file The classification of son.Additionally, the method is included based on the measure information of each atom/document pair, selection and the identified spy of atom Determine the subset of the maximally related atom/document pair of file.The method is further included using the atom/document pair of the specific file Subset filling search index.
According to the second aspect of the invention, there is provided storage computer can use one or more Computer Storage of instruction Medium, when being used by computing device, it causes the atom that a kind of use of computing device is identified in multiple files to fill out The method for filling one or more search indexes.The method is included from the multiple originals of the first file identification by the multiple files being indexed Son, each classified in multiple atoms according to one or more in linear model, n meta-models or n tuples, and calculate with The measure information of each of the related multiple atoms of the first file.Further, the method includes each of the multiple atoms of determination Measure information whether meet predetermined threshold value.The atom for meeting predetermined threshold be it is maximally related with the first file those.The party Method also includes abandoning not meeting the atom of predetermined threshold, and the atom that meet predetermined threshold related to the first file is incorporated into In one or more search indexes.
According to the third aspect of the invention we, there is provided storage computer can use one or more Computer Storage of instruction Medium, when being used by computing device, it causes the atom that a kind of use of computing device is identified in multiple files to fill out The method for filling one or more search indexes.The method includes extracting multiple atoms from file, and the plurality of atom includes one Or multiple linear models, one or more n meta-models and one or more n tuples, and for multiple atoms each, calculate Represent the measure information of the classification of specific atoms associated with the file.The calculating of the measure information is based on following one or many It is individual:Two or more words of atom frequency hereof, the hereof nearness of two or more words of atom, atom The dependency of language, or inquiry log proves whether two or more words of atom are previously contacted as checked Together.The method further comprises determining that measure information threshold value.Its measure information meets or more than the original of measure information threshold value Son/file is to being indexed.In addition, the method includes abandoning a part of atom/document pair based on measure information.Corresponding to being abandoned Atom/document pair measure information be less than measure information threshold value.Met or more than measure information by indexing its measure information The atom/document pair of threshold value and fill one or more search index, wherein linear model, n meta-models and n tuples are marked respectively Draw.The associated documents of atom in the accessed inquiry with identification of one or more search indexes.
The general introduction of embodiments of the present invention is had been described with, one kind of achievable embodiment of the present invention is described below Illustrative Operating Environment, to provide the general scene of many aspects of the present invention.Especially, referring initially to Fig. 1, illustrate For realizing the Illustrative Operating Environment of embodiments of the present invention, commonly known as computing device 100.But computing device One example of 100 simply suitable computing environment, it is not intended to which hint is any with regard to the scope of the use invented or function Limit.Computing device 100 also should not be construed to have relevant with any one or its combination of illustrated component appointing What relies on or needs.
The present invention be able to can be used described in the usual scene of instruction in computer code or machine, the code or instruction Including the computer executable instructions of such as program module, by computer or personal digital assistant or other handheld devices etc. Other machines perform.Normally, including the program module of routine, program, object, component, data structure etc. refers to perform The code of particular task or enforcement particular abstract data type.The present invention can be implemented in multiple systems configuration, including handss Holding equipment, consumer-elcetronics devices, general purpose computer, more professional computing device etc..The present invention can also be realized in distributed meter In calculating environment, wherein task is performed by the remote processing devices by communication network links.
With reference to Fig. 1, computing device 100 includes bus 110, and it directly or indirectly couples following equipment:Memorizer 112, One or more processors 114, one or more presentation components 116, input/output(I/O)Port 118, input output assembly 120 and exemplifying power supply 122.It can be one or more bus that bus 110 is represented(Such as address bus, data/address bus Or its combination).Although for multiple pieces be clearly shown by lines in Fig. 1, in fact, describing multiple components not so It is clear, and Metaphor, the line is more precisely gloomy and fuzzy.For example, can be by the presentation group of such as display device Part regards I/O components as.In addition, processor has memorizer.Inventor recognizes that this is that technology has in itself, and reaffirms The diagram of Fig. 1 is only the figure of the exemplary computer device that can relatively use with one or more embodiments of the present invention Show.It is not different between these species such as such as " work station ", " server ", " kneetop computer ", " handheld device ", because institute Having all be expected in the range of Fig. 1 and be referred to as " computing device ".
Computing device 100 typically comprises various computer-readable mediums.Computer-readable medium can be can be by calculating Any available medium that equipment 100 is accessed, and including volatile and nonvolatile medium, removable and non-removable media.It is logical Cross example and unrestricted, computer-readable medium can include computer-readable storage medium and communication media.Computer-readable storage medium Including for storage information(Such as computer-readable instruction, data structure, program module or other data)Any means Volatile and nonvolatile, removable and non-removable media with technology implementation.Computer-readable storage medium include but is not limited to RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disc(DVD)Or other optical disk storage apparatus, magnetic Box, tape, disk storage or other magnetic storage apparatus, or can be used the desired information of storage and can be set by calculating The standby 100 any other media for accessing.Communication media typically comprise computer-readable instruction, data structure, program module or Modulated data signal(Such as carrier wave or other transmission mechanisms)In other data, and including random information delivery media.Word Language " data signal of modulation " refers to that one or more spy is set or changed according to the mode of coding information in the signal The signal of property.Unrestricted by example, communication media includes the wire medium of such as cable network or direct wired connection, and Such as ultrasound, RF, infrared and other wireless mediums wireless mediums.Any of the above combination should also be as being included in computer can In reading the scope of medium.
Memorizer 112 includes volatile and/or nonvolatile storage form computer-readable storage medium.Memorizer can be can Removing, non-removable or its combination.Exemplary hardware devices include solid-state memory, hard drive, disc drives etc.. Computing device 100 includes one or more processors, its various entity receive data from such as memorizer 112 or I/O components 120 According to.(It is multiple)The presentation of component 116 data are presented and are indicated to user or miscellaneous equipment.Exemplary presentation components include display device, Speaker, print components, vibration component etc..
I/O ports 118 allow computing device 100 logically coupled to the miscellaneous equipment for including I/O components 120, wherein one Can be a bit built-in.The component of diagram includes mike, stick, game paddle, satellite butterfly antenna, scanner, printing Machine, wireless device etc..
With reference now to Fig. 3, there is provided diagram can use the block diagram of the example system 300 of embodiments of the present invention. It should be understood that this and other arrangements being described herein as are proposed only as example.Except or replace those shown, Can be using other arrangements and element(Such as machine, interface, function, order and function combinations etc.), some elements can be complete Ignore entirely.Further, many elements described herein are functional entitys, and it may be implemented as detached or distributed component Or combined with other components, and in any suitable combination and position.It is described herein to be performed by one or more entities Several functions can be realized by hardware, firmware and/or software.For example, several functions can be by computing device in storage In device store instruction and realize.
In unshowned other components, system 300 can draw including user equipment 302, content server 304 and search Hold up server 306.Each component illustrated in Fig. 3 can be any type of computing device, for example, such as describe with reference to Fig. 1 Computing device 100.Component can communicate with one another via network 308, and it can include but is not limited to one or more LANs (LAN)And/or wide area network(WAN).This networked environment in an office, the computer network of enterprise-wide, Intranet and mutually Networking is universal.It should be understood that any number of user equipment, content server and search engine server can be at these Used in system 300 in bright scope.Each can include individual equipment or the multiple of cooperation set in distributed environment It is standby.For example, search engine server 306 can include the multiple equipment being arranged in distributed environment, and it jointly provides this The function of the search engine server 306 of place's description.Additionally, unshowned other components may also be included in that system 300 In.
Search engine server 306 normally operates to receive the search from the such as user equipment of user equipment 302 Inquiry, to provide the Search Results in response to search inquiry.Search engine server 306 especially includes indexing component 310, user Interface module 312, inquiry reformulation component 314, file matching component 316, document pruning component 318 and definitive document classification component 320。
Indexing component 310 operates to index the file safeguarded with regard to the content server by such as content server 304 Data.For example, component is crawled(It is not shown)Content server can be employed to crawl and access what is safeguarded by content server The relevant information of file.So as to indexing component 310 indexes relevant with the file for crawling data in search index 322. In embodiment, indexing component 310 indexes the marking letter of the atom for finding hereof and the file for wherein finding each atom Breath, it indicates importance of the atom in the context of file.Can be calculated using any number of algorithm and be sent out hereof The fraction of existing atom.Only by way of example, the fraction can be based on the reverse document-frequency of word frequency known in the art (TF/IDF)Function.For example, BM25F classifications(ranking)Function can be used.Make for the fraction that file/atom pair is generated It is stored in search index 322 for precalculated fraction.
In embodiments, indexing component 310 analyzes each file to recognize word, n meta-models, n tuples and determine this Which should be indexed for this document in a little atoms.During the file that will be indexed is processed, with regard to Query distribution, word point Cloth and/or the statistics of the simplified score function used during funnel is processed can be used to statistically select best original Subclass is representing this document.The atom that these are selected is indexed in search index 322 using precalculated fraction, and it is permitted Perhaps file is effectively deleted early stage funnel is processed.
Although being not required, in certain embodiments of the present invention, search index 322 can include reverse rope Draw(By the order of atom)With positive index(By the order of file).Reverse indexing can include multiple list(posting list), each list points to an atom and the file including the atom listed, with each file/atom pair Precalculated fraction.As will be described in more detail below, reverse indexing and positive index can be in the differences of funnel process Stage is used.
User's interface unit 312 is provided to the interface of the such as user equipment of user equipment 302, and it allows user's submission to search Rope is inquired about to search engine server 306 and receives Search Results from search engine server 306.User equipment 302 can be Any type of computing device for submitting search inquiry and reception Search Results to is used by user.Only by example rather than limit System, user equipment 302 can be desktop computer, laptop computer, panel computer, mobile device or other types of calculating Equipment.User equipment 302 can include a kind of application, and it allows user input search inquiry and submits to the search inquiry extremely to search for Engine server 306 is obtaining Search Results.For example, user equipment 302 can include web browser, and it includes search input Frame allows user to access searched page to submit search inquiry to.For submitting other mechanism of search inquiry to search engine to It is contemplated that in the scope of embodiment of the present invention.
When search inquiry is received via user's interface unit 312, inquiry reformulation component 314 is operated to be looked into rewriting this Ask.Inquiry is rewritten as being easy to being based on how to index data in search index 322 and query search from its free form text The form of index 322.In embodiments, the word of search inquiry is analyzed to identify can be used for query search index 322 atom.Can use and be used to recognize that the similar technology of atom is come hereof when search index 322 gets the bid quotation part Recognize the atom.For example, atom can be recognized based on the statistics of word and Query distribution information.Inquiry reformulation component 314 can To provide atom conjunction(conjunction)Set and the cascade variable of these atoms(cascading variant).
File matching component 316 indexes the file set of 322 and identification and matching using the revised inquiry with query search Close.For example, the inquiry of rewriting can include two or more atoms, and file matching component 316 can obtain the mark of these atoms The common factor of note list, to provide initial matching files set.
Document pruning component 318 is operated by deleting file from initial matching files set.This can include making Calculated from initial matching files collection with the precalculated fraction of the file/atom pair being stored in search index 322 The preliminary score of each file for closing.The preliminary score can be based on score function be simplified, and it is for performance and retrieval (recall)And adjust.In some embodiments, the simplified score function that be used to generate preliminary score is based on complete point Level algorithm and set up, the complete hierarchical algorithmses are subsequently used in and provide final hierarchial file structure set.So, score function is simplified As the approximate of final hierarchical algorithmses.For example, such as in US number of patent applications(Not yet distribute)(Attorney docket MFCP.157122), described in entitled " DECOMPOSABLE RANKING FOR EFFICIENT PRECOMPUTING " Method can be used to set up simplified score function.In some embodiments, simplify score function to include from final point The subset of the graded features of level algorithm.
Multiple different methods can be by document pruning component 318 using deleting the initial file set.At some In embodiment, document pruning component 318 can retain the quantity of the predetermined matching in initial file set, and remove other File is not considered further that(That is the N number of matching in top).For example, document pruning component 318 can retain with highest preliminary score one Thousand files.The quantity of the matching that document pruning component 318 retains can be based on the simplified score letter for generating preliminary score Several fidelity confidences.The fidelity confidence represents the file set that simplified score function is provided and will provided by complete hierarchical algorithmses The ability of the file set that conjunction matches.For example, can obtain average 1200 files to obtain by most from score function is simplified 1000, the top file that whole hierarchical algorithmses are provided.In other embodiments, replace retaining the file of predetermined quantity, file is deleted Subtracting component 318 can retain the file with the preliminary score on specific threshold.
In some embodiments, file matching component 316 and document pruning component 318 can be by close-coupleds, so File is matched and deleted and is incorporated in single process to be repeated several times.For example, preliminary score can be calculated, because matching File is identified and for removing the file that will be likely to be abandoned by complete hierarchical algorithmses.
In some embodiments, indexed using the search of the list of layering(Such as in U.S. Patent Application No.(Still It is unallocated)(Act on behalf of Reference Number MFCP.157121), entitled " TIERING OF POSTING LISTS IN SEARCH ENGINE Described in INDEX ")This matching/delete process can be used to be easy to.Each list by with given atom It is associated and will includes based on precalculated fraction(The fraction is assigned to file, represents given atom pair in each text The dependency of the context of part)And the layer for sorting.In each layer, labelling(posting)Can sort by file internal.Make Indexed with this search, file matching component 314 will be using ground floor(With the precalculated fraction of highest)Obtain initial File set merge using simplify score function delete initial file set.Armed with sufficient amount of file, match/ Deleting process can terminate.Alternatively, if not providing sufficient amount of file, matching and delete can be in reduced levels Layer is repeatedly carried out, until remaining sufficient amount of file.
The matching that there is provided by file matching component 316 and document pruning component 318 and delete and process retained file set Close and estimate to provide final hierarchial file structure set by definitive document classification component 320.Definitive document classification component 320 has been used Whole hierarchical algorithmses, the algorithm can process the file set for retaining and operate to original search inquiry and by matching and deleting.It is complete Whole hierarchical algorithmses using than delete process during with more graded features for being used of simplifieds score function and more come From the data of file.So, complete hierarchical algorithmses are more expensive computings, and it needs more processing and takes longer time Calculate.However, because the set of alternative file is deleted, complete hierarchical algorithmses are performed on less file set.
Definitive document classification component 320 provides final hierarchial file structure set, and it is indicated to user's interface unit 312. Then user's interface unit 312 will pass to user and sets including at least one of Search Results of final hierarchial file structure set Standby 302.For example, user's interface unit 312 can be generated based on final hierarchial file structure set or otherwise provided and be listed The search engine results page of Search Results(SERP).
Turn next to Fig. 4, there is provided illustrate according to embodiment of the present invention, for being returned using process stage by stage Return the flow chart of the whole method 400 of the Search Results of search inquiry.Process stage by stage starts from precalculating/index rank Section, as shown in frame 402.This stage is off-line phase, i.e. it separately holds with any search inquiry for receiving OK.Precalculating/and in the index stage 402, file is crawled, and the data with regard to this document are indexed in search index In.According to an embodiment precalculating/the index stage 402 during index file data process below in reference to Fig. 5 It is discussed in further detail.
Precalculating/and after the index stage 402, the stage that figure 4 illustrates includes on-line stage, connects in this stage Receive search inquiry and responsively return Search Results.The first stage of on-line stage is matching stage, as shown in frame 404 's.During matching stage 404, search inquiry is received and is written over, and the inquiry of the rewriting is used for from search index Identification and matching file.Joined below for the process of identification and matching file during matching stage 404 according to an embodiment It is discussed in further detail according to Fig. 6.
Next stage after the matching is the stage of deleting, as shown in frame 406.The stage 406 is deleted from matching rank Section 404 obtains initial file set, and determines preliminary score for each file using score function is simplified.Tentatively divided based on this Number, from initial file set file is deleted.File is deleted from initial matching files set according to an embodiment Process be discussed in further detail below in reference to Fig. 7.
In some embodiments, matching stage 404 and delete the stage 406 can be alternately.Especially, when matching text Part can be performed to delete when identified and indicate that file will be likely to by final hierarchical algorithmses to abandon wherein preliminary score earlier The candidate of discarding does not further consider.
In matching stage 404 and delete quilt during the alternative file retained after the stage 406 is integrated into the final classification stage Further estimate, such as illustrate at frame 408.During the final classification stage 408, determine what is retained using complete hierarchical algorithmses The final score of file.In some embodiments, complete hierarchical algorithmses can be in each of original search inquiry and reservation Perform in the data of file.Complete hierarchical algorithmses can determine final hierarchial file structure collection using multiple different graded features Close.In response to search inquiry, Search Results are provided based on final hierarchial file structure set, as shown in frame 410.
Turning now to Fig. 5, there is provided diagram according to embodiment of the present invention, for precalculating file/atom pair Fraction and index data method 500 flow chart.Initially, as shown in frame 502, a file is accessed.For example, crawl Device can be employed to crawl file and obtain file data.At frame 504, this document is processed.This document is processed with knowledge The atom included in other file.As mentioned above, the process includes the text of Study document to recognize word, n meta-models and n Tuple, and determine for which in this document these atoms should be indexed.With regard to Query distribution, the statistics of word distribution And/or the simplified score function used during funnel is processed can be used to statistically select best atom set with table Show this document.
As shown in frame 506, for each atom identified in file generates a fraction.The fraction representation is in text The importance of the atom in the context of part.Can be calculated using any amount of algorithm the atom that finds hereof point Number.Only by example, fraction can be based on the reverse document-frequency of word frequency known in the art(TF/IDF)Function.For example, can be with Using BM25F classification functions.
As shown in frame 508, in search index acceptance of the bid argument evidence.This can include storage with regard to hereof finding The fraction of the information of atom and each file/atom pair.These fractions include precalculated fraction, and it can be processed in funnel Period is used.In some embodiments, it is that each atom creates list.Each list can include including is somebody's turn to do The instruction of the fraction of the listed files of atom and precalculated each file/atom pair.
Referring next to Fig. 6, there is provided illustrate according to embodiment of the present invention, for obtaining just during matching stage The flow chart of the method 600 of the matching files set of beginning.As shown in frame 602, search inquiry is initially received.The search Inquiry can be included by one or more search terms of the user input using user equipment.
As shown in frame 604, the search inquiry for receiving is written over.Especially, the word of search inquiry is analyzed To recognize one or more atoms that can be used for query search index.The analysis can be similar to when file data be indexed The analysis of atom in for recognizing file.For example, the statistics of word and search inquiry can be used to recognize in search inquiry Atom.The revised inquiry can include the connection set of words and their cascade variable of atom(cascading variant).
As shown in frame 606, the revised inquiry is used for the set according to search index identification and matching file.It is special Not, query search index and identification and matching file are used for according to the atom that original query is identified.As indicated above, Search index can be included in the list of the various atoms recognized in the file of index.Corresponding to by revised inquiry knowledge The list of the atom not gone out can be identified and used to identification and matching file.For example, according to revised inquiry The common factor of the list of multiple atoms can provide initial matching files set.
Turn to Fig. 7, there is provided diagram according to embodiment of the present invention, for during the stage of deleting from initial matching The flow chart of the method 700 of file is deleted in file set.It is pre- in search index using being stored in as shown in frame 702 Precalculated fraction calculates preliminary score for each file.This can include obtaining precalculated point of each atom of file Number, and the precalculated fraction used in score function is simplified is generating the preliminary score of file.The simplification score function Can set up in such a way:The estimation of the final score that its offer is provided by complete hierarchical algorithmses.For example, the simplification meter Point function can include the subset of the feature used by complete hierarchical algorithmses.In some embodiments, it is special using the such as U.S. Sharp application number(Not yet distribute)(Act on behalf of Reference Number MFCP.157122), entitled " DECOMPOSABLE RANKING FOR Process described by EFFICIENT PRECOMPUTING " is defining simplified score function.
As shown in frame 704, file is deleted from initial matching files set based on preliminary score.In some enforcements In mode, the N number of file in top is retained, i.e. the N number of file with highest preliminary score is retained further to process.Retain File quantity can be based on be used to calculate preliminary score simplified score function fidelity.The simplification score function Fidelity represents the ability that simplified score function provides the file set similar to those classifications provided by final hierarchical algorithmses. If it is known that including the relatedness between the final hierarchical algorithmses for simplifying the error in score function and simplified score function, being somebody's turn to do Knowledge may be used to determine whether the quantity of the file retained from the stage of deleting.For example, if it is desired to which 1000 search knots are provided Fruit and known fifty-fifty will include from final point from 1200, the top file for simplifying score function in all inquiries Level algorithm 1000, top file, then top 1200 files will be retained from the stage of deleting.
In certain embodiments of the present invention, funnel is processed can use the search for including reverse indexing and positive index Index.The reverse indexing is according to atomic order.This will be easy to rapidly be obtained in the matching that funnel is processed and during deleting the stage Data.Especially, when receiving search inquiry and identifying from search inquiry the atomic time, knowing corresponding to from search inquiry List in the reverse indexing of the atom not gone out can be accessed quickly and for identification and matching file, and obtain by Simplify the precalculated fraction that score function is used.Forward direction index is according to file ordering.This is final by be easy to funnel to process The classification stage.Especially, the file set deleted will be provided as the result for matching and deleting the stage.The file set deleted Conjunction will be relatively small.So, positive index storage file data, this document data are the texts in the file set for deleting It is that part is obtained and by final hierarchical algorithmses using providing final hierarchial file structure set.In some embodiments, positive rope Drawing can be according to U.S. Patent Application No.(Not yet distribute)(Act on behalf of Reference Number MFCP.157165), entitled " EFFICIENT It is constructed as described in FORWARD RANKING IN A SEARCH ENGINE ".Additionally, in some embodiments In, Mixture Distribution Model can be used for reverse and positive index, such as in U.S. Patent Application No.(Not yet distribute)(Act on behalf of case Number MFCP.157166), entitled " HYBRID DISTRIBUTION MODEL FOR SEARCH ENGINE INDEXES "(Its Full content is incorporated herein by reference)Described in as.
Turning now to Fig. 8, it is illustrated that the example system of embodiment of the present invention can be used.Although some of the present invention Embodiment(As discussed herein)It is to be directed to the funnel process that file candidate is estimated and deleted in multiple stages, but other realities The mode of applying is for most useful and maximally related atom in identification file and the index in the search index related to specific file Those atoms.Atom can take many forms, including word or linear model, n meta-models or n tuples.Although herein only Single word is normally indexed, and as will be discussed below, some type of atom has multiple words, so, word Combination can be indexed together.As employed herein, according to by defined in the segmenter technology for being used, linear model reflects It is mapped to single symbol or word(word).So, linear model can be the individual character found in file.N meta-models are from file " n " that extract individual continuous or subcontinuous sequence of terms.N meta-models can be tight or pine.If it is corresponded to A succession of continuous word, then n meta-models be known as tight.The n meta-models of pine occur order hereof according to word Comprising them, but word is not required continuously.The n meta-models of pine are normally used for representing by footy word(word) The phrase of the class equivalent being distinguish between(For example " if rained, I will be drenched " is compared to " I will be drenched if raining It is wet ").Such as binary model is two words that there is " n " to be equal to 2.Similarly, ternary model is that have three of " n " equal to 3 Word.N tuples, are the set of the individual word of co-occurrence " n " hereof as employed herein, and it is sequentially independent.Hereof The atom of identification is indexed in one or more search indexes.In one embodiment, for linear model, n meta-models There is respective index with n tuples.
Return to Fig. 8, it should be understood that described herein this and other arrangements are suggested only as example.Other arrangement and Element(Such as machine, interface, function, order and function combinations etc.)Can be used in addition to those for illustrating or replace illustrating Those using, and some elements can be almost completely neglected.Further, many elements described herein are function realities Body, it can be implemented as discrete or distributed component or in combination with other components, and in any suitable group Close and position.The several functions performed by one or more entities described herein can be held by hardware, firmware and/or software OK.For example, several functions can be by the computing device of the instruction stored in execution memorizer.
Among unshowned other components, system 800 can include user equipment 802, index server 804, search Index maker 808 and search index 818.Each component illustrated in Fig. 8 can be any type of computing device, for example, Such as with reference to the computing device 100 of Fig. 1 descriptions.Component can communicate with one another via network 806, network 806 can include but not It is limited to one or more LANs(LAN)And/or wide area network(WAN).This networked environment is in office, enterprise-wide computer net It is average case in network, Intranet and the Internet.It should be understood that within the scope of the invention, any amount of user equipment, rope Drawing server, search index maker and search index can be used in system 800.Each can include individual equipment or The multiple equipment that person cooperates in distributed environment.For example, index server 804 can include being arranged in distributed environment Multiple equipment, it jointly provides the function of index server described herein 804.Similarly, as described herein, Ke Yiyou Multiple search indexes.These can be stored in search index 818 or can be stored in detached position.Additionally, Unshowned other components may also be included in that in system 800.
Index server 804 is usually operated to receive search inquiry from the user equipment of such as user equipment 802, and is led to Cross the Search Results for searching for one or more search index offers in response to the search inquiry.Search index maker 808 is especially Component 814 and search indexing component 816 are deleted including atomic identification component 810, measure information computation module 812, atom.Generally Ground, search index maker 808 be responsible for generate or using the inquiry being determined for future be most useful or maximally related atom/ File is to filling existing search index.Atomic identification component 810 is generally responsible for checking file and independent word being extracted from file Language.Additionally, the identification of atomic identification component 810 is those atoms of n meta-models and n tuples.For example, atomic identification component 810 By determining that each word position relative to each other may be capable of identify that n meta-models.As mentioned, including n tuples word Language is position independence, therefore may be located at any position in file.Being described below of Fig. 9 A, 9B and 9C is explained further n Meta-model and n tuples.
Measure information computation module 812 calculates measure information.The atom recognized in file can be selected based on measure information It is selected as most related or most useful to specific file.Generally, measure information is atom relative to specific file(Wherein atom is from this It is identified in specific file or parses)Classification.Measure information estimates the serviceability of the atom in general inquiry is parsed. In one embodiment, measure information computation module 812 calculates the measure information of each atom/document pair using a kind of algorithm. Many factors can be employed to compute measure information in combination with the algorithm.Only for the purpose of example, these factors can If the frequency of atom, atom are the words in separation, the atom of the word of n meta-models or n tuples in include information fraction, file Number of times that the number of times and word that language individually occurs occurs together and atom or including the atom word in inquiry log Whether occur.Last factor proves that the word of atom is associated in some way, and the word is previously searched Cross.If each word occurs repeatedly in atom, but it is not to connect each other in distance as the number of times that word occurs hereof Near, this might mean that these words are only to be located in same file without deeper implication by chance.If compared with accidental In the case of the distance expected, these words more closely occur each other, then become more meaningful.
Atom deletes component 814 and is responsible for deleting the quantity of the atom/document pair for each file, so for specific text Those atoms that part is unlikely to be related or important will not be indexed, thus without the excessive memory space of occupancy.For having Therefore 400 different words simultaneously have the file of 400 entries in search index, if in this document binary model also by Identify, then for this single file there will be 80000 pairs of words.If ternary model and n tuples are also indexed out, this Quantity can increase bigger.Not only the quantity of atom/document pair is further huge, and the position of each word alternatively can be stored in In search index, this is with atom/document to taking memory space as itself.As mentioned, based on many factors, wherein one It is listed above a bit, algorithm calculates measure information, and whether it is used for determining after a while specific atoms/file to being indexed. This determination is based on threshold value.The threshold value is arranged by checking previous operation, such as in the past One day, always according to initial test.Have it is many plant calculate threshold values modes, aforesaid way merely to the purpose of example and carry For.The therefore typically predetermined value of threshold value.So, based on threshold value, atom deletes component 814 and checks each atom/document pair Measure information is simultaneously made with regard to each decision to being indexed or abandon.
Once atom deletes the quantity that component 814 has deleted atom/document pair, as described above, rope is searched for Drawing component 816 can generate search index or increase entry to existing search index.In one example, searching for index can It is generated during with process described above, those entries in search index can be merged in existing search index, Such as master index.The multiple search indexes of the storage of search index 818 in one embodiment.Similarly, it is right as referred to front Can there are individually search index, including linear model index, n meta-models index, n units group index in various types of atoms. Linear model index be from given word to file identification/classification record list mapping.In one embodiment, delete Process is not applied to linear model, because the quantity of linear model is typically manageable, therefore can be deleted, Or at least need not as n meta-models and n tuples delete it is so much.N meta-models index include by slip window algorithm for The n meta-models that fixed " n " is recognized hereof.For example, for word stream t1t2t3t4t5, n=2 herein, then n meta-models atom Including(t1t2)、(t2t3)、(t3t4)With(t4t5).Therefore, from a string of five words of n=2, four atoms produce.This A little atoms with(DocID, classification)Record is indexed and stores, and the classification that " classification " is two continuous words in file is somebody's turn to do herein Approximately.
In some embodiments, classification or measure information are not stored in index, but are instead only applied to really Fixed which atom is indexed is dropped with for which.N units group index is indexed similar to n meta-models described herein, is referred to except existing N tuples more than several levels are identified from file, due to the position of the word of n tuples be considered it is incoherent.So, N tuples are generally more than what n meta-model and linear model were deleted.Further, in some cases, n meta-models and n tuples Can be replicated, therefore the copy is dropped during process is deleted.Once it is identified and indexes, in one embodiment, Atom(Linear model, n meta-models, n tuples)The use priority hash index is stored in dictionary, such as in United States Patent (USP) Shen Please number 12/980582(Act on behalf of Reference Number MFCP.157119), entitled " PRIORITY HASH INDEX "(Entire contents pass through Reference is incorporated herein)Described.
Fig. 9 A, 9B and 9C respectively illustrate linear model search index, n meta-models search rope according to embodiment of the present invention Draw the example with entry in the first group searching indexes of n.The embodiment of Fig. 9 A, 9B and 9C uses " Holistic Approach in The sampling word string of Southern Sweden ".Fig. 9 A illustrate the linear model 900 recognized from the word string.As indicated, There is the linear model of 5 identification, each is made up of single word.Fig. 9 B illustrate the n units mould recognized from the sampling word string Type 910.Because n meta-models are closer to each other or adjacent, identify 7 n meta-models, this be linear model than identifying more High quantity.Fig. 9 C illustrate the n tuples 920 recognized from the sampling word string.As indicated, compared with linear model or n units mould Type, very many n tuples are identified, because n tuples can be paired or otherwise match together, even if in file In they be not adjacent to each other or close.13 n tuples are identified from the sampling word string being made up of 5 words.Fig. 9 A, 9B and 9C is illustrated the quantity to illustrate n tuples generally how much larger than linear model or the quantity of n meta-models.
With reference to Figure 10, the flow chart of method 1000 according to the embodiment of the present invention is illustrated, and method 1000 is used to make The atom identified in multiple files fills one or more search indexes.Initially, will search in step 1010 identification The file set that rustling sound is indexed in drawing.File is usually indexed, so when search inquiry is received, by accessing search Index, maximally related file easily can find for user.In step 1012, in each file atom is identified.As institute Refer to, atom can be one or more of linear model, n meta-models or n tuples.Linear model be typically single symbol or Word, and work as " n " more than for the moment, n meta-models are multiple words or symbol, and it is adjacent to each other hereof or closely arranges. For example, n meta-models can be the continuous or subcontinuous sequence of terms extracted from specific file, and herein " n " is to connect The quantity of continuous or subcontinuous word.N tuples are co-occurrence in identical file but are not required adjacent to each other or closely Multiple words or symbol in file.In one example, these words comprising n tuples can not be to connect each other completely Near, such as in the different piece of file.Additionally, n tuples are that order is unrelated.
In step 1014, the list of atom/document pair is generated.Atom/document is to being the atom that recognizes hereof and right Ying Yu therefrom recognizes the file identification of the file of the atom.In step 1016 for each atom/document to calculating measure information. Measure information represents the classification of the atom related to specific file, and such as atom has many relative to file in parsing search inquiry It is related.In one embodiment, Machine learning tools are used to each atom computing measure information, also select in addition Maximally related atom/document pair is selected, it is confirmed as and the text for therefrom identifying atom based on measure information and other factorses Part is most related.Calculating for measure information can be used using the algorithm of many factors.Only for the purpose of example, these factors can In information fraction, the file of the word in be included in any corpus between frequency, the word of one or more words comprising atom Independently there are how many times every, one or more words and they occur how many times jointly and whether atom occurs in inquiry day What occur in will and with frequency.Exist other factorses can by using and expection be within the scope of the invention.
The subset of atom/document pair is selected at step 1018 as maximally related with specific file.At step 1018 This selection be based upon atom/document to calculate measure information.Normally, threshold value is determined, so on threshold value Those measure informations be considered as related, beneath those are not qualified as correlation or at least not so phase Close.In one embodiment, the subset of atom/document pair is selected to include atom/text is deleted or limited using Pruning algorithm To less quantity, atom/document so more relevant than other is to less relevant atom/document to being lost for the quantity of part pair Abandon, and therefore be not indexed.In step 1020, using the subset filling search index of the atom/document pair of specific file.As institute Refer to, all atom/documents are to can initially be indexed in detached index, then only as maximally related selected Those selected are filled or index in main search index.Additionally, as mentioned, there may be more than one search rope Draw, therefore in one embodiment, linear model is indexed in linear model index, and n meta-models are indexed in n meta-models In index, n tuples are indexed in n units group index.
In one embodiment, search inquiry is received.Search inquiry can be rewritten as linear model, n meta-models, At least one of n tuples or its combination.Atom has been indexed in search index therein and has been accessed to determine for rewriting The maximally related file of search inquiry afterwards.
Figure 11 is flow chart, and it illustrates method 1100 according to the embodiment of the present invention, for using in multiple files In the atom that identifies fill one or more search indexes.Initially in step 1110, atom is identified from the first file Go out.Each in these atoms is classified as linear model, n meta-models, n tuples or its combination in step 1112.In step 1114, it is each calculating measure information of the atom of identification.As mentioned, measure information represents the classification of atom/document pair, Because it is useful in general inquiry is parsed.Factor used in measure information is calculated is included but is not limited to first The nearness of the position of two or more words of atom, the pass of the word of atom in the frequency of atom, the first file in file Whether connection property and the word such as the atom proved by inspection inquiry log have previously been linked together.In step 1116, whether the measure information for determining each atom meets predetermined threshold value.The atom for meeting threshold value is considered as or knows pass In the first file be it is maximally related those.In one embodiment, threshold value can be arbitrary, or in another embodiment In can be based purely on the quantity such as how many atoms are indexed.In another embodiment again, before threshold value is based on Attempt, the trial is performed with regard to being found the atom related to specific file.
In step 1118, the atom for not meeting predetermined threshold is dropped.The atom for meeting threshold value is incorporated into one or many In individual search index, illustrate in step 1120.In one embodiment, one or more search indexes include linear model Index, n meta-models index and n units group index.In one embodiment, as previously mentioned, know in the file being indexed Other whole linear models can be merged in search index and therefore not deleted.Additionally, phase in one embodiment Same process is suitable for n tuples.Alternatively, n meta-models can be deleted to specific degrees but unlike n tuples so It is many.So, can be dropped than n meta-model and the greater percentage of n tuples of linear model.In addition, some n tuples also can be identified For n meta-models, thus delete process during copy can be dropped.
Mark of another embodiment of Figure 11 comprising the atom from the second file.Each of these atoms is classified For linear model, n meta-models, n tuples or its combination.For each atom computing measure information related to the second file.One A little atoms can be same or like with those identified from the first file, but based on therefrom identifying the difference of atom File, can have different measure informations.So that it is determined that whether meeting predetermined threshold value for the measure information of each atom. It is maximally related that those for meeting are considered as with regard to the second file.Those for not meeting threshold value are dropped.Meet threshold value those Atom is incorporated in search index.
Figure 12 is flow chart, and it illustrates method 1200 according to the embodiment of the present invention, and the method is used for using many The atom identified in individual file fills one or more search indexes.At step 1210, atom is extracted from file. Atom can be classified as linear model, n meta-models or n tuples.For each atom, Information Meter is calculated at step 1212 Amount.Measure information represents the classification of the specific atoms with regard to file.Additionally, the calculating of measure information can be based on for example in file The nearness of the word of atom in frequency, the file of middle atom, the dependency of the word of atom and as by checking inquiry day Whether the word of the atom that will is proved previously had been linked together.Other factorses can also be used and expected in this In the range of invention.
In step 1214, determine measure information threshold value, such measure information meet or more than measure information threshold value those Atom/document is to being indexed.In step 1216, based on measure information, a part of atom/document is to being dropped, if such as should Measure information does not meet threshold value.Meet or more than the atom/document pair of measure information threshold value in step 1218 use information tolerance Fill one or more search indexes.In one embodiment, linear model, n meta-models and n tuples are indexed respectively. Step 1220, search index is accessed to recognize the file related to the atom in the search inquiry for receiving.
If be understood, embodiments of the present invention provide the meter of the measure information for each atom/document pair Calculate, and which atom/document is use information tolerance determining to being indexed and which is dropped.The present invention is already in connection with specific reality The mode of applying is described, and it is intended in all respects be schematic and non-limiting.Interchangeable embodiment is for the present invention Those skilled in the art will be apparent from, without deviating from its scope.
From the foregoing, it will be observed that the invention is well adapted for reaching all purposes presented above and target(With for System and method be other obvious and intrinsic advantages together).It will be understood that special characteristic and sub-portfolio are practical and can To be used, without reference to further feature and sub-portfolio.This is all expected and is included therein by the scope of claim.

Claims (11)

1. it is a kind of to search for the methods for indexing, the method bag for filling one or more using the atom recognized in multiple files Include:Recognize the set of the file being indexed in search index;
For each file of file set, multiple atoms are recognized, the plurality of atom includes one or more linear models, one Or multiple n meta-models, and one or more n tuples;
Based on the file set for being recognized and the plurality of atom, the list of atom/document pair is generated;
Represent related to specific file, look in search to calculating measure information, the wherein measure information for each atom/document The classification of atom use during inquiry, precalculated;
Based on the measure information of each atom/document pair, select with therefrom identify the maximally related atom of specific file of atom/ The subset of file pair;And
Using the subset filling search index of the atom/document pair for specific file,
Wherein, the associated documents of the search inquiry are directed to based on Pruning algorithm from identification in the search index, it is described Pruning algorithm calculates preliminary score to select the file based on the preliminary score for each file in the file The subset of set, wherein, the preliminary score be using for each atom/document to precalculated described information tolerance and Simplify score function to calculate, the final classification that the simplified score function is approximately adopted in the associated documents are recognized is calculated Method.
2. the method for claim 1 wherein that search index includes one or more search indexes, wherein one or more search ropes Draw including linear model index, n meta-models index and n units group index.
3. the method for claim 1 wherein that selection is further included with the subset of the maximally related atom/document pair of specific file: The quantity of atom/document pair is deleted to less quantity, so compared with other atom/documents to less phase using Pruning algorithm The atom/document of pass is not to being indexed.
4. the method for claim 1, Machine learning tools be used to for atom/document to calculate measure information and select and from In identify atom the maximally related atom/document pair of specific file subset.
5. the method for claim 1, further includes:
Receive search inquiry;
The search inquiry is rewritten as into one or more linear models, one or more n meta-models or one or more n tuples At least one of;And
Using the search inquiry of the rewriting, search index is accessed to determine for the maximally related file of search inquiry.
6. the atom that a kind of use is identified in multiple files fills the method for one or more search indexes, the method bag Include:
From multiple atoms will be recognized in the first file of the multiple files being indexed;
Each atom of multiple atoms is categorized as linear model, n meta-models or n tuples one or more;
For each atom computing of multiple atoms measure information related to the first file, the wherein measure information is represented and searched Classification that rope is used during inquiring about, precalculated atom;
Whether the measure information for determining each atom of multiple atoms meets predetermined threshold value, wherein meeting the atom of predetermined threshold Be it is maximally related for the first file those;
Discarding does not meet the atom of predetermined threshold;
The atom for first file being met to predetermined threshold is incorporated to into one or more search indexes,
Wherein, by first file identification it is to be for the search inquiry correlation from one or more of search indexes Based on Pruning algorithm, the Pruning algorithm calculates preliminary score to select based on the preliminary score for first file First file, first file is selected from the file set indexed in one or more of search indexes , wherein, the preliminary score is to use to measure precalculated described information for each atom/document and simplify score Come what is calculated, first file identification is being approximately in search inquiry correlation to function by the simplified score function Using final hierarchical algorithmses.
7. the method for claim 6, wherein the monatomic measure information recognized in the first file represents that the monatomic Classification, it is with regard to for the first file, how useful first atom is in parsing has monatomic search inquiry.
8. the method for claim 6, wherein for the calculating of the measure information of each atom of multiple atoms is based in following One or more held:The frequency of atom, two or more words of atom in the first file connect in the first file Recency, the dependency of two or more words of atom or such as two of atom by checking that inquiry log proved or Whether multiple words have previously been linked together.
9. the method for claim 6, further includes:
Multiple atoms are recognized from the second file;
Each atom of multiple atoms is categorized as into one or more in linear model, n meta-models or n tuples;
For multiple atoms each atom computing with regard to the second file measure information;
Whether the measure information for determining each atom of multiple atoms meets predetermined threshold, wherein meet the atom of predetermined threshold being It is maximally related for the second file those;
Discarding does not meet the atom of predetermined threshold;And
The atom for second file being met to predetermined threshold is incorporated to into one or more search indexes.
10. it is a kind of to search for the methods for indexing, the method for filling one or more using the atom recognized in multiple files Including:Multiple atoms are extracted from file, the plurality of atom includes one or more linear models, one or more n meta-models And one or more n tuples;
For each atom of multiple atoms, measure information is calculated, described information measurement representation is associated with the file, for specific Atom, use in the search query, precalculated classification, the calculating of wherein measure information is based on herein below one It is individual or multiple:Hereof the frequency of atom, the nearness of two or more words of atom hereof, two of atom Or whether the dependency or the two or more words such as the atom proved by inspection inquiry log of multiple words are previous It has been linked together;
Determine measure information threshold value, wherein measure information meet or more than the measure information threshold value atom/document to being indexed;
A part of atom/document pair is abandoned based on measure information, wherein corresponding to the measure information of the atom/document pair being dropped Less than the measure information threshold value;
Met or more than the atom/document pair of the measure information threshold value by indexing measure information, filled one or more of Search index, wherein linear model, n meta-models are indexed respectively in being indexed by the different search of each leisure with n tuples;And
For the associated documents of atom during one or more of search indexes are accessed to recognize inquiry,
Wherein, recognize that associated documents are at least partially based on Pruning algorithm, the Pruning algorithm is directed to the atom/document To file calculate preliminary score, wherein, the preliminary score is that score function is measured and simplified using described information calculating , the final hierarchical algorithmses that the simplified score function is approximately adopted in the associated documents are recognized.
A kind of 11. devices for filling one or more search indexes using the atom recognized in multiple files, including use In the module for performing the method such as any one of claim 1-10.
CN201210060934.5A 2011-03-10 2012-03-09 Selection of atoms for search engine retrieval Active CN102682073B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US13/045,278 US9342582B2 (en) 2010-11-22 2011-03-10 Selection of atoms for search engine retrieval
US13/045278 2011-03-10

Publications (2)

Publication Number Publication Date
CN102682073A CN102682073A (en) 2012-09-19
CN102682073B true CN102682073B (en) 2017-04-12

Family

ID=46814001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210060934.5A Active CN102682073B (en) 2011-03-10 2012-03-09 Selection of atoms for search engine retrieval

Country Status (1)

Country Link
CN (1) CN102682073B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103853832B (en) * 2014-03-11 2017-07-28 上海爱数信息技术股份有限公司 Customizable data grasping means in a kind of text retrieval system
CN114724639B (en) * 2022-06-10 2022-09-16 国家超级计算天津中心 Preprocessing acceleration method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN102682073A (en) 2012-09-19

Similar Documents

Publication Publication Date Title
CN106663124B (en) Generating and using knowledge-enhanced models
CN103838833B (en) Text retrieval system based on correlation word semantic analysis
US8073877B2 (en) Scalable semi-structured named entity detection
US11138285B2 (en) Intent encoder trained using search logs
US8037068B2 (en) Searching through content which is accessible through web-based forms
US20100191758A1 (en) System and method for improved search relevance using proximity boosting
JP2009543255A (en) Map hierarchical and sequential document trees to identify parallel data
CN111881334A (en) Keyword-to-enterprise retrieval method based on semi-supervised learning
US9569525B2 (en) Techniques for entity-level technology recommendation
US11727058B2 (en) Unsupervised automatic taxonomy graph construction using search queries
WO2023108980A1 (en) Information push method and device based on text adversarial sample
CN115470338A (en) Multi-scene intelligent question and answer method and system based on multi-way recall
US20090327269A1 (en) Pattern generation
Dadure et al. Embedding and generalization of formula with context in the retrieval of mathematical information
CN102682073B (en) Selection of atoms for search engine retrieval
KR20120038418A (en) Searching methods and devices
CN110851584A (en) Accurate recommendation system and method for legal provision
US11687514B2 (en) Multimodal table encoding for information retrieval systems
US8027957B2 (en) Grammar compression
CN112214511A (en) API recommendation method based on WTP-WCD algorithm
Rani et al. Telugu text summarization using LSTM deep learning
CN116992874B (en) Text quotation auditing and tracing method, system, device and storage medium
Hahm et al. Investigation into the existence of the indexer effect in key phrase extraction
CN113868387A (en) Word2vec medical similar problem retrieval method based on improved tf-idf weighting
CN116050396A (en) Sensitive information identification method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1172124

Country of ref document: HK

C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: MICROSOFT TECHNOLOGY LICENSING LLC

Free format text: FORMER OWNER: MICROSOFT CORP.

Effective date: 20150623

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20150623

Address after: Washington State

Applicant after: Micro soft technique license Co., Ltd

Address before: Washington State

Applicant before: Microsoft Corp.

GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: GR

Ref document number: 1172124

Country of ref document: HK