Specific embodiment
Subject of the present invention is described with specificity to meet legal requirement herein.However, the description itself is unexpectedly
Figure limits the scope of the patent.Conversely, inventor already allows for claimed theme to be also possible to otherwise realize,
In the case of in combination with other existing or future technologies, similar different step the step of including with described in this document
Or step combination.In addition, although word " step " and/or " frame " can be used to indicate that the difference of used method will herein
Element, but the word should not be construed as to imply that among multiple steps disclosed herein or between any particular order,
Unless and except the order of each step is expressly depicted.
Embodiments of the present invention provide a kind of index and search process, and it allows heap file with cost-effective side
Formula is indexed and retrieves, and meets strict waiting time constraint.According to the embodiment of the present invention, using in multiple stages
Estimate and delete the process of file candidate.Conceptive, the process appears to funnel(funnel), due in whole rank
Analysis becomes more complicated in section, therefore file candidate is estimated and deletes.It is more high as the process continues on all stage
Expensive calculating is employed and the quantity of alternative file can be reduced multiple orders of magnitude.Different strategies is applied in each stage
To allow that Search Results are returned from heap file with quickly and efficiently mode.In addition, the strategy used in each stage can
The strategy that other stages use is replenished to be designed, so that the process is more efficient.
Primitive of the search index index that embodiments of the present invention are used from the higher-order of file(primitive)
Or " atom ", this is contrary with single word is simply indexed.As employed herein, " atom " can refer to inquiry or file it is various
Unit.These units can include such as word, n meta-models, n tuples, k adjacent to n tuples etc..Word is mapped as downwards single symbol
Number or word, as by the specific segmenter for being used(tokenizer)Technology limiting.In one embodiment, word is single
Character.In another embodiment, word is single word or combinatorics on words.N meta-models can be extracted from file
" n " individual continuous or subcontinuous sequence of terms.If n meta-models correspond to a string continuous words, it may be said that it is
" tight ", if it includes word according to the order for occurring hereof, but the word is not required continuously, then be " pine
".The n meta-models of pine are normally used to indicate the phrase that difference is the class equivalent of footy word(For example " if
Raining, I will be drenched " and " I will be drenched if raining ").N tuples as employed herein are co-occurrences hereof(Sequentially
It is unrelated)" n " individual word set.Further, as employed herein, k refers to " k " individual word hereof adjacent to n tuples
The set of " n " individual word of co-occurrence in the window of language.Therefore, atom is generally defined as above all of generalization summary.This
The realization of invention embodiment can use different types of atom, but as employed herein, atom is described in general manner
State species each.
When search index is set up, each file is analyzed to identify the atom in file and generates for each atom advance
The fraction or grade of calculating, it represents the importance or the dependency with context of atom.Search index storage with regard to
The information of the precalculated fraction generated for file/atom pair, it is used during funnel process.
Fig. 2 illustrates multiple stages that the funnel of an embodiment of the invention is processed.Place shown in figure 2
The stage of reason is performed after search inquiry is received, and including:L0 matching stages 202, L1 is classified the stage 204 temporarily, and
L2 is finally classified the stage 206.As represented in Fig. 2, as the process is carried out, the quantity of alternative file is reduced.
When search inquiry is received, search inquiry is analyzed to identify atom.The atom is during L0 matching stages 202
It is used to query search index and recognizes the initial matching file set comprising the atom from search inquiry.As in fig. 2
Shown, the quantity of alternative file can be reduced to matching from search by this from all files indexed in search index
Those files of the atom of inquiry.
The stage 204 is classified temporarily in L1, is the candidate remained from L0 matching stages 202 using simplified score function
File calculates preliminary score.The simplification score function is especially precalculated to what is stored in search index for file/atom pair
Fraction carry out computing.In some embodiments, simplify score function to can serve as that hierarchial file structure will be eventually used to most
Whole hierarchical algorithmses it is approximate.However, simplifying score function provides the computing more cheap than final hierarchical algorithmses, this allows a large amount of
Alternative file is rendered adequately treated quite quickly.Based on preliminary fraction, alternative file is deleted.For example, only with highest preliminary score
The N number of file in top can be retained.
The stage 206 is finally classified in L2, estimates to be classified what the stage 204 remained temporarily from L1 using final hierarchical algorithmses
Alternative file.Compared to the simplified score function used during L1 is classified the stage 204 temporarily, final hierarchical algorithmses are that have greatly
The computing costly of amount graded features.However, final hierarchical algorithmses are applied to the alternative file of much smaller number.Final point
Level algorithm provides the file set of classification, and in response to original search inquiry, the file set based on the classification provides search
As a result.
So as to, in one aspect, one of the embodiments of the present invention instruction that computer can use for being stored with or
Multiple computer-readable storage mediums, when being used by computing device, the instruction causes computing device method.The method includes connecing
Receive search inquiry and rewrite(reformulating)The search inquiry is recognizing one or more atoms.The method also includes base
In one or more atoms from the initial file set of search index identification.The method further includes former for one or more
Sub and initial file set, is that file/atom pair is precalculated using simplifying score function and being stored in search index
Fraction, is that each file in initial file set calculates preliminary score.The method is also included based on preliminary score from initial
File set in select the file set deleted.The method further includes that using complete hierarchical algorithmses be the file set deleted
Each file in conjunction calculates classification fraction to provide the file set of classification.The method still further comprises the text based on classification
Part set provides Search Results to present to terminal use.
In yet another embodiment of the present invention, its aspect be for include at least one processor and one or more
The computerized system of computer-readable storage medium.The system includes inquiry reformulation component, its search inquiry for receiving of analysis with
Based on one or more atoms of the words recognition included in the search inquiry for receiving and generate the inquiry of rewriting.The system is also
Including file matching component, it carrys out query search index and recognizes initial matching files set using the inquiry rewritten.This is
System also includes document pruning component, and it calculates preliminary using each file that score function is initial matching files set is simplified
Fraction, and the file set deleted based on the preliminary score identification.The system still further comprises definitive document classification component, its
The use of complete hierarchical algorithmses is each file calculating classification fraction in the file set deleted.
The further embodiment of the present invention provides the search in response to search inquiry for a kind of use staged care
As a result method.The method includes receiving search inquiry and recognizing one or more atoms from search inquiry.The method also includes
Identification includes the initial file set of one or more atoms, the use of simplify score function is every in initial file set
Individual file calculates preliminary score, and the subset based on the preliminary score select file so as to further process.The method is further
Classification fraction is calculated including using final hierarchical algorithmses for each file in the subset of file.The method still further comprises base
The set of Search Results is provided in the classification fraction.
Except implementations described above, describe here again for from file identification relevant atomic and indexing original
The method of son/file pair.For example, atom(It can be classified as linear model, n meta-models or n tuples)Known from file
Other or extraction.Be each atom/document to calculate measure information.The calculating of the measure information can be based on many factors, or even
Can be completed by Machine learning tools, the Machine learning tools can learn how to calculate measure information.Threshold value is used to base
Being discarded in parsing inquiry in measure information is considered as not being that those related or useful to are former as other atom/documents
Son/file pair.Be considered as it is maximally related those search index in be indexed for future receive search inquiry when use.
According to the first aspect of the invention, there is provided a kind of method is so that the atom filling identified in multiple files
One or more search indexes.The method includes recognizing in search index by the file set being indexed, for file set
In each file, recognize multiple atoms, the plurality of atom includes one or more linear models, one or more n meta-models
With one or more n tuples.In addition, the method includes, based on recognized file set and multiple atoms, generating atom/document
To list, and calculate the measure information of each atom/document pair, the wherein measure information represents the original related to specific file
The classification of son.Additionally, the method is included based on the measure information of each atom/document pair, selection and the identified spy of atom
Determine the subset of the maximally related atom/document pair of file.The method is further included using the atom/document pair of the specific file
Subset filling search index.
According to the second aspect of the invention, there is provided storage computer can use one or more Computer Storage of instruction
Medium, when being used by computing device, it causes the atom that a kind of use of computing device is identified in multiple files to fill out
The method for filling one or more search indexes.The method is included from the multiple originals of the first file identification by the multiple files being indexed
Son, each classified in multiple atoms according to one or more in linear model, n meta-models or n tuples, and calculate with
The measure information of each of the related multiple atoms of the first file.Further, the method includes each of the multiple atoms of determination
Measure information whether meet predetermined threshold value.The atom for meeting predetermined threshold be it is maximally related with the first file those.The party
Method also includes abandoning not meeting the atom of predetermined threshold, and the atom that meet predetermined threshold related to the first file is incorporated into
In one or more search indexes.
According to the third aspect of the invention we, there is provided storage computer can use one or more Computer Storage of instruction
Medium, when being used by computing device, it causes the atom that a kind of use of computing device is identified in multiple files to fill out
The method for filling one or more search indexes.The method includes extracting multiple atoms from file, and the plurality of atom includes one
Or multiple linear models, one or more n meta-models and one or more n tuples, and for multiple atoms each, calculate
Represent the measure information of the classification of specific atoms associated with the file.The calculating of the measure information is based on following one or many
It is individual:Two or more words of atom frequency hereof, the hereof nearness of two or more words of atom, atom
The dependency of language, or inquiry log proves whether two or more words of atom are previously contacted as checked
Together.The method further comprises determining that measure information threshold value.Its measure information meets or more than the original of measure information threshold value
Son/file is to being indexed.In addition, the method includes abandoning a part of atom/document pair based on measure information.Corresponding to being abandoned
Atom/document pair measure information be less than measure information threshold value.Met or more than measure information by indexing its measure information
The atom/document pair of threshold value and fill one or more search index, wherein linear model, n meta-models and n tuples are marked respectively
Draw.The associated documents of atom in the accessed inquiry with identification of one or more search indexes.
The general introduction of embodiments of the present invention is had been described with, one kind of achievable embodiment of the present invention is described below
Illustrative Operating Environment, to provide the general scene of many aspects of the present invention.Especially, referring initially to Fig. 1, illustrate
For realizing the Illustrative Operating Environment of embodiments of the present invention, commonly known as computing device 100.But computing device
One example of 100 simply suitable computing environment, it is not intended to which hint is any with regard to the scope of the use invented or function
Limit.Computing device 100 also should not be construed to have relevant with any one or its combination of illustrated component appointing
What relies on or needs.
The present invention be able to can be used described in the usual scene of instruction in computer code or machine, the code or instruction
Including the computer executable instructions of such as program module, by computer or personal digital assistant or other handheld devices etc.
Other machines perform.Normally, including the program module of routine, program, object, component, data structure etc. refers to perform
The code of particular task or enforcement particular abstract data type.The present invention can be implemented in multiple systems configuration, including handss
Holding equipment, consumer-elcetronics devices, general purpose computer, more professional computing device etc..The present invention can also be realized in distributed meter
In calculating environment, wherein task is performed by the remote processing devices by communication network links.
With reference to Fig. 1, computing device 100 includes bus 110, and it directly or indirectly couples following equipment:Memorizer 112,
One or more processors 114, one or more presentation components 116, input/output(I/O)Port 118, input output assembly
120 and exemplifying power supply 122.It can be one or more bus that bus 110 is represented(Such as address bus, data/address bus
Or its combination).Although for multiple pieces be clearly shown by lines in Fig. 1, in fact, describing multiple components not so
It is clear, and Metaphor, the line is more precisely gloomy and fuzzy.For example, can be by the presentation group of such as display device
Part regards I/O components as.In addition, processor has memorizer.Inventor recognizes that this is that technology has in itself, and reaffirms
The diagram of Fig. 1 is only the figure of the exemplary computer device that can relatively use with one or more embodiments of the present invention
Show.It is not different between these species such as such as " work station ", " server ", " kneetop computer ", " handheld device ", because institute
Having all be expected in the range of Fig. 1 and be referred to as " computing device ".
Computing device 100 typically comprises various computer-readable mediums.Computer-readable medium can be can be by calculating
Any available medium that equipment 100 is accessed, and including volatile and nonvolatile medium, removable and non-removable media.It is logical
Cross example and unrestricted, computer-readable medium can include computer-readable storage medium and communication media.Computer-readable storage medium
Including for storage information(Such as computer-readable instruction, data structure, program module or other data)Any means
Volatile and nonvolatile, removable and non-removable media with technology implementation.Computer-readable storage medium include but is not limited to RAM,
ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disc(DVD)Or other optical disk storage apparatus, magnetic
Box, tape, disk storage or other magnetic storage apparatus, or can be used the desired information of storage and can be set by calculating
The standby 100 any other media for accessing.Communication media typically comprise computer-readable instruction, data structure, program module or
Modulated data signal(Such as carrier wave or other transmission mechanisms)In other data, and including random information delivery media.Word
Language " data signal of modulation " refers to that one or more spy is set or changed according to the mode of coding information in the signal
The signal of property.Unrestricted by example, communication media includes the wire medium of such as cable network or direct wired connection, and
Such as ultrasound, RF, infrared and other wireless mediums wireless mediums.Any of the above combination should also be as being included in computer can
In reading the scope of medium.
Memorizer 112 includes volatile and/or nonvolatile storage form computer-readable storage medium.Memorizer can be can
Removing, non-removable or its combination.Exemplary hardware devices include solid-state memory, hard drive, disc drives etc..
Computing device 100 includes one or more processors, its various entity receive data from such as memorizer 112 or I/O components 120
According to.(It is multiple)The presentation of component 116 data are presented and are indicated to user or miscellaneous equipment.Exemplary presentation components include display device,
Speaker, print components, vibration component etc..
I/O ports 118 allow computing device 100 logically coupled to the miscellaneous equipment for including I/O components 120, wherein one
Can be a bit built-in.The component of diagram includes mike, stick, game paddle, satellite butterfly antenna, scanner, printing
Machine, wireless device etc..
With reference now to Fig. 3, there is provided diagram can use the block diagram of the example system 300 of embodiments of the present invention.
It should be understood that this and other arrangements being described herein as are proposed only as example.Except or replace those shown,
Can be using other arrangements and element(Such as machine, interface, function, order and function combinations etc.), some elements can be complete
Ignore entirely.Further, many elements described herein are functional entitys, and it may be implemented as detached or distributed component
Or combined with other components, and in any suitable combination and position.It is described herein to be performed by one or more entities
Several functions can be realized by hardware, firmware and/or software.For example, several functions can be by computing device in storage
In device store instruction and realize.
In unshowned other components, system 300 can draw including user equipment 302, content server 304 and search
Hold up server 306.Each component illustrated in Fig. 3 can be any type of computing device, for example, such as describe with reference to Fig. 1
Computing device 100.Component can communicate with one another via network 308, and it can include but is not limited to one or more LANs
(LAN)And/or wide area network(WAN).This networked environment in an office, the computer network of enterprise-wide, Intranet and mutually
Networking is universal.It should be understood that any number of user equipment, content server and search engine server can be at these
Used in system 300 in bright scope.Each can include individual equipment or the multiple of cooperation set in distributed environment
It is standby.For example, search engine server 306 can include the multiple equipment being arranged in distributed environment, and it jointly provides this
The function of the search engine server 306 of place's description.Additionally, unshowned other components may also be included in that system 300
In.
Search engine server 306 normally operates to receive the search from the such as user equipment of user equipment 302
Inquiry, to provide the Search Results in response to search inquiry.Search engine server 306 especially includes indexing component 310, user
Interface module 312, inquiry reformulation component 314, file matching component 316, document pruning component 318 and definitive document classification component
320。
Indexing component 310 operates to index the file safeguarded with regard to the content server by such as content server 304
Data.For example, component is crawled(It is not shown)Content server can be employed to crawl and access what is safeguarded by content server
The relevant information of file.So as to indexing component 310 indexes relevant with the file for crawling data in search index 322.
In embodiment, indexing component 310 indexes the marking letter of the atom for finding hereof and the file for wherein finding each atom
Breath, it indicates importance of the atom in the context of file.Can be calculated using any number of algorithm and be sent out hereof
The fraction of existing atom.Only by way of example, the fraction can be based on the reverse document-frequency of word frequency known in the art
(TF/IDF)Function.For example, BM25F classifications(ranking)Function can be used.Make for the fraction that file/atom pair is generated
It is stored in search index 322 for precalculated fraction.
In embodiments, indexing component 310 analyzes each file to recognize word, n meta-models, n tuples and determine this
Which should be indexed for this document in a little atoms.During the file that will be indexed is processed, with regard to Query distribution, word point
Cloth and/or the statistics of the simplified score function used during funnel is processed can be used to statistically select best original
Subclass is representing this document.The atom that these are selected is indexed in search index 322 using precalculated fraction, and it is permitted
Perhaps file is effectively deleted early stage funnel is processed.
Although being not required, in certain embodiments of the present invention, search index 322 can include reverse rope
Draw(By the order of atom)With positive index(By the order of file).Reverse indexing can include multiple list(posting
list), each list points to an atom and the file including the atom listed, with each file/atom pair
Precalculated fraction.As will be described in more detail below, reverse indexing and positive index can be in the differences of funnel process
Stage is used.
User's interface unit 312 is provided to the interface of the such as user equipment of user equipment 302, and it allows user's submission to search
Rope is inquired about to search engine server 306 and receives Search Results from search engine server 306.User equipment 302 can be
Any type of computing device for submitting search inquiry and reception Search Results to is used by user.Only by example rather than limit
System, user equipment 302 can be desktop computer, laptop computer, panel computer, mobile device or other types of calculating
Equipment.User equipment 302 can include a kind of application, and it allows user input search inquiry and submits to the search inquiry extremely to search for
Engine server 306 is obtaining Search Results.For example, user equipment 302 can include web browser, and it includes search input
Frame allows user to access searched page to submit search inquiry to.For submitting other mechanism of search inquiry to search engine to
It is contemplated that in the scope of embodiment of the present invention.
When search inquiry is received via user's interface unit 312, inquiry reformulation component 314 is operated to be looked into rewriting this
Ask.Inquiry is rewritten as being easy to being based on how to index data in search index 322 and query search from its free form text
The form of index 322.In embodiments, the word of search inquiry is analyzed to identify can be used for query search index
322 atom.Can use and be used to recognize that the similar technology of atom is come hereof when search index 322 gets the bid quotation part
Recognize the atom.For example, atom can be recognized based on the statistics of word and Query distribution information.Inquiry reformulation component 314 can
To provide atom conjunction(conjunction)Set and the cascade variable of these atoms(cascading variant).
File matching component 316 indexes the file set of 322 and identification and matching using the revised inquiry with query search
Close.For example, the inquiry of rewriting can include two or more atoms, and file matching component 316 can obtain the mark of these atoms
The common factor of note list, to provide initial matching files set.
Document pruning component 318 is operated by deleting file from initial matching files set.This can include making
Calculated from initial matching files collection with the precalculated fraction of the file/atom pair being stored in search index 322
The preliminary score of each file for closing.The preliminary score can be based on score function be simplified, and it is for performance and retrieval
(recall)And adjust.In some embodiments, the simplified score function that be used to generate preliminary score is based on complete point
Level algorithm and set up, the complete hierarchical algorithmses are subsequently used in and provide final hierarchial file structure set.So, score function is simplified
As the approximate of final hierarchical algorithmses.For example, such as in US number of patent applications(Not yet distribute)(Attorney docket
MFCP.157122), described in entitled " DECOMPOSABLE RANKING FOR EFFICIENT PRECOMPUTING "
Method can be used to set up simplified score function.In some embodiments, simplify score function to include from final point
The subset of the graded features of level algorithm.
Multiple different methods can be by document pruning component 318 using deleting the initial file set.At some
In embodiment, document pruning component 318 can retain the quantity of the predetermined matching in initial file set, and remove other
File is not considered further that(That is the N number of matching in top).For example, document pruning component 318 can retain with highest preliminary score one
Thousand files.The quantity of the matching that document pruning component 318 retains can be based on the simplified score letter for generating preliminary score
Several fidelity confidences.The fidelity confidence represents the file set that simplified score function is provided and will provided by complete hierarchical algorithmses
The ability of the file set that conjunction matches.For example, can obtain average 1200 files to obtain by most from score function is simplified
1000, the top file that whole hierarchical algorithmses are provided.In other embodiments, replace retaining the file of predetermined quantity, file is deleted
Subtracting component 318 can retain the file with the preliminary score on specific threshold.
In some embodiments, file matching component 316 and document pruning component 318 can be by close-coupleds, so
File is matched and deleted and is incorporated in single process to be repeated several times.For example, preliminary score can be calculated, because matching
File is identified and for removing the file that will be likely to be abandoned by complete hierarchical algorithmses.
In some embodiments, indexed using the search of the list of layering(Such as in U.S. Patent Application No.(Still
It is unallocated)(Act on behalf of Reference Number MFCP.157121), entitled " TIERING OF POSTING LISTS IN SEARCH ENGINE
Described in INDEX ")This matching/delete process can be used to be easy to.Each list by with given atom
It is associated and will includes based on precalculated fraction(The fraction is assigned to file, represents given atom pair in each text
The dependency of the context of part)And the layer for sorting.In each layer, labelling(posting)Can sort by file internal.Make
Indexed with this search, file matching component 314 will be using ground floor(With the precalculated fraction of highest)Obtain initial
File set merge using simplify score function delete initial file set.Armed with sufficient amount of file, match/
Deleting process can terminate.Alternatively, if not providing sufficient amount of file, matching and delete can be in reduced levels
Layer is repeatedly carried out, until remaining sufficient amount of file.
The matching that there is provided by file matching component 316 and document pruning component 318 and delete and process retained file set
Close and estimate to provide final hierarchial file structure set by definitive document classification component 320.Definitive document classification component 320 has been used
Whole hierarchical algorithmses, the algorithm can process the file set for retaining and operate to original search inquiry and by matching and deleting.It is complete
Whole hierarchical algorithmses using than delete process during with more graded features for being used of simplifieds score function and more come
From the data of file.So, complete hierarchical algorithmses are more expensive computings, and it needs more processing and takes longer time
Calculate.However, because the set of alternative file is deleted, complete hierarchical algorithmses are performed on less file set.
Definitive document classification component 320 provides final hierarchial file structure set, and it is indicated to user's interface unit 312.
Then user's interface unit 312 will pass to user and sets including at least one of Search Results of final hierarchial file structure set
Standby 302.For example, user's interface unit 312 can be generated based on final hierarchial file structure set or otherwise provided and be listed
The search engine results page of Search Results(SERP).
Turn next to Fig. 4, there is provided illustrate according to embodiment of the present invention, for being returned using process stage by stage
Return the flow chart of the whole method 400 of the Search Results of search inquiry.Process stage by stage starts from precalculating/index rank
Section, as shown in frame 402.This stage is off-line phase, i.e. it separately holds with any search inquiry for receiving
OK.Precalculating/and in the index stage 402, file is crawled, and the data with regard to this document are indexed in search index
In.According to an embodiment precalculating/the index stage 402 during index file data process below in reference to Fig. 5
It is discussed in further detail.
Precalculating/and after the index stage 402, the stage that figure 4 illustrates includes on-line stage, connects in this stage
Receive search inquiry and responsively return Search Results.The first stage of on-line stage is matching stage, as shown in frame 404
's.During matching stage 404, search inquiry is received and is written over, and the inquiry of the rewriting is used for from search index
Identification and matching file.Joined below for the process of identification and matching file during matching stage 404 according to an embodiment
It is discussed in further detail according to Fig. 6.
Next stage after the matching is the stage of deleting, as shown in frame 406.The stage 406 is deleted from matching rank
Section 404 obtains initial file set, and determines preliminary score for each file using score function is simplified.Tentatively divided based on this
Number, from initial file set file is deleted.File is deleted from initial matching files set according to an embodiment
Process be discussed in further detail below in reference to Fig. 7.
In some embodiments, matching stage 404 and delete the stage 406 can be alternately.Especially, when matching text
Part can be performed to delete when identified and indicate that file will be likely to by final hierarchical algorithmses to abandon wherein preliminary score earlier
The candidate of discarding does not further consider.
In matching stage 404 and delete quilt during the alternative file retained after the stage 406 is integrated into the final classification stage
Further estimate, such as illustrate at frame 408.During the final classification stage 408, determine what is retained using complete hierarchical algorithmses
The final score of file.In some embodiments, complete hierarchical algorithmses can be in each of original search inquiry and reservation
Perform in the data of file.Complete hierarchical algorithmses can determine final hierarchial file structure collection using multiple different graded features
Close.In response to search inquiry, Search Results are provided based on final hierarchial file structure set, as shown in frame 410.
Turning now to Fig. 5, there is provided diagram according to embodiment of the present invention, for precalculating file/atom pair
Fraction and index data method 500 flow chart.Initially, as shown in frame 502, a file is accessed.For example, crawl
Device can be employed to crawl file and obtain file data.At frame 504, this document is processed.This document is processed with knowledge
The atom included in other file.As mentioned above, the process includes the text of Study document to recognize word, n meta-models and n
Tuple, and determine for which in this document these atoms should be indexed.With regard to Query distribution, the statistics of word distribution
And/or the simplified score function used during funnel is processed can be used to statistically select best atom set with table
Show this document.
As shown in frame 506, for each atom identified in file generates a fraction.The fraction representation is in text
The importance of the atom in the context of part.Can be calculated using any amount of algorithm the atom that finds hereof point
Number.Only by example, fraction can be based on the reverse document-frequency of word frequency known in the art(TF/IDF)Function.For example, can be with
Using BM25F classification functions.
As shown in frame 508, in search index acceptance of the bid argument evidence.This can include storage with regard to hereof finding
The fraction of the information of atom and each file/atom pair.These fractions include precalculated fraction, and it can be processed in funnel
Period is used.In some embodiments, it is that each atom creates list.Each list can include including is somebody's turn to do
The instruction of the fraction of the listed files of atom and precalculated each file/atom pair.
Referring next to Fig. 6, there is provided illustrate according to embodiment of the present invention, for obtaining just during matching stage
The flow chart of the method 600 of the matching files set of beginning.As shown in frame 602, search inquiry is initially received.The search
Inquiry can be included by one or more search terms of the user input using user equipment.
As shown in frame 604, the search inquiry for receiving is written over.Especially, the word of search inquiry is analyzed
To recognize one or more atoms that can be used for query search index.The analysis can be similar to when file data be indexed
The analysis of atom in for recognizing file.For example, the statistics of word and search inquiry can be used to recognize in search inquiry
Atom.The revised inquiry can include the connection set of words and their cascade variable of atom(cascading
variant).
As shown in frame 606, the revised inquiry is used for the set according to search index identification and matching file.It is special
Not, query search index and identification and matching file are used for according to the atom that original query is identified.As indicated above,
Search index can be included in the list of the various atoms recognized in the file of index.Corresponding to by revised inquiry knowledge
The list of the atom not gone out can be identified and used to identification and matching file.For example, according to revised inquiry
The common factor of the list of multiple atoms can provide initial matching files set.
Turn to Fig. 7, there is provided diagram according to embodiment of the present invention, for during the stage of deleting from initial matching
The flow chart of the method 700 of file is deleted in file set.It is pre- in search index using being stored in as shown in frame 702
Precalculated fraction calculates preliminary score for each file.This can include obtaining precalculated point of each atom of file
Number, and the precalculated fraction used in score function is simplified is generating the preliminary score of file.The simplification score function
Can set up in such a way:The estimation of the final score that its offer is provided by complete hierarchical algorithmses.For example, the simplification meter
Point function can include the subset of the feature used by complete hierarchical algorithmses.In some embodiments, it is special using the such as U.S.
Sharp application number(Not yet distribute)(Act on behalf of Reference Number MFCP.157122), entitled " DECOMPOSABLE RANKING FOR
Process described by EFFICIENT PRECOMPUTING " is defining simplified score function.
As shown in frame 704, file is deleted from initial matching files set based on preliminary score.In some enforcements
In mode, the N number of file in top is retained, i.e. the N number of file with highest preliminary score is retained further to process.Retain
File quantity can be based on be used to calculate preliminary score simplified score function fidelity.The simplification score function
Fidelity represents the ability that simplified score function provides the file set similar to those classifications provided by final hierarchical algorithmses.
If it is known that including the relatedness between the final hierarchical algorithmses for simplifying the error in score function and simplified score function, being somebody's turn to do
Knowledge may be used to determine whether the quantity of the file retained from the stage of deleting.For example, if it is desired to which 1000 search knots are provided
Fruit and known fifty-fifty will include from final point from 1200, the top file for simplifying score function in all inquiries
Level algorithm 1000, top file, then top 1200 files will be retained from the stage of deleting.
In certain embodiments of the present invention, funnel is processed can use the search for including reverse indexing and positive index
Index.The reverse indexing is according to atomic order.This will be easy to rapidly be obtained in the matching that funnel is processed and during deleting the stage
Data.Especially, when receiving search inquiry and identifying from search inquiry the atomic time, knowing corresponding to from search inquiry
List in the reverse indexing of the atom not gone out can be accessed quickly and for identification and matching file, and obtain by
Simplify the precalculated fraction that score function is used.Forward direction index is according to file ordering.This is final by be easy to funnel to process
The classification stage.Especially, the file set deleted will be provided as the result for matching and deleting the stage.The file set deleted
Conjunction will be relatively small.So, positive index storage file data, this document data are the texts in the file set for deleting
It is that part is obtained and by final hierarchical algorithmses using providing final hierarchial file structure set.In some embodiments, positive rope
Drawing can be according to U.S. Patent Application No.(Not yet distribute)(Act on behalf of Reference Number MFCP.157165), entitled " EFFICIENT
It is constructed as described in FORWARD RANKING IN A SEARCH ENGINE ".Additionally, in some embodiments
In, Mixture Distribution Model can be used for reverse and positive index, such as in U.S. Patent Application No.(Not yet distribute)(Act on behalf of case
Number MFCP.157166), entitled " HYBRID DISTRIBUTION MODEL FOR SEARCH ENGINE INDEXES "(Its
Full content is incorporated herein by reference)Described in as.
Turning now to Fig. 8, it is illustrated that the example system of embodiment of the present invention can be used.Although some of the present invention
Embodiment(As discussed herein)It is to be directed to the funnel process that file candidate is estimated and deleted in multiple stages, but other realities
The mode of applying is for most useful and maximally related atom in identification file and the index in the search index related to specific file
Those atoms.Atom can take many forms, including word or linear model, n meta-models or n tuples.Although herein only
Single word is normally indexed, and as will be discussed below, some type of atom has multiple words, so, word
Combination can be indexed together.As employed herein, according to by defined in the segmenter technology for being used, linear model reflects
It is mapped to single symbol or word(word).So, linear model can be the individual character found in file.N meta-models are from file
" n " that extract individual continuous or subcontinuous sequence of terms.N meta-models can be tight or pine.If it is corresponded to
A succession of continuous word, then n meta-models be known as tight.The n meta-models of pine occur order hereof according to word
Comprising them, but word is not required continuously.The n meta-models of pine are normally used for representing by footy word(word)
The phrase of the class equivalent being distinguish between(For example " if rained, I will be drenched " is compared to " I will be drenched if raining
It is wet ").Such as binary model is two words that there is " n " to be equal to 2.Similarly, ternary model is that have three of " n " equal to 3
Word.N tuples, are the set of the individual word of co-occurrence " n " hereof as employed herein, and it is sequentially independent.Hereof
The atom of identification is indexed in one or more search indexes.In one embodiment, for linear model, n meta-models
There is respective index with n tuples.
Return to Fig. 8, it should be understood that described herein this and other arrangements are suggested only as example.Other arrangement and
Element(Such as machine, interface, function, order and function combinations etc.)Can be used in addition to those for illustrating or replace illustrating
Those using, and some elements can be almost completely neglected.Further, many elements described herein are function realities
Body, it can be implemented as discrete or distributed component or in combination with other components, and in any suitable group
Close and position.The several functions performed by one or more entities described herein can be held by hardware, firmware and/or software
OK.For example, several functions can be by the computing device of the instruction stored in execution memorizer.
Among unshowned other components, system 800 can include user equipment 802, index server 804, search
Index maker 808 and search index 818.Each component illustrated in Fig. 8 can be any type of computing device, for example,
Such as with reference to the computing device 100 of Fig. 1 descriptions.Component can communicate with one another via network 806, network 806 can include but not
It is limited to one or more LANs(LAN)And/or wide area network(WAN).This networked environment is in office, enterprise-wide computer net
It is average case in network, Intranet and the Internet.It should be understood that within the scope of the invention, any amount of user equipment, rope
Drawing server, search index maker and search index can be used in system 800.Each can include individual equipment or
The multiple equipment that person cooperates in distributed environment.For example, index server 804 can include being arranged in distributed environment
Multiple equipment, it jointly provides the function of index server described herein 804.Similarly, as described herein, Ke Yiyou
Multiple search indexes.These can be stored in search index 818 or can be stored in detached position.Additionally,
Unshowned other components may also be included in that in system 800.
Index server 804 is usually operated to receive search inquiry from the user equipment of such as user equipment 802, and is led to
Cross the Search Results for searching for one or more search index offers in response to the search inquiry.Search index maker 808 is especially
Component 814 and search indexing component 816 are deleted including atomic identification component 810, measure information computation module 812, atom.Generally
Ground, search index maker 808 be responsible for generate or using the inquiry being determined for future be most useful or maximally related atom/
File is to filling existing search index.Atomic identification component 810 is generally responsible for checking file and independent word being extracted from file
Language.Additionally, the identification of atomic identification component 810 is those atoms of n meta-models and n tuples.For example, atomic identification component 810
By determining that each word position relative to each other may be capable of identify that n meta-models.As mentioned, including n tuples word
Language is position independence, therefore may be located at any position in file.Being described below of Fig. 9 A, 9B and 9C is explained further n
Meta-model and n tuples.
Measure information computation module 812 calculates measure information.The atom recognized in file can be selected based on measure information
It is selected as most related or most useful to specific file.Generally, measure information is atom relative to specific file(Wherein atom is from this
It is identified in specific file or parses)Classification.Measure information estimates the serviceability of the atom in general inquiry is parsed.
In one embodiment, measure information computation module 812 calculates the measure information of each atom/document pair using a kind of algorithm.
Many factors can be employed to compute measure information in combination with the algorithm.Only for the purpose of example, these factors can
If the frequency of atom, atom are the words in separation, the atom of the word of n meta-models or n tuples in include information fraction, file
Number of times that the number of times and word that language individually occurs occurs together and atom or including the atom word in inquiry log
Whether occur.Last factor proves that the word of atom is associated in some way, and the word is previously searched
Cross.If each word occurs repeatedly in atom, but it is not to connect each other in distance as the number of times that word occurs hereof
Near, this might mean that these words are only to be located in same file without deeper implication by chance.If compared with accidental
In the case of the distance expected, these words more closely occur each other, then become more meaningful.
Atom deletes component 814 and is responsible for deleting the quantity of the atom/document pair for each file, so for specific text
Those atoms that part is unlikely to be related or important will not be indexed, thus without the excessive memory space of occupancy.For having
Therefore 400 different words simultaneously have the file of 400 entries in search index, if in this document binary model also by
Identify, then for this single file there will be 80000 pairs of words.If ternary model and n tuples are also indexed out, this
Quantity can increase bigger.Not only the quantity of atom/document pair is further huge, and the position of each word alternatively can be stored in
In search index, this is with atom/document to taking memory space as itself.As mentioned, based on many factors, wherein one
It is listed above a bit, algorithm calculates measure information, and whether it is used for determining after a while specific atoms/file to being indexed.
This determination is based on threshold value.The threshold value is arranged by checking previous operation, such as in the past
One day, always according to initial test.Have it is many plant calculate threshold values modes, aforesaid way merely to the purpose of example and carry
For.The therefore typically predetermined value of threshold value.So, based on threshold value, atom deletes component 814 and checks each atom/document pair
Measure information is simultaneously made with regard to each decision to being indexed or abandon.
Once atom deletes the quantity that component 814 has deleted atom/document pair, as described above, rope is searched for
Drawing component 816 can generate search index or increase entry to existing search index.In one example, searching for index can
It is generated during with process described above, those entries in search index can be merged in existing search index,
Such as master index.The multiple search indexes of the storage of search index 818 in one embodiment.Similarly, it is right as referred to front
Can there are individually search index, including linear model index, n meta-models index, n units group index in various types of atoms.
Linear model index be from given word to file identification/classification record list mapping.In one embodiment, delete
Process is not applied to linear model, because the quantity of linear model is typically manageable, therefore can be deleted,
Or at least need not as n meta-models and n tuples delete it is so much.N meta-models index include by slip window algorithm for
The n meta-models that fixed " n " is recognized hereof.For example, for word stream t1t2t3t4t5, n=2 herein, then n meta-models atom
Including(t1t2)、(t2t3)、(t3t4)With(t4t5).Therefore, from a string of five words of n=2, four atoms produce.This
A little atoms with(DocID, classification)Record is indexed and stores, and the classification that " classification " is two continuous words in file is somebody's turn to do herein
Approximately.
In some embodiments, classification or measure information are not stored in index, but are instead only applied to really
Fixed which atom is indexed is dropped with for which.N units group index is indexed similar to n meta-models described herein, is referred to except existing
N tuples more than several levels are identified from file, due to the position of the word of n tuples be considered it is incoherent.So,
N tuples are generally more than what n meta-model and linear model were deleted.Further, in some cases, n meta-models and n tuples
Can be replicated, therefore the copy is dropped during process is deleted.Once it is identified and indexes, in one embodiment,
Atom(Linear model, n meta-models, n tuples)The use priority hash index is stored in dictionary, such as in United States Patent (USP) Shen
Please number 12/980582(Act on behalf of Reference Number MFCP.157119), entitled " PRIORITY HASH INDEX "(Entire contents pass through
Reference is incorporated herein)Described.
Fig. 9 A, 9B and 9C respectively illustrate linear model search index, n meta-models search rope according to embodiment of the present invention
Draw the example with entry in the first group searching indexes of n.The embodiment of Fig. 9 A, 9B and 9C uses " Holistic Approach in
The sampling word string of Southern Sweden ".Fig. 9 A illustrate the linear model 900 recognized from the word string.As indicated,
There is the linear model of 5 identification, each is made up of single word.Fig. 9 B illustrate the n units mould recognized from the sampling word string
Type 910.Because n meta-models are closer to each other or adjacent, identify 7 n meta-models, this be linear model than identifying more
High quantity.Fig. 9 C illustrate the n tuples 920 recognized from the sampling word string.As indicated, compared with linear model or n units mould
Type, very many n tuples are identified, because n tuples can be paired or otherwise match together, even if in file
In they be not adjacent to each other or close.13 n tuples are identified from the sampling word string being made up of 5 words.Fig. 9 A, 9B and
9C is illustrated the quantity to illustrate n tuples generally how much larger than linear model or the quantity of n meta-models.
With reference to Figure 10, the flow chart of method 1000 according to the embodiment of the present invention is illustrated, and method 1000 is used to make
The atom identified in multiple files fills one or more search indexes.Initially, will search in step 1010 identification
The file set that rustling sound is indexed in drawing.File is usually indexed, so when search inquiry is received, by accessing search
Index, maximally related file easily can find for user.In step 1012, in each file atom is identified.As institute
Refer to, atom can be one or more of linear model, n meta-models or n tuples.Linear model be typically single symbol or
Word, and work as " n " more than for the moment, n meta-models are multiple words or symbol, and it is adjacent to each other hereof or closely arranges.
For example, n meta-models can be the continuous or subcontinuous sequence of terms extracted from specific file, and herein " n " is to connect
The quantity of continuous or subcontinuous word.N tuples are co-occurrence in identical file but are not required adjacent to each other or closely
Multiple words or symbol in file.In one example, these words comprising n tuples can not be to connect each other completely
Near, such as in the different piece of file.Additionally, n tuples are that order is unrelated.
In step 1014, the list of atom/document pair is generated.Atom/document is to being the atom that recognizes hereof and right
Ying Yu therefrom recognizes the file identification of the file of the atom.In step 1016 for each atom/document to calculating measure information.
Measure information represents the classification of the atom related to specific file, and such as atom has many relative to file in parsing search inquiry
It is related.In one embodiment, Machine learning tools are used to each atom computing measure information, also select in addition
Maximally related atom/document pair is selected, it is confirmed as and the text for therefrom identifying atom based on measure information and other factorses
Part is most related.Calculating for measure information can be used using the algorithm of many factors.Only for the purpose of example, these factors can
In information fraction, the file of the word in be included in any corpus between frequency, the word of one or more words comprising atom
Independently there are how many times every, one or more words and they occur how many times jointly and whether atom occurs in inquiry day
What occur in will and with frequency.Exist other factorses can by using and expection be within the scope of the invention.
The subset of atom/document pair is selected at step 1018 as maximally related with specific file.At step 1018
This selection be based upon atom/document to calculate measure information.Normally, threshold value is determined, so on threshold value
Those measure informations be considered as related, beneath those are not qualified as correlation or at least not so phase
Close.In one embodiment, the subset of atom/document pair is selected to include atom/text is deleted or limited using Pruning algorithm
To less quantity, atom/document so more relevant than other is to less relevant atom/document to being lost for the quantity of part pair
Abandon, and therefore be not indexed.In step 1020, using the subset filling search index of the atom/document pair of specific file.As institute
Refer to, all atom/documents are to can initially be indexed in detached index, then only as maximally related selected
Those selected are filled or index in main search index.Additionally, as mentioned, there may be more than one search rope
Draw, therefore in one embodiment, linear model is indexed in linear model index, and n meta-models are indexed in n meta-models
In index, n tuples are indexed in n units group index.
In one embodiment, search inquiry is received.Search inquiry can be rewritten as linear model, n meta-models,
At least one of n tuples or its combination.Atom has been indexed in search index therein and has been accessed to determine for rewriting
The maximally related file of search inquiry afterwards.
Figure 11 is flow chart, and it illustrates method 1100 according to the embodiment of the present invention, for using in multiple files
In the atom that identifies fill one or more search indexes.Initially in step 1110, atom is identified from the first file
Go out.Each in these atoms is classified as linear model, n meta-models, n tuples or its combination in step 1112.In step
1114, it is each calculating measure information of the atom of identification.As mentioned, measure information represents the classification of atom/document pair,
Because it is useful in general inquiry is parsed.Factor used in measure information is calculated is included but is not limited to first
The nearness of the position of two or more words of atom, the pass of the word of atom in the frequency of atom, the first file in file
Whether connection property and the word such as the atom proved by inspection inquiry log have previously been linked together.In step
1116, whether the measure information for determining each atom meets predetermined threshold value.The atom for meeting threshold value is considered as or knows pass
In the first file be it is maximally related those.In one embodiment, threshold value can be arbitrary, or in another embodiment
In can be based purely on the quantity such as how many atoms are indexed.In another embodiment again, before threshold value is based on
Attempt, the trial is performed with regard to being found the atom related to specific file.
In step 1118, the atom for not meeting predetermined threshold is dropped.The atom for meeting threshold value is incorporated into one or many
In individual search index, illustrate in step 1120.In one embodiment, one or more search indexes include linear model
Index, n meta-models index and n units group index.In one embodiment, as previously mentioned, know in the file being indexed
Other whole linear models can be merged in search index and therefore not deleted.Additionally, phase in one embodiment
Same process is suitable for n tuples.Alternatively, n meta-models can be deleted to specific degrees but unlike n tuples so
It is many.So, can be dropped than n meta-model and the greater percentage of n tuples of linear model.In addition, some n tuples also can be identified
For n meta-models, thus delete process during copy can be dropped.
Mark of another embodiment of Figure 11 comprising the atom from the second file.Each of these atoms is classified
For linear model, n meta-models, n tuples or its combination.For each atom computing measure information related to the second file.One
A little atoms can be same or like with those identified from the first file, but based on therefrom identifying the difference of atom
File, can have different measure informations.So that it is determined that whether meeting predetermined threshold value for the measure information of each atom.
It is maximally related that those for meeting are considered as with regard to the second file.Those for not meeting threshold value are dropped.Meet threshold value those
Atom is incorporated in search index.
Figure 12 is flow chart, and it illustrates method 1200 according to the embodiment of the present invention, and the method is used for using many
The atom identified in individual file fills one or more search indexes.At step 1210, atom is extracted from file.
Atom can be classified as linear model, n meta-models or n tuples.For each atom, Information Meter is calculated at step 1212
Amount.Measure information represents the classification of the specific atoms with regard to file.Additionally, the calculating of measure information can be based on for example in file
The nearness of the word of atom in frequency, the file of middle atom, the dependency of the word of atom and as by checking inquiry day
Whether the word of the atom that will is proved previously had been linked together.Other factorses can also be used and expected in this
In the range of invention.
In step 1214, determine measure information threshold value, such measure information meet or more than measure information threshold value those
Atom/document is to being indexed.In step 1216, based on measure information, a part of atom/document is to being dropped, if such as should
Measure information does not meet threshold value.Meet or more than the atom/document pair of measure information threshold value in step 1218 use information tolerance
Fill one or more search indexes.In one embodiment, linear model, n meta-models and n tuples are indexed respectively.
Step 1220, search index is accessed to recognize the file related to the atom in the search inquiry for receiving.
If be understood, embodiments of the present invention provide the meter of the measure information for each atom/document pair
Calculate, and which atom/document is use information tolerance determining to being indexed and which is dropped.The present invention is already in connection with specific reality
The mode of applying is described, and it is intended in all respects be schematic and non-limiting.Interchangeable embodiment is for the present invention
Those skilled in the art will be apparent from, without deviating from its scope.
From the foregoing, it will be observed that the invention is well adapted for reaching all purposes presented above and target(With for
System and method be other obvious and intrinsic advantages together).It will be understood that special characteristic and sub-portfolio are practical and can
To be used, without reference to further feature and sub-portfolio.This is all expected and is included therein by the scope of claim.