CN102243631A

CN102243631A - Super key distributed searching method

Info

Publication number: CN102243631A
Application number: CN201010171392XA
Authority: CN
Inventors: 吴春尧
Original assignee: Individual
Current assignee: Individual
Priority date: 2010-05-13
Filing date: 2010-05-13
Publication date: 2011-11-16

Abstract

The invention discloses a super key distributed searching method, which is information retrieval for intelligently matching full text of a text. The full test is used as the searching input, and intelligent fusion is performed on a search engine framework in multiple stages. A system processes the whole text to obtain representative keywords comprising keyword itself, keyword disambiguation result, keyword weight and relation among the keywords. An intelligent process of the system uses the following modules: a full text matching algorithm merging module (FMM), a multi-feature extraction module (FE) and a global feature extraction module (WFE), wherein the FMM is distributed in a basic server and an advanced server; the FE is distributed in an inverted extraction process of the basic server and a query analysis process of the advanced server; and the WFE is distributed in a searching module. The system keeps the advantages of high data throughput rate, high reliability, concurrent service and the like of the distributed searching, causes no additional complexity in searching and improves relativity of searching.

Description

The superkey distributed search methods

Affiliated technical field

The present invention proposes a kind of information retrieval method of full-text search of text, the superkey distributed search methods especially is fit to the mass data information retrieval, realizes having kept the search distributed nature, having improved searching accuracy based on the search of intelligence coupling in full.

Background technology

At present, the key search mode that search engine uses is the main application of information retrieval, by the retrieval and the name arranging technology of key word, finishes the distributed search function, and the sort result of most probable coupling is showed.Intellectual technology is in conjunction with search, i.e. search engine intellectuality can be divided into the following aspects basically:

In advance to data classify, clustering processing, show these classifications by the result, improve the search effect by user interactions;

2. utilize the search daily record, the potential contact of excavating keyword reaches the extended method of keyword;

3. utilize intelligent agent technology, finish Intelligentized method;

4. handle in the middle of intellectual technology being incorporated reptile.

These methods all are based on keyword search, but have no idea to accomplish and have under whole text search (be input as whole text, such as: utilize resume the to look for position) conditions of demand the user.The keyword search engine Intelligentized method can't solve these intelligent problems.Need the Intelligentized method under a kind of processing power prerequisite of distributed, the big data quantity that does not destroy search engine.The present invention will address this problem just.

Summary of the invention

The superkey distributed search methods is a kind of to the intelligent in full information retrieval method that mates of text, solves mass data information retrieval big data quantity, quick, high correlation search.It proposes with respect to keyword search, the searching method of key word only allows to utilize limited key word to search for, and can't search for all being put in the search engine in full, if long text inputted search engine, can be kept the limited character string in front by truncation.As the search input, search engine obtains more Useful Informations the Search Results that more meets user's input is provided whole text in the superkey search.This method is done corresponding improvement on the search engine framework, kept the processing feature of the distributed big data quantity of search engine; This is because each stage of search engine framework is carried out intelligent fusion treatment.Particularly, the superkey search is the full text message formization, utilize whole text message, acquisition has the key word of representative, key word utilizes the context qi that disappears to handle, superkey comprises disappear qi result, the important degree of key word of key word itself, key word and represents to be to concern between weight and the key word, and the qi result that disappears of key word is right for key word and implication string thereof; Key word closes be key word to the relation name.This search is called the superkey search.

Need make full use of intellectual technology to whole text as the search input.The superkey distributed search methods that the present invention proposes is the expansion of keyword search engine framework, and it makes search engine can keep the distributed concurrent characteristics of keyword search, also can incorporate present multiple machine learning algorithm, and basic process is as follows:

1. system carries out feature extraction to the text of input, obtains the superkey collection, and this is the crucial part that is different from existing search engine.

2. the rank of Search Results is to sort according to the feature that proposes previously, and its sort algorithm is the machine learning algorithm collection.That is: various existing machine learning algorithms can be put in the middle of the present search framework.

3. the feature extraction of input text is based on the whole bag of tricks of this paper theme extracting method; In the training study stage, used the technology of presorting of text, sorting technique can be done more accurately for the weight of feature and estimate.

Superkey distributed search framework is the machine recognition algorithm distribution with this paper, in other words, it is a complete procedure " search engineization " text identification, from another perspective, being distributed search engine in " intellectuality " in each stage, is the framework of combining closely of a kind of text Intelligent treatment and search engine.This makes that search engine has kept also having increased the accuracy of coupling outside the distributed advantage such as how concurrent, has solved the several big problem of search engine: " entirely " " standard " " correlativity ".

The framework retrieval module of key word distributed search engine is made of five parts: build index database program (INDEXER), basic retrieval service (BS), information retrieval service (DI), advanced search service (AS), retrieval module client (CLIENT).See Fig. 1.

The implementation method of following this framework of simple declaration:

Four-headed arrow represents to set up the stable network connection between the two among the figure, carries out the exchange of data.Unidirectional arrow is represented data transmission direction.

Data exchange process is as follows:

1.INDEXER set up index database according to document and relevant information.

2. the unit resource limit need be set up a plurality of index databases, is distributed on the different machines.

3. one group of BS/DI service of corresponding this machine of index database.BS provides and the relevant information that sorts, and DI provides all the other information that need show.

4.CLIENT send query requests to AS, AS can visit one group of relevant BS as required and obtain and all information that sort relevant, this process promptly is a query analysis.And the information that each BS is returned gathers merger, gets ranking results to the end, and determines the concrete clauses and subclauses that needs show according to current display position, visits again relevant DI and obtains the information that whole needs show, returns to CLIENT.

According to practical application request, retrieval module possesses following technical essential:

1. it is mutual with storage to make up server I ndexer, data-oriented complete or collected works, computation index superkey weight.Realize the index data piecemeal based on MD5.

2. obtain data from Indexer, independent with accumulation layer, be responsible for computings such as participle, inquiry, slide fastener merger, ordering.

3. simultaneously, obtain data from Indexer, independent with accumulation layer, be responsible for summary extraction, presentation information calculating etc.Need resident cache.

Superkey distributed search framework, has increased matching algorithm merger module (FMM), many feature extractions grading module (FE) and global characteristics extraction module (WFE) in full at the framework while that has kept distributed search engine substantially.

The Intelligent treatment technology generally comprises three processes: training process, characteristic extraction procedure, coupling identifying.Training process is by handling the parameter that language material (or data) obtains each feature, calculating the feature scoring of the overall situation; Characteristic extraction procedure is that particular text is carried out participle, word frequency statistics, the calculating of super key feature, again in conjunction with global characteristics and local feature scoring; The coupling identifying is the feature of the text proposition of data to be carried out traverse scanning estimate out most probable result.General intelligence identification is a complete process, and the present invention opens the rational cutting of this process, is distributed to each module of search framework, sees Fig. 2.

Description of drawings

The present invention is further described below in conjunction with drawings and Examples.

Fig. 1 is a keyword search framework relation;

Fig. 2 is a superkey search framework relation.

Embodiment

1. the global characteristics extraction module is based on that many granularities participle and the technology of presorting finish.Many granularities participle adopts based on the statistics ambiguity string qi method that disappears.Obtain the ambiguity string by forward maximum matching algorithm and correct cutting result comparison, when carrying out identifying, at first scan ambiguity string storehouse, after hitting, as correct result's output-this is ambiguity string arthmetic statement, native system improves this algorithm with result in the storehouse, use and find, itself also has ambiguity the ambiguity string, and in different language environments, its ambiguity result is different.The statistics ambiguity string qi that disappears is top and bottom user's statistical model with the ambiguity string, to the ambiguity of the ambiguity string qi that disappears, has solved the ambiguity problem of ambiguity string participle.This method or with the computation complexity of forward maximum matching algorithm.What guaranteed search has kept accuracy fast simultaneously.The technology of presorting uses X2 statistics user to estimate global characteristics, and concrete formula is as follows:

χ^{2} (w, c) = \frac{| D | {(A_{1} \times A_{4} - A_{2} \times A_{3})}^{2}}{(A_{1} + A_{3}) (A_{2} + A_{4}) (A_{1} + A_{2}) (A_{3} + A_{4})}

Wherein, A ₁For comprising the c class text number of entry w, A in the training set ₂For comprising the non-c class text number of entry w, A in the training set ₃For not comprising the c class text number of entry w, A in the training set ₄For not comprising the non-c class text number of entry w in the training set, | the D| side is total textual data in the training set.

Just word that this formula calculates is with respect to the CHI value of some classes, it with respect to the CHI value of whole text data be it with respect to all class CHI values comprehensively.Comprehensive mode has two kinds usually, and is as follows

Shown in:

χ^{2} (w) = Σ_{i = 1}^{| c |} P (c_{i}) χ^{2} (w, c_{i})

Overall situation extraction module is implanted in framework of the present invention in the index module (INDEXER), and goodbye index is finished global characteristics and extracted simultaneously, and complexity does not increase.

2. full text matching algorithm merger module adopts the similar algorithm of included angle cosine:

SIM (d_{i}, d_{j}) \frac{Σ_{k = 1}^{m} W_{ik} * W_{jk}}{\sqrt{(Σ_{K = 1}^{m} {W^{2}}_{ik}) (Σ_{K = 1}^{m} {W^{2}}_{jkk})}}

d _i, d _jBe respectively two vectors, i.e. d _iInput feature vector to be determined, d _jFeature for the concentrated a word of all problems.W is the weights of vector.

Matching algorithm merger module is implanted among BS and the AS in full, owing to combine the row's of falling design of existing framework, does not increase the complexity of algorithm.

3. many feature extractions grading module is an important module, and this module has two module invokes, and index (INDEXER) is built the feature that the index process obtains every piece of document again, is written to these features in the storage organization of index database in company with inverted list; Another is the processing to user input text, and promptly query analysis extracts many features too, utilizes many features to search for merger.

The present invention has obtained useful effect to search:

Framework has kept distributed characteristics: 1. throughput; 2. reliability (no single point failure); 3. the data of disconnected service are rebuild; 4. redundancy and backup; 5. infrastructure ease for use.Large increase has been arranged: 1. allow whole text of input or keyword on the searching accuracy; 2. system carries out intelligent extraction in full to input, obtains character representation accurately; 3. the result of input is mated or more meets in system by intelligence.

Claims

The superkey distributed search methods is a kind of to the intelligent in full information retrieval method that mates of text.

1. with the training and the identifying deconsolidation process of intelligent processing method, be arranged in each module of distributed search, the big data quantity with distributed search handles, at a high speed, high concurrent service feature, also have the characteristics of distributed intelligence.
2. system is input as entire article.System analyzes article, obtains superkey.Superkey is to utilize whole text message, obtains to have the key word of representative, utilizes the context qi that disappears to handle, and superkey comprises disappear qi result, the important degree of key word of key word itself, key word and represents to be to concern between weight and the key word.The qi result that disappears of key word is that key word and implication string thereof are right; Key word closes be key word to the relation name.
3. system is divided into three modules with Intelligent treatment: full text matching algorithm merger module (FMM), many feature extractions grading module (FE) and global characteristics extraction module (WFE), identifying, feature extraction, the training process of corresponding intelligent algorithm respectively.And following configuration: FMM is distributed in infrastructure service device and advanced server in distributed search; FE is distributed in the query analysis process of arranging leaching process and advanced server of infrastructure service device; Global characteristics extraction module (WFE) is distributed in the middle of the index module.
4. be applied to position search (looking for a job),, utilize individual's background information search to meet position in order to obtain the background information of resume; Be applied to resume search (looking for the talent),, utilize these demand information search to meet resume in order to obtain the demand information of position.