CN102243631A - Super key distributed searching method - Google Patents

Super key distributed searching method Download PDF

Info

Publication number
CN102243631A
CN102243631A CN201010171392XA CN201010171392A CN102243631A CN 102243631 A CN102243631 A CN 102243631A CN 201010171392X A CN201010171392X A CN 201010171392XA CN 201010171392 A CN201010171392 A CN 201010171392A CN 102243631 A CN102243631 A CN 102243631A
Authority
CN
China
Prior art keywords
distributed
search
key word
module
searching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201010171392XA
Other languages
Chinese (zh)
Inventor
吴春尧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201010171392XA priority Critical patent/CN102243631A/en
Publication of CN102243631A publication Critical patent/CN102243631A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a super key distributed searching method, which is information retrieval for intelligently matching full text of a text. The full test is used as the searching input, and intelligent fusion is performed on a search engine framework in multiple stages. A system processes the whole text to obtain representative keywords comprising keyword itself, keyword disambiguation result, keyword weight and relation among the keywords. An intelligent process of the system uses the following modules: a full text matching algorithm merging module (FMM), a multi-feature extraction module (FE) and a global feature extraction module (WFE), wherein the FMM is distributed in a basic server and an advanced server; the FE is distributed in an inverted extraction process of the basic server and a query analysis process of the advanced server; and the WFE is distributed in a searching module. The system keeps the advantages of high data throughput rate, high reliability, concurrent service and the like of the distributed searching, causes no additional complexity in searching and improves relativity of searching.

Description

The superkey distributed search methods
Affiliated technical field
The present invention proposes a kind of information retrieval method of full-text search of text, the superkey distributed search methods especially is fit to the mass data information retrieval, realizes having kept the search distributed nature, having improved searching accuracy based on the search of intelligence coupling in full.
Background technology
At present, the key search mode that search engine uses is the main application of information retrieval, by the retrieval and the name arranging technology of key word, finishes the distributed search function, and the sort result of most probable coupling is showed.Intellectual technology is in conjunction with search, i.e. search engine intellectuality can be divided into the following aspects basically:
In advance to data classify, clustering processing, show these classifications by the result, improve the search effect by user interactions;
2. utilize the search daily record, the potential contact of excavating keyword reaches the extended method of keyword;
3. utilize intelligent agent technology, finish Intelligentized method;
4. handle in the middle of intellectual technology being incorporated reptile.
These methods all are based on keyword search, but have no idea to accomplish and have under whole text search (be input as whole text, such as: utilize resume the to look for position) conditions of demand the user.The keyword search engine Intelligentized method can't solve these intelligent problems.Need the Intelligentized method under a kind of processing power prerequisite of distributed, the big data quantity that does not destroy search engine.The present invention will address this problem just.
Summary of the invention
The superkey distributed search methods is a kind of to the intelligent in full information retrieval method that mates of text, solves mass data information retrieval big data quantity, quick, high correlation search.It proposes with respect to keyword search, the searching method of key word only allows to utilize limited key word to search for, and can't search for all being put in the search engine in full, if long text inputted search engine, can be kept the limited character string in front by truncation.As the search input, search engine obtains more Useful Informations the Search Results that more meets user's input is provided whole text in the superkey search.This method is done corresponding improvement on the search engine framework, kept the processing feature of the distributed big data quantity of search engine; This is because each stage of search engine framework is carried out intelligent fusion treatment.Particularly, the superkey search is the full text message formization, utilize whole text message, acquisition has the key word of representative, key word utilizes the context qi that disappears to handle, superkey comprises disappear qi result, the important degree of key word of key word itself, key word and represents to be to concern between weight and the key word, and the qi result that disappears of key word is right for key word and implication string thereof; Key word closes be key word to the relation name.This search is called the superkey search.
Need make full use of intellectual technology to whole text as the search input.The superkey distributed search methods that the present invention proposes is the expansion of keyword search engine framework, and it makes search engine can keep the distributed concurrent characteristics of keyword search, also can incorporate present multiple machine learning algorithm, and basic process is as follows:
1. system carries out feature extraction to the text of input, obtains the superkey collection, and this is the crucial part that is different from existing search engine.
2. the rank of Search Results is to sort according to the feature that proposes previously, and its sort algorithm is the machine learning algorithm collection.That is: various existing machine learning algorithms can be put in the middle of the present search framework.
3. the feature extraction of input text is based on the whole bag of tricks of this paper theme extracting method; In the training study stage, used the technology of presorting of text, sorting technique can be done more accurately for the weight of feature and estimate.
Superkey distributed search framework is the machine recognition algorithm distribution with this paper, in other words, it is a complete procedure " search engineization " text identification, from another perspective, being distributed search engine in " intellectuality " in each stage, is the framework of combining closely of a kind of text Intelligent treatment and search engine.This makes that search engine has kept also having increased the accuracy of coupling outside the distributed advantage such as how concurrent, has solved the several big problem of search engine: " entirely " " standard " " correlativity ".
The framework retrieval module of key word distributed search engine is made of five parts: build index database program (INDEXER), basic retrieval service (BS), information retrieval service (DI), advanced search service (AS), retrieval module client (CLIENT).See Fig. 1.
The implementation method of following this framework of simple declaration:
Four-headed arrow represents to set up the stable network connection between the two among the figure, carries out the exchange of data.Unidirectional arrow is represented data transmission direction.
Data exchange process is as follows:
1.INDEXER set up index database according to document and relevant information.
2. the unit resource limit need be set up a plurality of index databases, is distributed on the different machines.
3. one group of BS/DI service of corresponding this machine of index database.BS provides and the relevant information that sorts, and DI provides all the other information that need show.
4.CLIENT send query requests to AS, AS can visit one group of relevant BS as required and obtain and all information that sort relevant, this process promptly is a query analysis.And the information that each BS is returned gathers merger, gets ranking results to the end, and determines the concrete clauses and subclauses that needs show according to current display position, visits again relevant DI and obtains the information that whole needs show, returns to CLIENT.
According to practical application request, retrieval module possesses following technical essential:
1. it is mutual with storage to make up server I ndexer, data-oriented complete or collected works, computation index superkey weight.Realize the index data piecemeal based on MD5.
2. obtain data from Indexer, independent with accumulation layer, be responsible for computings such as participle, inquiry, slide fastener merger, ordering.
3. simultaneously, obtain data from Indexer, independent with accumulation layer, be responsible for summary extraction, presentation information calculating etc.Need resident cache.
Superkey distributed search framework, has increased matching algorithm merger module (FMM), many feature extractions grading module (FE) and global characteristics extraction module (WFE) in full at the framework while that has kept distributed search engine substantially.
The Intelligent treatment technology generally comprises three processes: training process, characteristic extraction procedure, coupling identifying.Training process is by handling the parameter that language material (or data) obtains each feature, calculating the feature scoring of the overall situation; Characteristic extraction procedure is that particular text is carried out participle, word frequency statistics, the calculating of super key feature, again in conjunction with global characteristics and local feature scoring; The coupling identifying is the feature of the text proposition of data to be carried out traverse scanning estimate out most probable result.General intelligence identification is a complete process, and the present invention opens the rational cutting of this process, is distributed to each module of search framework, sees Fig. 2.
Description of drawings
The present invention is further described below in conjunction with drawings and Examples.
Fig. 1 is a keyword search framework relation;
Fig. 2 is a superkey search framework relation.
Embodiment
1. the global characteristics extraction module is based on that many granularities participle and the technology of presorting finish.Many granularities participle adopts based on the statistics ambiguity string qi method that disappears.Obtain the ambiguity string by forward maximum matching algorithm and correct cutting result comparison, when carrying out identifying, at first scan ambiguity string storehouse, after hitting, as correct result's output-this is ambiguity string arthmetic statement, native system improves this algorithm with result in the storehouse, use and find, itself also has ambiguity the ambiguity string, and in different language environments, its ambiguity result is different.The statistics ambiguity string qi that disappears is top and bottom user's statistical model with the ambiguity string, to the ambiguity of the ambiguity string qi that disappears, has solved the ambiguity problem of ambiguity string participle.This method or with the computation complexity of forward maximum matching algorithm.What guaranteed search has kept accuracy fast simultaneously.The technology of presorting uses X2 statistics user to estimate global characteristics, and concrete formula is as follows:
χ 2 ( w , c ) = | D | ( A 1 × A 4 - A 2 × A 3 ) 2 ( A 1 + A 3 ) ( A 2 + A 4 ) ( A 1 + A 2 ) ( A 3 + A 4 )
Wherein, A 1For comprising the c class text number of entry w, A in the training set 2For comprising the non-c class text number of entry w, A in the training set 3For not comprising the c class text number of entry w, A in the training set 4For not comprising the non-c class text number of entry w in the training set, | the D| side is total textual data in the training set.
Just word that this formula calculates is with respect to the CHI value of some classes, it with respect to the CHI value of whole text data be it with respect to all class CHI values comprehensively.Comprehensive mode has two kinds usually, and is as follows
Shown in:
χ 2 ( w ) = Σ i = 1 | c | P ( c i ) χ 2 ( w , c i )
Overall situation extraction module is implanted in framework of the present invention in the index module (INDEXER), and goodbye index is finished global characteristics and extracted simultaneously, and complexity does not increase.
2. full text matching algorithm merger module adopts the similar algorithm of included angle cosine:
SIM ( d i , d j ) Σ k = 1 m W ik * W jk ( Σ K = 1 m W 2 ik ) ( Σ K = 1 m W 2 jkk )
d i, d jBe respectively two vectors, i.e. d iInput feature vector to be determined, d jFeature for the concentrated a word of all problems.W is the weights of vector.
Matching algorithm merger module is implanted among BS and the AS in full, owing to combine the row's of falling design of existing framework, does not increase the complexity of algorithm.
3. many feature extractions grading module is an important module, and this module has two module invokes, and index (INDEXER) is built the feature that the index process obtains every piece of document again, is written to these features in the storage organization of index database in company with inverted list; Another is the processing to user input text, and promptly query analysis extracts many features too, utilizes many features to search for merger.
The present invention has obtained useful effect to search:
Framework has kept distributed characteristics: 1. throughput; 2. reliability (no single point failure); 3. the data of disconnected service are rebuild; 4. redundancy and backup; 5. infrastructure ease for use.Large increase has been arranged: 1. allow whole text of input or keyword on the searching accuracy; 2. system carries out intelligent extraction in full to input, obtains character representation accurately; 3. the result of input is mated or more meets in system by intelligence.

Claims (4)

  1. The superkey distributed search methods is a kind of to the intelligent in full information retrieval method that mates of text.
    1. with the training and the identifying deconsolidation process of intelligent processing method, be arranged in each module of distributed search, the big data quantity with distributed search handles, at a high speed, high concurrent service feature, also have the characteristics of distributed intelligence.
  2. 2. system is input as entire article.System analyzes article, obtains superkey.Superkey is to utilize whole text message, obtains to have the key word of representative, utilizes the context qi that disappears to handle, and superkey comprises disappear qi result, the important degree of key word of key word itself, key word and represents to be to concern between weight and the key word.The qi result that disappears of key word is that key word and implication string thereof are right; Key word closes be key word to the relation name.
  3. 3. system is divided into three modules with Intelligent treatment: full text matching algorithm merger module (FMM), many feature extractions grading module (FE) and global characteristics extraction module (WFE), identifying, feature extraction, the training process of corresponding intelligent algorithm respectively.And following configuration: FMM is distributed in infrastructure service device and advanced server in distributed search; FE is distributed in the query analysis process of arranging leaching process and advanced server of infrastructure service device; Global characteristics extraction module (WFE) is distributed in the middle of the index module.
  4. 4. be applied to position search (looking for a job),, utilize individual's background information search to meet position in order to obtain the background information of resume; Be applied to resume search (looking for the talent),, utilize these demand information search to meet resume in order to obtain the demand information of position.
CN201010171392XA 2010-05-13 2010-05-13 Super key distributed searching method Pending CN102243631A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010171392XA CN102243631A (en) 2010-05-13 2010-05-13 Super key distributed searching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010171392XA CN102243631A (en) 2010-05-13 2010-05-13 Super key distributed searching method

Publications (1)

Publication Number Publication Date
CN102243631A true CN102243631A (en) 2011-11-16

Family

ID=44961694

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010171392XA Pending CN102243631A (en) 2010-05-13 2010-05-13 Super key distributed searching method

Country Status (1)

Country Link
CN (1) CN102243631A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103634420A (en) * 2013-11-22 2014-03-12 北京极客优才科技有限公司 Resume e-mail screening system and method
CN106844698A (en) * 2017-01-26 2017-06-13 成都市亚丁胡杨科技股份有限公司 A kind of digital cloud service platform
CN107844960A (en) * 2017-11-22 2018-03-27 辅投帮(武汉)科技有限公司 A kind of investment analysis tools of automatic intelligent analysis report of business plan
CN110442619A (en) * 2019-07-29 2019-11-12 新华三大数据技术有限公司 Search result ordering method, device, electronic equipment and storage medium
CN112307762A (en) * 2020-12-24 2021-02-02 完美世界(北京)软件科技发展有限公司 Search result sorting method and device, storage medium and electronic device

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103634420A (en) * 2013-11-22 2014-03-12 北京极客优才科技有限公司 Resume e-mail screening system and method
CN103634420B (en) * 2013-11-22 2017-07-28 谢小雪 resume mail screening system and method
CN106844698A (en) * 2017-01-26 2017-06-13 成都市亚丁胡杨科技股份有限公司 A kind of digital cloud service platform
CN107844960A (en) * 2017-11-22 2018-03-27 辅投帮(武汉)科技有限公司 A kind of investment analysis tools of automatic intelligent analysis report of business plan
CN107844960B (en) * 2017-11-22 2020-12-01 辅投帮(武汉)科技有限公司 Investment analysis tool for automatically and intelligently analyzing business plan
CN110442619A (en) * 2019-07-29 2019-11-12 新华三大数据技术有限公司 Search result ordering method, device, electronic equipment and storage medium
CN110442619B (en) * 2019-07-29 2022-02-11 新华三大数据技术有限公司 Search result ordering method and device, electronic equipment and storage medium
CN112307762A (en) * 2020-12-24 2021-02-02 完美世界(北京)软件科技发展有限公司 Search result sorting method and device, storage medium and electronic device
CN112307762B (en) * 2020-12-24 2021-04-30 完美世界(北京)软件科技发展有限公司 Search result sorting method and device, storage medium and electronic device

Similar Documents

Publication Publication Date Title
CN104376406B (en) A kind of enterprise innovation resource management and analysis method based on big data
CN104239513B (en) A kind of semantic retrieving method of domain-oriented data
CN101655857B (en) Method for mining data in construction regulation field based on associative regulation mining technology
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN106156272A (en) A kind of information retrieval method based on multi-source semantic analysis
CN102043851A (en) Multiple-document automatic abstracting method based on frequent itemset
CN106951498A (en) Text clustering method
CN102184262A (en) Web-based text classification mining system and web-based text classification mining method
CN103838833A (en) Full-text retrieval system based on semantic analysis of relevant words
CN106202514A (en) Accident based on Agent is across the search method of media information and system
Bin et al. Web mining research
CA2788435A1 (en) Method and system for conducting legal research using clustering analytics
CN103778206A (en) Method for providing network service resources
CN102243631A (en) Super key distributed searching method
CN103761286B (en) A kind of Service Source search method based on user interest
Costache et al. Categorization based relevance feedback search engine for earth observation images repositories
Liu et al. Clustering-based topical Web crawling using CFu-tree guided by link-context
CN110377690A (en) A kind of information acquisition method and system based on long-range Relation extraction
CN103970838A (en) Society image tag ordering method based on compressed domains
CN101661492B (en) High-dimensional space hypersphere covering method for human motion capture data retrieval
Hwang et al. A befitting image data crawling and annotating system with cnn based transfer learning
Liu et al. A query suggestion method based on random walk and topic concepts
Ajitha et al. EFFECTIVE FEATURE EXTRACTION FOR DOCUMENT CLUSTERING TO ENHANCE SEARCH ENGINE USING XML.
CN106156259A (en) A kind of user behavior information displaying method and system
Ma et al. QA4PRF: A question answering based framework for pseudo relevance feedback

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
DD01 Delivery of document by public notice

Addressee: Wu Chunyao

Document name: Notification of Publication of the Application for Invention

DD01 Delivery of document by public notice

Addressee: Wu Chunyao

Document name: Notification that Application Deemed to be Withdrawn

C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20111116