CN104142968A

CN104142968A - Solr technology based distributed searching method and system

Info

Publication number: CN104142968A
Application number: CN201310577657.XA
Authority: CN
Inventors: 吴含前; 姚莉; 王存哲; 李露
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2013-11-19
Filing date: 2013-11-19
Publication date: 2014-11-12

Abstract

The invention discloses a solr technology based distributed searching method and system. The method comprises steps as follows: 1), when an off-line client system registers and files electronic documents, firstly, the electronic documents are automatically classified on the basis of a naive bayesian algorithm; 2), after the electronic documents are classified, the electronic documents are indexed in a distributed manner on the basis of a consistent Hash algorithm according to the classification of the electronic documents; and 3), after the indexing documents are established, a user inputs an inquiry statement for inquiring the electronic documents. The system adopts a distributed mode of an open source searching tool Solr and distributes the inquiry requests to the distributed nodes, each distributed node responds to the searching request, and then, a result is subjected to merging and duplication elimination and is returned to the user after well sorted, so that distributed vertical search is realized. With adoption of the manner, the accuracy for automatic classification of the electronic documents can be improved, and the stability of the system is improved.

Description

A kind of distributed search methods and system based on solr technology

Technical field

The present invention relates to information retrieval field, especially relate to a kind of distributed search methods and system based on solr technology.

Background technology

Internet technology obtains develop rapidly, online data volume sharp increase, and increasing of mass data produced tremendous influence to the search quality of universal search engine.At this moment, find accurately, fast the information oneself needing difficult on the net.Sum up its reason and have 3 points: the one,, online information is complicated unordered, and the information that likely duplicates of different websites, therefore utilize search engine inquiry to Search Results will produce information noise; The 2nd, only judge that according to the query terms of user's input the real search intention of user is very difficult; The 3rd, the reptile program of search engine can not crawl the information on all internets, captures in real time in other words network information.Now in the urgent need to there being a kind of appearance of the search engine for a certain field or theme.

Summary of the invention

The technical matters that the present invention mainly solves is to provide a kind of distributed search methods and system based on solr technology, can improve the accuracy of e-file automatic classification, the stability of enhancing system, and can merge duplicate removal, Auto-grouping to Search Results, realize vertical search, made to search for more absorbed, concrete and go deep into.

For solving the problems of the technologies described above, the technical scheme that the present invention adopts is: a kind of distributed search methods based on solr technology is provided, comprises the following steps:

1) in the time that offline client system is registered filing to e-file, first based on NB Algorithm, e-file is carried out to automatic classification;

2) after e-file classification, according to classification under e-file, based on consistance hash algorithm, e-file is carried out to distributed index, the content of index comprises the important metadata of e-file and the associated metadata of the electronic document that e-file comprises;

3) after index file is set up, user input query statement carries out the inquiry of e-file;

Wherein, described step 3) specifically comprises: adopt the distribution mode of the research tool Solr that increases income, inquiry request is distributed to distributed node, each distributed node response searching request, then merges duplicate removal to result, after sequence is good, returns to user.

In a preferred embodiment of the present invention, while e-file being carried out to automatic classification in described step 1), adopt coordinating factor dynamically to adjust the face that stresses of automatic classification, the size of described coordinating factor is 0-1.

In a preferred embodiment of the present invention, the size of described coordinating factor is 0.5.

In a preferred embodiment of the present invention, the NB Algorithm in described step 1), specifically comprises the following steps:

1.1) selection of dictionary and processing: adopt the index instrument of search engine to carry out respectively index process to the document of respective classes in dictionary;

1.2) extract the Feature Words of document to be sorted: adopt the installation component of search engine, summary and keyword message to document extract, and then the key word extracting are carried out to duplicate removal, select and get Feature Words;

1.3) adopt Bayesian formula and dictionary sample files to carry out Bayes's calculating the Feature Words extracting, obtain the probability of document to be sorted for each classification, then compare probable value, obtain maximum probability, thereby find the affiliated classification of document to be sorted.

In a preferred embodiment of the present invention, described step 1.3) described in Bayesian formula be:

Class(d)=argmax P(c|d)；

Wherein, d: document;

C: classification;

Class (d): the classification under document;

P (c|d): document d belongs to the probability of classification c;

ArgmaxP (c|d): document belongs to the maximal value of a certain classification;

The value of P (c|d) is wherein drawn by following formula:

P(c|d)=λP(c)+ (1-λ)bayes(c|d)；

Wherein, P (c): the set of given classification, in set, belong to the probability of c class, value is P (c)=1/n, the wherein number of n presentation class;

λ: coordinating factor;

Bayes (c|d): utilize Bayesian formula to obtain the probability that document d belongs to classification c.

The present invention also provides a kind of distributed search system, and described system comprises:

Automatic categorizer, for carrying out automatic classification to e-file;

Distributed index and searcher, replication mode and the distribution mode of employing Solr, backed up the index file of distributed node by replication mode, carries out distributed search by distribution mode.

In a preferred embodiment of the present invention, described system also comprises carries out the intelligent prompt device of intelligent prompt, Search Results is carried out to classified statistics device and the Search Results authority filtration unit of Auto-grouping statistics query statement.

The invention has the beneficial effects as follows: based on NB Algorithm, e-file is carried out automatic classification and introduce coordinating factor dynamically adjusting to the face that stresses of automatic classification, can improve the accuracy of e-file automatic classification; Based on consistance hash algorithm, e-file is carried out to distributed index, can strengthen the stability of system; By adopting the distribution mode of Slor, distributed node is optimized, and Search Results is merged to duplicate removal, Auto-grouping, realize vertical search, make to search for more absorbed, concrete and go deep into.

Brief description of the drawings

Fig. 1 is a kind of distributed search methods based on solr technology of the present invention and the schematic flow sheet of system;

Fig. 2 is the distributed index constitutional diagram the present invention is based in the distributed search methods of solr technology;

Fig. 3 is the distributed search process flow diagram the present invention is based in the distributed search methods of solr technology;

Fig. 4 is distributed search system software architecture diagram of the present invention;

Fig. 5 is the class interface design drawing of the automatic categorizer of distributed search system of the present invention;

Fig. 6 is the class interface design drawing of the distributed index device of distributed search system of the present invention;

Fig. 7 is the search intelligent prompt interface of distributed search system of the present invention;

Fig. 8 is the advanced search interface of distributed search system of the present invention;

Fig. 9 is the search result interfaces of distributed search system of the present invention;

In accompanying drawing, the mark of each parts is as follows: 1, index, 2, searcher.

Embodiment

Below in conjunction with accompanying drawing, preferred embodiment of the present invention is described in detail, thereby so that advantages and features of the invention can be easier to be it will be appreciated by those skilled in the art that, protection scope of the present invention is made to more explicit defining.

Refer to Fig. 1-Fig. 9, the embodiment of the present invention comprises:

A kind of distributed search system, described system comprises:

1) automatic categorizer, for carrying out automatic classification to e-file;

When ERMS offline client system is registered filing to e-file, carry out automatic classification to e-file, so that follow-up distributed index.Because the document under e-file may be inconsistent with the theme that file metadata is described, therefore can not carry out the judgement of final type completely to e-file according to the e-file type defining in ERMS offline client system.Automatic categorizer in the present embodiment has adopted coordinating factor that the size of the factor is set by user, determines by user the ratio that classification that ERMS offline client system defines and Bayes's classification respectively account for.Wherein, the size of the coordinating factor of acquiescence is 0.5.

Described Bayesian formula is:

Class(d)=argmax P(c|d)；

Wherein, d: document;

C: classification;

Class (d): the classification under document;

P (c|d): document d belongs to the probability of classification c;

The value of P (c|d) is wherein drawn by following formula:

P(c|d)=λP(c)+ (1-λ)bayes(c|d)；

λ: coordinating factor, value is 0-1;

From above formula, in the time of λ=1, not according to bayesian algorithm, e-file is classified, classify according to the type of the e-file configuring in current ERMS offline client system completely; Otherwise, in the time of λ=0, according to Bayesian Classification Arithmetic, e-file is reclassified completely.

Because document d can be expressed as the set of n uncorrelated independently eigenwert, d=(w1, w2 ..., wn),

The calculating of bayes (c1d) can be obtained by bayesian algorithm, that is:

bayes(c|d)= =

Increase after coordinating factor, must ensure , existing as follows to algorithm proof:

1)

2)

3)

……

n)

Above n expression formula is added: ;

Due to therefore, , card is finished.

2) distributed index and searcher, index 1 and searcher 2 as shown in Figure 1, replication mode and the distribution mode of employing Solr, backed up the index file of distributed node by replication mode, carries out distributed search by distribution mode;

Because the e-file quantity of managing in ERMS offline client system will be exponential growth, the size of index file is inevitable also will be exponential growth, and in the time that the size of index file exceedes a certain threshold value, the speed of search and efficiency will be had a greatly reduced quality.So, in order to make system can tackle the search utilization of the e-file of magnanimity PE rank, in the present embodiment, adopt distributed strategy, based on the replication mode of consistance hash algorithm and Solr, index file is carried out to distributed storage and backup; Distribution mode based on Solr and facet face vertical search characteristic are to carrying out the strategy of distributed search.Adopt memcached and heartbeat strategy to carry out distributed storage and monitoring to distributed node state simultaneously.Fig. 2 is distributed index constitutional diagram.

The realization of the distributed search in the present embodiment has mainly adopted the shard distribution mode of Solr, it is user input query word, then from distributed caching device, obtain distributed host node or the distributed interim host node of survival, then request is distributed to the distributed node of survival, carry out corresponding by distributed node to request, master server is responsible for the Query Result of distributed node to gather, and then final Query Result is fed back to user.Fig. 3 is the process flow diagram of distributed search.

3) intelligent prompt device, for carrying out intelligent prompt to user's query statement;

4) classified statistics device, for carrying out Auto-grouping statistics to Search Results;

5) Search Results authority filtration unit.

Native system has the functions such as automatic classification, distributed index, distributed search, intelligent prompt, classified statistics and the filtration of Search Results authority of data.Wherein, replication mode and distribution mode that distributed index and search have mainly adopted Solr, backed up the index file of distributed node by replication mode, carries out distributed search by distribution mode.

Based on a distributed search methods for solr technology, comprise the following steps:

Wherein, described NB Algorithm specifically comprises the following steps:

1.1) selection of dictionary and processing: first should select the judgement of more authoritative sample for classifying, the sample of selecting in the present embodiment derives from search dog dictionary (standard edition).Because this dictionary is larger, if the each sample files in document and dictionary is contrasted, need to consume the time at least 10 seconds, in order to improve the speed of index, in the present embodiment, adopt the index instrument IndexWriter of Lucene to carry out respectively index process to the document of respective classes in dictionary;

1.2) extract the Feature Words of document to be sorted: treat the extraction of classifying documents Feature Words, adopt the Tika installation component of Lucene search engine.In order to improve the speed of index, only summary and the keyword message to document extracts, and then the key word extracting carried out to duplicate removal;

Native system utilizes search engine technique, based on ERMS system, Design and implementation distributed vertical search engine.In the time that ERMS offline client system is registered filing to e-file, first based on NB Algorithm, e-file is carried out to automatic classification; After e-file classification, according to classification under e-file, based on consistance hash algorithm, e-file is carried out to distributed index, the content of index comprises the important metadata of e-file and the associated metadata of the electronic document that e-file comprises; After index file is set up, user can carry out the inquiry of e-file by input inquiry statement, specific implementation has adopted distribution (shard) pattern of the research tool Solr that increases income, inquiry request is distributed to distributed node, each distributed node response searching request, then result is merged to duplicate removal, sort and return to user.From the stability of system, mainly consider that following two aspects carry out system optimization: the one, high concurrent request processing aspect, mainly distributed node is optimized, has introduced load balancing simultaneously, make the user's request in can the fast processing high concurrent situation of system; The 2nd, system disaster tolerance aspect, has adopted MS master-slave (master-slave) framework, based on Observer Pattern, the index file on distributed node is carried out to timed backup.In the time that distributed node breaks down, serve as its role by backup node, respond index and search request.

The present invention has disclosed a kind of distributed search methods and system based on solr technology, based on NB Algorithm, e-file is carried out automatic classification and introduce coordinating factor dynamically adjusting to the face that stresses of automatic classification, can improve the accuracy of e-file automatic classification; Based on consistance hash algorithm, e-file is carried out to distributed index, can strengthen the stability of system; By adopting the distribution mode of Slor, distributed node is optimized, and Search Results is merged to duplicate removal, Auto-grouping, realize vertical search, make to search for more absorbed, concrete and go deep into.

The foregoing is only embodiments of the invention; not thereby limit the scope of the claims of the present invention; every equivalent structure or conversion of equivalent flow process that utilizes instructions of the present invention and accompanying drawing content to do; or be directly or indirectly used in other relevant technical fields, be all in like manner included in scope of patent protection of the present invention.

Claims

1. the distributed search methods based on solr technology, is characterized in that, comprises the following steps:

2. the distributed search methods based on solr technology according to claim 1, it is characterized in that, while e-file being carried out to automatic classification in described step 1), adopt coordinating factor dynamically to adjust the face that stresses of automatic classification, the size of described coordinating factor is 0-1.

3. the distributed search methods based on solr technology according to claim 2, is characterized in that, the size of described coordinating factor is 0.5.

4. the distributed search methods based on solr technology according to claim 1, is characterized in that, the NB Algorithm in described step 1), specifically comprises the following steps:

5. the distributed search methods based on solr technology according to claim 4, is characterized in that, described step 1.3) described in Bayesian formula be:

Class(d)=argmax P(c|d)；

Wherein, d: document;

C: classification;

Class (d): the classification under document;

P (c|d): document d belongs to the probability of classification c;

The value of P (c|d) is wherein drawn by following formula:

P(c|d)=λP(c)+ (1-λ)bayes(c|d)；

λ: coordinating factor;

6. a distributed search system, is characterized in that, described system comprises:

Automatic categorizer, for carrying out automatic classification to e-file;

7. distributed search system according to claim 6, it is characterized in that, described system also comprises carries out the intelligent prompt device of intelligent prompt, Search Results is carried out to classified statistics device and the Search Results authority filtration unit of Auto-grouping statistics query statement.