CN105787097A

CN105787097A - Distributed index establishment method and system based on text clustering

Info

Publication number: CN105787097A
Application number: CN201610154682.0A
Authority: CN
Inventors: 林格; 邓现
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2016-03-16
Filing date: 2016-03-16
Publication date: 2016-07-20

Abstract

The invention discloses a distributed index establishment method and system based on text clustering.The method includes the steps that unstructured texts are subjected to formatting and word segmentation pretreatment, and the pretreatment result is stored in original distributed nodes; the pretreatment result is subjected to filtering and feature extraction, and processed text lexical feature vectors are obtained; the text lexical feature vectors are clustered through a Canopy-Kmeans clustering algorithm, and K clusters of the text lexical feature vectors are obtained; each cluster of the K clusters is distributed on one or more distributed nodes; the K clusters distributed on the one or more nodes are subjected to full-text index establishment through an index engine, and K full-text indexes are obtained.By means of the embodiment, the method and system are used for establishing a distributed index mode for retrieval, the rapid index mode is provided for a user, and the use experience of the user is improved.

Description

A kind of distributed index construction method based on text cluster and system

Technical field

The present invention relates to search index constructing technology field, particularly relate to a kind of based on text cluster point Cloth index structuring method and system.

Background technology

Generally use index technology that information is retrieved in traditional structured message management, but Under distributed network environment, the growth rate of knowledge scale is very fast, the size of index file along with The growth of scale and be increased dramatically, do not simply fail to centralized fashion storage index, recall precision is the tightest Weight is affected by huge index database；Have for this situation and propose a kind of indexing means divided based on document, But set is divided by this index by random manner, owing to each subset divided is The distribution of valency, therefore need nonetheless remain for retrieving all of subindex when retrieval, causes the expense of retrieval very Greatly.

Text cluster is assumed according to cluster: the object of same class has higher similarity, different classes Object between difference relatively big, be a kind of unsupervised machine learning method；Text cluster is different from literary composition This classification, cluster need not training process, it is not required that in advance to document manual mark classification, Different texts are condensed into different classifications automatically, have at certain flexibility and higher automation Reason ability.

Distributed computing technology mainly includes distributed storage and two basic functions of parallel computation；Distributed deposit Storage provides a transparent consistent file access system, and uses distributed mode to sea physically The data of amount store；The input data of magnanimity are scattered in multiple node, by each by parallel computation Calculate nodal parallel, finally the result of calculation merger of all nodes is become final result.

Summary of the invention

It is an object of the invention to overcome the deficiencies in the prior art, the invention provides a kind of based on text The distributed index construction method of cluster and system, for building a kind of distributed index for retrieval Mode, gives user a kind of quickly indexed mode, improves the experience sense of user.

In order to solve the problems referred to above, the present invention proposes a kind of distributed index structure based on text cluster Construction method, described method includes:

Non-structured text is formatted and participle pretreatment, pre-processed results is stored in distribution On formula node；

Carry out described pre-processed results filtering and process with feature extraction, the text vocabulary after acquisition process Characteristic vector；

Use Canopy-Kmeans clustering algorithm that described text lexical feature vector is carried out at cluster Reason, obtains described text lexical feature vectorial K and clusters；

Each the clustering that described K clusters is distributed on one or more distributed node；

Index engine is used to cluster to the described described K being distributed on one or more distributed node Carry out setting up full-text index to process, obtain K full-text index.

Preferably, described non-structured text is formatted and participle pretreatment, by pretreatment knot Fruit is stored on distributed node, including:

The non-structured text of different-format on each distributed node is carried out uniform format process, obtains Take the first text that form is consistent；

Described first text is carried out word segmentation processing, carries out keyword extraction according to result, obtain The keyword vocabulary of the first text；

Use the combination of " key=text numbering, value=text vocabulary " by described keyword vocabulary It is stored on distributed node.

Preferably, described carry out described pre-processed results is filtered and feature extraction process, and acquisition processes After Text eigenvector, including:

Use parallelization calculation that the text being stored in described distribution node is processed, obtain institute State the word frequency of vocabulary in text；

Use described word frequency to compare with first threshold, preserve the described word frequency word more than first threshold Converge；

Calculate the TF-IDF value of described vocabulary, use described TF-IDF value compared with Second Threshold, Preserve the TF-IDF value the second vocabulary more than Second Threshold；

According to described second word retrieval feature, and give the weight of described second vocabulary, obtain described The characteristic vector of the second vocabulary.

Preferably, described Text eigenvector is carried out by described employing Canopy-Kmeans clustering algorithm Clustering processing, including:

Use Canopy cluster mode that described text lexical feature vector carries out preliminary clusters, obtain with Text lexical feature vector centered by Canopy tentatively clusters；

Tentatively cluster according to described text lexical feature vector and carry out Kmeans clustering processing, obtain described K of text lexical feature vector clusters.

Preferably, described employing index engine is distributed on one or more distributed node described Individual the clustering of described K carries out setting up full-text index process, including:

Using index engine to process clustering on each distribution node, cluster described in foundation is complete Literary composition index；

The full-text index clustered on all distribution nodes is merged, obtains K full-text index.

Correspondingly, present invention also offers a kind of distributed index constructing system based on text cluster, Described system includes:

Pretreatment module: for non-structured text being formatted and participle pretreatment, will locate in advance Reason result is stored on distributed node；

Filter and characteristic extracting module: for described pre-processed results being carried out at filtration and feature extraction Reason, the text lexical feature after acquisition processes is vectorial；

Cluster module: be used for using Canopy-Kmeans clustering algorithm to described text lexical feature to Amount carries out clustering processing, obtains described text lexical feature vectorial K and clusters；

Cluster distribution module: for each the clustering that described K clusters is distributed in one or more points On cloth node；

Index construct module: be used for using index engine to be distributed in one or more distributed joint to described Individual the clustering of described K on point carries out setting up full-text index process, obtains K full-text index.

Preferably, described pretreatment module, including:

Uniform format processing unit: for by the destructuring literary composition of different-format on each distributed node Originally carry out uniform format process, obtain the first text that form is consistent；

Word segmentation processing and keyword extracting unit: for described first text is carried out word segmentation processing, root Carry out keyword extraction according to result, obtain the keyword vocabulary of the first text；

Memory cell: the combination being used for using " key=text numbering, value=text vocabulary " will Described keyword vocabulary is stored on distributed node.

Preferably, described filtration includes with characteristic extracting module:

Parallelization computing unit: for using parallelization calculation to being stored in described distribution node Text processes, and obtains the word frequency of vocabulary in described text；

First comparing unit: be used for using described word frequency to compare with first threshold, preserves institute's predicate Frequency is more than the vocabulary of first threshold；

Second comparing unit: for calculating the TF-IDF value of described vocabulary, use described TF-IDF value Compared with Second Threshold, preserve the TF-IDF value the second vocabulary more than Second Threshold；

Feature extraction unit: be used for according to described second word retrieval feature, and give described second word The weight converged, obtains the characteristic vector of described second vocabulary.

Preferably, described cluster module includes:

First cluster cell: be used for using Canopy cluster mode that described text lexical feature vector is entered Row preliminary clusters, obtains the text lexical feature vector centered by Canopy and tentatively clusters；

Second cluster cell: carry out Kmeans for tentatively clustering according to described text lexical feature vector Clustering processing, obtains described text lexical feature vectorial K and clusters.

Preferably, described index construct module includes:

Node index construct unit: for using index engine that clustering on each distribution node is carried out Process, the full-text index clustered described in foundation；

Index combining unit: for the full-text index clustered on all distribution nodes is merged, obtain Take K full-text index.

In implementation process of the present invention, by formatting text, participle, filtration, feature carry Take and clustering processing, and result is set up full-text index, for building a kind of dividing for retrieval Cloth indexed mode, gives user a kind of quickly indexed mode, improves the experience sense of user.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below by right In embodiment or description of the prior art, the required accompanying drawing used is briefly described, it should be apparent that, Accompanying drawing in describing below is only some embodiments of the present invention, for those of ordinary skill in the art From the point of view of, on the premise of not paying creative work, it is also possible to obtain the attached of other according to these accompanying drawings Figure.

Fig. 1 is that the flow process of the distributed index construction method based on text cluster of the embodiment of the present invention is shown It is intended to；

Fig. 2 is the schematic flow sheet of the pre-treatment step of the embodiment of the present invention；

Fig. 3 is the schematic flow sheet of the Text eigenvector obtaining step of the embodiment of the present invention；

Fig. 4 is the structure group of the distributed index constructing system based on text cluster of the embodiment of the present invention Become schematic diagram；

Fig. 5 is the structure composition schematic diagram of the pretreatment module of the embodiment of the present invention；

Fig. 6 is the filtration structure composition schematic diagram with characteristic extracting module of the embodiment of the present invention.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is entered Row clearly and completely describes, it is clear that described embodiment is only a part of embodiment of the present invention, Rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not having Have and make the every other embodiment obtained under creative work premise, broadly fall into present invention protection Scope.

Fig. 1 is that the flow process of the distributed index construction method based on text cluster of the embodiment of the present invention is shown It is intended to, as it is shown in figure 1, the method includes:

S11: non-structured text is formatted and participle pretreatment, pre-processed results is stored in On distributed node；

S124: carry out this pre-processed results filtering and feature extraction process, the text after acquisition process Lexical feature vector；

S13: use Canopy-Kmeans clustering algorithm that text lexical feature vector is carried out at cluster Reason, obtains text lexical feature vectorial K and clusters；

S14: each the clustering that this K clusters is distributed on one or more distributed node；

S15: use index engine to this be distributed in the K on one or more distributed node cluster into Row is set up full-text index and is processed, and obtains K full-text index.

S11 is described further:

There is the problem that structure is inconsistent in the text in database, non-structured text is carried out form Change processes the structured text obtaining uniform format, then text is carried out participle pretreatment, then obtains Word segmentation result, and this result is stored on distributed node.

Further, Fig. 2 is the schematic flow sheet of the pre-treatment step of the embodiment of the present invention, such as Fig. 2 Shown in, this step includes:

S111: the non-structured text of different-format on each distributed node is carried out at uniform format Reason, obtains the first text that form is consistent；

S112: this first text is carried out word segmentation processing, carries out keyword extraction according to result, Obtain the keyword vocabulary of the first text；

S113: use the combination of " key=text numbering, value=text vocabulary " by described key Word vocabulary is stored on distributed node.

S111 is described further:

The non-structured text that will be distributed over all kinds of different-formats on distributed node carries out uniform format Process, thus get the first text of uniform format.

S112 is described further:

First text is carried out word segmentation processing, the vocabulary of separating treatment in the first text is extracted, Using the vocabulary that extracts as keyword, thus obtain the keyword vocabulary of the first text.

S113 is described further:

The keyword vocabulary extracted according to each text and the text use " key=text numbering, Value=text vocabulary " combination be combined, then key and value combined is stored in On distributed node.

S12 is described further:

The text of the result according to above-mentioned steps process and vocabulary, the word frequency of the vocabulary in the calculating text, Use this word frequency compared with first threshold, preserve the word frequency vocabulary more than first threshold, the most again Calculate the TF-IDF value of this vocabulary, use TF-IDF value compared with Second Threshold, preserve TF-IDF Value, more than the second vocabulary of Second Threshold, gives remaining second term weight according to TF-IDF value after allowing, Extract the characteristic vector of this second vocabulary.

Further, Fig. 3 be the embodiment of the present invention Text eigenvector obtaining step flow process signal Figure, as it is shown on figure 3, this step includes:

S121: use parallelization calculation that the text being stored in this distribution node is processed, obtain Take the word frequency of vocabulary in the text；

S122: compare whether this word frequency is more than first threshold, if then jumping to S123, if it is not, then Remove the vocabulary that this word frequency is corresponding；

S123: preserve the described word frequency vocabulary more than first threshold, calculate the TF-IDF value of described vocabulary；

S124: compare whether this TF-IDF value is more than Second Threshold, the most then jump to S125, if No, then remove the vocabulary that this TF-IDF value is corresponding；

S125: preserve the TF-IDF value the second vocabulary more than Second Threshold；

S126: according to the second word retrieval feature, and give the weight of the second vocabulary, obtain this second The characteristic vector of vocabulary.

S121 is described further:

Word frequency refers to the frequency that a vocabulary occurs in the text, vocabulary t frequency tf in text d (t, d)=count (t IN d)/count (t), i.e. vocabulary occurrence number and the ratio of text vocabulary total amount；Logical Cross above-mentioned formula text is processed, calculate and get the word frequency of vocabulary in text.

S122 is described further:

In one text, the number of times that vocabulary occurs is the most, and this vocabulary is exactly the most crucial, and word frequency ratio Relatively low vocabulary does not the most have the ability representing text, therefore, arranges first threshold, uses word frequency Compare with first threshold, remove the vocabulary that word frequency is less than first threshold, preserve word frequency than the first threshold It is worth big vocabulary；This first threshold gives according to actual conditions, and this first threshold sets in the present embodiment It is set to 0.01.

S123 is described further:

After obtaining comparing with first threshold after remaining vocabulary, calculate the TF-IDF value of this vocabulary.

S124 is described further:

Use TF-IDF value to compare with Second Threshold, remove TF-IDF value less than Second Threshold Vocabulary, preserves the vocabulary that TF-IDF value is bigger than Second Threshold；This Second Threshold sets according to actual conditions Depending on, this Second Threshold is set as 0.01 in the present embodiment.

S125 is described further:

Deposit the TF-IDF value the second vocabulary more than Second Threshold.

S126 is described further:

The computing formula of weight w in text d of the t of vocabulary is by TF-IDF:

W (t, d)=TF (t, d) * log (1/DF (t))；

Wherein, DF (t) is text frequency, refers to the text proportion of a certain vocabulary t, the text frequency of vocabulary t DF is DF (t)=n (t)/n, i.e. contains the amount of text of vocabulary t and the ratio of text sum；Vocabulary frequency (t d) is vocabulary t word frequency in text d to TF.

If the frequency of occurrences that some vocabulary is in a text is higher, and the appearance in other texts Frequency is less, then it is believed that this vocabulary has good class discrimination, suitable being used for represents literary composition This, can extract characteristic vector further.

Vector space model is used to represent text, to the text containing n characteristic item d(t₁,t₂,…,t_n), each characteristic item t_kIt is endowed calculated weight w of TF-IDF_k, represent this spy Levy significance level in the text, the i.e. text and can use characteristic vector d (w₁,w₂,…,w_n) represent, w_kIt is characterized a t_kTF-IDF weight, according to this feature item t_kTF-IDF weight give corresponding Term weight.

S13 is described further:

First, use Canopy cluster mode that text lexical feature vector is carried out preliminary clusters, obtain Take the text lexical feature vector centered by Canopy tentatively to cluster；Then, according to text vocabulary Characteristic vector tentatively clusters and carries out Kmeans clustering processing, obtains the K of text lexical feature vector Individual cluster.

Further, Canopy clustering algorithm has characteristic simply, quickly and precisely, is processing sea During the high dimension measured, in the case of especially data volume is huge, Canopy cluster is used tentatively to locate Reason, can be effectively improved efficiency, and Canopy clustering algorithm is specific as follows:

(1) characteristic vector set is initialized as list, selects two distance thresholds: T1, T2.

(2) take an object d in list at random as Canopy center, be labeled as c, and by d from List deletes；

(3) distance distance of all object d_i Yu c in list is calculated, if distance < T1, will This object adds Canopy c；If distanc < T2, this point is deleted from list, namely this object Cannot be as Canopy center；

(4) remaining c is added in canopylist；

(5) repeat step 2,3,4, until data are that sky terminates in list, canopylist is then last Canopy cluster result.

Wherein, it is contemplated that due to the higher-dimension of text lexical feature vector, COS distance degree is therefore used Amount；

Concrete, COS distance computing formula between characteristic vector A and characteristic vector B particularly as follows:

C o \sin e_d i s \tan c e (A, B) = 1 - Σ_{i = 1}^{n} (a_{i} \times b_{i}) / (\sqrt{Σ_{i = 1}^{n} a_{i}^{2}} \times \sqrt{Σ_{i = 1}^{n} b_{i}^{2}});

Wherein characteristic vector A is expressed as A=(a₁,a₂,…,a_n), characteristic vector B is expressed as B= (b₁,b₂,…,b_n), i=1,2 ..., n.

Use Kmeans clustering algorithm that preliminary clusters result is carried out clustering processing, Kmeans again The basic thought of clustering algorithm is: sort out as center with k object in space, and object is empty In between, the object near each center is classified as a class respectively, by the way of successive ignition, by each poly- The value of class barycenter gradually calculates renewal, until the barycenter that clusters is stablized constant.

For the embodiment of the present invention, original Kmeans clustering algorithm is carried out the amendment calculated, after amendment Algorithm specific as follows:

(1) using the result of Canopy clustering algorithm as the input of Kmeans clustering algorithm, i.e. The Canopy center that Canopy clustering algorithm produces as the initialization barycenter of Kmeans algorithm, and Each characteristic vector has been dispensed in corresponding barycenter；

(2) each characteristic vector is calculated this feature vector distance to each barycenter, and distributed To nearest cluster barycenter, wherein distance computing formula still uses use in Canopy clustering algorithm COS distance；

(3) each cluster is recalculated all it is worth to new cluster barycenter；

(4) all data objects variance error value E to its corresponding cluster barycenter is calculated, if E is more than Threshold value then repeats step 2 and step 3, and otherwise cluster terminates.

Wherein, E computing formula particularly as follows:

E = \frac{1}{n} \underset{x}{Σ} | | x - u_{k (x)} | |^{2};

Wherein, x is the text vector of document；K (x) represents clustering of vector x place；u_k(x)Represent The centroid vector clustered at vector x place；N is document number of vectors.

Parallel optimization designs: the same kmeans clustering algorithm first carrying out local on each node: right This vector of vectorial local calculation on each node is to the distance of each overall situation barycenter, and assigns it to Nearest overall barycenter obtains global clustering；Local Clustering on node is calculated and is all worth to local matter The heart and local variance error amount；Local barycenter on all nodes and local error variance yields are integrated Become overall situation barycenter and total error variance value E, decide whether to continue iteration or end further according to E Polymerization, finally gives K cluster and barycenter thereof；

Overall situation centroid calculation formula is:

v_i=(v_i[1]*m₁+…+v_i[j]*m_j+…+v_i[s]*m_s)/(m₁+…+m_s)

Wherein, v_iThe overall centroid vector of the ith cluster for calculating；

v_i[j] is the local centroid vector on the distributed node S having jth to cluster, m_sFor this cluster In represent document vector number；Overall situation variance error amount computing formula E is: E=(E₁*n₁+…+E_j*n_j+…+E_t*n_t)/(n₁+…+n_t)；E_jVariance error amount for jth node； n_jFor the vector sum on this node；T is node total number.

S14 is described further:

Each the clustering that the K obtained in above-mentioned steps clusters is distributed in one or more distributed joint On point.

S15 is described further:

Use index engine that clustering on each distribution node is processed, set up this full text clustered Index；The full-text index clustered on all distribution nodes is merged, obtains K full-text index.

Further, according to concrete index engine, clustering on each distribution node is carried out in full Index is set up, and is merged by the clustered index of the identical cluster on all nodes, i.e. can get K The individual overall full-text index clustered.

The following is in the embodiment of the present invention, user carries out the process retrieved using search key:

The inquiry string of input is carried out participle extraction keyword process, further according to index selection algorithm Calculate the similarity of inquiry and subclass, select the index meeting certain condition.

Wherein provide a kind of index selection algorithm based on search space, be described as follows:

The search space P={p of definition internal system₁,p₂,…,p_i, p_iRepresent the one query of history Record；Cluster index storehouse is S={S₁,S₂,…,S_j}；rel(q|S_j) represent index database S_jWith currently look into Ask the degree of correlation of q；

Algorithm steps is:

(1) each index database and historical query p are calculated_iDegree of correlation rel (p_i|S_j)；If S_jDo not exist SET(p_iIn), then rel (p_i|S_j)=0；Otherwise degree of correlation rel (p_i|S_j) computing formula is specific as follows is:

r e l (p_{i} | S_{j}) = \underset{T}{Σ} \frac{r e l (p_{i} | d o c)}{T};

Wherein, rel (p_i| doc) refer to the degree of correlation of historical query and document, when document belongs to cluster S_jTime Degree of correlation rel (p_i| doc)=1, otherwise degree of correlation rel (pi | doc)=0；T is predefined value, refers to commenting Dividing the front T number of documents needing to be considered in list, T is set to 20 in embodiments of the present invention, i.e. selects Select the document that degree of correlation ranking is front 20；

(2) select k most like historical query, use the inquiry q of COS distance metric calculation input With similarity sim of historical query (q | p_i), select k the inquiry that similarity is higher, can survey according to experiment Examination obtains optimum efficiency k value；

(3) according to the associated information calculation current queries q and index database S of similar inquiry_iThe degree of correlation rel(q|S_j), according to degree of correlation rel (q | S_j) sequence, select relatively relevant index database；

Current queries q and search library S_iDegree of correlation rel (q | S_j) computing formula particularly as follows:

r e l (q | S_{j}) = \underset{k}{Σ} r e l (p_{i} | S_{j}) \times s i m (q | p_{i});

rel(p_i|s_j) represent index database S_jWith historical query p_iThe degree of correlation；sim(q|p_i) represent current queries q With historical query p_iThe degree of correlation；K represents front k the historical query most like with current queries q；

(4) after process completes inquiry, the feedback information of system acquisition user, such as user's actual click The information such as link, finally add this time inquiry to search space, update inquiry storehouse, thus complete one Secondary inquiry.

At the enterprising line retrieval of qualified index, by calculating by information such as the document frequencies of the overall situation Dividing and each retrieval result indexed is merged and sorted, finally retrieved result, it is right to complete The retrieval of inquiry；Be given for inquiry q retrieval result d scoring Score (q, basis d) is:

\begin{matrix} S c o r e (q, d) = c o o r d (q, d) \times q u e r y N o r m (q) \\ \times Σ_{i}^{q} (T F (t, d) \times {IDF}^{2} (t) \times t . g e t B o o s t () \times n o r m (t, d)) \end{matrix};

Wherein, each keyword that t extracts in being inquiry q；TF (t, d) is t word frequency in document d, IDF (t) is inverse document frequency；T.getBoost () is the significance level in inquiry input arranged keyword； Norm (t, the weighted sum length factor of the document set time d) for setting up index；Coord (q, d) for scoring because of Son, document occurs that the most matching degrees of query term number of times are the highest；QueryNorm (q) is by query language normalizing Change, make different query languages directly compare.

Correspondingly, Fig. 4 is the distributed index constructing system based on text cluster of the embodiment of the present invention Structure composition schematic diagram, as shown in Figure 4, this system includes:

Pretreatment module 11: for formatting non-structured text and participle pretreatment, will be pre- Result is stored on distributed node；

Filter and characteristic extracting module 12: for this pre-processed results being carried out at filtration and feature extraction Reason, the text lexical feature after acquisition processes is vectorial；

Cluster module 13: be used for using Canopy-Kmeans clustering algorithm to text lexical feature to Amount carries out clustering processing, obtains text lexical feature vectorial K and clusters；

Cluster distribution module 14: for each the clustering that this K clusters is distributed in one or more points On cloth node；

Index construct module 15: be used for using index engine that this is distributed in one or more distributed joint Individual the clustering of this K on point carries out setting up full-text index process, obtains K full-text index.

Preferably, Fig. 5 is the structure composition schematic diagram of the pretreatment module of the embodiment of the present invention, such as Fig. 5 Shown in, this pretreatment module 11, including:

Uniform format processing unit 111: for by the destructuring of different-format on each distributed node Text carries out uniform format process, obtains the first text that form is consistent；

Word segmentation processing and keyword extracting unit 112: for this first text is carried out word segmentation processing, root Carry out keyword extraction according to result, obtain the keyword vocabulary of the first text；

Memory cell 113: be used for using the combination side of " key=text numbering, value=text vocabulary " This keyword vocabulary is stored on distributed node by formula.

Preferably, Fig. 6 is the filtration structure composition signal with characteristic extracting module of the embodiment of the present invention Figure, as shown in Figure 6, this filtration includes with characteristic extracting module 12:

Parallelization computing unit 121: for using parallelization calculation to being stored in this distribution node Text processes, the word frequency of vocabulary in the acquisition text；

First comparing unit 122: be used for using this word frequency to compare with first threshold, preserve this word frequency Vocabulary more than first threshold；

Second comparing unit 123: for calculating the TF-IDF value of this vocabulary, use this TF-IDF value Compared with Second Threshold, preserve the TF-IDF value the second vocabulary more than Second Threshold；

Feature extraction unit 124: be used for according to this second word retrieval feature, and give this second vocabulary Weight, obtain the characteristic vector of this second vocabulary.

Preferably, this cluster module 13 includes:

First cluster cell: be used for using Canopy cluster mode that text lexical feature vector is carried out Preliminary clusters, obtains the text lexical feature vector centered by Canopy and tentatively clusters；

Second cluster cell: carry out Kmeans for tentatively clustering according to text lexical feature vector Clustering processing, obtains text lexical feature vectorial K and clusters.

Preferably, this index construct module 15 includes:

Node index construct unit: for using index engine that clustering on each distribution node is carried out Process, set up this full-text index clustered；

Specifically, the system related functions module operation principle of the embodiment of the present invention refers to method enforcement The associated description of example, repeats no more here.

One of ordinary skill in the art will appreciate that in the various methods of above-described embodiment is all or part of Step can be by program and completes to instruct relevant hardware, and this program can be stored in a calculating In machine readable storage medium storing program for executing, storage medium may include that read-only storage (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD Deng.

It addition, above a kind of based on text cluster the distributed index that the embodiment of the present invention is provided Construction method and system are described in detail, the specific case principle to the present invention used herein And embodiment is set forth, the explanation of above example is only intended to help to understand the side of the present invention Method and core concept thereof；Simultaneously for one of ordinary skill in the art, according to the thought of the present invention, The most all will change, in sum, this specification content Should not be construed as limitation of the present invention.

Claims

1. a distributed index construction method based on text cluster, it is characterised in that described method Including:

Distributed index construction method the most according to claim 1, it is characterised in that described right Non-structured text formats and pre-processes with participle, and pre-processed results is stored in distributed node On, including:

Distributed index construction method the most according to claim 1, it is characterised in that described right Described pre-processed results carries out filtering and feature extraction process, the Text eigenvector after acquisition process, Including:

Distributed index construction method the most according to claim 1, it is characterised in that described in adopt With Canopy-Kmeans clustering algorithm, described Text eigenvector is carried out clustering processing, including:

Distributed index construction method the most according to claim 1, it is characterised in that described in adopt The described described K being distributed on one or more distributed node is clustered and builds by index of reference engine Vertical full-text index processes, including:

6. a distributed index constructing system based on text cluster, it is characterised in that described system Including:

Distributed index constructing system the most according to claim 6, it is characterised in that described pre- Processing module, including:

Distributed index constructing system the most according to claim 6, it is characterised in that described mistake Filter includes with characteristic extracting module:

Distributed index constructing system the most according to claim 6, it is characterised in that described poly- Generic module includes:

Distributed index constructing system the most according to claim 6, it is characterised in that described Index construct module includes: