CN105787097A - Distributed index establishment method and system based on text clustering - Google Patents

Distributed index establishment method and system based on text clustering Download PDF

Info

Publication number
CN105787097A
CN105787097A CN201610154682.0A CN201610154682A CN105787097A CN 105787097 A CN105787097 A CN 105787097A CN 201610154682 A CN201610154682 A CN 201610154682A CN 105787097 A CN105787097 A CN 105787097A
Authority
CN
China
Prior art keywords
text
index
distributed
vocabulary
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610154682.0A
Other languages
Chinese (zh)
Inventor
林格
邓现
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Sun Yat Sen University
Original Assignee
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Sun Yat Sen University filed Critical National Sun Yat Sen University
Priority to CN201610154682.0A priority Critical patent/CN105787097A/en
Publication of CN105787097A publication Critical patent/CN105787097A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2272Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention discloses a distributed index establishment method and system based on text clustering.The method includes the steps that unstructured texts are subjected to formatting and word segmentation pretreatment, and the pretreatment result is stored in original distributed nodes; the pretreatment result is subjected to filtering and feature extraction, and processed text lexical feature vectors are obtained; the text lexical feature vectors are clustered through a Canopy-Kmeans clustering algorithm, and K clusters of the text lexical feature vectors are obtained; each cluster of the K clusters is distributed on one or more distributed nodes; the K clusters distributed on the one or more nodes are subjected to full-text index establishment through an index engine, and K full-text indexes are obtained.By means of the embodiment, the method and system are used for establishing a distributed index mode for retrieval, the rapid index mode is provided for a user, and the use experience of the user is improved.

Description

A kind of distributed index construction method based on text cluster and system
Technical field
The present invention relates to search index constructing technology field, particularly relate to a kind of based on text cluster point Cloth index structuring method and system.
Background technology
Generally use index technology that information is retrieved in traditional structured message management, but Under distributed network environment, the growth rate of knowledge scale is very fast, the size of index file along with The growth of scale and be increased dramatically, do not simply fail to centralized fashion storage index, recall precision is the tightest Weight is affected by huge index database;Have for this situation and propose a kind of indexing means divided based on document, But set is divided by this index by random manner, owing to each subset divided is The distribution of valency, therefore need nonetheless remain for retrieving all of subindex when retrieval, causes the expense of retrieval very Greatly.
Text cluster is assumed according to cluster: the object of same class has higher similarity, different classes Object between difference relatively big, be a kind of unsupervised machine learning method;Text cluster is different from literary composition This classification, cluster need not training process, it is not required that in advance to document manual mark classification, Different texts are condensed into different classifications automatically, have at certain flexibility and higher automation Reason ability.
Distributed computing technology mainly includes distributed storage and two basic functions of parallel computation;Distributed deposit Storage provides a transparent consistent file access system, and uses distributed mode to sea physically The data of amount store;The input data of magnanimity are scattered in multiple node, by each by parallel computation Calculate nodal parallel, finally the result of calculation merger of all nodes is become final result.
Summary of the invention
It is an object of the invention to overcome the deficiencies in the prior art, the invention provides a kind of based on text The distributed index construction method of cluster and system, for building a kind of distributed index for retrieval Mode, gives user a kind of quickly indexed mode, improves the experience sense of user.
In order to solve the problems referred to above, the present invention proposes a kind of distributed index structure based on text cluster Construction method, described method includes:
Non-structured text is formatted and participle pretreatment, pre-processed results is stored in distribution On formula node;
Carry out described pre-processed results filtering and process with feature extraction, the text vocabulary after acquisition process Characteristic vector;
Use Canopy-Kmeans clustering algorithm that described text lexical feature vector is carried out at cluster Reason, obtains described text lexical feature vectorial K and clusters;
Each the clustering that described K clusters is distributed on one or more distributed node;
Index engine is used to cluster to the described described K being distributed on one or more distributed node Carry out setting up full-text index to process, obtain K full-text index.
Preferably, described non-structured text is formatted and participle pretreatment, by pretreatment knot Fruit is stored on distributed node, including:
The non-structured text of different-format on each distributed node is carried out uniform format process, obtains Take the first text that form is consistent;
Described first text is carried out word segmentation processing, carries out keyword extraction according to result, obtain The keyword vocabulary of the first text;
Use the combination of " key=text numbering, value=text vocabulary " by described keyword vocabulary It is stored on distributed node.
Preferably, described carry out described pre-processed results is filtered and feature extraction process, and acquisition processes After Text eigenvector, including:
Use parallelization calculation that the text being stored in described distribution node is processed, obtain institute State the word frequency of vocabulary in text;
Use described word frequency to compare with first threshold, preserve the described word frequency word more than first threshold Converge;
Calculate the TF-IDF value of described vocabulary, use described TF-IDF value compared with Second Threshold, Preserve the TF-IDF value the second vocabulary more than Second Threshold;
According to described second word retrieval feature, and give the weight of described second vocabulary, obtain described The characteristic vector of the second vocabulary.
Preferably, described Text eigenvector is carried out by described employing Canopy-Kmeans clustering algorithm Clustering processing, including:
Use Canopy cluster mode that described text lexical feature vector carries out preliminary clusters, obtain with Text lexical feature vector centered by Canopy tentatively clusters;
Tentatively cluster according to described text lexical feature vector and carry out Kmeans clustering processing, obtain described K of text lexical feature vector clusters.
Preferably, described employing index engine is distributed on one or more distributed node described Individual the clustering of described K carries out setting up full-text index process, including:
Using index engine to process clustering on each distribution node, cluster described in foundation is complete Literary composition index;
The full-text index clustered on all distribution nodes is merged, obtains K full-text index.
Correspondingly, present invention also offers a kind of distributed index constructing system based on text cluster, Described system includes:
Pretreatment module: for non-structured text being formatted and participle pretreatment, will locate in advance Reason result is stored on distributed node;
Filter and characteristic extracting module: for described pre-processed results being carried out at filtration and feature extraction Reason, the text lexical feature after acquisition processes is vectorial;
Cluster module: be used for using Canopy-Kmeans clustering algorithm to described text lexical feature to Amount carries out clustering processing, obtains described text lexical feature vectorial K and clusters;
Cluster distribution module: for each the clustering that described K clusters is distributed in one or more points On cloth node;
Index construct module: be used for using index engine to be distributed in one or more distributed joint to described Individual the clustering of described K on point carries out setting up full-text index process, obtains K full-text index.
Preferably, described pretreatment module, including:
Uniform format processing unit: for by the destructuring literary composition of different-format on each distributed node Originally carry out uniform format process, obtain the first text that form is consistent;
Word segmentation processing and keyword extracting unit: for described first text is carried out word segmentation processing, root Carry out keyword extraction according to result, obtain the keyword vocabulary of the first text;
Memory cell: the combination being used for using " key=text numbering, value=text vocabulary " will Described keyword vocabulary is stored on distributed node.
Preferably, described filtration includes with characteristic extracting module:
Parallelization computing unit: for using parallelization calculation to being stored in described distribution node Text processes, and obtains the word frequency of vocabulary in described text;
First comparing unit: be used for using described word frequency to compare with first threshold, preserves institute's predicate Frequency is more than the vocabulary of first threshold;
Second comparing unit: for calculating the TF-IDF value of described vocabulary, use described TF-IDF value Compared with Second Threshold, preserve the TF-IDF value the second vocabulary more than Second Threshold;
Feature extraction unit: be used for according to described second word retrieval feature, and give described second word The weight converged, obtains the characteristic vector of described second vocabulary.
Preferably, described cluster module includes:
First cluster cell: be used for using Canopy cluster mode that described text lexical feature vector is entered Row preliminary clusters, obtains the text lexical feature vector centered by Canopy and tentatively clusters;
Second cluster cell: carry out Kmeans for tentatively clustering according to described text lexical feature vector Clustering processing, obtains described text lexical feature vectorial K and clusters.
Preferably, described index construct module includes:
Node index construct unit: for using index engine that clustering on each distribution node is carried out Process, the full-text index clustered described in foundation;
Index combining unit: for the full-text index clustered on all distribution nodes is merged, obtain Take K full-text index.
In implementation process of the present invention, by formatting text, participle, filtration, feature carry Take and clustering processing, and result is set up full-text index, for building a kind of dividing for retrieval Cloth indexed mode, gives user a kind of quickly indexed mode, improves the experience sense of user.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below by right In embodiment or description of the prior art, the required accompanying drawing used is briefly described, it should be apparent that, Accompanying drawing in describing below is only some embodiments of the present invention, for those of ordinary skill in the art From the point of view of, on the premise of not paying creative work, it is also possible to obtain the attached of other according to these accompanying drawings Figure.
Fig. 1 is that the flow process of the distributed index construction method based on text cluster of the embodiment of the present invention is shown It is intended to;
Fig. 2 is the schematic flow sheet of the pre-treatment step of the embodiment of the present invention;
Fig. 3 is the schematic flow sheet of the Text eigenvector obtaining step of the embodiment of the present invention;
Fig. 4 is the structure group of the distributed index constructing system based on text cluster of the embodiment of the present invention Become schematic diagram;
Fig. 5 is the structure composition schematic diagram of the pretreatment module of the embodiment of the present invention;
Fig. 6 is the filtration structure composition schematic diagram with characteristic extracting module of the embodiment of the present invention.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is entered Row clearly and completely describes, it is clear that described embodiment is only a part of embodiment of the present invention, Rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not having Have and make the every other embodiment obtained under creative work premise, broadly fall into present invention protection Scope.
Fig. 1 is that the flow process of the distributed index construction method based on text cluster of the embodiment of the present invention is shown It is intended to, as it is shown in figure 1, the method includes:
S11: non-structured text is formatted and participle pretreatment, pre-processed results is stored in On distributed node;
S124: carry out this pre-processed results filtering and feature extraction process, the text after acquisition process Lexical feature vector;
S13: use Canopy-Kmeans clustering algorithm that text lexical feature vector is carried out at cluster Reason, obtains text lexical feature vectorial K and clusters;
S14: each the clustering that this K clusters is distributed on one or more distributed node;
S15: use index engine to this be distributed in the K on one or more distributed node cluster into Row is set up full-text index and is processed, and obtains K full-text index.
S11 is described further:
There is the problem that structure is inconsistent in the text in database, non-structured text is carried out form Change processes the structured text obtaining uniform format, then text is carried out participle pretreatment, then obtains Word segmentation result, and this result is stored on distributed node.
Further, Fig. 2 is the schematic flow sheet of the pre-treatment step of the embodiment of the present invention, such as Fig. 2 Shown in, this step includes:
S111: the non-structured text of different-format on each distributed node is carried out at uniform format Reason, obtains the first text that form is consistent;
S112: this first text is carried out word segmentation processing, carries out keyword extraction according to result, Obtain the keyword vocabulary of the first text;
S113: use the combination of " key=text numbering, value=text vocabulary " by described key Word vocabulary is stored on distributed node.
S111 is described further:
The non-structured text that will be distributed over all kinds of different-formats on distributed node carries out uniform format Process, thus get the first text of uniform format.
S112 is described further:
First text is carried out word segmentation processing, the vocabulary of separating treatment in the first text is extracted, Using the vocabulary that extracts as keyword, thus obtain the keyword vocabulary of the first text.
S113 is described further:
The keyword vocabulary extracted according to each text and the text use " key=text numbering, Value=text vocabulary " combination be combined, then key and value combined is stored in On distributed node.
S12 is described further:
The text of the result according to above-mentioned steps process and vocabulary, the word frequency of the vocabulary in the calculating text, Use this word frequency compared with first threshold, preserve the word frequency vocabulary more than first threshold, the most again Calculate the TF-IDF value of this vocabulary, use TF-IDF value compared with Second Threshold, preserve TF-IDF Value, more than the second vocabulary of Second Threshold, gives remaining second term weight according to TF-IDF value after allowing, Extract the characteristic vector of this second vocabulary.
Further, Fig. 3 be the embodiment of the present invention Text eigenvector obtaining step flow process signal Figure, as it is shown on figure 3, this step includes:
S121: use parallelization calculation that the text being stored in this distribution node is processed, obtain Take the word frequency of vocabulary in the text;
S122: compare whether this word frequency is more than first threshold, if then jumping to S123, if it is not, then Remove the vocabulary that this word frequency is corresponding;
S123: preserve the described word frequency vocabulary more than first threshold, calculate the TF-IDF value of described vocabulary;
S124: compare whether this TF-IDF value is more than Second Threshold, the most then jump to S125, if No, then remove the vocabulary that this TF-IDF value is corresponding;
S125: preserve the TF-IDF value the second vocabulary more than Second Threshold;
S126: according to the second word retrieval feature, and give the weight of the second vocabulary, obtain this second The characteristic vector of vocabulary.
S121 is described further:
Word frequency refers to the frequency that a vocabulary occurs in the text, vocabulary t frequency tf in text d (t, d)=count (t IN d)/count (t), i.e. vocabulary occurrence number and the ratio of text vocabulary total amount;Logical Cross above-mentioned formula text is processed, calculate and get the word frequency of vocabulary in text.
S122 is described further:
In one text, the number of times that vocabulary occurs is the most, and this vocabulary is exactly the most crucial, and word frequency ratio Relatively low vocabulary does not the most have the ability representing text, therefore, arranges first threshold, uses word frequency Compare with first threshold, remove the vocabulary that word frequency is less than first threshold, preserve word frequency than the first threshold It is worth big vocabulary;This first threshold gives according to actual conditions, and this first threshold sets in the present embodiment It is set to 0.01.
S123 is described further:
After obtaining comparing with first threshold after remaining vocabulary, calculate the TF-IDF value of this vocabulary.
S124 is described further:
Use TF-IDF value to compare with Second Threshold, remove TF-IDF value less than Second Threshold Vocabulary, preserves the vocabulary that TF-IDF value is bigger than Second Threshold;This Second Threshold sets according to actual conditions Depending on, this Second Threshold is set as 0.01 in the present embodiment.
S125 is described further:
Deposit the TF-IDF value the second vocabulary more than Second Threshold.
S126 is described further:
The computing formula of weight w in text d of the t of vocabulary is by TF-IDF:
W (t, d)=TF (t, d) * log (1/DF (t));
Wherein, DF (t) is text frequency, refers to the text proportion of a certain vocabulary t, the text frequency of vocabulary t DF is DF (t)=n (t)/n, i.e. contains the amount of text of vocabulary t and the ratio of text sum;Vocabulary frequency (t d) is vocabulary t word frequency in text d to TF.
If the frequency of occurrences that some vocabulary is in a text is higher, and the appearance in other texts Frequency is less, then it is believed that this vocabulary has good class discrimination, suitable being used for represents literary composition This, can extract characteristic vector further.
Vector space model is used to represent text, to the text containing n characteristic item d(t1,t2,…,tn), each characteristic item tkIt is endowed calculated weight w of TF-IDFk, represent this spy Levy significance level in the text, the i.e. text and can use characteristic vector d (w1,w2,…,wn) represent, wkIt is characterized a tkTF-IDF weight, according to this feature item tkTF-IDF weight give corresponding Term weight.
S13 is described further:
First, use Canopy cluster mode that text lexical feature vector is carried out preliminary clusters, obtain Take the text lexical feature vector centered by Canopy tentatively to cluster;Then, according to text vocabulary Characteristic vector tentatively clusters and carries out Kmeans clustering processing, obtains the K of text lexical feature vector Individual cluster.
Further, Canopy clustering algorithm has characteristic simply, quickly and precisely, is processing sea During the high dimension measured, in the case of especially data volume is huge, Canopy cluster is used tentatively to locate Reason, can be effectively improved efficiency, and Canopy clustering algorithm is specific as follows:
(1) characteristic vector set is initialized as list, selects two distance thresholds: T1, T2.
(2) take an object d in list at random as Canopy center, be labeled as c, and by d from List deletes;
(3) distance distance of all object d_i Yu c in list is calculated, if distance < T1, will This object adds Canopy c;If distanc < T2, this point is deleted from list, namely this object Cannot be as Canopy center;
(4) remaining c is added in canopylist;
(5) repeat step 2,3,4, until data are that sky terminates in list, canopylist is then last Canopy cluster result.
Wherein, it is contemplated that due to the higher-dimension of text lexical feature vector, COS distance degree is therefore used Amount;
Concrete, COS distance computing formula between characteristic vector A and characteristic vector B particularly as follows:
C o sin e _ d i s tan c e ( A , B ) = 1 - &Sigma; i = 1 n ( a i &times; b i ) / ( &Sigma; i = 1 n a i 2 &times; &Sigma; i = 1 n b i 2 ) ;
Wherein characteristic vector A is expressed as A=(a1,a2,…,an), characteristic vector B is expressed as B= (b1,b2,…,bn), i=1,2 ..., n.
Use Kmeans clustering algorithm that preliminary clusters result is carried out clustering processing, Kmeans again The basic thought of clustering algorithm is: sort out as center with k object in space, and object is empty In between, the object near each center is classified as a class respectively, by the way of successive ignition, by each poly- The value of class barycenter gradually calculates renewal, until the barycenter that clusters is stablized constant.
For the embodiment of the present invention, original Kmeans clustering algorithm is carried out the amendment calculated, after amendment Algorithm specific as follows:
(1) using the result of Canopy clustering algorithm as the input of Kmeans clustering algorithm, i.e. The Canopy center that Canopy clustering algorithm produces as the initialization barycenter of Kmeans algorithm, and Each characteristic vector has been dispensed in corresponding barycenter;
(2) each characteristic vector is calculated this feature vector distance to each barycenter, and distributed To nearest cluster barycenter, wherein distance computing formula still uses use in Canopy clustering algorithm COS distance;
(3) each cluster is recalculated all it is worth to new cluster barycenter;
(4) all data objects variance error value E to its corresponding cluster barycenter is calculated, if E is more than Threshold value then repeats step 2 and step 3, and otherwise cluster terminates.
Wherein, E computing formula particularly as follows:
E = 1 n &Sigma; x | | x - u k ( x ) | | 2 ;
Wherein, x is the text vector of document;K (x) represents clustering of vector x place;uk(x)Represent The centroid vector clustered at vector x place;N is document number of vectors.
Parallel optimization designs: the same kmeans clustering algorithm first carrying out local on each node: right This vector of vectorial local calculation on each node is to the distance of each overall situation barycenter, and assigns it to Nearest overall barycenter obtains global clustering;Local Clustering on node is calculated and is all worth to local matter The heart and local variance error amount;Local barycenter on all nodes and local error variance yields are integrated Become overall situation barycenter and total error variance value E, decide whether to continue iteration or end further according to E Polymerization, finally gives K cluster and barycenter thereof;
Overall situation centroid calculation formula is:
vi=(vi[1]*m1+…+vi[j]*mj+…+vi[s]*ms)/(m1+…+ms)
Wherein, viThe overall centroid vector of the ith cluster for calculating;
vi[j] is the local centroid vector on the distributed node S having jth to cluster, msFor this cluster In represent document vector number;Overall situation variance error amount computing formula E is: E=(E1*n1+…+Ej*nj+…+Et*nt)/(n1+…+nt);EjVariance error amount for jth node; njFor the vector sum on this node;T is node total number.
S14 is described further:
Each the clustering that the K obtained in above-mentioned steps clusters is distributed in one or more distributed joint On point.
S15 is described further:
Use index engine that clustering on each distribution node is processed, set up this full text clustered Index;The full-text index clustered on all distribution nodes is merged, obtains K full-text index.
Further, according to concrete index engine, clustering on each distribution node is carried out in full Index is set up, and is merged by the clustered index of the identical cluster on all nodes, i.e. can get K The individual overall full-text index clustered.
The following is in the embodiment of the present invention, user carries out the process retrieved using search key:
The inquiry string of input is carried out participle extraction keyword process, further according to index selection algorithm Calculate the similarity of inquiry and subclass, select the index meeting certain condition.
Wherein provide a kind of index selection algorithm based on search space, be described as follows:
The search space P={p of definition internal system1,p2,…,pi, piRepresent the one query of history Record;Cluster index storehouse is S={S1,S2,…,Sj};rel(q|Sj) represent index database SjWith currently look into Ask the degree of correlation of q;
Algorithm steps is:
(1) each index database and historical query p are calculatediDegree of correlation rel (pi|Sj);If SjDo not exist SET(piIn), then rel (pi|Sj)=0;Otherwise degree of correlation rel (pi|Sj) computing formula is specific as follows is:
r e l ( p i | S j ) = &Sigma; T r e l ( p i | d o c ) T ;
Wherein, rel (pi| doc) refer to the degree of correlation of historical query and document, when document belongs to cluster SjTime Degree of correlation rel (pi| doc)=1, otherwise degree of correlation rel (pi | doc)=0;T is predefined value, refers to commenting Dividing the front T number of documents needing to be considered in list, T is set to 20 in embodiments of the present invention, i.e. selects Select the document that degree of correlation ranking is front 20;
(2) select k most like historical query, use the inquiry q of COS distance metric calculation input With similarity sim of historical query (q | pi), select k the inquiry that similarity is higher, can survey according to experiment Examination obtains optimum efficiency k value;
(3) according to the associated information calculation current queries q and index database S of similar inquiryiThe degree of correlation rel(q|Sj), according to degree of correlation rel (q | Sj) sequence, select relatively relevant index database;
Current queries q and search library SiDegree of correlation rel (q | Sj) computing formula particularly as follows:
r e l ( q | S j ) = &Sigma; k r e l ( p i | S j ) &times; s i m ( q | p i ) ;
rel(pi|sj) represent index database SjWith historical query piThe degree of correlation;sim(q|pi) represent current queries q With historical query piThe degree of correlation;K represents front k the historical query most like with current queries q;
(4) after process completes inquiry, the feedback information of system acquisition user, such as user's actual click The information such as link, finally add this time inquiry to search space, update inquiry storehouse, thus complete one Secondary inquiry.
At the enterprising line retrieval of qualified index, by calculating by information such as the document frequencies of the overall situation Dividing and each retrieval result indexed is merged and sorted, finally retrieved result, it is right to complete The retrieval of inquiry;Be given for inquiry q retrieval result d scoring Score (q, basis d) is:
S c o r e ( q , d ) = c o o r d ( q , d ) &times; q u e r y N o r m ( q ) &times; &Sigma; i q ( T F ( t , d ) &times; IDF 2 ( t ) &times; t . g e t B o o s t ( ) &times; n o r m ( t , d ) ) ;
Wherein, each keyword that t extracts in being inquiry q;TF (t, d) is t word frequency in document d, IDF (t) is inverse document frequency;T.getBoost () is the significance level in inquiry input arranged keyword; Norm (t, the weighted sum length factor of the document set time d) for setting up index;Coord (q, d) for scoring because of Son, document occurs that the most matching degrees of query term number of times are the highest;QueryNorm (q) is by query language normalizing Change, make different query languages directly compare.
Correspondingly, Fig. 4 is the distributed index constructing system based on text cluster of the embodiment of the present invention Structure composition schematic diagram, as shown in Figure 4, this system includes:
Pretreatment module 11: for formatting non-structured text and participle pretreatment, will be pre- Result is stored on distributed node;
Filter and characteristic extracting module 12: for this pre-processed results being carried out at filtration and feature extraction Reason, the text lexical feature after acquisition processes is vectorial;
Cluster module 13: be used for using Canopy-Kmeans clustering algorithm to text lexical feature to Amount carries out clustering processing, obtains text lexical feature vectorial K and clusters;
Cluster distribution module 14: for each the clustering that this K clusters is distributed in one or more points On cloth node;
Index construct module 15: be used for using index engine that this is distributed in one or more distributed joint Individual the clustering of this K on point carries out setting up full-text index process, obtains K full-text index.
Preferably, Fig. 5 is the structure composition schematic diagram of the pretreatment module of the embodiment of the present invention, such as Fig. 5 Shown in, this pretreatment module 11, including:
Uniform format processing unit 111: for by the destructuring of different-format on each distributed node Text carries out uniform format process, obtains the first text that form is consistent;
Word segmentation processing and keyword extracting unit 112: for this first text is carried out word segmentation processing, root Carry out keyword extraction according to result, obtain the keyword vocabulary of the first text;
Memory cell 113: be used for using the combination side of " key=text numbering, value=text vocabulary " This keyword vocabulary is stored on distributed node by formula.
Preferably, Fig. 6 is the filtration structure composition signal with characteristic extracting module of the embodiment of the present invention Figure, as shown in Figure 6, this filtration includes with characteristic extracting module 12:
Parallelization computing unit 121: for using parallelization calculation to being stored in this distribution node Text processes, the word frequency of vocabulary in the acquisition text;
First comparing unit 122: be used for using this word frequency to compare with first threshold, preserve this word frequency Vocabulary more than first threshold;
Second comparing unit 123: for calculating the TF-IDF value of this vocabulary, use this TF-IDF value Compared with Second Threshold, preserve the TF-IDF value the second vocabulary more than Second Threshold;
Feature extraction unit 124: be used for according to this second word retrieval feature, and give this second vocabulary Weight, obtain the characteristic vector of this second vocabulary.
Preferably, this cluster module 13 includes:
First cluster cell: be used for using Canopy cluster mode that text lexical feature vector is carried out Preliminary clusters, obtains the text lexical feature vector centered by Canopy and tentatively clusters;
Second cluster cell: carry out Kmeans for tentatively clustering according to text lexical feature vector Clustering processing, obtains text lexical feature vectorial K and clusters.
Preferably, this index construct module 15 includes:
Node index construct unit: for using index engine that clustering on each distribution node is carried out Process, set up this full-text index clustered;
Index combining unit: for the full-text index clustered on all distribution nodes is merged, obtain Take K full-text index.
Specifically, the system related functions module operation principle of the embodiment of the present invention refers to method enforcement The associated description of example, repeats no more here.
In implementation process of the present invention, by formatting text, participle, filtration, feature carry Take and clustering processing, and result is set up full-text index, for building a kind of dividing for retrieval Cloth indexed mode, gives user a kind of quickly indexed mode, improves the experience sense of user.
One of ordinary skill in the art will appreciate that in the various methods of above-described embodiment is all or part of Step can be by program and completes to instruct relevant hardware, and this program can be stored in a calculating In machine readable storage medium storing program for executing, storage medium may include that read-only storage (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD Deng.
It addition, above a kind of based on text cluster the distributed index that the embodiment of the present invention is provided Construction method and system are described in detail, the specific case principle to the present invention used herein And embodiment is set forth, the explanation of above example is only intended to help to understand the side of the present invention Method and core concept thereof;Simultaneously for one of ordinary skill in the art, according to the thought of the present invention, The most all will change, in sum, this specification content Should not be construed as limitation of the present invention.

Claims (10)

1. a distributed index construction method based on text cluster, it is characterised in that described method Including:
Non-structured text is formatted and participle pretreatment, pre-processed results is stored in distribution On formula node;
Carry out described pre-processed results filtering and process with feature extraction, the text vocabulary after acquisition process Characteristic vector;
Use Canopy-Kmeans clustering algorithm that described text lexical feature vector is carried out at cluster Reason, obtains described text lexical feature vectorial K and clusters;
Each the clustering that described K clusters is distributed on one or more distributed node;
Index engine is used to cluster to the described described K being distributed on one or more distributed node Carry out setting up full-text index to process, obtain K full-text index.
Distributed index construction method the most according to claim 1, it is characterised in that described right Non-structured text formats and pre-processes with participle, and pre-processed results is stored in distributed node On, including:
The non-structured text of different-format on each distributed node is carried out uniform format process, obtains Take the first text that form is consistent;
Described first text is carried out word segmentation processing, carries out keyword extraction according to result, obtain The keyword vocabulary of the first text;
Use the combination of " key=text numbering, value=text vocabulary " by described keyword vocabulary It is stored on distributed node.
Distributed index construction method the most according to claim 1, it is characterised in that described right Described pre-processed results carries out filtering and feature extraction process, the Text eigenvector after acquisition process, Including:
Use parallelization calculation that the text being stored in described distribution node is processed, obtain institute State the word frequency of vocabulary in text;
Use described word frequency to compare with first threshold, preserve the described word frequency word more than first threshold Converge;
Calculate the TF-IDF value of described vocabulary, use described TF-IDF value compared with Second Threshold, Preserve the TF-IDF value the second vocabulary more than Second Threshold;
According to described second word retrieval feature, and give the weight of described second vocabulary, obtain described The characteristic vector of the second vocabulary.
Distributed index construction method the most according to claim 1, it is characterised in that described in adopt With Canopy-Kmeans clustering algorithm, described Text eigenvector is carried out clustering processing, including:
Use Canopy cluster mode that described text lexical feature vector carries out preliminary clusters, obtain with Text lexical feature vector centered by Canopy tentatively clusters;
Tentatively cluster according to described text lexical feature vector and carry out Kmeans clustering processing, obtain described K of text lexical feature vector clusters.
Distributed index construction method the most according to claim 1, it is characterised in that described in adopt The described described K being distributed on one or more distributed node is clustered and builds by index of reference engine Vertical full-text index processes, including:
Using index engine to process clustering on each distribution node, cluster described in foundation is complete Literary composition index;
The full-text index clustered on all distribution nodes is merged, obtains K full-text index.
6. a distributed index constructing system based on text cluster, it is characterised in that described system Including:
Pretreatment module: for non-structured text being formatted and participle pretreatment, will locate in advance Reason result is stored on distributed node;
Filter and characteristic extracting module: for described pre-processed results being carried out at filtration and feature extraction Reason, the text lexical feature after acquisition processes is vectorial;
Cluster module: be used for using Canopy-Kmeans clustering algorithm to described text lexical feature to Amount carries out clustering processing, obtains described text lexical feature vectorial K and clusters;
Cluster distribution module: for each the clustering that described K clusters is distributed in one or more points On cloth node;
Index construct module: be used for using index engine to be distributed in one or more distributed joint to described Individual the clustering of described K on point carries out setting up full-text index process, obtains K full-text index.
Distributed index constructing system the most according to claim 6, it is characterised in that described pre- Processing module, including:
Uniform format processing unit: for by the destructuring literary composition of different-format on each distributed node Originally carry out uniform format process, obtain the first text that form is consistent;
Word segmentation processing and keyword extracting unit: for described first text is carried out word segmentation processing, root Carry out keyword extraction according to result, obtain the keyword vocabulary of the first text;
Memory cell: the combination being used for using " key=text numbering, value=text vocabulary " will Described keyword vocabulary is stored on distributed node.
Distributed index constructing system the most according to claim 6, it is characterised in that described mistake Filter includes with characteristic extracting module:
Parallelization computing unit: for using parallelization calculation to being stored in described distribution node Text processes, and obtains the word frequency of vocabulary in described text;
First comparing unit: be used for using described word frequency to compare with first threshold, preserves institute's predicate Frequency is more than the vocabulary of first threshold;
Second comparing unit: for calculating the TF-IDF value of described vocabulary, use described TF-IDF value Compared with Second Threshold, preserve the TF-IDF value the second vocabulary more than Second Threshold;
Feature extraction unit: be used for according to described second word retrieval feature, and give described second word The weight converged, obtains the characteristic vector of described second vocabulary.
Distributed index constructing system the most according to claim 6, it is characterised in that described poly- Generic module includes:
First cluster cell: be used for using Canopy cluster mode that described text lexical feature vector is entered Row preliminary clusters, obtains the text lexical feature vector centered by Canopy and tentatively clusters;
Second cluster cell: carry out Kmeans for tentatively clustering according to described text lexical feature vector Clustering processing, obtains described text lexical feature vectorial K and clusters.
Distributed index constructing system the most according to claim 6, it is characterised in that described Index construct module includes:
Node index construct unit: for using index engine that clustering on each distribution node is carried out Process, the full-text index clustered described in foundation;
Index combining unit: for the full-text index clustered on all distribution nodes is merged, obtain Take K full-text index.
CN201610154682.0A 2016-03-16 2016-03-16 Distributed index establishment method and system based on text clustering Pending CN105787097A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610154682.0A CN105787097A (en) 2016-03-16 2016-03-16 Distributed index establishment method and system based on text clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610154682.0A CN105787097A (en) 2016-03-16 2016-03-16 Distributed index establishment method and system based on text clustering

Publications (1)

Publication Number Publication Date
CN105787097A true CN105787097A (en) 2016-07-20

Family

ID=56394027

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610154682.0A Pending CN105787097A (en) 2016-03-16 2016-03-16 Distributed index establishment method and system based on text clustering

Country Status (1)

Country Link
CN (1) CN105787097A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484813A (en) * 2016-09-23 2017-03-08 广东港鑫科技有限公司 A kind of big data analysis system and method
CN106886613A (en) * 2017-05-03 2017-06-23 成都云数未来信息科学有限公司 A kind of Text Clustering Method of parallelization
CN108062306A (en) * 2017-12-29 2018-05-22 国信优易数据有限公司 A kind of index system establishment system and method for business environment evaluation
CN108172304A (en) * 2017-12-18 2018-06-15 广州七乐康药业连锁有限公司 A kind of medical information visible processing method and system based on user's medical treatment feedback
CN110674243A (en) * 2019-07-02 2020-01-10 厦门耐特源码信息科技有限公司 Corpus index construction method based on dynamic K-means algorithm
CN110956213A (en) * 2019-11-29 2020-04-03 珠海大横琴科技发展有限公司 Method and device for generating remote sensing image feature library and method and device for retrieving remote sensing image
CN113641870A (en) * 2021-10-18 2021-11-12 北京微播易科技股份有限公司 Vector index construction method, vector retrieval method and system corresponding to vector index construction method and vector retrieval method
CN115203378A (en) * 2022-09-09 2022-10-18 北京澜舟科技有限公司 Retrieval enhancement method, system and storage medium based on pre-training language model
CN116340991A (en) * 2023-02-02 2023-06-27 魔萌动漫文化传播(深圳)有限公司 Big data management method and device for IP gallery material resources and electronic equipment
CN116910186A (en) * 2023-09-12 2023-10-20 南京信息工程大学 Text index model construction method, index method, system and terminal

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591978A (en) * 2012-01-05 2012-07-18 复旦大学 Distributed text copy detection system
CN102831253A (en) * 2012-09-25 2012-12-19 北京科东电力控制系统有限责任公司 Distributed full-text retrieval system
CN103577418A (en) * 2012-07-24 2014-02-12 北京拓尔思信息技术股份有限公司 Massive document distribution searching duplication removing system and method
US9058321B2 (en) * 2008-05-16 2015-06-16 Enpluz, LLC Support for international search terms—translate as you crawl
CN105069101A (en) * 2015-08-07 2015-11-18 桂林电子科技大学 Distributed index construction and search method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9058321B2 (en) * 2008-05-16 2015-06-16 Enpluz, LLC Support for international search terms—translate as you crawl
CN102591978A (en) * 2012-01-05 2012-07-18 复旦大学 Distributed text copy detection system
CN103577418A (en) * 2012-07-24 2014-02-12 北京拓尔思信息技术股份有限公司 Massive document distribution searching duplication removing system and method
CN102831253A (en) * 2012-09-25 2012-12-19 北京科东电力控制系统有限责任公司 Distributed full-text retrieval system
CN105069101A (en) * 2015-08-07 2015-11-18 桂林电子科技大学 Distributed index construction and search method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
冯汝伟: "分布式环境下基于文本聚类的海量非结构化知识管理", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484813A (en) * 2016-09-23 2017-03-08 广东港鑫科技有限公司 A kind of big data analysis system and method
CN106886613A (en) * 2017-05-03 2017-06-23 成都云数未来信息科学有限公司 A kind of Text Clustering Method of parallelization
CN106886613B (en) * 2017-05-03 2020-06-26 成都云数未来信息科学有限公司 Parallelized text clustering method
CN108172304A (en) * 2017-12-18 2018-06-15 广州七乐康药业连锁有限公司 A kind of medical information visible processing method and system based on user's medical treatment feedback
CN108172304B (en) * 2017-12-18 2021-04-02 广州七乐康药业连锁有限公司 Medical information visualization processing method and system based on user medical feedback
CN108062306A (en) * 2017-12-29 2018-05-22 国信优易数据有限公司 A kind of index system establishment system and method for business environment evaluation
CN110674243A (en) * 2019-07-02 2020-01-10 厦门耐特源码信息科技有限公司 Corpus index construction method based on dynamic K-means algorithm
CN110956213A (en) * 2019-11-29 2020-04-03 珠海大横琴科技发展有限公司 Method and device for generating remote sensing image feature library and method and device for retrieving remote sensing image
CN113641870A (en) * 2021-10-18 2021-11-12 北京微播易科技股份有限公司 Vector index construction method, vector retrieval method and system corresponding to vector index construction method and vector retrieval method
CN113641870B (en) * 2021-10-18 2022-02-11 北京微播易科技股份有限公司 Vector index construction method, vector retrieval method and system corresponding to vector index construction method and vector retrieval method
CN115203378A (en) * 2022-09-09 2022-10-18 北京澜舟科技有限公司 Retrieval enhancement method, system and storage medium based on pre-training language model
CN115203378B (en) * 2022-09-09 2023-01-24 北京澜舟科技有限公司 Retrieval enhancement method, system and storage medium based on pre-training language model
CN116340991A (en) * 2023-02-02 2023-06-27 魔萌动漫文化传播(深圳)有限公司 Big data management method and device for IP gallery material resources and electronic equipment
CN116340991B (en) * 2023-02-02 2023-11-07 魔萌动漫文化传播(深圳)有限公司 Big data management method and device for IP gallery material resources and electronic equipment
CN116910186A (en) * 2023-09-12 2023-10-20 南京信息工程大学 Text index model construction method, index method, system and terminal
CN116910186B (en) * 2023-09-12 2023-11-21 南京信息工程大学 Text index model construction method, index method, system and terminal

Similar Documents

Publication Publication Date Title
CN105787097A (en) Distributed index establishment method and system based on text clustering
Bhattacharjee et al. A survey of density based clustering algorithms
CN107066553B (en) Short text classification method based on convolutional neural network and random forest
US8341159B2 (en) Creating taxonomies and training data for document categorization
CN103500208B (en) Deep layer data processing method and system in conjunction with knowledge base
CN104239513B (en) A kind of semantic retrieving method of domain-oriented data
CN108132927B (en) Keyword extraction method for combining graph structure and node association
Sambasivam et al. Advanced data clustering methods of mining Web documents.
CN103970729A (en) Multi-subject extracting method based on semantic categories
CN110059181A (en) Short text stamp methods, system, device towards extensive classification system
CN103838833A (en) Full-text retrieval system based on semantic analysis of relevant words
CN103049569A (en) Text similarity matching method on basis of vector space model
CN107291895B (en) Quick hierarchical document query method
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN107122382A (en) A kind of patent classification method based on specification
CN109376352A (en) A kind of patent text modeling method based on word2vec and semantic similarity
CN106446162A (en) Orient field self body intelligence library article search method
CN106484797A (en) Accident summary abstracting method based on sparse study
Roul et al. Web document clustering and ranking using tf-idf based apriori approach
CN103761286B (en) A kind of Service Source search method based on user interest
CN112597285A (en) Man-machine interaction method and system based on knowledge graph
CN106599072A (en) Text clustering method and device
Fahad et al. Review on semantic document clustering
Jayanthi et al. Clustering approach for classification of research articles based on keyword search
Adinugroho et al. Optimizing K-means text document clustering using latent semantic indexing and pillar algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160720