CN106971005A

CN106971005A - Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment

Info

Publication number: CN106971005A
Application number: CN201710286671.2A
Authority: CN
Inventors: 沈晔; 周天和; 李思剑; 任培荣
Original assignee: Hangzhou Yang Fan Technology Co Ltd
Current assignee: Hangzhou Yang Fan Technology Co Ltd
Priority date: 2017-04-27
Filing date: 2017-04-27
Publication date: 2017-07-21

Abstract

The present invention relates to the distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment, a kind of Text similarity computing method is proposed first with vector space model；Secondly, based on two sub-clustering centers of " each other minimum similarity degree text to " search selection, propose to realize two points of K means clustering algorithms of cluster barycenter optimizing by once dividing；Finally, parallel clustering method is designed towards the extensive text that cloud computing is applied based on MapReduce frameworks.The inventive method shows that parallel clustering model is while suitable Clustering Effect is obtained, with obvious odds for effectiveness, has good autgmentability in different pieces of information scale and calculate node number in Hadoop platform with the experiment of real text data.

Description

Distributed parallel text cluster based on MapReduce under a kind of cloud computing environment Method

Technical field

The present invention relates to the distribution based on MapReduce under field of cloud computer technology, more particularly to a kind of cloud computing environment The parallel Text Clustering Method of formula.

Background technology

Text mining is the research that data mining extends in text type data, is to be used as research pair using text data As using data mining correlation technique, what therefrom structure, model, pattern of searching text message etc. were implicit has potential value Knowledge process, combine data mining, machine learning, natural language processing, information retrieval and information management etc. difference neck The achievement in research in domain.Using the Internet, applications as the active demand of the rapid growth and business analysis of the text data of carrier so that The importance and urgency of text mining also increasingly strengthen, wherein in the case where not needing training set and predefined classification, from The text cluster research that rational text cluster divides is found in given text collection it is one of text mining field and important grinds Study carefully direction.

It is how quick effective with the extensive development of the various applications (microblogging, ecommerce and search engine) in internet The extensive text that ground excavates application generation has turned into the huge challenge that data mining research and application field have been faced.Point Cloth parallel computation computing capability when in face of large-scale data it is powerful and realize it is simple and convenient, therefore by Distributed Parallel Computing Distributed text digging technology produced by introducing text mining field is study hotspot in recent years.The rise of cloud computing is to divide Cloth parallel computation provides more frameworks, and the MapReduce frameworks that wherein Google is proposed allow user by defining Map The raising of computational efficiency will be obtained in large-scale data distribution of computation tasks to multiple calculate nodes with Reduce tasks, towards The appearance of the Hadoop platform of increasing income of cloud computing is even more to realize to provide for the Distributed Parallel Computing Model based on MapReduce It is convenient, and there is scholar to develop the Mahout class libraries for machine learning and data mining algorithm.

The content of the invention

The present invention is to overcome above-mentioned weak point, it is therefore intended that provide and MapReduce is based under a kind of cloud computing environment Distributed parallel Text Clustering Method, this method proposes a kind of Text similarity computing side first with vector space model Method, on this basis, proposes to realize two points of K-means clustering algorithms of cluster barycenter optimizing by once dividing；Then, it is based on MapReduce frameworks design parallel clustering method towards the extensive text that cloud computing is applied.The present invention is flat towards cloud computing Extensive text mining application on platform, improves the efficiency of text cluster.

The present invention is to reach above-mentioned purpose by the following technical programs：Based on MapReduce's under a kind of cloud computing environment Distributed parallel Text Clustering Method, comprises the following steps：

(1) text feature is represented using vector space model, text similarity is obtained with reference to Text similarity computing method Computation model；

(2) initial two sub-clustering is determined using the Text Clustering Algorithm selection based on " each other minimum similarity degree text to " search Center, realizes that two points of K-means clusters of cluster barycenter optimizing complete cluster by once dividing, forms text cluster and divide；

(3) MapReduce frameworks are based on, Distributed Parallel Computing is carried out using three MapReduce tasks, is each responsible for Search " each other minimum similarity degree text to ", distribution text to two clusters, final K-means text clusters, until cluster is divided No longer change and export cluster result.

Preferably, described Text similarity computing method is as follows：

Given text d_i,d_j, TA (d_i,d_j)={ t_a1,t_a2,...,t_at,...,t_ahRepresent d_i,d_jContained Feature Words and Collection, h is and concentrates the number of Feature Words；TS(d_i,d_j)={ t_s1,t_s2,...,t_sk,...,t_slRepresent d_i,d_jContained Feature Words Common factor, l for occur simultaneously in Feature Words number；Then text d_i,d_jEach Feature Words t in TS_skOn similarity sim (d_i, d_j,t_sk) be defined as

Text d_i,d_jSimilarity SIM (d_i,d_j) be defined as

Preferably, the utilization vector space model represents that the method for text feature is：Given text collection D={ d₁, d₂,…,d_i,…,d_N, d_iEach text vector is represented, is represented by using vector space model

d_i=(＜ t₁,w_i1＞, ＜ t₂,w_i2＞ ..., ＜ t_j,w_ij＞ ..., ＜ t_m,w_im＞)

Wherein, T={ t₁,t₂,…,t_j,…,t_mRepresent the set of all Feature Words that all texts are included in text set； W_i={ w_i1,w_i2,…,w_ij,…,w_imRepresent text d_iCorresponding weight vectors, are calculated using TF-IDF on all Feature Words Method is obtained, and formula is as follows：

Wherein, tf_ijRefer to Feature Words t_jIn text d_iThe frequency of middle appearance, n_ijFor text d_iMiddle Feature Words t_jThe number of times of appearance, n_iFor text d_iThe sum that all Feature Words contained occur；idf_jRefer to Feature Words t_jReverse document frequency in whole text set Rate, for weighing the appearance scope of Feature Words；N is total amount of text, N in text collection_jRepresent to contain Feature Words t_jNot identical text This quantity.

The definition of " each other minimum similarity degree text to " is preferably, described：If text cluster Then " each other minimum similarity degree text to " is defined as meeting two text d of following condition in cluster C_i, d_j：

That is d_iBe in text cluster with d_jThe minimum text of similarity, while d_jBe in the cluster with d_iThe minimum text of similarity.

Preferably, the method for the Text Clustering Algorithm based on " each other minimum similarity degree text to " search is as follows：

Input：Text collection D={ d₁,d₂,…,d_i,…,d_N}；

Parameter：The number of clusters K of cluster；

Output：Text collection D cluster divides S={ S₁,S₂,…,S_k,…,S_K}；

(1) initialized, regard the set D that all texts are constituted as initial cluster：S={ S₀, S₀←D；

(2) the minimum cluster S of text cluster similarity side MS are selected from S_mIt is used as cluster to be divided；

(3) the initial two sub-clusterings center for finding cluster Sm divide using " each other minimum similarity degree text to " searching algorithm is literary This is to dx, dy；

(4) by all text S of cluster to be sorted_m={ d_m1, d_m2..., d_mi..., d_mnDistributed according to the maximum principle of similarity To cluster S_xAnd S_yIn, it is shown below：

By S_xAnd S_yIt is added to cluster to divide in S, and by S_mDeleted from S；

(5) if S Chinese version clusters number is less than K, return and perform step (2)；If S Chinese version clusters number is equal to K, step is performed (6)；

(6) barycenter of K using in S cluster obtains text to all texts as initial cluster center using sphere K-means clusters Cluster divides S, wherein, Text similarity computing method is used in cluster process.

Preferably, described text cluster similarity side MS is defined as follows：n_CThe text cluster of individual textCluster similarity side MS (C) be defined as all texts and cluster barycenter similarity square Average：

" each other minimum similarity degree text to " searching algorithm is as follows preferably, described：

Input：Text clustern_CFor the quantity of text cluster C Chinese versions；Output：" each other Minimum similarity degree text to " d_x,d_y；

(i) text d is randomly selected in text cluster C_iIt is assigned to d_x,d_x←d_i；

(ii) search and text d in text cluster C_xThe minimum text d of similarity_y, i.e.,

(iii) search and text d in text cluster C_yThe minimum text d of similarity_k, i.e.,

(iv) following two conditions are judged：

If (a) d_k=d_xOr SIM (d_x,d_y)=SIM (d_k,d_y), then algorithm terminates, and exports d_x,d_yTo be " minimum each other similar Spend text to ", i.e. text cluster C initial cluster center；

If (b) d_k≠d_xAnd SIM (dx, dy) ≠ SIM (d_k, d_y), then assignment d_x←d_y, d_y← dk, redirects execution step (iii) re-search for.

The beneficial effects of the present invention are：It is of the invention herein towards the extensive text mining application on cloud computing platform, Improve the efficiency of text cluster.

Brief description of the drawings

Fig. 1 is the clustering method main flow schematic diagram of the present invention.

Embodiment

With reference to specific embodiment, the present invention is described further, but protection scope of the present invention is not limited in This：

Embodiment：As shown in figure 1, the distributed parallel text cluster side based on MapReduce under a kind of cloud computing environment Method, comprises the following steps：

Define 1 and represent text feature using vector space model：Given text collection D={ d₁,d₂,…,d_i,…,d_N, d_i Each text vector is represented, d is represented by using vector space model_i=(＜ t₁,w_i1＞, ＜ t₂,w_i2＞ ..., ＜ t_j,w_ij ＞ ..., ＜ t_m,w_im＞).Wherein：T={ t₁,t₂,…,t_j,…,t_mRepresent all features that all texts are included in text set The set of word, W_i={ w_i1,w_i2,…,w_ij,…,w_imRepresent text d_iThe corresponding weight vectors on all Feature Words, are used TF-IDF computational methods：

In formula, tf_ijRefer to Feature Words t_jIn text d_iThe frequency of middle appearance, n_ijFor text d_iMiddle Feature Words t_jThe number of times of appearance, n_iFor text d_iThe sum that all Feature Words contained occur；idf_jRefer to Feature Words t_jReverse document frequency in whole text set Rate, for weighing the appearance scope of Feature Words, N is total amount of text, N in text collection_jRepresent to contain Feature Words t_jNot identical text This quantity.Obviously, the frequency that some Feature Words occurs in specific document is higher, and this feature word is distinguishing text content category Ability in terms of property is stronger (TF)；The scope occurred in text set is wider, and the attribute that it distinguishes content of text is lower (IDF).

The definition for defining 2 text similarities is mainly as follows：Given text d_i,d_j, TA (d_i,d_j)={ t_a1,t_a2,..., t_at,...,t_ahRepresent d_i,d_jThe union of contained Feature Words, h is and concentrates the number of Feature Words；TS(d_i,d_j)={ t_s1, t_s2,...,t_sk,...,t_slRepresent d_i,d_jThe common factor of contained Feature Words, l is the number of Feature Words in occuring simultaneously.Text d_i,d_j Each Feature Words t in TS_skOn similarity sim (d_i,d_j,t_sk) be defined as

Text d_i,d_jSimilarity SIM (d_i,d_j) be defined as

That is the number for all Feature Words that similarity sum of two texts on all common trait words is included with two texts The ratio between.Formula (3) and classical cosine similarity (W_i·W_j/(|W_i|*|W_j|)) computational methods compare, all first with two The molecule of the common trait word calculation formula that text is included, secondly denominator term make use of in addition to common trait remaining Respective this paper Feature Words.Unlike, formula (3) proposed by the present invention has been precisely calculated the phase of each common trait word respectively Overall similarity is directly calculated by inner product of vectors like degree, rather than in included angle cosine.

Define 3 text cluster similarities square：Including n_CThe text cluster of individual textCluster Similarity side MS (C) is defined as the average of all texts and cluster barycenter similarity square：

Wherein, d_eFor cluster barycenter Text eigenvector, i.e. d_e=(＜ t₁,w_e1＞, ＜ t₂,w_e2＞ ..., ＜ t_j,w_ej ＞ ..., ＜ t_m,w_em＞),

In an embodiment of the present invention, two points of original K-means methods utilize K- after selecting a cluster to enter line splitting Means thoughts randomly select initial cluster center and carry out two segregation classes and successive ignition searching optimal dividing.Wherein cluster " each other most Small similarity text to " refer to text cluster" each other minimum similarity degree text to " it is fixed Adopted two text d to meet following condition in cluster C_i, d_j：

The present invention proposes to determine initial two sub-clusterings center according to " each other minimum similarity degree text to " of search cluster, it is clear that one It may contain in individual text cluster more than " each other minimum similarity degree text to " for meeting formula (6) for a pair.Therefore, the " mutual of cluster is provided For minimum similarity degree text " search greedy algorithm it is as follows.

Preparation algorithm 1 " each other minimum similarity degree text to " searching algorithm.

Input：Text clustern_CFor the quantity of text cluster C Chinese versions.

Output：" each other minimum similarity degree text to " d_x,d_y.

Algorithm steps are as follows：

Step 1 randomly selects text d in text cluster C_iIt is assigned to d_x,d_x←d_i。

Step 2 is searched for and text d in text cluster C_xThe minimum text d of similarity_y, i.e.,

Step 3 is searched for and text d in text cluster C_yThe minimum text d of similarity_k, i.e.,

Step 4 judges following two conditions：

If 4.1 d_k=d_xOr SIM (d_x,d_y)=SIM (d_k,d_y), then algorithm terminates, and exports d_x,d_yTo be " minimum each other similar Spend text to ", i.e. text cluster C initial cluster center；

If 4.2 d_k≠d_xAnd SIM (dx, dy) ≠ SIM (d_k, d_y), then assignment d_x←d_y, d_y← dk, return to step 3 is searched again Rope.

Step 5 terminates.

In an embodiment of the present invention, Text Clustering Algorithm based on " each other minimum similarity degree text to " search is according to carrying The initial cluster center system of selection gone out, with reference to two points of K-means algorithm ideas, provides Text Clustering Algorithm step as follows：

Preparation algorithm 2 is based on the Text Clustering Algorithm of " each other minimum similarity degree text to " search.

Input：Text collection D={ d₁,d₂,…,d_i,…,d_N}。

Parameter：The number of clusters K. of cluster

Output：Text collection D cluster divides S={ S₁,S₂,…,S_k,…,S_K}；

Algorithm steps are as follows：

Step 1 is initialized.The set D that all texts are constituted is used as initial cluster：S={ S₀, S₀←D.

Step 2 selects the minimum cluster S of text similarity side MS according to formula (4) from S_mIt is used as cluster to be divided

Step 3 finds cluster Sm to be divided initial two sub-clusterings centered text to dx, dy. with the preparation algorithm 1 proposed

Step 4 is by all text S of cluster to be sorted_m={ d_m1, d_m2..., d_mi..., d_mnAccording to the maximum principle point of similarity It is fitted on cluster S_xAnd S_yIn：

By S_xAnd S_yIt is added to cluster to divide in S, and by S_mDeleted from S.

If step 5 S Chinese version clusters number is less than K, return to step 2；If S Chinese version clusters number is equal to K, step is turned to 6。

The barycenter of K using in S cluster of step 6 obtains text to all texts as initial cluster center using sphere K-means clusters This cluster divides S, using the Text similarity computing method for defining 2 wherein in cluster process.

Step 7 terminates.

From the preparation process of algorithm 2, Text Clustering Algorithm proposed by the present invention is after initial two sub-clusterings center is searched All objects (step 4) of primary distribution, obtain the division of cluster, have no the iteration optimizing repeated in original two points of K-means algorithms Process.

Parallel text cluster model based on MapReduce is although modified hydrothermal process is possible to carry compared to original two segregations class High cluster efficiency, but the raising of the computational efficiency obtained by the improvement of clustering algorithm in itself can not be far adapted to towards cloud computing Extensive mass text cluster result needs in practical application, therefore using the MapReduce frameworks in cloud computing environment to text This cluster process carries out parallelization processing, will further greatly improve text cluster efficiency.

In an embodiment of the present invention, the parallel text cluster process based on MapReduce tasks is in text cluster process In, three MapReduce tasks are designed herein and carry out Distributed Parallel Computing, and being each responsible for search, " minimum similarity degree is literary each other This to ", distribution text is to two clusters and final K-means text cluster processes.

Specifically, step one：Found according to " each other minimum similarity degree text to " in preparation algorithm 1 in original text cluster The heart.Map chooses a text d_x, selected cluster S is calculated according to defining 2_mIn remaining text and d_xSimilarity, and search for and d_x Text d with minimum similarity degree_y, and search for and d_yText d with minimum similarity degree_k；Reduce is by d_yIt is assigned to d_x, d_kAssign It is worth to d_y, re-use Map search d_yMinimum similarity degree text, until finding " each other minimum similarity degree text to "<d_x, d_y >.MapReduce processes are represented by：

Map：<d_x, List<t_j, w_xj>>,<S_m, List<d_mi>>→<d_x&d_y, SIM (d_x, d_y)>

Repeat

Map：<d_y, List<t_j, w_yj>>,<S_m, List<d_mi>>→<d_y&d_k, SIM (d_k, d_y)>

Reduce：d_x←d_y, d_y←d_k

End until d_k=d_y or SIM(d_k, d_y)=SIM (d_x, d_y)

Step 2：Step 3 is arrived according to the step 1 of preparation algorithm 2, all texts are distributed in cluster to be divided into two clusters. Map is according to the initial cluster center d of search_x,d_y, cluster S is calculated according to defining 2_mIn all texts and cluster center d_x,d_ySimilarity, And it is assigned to two cluster S by the maximum principle of similarity_xAnd S_yIn,<S_k, List<d_i, List<t_j, w_ij>>>(S_k=s_xOr S_y)。 Reduce is according to the centroid vector d for defining 3 two clusters of calculating_ekWith text similarity side MS (S_k), i.e.,<S_k, d_ek, MS (S_k)>。 MapReduce processes are represented by：

Map：<S_m, List<d_mi>, d_x, d_y>→<S_k, List<d_i>>

Reduce：<S_k, List<d_i>>→<S_k, d_ek, MS (S_k)>

Above-mentioned two MapReduce tasks would be repeated for until the number of cluster is the K. that specifies

Step 3：Step 7 is arrived according to the step 4 of preparation algorithm 2, the barycenter of K cluster is subjected to K-means clusters.Map is read Enter the centroid vector d of all text collection D and K cluster_ekK-means clusters are carried out, text cluster is formed and divides, i.e.,<S_k, List< d_ki>>, the text assigning process of the similar upper MapReduce task of process needs：

Repeat：

Map：D, List<d_ek>→<S_k, List<d_ki>>

Reduce：<S_k, List<d_i>>→<S_k, d_ek, MS (S_k)>

Until clusters, which are divided, no longer to be changed.

The technical principle for being the specific embodiment of the present invention and being used above, if conception under this invention institute The change of work, during the spirit that function produced by it is still covered without departing from specification and accompanying drawing, should belong to the present invention's Protection domain.

Claims

1. the distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment, it is characterised in that including such as Lower step：

(1) text feature is represented using vector space model, Text similarity computing is obtained with reference to Text similarity computing method Model；

(2) determined using the Text Clustering Algorithm selection based on " each other minimum similarity degree text to " search in initial two sub-clustering The heart, realizes that two points of K-means clusters of cluster barycenter optimizing complete cluster by once dividing, forms text cluster and divide；

(3) MapReduce frameworks are based on, Distributed Parallel Computing is carried out using three MapReduce tasks, search is each responsible for " each other minimum similarity degree text to ", distribution text to two clusters, final K-means text clusters, until cluster is divided no longer Change and export cluster result.

2. the distributed parallel text cluster side based on MapReduce under a kind of cloud computing environment according to claim 1 Method, it is characterised in that：Described Text similarity computing method is as follows：Given text d_i,d_j, TA (d_i,d_j)={ t_a1, t_a2,...,t_at,...,t_ahRepresent d_i,d_jThe union of contained Feature Words, h is and concentrates the number of Feature Words；TS(d_i,d_j)= {t_s1,t_s2,...,t_sk,...,t_slRepresent d_i,d_jThe common factor of contained Feature Words, l is the number of Feature Words in occuring simultaneously；Then text d_i,d_jEach Feature Words t in TS_skOn similarity sim (d_i,d_j,t_sk) be defined as

s i m (d_{i}, d_{j}, t_{s k}) = \frac{\min (w_{i s k}, w_{j s k})}{\max (w_{i s k}, w_{j s k})};

Text d_i,d_jSimilarity SIM (d_i,d_j) be defined as

S I M (d_{i}, d_{j}) = \frac{\underset{t_{s k} &Element; T S (d_{i}, d_{j})}{Σ} \sin (d_{i}, d_{j}, t_{s k})}{| T A (d_{i}, d_{j}) |} .

3. the distributed parallel text cluster side based on MapReduce under a kind of cloud computing environment according to claim 1 Method, it is characterised in that：The utilization vector space model represents that the method for text feature is：Given text collection D={ d₁, d₂,…,d_i,…,d_N, d_iEach text vector is represented, is represented by using vector space model

Wherein, T={ t₁,t₂,…,t_j,…,t_mRepresent the set of all Feature Words that all texts are included in text set；W_i= {w_i1,w_i2,…,w_ij,…,w_imRepresent text d_iThe corresponding weight vectors on all Feature Words, using TF-IDF computational methods Obtain, formula is as follows：

w_{i j} = {tf}_{i} \times {idf}_{j} = \frac{n_{i j}}{n_{i}} \cdot \log_{2} (\frac{N}{N_{j} + 1} + 1) .

Wherein, tf_ijRefer to Feature Words t_jIn text d_iThe frequency of middle appearance, n_ijFor text d_iMiddle Feature Words t_jThe number of times of appearance, n_iFor Text d_iThe sum that all Feature Words contained occur；idf_jRefer to Feature Words t_jReverse document frequency in whole text set, is used To weigh the appearance scope of Feature Words；N is total amount of text, N in text collection_jRepresent to contain Feature Words t_jDifferent textual datas Amount.

4. the distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment according to claim 2, It is characterized in that：The definition of described " each other minimum similarity degree text to " is：If text cluster Then " each other minimum similarity degree text to " is defined as meeting two text d of following condition in cluster C_i, d_j：

S I M (d_{i}, d_{j}) = \min_{d_{k} &Element; C} {S I M (d_{i}, d_{j})} = \min_{d_{k} &Element; C} {S I M (d_{j}, d_{k})}

5. the distributed parallel text cluster side based on MapReduce under a kind of cloud computing environment according to claim 1 Method, it is characterised in that：The method of the Text Clustering Algorithm based on " each other minimum similarity degree text to " search is as follows：

Input：Text collection D={ d₁,d₂,…,d_i,…,d_N}；

Parameter：The number of clusters K of cluster；

Output：Text collection D cluster divides S={ S₁,S₂,…,S_k,…,S_K}；

(3) cluster Sm to be divided initial two sub-clusterings centered text pair is found using " each other minimum similarity degree text to " searching algorithm Dx, dy；

(4) by all text S of cluster to be sorted_m={ d_m1, d_m2..., d_mi..., d_mnCluster is assigned to according to the maximum principle of similarity S_xAnd S_yIn, it is shown below：

\{\begin{matrix} d_{m i} &Element; S_{x}, S I M (d_{m i}, d_{x}) &GreaterEqual; \sin (d_{m i}, d_{y}); \\ d_{m i} &Element; S_{y}, S I M (d_{m i}, d_{x}) &GreaterEqual; \sin (d_{m i}, d_{y}) . \end{matrix}

By S_xAnd S_yIt is added to cluster to divide in S, and by S_mDeleted from S；

(5) if S Chinese version clusters number is less than K, return and perform step (2)；If S Chinese version clusters number is equal to K, step (6) is performed；

(6) barycenter of K using in S cluster obtains text cluster stroke to all texts as initial cluster center using sphere K-means clusters Divide S, wherein, Text similarity computing method is used in cluster process.

6. the distributed parallel text based on MapReduce under a kind of cloud computing environment according to claim 5 Clustering method, it is characterised in that：Described text cluster similarity side MS is defined as follows：n_CThe text cluster of individual textCluster similarity side MS (C) be defined as all texts and cluster barycenter similarity square Average：

M S (C) = \frac{\underset{d_{i} &Element; C}{Σ} S I M {(d_{i}, d_{j})}^{2}}{n_{C}} .

7. the distributed parallel text cluster side based on MapReduce under a kind of cloud computing environment according to claim 5 Method, it is characterised in that：Described " each other minimum similarity degree text to " searching algorithm is as follows：

Input：Text clustern_CFor the quantity of text cluster C Chinese versions；Output：It is " minimum each other Similarity text to " d_x,d_y；

S I M (d_{x}, d_{y}) = \min_{d_{j} &Element; C} {S I M (d_{x}, d_{j})};

S I M (d_{y}, d_{k}) = \min_{d_{j} &Element; C} {S I M (d_{y}, d_{j})};

(iv) following two conditions are judged：

If (a) d_k=d_xOr SIM (d_x,d_y)=SIM (d_k,d_y), then algorithm terminates, and exports d_x,d_yFor " minimum similarity degree is literary each other This to ", i.e. text cluster C initial cluster center；

If (b) d_k≠d_xAnd SIM (dx, dy) ≠ SIM (d_k, d_y), then assignment d_x←d_y, d_y← dk, redirects execution step (iii) weight New search.