CN106971005A - Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment - Google Patents

Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment Download PDF

Info

Publication number
CN106971005A
CN106971005A CN201710286671.2A CN201710286671A CN106971005A CN 106971005 A CN106971005 A CN 106971005A CN 201710286671 A CN201710286671 A CN 201710286671A CN 106971005 A CN106971005 A CN 106971005A
Authority
CN
China
Prior art keywords
text
cluster
similarity
minimum
feature words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201710286671.2A
Other languages
Chinese (zh)
Inventor
沈晔
周天和
李思剑
任培荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yang Fan Technology Co Ltd
Original Assignee
Hangzhou Yang Fan Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yang Fan Technology Co Ltd filed Critical Hangzhou Yang Fan Technology Co Ltd
Priority to CN201710286671.2A priority Critical patent/CN106971005A/en
Publication of CN106971005A publication Critical patent/CN106971005A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to the distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment, a kind of Text similarity computing method is proposed first with vector space model;Secondly, based on two sub-clustering centers of " each other minimum similarity degree text to " search selection, propose to realize two points of K means clustering algorithms of cluster barycenter optimizing by once dividing;Finally, parallel clustering method is designed towards the extensive text that cloud computing is applied based on MapReduce frameworks.The inventive method shows that parallel clustering model is while suitable Clustering Effect is obtained, with obvious odds for effectiveness, has good autgmentability in different pieces of information scale and calculate node number in Hadoop platform with the experiment of real text data.

Description

Distributed parallel text cluster based on MapReduce under a kind of cloud computing environment Method
Technical field
The present invention relates to the distribution based on MapReduce under field of cloud computer technology, more particularly to a kind of cloud computing environment The parallel Text Clustering Method of formula.
Background technology
Text mining is the research that data mining extends in text type data, is to be used as research pair using text data As using data mining correlation technique, what therefrom structure, model, pattern of searching text message etc. were implicit has potential value Knowledge process, combine data mining, machine learning, natural language processing, information retrieval and information management etc. difference neck The achievement in research in domain.Using the Internet, applications as the active demand of the rapid growth and business analysis of the text data of carrier so that The importance and urgency of text mining also increasingly strengthen, wherein in the case where not needing training set and predefined classification, from The text cluster research that rational text cluster divides is found in given text collection it is one of text mining field and important grinds Study carefully direction.
It is how quick effective with the extensive development of the various applications (microblogging, ecommerce and search engine) in internet The extensive text that ground excavates application generation has turned into the huge challenge that data mining research and application field have been faced.Point Cloth parallel computation computing capability when in face of large-scale data it is powerful and realize it is simple and convenient, therefore by Distributed Parallel Computing Distributed text digging technology produced by introducing text mining field is study hotspot in recent years.The rise of cloud computing is to divide Cloth parallel computation provides more frameworks, and the MapReduce frameworks that wherein Google is proposed allow user by defining Map The raising of computational efficiency will be obtained in large-scale data distribution of computation tasks to multiple calculate nodes with Reduce tasks, towards The appearance of the Hadoop platform of increasing income of cloud computing is even more to realize to provide for the Distributed Parallel Computing Model based on MapReduce It is convenient, and there is scholar to develop the Mahout class libraries for machine learning and data mining algorithm.
The content of the invention
The present invention is to overcome above-mentioned weak point, it is therefore intended that provide and MapReduce is based under a kind of cloud computing environment Distributed parallel Text Clustering Method, this method proposes a kind of Text similarity computing side first with vector space model Method, on this basis, proposes to realize two points of K-means clustering algorithms of cluster barycenter optimizing by once dividing;Then, it is based on MapReduce frameworks design parallel clustering method towards the extensive text that cloud computing is applied.The present invention is flat towards cloud computing Extensive text mining application on platform, improves the efficiency of text cluster.
The present invention is to reach above-mentioned purpose by the following technical programs:Based on MapReduce's under a kind of cloud computing environment Distributed parallel Text Clustering Method, comprises the following steps:
(1) text feature is represented using vector space model, text similarity is obtained with reference to Text similarity computing method Computation model;
(2) initial two sub-clustering is determined using the Text Clustering Algorithm selection based on " each other minimum similarity degree text to " search Center, realizes that two points of K-means clusters of cluster barycenter optimizing complete cluster by once dividing, forms text cluster and divide;
(3) MapReduce frameworks are based on, Distributed Parallel Computing is carried out using three MapReduce tasks, is each responsible for Search " each other minimum similarity degree text to ", distribution text to two clusters, final K-means text clusters, until cluster is divided No longer change and export cluster result.
Preferably, described Text similarity computing method is as follows:
Given text di,dj, TA (di,dj)={ ta1,ta2,...,tat,...,tahRepresent di,djContained Feature Words and Collection, h is and concentrates the number of Feature Words;TS(di,dj)={ ts1,ts2,...,tsk,...,tslRepresent di,djContained Feature Words Common factor, l for occur simultaneously in Feature Words number;Then text di,djEach Feature Words t in TSskOn similarity sim (di, dj,tsk) be defined as
Text di,djSimilarity SIM (di,dj) be defined as
Preferably, the utilization vector space model represents that the method for text feature is:Given text collection D={ d1, d2,…,di,…,dN, diEach text vector is represented, is represented by using vector space model
di=(< t1,wi1>, < t2,wi2> ..., < tj,wij> ..., < tm,wim>)
Wherein, T={ t1,t2,…,tj,…,tmRepresent the set of all Feature Words that all texts are included in text set; Wi={ wi1,wi2,…,wij,…,wimRepresent text diCorresponding weight vectors, are calculated using TF-IDF on all Feature Words Method is obtained, and formula is as follows:
Wherein, tfijRefer to Feature Words tjIn text diThe frequency of middle appearance, nijFor text diMiddle Feature Words tjThe number of times of appearance, niFor text diThe sum that all Feature Words contained occur;idfjRefer to Feature Words tjReverse document frequency in whole text set Rate, for weighing the appearance scope of Feature Words;N is total amount of text, N in text collectionjRepresent to contain Feature Words tjNot identical text This quantity.
The definition of " each other minimum similarity degree text to " is preferably, described:If text cluster Then " each other minimum similarity degree text to " is defined as meeting two text d of following condition in cluster Ci, dj
That is diBe in text cluster with djThe minimum text of similarity, while djBe in the cluster with diThe minimum text of similarity.
Preferably, the method for the Text Clustering Algorithm based on " each other minimum similarity degree text to " search is as follows:
Input:Text collection D={ d1,d2,…,di,…,dN};
Parameter:The number of clusters K of cluster;
Output:Text collection D cluster divides S={ S1,S2,…,Sk,…,SK};
(1) initialized, regard the set D that all texts are constituted as initial cluster:S={ S0, S0←D;
(2) the minimum cluster S of text cluster similarity side MS are selected from SmIt is used as cluster to be divided;
(3) the initial two sub-clusterings center for finding cluster Sm divide using " each other minimum similarity degree text to " searching algorithm is literary This is to dx, dy;
(4) by all text S of cluster to be sortedm={ dm1, dm2..., dmi..., dmnDistributed according to the maximum principle of similarity To cluster SxAnd SyIn, it is shown below:
By SxAnd SyIt is added to cluster to divide in S, and by SmDeleted from S;
(5) if S Chinese version clusters number is less than K, return and perform step (2);If S Chinese version clusters number is equal to K, step is performed (6);
(6) barycenter of K using in S cluster obtains text to all texts as initial cluster center using sphere K-means clusters Cluster divides S, wherein, Text similarity computing method is used in cluster process.
Preferably, described text cluster similarity side MS is defined as follows:nCThe text cluster of individual textCluster similarity side MS (C) be defined as all texts and cluster barycenter similarity square Average:
" each other minimum similarity degree text to " searching algorithm is as follows preferably, described:
Input:Text clusternCFor the quantity of text cluster C Chinese versions;Output:" each other Minimum similarity degree text to " dx,dy
(i) text d is randomly selected in text cluster CiIt is assigned to dx,dx←di
(ii) search and text d in text cluster CxThe minimum text d of similarityy, i.e.,
(iii) search and text d in text cluster CyThe minimum text d of similarityk, i.e.,
(iv) following two conditions are judged:
If (a) dk=dxOr SIM (dx,dy)=SIM (dk,dy), then algorithm terminates, and exports dx,dyTo be " minimum each other similar Spend text to ", i.e. text cluster C initial cluster center;
If (b) dk≠dxAnd SIM (dx, dy) ≠ SIM (dk, dy), then assignment dx←dy, dy← dk, redirects execution step (iii) re-search for.
The beneficial effects of the present invention are:It is of the invention herein towards the extensive text mining application on cloud computing platform, Improve the efficiency of text cluster.
Brief description of the drawings
Fig. 1 is the clustering method main flow schematic diagram of the present invention.
Embodiment
With reference to specific embodiment, the present invention is described further, but protection scope of the present invention is not limited in This:
Embodiment:As shown in figure 1, the distributed parallel text cluster side based on MapReduce under a kind of cloud computing environment Method, comprises the following steps:
(1) text feature is represented using vector space model, text similarity is obtained with reference to Text similarity computing method Computation model;
Define 1 and represent text feature using vector space model:Given text collection D={ d1,d2,…,di,…,dN, di Each text vector is represented, d is represented by using vector space modeli=(< t1,wi1>, < t2,wi2> ..., < tj,wij > ..., < tm,wim>).Wherein:T={ t1,t2,…,tj,…,tmRepresent all features that all texts are included in text set The set of word, Wi={ wi1,wi2,…,wij,…,wimRepresent text diThe corresponding weight vectors on all Feature Words, are used TF-IDF computational methods:
In formula, tfijRefer to Feature Words tjIn text diThe frequency of middle appearance, nijFor text diMiddle Feature Words tjThe number of times of appearance, niFor text diThe sum that all Feature Words contained occur;idfjRefer to Feature Words tjReverse document frequency in whole text set Rate, for weighing the appearance scope of Feature Words, N is total amount of text, N in text collectionjRepresent to contain Feature Words tjNot identical text This quantity.Obviously, the frequency that some Feature Words occurs in specific document is higher, and this feature word is distinguishing text content category Ability in terms of property is stronger (TF);The scope occurred in text set is wider, and the attribute that it distinguishes content of text is lower (IDF).
The definition for defining 2 text similarities is mainly as follows:Given text di,dj, TA (di,dj)={ ta1,ta2,..., tat,...,tahRepresent di,djThe union of contained Feature Words, h is and concentrates the number of Feature Words;TS(di,dj)={ ts1, ts2,...,tsk,...,tslRepresent di,djThe common factor of contained Feature Words, l is the number of Feature Words in occuring simultaneously.Text di,dj Each Feature Words t in TSskOn similarity sim (di,dj,tsk) be defined as
Text di,djSimilarity SIM (di,dj) be defined as
That is the number for all Feature Words that similarity sum of two texts on all common trait words is included with two texts The ratio between.Formula (3) and classical cosine similarity (Wi·Wj/(|Wi|*|Wj|)) computational methods compare, all first with two The molecule of the common trait word calculation formula that text is included, secondly denominator term make use of in addition to common trait remaining Respective this paper Feature Words.Unlike, formula (3) proposed by the present invention has been precisely calculated the phase of each common trait word respectively Overall similarity is directly calculated by inner product of vectors like degree, rather than in included angle cosine.
(2) initial two sub-clustering is determined using the Text Clustering Algorithm selection based on " each other minimum similarity degree text to " search Center, realizes that two points of K-means clusters of cluster barycenter optimizing complete cluster by once dividing, forms text cluster and divide;
Define 3 text cluster similarities square:Including nCThe text cluster of individual textCluster Similarity side MS (C) is defined as the average of all texts and cluster barycenter similarity square:
Wherein, deFor cluster barycenter Text eigenvector, i.e. de=(< t1,we1>, < t2,we2> ..., < tj,wej > ..., < tm,wem>),
In an embodiment of the present invention, two points of original K-means methods utilize K- after selecting a cluster to enter line splitting Means thoughts randomly select initial cluster center and carry out two segregation classes and successive ignition searching optimal dividing.Wherein cluster " each other most Small similarity text to " refer to text cluster" each other minimum similarity degree text to " it is fixed Adopted two text d to meet following condition in cluster Ci, dj
That is diBe in text cluster with djThe minimum text of similarity, while djBe in the cluster with diThe minimum text of similarity.
The present invention proposes to determine initial two sub-clusterings center according to " each other minimum similarity degree text to " of search cluster, it is clear that one It may contain in individual text cluster more than " each other minimum similarity degree text to " for meeting formula (6) for a pair.Therefore, the " mutual of cluster is provided For minimum similarity degree text " search greedy algorithm it is as follows.
Preparation algorithm 1 " each other minimum similarity degree text to " searching algorithm.
Input:Text clusternCFor the quantity of text cluster C Chinese versions.
Output:" each other minimum similarity degree text to " dx,dy.
Algorithm steps are as follows:
Step 1 randomly selects text d in text cluster CiIt is assigned to dx,dx←di
Step 2 is searched for and text d in text cluster CxThe minimum text d of similarityy, i.e.,
Step 3 is searched for and text d in text cluster CyThe minimum text d of similarityk, i.e.,
Step 4 judges following two conditions:
If 4.1 dk=dxOr SIM (dx,dy)=SIM (dk,dy), then algorithm terminates, and exports dx,dyTo be " minimum each other similar Spend text to ", i.e. text cluster C initial cluster center;
If 4.2 dk≠dxAnd SIM (dx, dy) ≠ SIM (dk, dy), then assignment dx←dy, dy← dk, return to step 3 is searched again Rope.
Step 5 terminates.
In an embodiment of the present invention, Text Clustering Algorithm based on " each other minimum similarity degree text to " search is according to carrying The initial cluster center system of selection gone out, with reference to two points of K-means algorithm ideas, provides Text Clustering Algorithm step as follows:
Preparation algorithm 2 is based on the Text Clustering Algorithm of " each other minimum similarity degree text to " search.
Input:Text collection D={ d1,d2,…,di,…,dN}。
Parameter:The number of clusters K. of cluster
Output:Text collection D cluster divides S={ S1,S2,…,Sk,…,SK};
Algorithm steps are as follows:
Step 1 is initialized.The set D that all texts are constituted is used as initial cluster:S={ S0, S0←D.
Step 2 selects the minimum cluster S of text similarity side MS according to formula (4) from SmIt is used as cluster to be divided
Step 3 finds cluster Sm to be divided initial two sub-clusterings centered text to dx, dy. with the preparation algorithm 1 proposed
Step 4 is by all text S of cluster to be sortedm={ dm1, dm2..., dmi..., dmnAccording to the maximum principle point of similarity It is fitted on cluster SxAnd SyIn:
By SxAnd SyIt is added to cluster to divide in S, and by SmDeleted from S.
If step 5 S Chinese version clusters number is less than K, return to step 2;If S Chinese version clusters number is equal to K, step is turned to 6。
The barycenter of K using in S cluster of step 6 obtains text to all texts as initial cluster center using sphere K-means clusters This cluster divides S, using the Text similarity computing method for defining 2 wherein in cluster process.
Step 7 terminates.
From the preparation process of algorithm 2, Text Clustering Algorithm proposed by the present invention is after initial two sub-clusterings center is searched All objects (step 4) of primary distribution, obtain the division of cluster, have no the iteration optimizing repeated in original two points of K-means algorithms Process.
(3) MapReduce frameworks are based on, Distributed Parallel Computing is carried out using three MapReduce tasks, is each responsible for Search " each other minimum similarity degree text to ", distribution text to two clusters, final K-means text clusters, until cluster is divided No longer change and export cluster result.
Parallel text cluster model based on MapReduce is although modified hydrothermal process is possible to carry compared to original two segregations class High cluster efficiency, but the raising of the computational efficiency obtained by the improvement of clustering algorithm in itself can not be far adapted to towards cloud computing Extensive mass text cluster result needs in practical application, therefore using the MapReduce frameworks in cloud computing environment to text This cluster process carries out parallelization processing, will further greatly improve text cluster efficiency.
In an embodiment of the present invention, the parallel text cluster process based on MapReduce tasks is in text cluster process In, three MapReduce tasks are designed herein and carry out Distributed Parallel Computing, and being each responsible for search, " minimum similarity degree is literary each other This to ", distribution text is to two clusters and final K-means text cluster processes.
Specifically, step one:Found according to " each other minimum similarity degree text to " in preparation algorithm 1 in original text cluster The heart.Map chooses a text dx, selected cluster S is calculated according to defining 2mIn remaining text and dxSimilarity, and search for and dx Text d with minimum similarity degreey, and search for and dyText d with minimum similarity degreek;Reduce is by dyIt is assigned to dx, dkAssign It is worth to dy, re-use Map search dyMinimum similarity degree text, until finding " each other minimum similarity degree text to "<dx, dy >.MapReduce processes are represented by:
Map:<dx, List<tj, wxj>>,<Sm, List<dmi>>→<dx&dy, SIM (dx, dy)>
Repeat
Map:<dy, List<tj, wyj>>,<Sm, List<dmi>>→<dy&dk, SIM (dk, dy)>
Reduce:dx←dy, dy←dk
End until dk=dy or SIM(dk, dy)=SIM (dx, dy)
Step 2:Step 3 is arrived according to the step 1 of preparation algorithm 2, all texts are distributed in cluster to be divided into two clusters. Map is according to the initial cluster center d of searchx,dy, cluster S is calculated according to defining 2mIn all texts and cluster center dx,dySimilarity, And it is assigned to two cluster S by the maximum principle of similarityxAnd SyIn,<Sk, List<di, List<tj, wij>>>(Sk=sxOr Sy)。 Reduce is according to the centroid vector d for defining 3 two clusters of calculatingekWith text similarity side MS (Sk), i.e.,<Sk, dek, MS (Sk)>。 MapReduce processes are represented by:
Map:<Sm, List<dmi>, dx, dy>→<Sk, List<di>>
Reduce:<Sk, List<di>>→<Sk, dek, MS (Sk)>
Above-mentioned two MapReduce tasks would be repeated for until the number of cluster is the K. that specifies
Step 3:Step 7 is arrived according to the step 4 of preparation algorithm 2, the barycenter of K cluster is subjected to K-means clusters.Map is read Enter the centroid vector d of all text collection D and K clusterekK-means clusters are carried out, text cluster is formed and divides, i.e.,<Sk, List< dki>>, the text assigning process of the similar upper MapReduce task of process needs:
Repeat:
Map:D, List<dek>→<Sk, List<dki>>
Reduce:<Sk, List<di>>→<Sk, dek, MS (Sk)>
Until clusters, which are divided, no longer to be changed.
The technical principle for being the specific embodiment of the present invention and being used above, if conception under this invention institute The change of work, during the spirit that function produced by it is still covered without departing from specification and accompanying drawing, should belong to the present invention's Protection domain.

Claims (7)

1. the distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment, it is characterised in that including such as Lower step:
(1) text feature is represented using vector space model, Text similarity computing is obtained with reference to Text similarity computing method Model;
(2) determined using the Text Clustering Algorithm selection based on " each other minimum similarity degree text to " search in initial two sub-clustering The heart, realizes that two points of K-means clusters of cluster barycenter optimizing complete cluster by once dividing, forms text cluster and divide;
(3) MapReduce frameworks are based on, Distributed Parallel Computing is carried out using three MapReduce tasks, search is each responsible for " each other minimum similarity degree text to ", distribution text to two clusters, final K-means text clusters, until cluster is divided no longer Change and export cluster result.
2. the distributed parallel text cluster side based on MapReduce under a kind of cloud computing environment according to claim 1 Method, it is characterised in that:Described Text similarity computing method is as follows:Given text di,dj, TA (di,dj)={ ta1, ta2,...,tat,...,tahRepresent di,djThe union of contained Feature Words, h is and concentrates the number of Feature Words;TS(di,dj)= {ts1,ts2,...,tsk,...,tslRepresent di,djThe common factor of contained Feature Words, l is the number of Feature Words in occuring simultaneously;Then text di,djEach Feature Words t in TSskOn similarity sim (di,dj,tsk) be defined as
s i m ( d i , d j , t s k ) = min ( w i s k , w j s k ) max ( w i s k , w j s k ) ;
Text di,djSimilarity SIM (di,dj) be defined as
S I M ( d i , d j ) = &Sigma; t s k &Element; T S ( d i , d j ) sin ( d i , d j , t s k ) | T A ( d i , d j ) | .
3. the distributed parallel text cluster side based on MapReduce under a kind of cloud computing environment according to claim 1 Method, it is characterised in that:The utilization vector space model represents that the method for text feature is:Given text collection D={ d1, d2,…,di,…,dN, diEach text vector is represented, is represented by using vector space model
di=(< t1,wi1>, < t2,wi2> ..., < tj,wij> ..., < tm,wim>)
Wherein, T={ t1,t2,…,tj,…,tmRepresent the set of all Feature Words that all texts are included in text set;Wi= {wi1,wi2,…,wij,…,wimRepresent text diThe corresponding weight vectors on all Feature Words, using TF-IDF computational methods Obtain, formula is as follows:
w i j = tf i &times; idf j = n i j n i &CenterDot; log 2 ( N N j + 1 + 1 ) .
Wherein, tfijRefer to Feature Words tjIn text diThe frequency of middle appearance, nijFor text diMiddle Feature Words tjThe number of times of appearance, niFor Text diThe sum that all Feature Words contained occur;idfjRefer to Feature Words tjReverse document frequency in whole text set, is used To weigh the appearance scope of Feature Words;N is total amount of text, N in text collectionjRepresent to contain Feature Words tjDifferent textual datas Amount.
4. the distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment according to claim 2, It is characterized in that:The definition of described " each other minimum similarity degree text to " is:If text cluster Then " each other minimum similarity degree text to " is defined as meeting two text d of following condition in cluster Ci, dj
S I M ( d i , d j ) = min d k &Element; C { S I M ( d i , d j ) } = min d k &Element; C { S I M ( d j , d k ) }
That is diBe in text cluster with djThe minimum text of similarity, while djBe in the cluster with diThe minimum text of similarity.
5. the distributed parallel text cluster side based on MapReduce under a kind of cloud computing environment according to claim 1 Method, it is characterised in that:The method of the Text Clustering Algorithm based on " each other minimum similarity degree text to " search is as follows:
Input:Text collection D={ d1,d2,…,di,…,dN};
Parameter:The number of clusters K of cluster;
Output:Text collection D cluster divides S={ S1,S2,…,Sk,…,SK};
(1) initialized, regard the set D that all texts are constituted as initial cluster:S={ S0, S0←D;
(2) the minimum cluster S of text cluster similarity side MS are selected from SmIt is used as cluster to be divided;
(3) cluster Sm to be divided initial two sub-clusterings centered text pair is found using " each other minimum similarity degree text to " searching algorithm Dx, dy;
(4) by all text S of cluster to be sortedm={ dm1, dm2..., dmi..., dmnCluster is assigned to according to the maximum principle of similarity SxAnd SyIn, it is shown below:
d m i &Element; S x , S I M ( d m i , d x ) &GreaterEqual; sin ( d m i , d y ) ; d m i &Element; S y , S I M ( d m i , d x ) &GreaterEqual; sin ( d m i , d y ) .
By SxAnd SyIt is added to cluster to divide in S, and by SmDeleted from S;
(5) if S Chinese version clusters number is less than K, return and perform step (2);If S Chinese version clusters number is equal to K, step (6) is performed;
(6) barycenter of K using in S cluster obtains text cluster stroke to all texts as initial cluster center using sphere K-means clusters Divide S, wherein, Text similarity computing method is used in cluster process.
6. the distributed parallel text based on MapReduce under a kind of cloud computing environment according to claim 5 Clustering method, it is characterised in that:Described text cluster similarity side MS is defined as follows:nCThe text cluster of individual textCluster similarity side MS (C) be defined as all texts and cluster barycenter similarity square Average:
M S ( C ) = &Sigma; d i &Element; C S I M ( d i , d j ) 2 n C .
7. the distributed parallel text cluster side based on MapReduce under a kind of cloud computing environment according to claim 5 Method, it is characterised in that:Described " each other minimum similarity degree text to " searching algorithm is as follows:
Input:Text clusternCFor the quantity of text cluster C Chinese versions;Output:It is " minimum each other Similarity text to " dx,dy
(i) text d is randomly selected in text cluster CiIt is assigned to dx,dx←di
(ii) search and text d in text cluster CxThe minimum text d of similarityy, i.e.,
S I M ( d x , d y ) = min d j &Element; C { S I M ( d x , d j ) } ;
(iii) search and text d in text cluster CyThe minimum text d of similarityk, i.e.,
S I M ( d y , d k ) = min d j &Element; C { S I M ( d y , d j ) } ;
(iv) following two conditions are judged:
If (a) dk=dxOr SIM (dx,dy)=SIM (dk,dy), then algorithm terminates, and exports dx,dyFor " minimum similarity degree is literary each other This to ", i.e. text cluster C initial cluster center;
If (b) dk≠dxAnd SIM (dx, dy) ≠ SIM (dk, dy), then assignment dx←dy, dy← dk, redirects execution step (iii) weight New search.
CN201710286671.2A 2017-04-27 2017-04-27 Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment Withdrawn CN106971005A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710286671.2A CN106971005A (en) 2017-04-27 2017-04-27 Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710286671.2A CN106971005A (en) 2017-04-27 2017-04-27 Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment

Publications (1)

Publication Number Publication Date
CN106971005A true CN106971005A (en) 2017-07-21

Family

ID=59332688

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710286671.2A Withdrawn CN106971005A (en) 2017-04-27 2017-04-27 Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment

Country Status (1)

Country Link
CN (1) CN106971005A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052485A (en) * 2017-12-15 2018-05-18 东软集团股份有限公司 the distributed computing method and device of vector similarity, storage medium and node
CN112463958A (en) * 2020-09-29 2021-03-09 上海海事大学 Method for rapidly clustering massive texts based on MapReduce framework
CN112784046A (en) * 2021-01-20 2021-05-11 北京百度网讯科技有限公司 Text clustering method, device and equipment and storage medium
CN116503031A (en) * 2023-06-29 2023-07-28 中国人民解放军国防科技大学 Personnel similarity calculation method, device, equipment and medium based on resume analysis
CN112784046B (en) * 2021-01-20 2024-05-28 北京百度网讯科技有限公司 Text clustering method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020042793A1 (en) * 2000-08-23 2002-04-11 Jun-Hyeog Choi Method of order-ranking document clusters using entropy data and bayesian self-organizing feature maps
CN105426426A (en) * 2015-11-04 2016-03-23 北京工业大学 KNN text classification method based on improved K-Medoids

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020042793A1 (en) * 2000-08-23 2002-04-11 Jun-Hyeog Choi Method of order-ranking document clusters using entropy data and bayesian self-organizing feature maps
CN105426426A (en) * 2015-11-04 2016-03-23 北京工业大学 KNN text classification method based on improved K-Medoids

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
武森 等: "基于MapReduce的大规模文本聚类并行化", 《北京科技大学学报》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052485A (en) * 2017-12-15 2018-05-18 东软集团股份有限公司 the distributed computing method and device of vector similarity, storage medium and node
CN108052485B (en) * 2017-12-15 2021-05-07 东软集团股份有限公司 Distributed computing method and device for vector similarity, storage medium and node
CN112463958A (en) * 2020-09-29 2021-03-09 上海海事大学 Method for rapidly clustering massive texts based on MapReduce framework
CN112463958B (en) * 2020-09-29 2022-07-15 上海海事大学 Method for rapidly clustering massive texts based on MapReduce framework
CN112784046A (en) * 2021-01-20 2021-05-11 北京百度网讯科技有限公司 Text clustering method, device and equipment and storage medium
CN112784046B (en) * 2021-01-20 2024-05-28 北京百度网讯科技有限公司 Text clustering method, device, equipment and storage medium
CN116503031A (en) * 2023-06-29 2023-07-28 中国人民解放军国防科技大学 Personnel similarity calculation method, device, equipment and medium based on resume analysis
CN116503031B (en) * 2023-06-29 2023-09-08 中国人民解放军国防科技大学 Personnel similarity calculation method, device, equipment and medium based on resume analysis

Similar Documents

Publication Publication Date Title
CN110674407B (en) Hybrid recommendation method based on graph convolution neural network
CN104699772B (en) A kind of big data file classification method based on cloud computing
Yin et al. Incomplete multi-view clustering via subspace learning
CN103279556B (en) Iteration Text Clustering Method based on self adaptation sub-space learning
CN107292186A (en) A kind of model training method and device based on random forest
Eluri et al. A comparative study of various clustering techniques on big data sets using Apache Mahout
CN105426529A (en) Image retrieval method and system based on user search intention positioning
CN104699698A (en) Graph query processing method based on massive data
Suganthi et al. Instance selection and feature extraction using cuttlefish optimization algorithm and principal component analysis using decision tree
CN109784405A (en) Cross-module state search method and system based on pseudo label study and semantic consistency
Zhou et al. An effective ensemble pruning algorithm based on frequent patterns
CN106601235A (en) Semi-supervision multitask characteristic selecting speech recognition method
CN106971005A (en) Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment
CN109739984A (en) A kind of parallel KNN network public-opinion sorting algorithm of improvement based on Hadoop platform
CN106203508A (en) A kind of image classification method based on Hadoop platform
Lee et al. A hybrid system for imbalanced data mining
Zhang et al. Fast exemplar-based clustering by gravity enrichment between data objects
CN113692591A (en) Node disambiguation
Gupta et al. Comparison of algorithms for document clustering
Mei et al. Proximity-based k-partitions clustering with ranking for document categorization and analysis
Gabryel A bag-of-features algorithm for applications using a NoSQL database
CN103324942B (en) A kind of image classification method, Apparatus and system
Leger Wmixnet: Software for clustering the nodes of binary and valued graphs using the stochastic block model
Park et al. Multi-attributed graph matching with multi-layer random walks
CN107480199B (en) Query reconstruction method, device, equipment and storage medium of database

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20170721