CN106971005A - Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment - Google Patents
Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment Download PDFInfo
- Publication number
- CN106971005A CN106971005A CN201710286671.2A CN201710286671A CN106971005A CN 106971005 A CN106971005 A CN 106971005A CN 201710286671 A CN201710286671 A CN 201710286671A CN 106971005 A CN106971005 A CN 106971005A
- Authority
- CN
- China
- Prior art keywords
- text
- cluster
- similarity
- minimum
- feature words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to the distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment, a kind of Text similarity computing method is proposed first with vector space model;Secondly, based on two sub-clustering centers of " each other minimum similarity degree text to " search selection, propose to realize two points of K means clustering algorithms of cluster barycenter optimizing by once dividing;Finally, parallel clustering method is designed towards the extensive text that cloud computing is applied based on MapReduce frameworks.The inventive method shows that parallel clustering model is while suitable Clustering Effect is obtained, with obvious odds for effectiveness, has good autgmentability in different pieces of information scale and calculate node number in Hadoop platform with the experiment of real text data.
Description
Technical field
The present invention relates to the distribution based on MapReduce under field of cloud computer technology, more particularly to a kind of cloud computing environment
The parallel Text Clustering Method of formula.
Background technology
Text mining is the research that data mining extends in text type data, is to be used as research pair using text data
As using data mining correlation technique, what therefrom structure, model, pattern of searching text message etc. were implicit has potential value
Knowledge process, combine data mining, machine learning, natural language processing, information retrieval and information management etc. difference neck
The achievement in research in domain.Using the Internet, applications as the active demand of the rapid growth and business analysis of the text data of carrier so that
The importance and urgency of text mining also increasingly strengthen, wherein in the case where not needing training set and predefined classification, from
The text cluster research that rational text cluster divides is found in given text collection it is one of text mining field and important grinds
Study carefully direction.
It is how quick effective with the extensive development of the various applications (microblogging, ecommerce and search engine) in internet
The extensive text that ground excavates application generation has turned into the huge challenge that data mining research and application field have been faced.Point
Cloth parallel computation computing capability when in face of large-scale data it is powerful and realize it is simple and convenient, therefore by Distributed Parallel Computing
Distributed text digging technology produced by introducing text mining field is study hotspot in recent years.The rise of cloud computing is to divide
Cloth parallel computation provides more frameworks, and the MapReduce frameworks that wherein Google is proposed allow user by defining Map
The raising of computational efficiency will be obtained in large-scale data distribution of computation tasks to multiple calculate nodes with Reduce tasks, towards
The appearance of the Hadoop platform of increasing income of cloud computing is even more to realize to provide for the Distributed Parallel Computing Model based on MapReduce
It is convenient, and there is scholar to develop the Mahout class libraries for machine learning and data mining algorithm.
The content of the invention
The present invention is to overcome above-mentioned weak point, it is therefore intended that provide and MapReduce is based under a kind of cloud computing environment
Distributed parallel Text Clustering Method, this method proposes a kind of Text similarity computing side first with vector space model
Method, on this basis, proposes to realize two points of K-means clustering algorithms of cluster barycenter optimizing by once dividing;Then, it is based on
MapReduce frameworks design parallel clustering method towards the extensive text that cloud computing is applied.The present invention is flat towards cloud computing
Extensive text mining application on platform, improves the efficiency of text cluster.
The present invention is to reach above-mentioned purpose by the following technical programs:Based on MapReduce's under a kind of cloud computing environment
Distributed parallel Text Clustering Method, comprises the following steps:
(1) text feature is represented using vector space model, text similarity is obtained with reference to Text similarity computing method
Computation model;
(2) initial two sub-clustering is determined using the Text Clustering Algorithm selection based on " each other minimum similarity degree text to " search
Center, realizes that two points of K-means clusters of cluster barycenter optimizing complete cluster by once dividing, forms text cluster and divide;
(3) MapReduce frameworks are based on, Distributed Parallel Computing is carried out using three MapReduce tasks, is each responsible for
Search " each other minimum similarity degree text to ", distribution text to two clusters, final K-means text clusters, until cluster is divided
No longer change and export cluster result.
Preferably, described Text similarity computing method is as follows:
Given text di,dj, TA (di,dj)={ ta1,ta2,...,tat,...,tahRepresent di,djContained Feature Words and
Collection, h is and concentrates the number of Feature Words;TS(di,dj)={ ts1,ts2,...,tsk,...,tslRepresent di,djContained Feature Words
Common factor, l for occur simultaneously in Feature Words number;Then text di,djEach Feature Words t in TSskOn similarity sim (di,
dj,tsk) be defined as
Text di,djSimilarity SIM (di,dj) be defined as
Preferably, the utilization vector space model represents that the method for text feature is:Given text collection D={ d1,
d2,…,di,…,dN, diEach text vector is represented, is represented by using vector space model
di=(< t1,wi1>, < t2,wi2> ..., < tj,wij> ..., < tm,wim>)
Wherein, T={ t1,t2,…,tj,…,tmRepresent the set of all Feature Words that all texts are included in text set;
Wi={ wi1,wi2,…,wij,…,wimRepresent text diCorresponding weight vectors, are calculated using TF-IDF on all Feature Words
Method is obtained, and formula is as follows:
Wherein, tfijRefer to Feature Words tjIn text diThe frequency of middle appearance, nijFor text diMiddle Feature Words tjThe number of times of appearance,
niFor text diThe sum that all Feature Words contained occur;idfjRefer to Feature Words tjReverse document frequency in whole text set
Rate, for weighing the appearance scope of Feature Words;N is total amount of text, N in text collectionjRepresent to contain Feature Words tjNot identical text
This quantity.
The definition of " each other minimum similarity degree text to " is preferably, described:If text cluster Then " each other minimum similarity degree text to " is defined as meeting two text d of following condition in cluster Ci, dj:
That is diBe in text cluster with djThe minimum text of similarity, while djBe in the cluster with diThe minimum text of similarity.
Preferably, the method for the Text Clustering Algorithm based on " each other minimum similarity degree text to " search is as follows:
Input:Text collection D={ d1,d2,…,di,…,dN};
Parameter:The number of clusters K of cluster;
Output:Text collection D cluster divides S={ S1,S2,…,Sk,…,SK};
(1) initialized, regard the set D that all texts are constituted as initial cluster:S={ S0, S0←D;
(2) the minimum cluster S of text cluster similarity side MS are selected from SmIt is used as cluster to be divided;
(3) the initial two sub-clusterings center for finding cluster Sm divide using " each other minimum similarity degree text to " searching algorithm is literary
This is to dx, dy;
(4) by all text S of cluster to be sortedm={ dm1, dm2..., dmi..., dmnDistributed according to the maximum principle of similarity
To cluster SxAnd SyIn, it is shown below:
By SxAnd SyIt is added to cluster to divide in S, and by SmDeleted from S;
(5) if S Chinese version clusters number is less than K, return and perform step (2);If S Chinese version clusters number is equal to K, step is performed
(6);
(6) barycenter of K using in S cluster obtains text to all texts as initial cluster center using sphere K-means clusters
Cluster divides S, wherein, Text similarity computing method is used in cluster process.
Preferably, described text cluster similarity side MS is defined as follows:nCThe text cluster of individual textCluster similarity side MS (C) be defined as all texts and cluster barycenter similarity square
Average:
" each other minimum similarity degree text to " searching algorithm is as follows preferably, described:
Input:Text clusternCFor the quantity of text cluster C Chinese versions;Output:" each other
Minimum similarity degree text to " dx,dy;
(i) text d is randomly selected in text cluster CiIt is assigned to dx,dx←di;
(ii) search and text d in text cluster CxThe minimum text d of similarityy, i.e.,
(iii) search and text d in text cluster CyThe minimum text d of similarityk, i.e.,
(iv) following two conditions are judged:
If (a) dk=dxOr SIM (dx,dy)=SIM (dk,dy), then algorithm terminates, and exports dx,dyTo be " minimum each other similar
Spend text to ", i.e. text cluster C initial cluster center;
If (b) dk≠dxAnd SIM (dx, dy) ≠ SIM (dk, dy), then assignment dx←dy, dy← dk, redirects execution step
(iii) re-search for.
The beneficial effects of the present invention are:It is of the invention herein towards the extensive text mining application on cloud computing platform,
Improve the efficiency of text cluster.
Brief description of the drawings
Fig. 1 is the clustering method main flow schematic diagram of the present invention.
Embodiment
With reference to specific embodiment, the present invention is described further, but protection scope of the present invention is not limited in
This:
Embodiment:As shown in figure 1, the distributed parallel text cluster side based on MapReduce under a kind of cloud computing environment
Method, comprises the following steps:
(1) text feature is represented using vector space model, text similarity is obtained with reference to Text similarity computing method
Computation model;
Define 1 and represent text feature using vector space model:Given text collection D={ d1,d2,…,di,…,dN, di
Each text vector is represented, d is represented by using vector space modeli=(< t1,wi1>, < t2,wi2> ..., < tj,wij
> ..., < tm,wim>).Wherein:T={ t1,t2,…,tj,…,tmRepresent all features that all texts are included in text set
The set of word, Wi={ wi1,wi2,…,wij,…,wimRepresent text diThe corresponding weight vectors on all Feature Words, are used
TF-IDF computational methods:
In formula, tfijRefer to Feature Words tjIn text diThe frequency of middle appearance, nijFor text diMiddle Feature Words tjThe number of times of appearance,
niFor text diThe sum that all Feature Words contained occur;idfjRefer to Feature Words tjReverse document frequency in whole text set
Rate, for weighing the appearance scope of Feature Words, N is total amount of text, N in text collectionjRepresent to contain Feature Words tjNot identical text
This quantity.Obviously, the frequency that some Feature Words occurs in specific document is higher, and this feature word is distinguishing text content category
Ability in terms of property is stronger (TF);The scope occurred in text set is wider, and the attribute that it distinguishes content of text is lower (IDF).
The definition for defining 2 text similarities is mainly as follows:Given text di,dj, TA (di,dj)={ ta1,ta2,...,
tat,...,tahRepresent di,djThe union of contained Feature Words, h is and concentrates the number of Feature Words;TS(di,dj)={ ts1,
ts2,...,tsk,...,tslRepresent di,djThe common factor of contained Feature Words, l is the number of Feature Words in occuring simultaneously.Text di,dj
Each Feature Words t in TSskOn similarity sim (di,dj,tsk) be defined as
Text di,djSimilarity SIM (di,dj) be defined as
That is the number for all Feature Words that similarity sum of two texts on all common trait words is included with two texts
The ratio between.Formula (3) and classical cosine similarity (Wi·Wj/(|Wi|*|Wj|)) computational methods compare, all first with two
The molecule of the common trait word calculation formula that text is included, secondly denominator term make use of in addition to common trait remaining
Respective this paper Feature Words.Unlike, formula (3) proposed by the present invention has been precisely calculated the phase of each common trait word respectively
Overall similarity is directly calculated by inner product of vectors like degree, rather than in included angle cosine.
(2) initial two sub-clustering is determined using the Text Clustering Algorithm selection based on " each other minimum similarity degree text to " search
Center, realizes that two points of K-means clusters of cluster barycenter optimizing complete cluster by once dividing, forms text cluster and divide;
Define 3 text cluster similarities square:Including nCThe text cluster of individual textCluster
Similarity side MS (C) is defined as the average of all texts and cluster barycenter similarity square:
Wherein, deFor cluster barycenter Text eigenvector, i.e. de=(< t1,we1>, < t2,we2> ..., < tj,wej
> ..., < tm,wem>),
In an embodiment of the present invention, two points of original K-means methods utilize K- after selecting a cluster to enter line splitting
Means thoughts randomly select initial cluster center and carry out two segregation classes and successive ignition searching optimal dividing.Wherein cluster " each other most
Small similarity text to " refer to text cluster" each other minimum similarity degree text to " it is fixed
Adopted two text d to meet following condition in cluster Ci, dj:
That is diBe in text cluster with djThe minimum text of similarity, while djBe in the cluster with diThe minimum text of similarity.
The present invention proposes to determine initial two sub-clusterings center according to " each other minimum similarity degree text to " of search cluster, it is clear that one
It may contain in individual text cluster more than " each other minimum similarity degree text to " for meeting formula (6) for a pair.Therefore, the " mutual of cluster is provided
For minimum similarity degree text " search greedy algorithm it is as follows.
Preparation algorithm 1 " each other minimum similarity degree text to " searching algorithm.
Input:Text clusternCFor the quantity of text cluster C Chinese versions.
Output:" each other minimum similarity degree text to " dx,dy.
Algorithm steps are as follows:
Step 1 randomly selects text d in text cluster CiIt is assigned to dx,dx←di。
Step 2 is searched for and text d in text cluster CxThe minimum text d of similarityy, i.e.,
Step 3 is searched for and text d in text cluster CyThe minimum text d of similarityk, i.e.,
Step 4 judges following two conditions:
If 4.1 dk=dxOr SIM (dx,dy)=SIM (dk,dy), then algorithm terminates, and exports dx,dyTo be " minimum each other similar
Spend text to ", i.e. text cluster C initial cluster center;
If 4.2 dk≠dxAnd SIM (dx, dy) ≠ SIM (dk, dy), then assignment dx←dy, dy← dk, return to step 3 is searched again
Rope.
Step 5 terminates.
In an embodiment of the present invention, Text Clustering Algorithm based on " each other minimum similarity degree text to " search is according to carrying
The initial cluster center system of selection gone out, with reference to two points of K-means algorithm ideas, provides Text Clustering Algorithm step as follows:
Preparation algorithm 2 is based on the Text Clustering Algorithm of " each other minimum similarity degree text to " search.
Input:Text collection D={ d1,d2,…,di,…,dN}。
Parameter:The number of clusters K. of cluster
Output:Text collection D cluster divides S={ S1,S2,…,Sk,…,SK};
Algorithm steps are as follows:
Step 1 is initialized.The set D that all texts are constituted is used as initial cluster:S={ S0, S0←D.
Step 2 selects the minimum cluster S of text similarity side MS according to formula (4) from SmIt is used as cluster to be divided
Step 3 finds cluster Sm to be divided initial two sub-clusterings centered text to dx, dy. with the preparation algorithm 1 proposed
Step 4 is by all text S of cluster to be sortedm={ dm1, dm2..., dmi..., dmnAccording to the maximum principle point of similarity
It is fitted on cluster SxAnd SyIn:
By SxAnd SyIt is added to cluster to divide in S, and by SmDeleted from S.
If step 5 S Chinese version clusters number is less than K, return to step 2;If S Chinese version clusters number is equal to K, step is turned to
6。
The barycenter of K using in S cluster of step 6 obtains text to all texts as initial cluster center using sphere K-means clusters
This cluster divides S, using the Text similarity computing method for defining 2 wherein in cluster process.
Step 7 terminates.
From the preparation process of algorithm 2, Text Clustering Algorithm proposed by the present invention is after initial two sub-clusterings center is searched
All objects (step 4) of primary distribution, obtain the division of cluster, have no the iteration optimizing repeated in original two points of K-means algorithms
Process.
(3) MapReduce frameworks are based on, Distributed Parallel Computing is carried out using three MapReduce tasks, is each responsible for
Search " each other minimum similarity degree text to ", distribution text to two clusters, final K-means text clusters, until cluster is divided
No longer change and export cluster result.
Parallel text cluster model based on MapReduce is although modified hydrothermal process is possible to carry compared to original two segregations class
High cluster efficiency, but the raising of the computational efficiency obtained by the improvement of clustering algorithm in itself can not be far adapted to towards cloud computing
Extensive mass text cluster result needs in practical application, therefore using the MapReduce frameworks in cloud computing environment to text
This cluster process carries out parallelization processing, will further greatly improve text cluster efficiency.
In an embodiment of the present invention, the parallel text cluster process based on MapReduce tasks is in text cluster process
In, three MapReduce tasks are designed herein and carry out Distributed Parallel Computing, and being each responsible for search, " minimum similarity degree is literary each other
This to ", distribution text is to two clusters and final K-means text cluster processes.
Specifically, step one:Found according to " each other minimum similarity degree text to " in preparation algorithm 1 in original text cluster
The heart.Map chooses a text dx, selected cluster S is calculated according to defining 2mIn remaining text and dxSimilarity, and search for and dx
Text d with minimum similarity degreey, and search for and dyText d with minimum similarity degreek;Reduce is by dyIt is assigned to dx, dkAssign
It is worth to dy, re-use Map search dyMinimum similarity degree text, until finding " each other minimum similarity degree text to "<dx, dy
>.MapReduce processes are represented by:
Map:<dx, List<tj, wxj>>,<Sm, List<dmi>>→<dx&dy, SIM (dx, dy)>
Repeat
Map:<dy, List<tj, wyj>>,<Sm, List<dmi>>→<dy&dk, SIM (dk, dy)>
Reduce:dx←dy, dy←dk
End until dk=dy or SIM(dk, dy)=SIM (dx, dy)
Step 2:Step 3 is arrived according to the step 1 of preparation algorithm 2, all texts are distributed in cluster to be divided into two clusters.
Map is according to the initial cluster center d of searchx,dy, cluster S is calculated according to defining 2mIn all texts and cluster center dx,dySimilarity,
And it is assigned to two cluster S by the maximum principle of similarityxAnd SyIn,<Sk, List<di, List<tj, wij>>>(Sk=sxOr Sy)。
Reduce is according to the centroid vector d for defining 3 two clusters of calculatingekWith text similarity side MS (Sk), i.e.,<Sk, dek, MS (Sk)>。
MapReduce processes are represented by:
Map:<Sm, List<dmi>, dx, dy>→<Sk, List<di>>
Reduce:<Sk, List<di>>→<Sk, dek, MS (Sk)>
Above-mentioned two MapReduce tasks would be repeated for until the number of cluster is the K. that specifies
Step 3:Step 7 is arrived according to the step 4 of preparation algorithm 2, the barycenter of K cluster is subjected to K-means clusters.Map is read
Enter the centroid vector d of all text collection D and K clusterekK-means clusters are carried out, text cluster is formed and divides, i.e.,<Sk, List<
dki>>, the text assigning process of the similar upper MapReduce task of process needs:
Repeat:
Map:D, List<dek>→<Sk, List<dki>>
Reduce:<Sk, List<di>>→<Sk, dek, MS (Sk)>
Until clusters, which are divided, no longer to be changed.
The technical principle for being the specific embodiment of the present invention and being used above, if conception under this invention institute
The change of work, during the spirit that function produced by it is still covered without departing from specification and accompanying drawing, should belong to the present invention's
Protection domain.
Claims (7)
1. the distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment, it is characterised in that including such as
Lower step:
(1) text feature is represented using vector space model, Text similarity computing is obtained with reference to Text similarity computing method
Model;
(2) determined using the Text Clustering Algorithm selection based on " each other minimum similarity degree text to " search in initial two sub-clustering
The heart, realizes that two points of K-means clusters of cluster barycenter optimizing complete cluster by once dividing, forms text cluster and divide;
(3) MapReduce frameworks are based on, Distributed Parallel Computing is carried out using three MapReduce tasks, search is each responsible for
" each other minimum similarity degree text to ", distribution text to two clusters, final K-means text clusters, until cluster is divided no longer
Change and export cluster result.
2. the distributed parallel text cluster side based on MapReduce under a kind of cloud computing environment according to claim 1
Method, it is characterised in that:Described Text similarity computing method is as follows:Given text di,dj, TA (di,dj)={ ta1,
ta2,...,tat,...,tahRepresent di,djThe union of contained Feature Words, h is and concentrates the number of Feature Words;TS(di,dj)=
{ts1,ts2,...,tsk,...,tslRepresent di,djThe common factor of contained Feature Words, l is the number of Feature Words in occuring simultaneously;Then text
di,djEach Feature Words t in TSskOn similarity sim (di,dj,tsk) be defined as
Text di,djSimilarity SIM (di,dj) be defined as
3. the distributed parallel text cluster side based on MapReduce under a kind of cloud computing environment according to claim 1
Method, it is characterised in that:The utilization vector space model represents that the method for text feature is:Given text collection D={ d1,
d2,…,di,…,dN, diEach text vector is represented, is represented by using vector space model
di=(< t1,wi1>, < t2,wi2> ..., < tj,wij> ..., < tm,wim>)
Wherein, T={ t1,t2,…,tj,…,tmRepresent the set of all Feature Words that all texts are included in text set;Wi=
{wi1,wi2,…,wij,…,wimRepresent text diThe corresponding weight vectors on all Feature Words, using TF-IDF computational methods
Obtain, formula is as follows:
Wherein, tfijRefer to Feature Words tjIn text diThe frequency of middle appearance, nijFor text diMiddle Feature Words tjThe number of times of appearance, niFor
Text diThe sum that all Feature Words contained occur;idfjRefer to Feature Words tjReverse document frequency in whole text set, is used
To weigh the appearance scope of Feature Words;N is total amount of text, N in text collectionjRepresent to contain Feature Words tjDifferent textual datas
Amount.
4. the distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment according to claim 2,
It is characterized in that:The definition of described " each other minimum similarity degree text to " is:If text cluster
Then " each other minimum similarity degree text to " is defined as meeting two text d of following condition in cluster Ci, dj:
That is diBe in text cluster with djThe minimum text of similarity, while djBe in the cluster with diThe minimum text of similarity.
5. the distributed parallel text cluster side based on MapReduce under a kind of cloud computing environment according to claim 1
Method, it is characterised in that:The method of the Text Clustering Algorithm based on " each other minimum similarity degree text to " search is as follows:
Input:Text collection D={ d1,d2,…,di,…,dN};
Parameter:The number of clusters K of cluster;
Output:Text collection D cluster divides S={ S1,S2,…,Sk,…,SK};
(1) initialized, regard the set D that all texts are constituted as initial cluster:S={ S0, S0←D;
(2) the minimum cluster S of text cluster similarity side MS are selected from SmIt is used as cluster to be divided;
(3) cluster Sm to be divided initial two sub-clusterings centered text pair is found using " each other minimum similarity degree text to " searching algorithm
Dx, dy;
(4) by all text S of cluster to be sortedm={ dm1, dm2..., dmi..., dmnCluster is assigned to according to the maximum principle of similarity
SxAnd SyIn, it is shown below:
By SxAnd SyIt is added to cluster to divide in S, and by SmDeleted from S;
(5) if S Chinese version clusters number is less than K, return and perform step (2);If S Chinese version clusters number is equal to K, step (6) is performed;
(6) barycenter of K using in S cluster obtains text cluster stroke to all texts as initial cluster center using sphere K-means clusters
Divide S, wherein, Text similarity computing method is used in cluster process.
6. the distributed parallel text based on MapReduce under a kind of cloud computing environment according to claim 5
Clustering method, it is characterised in that:Described text cluster similarity side MS is defined as follows:nCThe text cluster of individual textCluster similarity side MS (C) be defined as all texts and cluster barycenter similarity square
Average:
7. the distributed parallel text cluster side based on MapReduce under a kind of cloud computing environment according to claim 5
Method, it is characterised in that:Described " each other minimum similarity degree text to " searching algorithm is as follows:
Input:Text clusternCFor the quantity of text cluster C Chinese versions;Output:It is " minimum each other
Similarity text to " dx,dy;
(i) text d is randomly selected in text cluster CiIt is assigned to dx,dx←di;
(ii) search and text d in text cluster CxThe minimum text d of similarityy, i.e.,
(iii) search and text d in text cluster CyThe minimum text d of similarityk, i.e.,
(iv) following two conditions are judged:
If (a) dk=dxOr SIM (dx,dy)=SIM (dk,dy), then algorithm terminates, and exports dx,dyFor " minimum similarity degree is literary each other
This to ", i.e. text cluster C initial cluster center;
If (b) dk≠dxAnd SIM (dx, dy) ≠ SIM (dk, dy), then assignment dx←dy, dy← dk, redirects execution step (iii) weight
New search.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710286671.2A CN106971005A (en) | 2017-04-27 | 2017-04-27 | Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710286671.2A CN106971005A (en) | 2017-04-27 | 2017-04-27 | Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106971005A true CN106971005A (en) | 2017-07-21 |
Family
ID=59332688
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710286671.2A Withdrawn CN106971005A (en) | 2017-04-27 | 2017-04-27 | Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106971005A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108052485A (en) * | 2017-12-15 | 2018-05-18 | 东软集团股份有限公司 | the distributed computing method and device of vector similarity, storage medium and node |
CN112463958A (en) * | 2020-09-29 | 2021-03-09 | 上海海事大学 | Method for rapidly clustering massive texts based on MapReduce framework |
CN112784046A (en) * | 2021-01-20 | 2021-05-11 | 北京百度网讯科技有限公司 | Text clustering method, device and equipment and storage medium |
CN116503031A (en) * | 2023-06-29 | 2023-07-28 | 中国人民解放军国防科技大学 | Personnel similarity calculation method, device, equipment and medium based on resume analysis |
CN112784046B (en) * | 2021-01-20 | 2024-05-28 | 北京百度网讯科技有限公司 | Text clustering method, device, equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020042793A1 (en) * | 2000-08-23 | 2002-04-11 | Jun-Hyeog Choi | Method of order-ranking document clusters using entropy data and bayesian self-organizing feature maps |
CN105426426A (en) * | 2015-11-04 | 2016-03-23 | 北京工业大学 | KNN text classification method based on improved K-Medoids |
-
2017
- 2017-04-27 CN CN201710286671.2A patent/CN106971005A/en not_active Withdrawn
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020042793A1 (en) * | 2000-08-23 | 2002-04-11 | Jun-Hyeog Choi | Method of order-ranking document clusters using entropy data and bayesian self-organizing feature maps |
CN105426426A (en) * | 2015-11-04 | 2016-03-23 | 北京工业大学 | KNN text classification method based on improved K-Medoids |
Non-Patent Citations (1)
Title |
---|
武森 等: "基于MapReduce的大规模文本聚类并行化", 《北京科技大学学报》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108052485A (en) * | 2017-12-15 | 2018-05-18 | 东软集团股份有限公司 | the distributed computing method and device of vector similarity, storage medium and node |
CN108052485B (en) * | 2017-12-15 | 2021-05-07 | 东软集团股份有限公司 | Distributed computing method and device for vector similarity, storage medium and node |
CN112463958A (en) * | 2020-09-29 | 2021-03-09 | 上海海事大学 | Method for rapidly clustering massive texts based on MapReduce framework |
CN112463958B (en) * | 2020-09-29 | 2022-07-15 | 上海海事大学 | Method for rapidly clustering massive texts based on MapReduce framework |
CN112784046A (en) * | 2021-01-20 | 2021-05-11 | 北京百度网讯科技有限公司 | Text clustering method, device and equipment and storage medium |
CN112784046B (en) * | 2021-01-20 | 2024-05-28 | 北京百度网讯科技有限公司 | Text clustering method, device, equipment and storage medium |
CN116503031A (en) * | 2023-06-29 | 2023-07-28 | 中国人民解放军国防科技大学 | Personnel similarity calculation method, device, equipment and medium based on resume analysis |
CN116503031B (en) * | 2023-06-29 | 2023-09-08 | 中国人民解放军国防科技大学 | Personnel similarity calculation method, device, equipment and medium based on resume analysis |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110674407B (en) | Hybrid recommendation method based on graph convolution neural network | |
CN104699772B (en) | A kind of big data file classification method based on cloud computing | |
Yin et al. | Incomplete multi-view clustering via subspace learning | |
CN103279556B (en) | Iteration Text Clustering Method based on self adaptation sub-space learning | |
CN107292186A (en) | A kind of model training method and device based on random forest | |
Eluri et al. | A comparative study of various clustering techniques on big data sets using Apache Mahout | |
CN105426529A (en) | Image retrieval method and system based on user search intention positioning | |
CN104699698A (en) | Graph query processing method based on massive data | |
Suganthi et al. | Instance selection and feature extraction using cuttlefish optimization algorithm and principal component analysis using decision tree | |
CN109784405A (en) | Cross-module state search method and system based on pseudo label study and semantic consistency | |
Zhou et al. | An effective ensemble pruning algorithm based on frequent patterns | |
CN106601235A (en) | Semi-supervision multitask characteristic selecting speech recognition method | |
CN106971005A (en) | Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment | |
CN109739984A (en) | A kind of parallel KNN network public-opinion sorting algorithm of improvement based on Hadoop platform | |
CN106203508A (en) | A kind of image classification method based on Hadoop platform | |
Lee et al. | A hybrid system for imbalanced data mining | |
Zhang et al. | Fast exemplar-based clustering by gravity enrichment between data objects | |
CN113692591A (en) | Node disambiguation | |
Gupta et al. | Comparison of algorithms for document clustering | |
Mei et al. | Proximity-based k-partitions clustering with ranking for document categorization and analysis | |
Gabryel | A bag-of-features algorithm for applications using a NoSQL database | |
CN103324942B (en) | A kind of image classification method, Apparatus and system | |
Leger | Wmixnet: Software for clustering the nodes of binary and valued graphs using the stochastic block model | |
Park et al. | Multi-attributed graph matching with multi-layer random walks | |
CN107480199B (en) | Query reconstruction method, device, equipment and storage medium of database |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20170721 |