CN111597335A

CN111597335A - K-means initial clustering center determination method for microblog comment text

Info

Publication number: CN111597335A
Application number: CN202010364885.9A
Authority: CN
Inventors: 翟智昆; 周成成; 许海涛; 周贤伟
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2020-08-28
Anticipated expiration: 2040-04-30
Also published as: CN111597335B

Abstract

The invention provides a method for determining a K-means initial clustering center of a microblog comment text, which can quickly and accurately determine an optimal initial clustering center of the microblog comment text. The method comprises the following steps: s1, selecting the approved comments from the microblog comment vector set, adding the selected comments into the core comment cluster, and sequencing the comments in the core comment cluster according to the minimum approved comment number of the comments; s2, selecting the first comment in the sorted core comment cluster as a center comment, and calculating the distance between the center comment and the comment in the core comment cluster; s3, deleting the comments, the distance between the comments and the center of the core comment cluster is less than a set distance threshold value, and adding the center comments into the initial clustering center cluster; and S4, returning to continue executing S2 and S3 until the number of comments in the core comment cluster is 0, wherein the comments in the initial cluster center cluster are the final initial cluster center of the microblog comment vector set. The invention relates to the technical field of network communication.

Description

K-means initial clustering center determination method for microblog comment text

Technical Field

The invention relates to the technical field of network communication, in particular to a method for determining a K-means initial clustering center for microblog comment texts.

Background

With the continuous development of information technology, the scale of netizens is continuously enlarged. According to the 44 th statistical report of the development conditions of the Chinese interconnection network, which is released by the information center of the Chinese interconnection network in 8 months in 2019, the scale of the netizens in China reaches 8.54 hundred million by 6 months in 2019, the scale of the netizens in China is increased by 2598 ten thousand in comparison with the end of 2018, the popularity rate of the Internet reaches 61.2 percent, and the popularity rate of the Internet is increased by 1.6 percent in comparison with the end of 2018. Due to the development of network communication technology and the expansion of the number of network residents, the internet has become a major place for information distribution, dissemination and acquisition in daily life. The scale of information and information dissemination generated every day on the internet is huge, and various news and social events frequently appear, so that the public opinion information generated every day on the internet is also huge. The situation does not have great influence on the virtual network and has certain intrusion on the real life. Compared with the traditional media, the microblog has lower entrance requirement and higher information transmission speed, and particularly in the transmission process of some sudden events, the microblog becomes a main message transmission mode. Therefore, microblog becomes a main place for generating network public sentiment, and the microblog public sentiment should be paid attention to public sentiment monitoring.

K-means (K-means) is the most classical and widely used partitional clustering algorithm, and is often used in clustering of network public opinions. However, the use of the method has certain limitations, for example, the selection method of the initial clustering center is different, and if the selection is not proper (for example, an isolated point is selected), the final clustering result is often in a local optimum.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method for determining a K-means initial clustering center of a microblog comment text, and aims to solve the problem that the initial clustering center is improperly selected in the prior art, so that a clustering result is easy to fall into local optimum.

In order to solve the technical problem, an embodiment of the present invention provides a method for determining a K-means initial clustering center for a microblog comment text, including:

s1, selecting the approved comments from the microblog comment vector set, adding the selected comments into the core comment cluster, and sequencing the comments in the core comment cluster according to the minimum approved comment number of the comments;

s2, selecting the first comment in the sorted core comment cluster as a center comment, and calculating the distance between the center comment and the comment in the core comment cluster;

s3, deleting the comments, the distance between the comments and the center of the core comment cluster is less than a set distance threshold value, and adding the center comments into the initial clustering center cluster;

and S4, returning to continue executing S2 and S3 until the number of comments in the core comment cluster is 0, wherein the comments in the initial cluster center cluster are the final initial cluster center of the microblog comment vector set.

Further, before selecting the approved comments from the microblog comment vector set to be added into the core comment cluster and sorting the comments in the core comment cluster according to the comment approval number from large to small, the method further comprises:

acquiring original microblog comment data reader, preprocessing the acquired original microblog comment data reader, and outputting a microblog comment set data;

vectorizing the microblog comment set data, and outputting a microblog comment vector set data _ tfidf.

Further, the obtaining of the original microblog comment data reader, the preprocessing of the obtained original microblog comment data reader, and the outputting of the microblog comment set data include:

acquiring original microblog comment data reader ═ r₁,...,r_i,...,r_n}，r_i＝(p_i，likes_i) Where n represents the number of original microblog comments acquired, r_iRepresenting the ith comment, p in the original microblog comment data_iRepresenting the ith comment r in the original microblog comment data_iOriginal comment text information, like_iRepresenting the ith comment r in the original microblog comment data_iThe number of praise;

for the obtained sourcePreprocessing the microblog comment starting data to obtain a microblog comment set data d₁,...,d_i,...,d_n}，d_i＝((word₁,...,word_j,...,word_m),likes_i) Wherein d is_iRepresenting the preprocessed i-th comment, word_jRepresenting the ith comment d in the microblog comment set_iThe jth participle, like in (1)_iRepresenting the ith comment d in the microblog comment set_iM represents the ith comment d in the microblog comment set_iNumber of participles of d_iEach participle in the set is a feature word;

wherein the pretreatment comprises: symbol processing, Chinese segmentation and stop word.

Further, the vectorizing the microblog comment set data, and the outputting the microblog comment vector set data _ tfidf includes:

determining all feature words in the microblog comment set data, and forming a set T by non-repetitive feature words;

determining the ith characteristic word T in the set T_iWherein the TF-IDF value is expressed as:

tfidf(t_i)＝TF(t_i)×IDF(t_i)

wherein tfidf (t)_i) Means the i-th characteristic word t_iTF-IDF value of (1), TF-IDF representing the word frequency-inverse document frequency, TF (t)_i) Means the i-th characteristic word t_iTerm frequency of, IDF (t)_i) Means the i-th characteristic word t_iThe inverse document frequency of (d);

sorting the feature words in the set T according to the TF-IDF value of the feature words from large to small to generate a base vector jvsm, wherein the jvsm is (T _ tfidf)₁,...,t_tfidf_i,...,t_tfidf_jnt)，t_tfidf_i＝(t_i,tfidf(t_i) Wherein t) is_iRepresents the ith vector t _ tfidf in the basis vector space_iThe characteristic word of (1), tfidf (t)_i) Represents the ith vector t _ tfidf in the basis vector space_iThe TF-IDF value of (jnt) indicates that the category of the feature words in the set T also indicates the dimensionality of jvsm;

the tfidf (t) in the base vector jvsm_i) Deleting the feature words larger than the feature threshold FEA to obtain a standard feature vector vsm as: vsm ═ t _ tfidf₁,...,t_tfidf_i,...,t_tfidf_nt) Wherein t _ tfidf_i＝(t_i,tfidf(t_i))，t_iRepresenting the ith vector t _ tfidf in the standard feature vector space_iThe characteristic word of (1), tfidf (t)_i) Represents the ith vector t _ tfidf in the canonical feature vector space_iThe TF-IDF value of (n), nt represents the dimension of vsm;

vectorizing the microblog comment set data according to vsm to generate a vector space model data _ vsm of the microblog comment set data, wherein data _ vsm is { dv _ nolikes ═₁,...,dv_nolikes_i,...,dv_nolikes_n}，dv_nolikes_i＝(tv₁,...,tv_j,...,tv_nt)，tv_j＝(t_j，tfidf_j)，tv_jT in (1)_jRepresenting the i comment vectors dv _ nolikes in the vector space model data _ vsm_iThe j-th feature word of (iv), tv_jTfidf in (1)_jRepresenting the i comment vectors dv _ nolikes in the vector space model data _ vsm_iThe TF-IDF value of the jth feature word of (1), nt represents the i comment vectors dv _ nolikes in the vector space model data _ vsm_iThe number of feature words of (1) is dv _ nolikes_iThe dimension of (a);

adding a praise number in each comment vector in the vector space model data _ vsm, and generating a microblog comment vector set data _ tfidf of the microblog comment set data, wherein data _ tfidf is { dv ═ dv₁,...,dv_i,...,dv_n}， dv_i＝((tv₁,...,tv_j,...,tv_nt),likes_i)，tv_j＝(t_j，tfidf_j)，t_jRepresenting the ith comment dv in the microblog comment vector set data _ tfidf_iThe jth feature word, like_iRepresenting the ith comment dv in the microblog comment vector set data _ tfidf_iThe number of praise of (nt) represents the ith comment dv in the microblog comment vector set data _ tfidf_iThe number of feature words.

Go toStep by step, i-th kind of feature word t_iWord frequency TF (t)_i) Expressed as:

wherein, words represents the total number of feature words in a set T _ all formed by all feature words, word (T)_i) Representing the ith feature word T in a set T composed of non-repeated feature words_iThe number of (2).

Further, i comment vectors dv _ nolikes in the vector space model data _ vsm_iThe TF-IDF value of the jth feature word of (1) is:

further, the distance in S2 is represented as:

wherein D represents a distance, core _ clusters_i＝(cc_tfidf₁,…,cc_tfidf_j,…,cc_tfidf_nt)，cc_tfidf_jRepresents the ith comment core _ clusters in the core comment cluster_iThe TF-IDF value of the jth feature word of (1);

comment＝(c_tfidf₁,…,c_tfidf_j,…,c_tfidf_nt)，c_tfidf_ja TF-IDF value representing the jth feature word of the center comment; nt represents the dimension of the comment.

Further, after S4, the method further includes:

clustering the microblog comment vector set by utilizing a K-means clustering algorithm according to the finally obtained initial clustering center to generate clustering result clusters and clustering center cluster clusters _ centers;

and judging whether the cluster center cluster changes or not before and after the two times of clustering is converged, if so, performing the K-means clustering algorithm again according to the clustering result cluster and the clustering center cluster _ centers until the clustering center cluster does not change, and outputting the clustering result cluster.

The technical scheme of the invention has the following beneficial effects:

in the scheme, the comments which are praised are selected from the microblog comment vector set and added into the core comment cluster, and the comments in the core comment cluster are sorted according to the fact that the praise number of the comments is reduced from large to small; selecting a first comment in the sequenced core comment clusters as a central comment, and calculating the distance between the central comment and the comment in the core comment clusters; deleting comments in the core comment cluster, the distance between which and the center comment is smaller than a set distance threshold value, and adding the center comment into the initial clustering center cluster; and returning to continue to calculate the distance between the center comment and the comment in the core comment cluster and the operation of deleting the comment and adding the center comment into the initial clustering center cluster until the number of the comments in the core comment cluster is 0, wherein the comment in the initial clustering center cluster is the final initial clustering center of the microblog comment vector set. Therefore, the optimal initial clustering center of the microblog comment paper can be determined quickly and accurately, and the problem that the clustering result falls into local optimization due to improper selection of the initial clustering center is avoided.

Drawings

Fig. 1 is a schematic flow chart of a method for determining a K-means initial clustering center for microblog comment texts according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of microblog data preprocessing provided by the embodiment of the invention;

fig. 3 is a schematic flow diagram of microblog data vectorization according to an embodiment of the present invention;

FIG. 4 is a detailed flowchart illustrating a method for determining a K-means initial clustering center for microblog comment texts according to an embodiment of the present invention;

FIG. 5 is a flow diagram of a K-means initial clustering algorithm for microblog comment texts according to the embodiment of the invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved by the present invention clearer, the following detailed description is made with reference to the accompanying drawings and specific embodiments.

The invention provides a method for determining a K-means initial clustering center of a microblog comment text, aiming at the problem that the conventional initial clustering center is not properly selected and a clustering result is easy to fall into local optimum.

As shown in fig. 1, the method for determining a K-means initial clustering center for a microblog comment text according to the embodiment of the present invention includes:

According to the method for determining the K-means initial cluster center of the microblog comment text, disclosed by the embodiment of the invention, the comments which are praised are selected from the microblog comment vector set and added into the core comment cluster, and the comments in the core comment cluster are sorted according to the praise number of the comment from large to small; selecting the first comment in the sorted core comment cluster as a center comment, and calculating the distance between the center comment and the comment in the core comment cluster; deleting comments in the core comment cluster, the distance between which and the center comment is smaller than a set distance threshold value, and adding the center comment into the initial clustering center cluster; and returning to continue to calculate the distance between the center comment and the comment in the core comment cluster and the operation of deleting the comment and adding the center comment into the initial clustering center cluster until the number of the comments in the core comment cluster is 0, wherein the comment in the initial clustering center cluster is the final initial clustering center of the microblog comment vector set. Therefore, the optimal initial clustering center of the microblog comment text can be determined quickly and accurately, and the problem that the clustering result falls into local optimal due to improper selection of the initial clustering center is avoided.

In a specific implementation manner of the foregoing method for determining a K-means initial cluster center for a microblog comment text, further, before selecting a approved comment from a microblog comment vector set and adding the selected comment into a core comment cluster and sorting comments in the core comment cluster according to a decreasing number of comment approved comments, the method further includes:

In a specific implementation manner of the foregoing method for determining a K-means initial cluster center for a microblog comment text, further, the obtaining original microblog comment data reader, preprocessing the obtained original microblog comment data reader, and outputting microblog comment set data includes:

preprocessing the acquired original microblog comment data to obtain a microblog comment set data { d ═ d₁,...,d_i,...,d_n}，d_i＝((word₁,...,word_j,...,word_m),likes_i) Wherein d is_iRepresenting the preprocessed i-th comment, word_jRepresenting in a set of microblog reviewsThe ith comment d_iThe jth participle, like in (1)_iRepresenting the ith comment d in the microblog comment set_iM represents the ith comment d in the microblog comment set_iNumber of participles of d_iEach participle in the set is a feature word;

wherein the pretreatment comprises: symbol processing, Chinese segmentation, and stop-word, as shown in FIG. 2.

In this embodiment, the symbol processing refers to removing information which is meaningless to text clustering, such as "@ username" or "# topic #" in the microblog comment information.

In this embodiment, the chinese word segmentation is to divide the text information into separate words. The microblog comment information may be Chinese participled using participle software or a participle component in a programming language.

In this embodiment, stop words are words that have grammatical meaning in the complete context, but have no practical value in the processing of text information, such as some linguistic words, conjunctions, auxiliary words ("reduce", "and", "you") and the like. And removing stop words in the segmented microblog comment information through a preset stop word list.

In this embodiment, word segmentation word in microblog comment set data_jFor feature words, e.g. preprocessed i-th comment d_iThe word segmentation can be a feature word, and a set T formed by non-repeated feature words is { the article, the patent and the microblog }; all the feature words (including repetition) form a set T _ words { article, patent, microblog }.

In a specific implementation manner of the foregoing method for determining a K-means initial cluster center for a microblog comment text, further, as shown in fig. 3, the vectorizing the microblog comment set data, and outputting a microblog comment vector set data _ tfidf includes:

determining the ith characteristic word T in the set T_iWherein the TF-IDF value is expressed as：

tfidf(t_i)＝TF(t_i)×IDF(t_i)

vectorizing the microblog comment set data according to vsm to generate a vector space model data _ vsm of the microblog comment set data, wherein data _ vsm is { dv _ nolikes ═₁,...,dv_nolikes_i,...,dv_nolikes_n}，dv_nolikes_i＝(tv₁,...,tv_j,...,tv_nt)，tv_j＝(t_j，tfidf_j)，tv_jT in (1)_jRepresenting the i comment vectors dv _ nolikes in the vector space model data _ vsm_iThe j-th feature word of (iv), tv_jTfidf in (1)_jRepresents the i-pieces of comment directions in the vector space model data _ vsmAmount dv _ nolikes_iThe TF-IDF value of the jth feature word of (1), nt represents the i comment vectors dv _ nolikes in the vector space model data _ vsm_iThe number of feature words of (1) is dv _ nolikes_iThe dimension of (a);

In the embodiment, vectorization processing is carried out on the microblog comment set data, the dimensionality of the feature words in the microblog comment vector set can be reduced, the clustering precision is improved at a later stage, and the problem that the initial clustering center result is inaccurate due to the fact that a vector space model data _ vsm generated by the microblog comment set data is very sparse is solved.

In the specific implementation manner of the method for determining the initial clustering center of the K-means for the microblog comment text, the ith feature word t is_iWord frequency TF (t)_i) Expressed as:

wherein, words represents the total number of feature words in a set T _ all words formed by all feature words (including repetition), word (T)_i) Representing the ith feature word T in a set T composed of non-repeated feature words_iThe number of (2).

In the specific implementation manner of the method for determining the initial clustering center of K-means for the microblog comment text, i comments in the vector space model data _ vsm are further includedTheoretic vector dv _ nolikes_iThe TF-IDF value of the jth feature of (1) is:

as shown in fig. 4, the method for determining the K-means initial cluster center for the microblog comment text may specifically include the following steps:

c1, selecting comments with the praise number larger than 0 from the microblog comment vector set data _ tfidf, adding the comments into the core comment cluster core _ clusters, and then sorting the comments in the core _ clusters from large to small according to the praise number of each comment;

c2, selecting the first comment core _ clusters [0] in the core _ clusters as a central comment, and calculating the distance D between the comment and the comments in the core comment cluster; wherein the distance D is represented as:

wherein, core _ clusters_i＝(cc_tfidf₁,…,cc_tfidf_j,…,cc_tfidf_nt)，cc_tfidf_jRepresents the ith comment core _ clusters in the core comment cluster_iThe TF-IDF value of the jth feature word of (1); comment (c _ tfidf)₁,…, c_tfidf_j,…,c_tfidf_nt)，c_tfidf_jA TF-IDF value representing the jth feature word of the center comment; nt represents the dimension of the comment.

C3, deleting the comments D smaller than the set distance threshold K in the sorted core _ clusters and the comments, and adding the comments into the centers of the initial cluster center;

c4, returning to continue executing C2 and C3 until the iteration is finished when the number of comments in core _ clusters is 0, and the comments in centers can be used as initial clustering centers of the obtained original microblog comment data reader, wherein the centers are { C ═ C { (C) }₁,...,c_i,...,c_kAnd k represents the number of initial cluster centers.

As shown in fig. 5, after S4, clustering the microblog comment vector set data _ tfidf by using a K-means clustering algorithm may specifically include the following steps:

clustering the microblog evaluation vector set data _ tfidf by utilizing a K-means clustering algorithm according to the finally obtained initial clustering centers to generate clustering result clusters and clustering center clusters _ centers;

and judging whether the clustering center cluster changes in the two previous times and the two subsequent times, if so, performing the K-means clustering algorithm again according to the clustering result cluster and the clustering center cluster _ centers until the clustering center cluster does not change, and outputting the clustering result cluster, thereby obtaining the optimal clustering result.

The foregoing is a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should be considered as the protection scope of the present invention.

Claims

1. A method for determining K-means initial clustering centers of microblog comment texts is characterized by comprising the following steps:

2. The method for determining the K-means initial cluster center for the microblog comment text according to claim 1, wherein before the selected approved comments in the microblog comment vector set are added into the core comment cluster and the comments in the core comment cluster are sorted according to the comment approval number from large to small, the method further comprises:

acquiring original microblog comment data reader, preprocessing the acquired original microblog comment data reader, and outputting microblog comment set data;

3. The method for determining the K-means initial clustering center for the microblog comment text according to claim 2, wherein the step of acquiring original microblog comment data readers, the step of preprocessing the acquired original microblog comment data readers, and the step of outputting microblog comment set data comprises the steps of:

acquiring original microblog comment data reader ═ r₁,...,r_i,...,r_n}，r_i＝(p_i，likes_i) Wherein n represents the number of the acquired original microblog comments, r_iRepresenting the ith comment, p in the original microblog comment data_iRepresenting the ith comment r in the original microblog comment data_iOriginal comment text information, like_iRepresenting the ith comment r in the original microblog comment data_iThe number of praise;

preprocessing the acquired original microblog comment data to obtain a microblog comment set data { d ═ d₁,...,d_i,...,d_n}，d_i＝((word₁,...,word_j,...,word_m),likes_i) Wherein d is_iRepresenting the preprocessed i-th comment, word_jRepresenting the ith comment d in the microblog comment set_iThe jth participle, like in (1)_iRepresenting the ith comment d in the microblog comment set_iM represents the ith comment d in the microblog comment set_iNumber of participles of d_iEach participle in the set is a feature word;

4. The method for determining the K-means initial clustering center for the microblog comment text according to claim 3, wherein vectorizing the microblog comment set data and outputting a microblog comment vector set data _ tfidf comprises:

tfidf(t_i)＝TF(t_i)×IDF(t_i)

sorting the feature words in the set T according to the TF-IDF value of the feature words from large to small to generate a base vector jvsm, wherein the jvsm is (T _ tfidf)₁,...,t_tfidf_i,...,t_tfidf_jnt)，t_tfidf_i＝(t_i,tfidf(t_i) Wherein t) is_iRepresents the ith vector t _ tfidf in the basis vector space_iThe characteristic word of (1), tfidf (t)_i) Represents the ith vector t _ tfidf in the basis vector space_iThe TF-IDF value of (jnt) indicates that the category of the feature words in the set T also indicates the dimension of jvsm;

the tfidf (t) in the base vector jvsm_i) Deleting the feature words larger than the feature threshold FEA to obtain a standard feature vector vsm as follows: vsm ═ t _ tfidf₁,...,t_tfidf_i,...,t_tfidf_nt) Wherein t _ tfidf_i＝(t_i,tfidf(t_i))，t_iRepresents the ith vector t _ tfidf in the canonical feature vector space_iThe characteristic word of (1), tfidf (t)_i) Represents the ith vector t _ tfidf in the canonical feature vector space_iThe TF-IDF value of (n) and nt represents vThe dimension of sm;

adding praise number in each comment vector in the vector space model data _ vsm, and generating a microblog comment vector set data _ tfidf of the microblog comment set data, wherein data _ tfidf ═ { dv ═ dv [ ]₁,...,dv_i,...,dv_n}，dv_i＝((tv₁,...,tv_j,...,tv_nt),likes_i)，tv_j＝(t_j，tfidf_j)，t_jRepresenting the ith comment dv in the microblog comment vector set data _ tfidf_iThe jth feature word, like_iRepresenting the ith comment dv in the microblog comment vector set data _ tfidf_iThe number of praise of (nt) represents the ith comment dv in the microblog comment vector set data _ tfidf_iThe number of feature words.

5. The method for determining the initial clustering center of K-means for the microblog comment text according to claim 4, wherein the ith feature word t_iWord frequency TF (t)_i) Expressed as:

wherein words represents a set of all feature wordsTotal number of feature words in T _ Allwords, word (T)_i) Representing the ith feature word T in a set T composed of non-repeated feature words_iThe number of (2).

6. The method for determining the K-means initial clustering center for the microblog comment text according to claim 4, wherein the i comment vectors dv _ nolikes in the vector space model data _ vsm_iThe TF-IDF value of the jth feature word of (1) is:

7. the method for determining the initial clustering center of K-means for the microblog comment text according to claim 1, wherein the distance in S2 is represented as:

wherein D represents a distance, core _ clusters_i＝(cc_tfidf₁,…,cc_tfidf_j,...,cc_tfidf_nt)，cc_tfidf_jRepresents the ith comment core _ clusters in the core comment cluster_iThe TF-IDF value of the jth feature word of (1); comment (c _ tfidf)₁,…,c_tfidf_j,…,c_tfidf_nt)，c_tfidf_jA TF-IDF value representing the jth feature word of the center comment; nt represents the dimension of the comment.

8. The method for determining the initial clustering center of K-means for microblog comment texts according to claim 1, wherein after S4, the method further comprises:

clustering the microblog comment vector set by utilizing a K-means clustering algorithm according to the finally obtained initial clustering center to generate clustering result clusters and clustering center clusters _ centers;

and judging whether the cluster center cluster changes or not before and after twice and converging, if so, performing the K-means clustering algorithm again according to the clustering result cluster and the clustering center cluster _ centers until the clustering center cluster does not change, and outputting the clustering result cluster.