CN111597335A - K-means initial clustering center determination method for microblog comment text - Google Patents

K-means initial clustering center determination method for microblog comment text Download PDF

Info

Publication number
CN111597335A
CN111597335A CN202010364885.9A CN202010364885A CN111597335A CN 111597335 A CN111597335 A CN 111597335A CN 202010364885 A CN202010364885 A CN 202010364885A CN 111597335 A CN111597335 A CN 111597335A
Authority
CN
China
Prior art keywords
comment
tfidf
microblog
cluster
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010364885.9A
Other languages
Chinese (zh)
Other versions
CN111597335B (en
Inventor
翟智昆
周成成
许海涛
周贤伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN202010364885.9A priority Critical patent/CN111597335B/en
Publication of CN111597335A publication Critical patent/CN111597335A/en
Application granted granted Critical
Publication of CN111597335B publication Critical patent/CN111597335B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for determining a K-means initial clustering center of a microblog comment text, which can quickly and accurately determine an optimal initial clustering center of the microblog comment text. The method comprises the following steps: s1, selecting the approved comments from the microblog comment vector set, adding the selected comments into the core comment cluster, and sequencing the comments in the core comment cluster according to the minimum approved comment number of the comments; s2, selecting the first comment in the sorted core comment cluster as a center comment, and calculating the distance between the center comment and the comment in the core comment cluster; s3, deleting the comments, the distance between the comments and the center of the core comment cluster is less than a set distance threshold value, and adding the center comments into the initial clustering center cluster; and S4, returning to continue executing S2 and S3 until the number of comments in the core comment cluster is 0, wherein the comments in the initial cluster center cluster are the final initial cluster center of the microblog comment vector set. The invention relates to the technical field of network communication.

Description

K-means initial clustering center determination method for microblog comment text
Technical Field
The invention relates to the technical field of network communication, in particular to a method for determining a K-means initial clustering center for microblog comment texts.
Background
With the continuous development of information technology, the scale of netizens is continuously enlarged. According to the 44 th statistical report of the development conditions of the Chinese interconnection network, which is released by the information center of the Chinese interconnection network in 8 months in 2019, the scale of the netizens in China reaches 8.54 hundred million by 6 months in 2019, the scale of the netizens in China is increased by 2598 ten thousand in comparison with the end of 2018, the popularity rate of the Internet reaches 61.2 percent, and the popularity rate of the Internet is increased by 1.6 percent in comparison with the end of 2018. Due to the development of network communication technology and the expansion of the number of network residents, the internet has become a major place for information distribution, dissemination and acquisition in daily life. The scale of information and information dissemination generated every day on the internet is huge, and various news and social events frequently appear, so that the public opinion information generated every day on the internet is also huge. The situation does not have great influence on the virtual network and has certain intrusion on the real life. Compared with the traditional media, the microblog has lower entrance requirement and higher information transmission speed, and particularly in the transmission process of some sudden events, the microblog becomes a main message transmission mode. Therefore, microblog becomes a main place for generating network public sentiment, and the microblog public sentiment should be paid attention to public sentiment monitoring.
K-means (K-means) is the most classical and widely used partitional clustering algorithm, and is often used in clustering of network public opinions. However, the use of the method has certain limitations, for example, the selection method of the initial clustering center is different, and if the selection is not proper (for example, an isolated point is selected), the final clustering result is often in a local optimum.
Disclosure of Invention
The invention aims to solve the technical problem of providing a method for determining a K-means initial clustering center of a microblog comment text, and aims to solve the problem that the initial clustering center is improperly selected in the prior art, so that a clustering result is easy to fall into local optimum.
In order to solve the technical problem, an embodiment of the present invention provides a method for determining a K-means initial clustering center for a microblog comment text, including:
s1, selecting the approved comments from the microblog comment vector set, adding the selected comments into the core comment cluster, and sequencing the comments in the core comment cluster according to the minimum approved comment number of the comments;
s2, selecting the first comment in the sorted core comment cluster as a center comment, and calculating the distance between the center comment and the comment in the core comment cluster;
s3, deleting the comments, the distance between the comments and the center of the core comment cluster is less than a set distance threshold value, and adding the center comments into the initial clustering center cluster;
and S4, returning to continue executing S2 and S3 until the number of comments in the core comment cluster is 0, wherein the comments in the initial cluster center cluster are the final initial cluster center of the microblog comment vector set.
Further, before selecting the approved comments from the microblog comment vector set to be added into the core comment cluster and sorting the comments in the core comment cluster according to the comment approval number from large to small, the method further comprises:
acquiring original microblog comment data reader, preprocessing the acquired original microblog comment data reader, and outputting a microblog comment set data;
vectorizing the microblog comment set data, and outputting a microblog comment vector set data _ tfidf.
Further, the obtaining of the original microblog comment data reader, the preprocessing of the obtained original microblog comment data reader, and the outputting of the microblog comment set data include:
acquiring original microblog comment data reader ═ r1,...,ri,...,rn},ri=(pi,likesi) Where n represents the number of original microblog comments acquired, riRepresenting the ith comment, p in the original microblog comment dataiRepresenting the ith comment r in the original microblog comment dataiOriginal comment text information, likeiRepresenting the ith comment r in the original microblog comment dataiThe number of praise;
for the obtained sourcePreprocessing the microblog comment starting data to obtain a microblog comment set data d1,...,di,...,dn},di=((word1,...,wordj,...,wordm),likesi) Wherein d isiRepresenting the preprocessed i-th comment, wordjRepresenting the ith comment d in the microblog comment setiThe jth participle, like in (1)iRepresenting the ith comment d in the microblog comment setiM represents the ith comment d in the microblog comment setiNumber of participles of diEach participle in the set is a feature word;
wherein the pretreatment comprises: symbol processing, Chinese segmentation and stop word.
Further, the vectorizing the microblog comment set data, and the outputting the microblog comment vector set data _ tfidf includes:
determining all feature words in the microblog comment set data, and forming a set T by non-repetitive feature words;
determining the ith characteristic word T in the set TiWherein the TF-IDF value is expressed as:
tfidf(ti)=TF(ti)×IDF(ti)
wherein tfidf (t)i) Means the i-th characteristic word tiTF-IDF value of (1), TF-IDF representing the word frequency-inverse document frequency, TF (t)i) Means the i-th characteristic word tiTerm frequency of, IDF (t)i) Means the i-th characteristic word tiThe inverse document frequency of (d);
sorting the feature words in the set T according to the TF-IDF value of the feature words from large to small to generate a base vector jvsm, wherein the jvsm is (T _ tfidf)1,...,t_tfidfi,...,t_tfidfjnt),t_tfidfi=(ti,tfidf(ti) Wherein t) isiRepresents the ith vector t _ tfidf in the basis vector spaceiThe characteristic word of (1), tfidf (t)i) Represents the ith vector t _ tfidf in the basis vector spaceiThe TF-IDF value of (jnt) indicates that the category of the feature words in the set T also indicates the dimensionality of jvsm;
the tfidf (t) in the base vector jvsmi) Deleting the feature words larger than the feature threshold FEA to obtain a standard feature vector vsm as: vsm ═ t _ tfidf1,...,t_tfidfi,...,t_tfidfnt) Wherein t _ tfidfi=(ti,tfidf(ti)),tiRepresenting the ith vector t _ tfidf in the standard feature vector spaceiThe characteristic word of (1), tfidf (t)i) Represents the ith vector t _ tfidf in the canonical feature vector spaceiThe TF-IDF value of (n), nt represents the dimension of vsm;
vectorizing the microblog comment set data according to vsm to generate a vector space model data _ vsm of the microblog comment set data, wherein data _ vsm is { dv _ nolikes ═1,...,dv_nolikesi,...,dv_nolikesn},dv_nolikesi=(tv1,...,tvj,...,tvnt),tvj=(tj,tfidfj),tvjT in (1)jRepresenting the i comment vectors dv _ nolikes in the vector space model data _ vsmiThe j-th feature word of (iv), tvjTfidf in (1)jRepresenting the i comment vectors dv _ nolikes in the vector space model data _ vsmiThe TF-IDF value of the jth feature word of (1), nt represents the i comment vectors dv _ nolikes in the vector space model data _ vsmiThe number of feature words of (1) is dv _ nolikesiThe dimension of (a);
adding a praise number in each comment vector in the vector space model data _ vsm, and generating a microblog comment vector set data _ tfidf of the microblog comment set data, wherein data _ tfidf is { dv ═ dv1,...,dvi,...,dvn}, dvi=((tv1,...,tvj,...,tvnt),likesi),tvj=(tj,tfidfj),tjRepresenting the ith comment dv in the microblog comment vector set data _ tfidfiThe jth feature word, likeiRepresenting the ith comment dv in the microblog comment vector set data _ tfidfiThe number of praise of (nt) represents the ith comment dv in the microblog comment vector set data _ tfidfiThe number of feature words.
Go toStep by step, i-th kind of feature word tiWord frequency TF (t)i) Expressed as:
Figure BDA0002476412400000041
wherein, words represents the total number of feature words in a set T _ all formed by all feature words, word (T)i) Representing the ith feature word T in a set T composed of non-repeated feature wordsiThe number of (2).
Further, i comment vectors dv _ nolikes in the vector space model data _ vsmiThe TF-IDF value of the jth feature word of (1) is:
Figure BDA0002476412400000042
further, the distance in S2 is represented as:
Figure BDA0002476412400000043
wherein D represents a distance, core _ clustersi=(cc_tfidf1,…,cc_tfidfj,…,cc_tfidfnt),cc_tfidfjRepresents the ith comment core _ clusters in the core comment clusteriThe TF-IDF value of the jth feature word of (1);
comment=(c_tfidf1,…,c_tfidfj,…,c_tfidfnt),c_tfidfja TF-IDF value representing the jth feature word of the center comment; nt represents the dimension of the comment.
Further, after S4, the method further includes:
clustering the microblog comment vector set by utilizing a K-means clustering algorithm according to the finally obtained initial clustering center to generate clustering result clusters and clustering center cluster clusters _ centers;
and judging whether the cluster center cluster changes or not before and after the two times of clustering is converged, if so, performing the K-means clustering algorithm again according to the clustering result cluster and the clustering center cluster _ centers until the clustering center cluster does not change, and outputting the clustering result cluster.
The technical scheme of the invention has the following beneficial effects:
in the scheme, the comments which are praised are selected from the microblog comment vector set and added into the core comment cluster, and the comments in the core comment cluster are sorted according to the fact that the praise number of the comments is reduced from large to small; selecting a first comment in the sequenced core comment clusters as a central comment, and calculating the distance between the central comment and the comment in the core comment clusters; deleting comments in the core comment cluster, the distance between which and the center comment is smaller than a set distance threshold value, and adding the center comment into the initial clustering center cluster; and returning to continue to calculate the distance between the center comment and the comment in the core comment cluster and the operation of deleting the comment and adding the center comment into the initial clustering center cluster until the number of the comments in the core comment cluster is 0, wherein the comment in the initial clustering center cluster is the final initial clustering center of the microblog comment vector set. Therefore, the optimal initial clustering center of the microblog comment paper can be determined quickly and accurately, and the problem that the clustering result falls into local optimization due to improper selection of the initial clustering center is avoided.
Drawings
Fig. 1 is a schematic flow chart of a method for determining a K-means initial clustering center for microblog comment texts according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of microblog data preprocessing provided by the embodiment of the invention;
fig. 3 is a schematic flow diagram of microblog data vectorization according to an embodiment of the present invention;
FIG. 4 is a detailed flowchart illustrating a method for determining a K-means initial clustering center for microblog comment texts according to an embodiment of the present invention;
FIG. 5 is a flow diagram of a K-means initial clustering algorithm for microblog comment texts according to the embodiment of the invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages to be solved by the present invention clearer, the following detailed description is made with reference to the accompanying drawings and specific embodiments.
The invention provides a method for determining a K-means initial clustering center of a microblog comment text, aiming at the problem that the conventional initial clustering center is not properly selected and a clustering result is easy to fall into local optimum.
As shown in fig. 1, the method for determining a K-means initial clustering center for a microblog comment text according to the embodiment of the present invention includes:
s1, selecting the approved comments from the microblog comment vector set, adding the selected comments into the core comment cluster, and sequencing the comments in the core comment cluster according to the minimum approved comment number of the comments;
s2, selecting the first comment in the sorted core comment cluster as a center comment, and calculating the distance between the center comment and the comment in the core comment cluster;
s3, deleting the comments, the distance between the comments and the center of the core comment cluster is less than a set distance threshold value, and adding the center comments into the initial clustering center cluster;
and S4, returning to continue executing S2 and S3 until the number of comments in the core comment cluster is 0, wherein the comments in the initial cluster center cluster are the final initial cluster center of the microblog comment vector set.
According to the method for determining the K-means initial cluster center of the microblog comment text, disclosed by the embodiment of the invention, the comments which are praised are selected from the microblog comment vector set and added into the core comment cluster, and the comments in the core comment cluster are sorted according to the praise number of the comment from large to small; selecting the first comment in the sorted core comment cluster as a center comment, and calculating the distance between the center comment and the comment in the core comment cluster; deleting comments in the core comment cluster, the distance between which and the center comment is smaller than a set distance threshold value, and adding the center comment into the initial clustering center cluster; and returning to continue to calculate the distance between the center comment and the comment in the core comment cluster and the operation of deleting the comment and adding the center comment into the initial clustering center cluster until the number of the comments in the core comment cluster is 0, wherein the comment in the initial clustering center cluster is the final initial clustering center of the microblog comment vector set. Therefore, the optimal initial clustering center of the microblog comment text can be determined quickly and accurately, and the problem that the clustering result falls into local optimal due to improper selection of the initial clustering center is avoided.
In a specific implementation manner of the foregoing method for determining a K-means initial cluster center for a microblog comment text, further, before selecting a approved comment from a microblog comment vector set and adding the selected comment into a core comment cluster and sorting comments in the core comment cluster according to a decreasing number of comment approved comments, the method further includes:
acquiring original microblog comment data reader, preprocessing the acquired original microblog comment data reader, and outputting a microblog comment set data;
vectorizing the microblog comment set data, and outputting a microblog comment vector set data _ tfidf.
In a specific implementation manner of the foregoing method for determining a K-means initial cluster center for a microblog comment text, further, the obtaining original microblog comment data reader, preprocessing the obtained original microblog comment data reader, and outputting microblog comment set data includes:
acquiring original microblog comment data reader ═ r1,...,ri,...,rn},ri=(pi,likesi) Where n represents the number of original microblog comments acquired, riRepresenting the ith comment, p in the original microblog comment dataiRepresenting the ith comment r in the original microblog comment dataiOriginal comment text information, likeiRepresenting the ith comment r in the original microblog comment dataiThe number of praise;
preprocessing the acquired original microblog comment data to obtain a microblog comment set data { d ═ d1,...,di,...,dn},di=((word1,...,wordj,...,wordm),likesi) Wherein d isiRepresenting the preprocessed i-th comment, wordjRepresenting in a set of microblog reviewsThe ith comment diThe jth participle, like in (1)iRepresenting the ith comment d in the microblog comment setiM represents the ith comment d in the microblog comment setiNumber of participles of diEach participle in the set is a feature word;
wherein the pretreatment comprises: symbol processing, Chinese segmentation, and stop-word, as shown in FIG. 2.
In this embodiment, the symbol processing refers to removing information which is meaningless to text clustering, such as "@ username" or "# topic #" in the microblog comment information.
In this embodiment, the chinese word segmentation is to divide the text information into separate words. The microblog comment information may be Chinese participled using participle software or a participle component in a programming language.
In this embodiment, stop words are words that have grammatical meaning in the complete context, but have no practical value in the processing of text information, such as some linguistic words, conjunctions, auxiliary words ("reduce", "and", "you") and the like. And removing stop words in the segmented microblog comment information through a preset stop word list.
In this embodiment, word segmentation word in microblog comment set datajFor feature words, e.g. preprocessed i-th comment diThe word segmentation can be a feature word, and a set T formed by non-repeated feature words is { the article, the patent and the microblog }; all the feature words (including repetition) form a set T _ words { article, patent, microblog }.
In a specific implementation manner of the foregoing method for determining a K-means initial cluster center for a microblog comment text, further, as shown in fig. 3, the vectorizing the microblog comment set data, and outputting a microblog comment vector set data _ tfidf includes:
determining all feature words in the microblog comment set data, and forming a set T by non-repetitive feature words;
determining the ith characteristic word T in the set TiWherein the TF-IDF value is expressed as:
tfidf(ti)=TF(ti)×IDF(ti)
Wherein tfidf (t)i) Means the i-th characteristic word tiTF-IDF value of (1), TF-IDF representing the word frequency-inverse document frequency, TF (t)i) Means the i-th characteristic word tiTerm frequency of, IDF (t)i) Means the i-th characteristic word tiThe inverse document frequency of (d);
sorting the feature words in the set T according to the TF-IDF value of the feature words from large to small to generate a base vector jvsm, wherein the jvsm is (T _ tfidf)1,...,t_tfidfi,...,t_tfidfjnt),t_tfidfi=(ti,tfidf(ti) Wherein t) isiRepresents the ith vector t _ tfidf in the basis vector spaceiThe characteristic word of (1), tfidf (t)i) Represents the ith vector t _ tfidf in the basis vector spaceiThe TF-IDF value of (jnt) indicates that the category of the feature words in the set T also indicates the dimensionality of jvsm;
the tfidf (t) in the base vector jvsmi) Deleting the feature words larger than the feature threshold FEA to obtain a standard feature vector vsm as: vsm ═ t _ tfidf1,...,t_tfidfi,...,t_tfidfnt) Wherein t _ tfidfi=(ti,tfidf(ti)),tiRepresenting the ith vector t _ tfidf in the standard feature vector spaceiThe characteristic word of (1), tfidf (t)i) Represents the ith vector t _ tfidf in the canonical feature vector spaceiThe TF-IDF value of (n), nt represents the dimension of vsm;
vectorizing the microblog comment set data according to vsm to generate a vector space model data _ vsm of the microblog comment set data, wherein data _ vsm is { dv _ nolikes ═1,...,dv_nolikesi,...,dv_nolikesn},dv_nolikesi=(tv1,...,tvj,...,tvnt),tvj=(tj,tfidfj),tvjT in (1)jRepresenting the i comment vectors dv _ nolikes in the vector space model data _ vsmiThe j-th feature word of (iv), tvjTfidf in (1)jRepresents the i-pieces of comment directions in the vector space model data _ vsmAmount dv _ nolikesiThe TF-IDF value of the jth feature word of (1), nt represents the i comment vectors dv _ nolikes in the vector space model data _ vsmiThe number of feature words of (1) is dv _ nolikesiThe dimension of (a);
adding a praise number in each comment vector in the vector space model data _ vsm, and generating a microblog comment vector set data _ tfidf of the microblog comment set data, wherein data _ tfidf is { dv ═ dv1,...,dvi,...,dvn}, dvi=((tv1,...,tvj,...,tvnt),likesi),tvj=(tj,tfidfj),tjRepresenting the ith comment dv in the microblog comment vector set data _ tfidfiThe jth feature word, likeiRepresenting the ith comment dv in the microblog comment vector set data _ tfidfiThe number of praise of (nt) represents the ith comment dv in the microblog comment vector set data _ tfidfiThe number of feature words.
In the embodiment, vectorization processing is carried out on the microblog comment set data, the dimensionality of the feature words in the microblog comment vector set can be reduced, the clustering precision is improved at a later stage, and the problem that the initial clustering center result is inaccurate due to the fact that a vector space model data _ vsm generated by the microblog comment set data is very sparse is solved.
In the specific implementation manner of the method for determining the initial clustering center of the K-means for the microblog comment text, the ith feature word t isiWord frequency TF (t)i) Expressed as:
Figure BDA0002476412400000081
wherein, words represents the total number of feature words in a set T _ all words formed by all feature words (including repetition), word (T)i) Representing the ith feature word T in a set T composed of non-repeated feature wordsiThe number of (2).
In the specific implementation manner of the method for determining the initial clustering center of K-means for the microblog comment text, i comments in the vector space model data _ vsm are further includedTheoretic vector dv _ nolikesiThe TF-IDF value of the jth feature of (1) is:
Figure BDA0002476412400000082
as shown in fig. 4, the method for determining the K-means initial cluster center for the microblog comment text may specifically include the following steps:
c1, selecting comments with the praise number larger than 0 from the microblog comment vector set data _ tfidf, adding the comments into the core comment cluster core _ clusters, and then sorting the comments in the core _ clusters from large to small according to the praise number of each comment;
c2, selecting the first comment core _ clusters [0] in the core _ clusters as a central comment, and calculating the distance D between the comment and the comments in the core comment cluster; wherein the distance D is represented as:
Figure BDA0002476412400000091
wherein, core _ clustersi=(cc_tfidf1,…,cc_tfidfj,…,cc_tfidfnt),cc_tfidfjRepresents the ith comment core _ clusters in the core comment clusteriThe TF-IDF value of the jth feature word of (1); comment (c _ tfidf)1,…, c_tfidfj,…,c_tfidfnt),c_tfidfjA TF-IDF value representing the jth feature word of the center comment; nt represents the dimension of the comment.
C3, deleting the comments D smaller than the set distance threshold K in the sorted core _ clusters and the comments, and adding the comments into the centers of the initial cluster center;
c4, returning to continue executing C2 and C3 until the iteration is finished when the number of comments in core _ clusters is 0, and the comments in centers can be used as initial clustering centers of the obtained original microblog comment data reader, wherein the centers are { C ═ C { (C) }1,...,ci,...,ckAnd k represents the number of initial cluster centers.
As shown in fig. 5, after S4, clustering the microblog comment vector set data _ tfidf by using a K-means clustering algorithm may specifically include the following steps:
clustering the microblog evaluation vector set data _ tfidf by utilizing a K-means clustering algorithm according to the finally obtained initial clustering centers to generate clustering result clusters and clustering center clusters _ centers;
and judging whether the clustering center cluster changes in the two previous times and the two subsequent times, if so, performing the K-means clustering algorithm again according to the clustering result cluster and the clustering center cluster _ centers until the clustering center cluster does not change, and outputting the clustering result cluster, thereby obtaining the optimal clustering result.
The foregoing is a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should be considered as the protection scope of the present invention.

Claims (8)

1. A method for determining K-means initial clustering centers of microblog comment texts is characterized by comprising the following steps:
s1, selecting the approved comments from the microblog comment vector set, adding the selected comments into the core comment cluster, and sequencing the comments in the core comment cluster according to the minimum approved comment number of the comments;
s2, selecting the first comment in the sorted core comment cluster as a center comment, and calculating the distance between the center comment and the comment in the core comment cluster;
s3, deleting the comments, the distance between the comments and the center of the core comment cluster is less than a set distance threshold value, and adding the center comments into the initial clustering center cluster;
and S4, returning to continue executing S2 and S3 until the number of comments in the core comment cluster is 0, wherein the comments in the initial cluster center cluster are the final initial cluster center of the microblog comment vector set.
2. The method for determining the K-means initial cluster center for the microblog comment text according to claim 1, wherein before the selected approved comments in the microblog comment vector set are added into the core comment cluster and the comments in the core comment cluster are sorted according to the comment approval number from large to small, the method further comprises:
acquiring original microblog comment data reader, preprocessing the acquired original microblog comment data reader, and outputting microblog comment set data;
vectorizing the microblog comment set data, and outputting a microblog comment vector set data _ tfidf.
3. The method for determining the K-means initial clustering center for the microblog comment text according to claim 2, wherein the step of acquiring original microblog comment data readers, the step of preprocessing the acquired original microblog comment data readers, and the step of outputting microblog comment set data comprises the steps of:
acquiring original microblog comment data reader ═ r1,...,ri,...,rn},ri=(pi,likesi) Wherein n represents the number of the acquired original microblog comments, riRepresenting the ith comment, p in the original microblog comment dataiRepresenting the ith comment r in the original microblog comment dataiOriginal comment text information, likeiRepresenting the ith comment r in the original microblog comment dataiThe number of praise;
preprocessing the acquired original microblog comment data to obtain a microblog comment set data { d ═ d1,...,di,...,dn},di=((word1,...,wordj,...,wordm),likesi) Wherein d isiRepresenting the preprocessed i-th comment, wordjRepresenting the ith comment d in the microblog comment setiThe jth participle, like in (1)iRepresenting the ith comment d in the microblog comment setiM represents the ith comment d in the microblog comment setiNumber of participles of diEach participle in the set is a feature word;
wherein the pretreatment comprises: symbol processing, Chinese segmentation and stop word.
4. The method for determining the K-means initial clustering center for the microblog comment text according to claim 3, wherein vectorizing the microblog comment set data and outputting a microblog comment vector set data _ tfidf comprises:
determining all feature words in the microblog comment set data, and forming a set T by non-repetitive feature words;
determining the ith characteristic word T in the set TiWherein the TF-IDF value is expressed as:
tfidf(ti)=TF(ti)×IDF(ti)
wherein tfidf (t)i) Means the i-th characteristic word tiTF-IDF value of (1), TF-IDF representing the word frequency-inverse document frequency, TF (t)i) Means the i-th characteristic word tiTerm frequency of, IDF (t)i) Means the i-th characteristic word tiThe inverse document frequency of (d);
sorting the feature words in the set T according to the TF-IDF value of the feature words from large to small to generate a base vector jvsm, wherein the jvsm is (T _ tfidf)1,...,t_tfidfi,...,t_tfidfjnt),t_tfidfi=(ti,tfidf(ti) Wherein t) isiRepresents the ith vector t _ tfidf in the basis vector spaceiThe characteristic word of (1), tfidf (t)i) Represents the ith vector t _ tfidf in the basis vector spaceiThe TF-IDF value of (jnt) indicates that the category of the feature words in the set T also indicates the dimension of jvsm;
the tfidf (t) in the base vector jvsmi) Deleting the feature words larger than the feature threshold FEA to obtain a standard feature vector vsm as follows: vsm ═ t _ tfidf1,...,t_tfidfi,...,t_tfidfnt) Wherein t _ tfidfi=(ti,tfidf(ti)),tiRepresents the ith vector t _ tfidf in the canonical feature vector spaceiThe characteristic word of (1), tfidf (t)i) Represents the ith vector t _ tfidf in the canonical feature vector spaceiThe TF-IDF value of (n) and nt represents vThe dimension of sm;
vectorizing the microblog comment set data according to vsm to generate a vector space model data _ vsm of the microblog comment set data, wherein data _ vsm is { dv _ nolikes ═1,...,dv_nolikesi,...,dv_nolikesn},dv_nolikesi=(tv1,...,tvj,...,tvnt),tvj=(tj,tfidfj),tvjT in (1)jRepresenting the i comment vectors dv _ nolikes in the vector space model data _ vsmiThe j-th feature word of (iv), tvjTfidf in (1)jRepresenting the i comment vectors dv _ nolikes in the vector space model data _ vsmiThe TF-IDF value of the jth feature word of (1), nt represents the i comment vectors dv _ nolikes in the vector space model data _ vsmiThe number of feature words of (1) is dv _ nolikesiThe dimension of (a);
adding praise number in each comment vector in the vector space model data _ vsm, and generating a microblog comment vector set data _ tfidf of the microblog comment set data, wherein data _ tfidf ═ { dv ═ dv [ ]1,...,dvi,...,dvn},dvi=((tv1,...,tvj,...,tvnt),likesi),tvj=(tj,tfidfj),tjRepresenting the ith comment dv in the microblog comment vector set data _ tfidfiThe jth feature word, likeiRepresenting the ith comment dv in the microblog comment vector set data _ tfidfiThe number of praise of (nt) represents the ith comment dv in the microblog comment vector set data _ tfidfiThe number of feature words.
5. The method for determining the initial clustering center of K-means for the microblog comment text according to claim 4, wherein the ith feature word tiWord frequency TF (t)i) Expressed as:
Figure FDA0002476412390000031
wherein words represents a set of all feature wordsTotal number of feature words in T _ Allwords, word (T)i) Representing the ith feature word T in a set T composed of non-repeated feature wordsiThe number of (2).
6. The method for determining the K-means initial clustering center for the microblog comment text according to claim 4, wherein the i comment vectors dv _ nolikes in the vector space model data _ vsmiThe TF-IDF value of the jth feature word of (1) is:
Figure FDA0002476412390000032
7. the method for determining the initial clustering center of K-means for the microblog comment text according to claim 1, wherein the distance in S2 is represented as:
Figure FDA0002476412390000033
wherein D represents a distance, core _ clustersi=(cc_tfidf1,…,cc_tfidfj,...,cc_tfidfnt),cc_tfidfjRepresents the ith comment core _ clusters in the core comment clusteriThe TF-IDF value of the jth feature word of (1); comment (c _ tfidf)1,…,c_tfidfj,…,c_tfidfnt),c_tfidfjA TF-IDF value representing the jth feature word of the center comment; nt represents the dimension of the comment.
8. The method for determining the initial clustering center of K-means for microblog comment texts according to claim 1, wherein after S4, the method further comprises:
clustering the microblog comment vector set by utilizing a K-means clustering algorithm according to the finally obtained initial clustering center to generate clustering result clusters and clustering center clusters _ centers;
and judging whether the cluster center cluster changes or not before and after twice and converging, if so, performing the K-means clustering algorithm again according to the clustering result cluster and the clustering center cluster _ centers until the clustering center cluster does not change, and outputting the clustering result cluster.
CN202010364885.9A 2020-04-30 2020-04-30 K-means initial cluster center determining method for microblog comment text Active CN111597335B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010364885.9A CN111597335B (en) 2020-04-30 2020-04-30 K-means initial cluster center determining method for microblog comment text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010364885.9A CN111597335B (en) 2020-04-30 2020-04-30 K-means initial cluster center determining method for microblog comment text

Publications (2)

Publication Number Publication Date
CN111597335A true CN111597335A (en) 2020-08-28
CN111597335B CN111597335B (en) 2023-07-14

Family

ID=72182418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010364885.9A Active CN111597335B (en) 2020-04-30 2020-04-30 K-means initial cluster center determining method for microblog comment text

Country Status (1)

Country Link
CN (1) CN111597335B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013015971A (en) * 2011-07-01 2013-01-24 Kddi Corp Representative comment extraction method and program
CN104281606A (en) * 2013-07-08 2015-01-14 腾讯科技(北京)有限公司 Method and device for displaying microblog comments
CN104484343A (en) * 2014-11-26 2015-04-01 无锡清华信息科学与技术国家实验室物联网技术中心 Topic detection and tracking method for microblog
CN108268470A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of comment text classification extracting method based on the cluster that develops

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013015971A (en) * 2011-07-01 2013-01-24 Kddi Corp Representative comment extraction method and program
CN104281606A (en) * 2013-07-08 2015-01-14 腾讯科技(北京)有限公司 Method and device for displaying microblog comments
CN104484343A (en) * 2014-11-26 2015-04-01 无锡清华信息科学与技术国家实验室物联网技术中心 Topic detection and tracking method for microblog
CN108268470A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of comment text classification extracting method based on the cluster that develops

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
范佳健: "微博评论信息的聚类分析", 中国硕士学位优秀论文数据库 *

Also Published As

Publication number Publication date
CN111597335B (en) 2023-07-14

Similar Documents

Publication Publication Date Title
CN109165294B (en) Short text classification method based on Bayesian classification
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN111414479B (en) Label extraction method based on short text clustering technology
CN107944480B (en) Enterprise industry classification method
CN112084335A (en) Social media user account classification method based on information fusion
CN110134799B (en) BM25 algorithm-based text corpus construction and optimization method
CN108596637B (en) Automatic E-commerce service problem discovery system
CN113780007A (en) Corpus screening method, intention recognition model optimization method, equipment and storage medium
CN111754208A (en) Automatic screening method for recruitment resumes
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
TW202111569A (en) Text classification method with high scalability and multi-tag and apparatus thereof also providing a method and a device for constructing topic classification templates
CN110767211B (en) Voice synthesis broadcasting system based on text content data cleaning
TWI734085B (en) Dialogue system using intention detection ensemble learning and method thereof
CN110569351A (en) Network media news classification method based on restrictive user preference
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN107832307B (en) Chinese word segmentation method based on undirected graph and single-layer neural network
CN111309911B (en) Case topic discovery method for judicial field
CN113869054A (en) Deep learning-based electric power field project feature identification method
CN109871429B (en) Short text retrieval method integrating Wikipedia classification and explicit semantic features
CN116881451A (en) Text classification method based on machine learning
CN108804422B (en) Scientific and technological paper text modeling method
CN111597335A (en) K-means initial clustering center determination method for microblog comment text
CN115391522A (en) Text topic modeling method and system based on social platform metadata
Guo Social network rumor recognition based on enhanced naive bayes
Chen et al. Understanding emojis for financial sentiment analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant