CN106021388A

CN106021388A - Classifying method of WeChat official accounts based on LDA topic clustering

Info

Publication number: CN106021388A
Application number: CN201610312725.3A
Authority: CN
Inventors: 郭泽豪; 王振宇; 李风环; 戴瑾如
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2016-05-11
Filing date: 2016-05-11
Publication date: 2016-10-12

Abstract

The invention provides a classifying method of WeChat official accounts based on LDA topic clustering. The method comprises the following steps: acquiring an article pushed by each active WeChat official account; performing word segmentation on each acquired article using a word segmentation tool, filtering stop words, counting the word frequency inverse document frequency of residual words; selecting the residual word with the word frequency inverse document frequency value greater than the threshold value as the feature word of the article; performing the hidden topic discovery on feature words in all active articles using an article topic generation model, constructing an article-topic feature vector, and reducing dimension of the article-topic feature vector using a principle component analysis method; clustering the dimension-reduced article-topic feature vector using a Level-Panel algorithm to acquire the cluster level and the articles in the cluster level; determining the type of the WeChat official account according to the cluster level information of the article pushed by the WeChat official account. Through the adoption of the method provided by the invention, the type of the WeChat official account can be accurately determined; an advertiser can conveniently select correct WeChat official account to advertise.

Description

The sorting technique of wechat public number based on LDA Subject Clustering

Technical field

The present invention relates to text mining field, particularly relate to the classification of a kind of wechat public number based on LDA Subject Clustering Method.

Background technology

Wechat public number was reached the standard grade in August, 2012, ended in August, 2015, and its quantity alreadys more than 10,000,000.The wechat public Number being deeply infiltrated into Long Tail Market with the form of " little and beautiful ", many booking readers measure many and that amount of reading is big wechat public number then Using advertisement or electricity business's cooperation as profit mode, wherein advertisement is one of topmost realization method.Little by little, wechat public number The every field of people's daily life will be contained, including fields such as news, physical culture, education and automobiles.For advertiser For, select the wechat public number of appropriate arts to throw in advertisement to reach profit maximization.

Wechat public number as the on-line off-line wechat interactive marketing mode of a kind of main flow, businessman can on wechat platform with Special group realizes the comprehensive communication of word, picture, voice and video, interaction mode.Wherein, wechat public number includes clothes Business number, subscription number and enterprise number, and subscribe to a number main form by pushing article and provide the user information, attract to use The concern at family, and the article pushed is usually and adapts with its function.Just carried out by the content that subscription number is pushed article Really divide the function of subscription number, can effectively help advertiser to select the suitably number of subscription to throw in advertisement.

In June, 2014, search dog search engine is proposed the search engine for wechat public number, makes user can pass through key Article or search wechat public number searched in word, and can also subscribe to wechat public number on search dog platform simultaneously, realizes first " outer net " of wechat public number is shown.

To sum up, owing to wechat public number is large number of and Covering domain is extensive so that advertiser is difficult to select one properly Public number throw in advertisement.For wechat public number, the article that its function pushes with it is corresponding.Search dog is searched for Engine accesses the data of wechat public number, and acquisition and analysis for data provide possibility.But, dividing in wechat public number Apoplexy due to endogenous wind, there is presently no a kind of efficient and that accuracy rate is high sorting technique.

Summary of the invention

In order to overcome shortcoming that prior art exists with not enough, the present invention provides a kind of wechat based on LDA Subject Clustering The sorting technique of public number, it is possible to more efficient and exactly wechat public number is classified.

For solving above-mentioned technical problem, the present invention provides following technical scheme: a kind of wechat based on LDA Subject Clustering is public The sorting technique of many numbers, comprises the following steps:

S1. obtain, by each wechat public number of enlivening, the article that this wechat public number pushes；

S2. utilize participle instrument that each the article obtained is carried out word segmentation, filter stop words, the word of statistics residue word The most anti-document frequency；

S3. the word frequency anti-document frequency value residue word more than the threshold threshold θ Feature Words as this article is chosen；

S4. select number of topics K, use document subject matter to generate the model Feature Words to all enlivening the article that public number pushes Do latent subject to find, build article-theme feature vector；

S5. principal component analytical method is used, to article-theme feature vector dimensionality reduction；

S6. use Level-Panel algorithm, to the article after dimensionality reduction-theme feature vector clusters, obtain class bunch and class Bunch interior article；

The class bunch information of the article S7. pushed according to wechat public number determines the classification of wechat public number.

Further, described wechat public number of enlivening refers to push certification and every month the quantity of article more than 3 Wechat public number.

Further, described participle instrument refers to participle instrument based on Chinese Academy of Sciences's Chinese Word Automatic Segmentation.

Further, described word frequency refers to that the frequency that a given word occurs in article, described anti-document frequency are Metric form and the described anti-document frequency of one word general importance of finger are the inverse of document frequency.

Further, described article-theme feature vector form of expression isWherein i is i-th literary composition Chapter,Being the probability of the kth theme of i-th article, n is the theme number.

Further, the detailed process of described step S5 is:

(5a) mean normalization, calculates the average value mu of each theme feature value according to whole articles-theme feature vector_j And standard deviation sigma_j, wherein j=1,2 ..., n, n be the theme number, orderTo whole article-theme features Vector is normalized, wherein i=1,2 ..., m, and m is number of files；

(5b) calculating covariance matrix, computational methods are as follows:

C o v = 1 / m \times Σ_{i = 1}^{m} (p^{(i)}) \times {(p^{(i)})}^{T}

Wherein p⁽ⁱ⁾For the theme feature vector after i-th document normalization, (p⁽ⁱ⁾)^TFor p⁽ⁱ⁾The transposed vector of vector；

(5c) singular value decomposition, it is thus achieved that U, S and V matrix, computational methods are as follows:

[U, S, V]=svd (Cov)；

(5d) selecting suitable dimension g after dimensionality reduction according to s-matrix, computational methods are as follows:

1 - Σ_{i = 1}^{g} S_{i i} / Σ_{i = 1}^{m} S_{i i} < = 0.01

Wherein, minima g of g is taken_min, g_minFor allowing the characteristic loss most suitable dimension g in [0,0.01]；

(5e) front g the column vector of U matrix is chosen, it is thus achieved that the matrix U of m × g_reduce, pass through matrix U_reduceCalculate fall Article characteristic vector after dimension, computational methods are:

z^{(i)} = p^{(i)} \times U_{r e d u c e}^{T}

Wherein, z⁽ⁱ⁾For the article characteristic vector after dimensionality reduction,For matrix U_reduceTransposed matrix.

Further, the Level-Panel algorithm in described step S6 is text cluster side based on vector space model Method, its concrete method step is as follows:

According to described article-theme feature vector, given article set D={d₁,d₂,...,d_m, wherein d_iIt it is i-th Article after article dimensionality reduction-theme feature vector,

(6a) every article in article set D is regarded as bunch C comprising single member_i={ d_i, wherein i=1, 2、...、m；

(6b) optionally one of them comprises bunch C of single member_kStarting point as cluster；

(6c) find in the sample not clustered and bunch C_kThe distance point less than or equal to threshold threshold θ, i.e. similarity sim (C_k,C_i) any C of ＞=θ_i, by itself and C_kMerge and form new bunch C_k=simC_k∪C_i；

(6d) step 6c is repeated, until the sample all not clustered and C_kDistance be above threshold threshold θ, now gathered One class；

(6e) step 6b is repeated until whole single member bunch C_iIt is involved in cluster.

Further, described threshold threshold θ is set to θ=0.025, and described number of topics K is set to K=100.

After using technique scheme, the present invention at least has the advantages that

1, the present invention is on the basis of word frequency-anti-document frequency (TF-IDF) feature of statistics wechat public number propelling movement article On, TF-IDF value is filtered less than the word of threshold threshold, remains the principal character of article, it is to avoid secondary feature dry Disturb.

2, use document subject matter to generate model (LDA) and the Feature Words in article is done latent subject discovery, obtain article master Topic characteristic vector, describes the feature of article, reduces calculating cost simultaneously from semantic level.

3, use principal component analysis (PCA) to article theme feature vector dimensionality reduction, find the dependency relation between theme, from And find more suitably number of topics.

4, using Level-Panel algorithm to the article theme feature vector clusters after dimensionality reduction, Level-Panel algorithm is Model based on space vector, shows the highest superiority on text cluster.

5, the present invention can classify effectively and exactly to wechat public number, helps advertiser to select suitable wechat public Throw in advertisement, there is good practicality for many numbers.

Accompanying drawing explanation

Fig. 1 is the flow chart of the sorting technique of present invention wechat based on LDA Subject Clustering public number；

Fig. 2 is the principal component analysis flow chart of the sorting technique of present invention wechat based on LDA Subject Clustering public number.

Fig. 3 is the Level-Panel algorithm flow of the sorting technique of present invention wechat based on LDA Subject Clustering public number Figure.

Detailed description of the invention

It should be noted that in the case of not conflicting, the embodiment in the application and the feature in embodiment can phases Combine mutually, with specific embodiment, the application is described in further detail below in conjunction with the accompanying drawings.

Embodiment

Fig. 1 is the sorting technique of the wechat public number based on LDA Subject Clustering disclosed in the present embodiment and each is corresponding The flow chart of step.As it is shown in figure 1, said method comprising the steps of:

S1, by each enliven wechat public number obtain this wechat public number push article；

S2, utilize participle instrument that each article carry out word segmentation, filter stop words, add up the anti-document frequency of its word frequency (TF-IDF)；

S3, choose the TF-IDF value word more than threshold threshold θ (θ=0.025) as the Feature Words of this article；

S4, selection number of topics K (K=100), use document subject matter to generate model (LDA) and push all enlivening public number The Feature Words of article do latent subject and find, build article-theme feature vector；

S5, employing principal component analysis (PCA) method are to article-theme feature vector dimensionality reduction；

S6, employing Level-Panel algorithm, to the article after dimensionality reduction-theme feature vector clusters, obtain class bunch and class Bunch interior article；

S7, the class bunch information of article pushed according to wechat public number determine the classification of wechat public number.

Said method pushes word frequency-anti-document frequency (TF-IDF) feature of article by statistics wechat public number, builds The word feature vector of every article, uses Gaussian function to word feature vector normalization；Document subject matter is used to generate model (LDA) Feature Words in article does latent subject find, obtain the probability distribution of word-theme, build according to word-theme distribution probability Article theme feature vector, describes the feature of article, reduces calculating cost simultaneously from semantic level；Use principal component analysis (PCA) to article theme feature vector dimensionality reduction, find the dependency relation between theme, thus find more suitably number of topics；Adopt With Level-Panel algorithm to the article theme feature vector clusters after dimensionality reduction；Class belonging to article is pushed according to wechat public number Bunch information determine the classification of wechat public number.The present invention can accurately determine the classification of wechat public number, facilitates advertiser Correct wechat public number is selected to throw in advertisement.

Wherein, described enliven wechat public number to refer to push certification and every month the article number wechat more than 3 public Many numbers.

Further, described participle instrument refers to participle instrument (Ansj) based on Chinese Academy of Sciences's Chinese Word Automatic Segmentation.

Further, described word frequency refers to that the frequency that each word occurs in article, described anti-document frequency are certain words The metric form of general importance, if certain word is high in the frequency occurring in other documents, the most anti-document frequency is low, the most instead Document frequency is high, and described anti-document frequency is the inverse of document frequency.

Further, as in figure 2 it is shown, described step S5 uses principal component analysis (PCA) to article-theme feature vector fall The detailed process of dimension is:

(5a) first step is mean normalization, calculates each theme feature value according to whole articles-theme feature vector Average value mu_jAnd standard deviation sigma_j, wherein j=1,2 ..., n, n be the theme number, orderTo whole articles- Theme feature vector is normalized, wherein i=1,2 ..., m, and m is number of files；

(5b) second step is to calculate covariance matrix Cov, and computational methods are as follows:

C o v = 1 / m \times Σ_{i = 1}^{m} (p^{(i)}) \times {(p^{(i)})}^{T}

(5c) the 3rd step is singular value decomposition, it is thus achieved that U, S and V matrix, and computational methods are as follows:

[U, S, V]=svd (Cov)；

(5d) the 4th step is to select suitable dimension g after dimensionality reduction according to s-matrix, and computational methods are as follows:

1 - Σ_{i = 1}^{g} S_{i i} / Σ_{i = 1}^{m} S_{i i} < = 0.01

Take minima g of g_min, wherein g_minFor allowing the characteristic loss most suitable dimension in the range of [0,0.01] g；

(5e) the 5th step chooses front g the column vector of U matrix, it is thus achieved that the matrix U of m × g_reduce, pass through U_reduceCalculate Article characteristic vector after dimensionality reduction, computational methods are as follows:

z^{(i)} = p^{(i)} \times U_{r e d u c e}^{T},

WhereinFor U_reduceThe transposed matrix of matrix.

Further, as it is shown on figure 3, described Level-Panel algorithm is text cluster side based on vector space model Method, its concrete grammar is as follows:

According to described article-theme feature vector, given article set D={d₁,d₂,...,d_m, wherein d_iIt it is i-th Particular subject vector after article dimensionality reduction,

(6a) every article in D is regarded as bunch C comprising single member_i={ d_i, wherein i=1,2 ..., m；

(6c) find and C in the sample not clustered_kThe distance point less than threshold threshold θ, i.e. similarity sim (C_k,C_i) Any C of ＞=θ_i, by itself and C_kMerge and form new bunch C_k=simC_k∪C_i；

Wechat public number can be classified, have good availability by said method effectively exactly.

Although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, permissible It is understood by, these embodiments can be carried out the change of multiple equivalence without departing from the principles and spirit of the present invention Changing, revise, replace and modification, the scope of the present invention is limited by claims and equivalency range thereof.

Claims

1. the sorting technique of a wechat public number based on LDA Subject Clustering, it is characterised in that described method includes following step Rapid:

S2. utilizing participle instrument that each the article obtained is carried out word segmentation, filter stop words, the word frequency of statistics residue word is anti- Document frequency；

S4. select number of topics K, use document subject matter to generate model and the Feature Words all enlivening the article that public number pushes is done hidden Sexual Themes finds, builds article-theme feature vector；

S6. use Level-Panel algorithm, to the article after dimensionality reduction-theme feature vector clusters, obtain in class bunch and class bunch Article；

The sorting technique of wechat public number based on LDA Subject Clustering the most according to claim 1, it is characterised in that institute State and enliven wechat public number and refer to push certification and every month the quantity of the article wechat public number more than 3.

The sorting technique of wechat public number based on LDA Subject Clustering the most according to claim 1, it is characterised in that institute State participle instrument and refer to participle instrument based on Chinese Academy of Sciences's Chinese Word Automatic Segmentation.

The sorting technique of wechat public number based on LDA Subject Clustering the most according to claim 1, it is characterised in that institute Predicate frequency refers to that the frequency that a given word occurs in article, described anti-document frequency refer to a word general importance Metric form and described anti-document frequency are the inverse of document frequency.

The sorting technique of wechat public number based on LDA Subject Clustering the most according to claim 1, it is characterised in that institute Stating article-theme feature vector form of expression isWherein i is i-th article,It is the of i-th article The probability of k theme, n is the theme number.

The sorting technique of wechat public number based on LDA Subject Clustering the most according to claim 1, it is characterised in that institute The detailed process stating step S5 is:

(5a) mean normalization, calculates the average value mu of each theme feature value according to whole articles-theme feature vector_jAnd standard Difference σ_j, wherein j=1,2 ..., n, n be the theme number, orderWhole articles-theme feature vector is entered Row normalization, wherein i=1,2 ..., m, m is number of files；

(5b) calculating covariance matrix, computational methods are as follows:

C o v = 1 / m \times Σ_{i = 1}^{m} (p^{(i)}) \times {(p^{(i)})}^{T}

[U, S, V]=svd (Cov)；

1 - Σ_{i = 1}^{g} S_{i i} / Σ_{i = 1}^{m} S_{i i} < = 0.01

(5e) front g the column vector of U matrix is chosen, it is thus achieved that the matrix U of m × g_reduce, pass through matrix U_reduceAfter calculating dimensionality reduction Article characteristic vector, computational methods are:

z^{(i)} = p^{(i)} \times U_{r e d u c e}^{T}

The sorting technique of wechat public number based on LDA Subject Clustering the most according to claim 1, it is characterised in that institute Stating the Level-Panel algorithm in step S6 is Text Clustering Method based on vector space model, its concrete method step As follows:

According to described article-theme feature vector, given article set D={d₁,d₂,...,d_m, wherein d_iIt is i-th article fall Article after dimension-theme feature vector,

(6a) every article in article set D is regarded as bunch C comprising single member_i={ d_i, wherein i=1,2 ..., m；

(6c) find in the sample not clustered and bunch C_kThe distance point less than or equal to threshold threshold θ, i.e. similarity sim (C_k, C_i) any C of ＞=θ_i, by itself and C_kMerge and form new bunch C_k=simC_k∪C_i；

The sorting technique of wechat public number based on LDA Subject Clustering the most according to claim 1, it is characterised in that institute Stating threshold threshold θ and be set to θ=0.025, described number of topics K is set to K=100.