CN103942340A

CN103942340A - Microblog user interest recognizing method based on text mining

Info

Publication number: CN103942340A
Application number: CN201410195244.XA
Authority: CN
Inventors: 屈鸿; 王晓斌; 李�浩; 方正; 袁建
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2014-05-09
Filing date: 2014-05-09
Publication date: 2014-07-23

Abstract

The invention discloses a microblog user interest recognizing method based on text mining, and belongs to the field of text mining and natural language processing. The method includes the steps of collecting the newest topical microblog text data of a microblog text set and microblog text data of a designated user, standardizing the collected microblog text data, recognizing the newest microblog words and renewing a new word dictionary for the standardized topical microblog text data through the microblog new word recognition method, conducting Chinese character word separation on the standardized microblog text data of the designated user through the new word dictionary word separation method to achieve text vector expression, clustering the microblog text data, expressed through text vectors, of the designated user, recombining original microblog text data, extracting new text set features through a topic model, presetting topic dictionaries, calculating the weight of each topic dictionary based on the new text set features to obtain the final topic, and enabling the final topic to serve as the microblog user interest recognition, thereby improving accuracy of feature extraction.

Description

A kind of microblog users interest recognition methods based on text mining

Technical field

A kind of microblog users interest recognition methods based on text mining, text cluster---utilize improved K-Means algorithm to carry out short text clustering, and topic model---utilize the method for VSM and LDA models coupling to carry out the extraction of text feature word, belong to text mining, natural language processing, machine learning field.

Background technology

It is key link in text mining that text feature extracts, and according to the similarity between the feature calculation text extracting, is applied to text classification, cluster etc.The widespread use of microblogging, is widely used in microblogging text Text Mining Technology, by analyzing microblogging text, excavates current hot issue, track of issues etc.

Topic model is applied to text feature extraction and has more satisfactory effect, it regards text as the theme collection composition of obeying certain probability distribution, each theme is made up of the lexical item of certain probability distribution again, and text is expanded to " text-theme-word " three dimensions from the two-dimensional space of " text-word ".Topic model can effectively obtain the feature of text, and finds the semanteme that it is potential, namely theme.Topic model is applied in microblogging short text, because the theme of short text is imprecise, Sparse, can not find more accurately microblogging short text theme.

By clustering algorithm, microblogging assigned short text set is reassembled into new long text set, makes new text set theme clearer and more definite, Deta sparseness reduces.K-Means clustering algorithm is the typical clustering algorithm based on distance.Principle of work is: select at random the center of K sample as K classification, calculate the distance of other sample to each center, sample is referred to the class apart from place, Duan center.The center of upgrading the each classification after sorting out, this step of iteration, until the center of twice iteration no longer changes end.

LDA (Latent Dirichlet Analysis, potential Di Li Cray model) be reasonable model in topic model, it regards text by the theme of a series of obedience multinomial distribution as and forms, and each theme is again by a series of obedience Dirichlet distribution and constitutions.LDA model thought is to sample according to probability distribution: distribute and extract a theme according to theme, then distribute according to the word under this theme, extract a word.This step of iteration, until extract words all in text, and obtain net result: " text-theme " probability matrix and " theme-word " probability matrix.According to these two matrixes, extract the theme of urtext.Based on the characteristic of microblogging short text, improve LDA model, as MB-LDA model, it has considered the incidence relation of microblogging author and text, carry out the auxiliary Topics Crawling that carries out microblogging. adopt the gibbs methods of sampling to derive to model, the theme of microblogging can not only be excavated, the theme that contact person pays close attention to can also be excavated.

Chinese word segmentation refers to a Chinese character sequence is cut into independent one by one word.

N unit cuts each word that word refers to that participle obtains and is made up of N word, and current reasonable Chinese automatic word-cut, as the ICTCLAS Words partition system of the Chinese Academy of Sciences, can improve participle effect by adding the dictionary that access customer is given.

Text representation is that just text representation becomes succinct, unified, can be identified by a learning algorithm and sorter structured form, in text representation model, the model of commonplace use is vector space model, vector space model refers to the feature weight calculating in each text, and each text can be by the unique expression of proper vector.Wherein each value of proper vector obtains by calculating TF-IDF value.

Also there are many weak points in microblog users interest of the prior art recognition methods, specific as follows:

One, microblogging short text feature extraction, does not add effective neologisms, causes the result loss that obtains higher.

Two, existing technology is generally introduced and is analyzed for magnanimity microblogging text or by microblogging function, excavate hot issue, event etc., and there is no reasonably to propose a kind of analytical approach of the microblogging text associated user interest for unique user, and user's microblogging text is an important information source for the identification of user interest.

Three, due to the Un-structured of microblogging short text, the characteristic such as sparse, the accuracy of feature extraction is lower.

Summary of the invention

The present invention is directed to the deficiencies in the prior art part a kind of microblog users interest recognition methods based on text mining is provided, the microblogging that can issue by certain user, directly observes the recent interest custom of this user.

For achieving the above object, the technical solution used in the present invention is:

A microblog users interest recognition methods based on text mining, is characterized in that following steps:

(1) up-to-date topic microblogging text data and the designated user microblogging text data of collection microblogging style collection;

(2) the topic microblogging text data collecting and designated user microblogging text data are carried out to standardization processing;

(3) the topic microblogging text data after standardization processing is adopted to microblogging new word identification method, identify up-to-date microblogging neologisms, and upgrade neologisms dictionary;

(4) by the designated user microblogging text data after standardization processing, utilize the segmenting method of neologisms dictionary to carry out Chinese word segmentation, the TF-IDF value of each lexical item that calculating Chinese word segmentation obtains, obtains text vector and represents;

(5) the designated user microblogging text data representing through text vector is carried out to cluster, the original designated user microblogging text data in reconstitution steps (1), obtains new text set and clusters number;

(6) adopt the modeling of LDA theme, extract the new text set Feature Words obtaining after cluster;

(7) given subject dictionary, the text set Feature Words based on new, calculates each subject dictionary weight, obtains final theme, identifies as microblog users interest.

As preferably, in step (3), the step of described microblogging new word identification method is:

(31) gather the topic microblogging text data after standardization processing;

(32) topic microblogging text data is carried out to pre-service;

(33) pretreated topic microblogging text data is carried out to the polynary word of cutting;

(34) the polynary word of cutting is carried out to word filtration, old word filtration, word frequency filtration, adjacent string are filtered and the filtration of mutual information value.

As preferably, in step (34), the calculating of described mutual information value is to pass through formula:

I (A, B) = \log_{2} \frac{p (A, B)}{p (A) p (B)},

Wherein, A, B represent respectively a word in text (topic microblogging text data), P (A, B) probability simultaneously occurring for word A, B, P (A) is the probability that word A occurs separately, P (B) is the probability that word B occurs separately, and I is the mutual information value between word A and word B.

As preferably, in step (5), the step of described cluster is:

(51) text vector that designated user microblogging text data is converted into V dimension represents, V is the mean value of N section text (designated user microblogging text data) length, selects the center of K data point as K cluster classification by the given method of initial center;

(52) utilize Euclidean distance to calculate the distance of each data point and K center in designated user microblogging text data, acquisition cluster, is designated as: d _ij(i=1～N, j=1～K), wherein, d _ijrepresent the distance of i data point to j center, N is data point number, and i is i data point, and j is the central point of j cluster centre;

(53) recalculate the cluster centre of each obtained cluster, choose min is d _ijin minimum value, choose and i the central point j that data point is nearest, set a threshold value c, if min > c, i is made as to a new central point; Otherwise document i is under the jurisdiction of the classification at central point j place;

(54) upgrade each classification central point, recalculate the center of each cluster;

(55) repeat (52), (53) and (54) step, until convergence, the result that the condition of convergence is made as twice iterative process does not affect central point.

As preferably, in step (51), the given method step of described initial center is as follows:

(511) from N data point, select at random a data point, be denoted as center;

(512) calculate the distance dis (center, m) (m=1～N) of other N-1 data point to center, and cumulative all distances: sum{dis (center, m);

(513) random selected value r=random (sum{dis (center, m) }), calculate r=r-dis (center, m),, if r<0, m data point is designated as central point, wherein, random (sum{dis (center, m) }) represents from 0-sum{dis (center, m) } choose at random a value;

(514) repeat (511) and (512) two steps, until select K central point.

As preferably, it is characterized in that, in step (6), the step of extracting new text set Feature Words is:

(61) according to the new text set obtaining after cluster, calculate the TF-IDF value of word in every section of new text set, obtain new text vector;

(62) adopt LDA model to new text set modeling, given parameters value also repeatedly changes initial parameter value, and " theme-word " distribution is obtained in sampling and " document-theme " distributes;

(63) adopt final Feature Words extracting method, extract Feature Words.

As preferably, in step (63), the step of final Feature Words extracting method is as follows:

(631), for new text set, from " document-theme " distributes, select the Topic of a weight maximum as key topic keyTopic;

(632) select " theme-word " that keyTopic is corresponding to distribute;

(633) from word corresponding to keyTopic distributes, obtain first three word that proportion is larger, if a theme is repeatedly extracted, retain the number of times keyCount being extracted;

(634) repeating step (631), (632), (633), traveled through new text set, obtains all Feature Words.

As preferably, in step (4) and step (61), described TF-IDF value computing formula is as follows:

w_{ij} = {tf}_{ij} \times {idf}_{j} = {tf}_{ij} \times \log (\frac{N}{n_{j}})

Wherein, w _ijfor the TF-IDF value of word j in document (the new text that designated user microblogging text data or restructuring designated user microblogging text data obtain) i, tf _ijrepresent the frequency that word j occurs in document (the new text that designated user microblogging text data or restructuring designated user microblogging text data obtain) i, in document sets (the new text set that designated user microblogging text data set or restructuring designated user microblogging text data obtain), occurred that the document (the new text that designated user microblogging text data or restructuring designated user microblogging text data obtain) of word j has n _ja section, total textual data (the text number of the new text set that the text number of designated user microblogging text data set or restructuring designated user microblogging text data obtain) is N, idf _jrepresent N and n _jratio take the logarithm.

As preferably, in step (7), the step of described microblog users interest identification is:

(71) a given S subject dictionary;

(72) Feature Words according to LDA model, new text set modeling being extracted, calculates the Feature Words number N that each subject dictionary comprises _i(N _ifor integer), if any one dictionary of word mismatch is labeled as additional category;

(73) each Feature Words carries weights, calculates the weight size of each subject dictionary, and computing formula is as follows:

w_{i} = \frac{Σ_{j = 1}^{N_{i}} weight (Term [j])}{N_{i}},

Wherein, Term[j] be j word, weight (Term[j]) be the weights that word j is corresponding, N _ithe number of the Feature Words that the dictionary i that is the theme comprises, w _ifor the weight of each subject dictionary;

(74) according to the size sequence of each subject dictionary weight, to weight w _ithe subject dictionary i of > η (η > 0) (η is the threshold value of setting) just elects user's interest descriptor as; Otherwise, delete subject dictionary i;

(75) if the interest descriptor of choosing is emotion word, calculate emotion word polarity, account form is as follows:

(76) the non-emotion theme word obtaining and the feeling polarities calculating are defined as to interest descriptor.

Compared with prior art, the present invention is in advantage:

One, in the identification of microblogging neologisms, the microblogging data that adopt are up-to-date microblog topic data, most of neologisms all occur in topic, dwindle original data volume, improving pre-service efficiency, in identifying, adopted four layers of filtration--raw data is filtered, word frequency is filtered, adjacent string is filtered, mutual information value is filtered, every increase one deck filters, and improves neologisms recognition accuracy.

Two, the feature extraction of microblogging short text has adopted the cluster process with K-Means, and adds obtaining of K-Means++ initial, changes K value in cluster process, meets the indefinite property of theme of microblogging short text, ensures that cluster result is more effective.

Three, in the feature extraction of microblogging short text, utilize VSM and LDA to combine, considered word frequency and two factors of potential theme in text, model training language material is the new text set after cluster, make with respect to simple LDA model extraction, the result obtaining is more desirable.

Four, the Feature Words that utilizes feature extraction to arrive, calculate the word frequency of given subject dictionary, text language is transitioned into the word of describing user interest custom, under existing clear and definite theme, it is more accurate to describe by the user interest obtaining without supervision clustering with respect to existing technology.

Five, in text representation, the vectorial dimension of choosing is the average of the length of all texts, instead of the length (140 words) of general microblogging, and reasonable dimension, easily reaches expression effect.

Brief description of the drawings

Fig. 1 is schematic flow sheet of the present invention;

Fig. 2 is framework schematic diagram of the present invention;

Fig. 3 is microblogging style data acquisition flow schematic diagram in the present invention;

Fig. 4 is that in the present invention, microblogging text feature extracts schematic flow sheet;

Fig. 5 is microblogging new word identification method schematic flow sheet in the present invention;

Fig. 6 is microblogging short text feature extraction schematic flow sheet in the present invention;

Fig. 7 is that in the present invention, Feature Words schematic flow sheet is extracted in LDA modeling;

Fig. 8 is the microblog users interest recognition methods schematic flow sheet based on Feature Words in the present invention;

Fig. 9 is the data flow schematic diagram of microblog users interest recognition methods implementation process in the present invention.

Embodiment

Below in conjunction with the drawings and the specific embodiments, the invention will be further described.

Consult Fig. 1,2,9, a kind of microblog users interest recognition methods based on text mining, step is:

(1) up-to-date topic microblogging text data and the designated user microblogging text data of collection microblogging style collection.

As shown in Figure 3, the collection of data is to capture two parts data from network, topic microblogging text data and designated user microblogging text data.Topic microblogging text data is from url list, to obtain a URL, and obtaining communication in URL, carries out webpage crawl after obtaining communication, and webpage is analyzed webpage after capturing, and sets up index after web page analysis again; In addition, according to web page analysis, extract the link in this webpage, capture for next webpage, the link that crawl each time is all extracted from the last time, extract a URL and carry out new web page crawl.Designated user microblogging text data is to obtain oauth certification according to user ID, after obtaining oauth certification, obtain code, after obtaining code, download microblogging data, downloading microblogging extracting data text text, after extracting text text, set up index, index database set up jointly in the index of setting up by the topic microblogging text data that obtains and designated user microblogging text data.Taking Sina's microblogging as example, can obtain authority according to open API at present, capture microblogging data.

(2) the topic microblogging text data collecting and designated user microblogging text data are carried out to standardization processing.

Standardization processing comprises carries out text-converted processing to punctuation mark, mood symbol and special symbol in topic microblogging text data and designated user microblogging text data, punctuation mark, emoticon are converted into corresponding textual description, special symbol is deleted as "@", some stop words, noise word are deleted.

(3) the topic microblogging text data after standardization processing is adopted to microblogging new word identification method, identify up-to-date microblogging neologisms, and upgrade neologisms dictionary.

As shown in Figure 5, the step of microblogging new word identification method is:

(31) gather the topic microblogging text data after standardization processing;

(32) topic microblogging text data is carried out to pre-service; Comprise the text-converted to symbol, the deletion of noise word; Suppose M section microblogging (A[M], character string), two character string AllStr and Str, traversal A, if A[i] in comprise " ## ", extract the character string tempA between " # ", and Str+=temp; Text tempA2 is treated in addition: AllStr+=tempA2.

(33) pretreated topic microblogging text data is carried out to the polynary word of cutting; AllStr and Str are cut into respectively AllWord[K1] and StrWord[K2], K1 and K2 are respectively the word number after cutting, and the word then all existing in selected AllWord and StrWord, is designated as NewWord[K].

(34) the polynary word of cutting is carried out to word filtration, old word filtration, word frequency filtration, adjacent string are filtered and the filtration of mutual information value; Old word filters, and according to given old word dictionary word, mates one by one the word that in AllWord, cutting is arrived, if exist by its deletion; Word frequency is filtered, and according to original document collection (topic microblogging text data), to the each word counting in NewWord, deletes the word that does not reach threshold value; Adjacent string is filtered, if two word A and B exist, and the word AB of composition also exists, and three's word frequency is identical, thinks that AB is one group of neologisms; If A is different with B word frequency, delete the low word of word frequency; Mutual information value is filtered, and passes through formula calculate the mutual information value of two word A and B, given threshold value, thinks that AB is not neologisms if do not reach threshold value, deletes.In formula, A, B represent respectively a word in text (topic microblogging text data), P (A, B) probability simultaneously occurring for word A, B, P (A) is the probability that word A occurs separately, P (B) is the probability that word B occurs separately, and I is the mutual information value between word A and word B.

(4) by the designated user microblogging text data after standardization processing, the ICTCLAS Words partition system that utilizes importing neologisms dictionary is that the segmenting method of neologisms dictionary carries out Chinese word segmentation, the TF-IDF value of each lexical item that calculating participle obtains, obtains text vector and represents.

As shown in Figure 4, figure is divided into two parts, is respectively neologisms identification and the feature extraction of designated user microblogging text data of topic microblogging text data.Fig. 4 the right is the identifications of topic microblogging text data neologisms, first extracts topic microblogging text data; Topic microblogging text data is carried out to corresponding pre-service, comprise text-converted, the deletion of noise word etc. of symbol; Pretreated microblogging text is carried out to the polynary word (word is cut by N unit) of cutting; After cutting word, carry out old word filtration, word frequency filtration, adjacent string filtration, the filtration of mutual information value, four layers are filtered the final microblogging neologisms of identification.Fig. 4 left side is the feature extraction of designated user microblogging text data, first obtains designated user microblogging text data; The designated user microblogging text data getting is carried out to standardization processing; The segmenting method of text utilization after treatment being introduced to neologisms dictionary, carries out Chinese word segmentation; The result that participle is obtained, calculates each word TF-IDF value, obtains text vector and represents; The text data that text vector is represented extracts text feature word; The Feature Words extracting is stored in database.

(5) the designated user microblogging text data representing through text vector is carried out to cluster, original designated user microblogging text data in reconstitution steps (1), obtain new text set and clusters number, clusters number is identical with new text set number.

As shown in Figure 6, microblogging short text feature extracting method idiographic flow, first adopt VSM to carry out vector representation to text, the clustering algorithm of utilization based on K-Means is to designated user microblogging text data cluster, then according to the cluster result original designated user microblogging text data of recombinating, obtain new text set, and recalculate TF-IDF value, carry out new text vector and represent; Finally use LDA model text modeling, extract Feature Words according to final Feature Words extracting method.

The step of cluster is:

(51) text vector that designated user microblogging text data is converted into V dimension represents, V is the mean value of N section designated user microblogging text data length, selects the center of K data point as K cluster classification by the given method of initial center;

(53) recalculate the cluster centre of each obtained cluster, choose min is d _ijin minimum value, choose and i the central point j that data point is nearest, set a threshold value c, if min > c, i is made as to a new central point; Otherwise document i is under the jurisdiction of the classification at central point j place.

The given method step of described initial center is as follows:

(511) from N data point, select at random a data point, be denoted as center;

(513) random selected value r=random (sum{dis (center, m) }), calculates r=r-dis (center, m), if r<0, m data point is designated as central point; Wherein, random (sum{dis (center, m) }) represents from 0-sum{dis (center, m) } choose at random a value;

(514) repeat (511) and (512) two steps, until select K central point.

(6) adopt LDA topic model, extract the new text set Feature Words obtaining after cluster; The new text set obtaining after cluster, utilizes LDA model to new text set modeling, extracts feature.

As Fig. 7 describes the flow process of LDA text modeling, specifically describe as follows:

The step of extracting new text set Feature Words is:

(63) adopt final Feature Words extracting method, extract Feature Words.

The step of final Feature Words extracting method is as follows:

(632) select " theme-word " that keyTopic is corresponding to distribute;

(634) repeating step (631), (632), (633), traveled through new collected works, obtains all Feature Words.

In step (4) and step (61), described TF-IDF value computing formula is as follows:

w_{ij} = {tf}_{ij} \times {idf}_{j} = {tf}_{ij} \times \log (\frac{N}{n_{j}})

(7) according to Web-Based Dictionary database, given multiple subject dictionaries, as theme as game, and its dictionary content is the vocabulary such as " putting to death celestial being ", " World of Warcraft ", as themes as film, dictionary content is the vocabulary such as " generation master ", " A Fanda ".Text set Feature Words based on new, calculates each subject dictionary weight, obtains final theme, identifies as microblog users interest;

As shown in Figure 8, the microblog users interest recognition methods flow process based on Feature Words, specifically describes as follows:

(71) a given S subject dictionary;

w_{i} = \frac{Σ_{j = 1}^{N_{i}} weight (Term [j])}{N_{i}},

A microblog users interest recognition methods based on text mining, concrete implementation process is to obtain, after a certain subscriber authorisation, obtaining microblogging data, carrying out text analyzing, finally identifies its interest and describes.

As shown in Figure 2, whole flow process can be divided into three layers, and ground floor is data collection layer, gathers microblogging data, is divided into two parts: a part is to gather topic microblogging text data, identifies for neologisms; Part II is to gather designated user microblogging text data, for feature extraction.The second layer is text analyzing layer, for text feature extracts.The 3rd layer is application layer, is user behavior identification.

Be illustrated in figure 9 the data stream in the overall implementation procedure of designated user interest recognition methods, concrete steps are:

1. by obtain 200 up-to-date microblogging data Docs of user_ID from the microblogging API of Sina, carry out standardization processing, according to microblogging new word identification method, identify up-to-date microblogging neologisms, upgrade neologisms dictionary;

2. extract designated user microblogging short text data, standardization designated user microblogging short text data, utilize the ICTCLAS Words partition system that imports neologisms dictionary to carry out Chinese word segmentation, obtain lexical item document data collection Words;

3. utilize VSM model vector to represent every section of designated user microblogging text data, the lexical item of Chinese word segmentation is calculated to weights with TF-IDF, obtain " text-word " vector representation DW_Vectors;

4. using DW_Vectors as data, the clustering method based on K-Means++ carries out cluster, obtains theme number K, and the new text set NewDocs of a K section, obtains new text set vector NewDW_Vectors;

5. use LDA model to new text set modeling, utilize the sampling of GibbsSampling iteration, obtain two vector matrix matrix_DT (text-theme) and matrix_TW (theme-word);

6. according to final Feature Words extracting method, corresponding document label, gets L neologisms Terms;

7. a given S subject dictionary, calculates subject dictionary weight w _i, and choose w _ithe theme of > η (η > 0).The present invention is illustrated by above-described embodiment, but should be understood that, above-described embodiment is the object for giving an example and illustrating just, but not is intended to the present invention to be limited in described scope of embodiments.In addition it will be appreciated by persons skilled in the art that the present invention is not limited to above-described embodiment, can also make more kinds of variants and modifications according to instruction of the present invention, these variants and modifications all drop in the present invention's scope required for protection.Protection scope of the present invention is defined by the appended claims and equivalent scope thereof.

Claims

1. the microblog users interest recognition methods based on text mining, is characterized in that following steps:

2. a kind of microblog users interest recognition methods based on text mining according to claim 1, is characterized in that, in step (3), the step of described microblogging new word identification method is:

(31) gather the topic microblogging text data after standardization processing;

(32) topic microblogging text data is carried out to pre-service;

3. a kind of microblog users interest recognition methods based on text mining according to claim 2, is characterized in that, in step (34), the calculating of described mutual information value is to pass through formula:

I (A, B) = \log_{2} \frac{p (A, B)}{p (A) p (B)},

4. a kind of microblog users interest recognition methods based on text mining according to claim 1, is characterized in that, in step (5), the step of described cluster is:

5. a kind of microblog users interest recognition methods based on text mining according to claim 4, is characterized in that, in step (51), the given method step of described initial center is as follows:

(511) from N data point, select at random a data point, be denoted as center;

(514) repeat (511) and (512) two steps, until select K central point.

6. a kind of microblog users interest recognition methods based on text mining according to claim 1, is characterized in that, in step (6), the step of extracting new text set Feature Words is:

(63) adopt final Feature Words extracting method, extract Feature Words.

7. a kind of microblog users interest recognition methods based on text mining according to claim 7, is characterized in that, in step (63), the step of final Feature Words extracting method is as follows:

(632) select " theme-word " that keyTopic is corresponding to distribute;

8. according to a kind of microblog users interest recognition methods based on text mining described in claim 1 or 6, it is characterized in that, in step (4) and step (61), described TF-IDF value computing formula is as follows:

w_{ij} = {tf}_{ij} \times {idf}_{j} = {tf}_{ij} \times \log (\frac{N}{n_{j}})

9. a kind of microblog users interest recognition methods based on text mining according to claim 1, is characterized in that, in step (7), the step of described microblog users interest identification is:

(71) a given S subject dictionary;

w_{i} = \frac{Σ_{j = 1}^{N_{i}} weight (Term [j])}{N_{i}},