CN103942340A - Microblog user interest recognizing method based on text mining - Google Patents

Microblog user interest recognizing method based on text mining Download PDF

Info

Publication number
CN103942340A
CN103942340A CN201410195244.XA CN201410195244A CN103942340A CN 103942340 A CN103942340 A CN 103942340A CN 201410195244 A CN201410195244 A CN 201410195244A CN 103942340 A CN103942340 A CN 103942340A
Authority
CN
China
Prior art keywords
text
word
microblogging
text data
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410195244.XA
Other languages
Chinese (zh)
Inventor
屈鸿
王晓斌
李�浩
方正
袁建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201410195244.XA priority Critical patent/CN103942340A/en
Publication of CN103942340A publication Critical patent/CN103942340A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses a microblog user interest recognizing method based on text mining, and belongs to the field of text mining and natural language processing. The method includes the steps of collecting the newest topical microblog text data of a microblog text set and microblog text data of a designated user, standardizing the collected microblog text data, recognizing the newest microblog words and renewing a new word dictionary for the standardized topical microblog text data through the microblog new word recognition method, conducting Chinese character word separation on the standardized microblog text data of the designated user through the new word dictionary word separation method to achieve text vector expression, clustering the microblog text data, expressed through text vectors, of the designated user, recombining original microblog text data, extracting new text set features through a topic model, presetting topic dictionaries, calculating the weight of each topic dictionary based on the new text set features to obtain the final topic, and enabling the final topic to serve as the microblog user interest recognition, thereby improving accuracy of feature extraction.

Description

A kind of microblog users interest recognition methods based on text mining
Technical field
A kind of microblog users interest recognition methods based on text mining, text cluster---utilize improved K-Means algorithm to carry out short text clustering, and topic model---utilize the method for VSM and LDA models coupling to carry out the extraction of text feature word, belong to text mining, natural language processing, machine learning field.
Background technology
It is key link in text mining that text feature extracts, and according to the similarity between the feature calculation text extracting, is applied to text classification, cluster etc.The widespread use of microblogging, is widely used in microblogging text Text Mining Technology, by analyzing microblogging text, excavates current hot issue, track of issues etc.
Topic model is applied to text feature extraction and has more satisfactory effect, it regards text as the theme collection composition of obeying certain probability distribution, each theme is made up of the lexical item of certain probability distribution again, and text is expanded to " text-theme-word " three dimensions from the two-dimensional space of " text-word ".Topic model can effectively obtain the feature of text, and finds the semanteme that it is potential, namely theme.Topic model is applied in microblogging short text, because the theme of short text is imprecise, Sparse, can not find more accurately microblogging short text theme.
By clustering algorithm, microblogging assigned short text set is reassembled into new long text set, makes new text set theme clearer and more definite, Deta sparseness reduces.K-Means clustering algorithm is the typical clustering algorithm based on distance.Principle of work is: select at random the center of K sample as K classification, calculate the distance of other sample to each center, sample is referred to the class apart from place, Duan center.The center of upgrading the each classification after sorting out, this step of iteration, until the center of twice iteration no longer changes end.
LDA (Latent Dirichlet Analysis, potential Di Li Cray model) be reasonable model in topic model, it regards text by the theme of a series of obedience multinomial distribution as and forms, and each theme is again by a series of obedience Dirichlet distribution and constitutions.LDA model thought is to sample according to probability distribution: distribute and extract a theme according to theme, then distribute according to the word under this theme, extract a word.This step of iteration, until extract words all in text, and obtain net result: " text-theme " probability matrix and " theme-word " probability matrix.According to these two matrixes, extract the theme of urtext.Based on the characteristic of microblogging short text, improve LDA model, as MB-LDA model, it has considered the incidence relation of microblogging author and text, carry out the auxiliary Topics Crawling that carries out microblogging. adopt the gibbs methods of sampling to derive to model, the theme of microblogging can not only be excavated, the theme that contact person pays close attention to can also be excavated.
Chinese word segmentation refers to a Chinese character sequence is cut into independent one by one word.
N unit cuts each word that word refers to that participle obtains and is made up of N word, and current reasonable Chinese automatic word-cut, as the ICTCLAS Words partition system of the Chinese Academy of Sciences, can improve participle effect by adding the dictionary that access customer is given.
Text representation is that just text representation becomes succinct, unified, can be identified by a learning algorithm and sorter structured form, in text representation model, the model of commonplace use is vector space model, vector space model refers to the feature weight calculating in each text, and each text can be by the unique expression of proper vector.Wherein each value of proper vector obtains by calculating TF-IDF value.
Also there are many weak points in microblog users interest of the prior art recognition methods, specific as follows:
One, microblogging short text feature extraction, does not add effective neologisms, causes the result loss that obtains higher.
Two, existing technology is generally introduced and is analyzed for magnanimity microblogging text or by microblogging function, excavate hot issue, event etc., and there is no reasonably to propose a kind of analytical approach of the microblogging text associated user interest for unique user, and user's microblogging text is an important information source for the identification of user interest.
Three, due to the Un-structured of microblogging short text, the characteristic such as sparse, the accuracy of feature extraction is lower.
Summary of the invention
The present invention is directed to the deficiencies in the prior art part a kind of microblog users interest recognition methods based on text mining is provided, the microblogging that can issue by certain user, directly observes the recent interest custom of this user.
For achieving the above object, the technical solution used in the present invention is:
A microblog users interest recognition methods based on text mining, is characterized in that following steps:
(1) up-to-date topic microblogging text data and the designated user microblogging text data of collection microblogging style collection;
(2) the topic microblogging text data collecting and designated user microblogging text data are carried out to standardization processing;
(3) the topic microblogging text data after standardization processing is adopted to microblogging new word identification method, identify up-to-date microblogging neologisms, and upgrade neologisms dictionary;
(4) by the designated user microblogging text data after standardization processing, utilize the segmenting method of neologisms dictionary to carry out Chinese word segmentation, the TF-IDF value of each lexical item that calculating Chinese word segmentation obtains, obtains text vector and represents;
(5) the designated user microblogging text data representing through text vector is carried out to cluster, the original designated user microblogging text data in reconstitution steps (1), obtains new text set and clusters number;
(6) adopt the modeling of LDA theme, extract the new text set Feature Words obtaining after cluster;
(7) given subject dictionary, the text set Feature Words based on new, calculates each subject dictionary weight, obtains final theme, identifies as microblog users interest.
As preferably, in step (3), the step of described microblogging new word identification method is:
(31) gather the topic microblogging text data after standardization processing;
(32) topic microblogging text data is carried out to pre-service;
(33) pretreated topic microblogging text data is carried out to the polynary word of cutting;
(34) the polynary word of cutting is carried out to word filtration, old word filtration, word frequency filtration, adjacent string are filtered and the filtration of mutual information value.
As preferably, in step (34), the calculating of described mutual information value is to pass through formula: I ( A , B ) = log 2 p ( A , B ) p ( A ) p ( B ) ,
Wherein, A, B represent respectively a word in text (topic microblogging text data), P (A, B) probability simultaneously occurring for word A, B, P (A) is the probability that word A occurs separately, P (B) is the probability that word B occurs separately, and I is the mutual information value between word A and word B.
As preferably, in step (5), the step of described cluster is:
(51) text vector that designated user microblogging text data is converted into V dimension represents, V is the mean value of N section text (designated user microblogging text data) length, selects the center of K data point as K cluster classification by the given method of initial center;
(52) utilize Euclidean distance to calculate the distance of each data point and K center in designated user microblogging text data, acquisition cluster, is designated as: d ij(i=1~N, j=1~K), wherein, d ijrepresent the distance of i data point to j center, N is data point number, and i is i data point, and j is the central point of j cluster centre;
(53) recalculate the cluster centre of each obtained cluster, choose min is d ijin minimum value, choose and i the central point j that data point is nearest, set a threshold value c, if min > c, i is made as to a new central point; Otherwise document i is under the jurisdiction of the classification at central point j place;
(54) upgrade each classification central point, recalculate the center of each cluster;
(55) repeat (52), (53) and (54) step, until convergence, the result that the condition of convergence is made as twice iterative process does not affect central point.
As preferably, in step (51), the given method step of described initial center is as follows:
(511) from N data point, select at random a data point, be denoted as center;
(512) calculate the distance dis (center, m) (m=1~N) of other N-1 data point to center, and cumulative all distances: sum{dis (center, m);
(513) random selected value r=random (sum{dis (center, m) }), calculate r=r-dis (center, m),, if r<0, m data point is designated as central point, wherein, random (sum{dis (center, m) }) represents from 0-sum{dis (center, m) } choose at random a value;
(514) repeat (511) and (512) two steps, until select K central point.
As preferably, it is characterized in that, in step (6), the step of extracting new text set Feature Words is:
(61) according to the new text set obtaining after cluster, calculate the TF-IDF value of word in every section of new text set, obtain new text vector;
(62) adopt LDA model to new text set modeling, given parameters value also repeatedly changes initial parameter value, and " theme-word " distribution is obtained in sampling and " document-theme " distributes;
(63) adopt final Feature Words extracting method, extract Feature Words.
As preferably, in step (63), the step of final Feature Words extracting method is as follows:
(631), for new text set, from " document-theme " distributes, select the Topic of a weight maximum as key topic keyTopic;
(632) select " theme-word " that keyTopic is corresponding to distribute;
(633) from word corresponding to keyTopic distributes, obtain first three word that proportion is larger, if a theme is repeatedly extracted, retain the number of times keyCount being extracted;
(634) repeating step (631), (632), (633), traveled through new text set, obtains all Feature Words.
As preferably, in step (4) and step (61), described TF-IDF value computing formula is as follows:
w ij = tf ij &times; idf j = tf ij &times; log ( N n j )
Wherein, w ijfor the TF-IDF value of word j in document (the new text that designated user microblogging text data or restructuring designated user microblogging text data obtain) i, tf ijrepresent the frequency that word j occurs in document (the new text that designated user microblogging text data or restructuring designated user microblogging text data obtain) i, in document sets (the new text set that designated user microblogging text data set or restructuring designated user microblogging text data obtain), occurred that the document (the new text that designated user microblogging text data or restructuring designated user microblogging text data obtain) of word j has n ja section, total textual data (the text number of the new text set that the text number of designated user microblogging text data set or restructuring designated user microblogging text data obtain) is N, idf jrepresent N and n jratio take the logarithm.
As preferably, in step (7), the step of described microblog users interest identification is:
(71) a given S subject dictionary;
(72) Feature Words according to LDA model, new text set modeling being extracted, calculates the Feature Words number N that each subject dictionary comprises i(N ifor integer), if any one dictionary of word mismatch is labeled as additional category;
(73) each Feature Words carries weights, calculates the weight size of each subject dictionary, and computing formula is as follows:
w i = &Sigma; j = 1 N i weight ( Term [ j ] ) N i ,
Wherein, Term[j] be j word, weight (Term[j]) be the weights that word j is corresponding, N ithe number of the Feature Words that the dictionary i that is the theme comprises, w ifor the weight of each subject dictionary;
(74) according to the size sequence of each subject dictionary weight, to weight w ithe subject dictionary i of > η (η > 0) (η is the threshold value of setting) just elects user's interest descriptor as; Otherwise, delete subject dictionary i;
(75) if the interest descriptor of choosing is emotion word, calculate emotion word polarity, account form is as follows:
(76) the non-emotion theme word obtaining and the feeling polarities calculating are defined as to interest descriptor.
Compared with prior art, the present invention is in advantage:
One, in the identification of microblogging neologisms, the microblogging data that adopt are up-to-date microblog topic data, most of neologisms all occur in topic, dwindle original data volume, improving pre-service efficiency, in identifying, adopted four layers of filtration--raw data is filtered, word frequency is filtered, adjacent string is filtered, mutual information value is filtered, every increase one deck filters, and improves neologisms recognition accuracy.
Two, the feature extraction of microblogging short text has adopted the cluster process with K-Means, and adds obtaining of K-Means++ initial, changes K value in cluster process, meets the indefinite property of theme of microblogging short text, ensures that cluster result is more effective.
Three, in the feature extraction of microblogging short text, utilize VSM and LDA to combine, considered word frequency and two factors of potential theme in text, model training language material is the new text set after cluster, make with respect to simple LDA model extraction, the result obtaining is more desirable.
Four, the Feature Words that utilizes feature extraction to arrive, calculate the word frequency of given subject dictionary, text language is transitioned into the word of describing user interest custom, under existing clear and definite theme, it is more accurate to describe by the user interest obtaining without supervision clustering with respect to existing technology.
Five, in text representation, the vectorial dimension of choosing is the average of the length of all texts, instead of the length (140 words) of general microblogging, and reasonable dimension, easily reaches expression effect.
Brief description of the drawings
Fig. 1 is schematic flow sheet of the present invention;
Fig. 2 is framework schematic diagram of the present invention;
Fig. 3 is microblogging style data acquisition flow schematic diagram in the present invention;
Fig. 4 is that in the present invention, microblogging text feature extracts schematic flow sheet;
Fig. 5 is microblogging new word identification method schematic flow sheet in the present invention;
Fig. 6 is microblogging short text feature extraction schematic flow sheet in the present invention;
Fig. 7 is that in the present invention, Feature Words schematic flow sheet is extracted in LDA modeling;
Fig. 8 is the microblog users interest recognition methods schematic flow sheet based on Feature Words in the present invention;
Fig. 9 is the data flow schematic diagram of microblog users interest recognition methods implementation process in the present invention.
Embodiment
Below in conjunction with the drawings and the specific embodiments, the invention will be further described.
Consult Fig. 1,2,9, a kind of microblog users interest recognition methods based on text mining, step is:
(1) up-to-date topic microblogging text data and the designated user microblogging text data of collection microblogging style collection.
As shown in Figure 3, the collection of data is to capture two parts data from network, topic microblogging text data and designated user microblogging text data.Topic microblogging text data is from url list, to obtain a URL, and obtaining communication in URL, carries out webpage crawl after obtaining communication, and webpage is analyzed webpage after capturing, and sets up index after web page analysis again; In addition, according to web page analysis, extract the link in this webpage, capture for next webpage, the link that crawl each time is all extracted from the last time, extract a URL and carry out new web page crawl.Designated user microblogging text data is to obtain oauth certification according to user ID, after obtaining oauth certification, obtain code, after obtaining code, download microblogging data, downloading microblogging extracting data text text, after extracting text text, set up index, index database set up jointly in the index of setting up by the topic microblogging text data that obtains and designated user microblogging text data.Taking Sina's microblogging as example, can obtain authority according to open API at present, capture microblogging data.
(2) the topic microblogging text data collecting and designated user microblogging text data are carried out to standardization processing.
Standardization processing comprises carries out text-converted processing to punctuation mark, mood symbol and special symbol in topic microblogging text data and designated user microblogging text data, punctuation mark, emoticon are converted into corresponding textual description, special symbol is deleted as "@", some stop words, noise word are deleted.
(3) the topic microblogging text data after standardization processing is adopted to microblogging new word identification method, identify up-to-date microblogging neologisms, and upgrade neologisms dictionary.
As shown in Figure 5, the step of microblogging new word identification method is:
(31) gather the topic microblogging text data after standardization processing;
(32) topic microblogging text data is carried out to pre-service; Comprise the text-converted to symbol, the deletion of noise word; Suppose M section microblogging (A[M], character string), two character string AllStr and Str, traversal A, if A[i] in comprise " ## ", extract the character string tempA between " # ", and Str+=temp; Text tempA2 is treated in addition: AllStr+=tempA2.
(33) pretreated topic microblogging text data is carried out to the polynary word of cutting; AllStr and Str are cut into respectively AllWord[K1] and StrWord[K2], K1 and K2 are respectively the word number after cutting, and the word then all existing in selected AllWord and StrWord, is designated as NewWord[K].
(34) the polynary word of cutting is carried out to word filtration, old word filtration, word frequency filtration, adjacent string are filtered and the filtration of mutual information value; Old word filters, and according to given old word dictionary word, mates one by one the word that in AllWord, cutting is arrived, if exist by its deletion; Word frequency is filtered, and according to original document collection (topic microblogging text data), to the each word counting in NewWord, deletes the word that does not reach threshold value; Adjacent string is filtered, if two word A and B exist, and the word AB of composition also exists, and three's word frequency is identical, thinks that AB is one group of neologisms; If A is different with B word frequency, delete the low word of word frequency; Mutual information value is filtered, and passes through formula calculate the mutual information value of two word A and B, given threshold value, thinks that AB is not neologisms if do not reach threshold value, deletes.In formula, A, B represent respectively a word in text (topic microblogging text data), P (A, B) probability simultaneously occurring for word A, B, P (A) is the probability that word A occurs separately, P (B) is the probability that word B occurs separately, and I is the mutual information value between word A and word B.
(4) by the designated user microblogging text data after standardization processing, the ICTCLAS Words partition system that utilizes importing neologisms dictionary is that the segmenting method of neologisms dictionary carries out Chinese word segmentation, the TF-IDF value of each lexical item that calculating participle obtains, obtains text vector and represents.
As shown in Figure 4, figure is divided into two parts, is respectively neologisms identification and the feature extraction of designated user microblogging text data of topic microblogging text data.Fig. 4 the right is the identifications of topic microblogging text data neologisms, first extracts topic microblogging text data; Topic microblogging text data is carried out to corresponding pre-service, comprise text-converted, the deletion of noise word etc. of symbol; Pretreated microblogging text is carried out to the polynary word (word is cut by N unit) of cutting; After cutting word, carry out old word filtration, word frequency filtration, adjacent string filtration, the filtration of mutual information value, four layers are filtered the final microblogging neologisms of identification.Fig. 4 left side is the feature extraction of designated user microblogging text data, first obtains designated user microblogging text data; The designated user microblogging text data getting is carried out to standardization processing; The segmenting method of text utilization after treatment being introduced to neologisms dictionary, carries out Chinese word segmentation; The result that participle is obtained, calculates each word TF-IDF value, obtains text vector and represents; The text data that text vector is represented extracts text feature word; The Feature Words extracting is stored in database.
(5) the designated user microblogging text data representing through text vector is carried out to cluster, original designated user microblogging text data in reconstitution steps (1), obtain new text set and clusters number, clusters number is identical with new text set number.
As shown in Figure 6, microblogging short text feature extracting method idiographic flow, first adopt VSM to carry out vector representation to text, the clustering algorithm of utilization based on K-Means is to designated user microblogging text data cluster, then according to the cluster result original designated user microblogging text data of recombinating, obtain new text set, and recalculate TF-IDF value, carry out new text vector and represent; Finally use LDA model text modeling, extract Feature Words according to final Feature Words extracting method.
The step of cluster is:
(51) text vector that designated user microblogging text data is converted into V dimension represents, V is the mean value of N section designated user microblogging text data length, selects the center of K data point as K cluster classification by the given method of initial center;
(52) utilize Euclidean distance to calculate the distance of each data point and K center in designated user microblogging text data, acquisition cluster, is designated as: d ij(i=1~N, j=1~K), wherein, d ijrepresent the distance of i data point to j center, N is data point number, and i is i data point, and j is the central point of j cluster centre;
(53) recalculate the cluster centre of each obtained cluster, choose min is d ijin minimum value, choose and i the central point j that data point is nearest, set a threshold value c, if min > c, i is made as to a new central point; Otherwise document i is under the jurisdiction of the classification at central point j place.
(54) upgrade each classification central point, recalculate the center of each cluster;
(55) repeat (52), (53) and (54) step, until convergence, the result that the condition of convergence is made as twice iterative process does not affect central point.
The given method step of described initial center is as follows:
(511) from N data point, select at random a data point, be denoted as center;
(512) calculate the distance dis (center, m) (m=1~N) of other N-1 data point to center, and cumulative all distances: sum{dis (center, m);
(513) random selected value r=random (sum{dis (center, m) }), calculates r=r-dis (center, m), if r<0, m data point is designated as central point; Wherein, random (sum{dis (center, m) }) represents from 0-sum{dis (center, m) } choose at random a value;
(514) repeat (511) and (512) two steps, until select K central point.
(6) adopt LDA topic model, extract the new text set Feature Words obtaining after cluster; The new text set obtaining after cluster, utilizes LDA model to new text set modeling, extracts feature.
As Fig. 7 describes the flow process of LDA text modeling, specifically describe as follows:
The step of extracting new text set Feature Words is:
(61) according to the new text set obtaining after cluster, calculate the TF-IDF value of word in every section of new text set, obtain new text vector;
(62) adopt LDA model to new text set modeling, given parameters value also repeatedly changes initial parameter value, and " theme-word " distribution is obtained in sampling and " document-theme " distributes;
(63) adopt final Feature Words extracting method, extract Feature Words.
The step of final Feature Words extracting method is as follows:
(631), for new text set, from " document-theme " distributes, select the Topic of a weight maximum as key topic keyTopic;
(632) select " theme-word " that keyTopic is corresponding to distribute;
(633) from word corresponding to keyTopic distributes, obtain first three word that proportion is larger, if a theme is repeatedly extracted, retain the number of times keyCount being extracted;
(634) repeating step (631), (632), (633), traveled through new collected works, obtains all Feature Words.
In step (4) and step (61), described TF-IDF value computing formula is as follows:
w ij = tf ij &times; idf j = tf ij &times; log ( N n j )
Wherein, w ijfor the TF-IDF value of word j in document (the new text that designated user microblogging text data or restructuring designated user microblogging text data obtain) i, tf ijrepresent the frequency that word j occurs in document (the new text that designated user microblogging text data or restructuring designated user microblogging text data obtain) i, in document sets (the new text set that designated user microblogging text data set or restructuring designated user microblogging text data obtain), occurred that the document (the new text that designated user microblogging text data or restructuring designated user microblogging text data obtain) of word j has n ja section, total textual data (the text number of the new text set that the text number of designated user microblogging text data set or restructuring designated user microblogging text data obtain) is N, idf jrepresent N and n jratio take the logarithm.
(7) according to Web-Based Dictionary database, given multiple subject dictionaries, as theme as game, and its dictionary content is the vocabulary such as " putting to death celestial being ", " World of Warcraft ", as themes as film, dictionary content is the vocabulary such as " generation master ", " A Fanda ".Text set Feature Words based on new, calculates each subject dictionary weight, obtains final theme, identifies as microblog users interest;
As shown in Figure 8, the microblog users interest recognition methods flow process based on Feature Words, specifically describes as follows:
(71) a given S subject dictionary;
(72) Feature Words according to LDA model, new text set modeling being extracted, calculates the Feature Words number N that each subject dictionary comprises i(N ifor integer), if any one dictionary of word mismatch is labeled as additional category;
(73) each Feature Words carries weights, calculates the weight size of each subject dictionary, and computing formula is as follows:
w i = &Sigma; j = 1 N i weight ( Term [ j ] ) N i ,
Wherein, Term[j] be j word, weight (Term[j]) be the weights that word j is corresponding, N ithe number of the Feature Words that the dictionary i that is the theme comprises, w ifor the weight of each subject dictionary;
(74) according to the size sequence of each subject dictionary weight, to weight w ithe subject dictionary i of > η (η > 0) (η is the threshold value of setting) just elects user's interest descriptor as; Otherwise, delete subject dictionary i;
(75) if the interest descriptor of choosing is emotion word, calculate emotion word polarity, account form is as follows:
(76) the non-emotion theme word obtaining and the feeling polarities calculating are defined as to interest descriptor.
A microblog users interest recognition methods based on text mining, concrete implementation process is to obtain, after a certain subscriber authorisation, obtaining microblogging data, carrying out text analyzing, finally identifies its interest and describes.
As shown in Figure 2, whole flow process can be divided into three layers, and ground floor is data collection layer, gathers microblogging data, is divided into two parts: a part is to gather topic microblogging text data, identifies for neologisms; Part II is to gather designated user microblogging text data, for feature extraction.The second layer is text analyzing layer, for text feature extracts.The 3rd layer is application layer, is user behavior identification.
Be illustrated in figure 9 the data stream in the overall implementation procedure of designated user interest recognition methods, concrete steps are:
1. by obtain 200 up-to-date microblogging data Docs of user_ID from the microblogging API of Sina, carry out standardization processing, according to microblogging new word identification method, identify up-to-date microblogging neologisms, upgrade neologisms dictionary;
2. extract designated user microblogging short text data, standardization designated user microblogging short text data, utilize the ICTCLAS Words partition system that imports neologisms dictionary to carry out Chinese word segmentation, obtain lexical item document data collection Words;
3. utilize VSM model vector to represent every section of designated user microblogging text data, the lexical item of Chinese word segmentation is calculated to weights with TF-IDF, obtain " text-word " vector representation DW_Vectors;
4. using DW_Vectors as data, the clustering method based on K-Means++ carries out cluster, obtains theme number K, and the new text set NewDocs of a K section, obtains new text set vector NewDW_Vectors;
5. use LDA model to new text set modeling, utilize the sampling of GibbsSampling iteration, obtain two vector matrix matrix_DT (text-theme) and matrix_TW (theme-word);
6. according to final Feature Words extracting method, corresponding document label, gets L neologisms Terms;
7. a given S subject dictionary, calculates subject dictionary weight w i, and choose w ithe theme of > η (η > 0).The present invention is illustrated by above-described embodiment, but should be understood that, above-described embodiment is the object for giving an example and illustrating just, but not is intended to the present invention to be limited in described scope of embodiments.In addition it will be appreciated by persons skilled in the art that the present invention is not limited to above-described embodiment, can also make more kinds of variants and modifications according to instruction of the present invention, these variants and modifications all drop in the present invention's scope required for protection.Protection scope of the present invention is defined by the appended claims and equivalent scope thereof.

Claims (9)

1. the microblog users interest recognition methods based on text mining, is characterized in that following steps:
(1) up-to-date topic microblogging text data and the designated user microblogging text data of collection microblogging style collection;
(2) the topic microblogging text data collecting and designated user microblogging text data are carried out to standardization processing;
(3) the topic microblogging text data after standardization processing is adopted to microblogging new word identification method, identify up-to-date microblogging neologisms, and upgrade neologisms dictionary;
(4) by the designated user microblogging text data after standardization processing, utilize the segmenting method of neologisms dictionary to carry out Chinese word segmentation, the TF-IDF value of each lexical item that calculating Chinese word segmentation obtains, obtains text vector and represents;
(5) the designated user microblogging text data representing through text vector is carried out to cluster, the original designated user microblogging text data in reconstitution steps (1), obtains new text set and clusters number;
(6) adopt the modeling of LDA theme, extract the new text set Feature Words obtaining after cluster;
(7) given subject dictionary, the text set Feature Words based on new, calculates each subject dictionary weight, obtains final theme, identifies as microblog users interest.
2. a kind of microblog users interest recognition methods based on text mining according to claim 1, is characterized in that, in step (3), the step of described microblogging new word identification method is:
(31) gather the topic microblogging text data after standardization processing;
(32) topic microblogging text data is carried out to pre-service;
(33) pretreated topic microblogging text data is carried out to the polynary word of cutting;
(34) the polynary word of cutting is carried out to word filtration, old word filtration, word frequency filtration, adjacent string are filtered and the filtration of mutual information value.
3. a kind of microblog users interest recognition methods based on text mining according to claim 2, is characterized in that, in step (34), the calculating of described mutual information value is to pass through formula: I ( A , B ) = log 2 p ( A , B ) p ( A ) p ( B ) ,
Wherein, A, B represent respectively a word in text (topic microblogging text data), P (A, B) probability simultaneously occurring for word A, B, P (A) is the probability that word A occurs separately, P (B) is the probability that word B occurs separately, and I is the mutual information value between word A and word B.
4. a kind of microblog users interest recognition methods based on text mining according to claim 1, is characterized in that, in step (5), the step of described cluster is:
(51) text vector that designated user microblogging text data is converted into V dimension represents, V is the mean value of N section text (designated user microblogging text data) length, selects the center of K data point as K cluster classification by the given method of initial center;
(52) utilize Euclidean distance to calculate the distance of each data point and K center in designated user microblogging text data, acquisition cluster, is designated as: d ij(i=1~N, j=1~K), wherein, d ijrepresent the distance of i data point to j center, N is data point number, and i is i data point, and j is the central point of j cluster centre;
(53) recalculate the cluster centre of each obtained cluster, choose min is d ijin minimum value, choose and i the central point j that data point is nearest, set a threshold value c, if min > c, i is made as to a new central point; Otherwise document i is under the jurisdiction of the classification at central point j place;
(54) upgrade each classification central point, recalculate the center of each cluster;
(55) repeat (52), (53) and (54) step, until convergence, the result that the condition of convergence is made as twice iterative process does not affect central point.
5. a kind of microblog users interest recognition methods based on text mining according to claim 4, is characterized in that, in step (51), the given method step of described initial center is as follows:
(511) from N data point, select at random a data point, be denoted as center;
(512) calculate the distance dis (center, m) (m=1~N) of other N-1 data point to center, and cumulative all distances: sum{dis (center, m);
(513) random selected value r=random (sum{dis (center, m) }), calculate r=r-dis (center, m),, if r<0, m data point is designated as central point, wherein, random (sum{dis (center, m) }) represents from 0-sum{dis (center, m) } choose at random a value;
(514) repeat (511) and (512) two steps, until select K central point.
6. a kind of microblog users interest recognition methods based on text mining according to claim 1, is characterized in that, in step (6), the step of extracting new text set Feature Words is:
(61) according to the new text set obtaining after cluster, calculate the TF-IDF value of word in every section of new text set, obtain new text vector;
(62) adopt LDA model to new text set modeling, given parameters value also repeatedly changes initial parameter value, and " theme-word " distribution is obtained in sampling and " document-theme " distributes;
(63) adopt final Feature Words extracting method, extract Feature Words.
7. a kind of microblog users interest recognition methods based on text mining according to claim 7, is characterized in that, in step (63), the step of final Feature Words extracting method is as follows:
(631), for new text set, from " document-theme " distributes, select the Topic of a weight maximum as key topic keyTopic;
(632) select " theme-word " that keyTopic is corresponding to distribute;
(633) from word corresponding to keyTopic distributes, obtain first three word that proportion is larger, if a theme is repeatedly extracted, retain the number of times keyCount being extracted;
(634) repeating step (631), (632), (633), traveled through new text set, obtains all Feature Words.
8. according to a kind of microblog users interest recognition methods based on text mining described in claim 1 or 6, it is characterized in that, in step (4) and step (61), described TF-IDF value computing formula is as follows:
w ij = tf ij &times; idf j = tf ij &times; log ( N n j )
Wherein, w ijfor the TF-IDF value of word j in document (the new text that designated user microblogging text data or restructuring designated user microblogging text data obtain) i, tf ijrepresent the frequency that word j occurs in document (the new text that designated user microblogging text data or restructuring designated user microblogging text data obtain) i, in document sets (the new text set that designated user microblogging text data set or restructuring designated user microblogging text data obtain), occurred that the document (the new text that designated user microblogging text data or restructuring designated user microblogging text data obtain) of word j has n ja section, total textual data (the text number of the new text set that the text number of designated user microblogging text data set or restructuring designated user microblogging text data obtain) is N, idf jrepresent N and n jratio take the logarithm.
9. a kind of microblog users interest recognition methods based on text mining according to claim 1, is characterized in that, in step (7), the step of described microblog users interest identification is:
(71) a given S subject dictionary;
(72) Feature Words according to LDA model, new text set modeling being extracted, calculates the Feature Words number N that each subject dictionary comprises i(N ifor integer), if any one dictionary of word mismatch is labeled as additional category;
(73) each Feature Words carries weights, calculates the weight size of each subject dictionary, and computing formula is as follows:
w i = &Sigma; j = 1 N i weight ( Term [ j ] ) N i ,
Wherein, Term[j] be j word, weight (Term[j]) be the weights that word j is corresponding, N ithe number of the Feature Words that the dictionary i that is the theme comprises, w ifor the weight of each subject dictionary;
(74) according to the size sequence of each subject dictionary weight, to weight w ithe subject dictionary i of > η (η > 0) (η is the threshold value of setting) just elects user's interest descriptor as; Otherwise, delete subject dictionary i;
(75) if the interest descriptor of choosing is emotion word, calculate emotion word polarity, account form is as follows:
(76) the non-emotion theme word obtaining and the feeling polarities calculating are defined as to interest descriptor.
CN201410195244.XA 2014-05-09 2014-05-09 Microblog user interest recognizing method based on text mining Pending CN103942340A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410195244.XA CN103942340A (en) 2014-05-09 2014-05-09 Microblog user interest recognizing method based on text mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410195244.XA CN103942340A (en) 2014-05-09 2014-05-09 Microblog user interest recognizing method based on text mining

Publications (1)

Publication Number Publication Date
CN103942340A true CN103942340A (en) 2014-07-23

Family

ID=51190008

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410195244.XA Pending CN103942340A (en) 2014-05-09 2014-05-09 Microblog user interest recognizing method based on text mining

Country Status (1)

Country Link
CN (1) CN103942340A (en)

Cited By (58)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199846A (en) * 2014-08-08 2014-12-10 杭州电子科技大学 Comment subject term clustering method based on Wikipedia
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis
CN104462286A (en) * 2014-11-27 2015-03-25 重庆邮电大学 Microblog topic finding method based on modified LDA
CN104536951A (en) * 2014-12-29 2015-04-22 北京牡丹电子集团有限责任公司数字电视技术中心 Microblog text normalizing, word segmenting and part-speech tagging method and system
CN104573070A (en) * 2015-01-26 2015-04-29 清华大学 Text clustering method special for mixed length text sets
CN104598532A (en) * 2014-12-29 2015-05-06 中国联合网络通信有限公司广东省分公司 Information processing method and device
CN104657349A (en) * 2015-02-11 2015-05-27 厦门美柚信息科技有限公司 Forum post feature identifying method and device
CN104778209A (en) * 2015-03-13 2015-07-15 国家计算机网络与信息安全管理中心 Opinion mining method for ten-million-scale news comments
CN104850647A (en) * 2015-05-28 2015-08-19 国家计算机网络与信息安全管理中心 Microblog group discovering method and microblog group discovering device
CN105095196A (en) * 2015-07-24 2015-11-25 北京京东尚科信息技术有限公司 Method and device for finding new word in text
CN105224955A (en) * 2015-10-16 2016-01-06 武汉邮电科学研究院 Based on the method for microblogging large data acquisition network service state
CN105447206A (en) * 2016-01-05 2016-03-30 深圳市中易科技有限责任公司 New comment object identifying method and system based on word2vec algorithm
CN105573995A (en) * 2014-10-09 2016-05-11 中国银联股份有限公司 Interest identification method, interest identification equipment and data analysis method
CN105786791A (en) * 2014-12-23 2016-07-20 深圳市腾讯计算机系统有限公司 Data topic acquisition method and apparatus
CN106021388A (en) * 2016-05-11 2016-10-12 华南理工大学 Classifying method of WeChat official accounts based on LDA topic clustering
CN106055538A (en) * 2016-05-26 2016-10-26 达而观信息科技(上海)有限公司 Automatic extraction method for text labels in combination with theme model and semantic analyses
CN106156091A (en) * 2015-04-01 2016-11-23 富士通株式会社 The method and apparatus describing the author of short text
CN106202574A (en) * 2016-08-19 2016-12-07 清华大学 The appraisal procedure recommended towards microblog topic and device
CN106294314A (en) * 2016-07-19 2017-01-04 北京奇艺世纪科技有限公司 Topics Crawling method and device
CN106649730A (en) * 2016-12-23 2017-05-10 中山大学 User clustering and short text clustering method based on social network short text stream
CN106649853A (en) * 2016-12-30 2017-05-10 儒安科技有限公司 Short text clustering method based on deep learning
CN106776539A (en) * 2016-11-09 2017-05-31 武汉泰迪智慧科技有限公司 A kind of various dimensions short text feature extracting method and system
CN106874448A (en) * 2017-02-10 2017-06-20 中国农业大学 A kind of method and apparatus that earthquake descriptor is excavated from microblogging
CN107077640A (en) * 2014-09-03 2017-08-18 邓白氏公司 Analyzed via experience ownership, it is qualification and intake unstructured data sources system and processing
CN107220241A (en) * 2017-07-17 2017-09-29 广州特道信息科技有限公司 The user feeling analysis method and device of social networks
CN107463552A (en) * 2017-07-20 2017-12-12 北京奇艺世纪科技有限公司 A kind of method and apparatus for generating video subject title
CN107590172A (en) * 2017-07-17 2018-01-16 北京捷通华声科技股份有限公司 A kind of the core content method for digging and equipment of extensive speech data
CN107729455A (en) * 2017-09-25 2018-02-23 山东科技大学 A kind of social network opinion leader sort algorithm based on multidimensional characteristic analysis
CN107766576A (en) * 2017-11-15 2018-03-06 北京航空航天大学 A kind of extracting method of microblog users interest characteristics
CN107798113A (en) * 2017-11-02 2018-03-13 东南大学 A kind of document data sorting technique based on cluster analysis
CN107832467A (en) * 2017-11-29 2018-03-23 北京工业大学 A kind of microblog topic detecting method based on improved Single pass clustering algorithms
CN108108346A (en) * 2016-11-25 2018-06-01 广东亿迅科技有限公司 The theme feature word abstracting method and device of document
CN108182174A (en) * 2017-12-27 2018-06-19 掌阅科技股份有限公司 New words extraction method, electronic equipment and computer storage media
CN108304371A (en) * 2017-07-14 2018-07-20 腾讯科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium that Hot Contents excavate
CN108399227A (en) * 2018-02-12 2018-08-14 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium of automatic labeling
CN108897769A (en) * 2018-05-29 2018-11-27 武汉大学 Network implementations text classification data set extension method is fought based on production
CN109086375A (en) * 2018-07-24 2018-12-25 武汉大学 A kind of short text subject extraction method based on term vector enhancing
CN109325117A (en) * 2018-08-24 2019-02-12 北京信息科技大学 Social security events detection method in a kind of microblogging of multiple features fusion
CN109492092A (en) * 2018-09-29 2019-03-19 北明智通(北京)科技有限公司 Document classification method and system based on LDA topic model
CN109766437A (en) * 2018-12-07 2019-05-17 中科恒运股份有限公司 A kind of Text Clustering Method, text cluster device and terminal device
CN109857857A (en) * 2019-01-17 2019-06-07 中国人民解放军国防科技大学 Method for detecting drift of user reading interest topic
CN109948040A (en) * 2017-12-04 2019-06-28 北京京东尚科信息技术有限公司 Storage, recommended method and the system of object information, equipment and storage medium
CN110096704A (en) * 2019-04-29 2019-08-06 扬州大学 A kind of Dynamic Theme discovery algorithm of short text stream
CN110390092A (en) * 2018-04-18 2019-10-29 腾讯科技(深圳)有限公司 Document subject matter determines method and relevant device
CN110738047A (en) * 2019-09-03 2020-01-31 华中科技大学 Microblog user interest mining method and system based on image-text data and time effect
CN110858217A (en) * 2018-08-23 2020-03-03 北大方正集团有限公司 Method and device for detecting microblog sensitive topics and readable storage medium
CN111310453A (en) * 2019-11-05 2020-06-19 上海金融期货信息技术有限公司 User theme vectorization representation method and system based on deep learning
CN111309898A (en) * 2018-11-26 2020-06-19 中移(杭州)信息技术有限公司 Text mining method and device for new word discovery
CN111339247A (en) * 2020-02-11 2020-06-26 安徽理工大学 Microblog subtopic user comment emotional tendency analysis method
CN111460137A (en) * 2020-05-20 2020-07-28 南京大学 Micro-service focus identification method, device and medium based on topic model
CN111723349A (en) * 2019-03-18 2020-09-29 顺丰科技有限公司 User identification method, device, equipment and storage medium
US10803253B2 (en) 2018-06-30 2020-10-13 Wipro Limited Method and device for extracting point of interest from natural language sentences
CN112084306A (en) * 2020-09-10 2020-12-15 北京天融信网络安全技术有限公司 Sensitive word mining method and device, storage medium and electronic equipment
CN112101039A (en) * 2020-08-05 2020-12-18 华中师范大学 Learning interest discovery method for online learning community
CN112115698A (en) * 2019-06-19 2020-12-22 国际商业机器公司 Techniques for generating topic models
CN112182228A (en) * 2020-10-26 2021-01-05 城云科技(中国)有限公司 Method and device for mining and summarizing short text hot topic
WO2021087676A1 (en) * 2019-11-04 2021-05-14 Beijing Didi Infinity Technology And Development Co., Ltd. System, method, and storage medium for selecting learning materials
CN113010643A (en) * 2021-03-22 2021-06-22 平安科技(深圳)有限公司 Method, device and equipment for processing vocabulary in field of Buddhism and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6012058A (en) * 1998-03-17 2000-01-04 Microsoft Corporation Scalable system for K-means clustering of large databases
CN102663046A (en) * 2012-03-29 2012-09-12 中国科学院自动化研究所 Sentiment analysis method oriented to micro-blog short text

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6012058A (en) * 1998-03-17 2000-01-04 Microsoft Corporation Scalable system for K-means clustering of large databases
CN102663046A (en) * 2012-03-29 2012-09-12 中国科学院自动化研究所 Sentiment analysis method oriented to micro-blog short text

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DAVID ARTHUR ET AL.: "k-means++:The Advantages of Careful Seeding", 《PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS》 *
孙励: "基于微博的热点话题发现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
谢丽星等: "基于层次结构的多策略中文微博情感分析和特征抽取", 《中文信息学报》 *
黄波: "基于向量空间模型和LDA模型相结合的微博客话题发现算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (87)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199846A (en) * 2014-08-08 2014-12-10 杭州电子科技大学 Comment subject term clustering method based on Wikipedia
CN104199846B (en) * 2014-08-08 2017-09-19 杭州电子科技大学 Comment key phrases clustering method based on wikipedia
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis
CN104239436B (en) * 2014-08-27 2018-01-02 南京邮电大学 It is a kind of that method is found based on the network hotspot event of text classification and cluster analysis
CN107077640B (en) * 2014-09-03 2021-07-06 邓白氏公司 System and process for analyzing, qualifying, and ingesting unstructured data sources via empirical attribution
CN107077640A (en) * 2014-09-03 2017-08-18 邓白氏公司 Analyzed via experience ownership, it is qualification and intake unstructured data sources system and processing
CN105573995A (en) * 2014-10-09 2016-05-11 中国银联股份有限公司 Interest identification method, interest identification equipment and data analysis method
CN105573995B (en) * 2014-10-09 2019-03-15 中国银联股份有限公司 A kind of interest recognition methods, equipment and data analysing method
CN104462286A (en) * 2014-11-27 2015-03-25 重庆邮电大学 Microblog topic finding method based on modified LDA
CN105786791A (en) * 2014-12-23 2016-07-20 深圳市腾讯计算机系统有限公司 Data topic acquisition method and apparatus
CN105786791B (en) * 2014-12-23 2019-07-05 深圳市腾讯计算机系统有限公司 Data subject acquisition methods and device
CN104536951B (en) * 2014-12-29 2017-04-12 北京牡丹电子集团有限责任公司数字电视技术中心 Microblog text normalizing, word segmenting and part-speech tagging method and system
CN104536951A (en) * 2014-12-29 2015-04-22 北京牡丹电子集团有限责任公司数字电视技术中心 Microblog text normalizing, word segmenting and part-speech tagging method and system
CN104598532A (en) * 2014-12-29 2015-05-06 中国联合网络通信有限公司广东省分公司 Information processing method and device
CN104573070A (en) * 2015-01-26 2015-04-29 清华大学 Text clustering method special for mixed length text sets
CN104573070B (en) * 2015-01-26 2018-06-15 清华大学 A kind of Text Clustering Method for mixing length text set
CN104657349A (en) * 2015-02-11 2015-05-27 厦门美柚信息科技有限公司 Forum post feature identifying method and device
CN104657349B (en) * 2015-02-11 2018-07-31 厦门美柚信息科技有限公司 A kind of forum postings characteristic recognition method and device
CN104778209A (en) * 2015-03-13 2015-07-15 国家计算机网络与信息安全管理中心 Opinion mining method for ten-million-scale news comments
CN104778209B (en) * 2015-03-13 2018-04-27 国家计算机网络与信息安全管理中心 A kind of opining mining method for millions scale news analysis
CN106156091A (en) * 2015-04-01 2016-11-23 富士通株式会社 The method and apparatus describing the author of short text
CN104850647A (en) * 2015-05-28 2015-08-19 国家计算机网络与信息安全管理中心 Microblog group discovering method and microblog group discovering device
CN105095196B (en) * 2015-07-24 2017-11-14 北京京东尚科信息技术有限公司 The method and apparatus of new word discovery in text
CN105095196A (en) * 2015-07-24 2015-11-25 北京京东尚科信息技术有限公司 Method and device for finding new word in text
CN105224955A (en) * 2015-10-16 2016-01-06 武汉邮电科学研究院 Based on the method for microblogging large data acquisition network service state
CN105447206B (en) * 2016-01-05 2017-04-05 深圳市中易科技有限责任公司 New comment object identifying method and system based on word2vec algorithms
CN105447206A (en) * 2016-01-05 2016-03-30 深圳市中易科技有限责任公司 New comment object identifying method and system based on word2vec algorithm
CN106021388A (en) * 2016-05-11 2016-10-12 华南理工大学 Classifying method of WeChat official accounts based on LDA topic clustering
CN106055538B (en) * 2016-05-26 2019-03-08 达而观信息科技(上海)有限公司 The automatic abstracting method of the text label that topic model and semantic analysis combine
CN106055538A (en) * 2016-05-26 2016-10-26 达而观信息科技(上海)有限公司 Automatic extraction method for text labels in combination with theme model and semantic analyses
CN106294314A (en) * 2016-07-19 2017-01-04 北京奇艺世纪科技有限公司 Topics Crawling method and device
CN106202574A (en) * 2016-08-19 2016-12-07 清华大学 The appraisal procedure recommended towards microblog topic and device
CN106776539A (en) * 2016-11-09 2017-05-31 武汉泰迪智慧科技有限公司 A kind of various dimensions short text feature extracting method and system
CN108108346B (en) * 2016-11-25 2021-12-24 广东亿迅科技有限公司 Method and device for extracting theme characteristic words of document
CN108108346A (en) * 2016-11-25 2018-06-01 广东亿迅科技有限公司 The theme feature word abstracting method and device of document
CN106649730B (en) * 2016-12-23 2021-08-10 中山大学 User clustering and short text clustering method based on social network short text stream
CN106649730A (en) * 2016-12-23 2017-05-10 中山大学 User clustering and short text clustering method based on social network short text stream
CN106649853A (en) * 2016-12-30 2017-05-10 儒安科技有限公司 Short text clustering method based on deep learning
CN106874448A (en) * 2017-02-10 2017-06-20 中国农业大学 A kind of method and apparatus that earthquake descriptor is excavated from microblogging
CN106874448B (en) * 2017-02-10 2020-03-06 中国农业大学 Method and device for mining earthquake subject term from microblog
CN108304371B (en) * 2017-07-14 2021-07-13 腾讯科技(深圳)有限公司 Method and device for mining hot content, computer equipment and storage medium
CN108304371A (en) * 2017-07-14 2018-07-20 腾讯科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium that Hot Contents excavate
CN107590172B (en) * 2017-07-17 2020-06-05 北京捷通华声科技股份有限公司 Core content mining method and device for large-scale voice data
CN107590172A (en) * 2017-07-17 2018-01-16 北京捷通华声科技股份有限公司 A kind of the core content method for digging and equipment of extensive speech data
CN107220241A (en) * 2017-07-17 2017-09-29 广州特道信息科技有限公司 The user feeling analysis method and device of social networks
CN107463552A (en) * 2017-07-20 2017-12-12 北京奇艺世纪科技有限公司 A kind of method and apparatus for generating video subject title
CN107729455A (en) * 2017-09-25 2018-02-23 山东科技大学 A kind of social network opinion leader sort algorithm based on multidimensional characteristic analysis
CN107798113B (en) * 2017-11-02 2021-11-12 东南大学 Document data classification method based on cluster analysis
CN107798113A (en) * 2017-11-02 2018-03-13 东南大学 A kind of document data sorting technique based on cluster analysis
CN107766576A (en) * 2017-11-15 2018-03-06 北京航空航天大学 A kind of extracting method of microblog users interest characteristics
CN107832467A (en) * 2017-11-29 2018-03-23 北京工业大学 A kind of microblog topic detecting method based on improved Single pass clustering algorithms
CN109948040A (en) * 2017-12-04 2019-06-28 北京京东尚科信息技术有限公司 Storage, recommended method and the system of object information, equipment and storage medium
CN108182174A (en) * 2017-12-27 2018-06-19 掌阅科技股份有限公司 New words extraction method, electronic equipment and computer storage media
CN108399227A (en) * 2018-02-12 2018-08-14 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium of automatic labeling
CN108399227B (en) * 2018-02-12 2020-09-01 平安科技(深圳)有限公司 Automatic labeling method and device, computer equipment and storage medium
CN110390092A (en) * 2018-04-18 2019-10-29 腾讯科技(深圳)有限公司 Document subject matter determines method and relevant device
CN108897769A (en) * 2018-05-29 2018-11-27 武汉大学 Network implementations text classification data set extension method is fought based on production
US10803253B2 (en) 2018-06-30 2020-10-13 Wipro Limited Method and device for extracting point of interest from natural language sentences
CN109086375A (en) * 2018-07-24 2018-12-25 武汉大学 A kind of short text subject extraction method based on term vector enhancing
CN109086375B (en) * 2018-07-24 2021-10-22 武汉大学 Short text topic extraction method based on word vector enhancement
CN110858217A (en) * 2018-08-23 2020-03-03 北大方正集团有限公司 Method and device for detecting microblog sensitive topics and readable storage medium
CN109325117A (en) * 2018-08-24 2019-02-12 北京信息科技大学 Social security events detection method in a kind of microblogging of multiple features fusion
CN109325117B (en) * 2018-08-24 2022-10-11 北京信息科技大学 Multi-feature fusion social security event detection method in microblog
CN109492092A (en) * 2018-09-29 2019-03-19 北明智通(北京)科技有限公司 Document classification method and system based on LDA topic model
CN111309898A (en) * 2018-11-26 2020-06-19 中移(杭州)信息技术有限公司 Text mining method and device for new word discovery
CN109766437A (en) * 2018-12-07 2019-05-17 中科恒运股份有限公司 A kind of Text Clustering Method, text cluster device and terminal device
CN109857857A (en) * 2019-01-17 2019-06-07 中国人民解放军国防科技大学 Method for detecting drift of user reading interest topic
CN109857857B (en) * 2019-01-17 2020-11-20 中国人民解放军国防科技大学 Method for detecting drift of user reading interest topic
CN111723349A (en) * 2019-03-18 2020-09-29 顺丰科技有限公司 User identification method, device, equipment and storage medium
CN110096704B (en) * 2019-04-29 2023-05-05 扬州大学 Dynamic theme discovery method for short text stream
CN110096704A (en) * 2019-04-29 2019-08-06 扬州大学 A kind of Dynamic Theme discovery algorithm of short text stream
CN112115698A (en) * 2019-06-19 2020-12-22 国际商业机器公司 Techniques for generating topic models
CN110738047A (en) * 2019-09-03 2020-01-31 华中科技大学 Microblog user interest mining method and system based on image-text data and time effect
WO2021087676A1 (en) * 2019-11-04 2021-05-14 Beijing Didi Infinity Technology And Development Co., Ltd. System, method, and storage medium for selecting learning materials
CN111310453B (en) * 2019-11-05 2023-04-25 上海金融期货信息技术有限公司 User theme vectorization representation method and system based on deep learning
CN111310453A (en) * 2019-11-05 2020-06-19 上海金融期货信息技术有限公司 User theme vectorization representation method and system based on deep learning
CN111339247A (en) * 2020-02-11 2020-06-26 安徽理工大学 Microblog subtopic user comment emotional tendency analysis method
CN111339247B (en) * 2020-02-11 2022-10-28 安徽理工大学 Microblog subtopic user comment emotional tendency analysis method
CN111460137A (en) * 2020-05-20 2020-07-28 南京大学 Micro-service focus identification method, device and medium based on topic model
CN111460137B (en) * 2020-05-20 2023-10-17 南京大学 Method, equipment and medium for identifying micro-service focus based on topic model
WO2022028249A1 (en) * 2020-08-05 2022-02-10 华中师范大学 Learning interest discovery method for online learning community
CN112101039A (en) * 2020-08-05 2020-12-18 华中师范大学 Learning interest discovery method for online learning community
CN112084306A (en) * 2020-09-10 2020-12-15 北京天融信网络安全技术有限公司 Sensitive word mining method and device, storage medium and electronic equipment
CN112084306B (en) * 2020-09-10 2023-08-29 北京天融信网络安全技术有限公司 Keyword mining method and device, storage medium and electronic equipment
CN112182228A (en) * 2020-10-26 2021-01-05 城云科技(中国)有限公司 Method and device for mining and summarizing short text hot topic
CN113010643A (en) * 2021-03-22 2021-06-22 平安科技(深圳)有限公司 Method, device and equipment for processing vocabulary in field of Buddhism and storage medium
CN113010643B (en) * 2021-03-22 2023-07-21 平安科技(深圳)有限公司 Method, device, equipment and storage medium for processing vocabulary in Buddha field

Similar Documents

Publication Publication Date Title
CN103942340A (en) Microblog user interest recognizing method based on text mining
CN102411563B (en) Method, device and system for identifying target words
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN104199972B (en) A kind of name entity relation extraction and construction method based on deep learning
CN104598535B (en) A kind of event extraction method based on maximum entropy
CN104778209B (en) A kind of opining mining method for millions scale news analysis
CN103729402B (en) Method for establishing mapping knowledge domain based on book catalogue
CN103605665A (en) Keyword based evaluation expert intelligent search and recommendation method
CN103218444B (en) Based on semantic method of Tibetan language webpage text classification
CN103235774B (en) A kind of science and technology item application form Feature Words extracting method
CN106951438A (en) A kind of event extraction system and method towards open field
CN104484343A (en) Topic detection and tracking method for microblog
CN104281653A (en) Viewpoint mining method for ten million microblog texts
CN103631859A (en) Intelligent review expert recommending method for science and technology projects
CN104408093A (en) News event element extracting method and device
CN103324626B (en) A kind of set up the method for many granularities dictionary, the method for participle and device thereof
CN103246644B (en) Method and device for processing Internet public opinion information
CN103678670A (en) Micro-blog hot word and hot topic mining system and method
CN102270212A (en) User interest feature extraction method based on hidden semi-Markov model
CN103605794A (en) Website classifying method
CN103049569A (en) Text similarity matching method on basis of vector space model
CN104462053A (en) Inner-text personal pronoun anaphora resolution method based on semantic features
CN103778243A (en) Domain term extraction method
CN108038205A (en) For the viewpoint analysis prototype system of Chinese microblogging
CN103279478A (en) Method for extracting features based on distributed mutual information documents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140723

RJ01 Rejection of invention patent application after publication