CN106649730B

CN106649730B - User clustering and short text clustering method based on social network short text stream

Info

Publication number: CN106649730B
Application number: CN201611206373.XA
Authority: CN
Inventors: 沈鸿; 邱章成
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2021-08-10
Anticipated expiration: 2036-12-23
Also published as: CN106649730A

Abstract

The invention aims to solve the problems of ' word sense drift ' and short text sparsity ' of the current semantic-based user clustering and short text clustering methods without considering social factors. A user clustering and text clustering method based on social network short text stream topic modeling is provided, which comprises the following steps: s1, obtaining linguistic data; s2, preprocessing the corpus; s3, modeling based on the short text data flow theme in the social network; s4, derivation and sampling; s5, clustering the users; and S6, clustering the short texts. The invention comprehensively considers three factors of 'word sense drift', 'short text sparsity' and 'social network' which influence topic modeling, solves the problem that social semantic information is lost through short text stream analysis of users and text clustering of the social network, and greatly improves the precision of the existing clustering algorithm.

Description

User clustering and short text clustering method based on social network short text stream

Technical Field

The invention relates to the technical field of computers, in particular to a method and a system for clustering users and short texts based on short text streams of a social network.

Background

With the popularization of the mobile internet and the rapid development of the social network, hundreds of millions of user data are deposited on the social network, and how to analyze short texts published by the users for user clustering and short text clustering becomes a very important topic. However, the existing method is not an effective method for carrying out user dynamic clustering aiming at the potential semantic information of the short text data stream in the social network, so the invention provides an effective dynamic clustering method to solve the problems of carrying out user clustering and short text clustering on the potential semantic analysis of the short text data stream in the social network.

The invention is significantly different from the following short text processing patents in principle and design and application scenarios.

Publication number CN104850617A provides a short text processing method and device, the method is to obtain a first short text set, and pre-process the first short text set; based on the preprocessed first short text set, executing the following processing steps: training a topic model LDA by using the preprocessed first short text set to obtain the topic probability distribution of each short text in the first short text set; clustering the topic probability distribution, and determining the topic category of each short text in the first short text set. (however, the LDA related to the invention is not suitable for short text data stream in the social network for three reasons, 1, time factor is not considered, 2, social factor is not considered, 3, expression habit of the user is not considered, and therefore the theme characteristics of the user in the social network cannot be really captured.)

Publication No. CN101477563 provides a method, system and data processing apparatus for clustering short texts, specifically, step 1, takes all short texts in the short text set as a category; step 2, selecting a category from all current categories, and searching a core vocabulary from the selected category; step 3, if the core vocabulary is found, dividing the selected category into two categories according to whether the core vocabulary is contained or not, and executing the step 2; and 4, if the core vocabulary is not found, recording and deleting the selected category, selecting one category from the rest categories, and executing the step 2 until no category is left, wherein the recorded category is used as a clustering result.

The publication number CN105468713A provides a method for analyzing short text information clustering in network flow on line, the invention discloses a short text classification method with multi-model fusion, which comprises a learning method and a classification method; the learning method comprises the following steps: performing word segmentation and filtering on the short text training data to obtain a word set; calculating an IDF value for each word; calculating TFIDF values of all words, and constructing a text vector VSM; and performing text learning based on the vector space model to construct an ontology tree model, a keyword overlapping model, a naive Bayes model and a support vector machine model. The classification method comprises the following steps: performing word segmentation and filtering on the short text to be classified; generating a text vector based on the vector space model; respectively applying an ontology tree model, a keyword overlapping model, a naive Bayes model and a support vector machine model to classify to obtain a single model classification result; and fusing the single model classification results to obtain a final classification result.

Publication number CN104915386A provides a short text clustering method based on deep semantic feature learning, which selects a training text, performs dimension reduction on the original features of the training text under the constraint of local information preservation by a feature dimension reduction method, and performs binarization on a low-dimensional real-valued vector; acquiring word features from the training text, respectively acquiring word vectors corresponding to the word features through table look-up according to the word features, and taking the word vectors as input feature learning depth semantic representation features of the convolutional neural network; the output node of the convolutional neural network is subjected to dimensionality reduction through a plurality of logistic stewart regression fits to obtain a binary code; performing error back propagation training on the fitting residual error of the binary characteristic output by the convolutional neural network and the binary characteristic after the dimensionality reduction of the original characteristic; and performing deep semantic feature mapping on the training text by using the updated convolutional neural network model, and then obtaining a clustering result of the short text by using a K-means clustering algorithm.

Disclosure of Invention

The invention aims to solve the problems of ' word sense drift ' and short text sparsity ' of the current semantic-based user clustering and short text clustering methods without considering social factors. The method is characterized in that the topic modeling is carried out on the short text stream in the social network by adding the influence of the word expression habit of the user and the affinity distribution of the friend relationship on the topic distribution of the user in the traditional topic generation model. Optionally this text is trained using english as the corpus.

In order to achieve the purpose, the invention adopts the following technical scheme:

a user clustering and text clustering method based on social network short text stream topic modeling is shown in figure 1 and comprises the following steps:

and S1, obtaining the linguistic data. Obtaining a corpus of a social network platform through an API (application program interface) opened by a crawler or a social network platform company or collecting user corpora through a self-built social network system;

and S2, preprocessing the corpus. The method comprises the steps of segmenting words, stopping words, extracting word stems and extracting entities;

and S3, modeling based on the short text data stream topic in the social network. Subject modeling is carried out on texts in the corpus aiming at the social relations among text authors in the corpus, the problem of word sense drift and the problem of short text sparsity in the texts, so as to extract the subject of each text;

s4, derivation and sampling. Deducing the theme joint probability distribution of the model according to the established probability graph model, taking the theme joint probability distribution as the joint probability distribution of Gibbs sampling, and finally counting the theme distribution of the user and the text when the sampling is converged;

and S5, clustering the users. Taking the obtained user theme as the characteristics of the users in the corpus, and executing K-Means clustering to obtain a user clustering result;

and S6, clustering the short texts. And taking the obtained short text theme as the characteristic of the short text, and performing K-Means clustering to obtain a short text clustering result.

Preferably, in step S1, the Streaming API disclosed by Twitter is used to obtain english language material. And acquiring the Chinese corpus by adopting the Sina API. Optionally, the present invention takes english corpus as an example of a processing object.

Preferably, in step S2, for the chinese corpus, the present invention performs word segmentation processing on the corpus by using "longest word segmentation method to perform word segmentation on short text" ICTCLAS segmentation method. For English corpora, the invention adopts a stop word bank of Lemur to remove stop words. And the stem is extracted by a Porter method in a Stemming method of NLTK.

Optionally, in step S3, the present invention employs an autonomously designed topic model. In the text topic modeling and the user topic modeling, the characteristics of word sense drift, short text sparsity and friend circle in the social network are considered at the same time, and the topic model is redesigned to adapt to the topic modeling in the application environment. In addition, the method integrates user theme modeling and text theme modeling into one model, is efficient and convenient, and achieves two purposes at one stroke.

● for a priori α_t，u，β_t，e，zThe invention sets it to a dirichlet distribution.

● distribution for topic theta_t，u，φ_t，e，zThe invention sets it to a polynomial distribution conjugated to the dirichlet distribution.

After the model is successfully built, the combined probability distribution of the theme is deduced from the theme model by adopting a Gibbs Sampling method, and the theme generation process is continuously sampled by adopting the Sampling formula to obtain the final theme distribution. The sampling formula is as follows:

preferably, in step S4 and step S4, the present invention employs the subject matter obtained by the K-Means pair. Step S4 differs from S3 in that for new users, their topics are distributed in relation to short text and user interactions published during the current time period.

Drawings

FIG. 1 is a system flow diagram of the present invention;

FIG. 2 is a probabilistic graphical model of the subject modeling of the present invention;

FIG. 3 is a process of clustering existing users according to the present invention;

FIG. 4 is a flow of the present invention clustering new users;

FIG. 5 is a flow of clustering short texts published by existing users according to the present invention;

FIG. 6 is a process of clustering short texts published by a new user according to the present invention.

Detailed Description

The invention will be further elucidated with reference to the drawing and a specific embodiment. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

The invention provides a new text theme modeling method aiming at the problems of word sense drift and short text sparsity of a short text data stream published by a user in a social network, so as to obtain a text theme and a user theme in each time period.

Obtaining corpora: a short text of a long time period T is acquired. A small time value T is then selected to divide the large time into T/T time intervals (T can be taken to be one year, one quarter, one month, one week or one day). Since there is a social association between text and text authors in a social network, if the topic of analyzing a text is considered only to lose the social association itself, no text features that are practical are obtained. And because the text topics in the two time periods t before and after have topic correlation, a dynamic social probability graph model is proposed for topic modeling of the process of publishing short texts by the user. (optionally, the user information may be obtained from a self-established social networking system, or a web crawler system may be implemented to crawl user information and published data in an existing social networking site). the server obtains the user information through a network, including but not limited to a user id, a user short text, user friend ids, and a timestamp for publication of the short text. The language materials are divided into Chinese language materials and English language materials according to different languages. For English corpus, Twitter data is obtained through Streaming API of Twitter, and the Twitter data comprises a timestamp, a user id, all friends of the user id and tweet content published by the user. And for the Chinese corpus, capturing user data of a microblog by a crawler, or optionally, acquiring system user data by a self-built social system.

Alternatively, as an example, the english corpus of the present invention selects the Streaming API of Twitter corporation to obtain the text data published by the social users, the data carrying the time stamp, the user id and the contents of the tweet. It can also be based on existing corpora, but requires that each datum in the corpus contain text, user id and time stamp. And friendships between user ids.

User data for a duration of T is collected. And storing short text data published by the user in a format of user id, text content and time stamp. And stored on the server SVR. And then dividing the T interval into T/T small intervals according to the time length from the earliest time to the whole time interval according to the time stamp. Then, all the short texts are arranged into a short text set according to the interval. Setting the data format of the short text, and expressing the data format by a triple < userid, text, timestamp >, wherein the userid is the id of the user, the text is the content of the short text, and the timestamp is the time stamp when the short text is published. And, it is also necessary to acquire the friend list of the user at that time. Suppose that user has n friends, we represent it in the format of [ f1, f2, …, fn ].

Preprocessing the corpus: in order to improve the efficiency of short text processing, the short text needs to be preprocessed. For English: stop words are removed and a stem is extracted for each word. Alternatively, method of extracting stem we chose Porter method of NLTK. For Chinese: removing stop words such as's', etc., and then segmenting the text, optionally, segmenting the short text by using the 'longest segmentation method' optionally, such as ICTCCLAS segmentation method, and then obtaining data in the following format, [ user id, user friend id, short text word set, timestamp ]

Topic modeling based on short text streams in social networks: in step S3, considering the social relationship of the short text author in the social network, the influence of "word sense drift" and the sparsity of the short text on topic modeling, so as to establish a topic model suitable for the short text characteristics of the social network to extract the user and the text topic; the method comprises the following steps:

s301, aiming at the social relationship of short text authors in the social network, friend relationship closeness distribution is introduced to measure the degree of mutual influence of topics among friends.

S302, aiming at the problem of word sense drift of short text semantics in the social network, the topic-word distribution is regarded as the expression habits of the user and divided into 3 types, namely the expression habits of the user, the expression habits of friends of the user and the common expression habits of the rest of the whole social network.

S303, aiming at the sparsity problem of the short text in the social network, when the topic model is sampled, the topics of all words in the text are unified into the topic of the short text, which accords with the behavior of a user when the short text is published, because the number of words of the short text is limited, all words only serve the topic of the short text, namely the words are consistent with the topic of the short text.

A topic generation model DSM based on a chain of time and friend relationships is built for the short text publication process. The dynamic social topic generation model provided by the invention for the first time is as shown in the attached figure 2:

the model shows the short text topics and the user topic generation process in three time periods t-1, t, t +1, and because of the sparsity of the short text, unlike the traditional LDA model, we only assign one topic z to each short text (the traditional LDA considers that each word in one text is not the same topic). All words w in the short text are assigned to the z topic. The generation of each word w depends on the selection of the expression habit e and the corresponding topic-word distribution

Each short text is published by a user, and the topic of the short text reflects the potential semantic meaning and expression intention of the user. In a social network, each user is used for a certain number of friends, and under the influence of the friends, the user may publish texts with different past styles and intentions, which is also the reason for considering social factors. We represent the closeness and closeness of the user q and the set of friends { f1, …, fn } by a polynomial distribution. In the above model diagrams, gray circles represent observed quantities, and white circles represent hidden variables (abstracted). The rectangular box represents the number of sampling iterations for the inner variable, and the variable pointed to by the arrow represents the variable after the arrow that is required to be relied upon to generate the variable.

And (3) generating a theme: after the probability map generation model is established, Gibbs sampling is adopted to carry out theme sampling on each short text in the short text set, each word in the short text and each user. And finally, counting the subject distribution of the converged users. (Gibbs sampling is optionally adopted) until the probability distribution expressed by the probability graph is converged, and the topic distribution of each user and the short text is counted

Characteristic clustering: after obtaining the subject features of the user and the short texts, we can cluster the user and the short texts respectively. The result of the clustering is a clustering result that conforms to human semantic understanding. For the clustering method, optionally, a K-Means clustering algorithm may be employed for clustering. Thus, user clusters and short text clusters in the t period are obtained. The K-Means method pre-selects K cluster class centers in the number, and then for each other data (here, the subject probability distribution is an n-dimensional vector), we can calculate the cosine distance between the other data and the cluster class center by cosine, and assign each data to the cluster class with the closest cosine distance. And recalculates the center of each cluster to get new k centers. The above steps are repeated until k centers are no longer changed.

In particular, for new users or new text that only starts to appear during the t period, we adopt the topic-word distribution of the t period. For a new user or text only in the t period, the topic distribution of the user in the t period and the topic distribution of short text published by the user in the t period are analyzed and then assigned to a cluster class which is calculated to be closest to the user in the t period.

The user clusters are classified into the following 2 types according to the sequence of the users.

For users already in the system (as in fig. 3):

step S5101: collecting short text data streams published by each user in a period from 1 to t;

step S5102: carrying out dynamic social theme analysis on short texts published by existing users in the period from 1 to t and taking the short texts as corresponding user characteristics;

step S5103: clustering by using K-Means according to the theme characteristics of each user in each time period;

for newly arriving users (users that have just arrived during time t, as in FIG. 4):

step S5201: collecting short text data published by a new user in a time period t;

step S5202: performing theme analysis on the short text data in the time period to obtain the theme characteristics of the new user;

step S5203: and according to the comparison between the theme characteristics of the user and the calculated class clusters, assigning the user to the class cluster with the closest Euclidean distance.

Similarly, the short text clusters are classified into the following 2 types according to the users.

Short text published to an existing user (see fig. 5):

step S6101: collecting short text data streams published by each user in a period from 1 to t;

step S6102: carrying out dynamic social theme analysis on short texts published by the existing users in the period from 1 to t and taking the short texts as corresponding short text characteristics;

step S6103: clustering by using K-Means according to the theme characteristics of each short text in each time period;

short text published to the newly arrived user (see figure 6):

step S6201: collecting short text data published by a new user in a time period t;

step S6202: performing theme analysis on the short text data in the time period to obtain the theme characteristics of the new short text;

step S6203: and according to the comparison between the theme characteristics of the user and the calculated class clusters, the short text of the user is endowed to the class cluster with the closest Euclidean distance.

The foregoing embodiments and examples are merely preferred embodiments and examples of the present patent and are not to be construed as limiting the embodiments of the present patent. Other variations and modifications will be apparent to persons skilled in the art based on the foregoing description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. The user clustering and short text clustering method based on the social network short text stream is characterized by comprising the following steps of:

s1, obtaining corpora, namely obtaining a corpus of the social network platform through an API opened by a crawler or a social network platform company or collecting user corpora through a self-built social network system;

collecting user data with the time length of T, storing short text data published by a user, and storing the short text data on a server SVR; then according to the time stamp, dividing the T interval into T/T small intervals according to the time length from the earliest time to the whole time interval, and arranging all short texts into a short text set according to the intervals, wherein T is a year, a quarter, a month, a week or a day;

setting a short text data format, and expressing the short text data format by using a triple < userid, text, timestamp >, wherein the userid is the id of a user, the text is the content of a short text, and the timestamp is the time stamp of the short text when the short text is published, and acquiring a friend list of the user when the short text is published, and expressing the short text data format in an [ f1, f2, …, fn ] format on the assumption that the user has n friends;

s2, corpus preprocessing, including word segmentation, word stop, word stem extraction and entity extraction;

performing word segmentation on the obtained data, extracting word stems, extracting entities, stopping word preprocessing, dividing the collected data set time period T into T/T sections according to the selected time interval T, and processing each section of data respectively;

s3, modeling based on the short text data stream theme in the social network, and modeling the theme of the text in the corpus aiming at the social relationship among text authors, the 'word sense drift' problem and the short text sparsity problem in the corpus to extract the theme of each text;

s301, aiming at the social relationship of short text authors in the social network, introducing friend relationship compactness distribution for measuring the degree of mutual influence of topics among friends;

s302, aiming at the problem of word sense drift of short text semantics in the social network, considering the topic-word distribution as the expression habit of the user and dividing the topic-word distribution into 3 types, wherein the three types are respectively as follows: the user's own expression habits, the user's friend's expression habits, and the other common expression habits in the entire social network;

s303, aiming at the sparsity problem of the short text in the social network, unifying the topics of all words in the text into the topic of the short text when sampling the topic model;

s4, deducing and sampling, deducing the theme joint probability distribution of the model according to the established probability graph model, taking the theme joint probability distribution as the joint probability distribution of Gibbs sampling, and finally counting the theme distribution of the user and the text when the sampling is converged;

setting prior distribution in the topic model as Dirichlet distribution, setting topic distribution as polynomial distribution, and simplifying the derivation process of joint probability distribution through the conjugate relation of the Dirichlet distribution and the polynomial distribution;

s5, clustering the users, taking the obtained user theme as the characteristics of the users in the corpus, and executing K-Means clustering to obtain user clustering results;

s6, clustering the short texts, taking the obtained short text theme as the feature of the short texts, and performing K-Means clustering to obtain short text clustering results;

in the steps S5 and S6, the obtained user features and the text features are clustered by using a K-Means algorithm respectively to obtain text clusters and user clusters in the period; for a newly arrived user, after extracting features thereof, the newly arrived user is assigned to the cluster closest to his euclidean distance.

2. The method of claim 1, wherein the corpora include chinese corpora and english corpora; in step S1, the Streaming API disclosed by Twitter corporation is used to obtain english corpus, and the xinlang API is used to obtain chinese corpus.

3. The method for clustering users and short texts based on the social network short text stream as claimed in claim 1, wherein in step S2, for the chinese corpus, the "longest participle method is used to perform participle on the short text" ICTCLAS method is used to perform participle processing on the corpus; for English corpora, stop words are removed by adopting a stop word bank of Lemur, and stem words are extracted by adopting a Porter method in a Stemming method of NLTK.