CN105183909B

CN105183909B - social network user interest predicting method based on Gaussian mixture model

Info

Publication number: CN105183909B
Application number: CN201510646248.XA
Authority: CN
Inventors: 郑相涵; 赖太平; 郭文忠
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2015-10-09
Filing date: 2015-10-09
Publication date: 2017-04-12
Anticipated expiration: 2035-10-09
Also published as: CN105183909A

Abstract

The invention relates to a social network user interest predicting method based on a Gaussian mixture model. The method comprises the following steps that 1, user data are acquired from a social network; 2, feature vector extraction is performed on the acquired user data, and a series of feature vectors are generated; 3, a predicting model is built by adopting the Gaussian mixture model; 4, parameters are optimized by adopting an EM algorithm, and a predicting result is calculated. According to the social network user interest predicting method based on the Gaussian mixture model, the Gaussian mixture model is adopted, therefore, the higher predicting precision can be achieved, the using time is shortened, and the short-term interest of a user is effectively predicted.

Description

Social network user interest prediction method based on Gaussian mixture model

Technical Field

The invention relates to the technical field of social network information analysis, in particular to a social network user interest prediction method based on a Gaussian mixture model.

Background

The rapid diffusion of information and the convenience of social networks facilitate a large number of users sharing their daily activities, exchanging opinions, or building friendships with others. A report showed that by 2017, the number of users in the global social network was estimated to be 23.3 billion. Therefore, effective feature learning and interest prediction are of great significance not only to users (e.g., looking for users with similar interests), but also to service providers (e.g., analyzing user behavior in a set of application scenarios for personalized recommendations).

However, given the characteristics of social data (e.g., large amount, diversity, data value, etc.), it is difficult to predict user interests with high accuracy while ensuring that computational complexity and latency are within acceptable ranges. Furthermore, short-term interests may change dynamically (e.g., by friends) in the user interest profile. Therefore, a social network user interest prediction method based on a Gaussian mixture model is provided, and the short-term interest of the user can be effectively predicted.

Disclosure of Invention

In view of this, the present invention provides a social network user interest prediction method based on a gaussian mixture model, so as to achieve higher prediction accuracy, shorten the usage time, and effectively predict the short-term interest of the user.

The invention is realized by adopting the following scheme: a social network user interest prediction method based on a Gaussian mixture model comprises the following steps:

step S1: obtaining user data from a social network;

step S2: extracting a characteristic vector of the acquired user data to generate a series of characteristic vectors;

step S3: adopting a Gaussian mixture model to construct a prediction model;

step S4: and optimizing parameters by adopting an EM algorithm and calculating a prediction result.

Further, the step S1 is specifically: microblog information published or forwarded by p microblog users is acquired as training data, microblog information published or forwarded by q microblog users is acquired as test data, and r hot microblog categories and s hot microblogs in each hot microblog category are acquired.

Further, the step S2 is specifically: preprocessing the hot microblog, wherein the preprocessing comprises word segmentation, word frequency statistics and duplicate removal, t hot keywords can be obtained and used as interest characteristic values of hot microblog classes, and therefore r t-dimensional hot microblog characteristic vectors are generated; meanwhile, with microblog users as units, preprocessing the training data and the test data, including Chinese word segmentation, stop word processing and word frequency statistics; and extracting t interest characteristic values corresponding to the user from microblog information published or forwarded by the microblog user according to the r t-dimensional hot microblog characteristic vectors, and converting the t interest characteristic values into the characteristic vectors of the microblog user.

Preferably, the method for Chinese word segmentation comprises the following steps: a Chinese word segmentation system is adopted, and a user-defined user dictionary is combined to segment words of the microblog galaxies; the stop word processing method comprises the following steps: and filtering useless information by adopting a HashMap quick index table look-up method to reduce the noise of microblog information.

Further, the gaussian mixture model in step S3 is defined as a linearly superimposed gaussian model, as shown in formula (1):

wherein the Gaussian density N (x | mu)_kΣ k) is a hybrid component with an average value μ_kWith a covariance of ∑_k，π_kIs the mixing coefficient; integrating both sides of equation (1) with respect to x and normalizing p (x) and the single gaussian component yields equation (2) as follows:

since it is required that p (x) is not less than 0, N (x | mu)_kΣ k) is equal to or greater than 0, then π_k≥0；

In conjunction with equation (2), equation (3) is obtained:

0≤π_k≤1 (3)

therefore, the mixing coefficient satisfies the condition of becoming probability, and the marginal density obtained by the addition and multiplication principle is as shown in formula (4):

the formula (4) corresponds to the formula (1), where π_kP (k), is the prior probability of the kth element, density N (x | μ |)_kWhere Σ k) ═ p (x | k) is the probability of x under k conditions; therefore, according to bayes' theorem, the following formula (5) is generated:

assume that the feature vector data set that needs to be predicted is { x }₁,……,x_NRepresents the dataset as an N × D matrix X, where X_n ^TRepresents the nth row; using a corresponding stealth random variable with z_n ^TAn N × K matrix Z representation representing rows;

then the mixture of gaussiansThe shape of the distribution can be controlled by the parameters pi, mu and sigma, where pi ≡ { pi ≡ pi₁,…,π_k}，μ≡{μ₁,…,μ_k}，Σ≡{Σ₁,…,Σ_k}; after performing the maximum likelihood estimation, the formula (1) is converted into the following formula (6):

wherein X ═ { X ═ X₁,……,x_N}。

Further, the step S4 specifically includes the following steps:

step S41: initializing the mean value mu by using EM algorithm_kCovariance Σ_kπ_kAnd coefficient of mixing pi_kAnd evaluating the initial log-likelihood estimation function value;

step S42: the implicit class variables are estimated using the following equation (7):

step S43: the parameter update is performed by using the following formula (8), formula (9), formula (10), and formula (12):

wherein,

step S44: the log-likelihood estimation function value is evaluated using the following formula (12)

If the formula (12) does not satisfy the convergence criterion, the step S42 is returned to.

Compared with the prior art, the method adopts the Gaussian mixture model, can realize higher prediction precision on the interest of the social network user, shortens the use time, and effectively predicts the short-term interest of the user.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a system framework diagram of interest prediction in the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

The embodiment provides a social network user interest prediction method based on a gaussian mixture model, as shown in fig. 1 and fig. 2, including the following steps:

step S1: obtaining user data from a social network;

step S3: adopting a Gaussian mixture model to construct a prediction model;

In this embodiment, the step S1 specifically includes: microblog information published or forwarded by p microblog users is acquired as training data, microblog information published or forwarded by q microblog users is acquired as test data, and r hot microblog categories and s hot microblogs in each hot microblog category are acquired.

In this embodiment, the step S2 specifically includes: preprocessing the hot microblog, wherein the preprocessing comprises word segmentation, word frequency statistics and duplicate removal, t hot keywords can be obtained and used as interest characteristic values of hot microblog classes, and therefore r t-dimensional hot microblog characteristic vectors are generated; meanwhile, with microblog users as units, preprocessing the training data and the test data, including Chinese word segmentation, stop word processing and word frequency statistics; and extracting t interest characteristic values corresponding to the user from microblog information published or forwarded by the microblog user according to the r t-dimensional hot microblog characteristic vectors, and converting the t interest characteristic values into the characteristic vectors of the microblog user.

In this embodiment, preferably, the method for chinese word segmentation includes: a Chinese word segmentation system is adopted, and a user-defined user dictionary is combined to segment words of the microblog galaxies; the stop word processing method comprises the following steps: and filtering useless information by adopting a HashMap quick index table look-up method to reduce the noise of microblog information.

In this embodiment, deduplication is performed to account for different classes that may contain the same key, and deduplication functionality is necessary to reduce the redundant manual process.

In this embodiment, the gaussian mixture model in step S3 is defined as a linearly superimposed gaussian model, as shown in formula (1):

In conjunction with equation (2), equation (3) is obtained:

0≤π_k≤1 (3)

the shape of the gaussian mixture profile can be controlled by the parameters pi, mu and sigma, where pi ≡ { pi ≡ pi₁,…,π_k}，μ≡{μ₁,…,μ_k}，Σ≡{Σ₁,…,Σ_k}; after performing the maximum likelihood estimation, the formula (1) is converted into the following formula (6):

wherein X ═ { X ═ X₁,……,x_N}。

In this embodiment, the step S4 specifically includes the following steps:

wherein,

The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

Claims

1. A social network user interest prediction method based on a Gaussian mixture model is characterized by comprising the following steps: the method comprises the following steps:

step S1: obtaining user data from a social network;

step S3: adopting a Gaussian mixture model to construct a prediction model;

step S4: optimizing parameters by adopting an EM algorithm and calculating a prediction result;

the step S1 specifically includes: acquiring microblog information issued or forwarded by p microblog users as training data, acquiring microblog information issued or forwarded by q microblog users as test data, and acquiring r hot microblog categories and s hot microblogs in each hot microblog category;

the step S2 specifically includes: preprocessing the hot microblog, wherein the preprocessing comprises word segmentation, word frequency statistics and duplicate removal, t hot keywords can be obtained and used as interest characteristic values of hot microblog classes, and therefore r t-dimensional hot microblog characteristic vectors are generated; meanwhile, with microblog users as units, preprocessing the training data and the test data, including Chinese word segmentation, stop word processing and word frequency statistics; extracting t interest characteristic values corresponding to the user from microblog information published or forwarded by the microblog user according to the r t-dimensional hot microblog characteristic vectors, and converting the t interest characteristic values into the characteristic vectors of the microblog user;

the gaussian mixture model in step S3 is defined as a linearly superimposed gaussian model, as shown in formula (1):

In conjunction with equation (2), equation (3) is obtained:

0≤π_k≤1 (3)

wherein X ═ { X ═ X₁,……,x_N}；

The step S4 specifically includes the following steps:

step S41: initializing the mean value mu by using EM algorithm_kCovariance Σ_kAnd coefficient of mixing pi_kAnd evaluating the initial log-likelihood estimation function value;

step S43: the parameter update is performed by using the following formula (8), formula (9), formula (10), and formula (11):

wherein,

2. The method of claim 1, wherein the social network user interest prediction method based on the Gaussian mixture model comprises: the Chinese word segmentation method comprises the following steps: a Chinese word segmentation system is adopted, and a user-defined user dictionary is combined to segment words of the microblog galaxies; the stop word processing method comprises the following steps: and filtering useless information by adopting a HashMap quick index table look-up method to reduce the noise of microblog information.