CN105183909B - social network user interest predicting method based on Gaussian mixture model - Google Patents
social network user interest predicting method based on Gaussian mixture model Download PDFInfo
- Publication number
- CN105183909B CN105183909B CN201510646248.XA CN201510646248A CN105183909B CN 105183909 B CN105183909 B CN 105183909B CN 201510646248 A CN201510646248 A CN 201510646248A CN 105183909 B CN105183909 B CN 105183909B
- Authority
- CN
- China
- Prior art keywords
- microblog
- formula
- user
- gaussian mixture
- hot
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 239000000203 mixture Substances 0.000 title claims abstract description 23
- 239000013598 vector Substances 0.000 claims abstract description 20
- 230000011218 segmentation Effects 0.000 claims description 12
- 238000007781 pre-processing Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- 238000007476 Maximum Likelihood Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000003672 processing method Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Economics (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- Human Resources & Organizations (AREA)
- General Business, Economics & Management (AREA)
- Marketing (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a social network user interest predicting method based on a Gaussian mixture model. The method comprises the following steps that 1, user data are acquired from a social network; 2, feature vector extraction is performed on the acquired user data, and a series of feature vectors are generated; 3, a predicting model is built by adopting the Gaussian mixture model; 4, parameters are optimized by adopting an EM algorithm, and a predicting result is calculated. According to the social network user interest predicting method based on the Gaussian mixture model, the Gaussian mixture model is adopted, therefore, the higher predicting precision can be achieved, the using time is shortened, and the short-term interest of a user is effectively predicted.
Description
Technical Field
The invention relates to the technical field of social network information analysis, in particular to a social network user interest prediction method based on a Gaussian mixture model.
Background
The rapid diffusion of information and the convenience of social networks facilitate a large number of users sharing their daily activities, exchanging opinions, or building friendships with others. A report showed that by 2017, the number of users in the global social network was estimated to be 23.3 billion. Therefore, effective feature learning and interest prediction are of great significance not only to users (e.g., looking for users with similar interests), but also to service providers (e.g., analyzing user behavior in a set of application scenarios for personalized recommendations).
However, given the characteristics of social data (e.g., large amount, diversity, data value, etc.), it is difficult to predict user interests with high accuracy while ensuring that computational complexity and latency are within acceptable ranges. Furthermore, short-term interests may change dynamically (e.g., by friends) in the user interest profile. Therefore, a social network user interest prediction method based on a Gaussian mixture model is provided, and the short-term interest of the user can be effectively predicted.
Disclosure of Invention
In view of this, the present invention provides a social network user interest prediction method based on a gaussian mixture model, so as to achieve higher prediction accuracy, shorten the usage time, and effectively predict the short-term interest of the user.
The invention is realized by adopting the following scheme: a social network user interest prediction method based on a Gaussian mixture model comprises the following steps:
step S1: obtaining user data from a social network;
step S2: extracting a characteristic vector of the acquired user data to generate a series of characteristic vectors;
step S3: adopting a Gaussian mixture model to construct a prediction model;
step S4: and optimizing parameters by adopting an EM algorithm and calculating a prediction result.
Further, the step S1 is specifically: microblog information published or forwarded by p microblog users is acquired as training data, microblog information published or forwarded by q microblog users is acquired as test data, and r hot microblog categories and s hot microblogs in each hot microblog category are acquired.
Further, the step S2 is specifically: preprocessing the hot microblog, wherein the preprocessing comprises word segmentation, word frequency statistics and duplicate removal, t hot keywords can be obtained and used as interest characteristic values of hot microblog classes, and therefore r t-dimensional hot microblog characteristic vectors are generated; meanwhile, with microblog users as units, preprocessing the training data and the test data, including Chinese word segmentation, stop word processing and word frequency statistics; and extracting t interest characteristic values corresponding to the user from microblog information published or forwarded by the microblog user according to the r t-dimensional hot microblog characteristic vectors, and converting the t interest characteristic values into the characteristic vectors of the microblog user.
Preferably, the method for Chinese word segmentation comprises the following steps: a Chinese word segmentation system is adopted, and a user-defined user dictionary is combined to segment words of the microblog galaxies; the stop word processing method comprises the following steps: and filtering useless information by adopting a HashMap quick index table look-up method to reduce the noise of microblog information.
Further, the gaussian mixture model in step S3 is defined as a linearly superimposed gaussian model, as shown in formula (1):
wherein the Gaussian density N (x | mu)kΣ k) is a hybrid component with an average value μkWith a covariance of ∑k,πkIs the mixing coefficient; integrating both sides of equation (1) with respect to x and normalizing p (x) and the single gaussian component yields equation (2) as follows:
since it is required that p (x) is not less than 0, N (x | mu)kΣ k) is equal to or greater than 0, then πk≥0;
In conjunction with equation (2), equation (3) is obtained:
0≤πk≤1 (3)
therefore, the mixing coefficient satisfies the condition of becoming probability, and the marginal density obtained by the addition and multiplication principle is as shown in formula (4):
the formula (4) corresponds to the formula (1), where πkP (k), is the prior probability of the kth element, density N (x | μ |)kWhere Σ k) ═ p (x | k) is the probability of x under k conditions; therefore, according to bayes' theorem, the following formula (5) is generated:
assume that the feature vector data set that needs to be predicted is { x }1,……,xNRepresents the dataset as an N × D matrix X, where Xn TRepresents the nth row; using a corresponding stealth random variable with zn TAn N × K matrix Z representation representing rows;
then the mixture of gaussiansThe shape of the distribution can be controlled by the parameters pi, mu and sigma, where pi ≡ { pi ≡ pi1,…,πk},μ≡{μ1,…,μk},Σ≡{Σ1,…,Σk}; after performing the maximum likelihood estimation, the formula (1) is converted into the following formula (6):
wherein X ═ { X ═ X1,……,xN}。
Further, the step S4 specifically includes the following steps:
step S41: initializing the mean value mu by using EM algorithmkCovariance ΣkπkAnd coefficient of mixing pikAnd evaluating the initial log-likelihood estimation function value;
step S42: the implicit class variables are estimated using the following equation (7):
step S43: the parameter update is performed by using the following formula (8), formula (9), formula (10), and formula (12):
wherein,
step S44: the log-likelihood estimation function value is evaluated using the following formula (12)
If the formula (12) does not satisfy the convergence criterion, the step S42 is returned to.
Compared with the prior art, the method adopts the Gaussian mixture model, can realize higher prediction precision on the interest of the social network user, shortens the use time, and effectively predicts the short-term interest of the user.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a system framework diagram of interest prediction in the present invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
The embodiment provides a social network user interest prediction method based on a gaussian mixture model, as shown in fig. 1 and fig. 2, including the following steps:
step S1: obtaining user data from a social network;
step S2: extracting a characteristic vector of the acquired user data to generate a series of characteristic vectors;
step S3: adopting a Gaussian mixture model to construct a prediction model;
step S4: and optimizing parameters by adopting an EM algorithm and calculating a prediction result.
In this embodiment, the step S1 specifically includes: microblog information published or forwarded by p microblog users is acquired as training data, microblog information published or forwarded by q microblog users is acquired as test data, and r hot microblog categories and s hot microblogs in each hot microblog category are acquired.
In this embodiment, the step S2 specifically includes: preprocessing the hot microblog, wherein the preprocessing comprises word segmentation, word frequency statistics and duplicate removal, t hot keywords can be obtained and used as interest characteristic values of hot microblog classes, and therefore r t-dimensional hot microblog characteristic vectors are generated; meanwhile, with microblog users as units, preprocessing the training data and the test data, including Chinese word segmentation, stop word processing and word frequency statistics; and extracting t interest characteristic values corresponding to the user from microblog information published or forwarded by the microblog user according to the r t-dimensional hot microblog characteristic vectors, and converting the t interest characteristic values into the characteristic vectors of the microblog user.
In this embodiment, preferably, the method for chinese word segmentation includes: a Chinese word segmentation system is adopted, and a user-defined user dictionary is combined to segment words of the microblog galaxies; the stop word processing method comprises the following steps: and filtering useless information by adopting a HashMap quick index table look-up method to reduce the noise of microblog information.
In this embodiment, deduplication is performed to account for different classes that may contain the same key, and deduplication functionality is necessary to reduce the redundant manual process.
In this embodiment, the gaussian mixture model in step S3 is defined as a linearly superimposed gaussian model, as shown in formula (1):
wherein the Gaussian density N (x | mu)kΣ k) is a hybrid component with an average value μkWith a covariance of ∑k,πkIs the mixing coefficient; integrating both sides of equation (1) with respect to x and normalizing p (x) and the single gaussian component yields equation (2) as follows:
since it is required that p (x) is not less than 0, N (x | mu)kΣ k) is equal to or greater than 0, then πk≥0;
In conjunction with equation (2), equation (3) is obtained:
0≤πk≤1 (3)
therefore, the mixing coefficient satisfies the condition of becoming probability, and the marginal density obtained by the addition and multiplication principle is as shown in formula (4):
the formula (4) corresponds to the formula (1), where πkP (k), is the prior probability of the kth element, density N (x | μ |)kWhere Σ k) ═ p (x | k) is the probability of x under k conditions; therefore, according to bayes' theorem, the following formula (5) is generated:
assume that the feature vector data set that needs to be predicted is { x }1,……,xNRepresents the dataset as an N × D matrix X, where Xn TRepresents the nth row; using a corresponding stealth random variable with zn TAn N × K matrix Z representation representing rows;
the shape of the gaussian mixture profile can be controlled by the parameters pi, mu and sigma, where pi ≡ { pi ≡ pi1,…,πk},μ≡{μ1,…,μk},Σ≡{Σ1,…,Σk}; after performing the maximum likelihood estimation, the formula (1) is converted into the following formula (6):
wherein X ═ { X ═ X1,……,xN}。
In this embodiment, the step S4 specifically includes the following steps:
step S41: initializing the mean value mu by using EM algorithmkCovariance ΣkπkAnd coefficient of mixing pikAnd evaluating the initial log-likelihood estimation function value;
step S42: the implicit class variables are estimated using the following equation (7):
step S43: the parameter update is performed by using the following formula (8), formula (9), formula (10), and formula (12):
wherein,
step S44: the log-likelihood estimation function value is evaluated using the following formula (12)
If the formula (12) does not satisfy the convergence criterion, the step S42 is returned to.
The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.
Claims (2)
1. A social network user interest prediction method based on a Gaussian mixture model is characterized by comprising the following steps: the method comprises the following steps:
step S1: obtaining user data from a social network;
step S2: extracting a characteristic vector of the acquired user data to generate a series of characteristic vectors;
step S3: adopting a Gaussian mixture model to construct a prediction model;
step S4: optimizing parameters by adopting an EM algorithm and calculating a prediction result;
the step S1 specifically includes: acquiring microblog information issued or forwarded by p microblog users as training data, acquiring microblog information issued or forwarded by q microblog users as test data, and acquiring r hot microblog categories and s hot microblogs in each hot microblog category;
the step S2 specifically includes: preprocessing the hot microblog, wherein the preprocessing comprises word segmentation, word frequency statistics and duplicate removal, t hot keywords can be obtained and used as interest characteristic values of hot microblog classes, and therefore r t-dimensional hot microblog characteristic vectors are generated; meanwhile, with microblog users as units, preprocessing the training data and the test data, including Chinese word segmentation, stop word processing and word frequency statistics; extracting t interest characteristic values corresponding to the user from microblog information published or forwarded by the microblog user according to the r t-dimensional hot microblog characteristic vectors, and converting the t interest characteristic values into the characteristic vectors of the microblog user;
the gaussian mixture model in step S3 is defined as a linearly superimposed gaussian model, as shown in formula (1):
wherein the Gaussian density N (x | mu)kΣ k) is a hybrid component with an average value μkWith a covariance of ∑k,πkIs the mixing coefficient; integrating both sides of equation (1) with respect to x and normalizing p (x) and the single gaussian component yields equation (2) as follows:
since it is required that p (x) is not less than 0, N (x | mu)kΣ k) is equal to or greater than 0, then πk≥0;
In conjunction with equation (2), equation (3) is obtained:
0≤πk≤1 (3)
therefore, the mixing coefficient satisfies the condition of becoming probability, and the marginal density obtained by the addition and multiplication principle is as shown in formula (4):
the formula (4) corresponds to the formula (1), where πkP (k), is the prior probability of the kth element, density N (x | μ |)kWhere Σ k) ═ p (x | k) is the probability of x under k conditions; therefore, according to bayes' theorem, the following formula (5) is generated:
assume that the feature vector data set that needs to be predicted is { x }1,……,xNRepresents the dataset as an N × D matrix X, where Xn TRepresents the nth row; using a corresponding stealth random variable with zn TAn N × K matrix Z representation representing rows;
the shape of the gaussian mixture profile can be controlled by the parameters pi, mu and sigma, where pi ≡ { pi ≡ pi1,…,πk},μ≡{μ1,…,μk},Σ≡{Σ1,…,Σk}; after performing the maximum likelihood estimation, the formula (1) is converted into the following formula (6):
wherein X ═ { X ═ X1,……,xN};
The step S4 specifically includes the following steps:
step S41: initializing the mean value mu by using EM algorithmkCovariance ΣkAnd coefficient of mixing pikAnd evaluating the initial log-likelihood estimation function value;
step S42: the implicit class variables are estimated using the following equation (7):
step S43: the parameter update is performed by using the following formula (8), formula (9), formula (10), and formula (11):
wherein,
step S44: the log-likelihood estimation function value is evaluated using the following formula (12)
If the formula (12) does not satisfy the convergence criterion, the step S42 is returned to.
2. The method of claim 1, wherein the social network user interest prediction method based on the Gaussian mixture model comprises: the Chinese word segmentation method comprises the following steps: a Chinese word segmentation system is adopted, and a user-defined user dictionary is combined to segment words of the microblog galaxies; the stop word processing method comprises the following steps: and filtering useless information by adopting a HashMap quick index table look-up method to reduce the noise of microblog information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510646248.XA CN105183909B (en) | 2015-10-09 | 2015-10-09 | social network user interest predicting method based on Gaussian mixture model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510646248.XA CN105183909B (en) | 2015-10-09 | 2015-10-09 | social network user interest predicting method based on Gaussian mixture model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105183909A CN105183909A (en) | 2015-12-23 |
CN105183909B true CN105183909B (en) | 2017-04-12 |
Family
ID=54905990
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510646248.XA Active CN105183909B (en) | 2015-10-09 | 2015-10-09 | social network user interest predicting method based on Gaussian mixture model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105183909B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107220233A (en) * | 2017-05-09 | 2017-09-29 | 北京理工大学 | A kind of user knowledge demand model construction method based on gauss hybrid models |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105786711A (en) * | 2016-03-25 | 2016-07-20 | 广州华多网络科技有限公司 | Data analysis method and device |
CN109949938B (en) * | 2017-12-20 | 2024-04-26 | 北京亚信数据有限公司 | Method and device for standardizing medical non-standard names |
CN110869953B (en) * | 2018-02-06 | 2024-09-24 | 北京嘀嘀无限科技发展有限公司 | System and method for recommending traffic travel service |
CN110119827A (en) * | 2018-02-06 | 2019-08-13 | 北京嘀嘀无限科技发展有限公司 | With the prediction technique and device of vehicle type |
CN108182339B (en) * | 2018-03-20 | 2021-08-13 | 北京工业大学 | Window state prediction method and system based on Gaussian distribution |
CN109190040B (en) * | 2018-08-31 | 2021-05-28 | 合肥工业大学 | Collaborative evolution-based personalized recommendation method and device |
CN111241821B (en) * | 2018-11-28 | 2023-04-28 | 杭州海康威视数字技术股份有限公司 | Method and device for determining behavior characteristics of user |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104077412A (en) * | 2014-07-14 | 2014-10-01 | 福州大学 | Micro-blog user interest prediction method based on multiple Markov chains |
CN104636496A (en) * | 2015-03-04 | 2015-05-20 | 重庆理工大学 | Hybrid clustering recommendation method based on Gaussian distribution and distance similarity |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140358630A1 (en) * | 2013-05-31 | 2014-12-04 | Thomson Licensing | Apparatus and process for conducting social media analytics |
-
2015
- 2015-10-09 CN CN201510646248.XA patent/CN105183909B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104077412A (en) * | 2014-07-14 | 2014-10-01 | 福州大学 | Micro-blog user interest prediction method based on multiple Markov chains |
CN104636496A (en) * | 2015-03-04 | 2015-05-20 | 重庆理工大学 | Hybrid clustering recommendation method based on Gaussian distribution and distance similarity |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107220233A (en) * | 2017-05-09 | 2017-09-29 | 北京理工大学 | A kind of user knowledge demand model construction method based on gauss hybrid models |
CN107220233B (en) * | 2017-05-09 | 2020-06-16 | 北京理工大学 | User knowledge demand model construction method based on Gaussian mixture model |
Also Published As
Publication number | Publication date |
---|---|
CN105183909A (en) | 2015-12-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105183909B (en) | social network user interest predicting method based on Gaussian mixture model | |
Schelldorfer et al. | Estimation for high‐dimensional linear mixed‐effects models using ℓ1‐penalization | |
Brooks et al. | Nonparametric convergence assessment for MCMC model selection | |
US9875294B2 (en) | Method and apparatus for classifying object based on social networking service, and storage medium | |
CN113259325B (en) | Network security situation prediction method for optimizing Bi-LSTM based on sparrow search algorithm | |
CN114239464B (en) | Circuit yield prediction method and system based on Bayesian filter and resampling | |
CN108734287A (en) | Compression method and device, terminal, the storage medium of deep neural network model | |
Cong et al. | Fast and effective model order selection method to determine the number of sources in a linear transformation model | |
Noughabi et al. | On the entropy estimators | |
CN116187563A (en) | Sea surface temperature space-time intelligent prediction method based on fusion improvement variation modal decomposition | |
CN115345293A (en) | Training method and device of text processing model based on differential privacy | |
Ding et al. | Full‐reference image quality assessment using statistical local correlation | |
Zitouni et al. | Asymptotic properties of the estimator for a finite mixture of exponential dispersion models | |
CN109217844B (en) | Hyper-parameter optimization method based on pre-training random Fourier feature kernel LMS | |
Sevilla et al. | Bayesian topology inference on partially known networks from input-output pairs | |
JP2016520220A (en) | Hidden attribute model estimation device, method and program | |
Wiencierz et al. | Restricted likelihood ratio testing in linear mixed models with general error covariance structure | |
Madukaife et al. | Estimation of Shannon differential entropy: An extensive comparative review | |
Debbabi et al. | A new unsupervised threshold determination for hybrid models | |
Hansen et al. | Bayesian compressed sensing with unknown measurement noise level | |
Lei et al. | A weighted K-SVD-based double sparse representations approach for wireless channels using the modified Takenaka-Malmquist basis | |
Burnaev et al. | Adaptive design of experiments for sobol indices estimation based on quadratic metamodel | |
Lee | Generalized Bernoulli process: simulation, estimation, and application | |
CN114842236B (en) | Image classification method, image classification device, computer readable storage medium and electronic device | |
Li et al. | Goodness-of-fit tests of a parametric density functions: Monte Carlo simulation studies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |