CN105183833B - Microblog text recommendation method and device based on user model - Google Patents

Microblog text recommendation method and device based on user model Download PDF

Info

Publication number
CN105183833B
CN105183833B CN201510548344.0A CN201510548344A CN105183833B CN 105183833 B CN105183833 B CN 105183833B CN 201510548344 A CN201510548344 A CN 201510548344A CN 105183833 B CN105183833 B CN 105183833B
Authority
CN
China
Prior art keywords
microblog
user
candidate
target user
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510548344.0A
Other languages
Chinese (zh)
Other versions
CN105183833A (en
Inventor
喻梅
徐天一
王建荣
于健
缑小路
郭佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201510548344.0A priority Critical patent/CN105183833B/en
Publication of CN105183833A publication Critical patent/CN105183833A/en
Application granted granted Critical
Publication of CN105183833B publication Critical patent/CN105183833B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a microblog text recommendation method and a microblog text recommendation device based on a user model, wherein the method comprises the following steps: acquiring microblog data, forming a microblog document, and preprocessing the microblog document; establishing a target user topic model according to the LDA topic model, and calculating the matching degree of the candidate microblog and the target user topic model; establishing a target user keyword vector model based on a TF-IDF algorithm, and calculating the matching degree of the candidate microblog and the target user keyword vector model; and calculating the matching degree of the candidate microblog and the target user model as the score of the candidate microblog by combining the two matching degrees by using a weighted average method, and sequencing the score. The device comprises: according to the invention, the microblog information which is possibly interested by the target user can be found and recommended to the target user, so that the contact among the users is strengthened to facilitate the promotion of the vitality of the microblog.

Description

Microblog text recommendation method and device based on user model
Technical Field
The invention relates to the fields of data mining, natural language processing and information retrieval, in particular to a microblog text Recommendation Method (MCRA) based on a user model and a Recommendation device thereof.
Background
At present, various methods for personalized recommendation of microblog user modeling are available, and the methods can be roughly summarized into two types from the viewpoint: and the microblog user relationship or the microblog user releases the text content. Analyzing the microblog user relationship, and carrying out personalized recommendation: the method comprises the steps of analyzing the relation of a microblog user in a social network, analyzing the position of the microblog user in a community, analyzing the influence of the microblog user in the community, and ranking the influence to recommend the microblog user. Analyzing text contents issued by microblog users: and processing and analyzing the microblog content issued by the microblog user, so as to model and recommend the microblog user individually. And recommending the user or the content with the highest similarity to the user by judging the similarity between other users and the model. At the heart of this solution is the user content modeling.
The conventional statistical method Term Frequency-Inverse text Frequency model (TF-IDF) and topic modeling are commonly used for the user content modeling method. However, the traditional content modeling method TF-IDF cannot reflect the interest of the user on the potential subject.
The topic modeling technology mainly includes a Latent Semantic model (LSA), a probabilistic Latent Semantic model (PLSA), an implicit Dirichlet Allocation model (LDA), and the like. The LSA model maps documents from a sparse high-dimensional word space to a low-dimensional vector space, using the low-dimensional space to depict synonyms that correspond to the same or similar topics. However, the LSA model does not depict a probabilistic model of the number of occurrences of terms; the PLSA model is similar to the LSA model in the idea, introducing probability expressions between classes (topics) and words, and the parameters of the model can be obtained using the Expectation Maximization Algorithm (EM) and maximum likelihood estimation. This model does not provide a suitable probabilistic model at the document level, so that the PLSA model is not a perfect generative model, but rather the model must be randomly sampled in case of a determined document.
In response to the deficiency of PLSA, researchers have proposed the LDA model of cryptodirichlet distribution. The LDA model introduces two probability distributions, namely document theme probability distribution and theme term probability distribution, and the document is considered to be composed of multiple themes in a certain probability form, and the theme is considered to be composed of terms in a certain probability form, which accords with the generation process of the document. The LDA topic model can well reflect topics concerned by users, but the method cannot avoid inaccurate modeling caused by limitation of the number of microblog characters. The best recommendation effect cannot be achieved by only using the user theme model in recommendation.
Disclosure of Invention
The invention provides a user model-based microblog text recommendation method and a user model-based microblog text recommendation device, which can find microblog information which is possibly interested by an experimental target user in massive microblog information issued by other microblog users and recommend the microblog information to the target user, so that the relation among users is strengthened to improve the vitality of microblogs, and the following description is provided:
a microblog text recommendation method based on a user model comprises the following steps:
acquiring microblog data, forming a microblog document, and preprocessing the microblog document;
establishing a target user topic model according to the LDA topic model, and calculating the matching degree of the candidate microblog and the target user topic model;
establishing a target user keyword vector model based on a TF-IDF algorithm, and calculating the matching degree of the candidate microblog and the target user keyword vector model;
and calculating the matching degree of the candidate microblog and the target user model as the score of the candidate microblog by combining the two matching degrees by using a weighted average method, and sequencing the score.
The step of calculating the matching degree of the candidate microblog and the target user model as the score of the candidate microblog and ranking the score specifically comprises the following steps:
after the Score (W, u) of the candidate microblogs is obtained, the candidate microblogs are ranked according to the Score, and an initial microblog recommendation list L of the target user is constructed0For the initial microblog recommendation list L0Carrying out redundancy processing;
and outputting the recommendation list after the redundancy processing.
A user model-based microblog text recommendation device, the device comprising:
the acquisition and preprocessing module is used for acquiring microblog data, forming a microblog document and preprocessing the microblog document;
the first calculation module is used for establishing a target user topic model according to the LDA topic model and calculating the matching degree of the candidate microblog and the target user topic model;
the second calculation module is used for establishing a target user keyword vector model based on a TF-IDF algorithm and calculating the matching degree of the candidate microblog and the target user keyword vector model;
and the ranking module is used for calculating the matching degree of the candidate microblog and the target user model as the score of the candidate microblog by combining the two matching degrees by using a weighted average method, and ranking the score.
Wherein the sorting module further comprises:
a redundancy processing submodule, configured to, after obtaining scores Score (W, u) of the candidate microblogs, rank the candidate microblogs according to the scores, and construct an initial microblog recommendation list L of the target user0For the initial microblog recommendation list L0Carrying out redundancy processing;
and the output submodule is used for outputting the recommendation list after the redundancy processing.
The technical scheme provided by the invention has the beneficial effects that:
(1) in short text recommendation, a target user model is established for a target user by combining an LDA topic model method and a TF-IDF modeling method, so that the advantages of the two methods are effectively exerted, a more accurate user modeling effect is obtained, and a calculation method for calculating the matching degree of a candidate microblog and the user model is provided.
(2) According to the characteristics of microblog texts, a candidate microblog scoring standard based on weighting is provided, and the proportion of the modeling method in scoring can be effectively controlled by adjusting the weight. And scoring the candidate microblogs and carrying out TOP-N recommendation so as to obtain a more accurate microblog text recommendation algorithm.
Drawings
FIG. 1 is a flow chart of a microblog text recommendation method based on a user model;
FIG. 2 is a flow chart of the MCRA algorithm;
fig. 3 is a graph showing changes in AP when α is 0.0001 and β is different;
FIG. 4 is a graph showing a comparison of F values for MCRA, LDA and TF-IDF;
FIG. 5 is a diagram illustrating a comparison of AP values for MCRA and TF-IDF algorithms;
FIG. 6 is a schematic diagram of a microblog text recommendation device based on a user model;
FIG. 7 is a schematic diagram of a sorting module.
In the drawings, the components represented by the respective reference numerals are listed below:
1: an acquisition and preprocessing module; 2: a first calculation module;
3: a second calculation module; 4: a sorting module;
41: a redundancy processing submodule; 42; and outputting the submodule.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Example 1
A microblog text recommendation method based on a user model is disclosed, and referring to FIG. 1, the microblog text recommendation method comprises the following steps:
101: acquiring microblog data, forming a microblog document, and preprocessing the microblog document;
for example: according to the method and the device, the Sina microblog is taken as a research object, a certain Sina microblog user is selected as a target user, and content recommendation is carried out on the Sina microblog user. The microblog content released and forwarded by the target user and the attendees thereof is used as a research scope of the embodiment of the invention, and the microblog content released and forwarded by the target user and the attendees thereof is assumed to be the favorite content of the target user and can be used as research content to analyze the interest and hobbies of the target user. And capturing microblog data issued and forwarded by the target user and the attendees thereof, and forming a microblog document subjected to model construction according to the embodiment of the invention.
Preprocessing each microblog document, including: and (3) performing word segmentation, vectorization, dimension reduction and the like, and selecting a training set and a test set (a set consisting of candidate microblogs) for experiments. The specific operation of this step is well known to those skilled in the art, and the detailed description thereof is omitted here.
102: establishing a target user topic model according to the LDA topic model, and calculating the matching degree of the candidate microblog and the target user topic model;
103: establishing a target user keyword vector model based on a TF-IDF algorithm, and calculating the matching degree of the candidate microblog and the target user keyword vector model;
the target user model includes: a target user topic model and a target user keyword vector model. And when the matching degree of the candidate microblog and the target user model is calculated, the matching degree of the candidate microblog, the target user topic model and the target user keyword vector model is calculated respectively.
104: and calculating the matching degree of the candidate microblog and the target user model as the score of the candidate microblog by combining the two matching degrees by using a weighted average method, and sequencing the score.
In specific implementation, the embodiment of the invention performs topic modeling on the target user according to the message content issued by the target user. And acquiring a microblog list to be recommended, scoring the candidate microblogs according to the topic matching degree of the candidate microblogs and the target user, and sequencing the candidate microblogs according to the scores so as to recommend the candidate microblogs.
In summary, in the embodiments of the present invention, through the steps 101 to 104, the accuracy of microblog text recommendation is improved, so that the microblog really interested by the target user is arranged at a position further forward in the recommendation list.
Example 2
The scheme in embodiment 1 is described in detail below with reference to specific calculation formulas, examples, and fig. 2, where the MCRA Algorithm is divided into two sub-algorithms, namely, Target User Modeling Algorithm (TUMA) and text Recommendation Algorithm (CRA), and is described in detail below:
201: acquiring experimental data;
namely, the contents of microblog texts released and forwarded by a target user and a follower thereof are captured to construct an experimental microblog document. When experimental data are captured, a crawler program is designed by utilizing a Sina microblog open Application Programming Interface (API), target users and users related to the target users are selected, and microblog documents of the users are correspondingly formed. In a specific implementation, other software may be used to capture the experimental data, which is not limited in this embodiment of the present invention.
202: preprocessing data;
firstly, a Hidden Markov chain (HMM) -based Chinese Lexical Analysis System (Institute of Computing Technology, Chinese Lexical Analysis System, ICTCLASs) is applied to perform word segmentation processing on all documents, then terms of noun attributes in each microblog are extracted to represent the microblog, and a Vector Space Model (VSM) concept is adopted to perform dimension reduction on the microblog. And finally, selecting the microblog texts of the users which contain the target user and are related to the target user as a modeling data set from the processed microblog texts. In a specific implementation, other word segmentation processing software may also be used, which is not limited in this embodiment of the present invention.
203: in the modeling data set, training the constructed microblog text vector set by using an LDA (latent dirichlet allocation) model;
according to the characteristics of microblog users and microblog texts, all microblog contents issued by each user are regarded as a document, microblog documents of a plurality of users are trained, and an LDA model is solved by using a Gibbs sampling method (Griffiths T L, Steyvers M. filing scientific topics, proceedings of the National academy of science of the United States of America,2004,101(Suppl 1): 5228-5235), wherein the algorithm is described as follows:
first, some special symbols used in the LDA model are explained as follows:
d: document set, D ═ D1,d2,…,di},diIs the ith document.
T: set of topics, T ═ T1,t2,…,ti},tiIs the ith topic.
W: set of terms, W ═ W1,w2,…,wi},wiIs the ith term.
V:V={v1,v2,…,viSet of terms, set of all non-repeating terms in the corpus, viRepresenting the ith term.
u: a target user.
(1) For inputting to a userDocument set D ═ D1,d2,…,diand setting initial values of Dirichlet distribution parameters, namely a parameter β reflecting the relevance degree of the text and the theme and a parameter beta reflecting the density of the theme and the terms, wherein the iteration number is Niteratio
(2) Randomly selecting a theme for all terms in the document set D, and calculating an iteration initial value { the number of occurrences of k themes in m documents
Figure BDA0000793470100000051
m sum of number of topics in document nmNumber of times of word t corresponding to k topic
Figure BDA0000793470100000052
Total number of words n corresponding to k topicskI.e. that
Figure BDA0000793470100000053
(3) In document set D, for any word t in document m, the word t belongs to topic k. For t sampling a new topic, the values in (2) are
Figure BDA0000793470100000054
(4) Repeating step (3) until the Markov chain converges to the maximum likelihood probability;
(5) and outputting the document-theme probability distribution theta and the theme-term probability distribution phi.
After the LDA model is solved by the Gibbs sampling method, the user-theme probability relation P (T | u) of the target user can be obtained from the document-theme probability distribution theta output in the LDA model solving.
In addition, the output topic-term probability distribution Φ includes the probability of each term generated by the current topic corresponding to each topic, which is a very large matrix, the probabilities of the terms are ranked, and it is noted that the probability of terms generated from topics ranked next to each other is very small, and from the perspective of saving computing time, the embodiment of the present invention takes the top 20 terms in each topic to construct the topic term probability relationship matrix P (V | T) of the corpus. The embodiment of the invention does not limit the selected number.
204: the target user microblog documents and the user microblog documents which have concern with the target user jointly form a training set for training a user keyword vector model;
205: after a training set range is selected, processing the content issued by each user in the training set range into microblog documents, calculating the lexical item weight of each user document by using a TF-IDF algorithm, sequencing the lexical item weights in the microblog documents of target users, and acquiring a user keyword vector model K (u);
206: inputting a user-topic probability relation P (T | u) of a target user, a target user keyword vector model K (u) and a candidate microblog document D;
207: for each microblog W in the microblog document D, calculating the interest probability P (W | u) of the user u in the candidate microblog W according to a formula (1), and calculating K according to a formula (2)u(W);
P(W|u)=max{P(w1|u),P(w2|u),…P(wi|u),…,P(wm|u)} (1)
Wherein W ═ { W ═ W1,w2,…,wm},P(wi| u) (i ═ 1,2, …, m) is the term W in the candidate microblog W for the user uiThe probability of interest.
When the lexical items in the candidate microblog W are matched with the lexical items issued by the target user, defining the matching degree of the keyword vector model of the candidate microblog W and the target user u as each lexical item W in the candidate microblog WiAnd (4) the weight value in the keyword vector model K (u) of the target user u is the maximum value, otherwise, the weight of the candidate microblog is the lexical item weight with the lowest weight value in the keyword vector model of the target user.
Figure BDA0000793470100000061
Wherein, Ku(wn) For user u, the term w in the candidate microblog is pairednScoring of (4); w (v)iU) as target user u to term v in own documentiIs scored;Ku(W) is the score of user u on the candidate microblogs; vtargetusrIs a collection of terms that are published by a target user.
208: calculating Score (W, u) according to formula (3);
Figure BDA0000793470100000062
and (3) carrying out weighted average on the matching degree of the target user topic model and the candidate microblog and the matching degree of the target user keyword vector model and the candidate microblog according to a formula (3). The lambda and the mu are respectively used for adjusting the matching degree weight of the user topic model and the candidate microblog and the matching degree weight of the target user keyword vector model and the candidate microblog.
209: after the Score (W, u) of the candidate microblogs is obtained, the candidate microblogs are ranked according to the Score, and an initial microblog recommendation list L of the target user is constructed0For the initial microblog recommendation list L0Carrying out redundancy processing;
in the initial microblog recommendation list, the same microblog forwarded by some people exists, in the recommendation process, the redundant removal processing needs to be carried out on the microblog, and after the redundant removal processing, personalized recommendation (TOP-N recommendation) is provided for a user for the recommendation list.
210: and outputting the recommendation list L after the redundancy processing.
The content realized by the embodiment of the invention comprises the steps of selecting the target user and carrying out theme modeling on the target user according to the message content issued by the target user. And scoring the candidate microblogs according to the topic matching degree of the candidate microblogs and the target user, and sorting the candidate microblogs according to the scores so as to recommend the candidate microblogs.
The embodiment of the invention firstly provides a user theme modeling method combining LDA and TF-IDF to obtain better modeling effect, and secondly provides a new similarity calculation method when a target user keyword vector model is obtained; finally, an improvement method of the topic matching degree of the candidate microblog and the target user and the matching degree of the keyword vector model of the candidate microblog and the target user is provided.
By adopting the MCRA algorithm, the display type behaviors of the user in the microblog can be accurately applied to carry out theme modeling on the user, and the interest of the target user can be accurately analyzed. The MCRA algorithm comprehensively adopts the idea of the LDA algorithm and the idea of the TF-IDF algorithm to establish a model of the target user, so that the correct result in the recommendation list can obtain a more forward position under the condition of not influencing indexes such as accuracy, recall rate and the like of the original algorithm, namely, a microblog really interested by the target user can obtain a more forward position in the recommendation list.
Example 3
After the algorithm design is realized, an evaluation method of the algorithm is designed to measure the performance of the algorithm. And (3) designing an evaluation method by taking accuracy (Precision), Recall (Recall), F value and Average Accuracy (AP) as evaluation standards, evaluating the effectiveness, correctness and the like of the designed algorithm, and analyzing the experimental result.
The number of experimental subjects is set to 150, and TOP-N recommendation is performed under the condition that the values of the number N of recommended microblogs are changed to 10, 20, 30, 40, 50, 60, 70 and 80 respectively. Meanwhile, in order to check the effect of the MCRA algorithm, the method for recommending the user modeling based on the TF-IDF commonly adopted at present by the user modeling based on LDA model recommendation and John Hannon and the like is used as a comparison algorithm, and the accuracy, the recall rate, the F value and the average accuracy are used as evaluation indexes to compare and evaluate the three algorithms.
(1) The calculation formula of accuracy Precision is shown in formula (4).
Figure BDA0000793470100000081
Wherein L isall={W0,W1,…,Wi,…,WN},WiRespectively represent different microblogs, LnightThe contents are contents which accord with the interest of the target user in the recommendation list and are released for the target user in the experiment.
(2) The Recall ratio Recall calculation formula is shown in formula (5).
Figure BDA0000793470100000082
Wherein L istargetusrIs a microblog issued by a target user.
(3) The F value calculation formula is shown in formula (6).
Figure BDA0000793470100000083
(4) The average accuracy rate AP is an index showing the system's performance in ranking relevant documents. The more top the relevant documents are ranked in the results retrieved by the system, the higher the AP value. If the number of the relevant documents returned by the system is 0, the accuracy is also 0, and the calculation formula is shown as formula (7).
Figure BDA0000793470100000084
Wherein N is the total number of microblogs issued by the target user, namely related microblogs riIs the i-th relevant document searched out, RiIs the ranking of the ith relevant microblog in the recommendation list.
On the average accuracy index, the MCRA algorithm has a higher rising trend. In 8 groups of experiments, the average accuracy of the MCRA algorithm of 8 groups of experiments is not lower than that of the TF-IDF-based microblog text recommendation algorithm. But in general, the average accuracy of the MCRA algorithm is close to that of the TF-IDF-based microblog text recommendation algorithm, and the difference is not more than 3.1%.
And performing line drawing on F value results of the MCRA algorithm, the TF-IDF-based microblog text recommendation algorithm and the LDA-based microblog text recommendation algorithm in the experiment results of different recommended microblog numbers, wherein the MCRA algorithm and the TF-IDF-based microblog text recommendation algorithm are higher than the LDA-modeling-based microblog text recommendation algorithm in F value indexes as shown in figure 3. The MCRA algorithm and the TF-IDF-based microblog text recommendation algorithm provided by the invention have similar effects on the F value. Therefore, the MCRA and the TF-IDF-based microblog text recommendation algorithm can obtain better effect on the F value index than the LDA-based microblog text recommendation algorithm.
Because the MCRA algorithm and the TF-IDF-based microblog text recommendation algorithm are far better than the LDA-based microblog text recommendation algorithm in the average accuracy index, the average accuracy experiment results of the MCRA algorithm and the TF-IDF-based microblog text recommendation algorithm are plotted in a bar chart, as shown in FIG. 4. The average accuracy index of the algorithm provided by the invention can exceed that of a microblog text recommendation algorithm based on TF-IDF modeling only.
in the experiment, the number of themes is set to be 20, the number of recommended microblogs is set to be 20, α is set to be 0.0001, the value of β is 0.4-1.9, and the experiment result is shown in fig. 5.
in FIG. 5, experimental results show that the recommendation system designed by the embodiment of the invention can achieve the best recommendation effect when α is 0.0001 and β is 0.5. the analysis is as follows, when β is lower than 0.4, the matching degree P (W | u) of the candidate microblog and the target user topic model takes a larger weight in the score, and when β is greater than 1.9, the matching degree K (u) of the candidate microblog and the user keyword vector takes a larger weight in the score.
therefore, in the experiment, the MCRA algorithm is set to have the parameters α ═ 0.0001, β ═ 0.5 is most reasonable, the subject number is 150, and the accuracy of the three algorithms with different recommended numbers is compared as shown in table 1.
TABLE 1 comparison of accuracy rates for three algorithms with 150 subject numbers and different recommended numbers
Figure BDA0000793470100000091
As can be seen from table 1, in 8 groups of experiments, the accuracy of the MCRA algorithm of the 8 groups of experiments is higher than that of the LDA-based microblog text recommendation algorithm, which is 6% higher at the lowest and 24% higher at the highest. The accuracy of the 5 groups of experimental MCRA algorithms is not lower than that of a TF-IDF-based microblog text recommendation algorithm. The number of subjects was 150, and the recall ratios of the three algorithms with different recommended numbers are shown in table 2. The number of subjects was 150, and the F-value ratio of the three algorithms for changing the recommended number is shown in table 3. The number of subjects was 150 and the average accuracy comparison of the three algorithms with varying recommended numbers is shown in table 4.
TABLE 2 recall comparison of three algorithms with 150 themes and different recommended numbers
Figure BDA0000793470100000101
TABLE 3F-value comparison of three algorithms with 150 subject numbers and varying recommendation numbers
Figure BDA0000793470100000102
As can be seen from Table 3, the F values of the 8 experimental MCRA algorithms are higher than that of the LDA-based microblog text recommendation algorithm by 10% at least and 25.7% at most.
TABLE 4 comparison of average accuracy for three algorithms with 150 subject numbers and varying recommended numbers
Figure BDA0000793470100000103
As can be seen from table 4, the average accuracy of 8 groups of experimental MCRA algorithms is higher than that of LDA-based microblog text recommendation algorithm, which is at least 6% higher and at most 18% higher, and the number of recommended microblogs increases.
Example 4
A microblog text recommending apparatus based on a user model, referring to fig. 6, the apparatus comprising:
the acquisition and preprocessing module 1 is used for acquiring microblog data, forming a microblog document and preprocessing the microblog document;
the first calculation module 2 is used for establishing a target user topic model according to the LDA topic model and calculating the matching degree of the candidate microblog and the target user topic model;
the second calculation module 3 is used for establishing a target user keyword vector model based on a TF-IDF algorithm and calculating the matching degree of the candidate microblog and the target user keyword vector model;
and the sorting module 4 is used for calculating the matching degree of the candidate microblog and the target user model as the score of the candidate microblog by combining the two matching degrees by using a weighted average method, and sorting the score.
Wherein, referring to fig. 7, the sorting module 4 further includes:
a redundancy processing submodule 41, configured to, after obtaining scores Score (W, u) of the candidate microblogs, rank the candidate microblogs according to the scores, and construct an initial microblog recommendation list L of the target user0For the initial microblog recommendation list L0Carrying out redundancy processing;
and the output submodule 42 is used for outputting the recommendation list after the redundancy processing.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (2)

1. A microblog text recommendation method based on a user model is characterized by comprising the following steps:
acquiring microblog data to form a microblog document, performing word segmentation, vectorization and dimension reduction on the microblog document, and selecting microblog texts of users which contain target users and are related to the target users from the processed microblog texts as a modeling data set;
building a module data set, training the constructed microblog text vector set according to an LDA topic model, and acquiring a user-topic probability relation of a target user;
the target user microblog documents and the user microblog documents which have concern with the target user jointly form a training set for training a user keyword vector model;
processing the content issued by each user in the training set range into microblog documents, calculating the lexical item weight of each user document by using a TF-IDF algorithm, sequencing the lexical item weights in the microblog documents of the target user, and acquiring a user keyword vector model;
inputting a user-theme probability relation of a target user, a keyword vector model of the target user and a candidate microblog document; calculating the probability of interest of a user to each candidate microblog and a keyword vector model for each microblog in the candidate microblog document;
calculating scores of the candidate microblogs according to the interesting probability and the keyword vector model, and sequencing the scores;
the steps of calculating scores of the candidate microblogs according to the interesting probability and the keyword vector model and sequencing the scores specifically comprise:
after the scores of the candidate microblogs are obtained, sorting the candidate microblogs according to the scores, constructing an initial microblog recommendation list of a target user, and performing redundancy processing on the initial microblog recommendation list;
and outputting the recommendation list after the redundancy processing.
2. A microblog text recommending device based on a user model is characterized by comprising:
the acquisition and preprocessing module is used for acquiring microblog data, forming a microblog document, and performing word segmentation, vectorization and dimension reduction on the microblog document;
the first calculation module is used for training the constructed microblog text vector set according to the LDA topic model in the modeling data set to obtain a user-topic probability relation of a target user;
the second calculation module is used for forming a training set for training a user keyword vector model by the target user microblog document and the user microblog document which has the concern relation with the target user; processing the content issued by each user in the training set range into microblog documents, calculating the lexical item weight of each user document by using a TF-IDF algorithm, sequencing the lexical item weights in the microblog documents of the target user, and acquiring a user keyword vector model;
the sorting module is used for inputting a user-theme probability relation of a target user, a keyword vector model of the target user and candidate microblog documents; calculating the probability of interest of a user to each candidate microblog and a keyword vector model for each microblog in the candidate microblog document; calculating scores of the candidate microblogs according to the interesting probability and the keyword vector model, and sequencing the scores;
the sorting module further comprises:
the redundancy processing sub-module is used for obtaining scores of the candidate microblogs, sorting the candidate microblogs according to the scores, constructing an initial microblog recommendation list of a target user, and performing redundancy processing on the initial microblog recommendation list;
and the output submodule is used for outputting the recommendation list after the redundancy processing.
CN201510548344.0A 2015-08-31 2015-08-31 Microblog text recommendation method and device based on user model Active CN105183833B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510548344.0A CN105183833B (en) 2015-08-31 2015-08-31 Microblog text recommendation method and device based on user model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510548344.0A CN105183833B (en) 2015-08-31 2015-08-31 Microblog text recommendation method and device based on user model

Publications (2)

Publication Number Publication Date
CN105183833A CN105183833A (en) 2015-12-23
CN105183833B true CN105183833B (en) 2020-05-19

Family

ID=54905915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510548344.0A Active CN105183833B (en) 2015-08-31 2015-08-31 Microblog text recommendation method and device based on user model

Country Status (1)

Country Link
CN (1) CN105183833B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202294B (en) * 2016-07-01 2020-09-11 北京奇虎科技有限公司 Related news computing method and device based on keyword and topic model fusion
CN107025310A (en) * 2017-05-17 2017-08-08 长春嘉诚信息技术股份有限公司 A kind of automatic news in real time recommends method
CN107291912A (en) * 2017-06-26 2017-10-24 三螺旋大数据科技(昆山)有限公司 Investor recommends method and apparatus
CN107491417B (en) * 2017-07-06 2021-06-22 复旦大学 Document generation method based on specific division under topic model
CN107391692B (en) * 2017-07-26 2023-04-07 腾讯科技(北京)有限公司 Recommendation effect evaluation method and device
CN107766576A (en) * 2017-11-15 2018-03-06 北京航空航天大学 A kind of extracting method of microblog users interest characteristics
CN108460153A (en) * 2018-03-27 2018-08-28 广西师范大学 A kind of social media friend recommendation method of mixing blog article and customer relationship
CN108733824B (en) * 2018-05-22 2020-07-03 合肥工业大学 Interactive theme modeling method and device considering expert knowledge
CN110020189A (en) * 2018-06-29 2019-07-16 武汉掌游科技有限公司 A kind of article recommended method based on Chinese Similarity measures
CN109034389A (en) * 2018-08-02 2018-12-18 黄晓鸣 Man-machine interactive modification method, device, equipment and the medium of information recommendation system
CN111191108A (en) * 2018-10-26 2020-05-22 上海交通大学 Software crowdsourcing project recommendation method and system based on reinforcement learning
CN109885748A (en) * 2019-02-22 2019-06-14 新疆大学 Optimization recommended method based on meaning of one's words feature
CN110096867B (en) * 2019-05-13 2021-10-08 南开大学 Permission recommendation method and system for Android application function
CN110489665B (en) * 2019-08-16 2023-11-14 北京信息科技大学 Microblog personalized recommendation method based on scene modeling and convolutional neural network
CN111159565B (en) * 2019-12-31 2023-08-25 第四范式(北京)技术有限公司 Method, device and equipment for constructing recommendation model based on multi-objective optimization
CN111310060B (en) * 2020-05-13 2020-10-09 腾讯科技(深圳)有限公司 Recommendation method and device, electronic equipment and computer-readable storage medium
CN112487303B (en) * 2020-11-26 2022-04-22 杭州电子科技大学 Topic recommendation method based on social network user attributes

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049440A (en) * 2011-10-11 2013-04-17 腾讯科技(深圳)有限公司 Recommendation processing method and processing system for related articles
CN103064863A (en) * 2011-10-24 2013-04-24 北京百度网讯科技有限公司 Method and equipment of providing recommend information
CN103823906A (en) * 2014-03-19 2014-05-28 北京邮电大学 Multi-dimension searching sequencing optimization algorithm and tool based on microblog data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049440A (en) * 2011-10-11 2013-04-17 腾讯科技(深圳)有限公司 Recommendation processing method and processing system for related articles
CN103064863A (en) * 2011-10-24 2013-04-24 北京百度网讯科技有限公司 Method and equipment of providing recommend information
CN103823906A (en) * 2014-03-19 2014-05-28 北京邮电大学 Multi-dimension searching sequencing optimization algorithm and tool based on microblog data

Also Published As

Publication number Publication date
CN105183833A (en) 2015-12-23

Similar Documents

Publication Publication Date Title
CN105183833B (en) Microblog text recommendation method and device based on user model
Saad et al. Twitter sentiment analysis based on ordinal regression
CN110162593B (en) Search result processing and similarity model training method and device
US10885073B2 (en) Association strengths and value significances of ontological subjects of networks and compositions
Mohammed et al. Lsa & lda topic modeling classification: Comparison study on e-books
Nie et al. Data-driven answer selection in community QA systems
US8401980B2 (en) Methods for determining context of compositions of ontological subjects and the applications thereof using value significance measures (VSMS), co-occurrences, and frequency of occurrences of the ontological subjects
CN110532378B (en) Short text aspect extraction method based on topic model
CN112989802A (en) Barrage keyword extraction method, device, equipment and medium
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN110705247A (en) Based on x2-C text similarity calculation method
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
Wei et al. Sentiment classification of Chinese Weibo based on extended sentiment dictionary and organisational structure of comments
Wei et al. Learning from context: a mutual reinforcement model for Chinese microblog opinion retrieval
Jia et al. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth
Hashemzadeh et al. Improving keyword extraction in multilingual texts.
Kinariwala et al. Short text topic modelling using local and global word-context semantic correlation
Wei et al. Online education recommendation model based on user behavior data analysis
Yajian et al. A short text classification algorithm based on semantic extension
Mehendale et al. Cyber bullying detection for hindi-english language using machine learning
US20220058464A1 (en) Information processing apparatus and non-transitory computer readable medium
CN104615685B (en) A kind of temperature evaluation method of network-oriented topic
AlMahmoud et al. The effect of clustering algorithms on question answering
Mahalakshmi et al. Twitter sentiment analysis using conditional generative adversarial network
Chen et al. Learning the chinese sentence representation with LSTM autoencoder

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant