CN105869058B

CN105869058B - A kind of method that multilayer latent variable model user portrait extracts

Info

Publication number: CN105869058B
Application number: CN201610250016.7A
Authority: CN
Inventors: 毋立芳; 王丹; 刘爽; 张磊; 刘海英; 张岱
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2016-04-21
Filing date: 2016-04-21
Publication date: 2019-10-29
Anticipated expiration: 2036-04-21
Also published as: CN105869058A

Abstract

A kind of method that multilayer latent variable model user portrait extracts, is related to data mining and recommender system field.The present invention is towards social plan exhibition network extraction user's portrait, and the data of the both modalities which of user behavior, propose a kind of user's portrait extracting method of multilayer latent variable model in the text description information and forwarding chain for collection entry.It introduces LDA model and the implicit theme distribution of user is obtained to text description information, being distributed to theme interest for theme distribution is implied based on user；Theme distribution is implied in conjunction with user and theme interest is distributed to obtain the interest distribution of user.User's socialization community is found based on multilayer latent variable model, and obtains user's recommendation results in conjunction with Jensen-Shannon divergence ascending sort.The present invention carries out user's socialization community discovery using the information of user version information and forwarding two kinds of different modalities of chain user behavior, realizes that user recommends.

Description

Method for extracting user portrait of multilayer latent variable model

Technical Field

The invention relates to the field of data mining and recommending systems, in particular to research and implementation of a multi-layer latent variable model user portrait extraction method.

Background

Social media refers to a series of web applications that are built on the technical and conscious aspects of web2.0 and allow users to create and communicate content themselves. Since 2009, some professional social curation networks (e.g., Pinterest, snip. it, zoopit, petal nets, etc.) have formally emerged. The term "social curation" refers to the pronoun of people's activities such as collecting, organizing, and sharing information over the network. Traditional social networks are user-centric, and social curation networks are content-centric. The social strategy network is guided by user interests, users can create contents by themselves and collect contents concerned by themselves in other websites into their directories by linking and copying, and the users can sort and sort the personal collections according to their interests, so that the social strategy network provides very convenient network linking, network resource collection, sorting and sharing functions for the users. Other users can comment, like and change the collection of the user. The convenient publishing function enables the user to truly realize one-click expression and conveniently share own viewpoints. Pinterest is one of the websites that first integrated one-click content authoring and structuring into a content management set (called "Board"). The rapid development of Pinterest fully accounts for the appeal of such social curated networks to mass users. From 2012 onwards, more than ten Pinterest-like social curation networks such as petal nets, beauty, mushroom streets, etc. have appeared in China.

Extracting personalized user portraits on social networks is a key technology. There are many ways to extract user portraits for social networks, but most focus on predicting user attribute information, such as demographic characteristics, gender, age, religion, etc. Predictive models are often built on a series of feature sets (often natural language textual content) built by users, rather than a systematic model for recommendations or community discovery. User portrait extraction based on latent variable models is applied to text content analysis, traditional recommendation systems, social recommendation systems and social network analysis.

Social curation networks contain different kinds of information: the collection items, the collection book and the collection classification, short text information of the collection items generated by the user, information of a forwarding chain, attention information and the like. The user is composed of a series of collection item sets arranged in a collection book, and one collection item contains rich information, such as description information, information of the relevant collection book, original information forwarded by the collection item and the like. Social curation of information across multiple modalities of a network presents challenges to user modeling. How to carry out latent variable modeling on the description text information of the collection items and the user forwarding chain behaviors, find a potential user community and recommend users similar to the target user interests are the key points of the invention.

Disclosure of Invention

The user portrait extraction method based on the multi-modal latent variable modeling is researched for a social strategy development network and aiming at the characteristic that user data modes are various, and is used for community discovery and user recommendation realization.

In order to realize the problem, the invention provides a double-layer Bayesian latent variable model for describing the user. The method comprises the following steps:

A. and establishing a word bank and a stop word bank, and performing word segmentation on the text description information of all the collection items of the user by using a word segmentation tool ICTCCLAS.

B. In the social policy network, the favorite items of the current user can be collected by other users. If a user changes to a collection item, all self information except comments, praise and the like of the collection item is copied into a self collection book.

Collecting all collection item forwarding chain data of a target user set, and acquiring data from each collection item to an original collection item according to data of a person who forwards the collection item in the collection items. Data is crawled to the parent level starting with the current collection entry. Taking as a guide where this data is forwarded from, we trace back to the original favorites entry location. Each node in the tracing process is a copy of the original collection entry, and a chain-shaped path diagram, called a forwarding chain, is formed by the nodes. Each forwarding chain is made up of a collection of collection entries. One node on the forwarding chain is represented by the creating user ID for each forwarding favorites entry.

C. The multi-layer Latent variable model is based on a popular Bayesian Latent variable model, namely, Late Dirichlet Allocation (LDA), and a user interest model is extracted from a set (represented by user ID, and referred to as "pinners" herein) formed by all text information and forwarding chain data of a target user set.

Further, the step C specifically includes:

c1, the first layer model is the implicit theme of the text description information of the user collection item.

C2, calculating the probability of each collection item belonging to each topic

C3, the second layer model is to calculate the interest distribution of the implied subject.

C4, calculating the interest distribution of the user.

Further, the step C1 includes:

c11 and LDA are popular Bayesian latent variable models and are widely applied to machine learning and natural language processing. The basic idea of LDA is that a document is a collection of multiple topics, one topic being the probability distribution of words. LDA is based on the "bag of words" assumption, that is, the order of words in a document can be ignored. The user text information is subjected to LDA model, and the user and the characters are associated through the implied subject to generate a user-subject-word three-layer Bayesian model.

C12 and Perplexity are used as a metric LDA model, and the smaller the Perplexity value is, the better the model is. Measuring the first-layer LDA model by Perplexity, calculating the Perplexity model, and selecting the best number of subjects N_T1Used in the next step.

Wherein, U_testTo test a set of users, U_tTotal number of users for test set, w_uSet of words, p (w), describing information for user u's collection entry_u) Generating a probability, N, for a set of words of the collection entry description information of user u under a user model_uThe collection entry for user u describes the total number of words of the set of words of information. K is the total number of topics, N_mA set of words of item description information is collected for all users.

The step C2 includes:

c21 description information of a favorite is created by user or transferred, composed of a group of words, and can be represented as w_pin＝{w₁，w₂，...w_i...，w_NIn which w_iFor the ith word of the collection item description information, N is the collection item description informationAnd the total number of the word set. Topic collectionsIs the probability distribution of a word, where z_kFor the kth topic, N_T1Is the best overall theme. By usingIndicating that a Collection entry pin belongs to a topic z_kThe probability of (c).

C22, calculating each collection item pin belonging to a subject z according to the result of the first-layer LDA model_kProbability of (2)According to the pareto law, we choose 0.2 as the threshold ifThe collection entry belongs to this topic. Thus obtaining a compound of N_T1A collection of collection entries for individual topics.

The step C3 includes:

c31, a favorites entry is typically from one user to the end of another. A favorites entry may also be forwarded by other users, with a forward chain being the user's propagation through the social curation network. Each collection entry is represented by a user ID on the forwarding chain.

C32, treating the user ID as an independent word, according to the theory that LDA is based on the "bag of words" assumption. Thus, for implied subject N_T1And obtaining a three-layer Bayesian model of a theme-interest-user ID through the LDA model, and obtaining interest distribution of the hidden description information theme.

The step C4 includes:

and C41, combining the user implicit theme distribution and the interest distribution of the theme, and obtaining the interest distribution of the user through matrix multiplication. It can be said that a user is a collection of interests, one interest being the probability distribution of the user ID.

D. User recommendation based on multi-layer latent variable model

The step D comprises the following steps:

d1, user u₁And user u₂Similarity and user u₂And user u₁The similarity between them is the same. Jensen-Shannon divergence is a method of measuring the distance (degree of similarity) of probability distributions. Jensen-Shannon divergence is more balanced than Kullback-Leibler divergence, and the order of the parameters is symmetrical, i.e., the results are independent of the order of the parameters. The smaller the Jensen-Shannon divergence, the greater the similarity. And calculating the Jensen-Shannon divergence of the target user and all users on the forwarding chains as the similarity between the two users P and Q.

Wherein,

the user set on all collection item forwarding chains is called other user set U_rec＝{u_R1，u_R2，…，u_Ri，…，u_RNIn which u_RiThe node number of the collection entry set is the Ri-th user ID of the node on the forwarding chain, and the RN is the node number of the collection entry set on the forwarding chain. For the target user set, calculating each target user u_iAnd RN other user sets U_recThe Jensen-Shannon divergence of each user is used as the similarity between two users, and the calculation result can be expressed asWhereinRepresenting a target user u_iFor user u_RiJensen-Shannon dispersion value.

D2 for eyesTarget user u_iCalculated Jensen-Shannon divergence valueThe smaller the value, the more similar the user's interest, sorted from small to large. And taking Top-N as a recommended user to the target user.

Description of the drawings:

fig. 1 is a schematic diagram of a forwarding chain according to the present example.

FIG. 2 is a schematic diagram of a multi-layer latent variable model framework according to the present example.

FIG. 3 is a diagram of the results of a Perplexity of the present example.

FIG. 4 is a diagram illustrating the results of a community discovery process according to this example.

FIG. 5 is a diagram illustrating a MAP result according to the example.

The specific implementation mode is as follows:

the technical solution of the present invention will be described in more detail with reference to the accompanying drawings and examples.

The embodiment is performed for real data of a certain social curation network, in the example, 100 target users are real users in the network and respectively come from three classifications, where nos. 1 to 35 belong to classification one, nos. 36 to 75 belong to classification two, nos. 76 to 100 belong to classification three, and the total includes 633337 collection entries and forwarding chains corresponding to the collection entries.

A. A new word bank is established, which contains about 300000 new words, and is a common and popular keyword. A stop word thesaurus is established containing 1433 stop words, which have no specific meaning in the sentence.

The description information of one collection entry in the example is 'Taobao commodity description must have which qualifications?', and is subjected to word segmentation to obtain { Taobao, commodity, description, possess, qualifications }, and { must, which } stop words, so that the words are removed.

B. Read-in target user collection entry forwardingAnd chain data, wherein each piece of forwarding chain data is marked by a user ID (serial number) of a creating user of each node collection entry on the forwarding chain, and is represented as R ═ { p { (p) }₁，p₂，……，p_n}. The last two nodes p of each forwarding chain_n-1And p_nAnd (5) removing. The forwarding chain for a favorites entry for a target user in this example may be represented by a user ID as {38450, 115078, 86804, 60952, 310115, 86588, 269584, 280741, 298423, 15278, 31028, 256217, 271691}, with node p removed_n-1And p_nThe postforward chain is denoted by 86804, 60952, 310115, 86588, 269584, 280741, 298423, 15278, 31028, 256217, 271691.

C. And extracting the user interest model based on the text information and the forwarding chain data.

The step C specifically comprises the following steps:

c1, calculating the hidden theme of the text description information of the user collection items by the first layer model, and selecting the optimal theme number.

C2, calculating the probability of each collection item belonging to each topic

C4, calculating the interest distribution of the user.

The step C1 specifically includes:

and C11, performing LDA modeling on the description information of 100 target users to obtain a user-theme-word three-layer Bayesian model, and associating the users and words through themes.

Where p (w | u) represents the word distribution of the 100 target user sets, p (w | t) andtopic distribution, p (t | u) and θ, representing a set of words_uRepresenting the probability distribution of the occurrence of a topic in the set of users.

C12, 10% of the data set selected in the experimentAs a test set, i.e. 10 users as a test set. Calculating Perplexity when the number of themes N_T1Perplexity values tend to plateau at > 30, indicating N_T1The model is optimal in quality and computational complexity of more than or equal to 30. So set up N_T130. One user u out of the 100 target users in this example_iFor 30 subjects the probability isWherein

The step C2 specifically includes:

c21, calculating the result of three-layer Bayes model of the first layer user-subject-word, each collection item pin belonging to a subject z_kProbability of (2)The description information of a collection item pin is divided into words { Taobao, commodity, description, possession and quality }, wherein each word is corresponding to a subject z_kThe probabilities of (1) are {0.805, 0.456, 0.771, 0.002, 0.002}, respectively, then

C22, calculating that all collection items of 100 target user sets belong to N_T1Probability of 30 topicsIf it is notThe collection entry is considered to belong to this topic. Thus obtaining a compound of N_T1A collection of 30 subject collection entries.

The step C3 includes:

c31, each favorites entry is represented by a user ID on the forwarding chain. N is a radical of_T1The 30 topics are represented as a set of related forwarding chains, and also as a set of related user IDs of nodes on the forwarding chains.

C32, to N_T1And obtaining a three-layer Bayesian model of a subject-interest-user ID through the LDA model for 30 subjects to obtain the interest distribution of the implicitly described information subject, and associating the subject with the user ID through the interest.

Wherein p (uid | t) represents N_T1User ID distribution of 30 topics, p (uid | int) anddistribution of interest, p (int | t) and θ, representing a set of user IDs_tRepresenting the probability distribution of the occurrence of interest in the topic collection.

The step C4 includes:

c41, combining the user implicit topic distribution and the topic interest distribution, the implicit topic of the 100 target user sets can be expressed as: a matrix of 100 users by 30 topics, the interest distribution of which may be expressed as: matrix of 30 subjects interest. Through matrix multiplication, a matrix of 100 users' interests, i.e., a user interest probability distribution, can be calculated. It can also be said that a user is a collection of multiple interests, one interest being the probability distribution of the user ID.

p(int|u)＝p(int|t)p(t|u)＝θ_tθ_u

Where p (int | d) represents the probability distribution of the occurrence of interest among 100 users.

One advantage of the multi-modal latent variable model is that a community of users can be discovered from the description information and forwarding chain. Compared with a classic LDA model algorithm, the multi-mode latent variable model can better describe the user social community compared with the traditional LDA model algorithm.

D. User recommendation based on multi-layer latent variable model

The step D comprises the following steps:

d1, calculating each target user u for 100 users in the target user set_iAnd RN other user sets U_recThe Jensen-Shannon divergence of each user is used as the similarity between two users, and the calculation result can be expressed asWhereinRepresenting a target user u_iFor user u_RiJensen-Shannon dispersion value.

D2, for target user u_iCalculated Jensen-Shannon divergence valueIn order from small to large, the smaller the Jensen-Shannon divergence value is, the more similar the user interest distribution is. And taking Top-N as a recommended user to the target user. In this example, the present algorithm is compared to five baseline algorithms: (1) a recommendation algorithm based on a multi-modal latent variable model and Kullback-Leibler divergence, (2) a recommendation algorithm based on an LDA model and Jensen-Shannon divergence, (3) a recommendation algorithm based on an LDA model and Kullback-Leibler divergence, and (4) a URSRP algorithm, and (5) a popularity recommendation algorithm of user ID occurrence frequency. Compared with the average accuracy (MAP) index, the recommendation effect is obviously improved, and the MLLDA-JSD is improved compared with other baseline algorithms. We also verified that Jensen-Shannon divergence measures similarity between different users better than Kullback-Labler divergence.

Claims

1. A method for extracting a user portrait of a multi-layer latent variable model is characterized by comprising the following steps:

A. establishing a word bank and a stop word bank, and performing word segmentation on the text description information of all the collection items of the user by using a word segmentation tool ICTCCLAS;

B. in the social strategy network, the collection items of the current user can be collected by other users; if a user changes to collect one collection item, all self information of the collection item is copied into a self collection book;

collecting all collection item forwarding chain data of a target user set, and acquiring data from each collection item to an original collection item according to data of a person who forwards the collection item in the collection items; crawling data to the parent level from the current collection item; taking the data of which the data is forwarded from as a guide, and tracing back to the position of the original collection item; each node in the tracing process is a copy of the original collection entry, and a chain-shaped path diagram is formed by the nodes and is called as a forwarding chain; each forwarding chain is composed of a set containing a plurality of collection items; representing a node on the forwarding chain by the creating user ID of each forwarded collection entry;

C. extracting a user interest model from a set formed by all text information and forwarding chain data of a target user set;

the step C specifically comprises the following steps:

c1, calculating the implicit theme of the text description information of the user collection item;

c2, calculating the probability that each collection item belongs to each topic;

c3, calculating interest distribution of the implied subject by the second-layer model;

c4, calculating the interest distribution of the user;

D. user recommendations based on a multi-layer latent variable model;

the step C1 includes:

c11, the user text information is associated with the user and the characters through the implicit theme by an LDA model to generate a user-theme-word three-layer Bayes model;

c12, measuring the first-layer LDA model by Perplexity, calculating the Perplexity model, and selecting the best number of subjects N_T1Used in the next step;

wherein, U_testTo test a set of users, U_tTotal number of users for test set, w_uSet of words, p (w), describing information for user u's collection entry_u) Generating a probability, N, for a set of words of the collection entry description information of user u under a user model_uTotal number of words of the word set of the collection item description information for user u; k is the total number of topics, N_mCollecting a word set of item description information for all users;

the step C2 includes:

c21 description information of a favorite is created by user or transferred, and is composed of a group of words, denoted as w_pin＝{w₁，w₂，…w_i…，w_NIn which w_iThe ith word of the collection item description information is represented, and N is the total number of collection item description information word sets; topic collectionsIs the probability distribution of a word, where z_kFor the kth topic, N_T1The number of subjects is the best; by usingIndicating that a Collection entry pin belongs to a topic z_kThe probability of (d);

c22, calculating each collection item pin belonging to a subject z according to the result of the first-layer LDA model_kProbability of (2)Selecting 0.2 as the critical value ifThe collection entry belongs to this topic; thus obtaining a compound of N_T1A collection of collection items for the best topic;

the step C3 includes:

c31, each collection item is represented by a user ID on the forwarding chain;

c32, obtaining a three-layer Bayesian model of a subject-interest-user ID through an LDA model for the implied subject to obtain the interest distribution of the implied description information subject;

the step C4 includes:

and C41, combining the user implicit theme distribution and the interest distribution of the theme, and obtaining the interest distribution of the user through matrix multiplication.