CN109359180B

CN109359180B - User portrait generation method and device, electronic equipment and computer readable medium

Info

Publication number: CN109359180B
Application number: CN201811099279.8A
Authority: CN
Inventors: 蔡业首; 汤煌; 张小鹏
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-09-20
Filing date: 2018-09-20
Publication date: 2021-03-02
Anticipated expiration: 2038-09-20
Also published as: CN109359180A

Abstract

The disclosure relates to a user portrait generation method, a user portrait generation device, an electronic device and a computer readable medium. The method comprises the following steps: acquiring behavior information of a user and/or article description information corresponding to behaviors; generating an item set through the behavior information; generating a description document through the item description information; inputting the item set and/or the description document into a probability graph model so as to calculate a document theme vector and/or an item theme vector through the probability graph model; and generating a user representation of the user from the document theme vector and/or the item theme vector. The user portrait generation method, the device, the electronic equipment and the computer readable medium can improve the coverage rate of the article description information in the user portrait and improve the user portrait precision.

Description

User portrait generation method and device, electronic equipment and computer readable medium

Technical Field

The present disclosure relates to the field of computer information processing, and in particular, to a user portrait generation method, apparatus, electronic device, and computer readable medium.

Background

The user portrait is also called a user role and is an effective tool for delineating target users and connecting user appeal and design direction, and the user portrait is widely applied to various fields. In the process of actually generating the user representation, the attributes, behaviors and expectations of the user are often combined with the shallowest-appearing and life-close words to serve as virtual representations of the actual user, and the user interest is expected to be mined based on the user representation for the follow-up. In user interest mining, user portrayal is constructed by commonly using article description information and user behavior information at present. Firstly, clustering or classifying the articles purchased or downloaded by a user by using article description information; and then mapping the category information obtained from the item description to a user level according to the behavior information of the user.

However, the current user portrait mining method has some problems in practical application. Firstly, part of the article lacks description information, for example, the article can be an application program, and the application program with the description information only accounts for about 60% of the total application program number, and mining is carried out by the method, so that 40% of the application programs without the description information and the behaviors of users on the application programs are inevitably discarded. The behavior of eliminating the article information and the behavior information is easy to cause the problems that the coverage rate of the user is reduced and the user portrait is not accurate enough.

Therefore, a new user representation generation method, apparatus, electronic device and computer readable medium are needed.

The above information disclosed in this background section is only for enhancement of understanding of the background of the disclosure and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

In view of the above, the present disclosure provides a user portrait generation method, device, electronic device and computer readable medium, which can improve the coverage rate of the item description information in the user portrait and improve the user portrait precision.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to an aspect of the disclosure, a method for generating a user portrait is provided, the method comprising: acquiring behavior information of a user, wherein the behavior information comprises article operation information and article description information; generating an article set through article operation information; generating a description document through the item description information; inputting the item set and/or the description document into a probability graph model so as to calculate a document theme vector and/or an item theme vector through backward reasoning of the probability graph model; and generating a user representation of the user from the document theme vector and/or the item theme vector.

According to an aspect of the present disclosure, a user representation generation apparatus is provided, the apparatus comprising: the information module is used for acquiring behavior information of a user, wherein the behavior information comprises article operation information and article description information; the article collection module is used for generating an article collection through the article operation information; the descriptive document module is used for generating descriptive documents through the article description information; the vector module is used for inputting the item set and/or the description document into a probability graph model so as to calculate a document theme vector and/or an item theme vector in a reverse thrust mode through the probability graph model; and a user representation module for generating a user representation of the user from the document theme vector and/or the item theme vector.

According to an aspect of the present disclosure, an electronic device is provided, the electronic device including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above.

According to an aspect of the disclosure, a computer-readable medium is proposed, on which a computer program is stored, which program, when being executed by a processor, carries out the method as above.

According to the user portrait generation method, the device, the electronic equipment and the computer readable medium, the coverage rate of the article description information in the user portrait can be improved, and the user portrait precision is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are merely some embodiments of the present disclosure, and other drawings may be derived from those drawings by those of ordinary skill in the art without inventive effort.

FIG. 1 is a system block diagram illustrating a user representation generation method and apparatus in accordance with an exemplary embodiment.

FIG. 2 is a diagram illustrating an application scenario of a user representation generation method and apparatus according to an exemplary embodiment.

FIG. 3 is a flow diagram illustrating a user representation generation method in accordance with another illustrative embodiment.

FIG. 4 is a diagram illustrating a user representation generation method in accordance with an exemplary embodiment.

FIG. 5 is a diagram illustrating a user representation generation method in accordance with an exemplary embodiment.

FIG. 6 is a flow diagram illustrating a user representation generation method in accordance with another illustrative embodiment.

FIG. 7 is a flow diagram illustrating a user representation generation method in accordance with another illustrative embodiment.

FIG. 8 is a flowchart illustrating a user representation generation method in accordance with another exemplary embodiment.

FIG. 9 is a block diagram illustrating a user representation generation apparatus in accordance with an exemplary embodiment.

FIG. 10 is a block diagram illustrating a user representation generation apparatus in accordance with another exemplary embodiment.

FIG. 11 is a block diagram illustrating an electronic device in accordance with an example embodiment.

FIG. 12 is a schematic diagram illustrating a computer-readable storage medium according to an example embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or processes shown in the drawings are not necessarily required to practice the present disclosure and are, therefore, not intended to limit the scope of the present disclosure.

The inventor of the present application has found that, as described above, the user portrayal method in the prior art causes the problems of the coverage rate of the user by the interest features being reduced and the user portrayal being not accurate enough. The defects of the user portrait mode in the prior art can be solved by behavior-based LDA (Latent Dirichlet Allocation), and the main idea is to regard each user as a "document", and the content in the "document" is an item list of behaviors generated by the user. Based on such assumptions, the user's behavior can be utilized to cluster the items.

The LDA based on behaviors can solve the problem of clustering missing description articles, but in an actual scene, a user portrait is constructed through the LDA, text information related to the articles is abandoned, the text information completely depends on behavior information of the user, and semantic relevance among the articles is weak through clustering clusters obtained through the method. For example, in a user group clustering scenario, a user profile is often constructed by LDA, and groups in the same region are often divided into a cluster.

In view of the above, the inventor of the present application provides a user portrait generation method and device, where a multi-input latent dirichlet allocation model is established, and a probability map model is obtained through model training of the multi-input latent dirichlet allocation model, the probability map model can simultaneously accept multiple inputs of behaviors and texts, the probability map model can simultaneously obtain topic vectors of multiple layers such as users, articles, and keywords in the same topic space, and a user portrait generated through the topic vectors of the multiple layers can more comprehensively reflect user characteristics and more accurately describe personal information of the user.

The content of the present application will be described in detail below with the aid of specific examples:

As shown in fig. 1, the system architecture 1000 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may perform usual network operations on the

terminal devices

101, 102, 103, and the

terminal devices

101, 102, 103 may interact with the server 105 via the network 104 to receive or send messages, etc. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a shopping application, a web browser application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The

terminal devices

101, 102, 103 may send the behavior information of the user to the server 105 in real time so that the server 105 generates the user representation in real time. The

terminal devices

101, 102, 103 may also periodically send user behavior information to the server 105 in the form of timed tasks, so that the server 105 periodically generates and/or updates a user representation. The

terminal devices

101, 102, and 103 may further store the behavior information of the user in a daily log, for example, and the server 105 may actively pull the behavior information of the user from the

terminal devices

101, 102, and 103 by pulling data, which is not limited in this application.

The server 105 may be a server that provides various services, such as a background server that analyzes user behavior. The server 105 may analyze the received user behavior information and generate a processing result (a user image, a user tag, and the like).

The server 105 may, for example, obtain behavior information of the user and/or item description information corresponding to the behavior; server 105 may generate a collection of items, for example, from item operation information; the server 105 may generate a description document, for example, from the item description information; server 105 may, for example, input the set of items and/or the descriptive document into a probabilistic graphical model to determine a document topic vector and/or an item topic vector; server 105 may generate a user representation of the user, for example, from a document theme vector and/or an item theme vector.

The server 105 may also generate the probabilistic graph model from a multi-input potential dirichlet distribution model whose inputs are a plurality of data sets, e.g., according to behavioral information of one or more users and/or corresponding item description information.

The server 105 may be a physical server, or may be composed of a plurality of servers, for example, it should be noted that the user representation generating method provided by the embodiment of the present disclosure may be executed by the server 105, and accordingly, the user representation generating apparatus may be disposed in the server 105. And the responding end which is provided for the user to perform network operation and complete the normal use request of the user is generally positioned in the

terminal equipment

101, 102 and 103.

FIG. 2 is a diagram illustrating an application scenario of a user representation generation method and apparatus according to an exemplary embodiment. As shown in fig. 2, various users perform normal use operations in the terminal device, and the use operations may include web browsing, purchasing an application in an application store and downloading the application, performing a predetermined function using the application, and the like. The behavior information of the user may include the time of the behavior, the frequency of the behavior, and the like. The terminal device may send the user's behavioral information to a predetermined server, which determines the user's portrait of the user based on the analysis.

The terminal equipment can send the behavior information of the user to the server in real time, so that the server can generate the user portrait in real time. The terminal equipment can also send the behavior information of the user to the server in a timing mode through a timing task, so that the server can generate and/or update the user portrait in a timing mode. The terminal device may further store the behavior information of the user in a daily log, and the server may actively pull the behavior information of the user from the terminal device in a data pulling manner, which is not limited in this application.

FIG. 3 is a flow diagram illustrating a user representation generation method in accordance with another illustrative embodiment. The user representation generation method 30 includes steps S302 to S310.

As shown in fig. 3, in S302, behavior information of the user and/or item description information corresponding to the behavior is obtained. The behavior information of the user may include: the user performs a normal use operation in the terminal device, and the use operation may include web browsing, purchasing an application in an application store and downloading the application, completing a predetermined function using the application, and the like. The behavior information of the user may further include time of the behavior, frequency of the behavior, and the like.

In one embodiment, the item may be, for example, application information in an application store, and may further include, for example, information of an actual item purchased by the user in a online mall, which is not limited in this application. For example, when the item is application software in an application store, the item description information corresponding to the item is item introduction information of the application, and may be information of an interest field, a player scope, an application use scene, and the like, which are related to the application. When the article is actually used in an online marketplace, the article description information corresponding to the article may be information such as the category, the role, the price range, and the like of the article.

In S304, an item set is generated from the item operation information. Can include the following steps: extracting article operation information from the article interaction log; and generating the item set by the items of the predetermined operation in the item interaction log. For example, the item purchase information is extracted from the item purchase log; and generating the item set from the purchased items in the item purchase information. The item collection may also be generated, for example, by extracting application download information from the application download log and from downloaded applications in the application download information.

In one embodiment, the item set may sequentially list all item information purchased by the user, and may store the item information purchased by the user simultaneously with the number of times the item information was purchased, for example.

In one embodiment, a time window range is set, information of articles purchased by a user in the time window range is extracted, and an article set is generated through the information of the articles of the user in the time window range.

In S306, a description document is generated from the item description information. Can include the following steps: acquiring a plurality of description information corresponding to a plurality of articles; splicing the plurality of description information to generate the article description information; and performing word segmentation processing on the article description information to generate the description document.

The Word Segmentation process may be a Chinese Word Segmentation process for Chinese characters, where a Chinese Word Segmentation refers to segmenting a sequence of Chinese characters into individual words. Word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification. In the embodiment of the present application, when analyzing the article description information, the word segmentation algorithm can be divided into three types: a word segmentation method based on character string matching, a word segmentation method based on understanding and a word segmentation method based on statistics. Whether the method is combined with the part-of-speech tagging process or not can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging.

Of course, when the article description information is english or information in other languages, the article description information may also be subjected to word segmentation processing by word segmentation processing methods corresponding to other languages, and the specific word segmentation processing method does not affect the processing steps of the subsequent embodiments of the present disclosure. In the present disclosure, the article description information in the present disclosure may be subjected to word segmentation processing by one or more of the above multiple word segmentation methods, and the present disclosure is not limited thereto.

In S308, the item set and/or the description document are input into a probability map model to calculate a document topic vector and/or an item topic vector through the probability map model. Can include the following steps: determining key items (which may be one or more or a predetermined number) from the set of items, determining keywords (which may be one or more or a predetermined number) from the descriptive document; constructing a probability function at least through the key words and the key articles; and solving the probability function to determine a document theme vector and an item theme vector.

In one embodiment, determining key items from the set of items, determining keywords from the descriptive document comprises: determining keywords according to the description documents and the document theme matrix; and determining key items through the item set and the item subject matrix according to the key words.

In one embodiment, determining keywords from the descriptive document and the document topic matrix comprises: and extracting the keywords according to the probability distribution of each topic in the document topic matrix.

In one embodiment, determining key items from the set of items and the item topic matrix according to the keywords comprises: determining to extract the subject term according to the keyword; and extracting the key articles according to the probability distribution of each key article in the article theme matrix in the article set by the extraction theme words.

In one embodiment, for each user u, the description information corresponding to the items purchased by the user is spliced and participled as the interest description document d of the user_uPurchase usersThe purchased item set is defined as t_uEach user may be a triplet (u, d)_u,t_u) In the form of (1).

Through the prior model training, the model parameters of the probability map model can be obtained: the document theme matrix Lambda and the item theme matrix phi are set, the number of model themes is set to be K, and the probability function corresponding to the probability graph model is solved to determine a document theme vector T_dAnd an item topic vector T_t. Each subject term and article being respectively in the Λ matrix and

expressed as K-dimensional topic distribution vectors in the matrix. The process of training the probabilistic graphical model will be described in detail in the corresponding embodiment of fig. 6; the specific process of solving the probabilistic graphical model will be described in detail in the corresponding embodiment of fig. 7.

In an embodiment, the probabilistic graph model is generated by a multi-input latent dirichlet allocation model, input information of the probabilistic graph model may receive other user characteristic inputs in addition to the article operation information and the article description information, and a structure of the multi-input latent dirichlet allocation model may be adjusted according to an amount of the input information, which is not limited in the present application.

In S310, a user representation of the user is generated from a document theme vector and/or an item theme vector. By document topic vector T_dAnd an item topic vector T_tAnd generating a user interest vector, wherein the user image can be described by the user interest vector.

The user interest vector may be finally generated using the following formula:

T_u＝η*T_d+(1-η)*T_t(0＜η＜1)

according to the user portrait generation method disclosed by the invention, the article set and the description document are input into the probability map model, so that the document theme vector and the article theme vector are determined, the characteristics of the user can be described from two aspects of article self information and user behavior information, the coverage rate of the article description information in the user portrait can be improved, and the user portrait precision is improved.

It should be clearly understood that this disclosure describes how to make and use particular examples, but the principles of this disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.

In one embodiment, the probabilistic graphical model may be generated by historical user behavior information and a multi-input potential dirichlet distribution model. The basic theory of the multi-input latent dirichlet distribution model will be described below:

FIGS. 4 and 5 are diagrams illustrating a user representation generation method, respectively, according to an exemplary embodiment. In this application, the single-input latent dirichlet allocation may be an LDA topic model, and the model structure of the single-input latent dirichlet allocation is expanded, for example, the input information is defined as two or more types, so that the multi-input latent dirichlet allocation model may be generated. The description document in the latent dirichlet allocation is input as D, the word in certain article description information i is represented by w, K represents the document subject number in the article description information, K may be, for example, a numerical value preset by a user, the value range of K may be (1,100), and the value of K is a positive integer.

Fig. 4 is a probability map model formed by single-input latent dirichlet distribution, and the following parameters are described in the probability map model formed by single-input latent dirichlet distribution:

1.α and β are model hyper-parameters of the Dirichlet distribution (Dirichlet distribution), respectively, used to generate θ and φ; dirichlet distribution is a group of continuous multivariate probability distributions, and is a group of generalized beta (beta) distributions, and the beta distribution is a group of continuous probability distributions defined in the interval (0, 1). The dirichlet distribution is often used as the prior probability of bayes statistics. When the Dirichlet distribution dimension tends to be infinite, the Dirichlet process becomes a Dirichlet process;

2. theta-Dir (alpha) is the probability distribution of each item description information under K subjects;

3. phi-Dir (beta) is the probability distribution of the document theme vector under each theme, and is embodied as a matrix of K x V, and | V | is the number of keywords;

4. dirichelt and Multinormal (multidimensional normal distribution) are conjugated;

5. alpha controls the mean and Multinomial sparsity of theta.

The single-input latent dirichlet distribution assumes that the item description information in a certain description document D is composed of K topics. Under each topic k, there is a probability distribution vector with dimension | V |, where | V | is the total number of keywords, that produces each keyword individually. For example, in the musical theme, piano and violin, the probability of words such as performers is greater, while in the sports theme, words such as football, basketball, yaoming, etc. are greater.

When a person writes a certain document, firstly, a K-dimensional theme distribution vector theta is extracted from a Dirichlet distribution with alpha as a parameter, when each word w in the document is written, an author extracts a theme K from a multi-item distribution with theta as a parameter, and then extracts a keyword w according to the probability of the keyword appearing under the theme K to express the theme. If the article has N keywords, repeating the keyword extraction process for N times, and finally generating the whole article.

Formula (1) is a joint probability distribution function of a document corresponding to single-input Latent Dirichlet Allocation (LDA), implicit variables θ and z are integrated, and the probability of each document is multiplied to obtain formula (2), wherein p (W | α, β) represents the total probability of occurrence of a document set. According to the maximum likelihood estimation, the model parameters of the single-input potential Dirichlet distribution are solved only under the condition that the probability of p (D | alpha, beta) is guaranteed to be maximum. The hidden variables z and theta in LDA can be solved by utilizing Gibbs sampling or EM algorithm, and then the probability distribution matrix phi of the keywords under each topic is obtained. The EM Algorithm refers to a maximum Expectation Algorithm (also called Expectation Maximization Algorithm), which is an iterative Algorithm used for finding in statistics, and depends on the maximum likelihood estimation of parameters in a probability model of unobservable hidden variables.

Corresponding to a user portrait service scene, in order to obtain user interest expression, single-input potential Dirichlet distribution can be converted into a multi-input potential Dirichlet distribution model through an additional input mode, and a user portrait is determined from a user side and an article side respectively.

First, an item collection and description document is generated for a user:

1) the descriptive document contains all the relevant information of the user for generating the behavior object.

2) The item set is directly the ID of all the items that generate the action for the user.

Then, starting from the article side, each article obtains a K-dimensional theme vector based on the multi-input potential Dirichlet distribution model, and then the theme vectors of the articles with the behaviors generated by the user are superposed to express the user.

According to the above Model analysis, as shown in fig. 5, in the embodiment of the present application, in order to enable user Behavior information and description information of an article to simultaneously affect user portrait construction, expansion is performed on the basis of single-input potential dirichlet distribution, fig. 5 is a probability graph Model of BC-LDA (behavor Content Topic Model, multi-input potential dirichlet distribution Model) after expansion, an improved Topic Model can simultaneously accept Behavior and text bidirectional input, and a document Topic matrix corresponding to the first document Topic vector can be generated, for example, through a first layer Model structure of the multi-input potential dirichlet distribution Model; generating an item topic matrix corresponding to the first item topic vector by a second-level model structure of a multi-input potential Dirichlet distribution model; and generating the probability graph model according to the document theme matrix and the article theme matrix. After training, the topic vectors of the user, the article and the keyword under the same topic space are obtained.

The parameter description of the upper half of fig. 5 is consistent with the single-input latent dirichlet distribution, and the parameter description of the lower half is given below:

1. t is an item for which a user generates an action;

2. each item t of which the user generates the behavior corresponds to a theme c;

3、

the probability distribution of the items under each topic, the matrix of K x I, I being the number of items.

By overlaying the item description of the user-generated behavior, the description document of the user can be generated, and the process is the same as the scheme from the user side, and on the basis, the BC-LDA assumes that the user goes through the following process for the item-generated behavior:

1. like LDA, one z is generated for each keyword w on the user description text;

2. before the user acts on the article, the user firstly describes a text subject vector Z_uExtracting a theme c with equal probability;

3. extracting an article t from the article probability distribution vector corresponding to the c to generate a behavior;

4. if the user has acted on M items, repeat steps b-c M times.

The probability map model of BC-LDA can be formally expressed as the following probability function:

p(W,T,Z,C,Θ|α,β,γ)＝p(Z|Θ)p(Θ|α)p(W|Z,β)p(T|C,γ)p(C|Z) (3)

FIG. 6 is a flow diagram illustrating a user representation generation method in accordance with another illustrative embodiment. The process 60 of the user representation generation method exemplarily describes a process of "generating the probability map model by using behavior information of historical users and a multi-input latent dirichlet distribution model".

In S602, an item set group is generated from the one or more behavior information.

In S604, a descriptive document set is generated from the one or more item description information.

In S606, the item set group and the description document group are input into a multi-input latent dirichlet distribution model to obtain a first document topic vector and a first item topic vector.

In one embodiment, a document topic matrix corresponding to the first document topic vector is generated by a first-level model structure of a multi-input latent dirichlet distribution model; an item topic matrix corresponding to the first item topic vector may also be generated, for example, by a second-tier model structure of a multi-input latent dirichlet distribution model; and generating the probability graph model according to the document theme matrix and the article theme matrix.

In S608, the first document subject vector and the first item subject vector are iteratively sampled and calculated by Gibbs (Gibbs) sampling. Gibbs is a sampling technique, and gibbs sampling is an algorithm used in statistics for markov monte carlo (MCMC) to approximate a sample sequence from a certain multivariate probability distribution when direct sampling is difficult. The sequence can be used to approximate joint distributions, edge distributions of partial variables, or compute integrals.

In S610, when the iterative sampling calculation satisfies the condition, the probability map model is generated by a current multi-input latent dirichlet allocation model. The method comprises the following steps: acquiring a document theme matrix and an article theme matrix of a current multi-input potential Dirichlet distribution model and the number of model themes; and generating the probability map model through the document theme matrix, the item theme matrix, the number of model themes and a model structure of a multi-input potential Dirichlet distribution model.

Wherein, for example, when the first threshold value satisfies the condition in the iterative sampling calculation, the relevant parameters of the multi-input potential dirichlet distribution model are determined to be used as the probability map model. Wherein, in one embodiment, the first threshold may be perplexity, which is used to measure how well a probability distribution or probability model predicts a sample. It can also be used to compare the two probability distributions or probability models of the document topic vector Λ and the item topic distribution matrix Φ in the embodiments of the present application. In the embodiment of the application, it can be considered that the probability distribution model with low confusion can better predict user behavior data, and a better prediction sample is obtained.

In one embodiment, the input of the multi-input latent dirichlet distribution model may be, for example, two data sets, respectively, an item set group generated by item operation information in the user behavior information and a description document group generated by item description information as described above. The probability map model determined from the two data sets may be a two-layer dirichlet distribution structure, and the specific structure may be as shown in fig. 5.

In order to perform subsequent calculation on the probability map model with the double-layer dirichlet distribution structure, the document theme matrix, the article theme matrix and the model theme number are required to be used as calculation parameters of the probability map model with the double-layer dirichlet distribution structure. In the Gibbs sampling calculation, the document theme matrix, the article theme matrix and the model theme quantity when the perplexity meets the condition can be used as parameters of a probability graph model of a double-layer Dirichlet distribution structure.

In the subsequent calculation process, the behavior information of the current user is input into a probability graph model of a double-layer Dirichlet distribution structure with the determined document theme matrix, the determined article theme matrix and the determined number of model themes, so that a document theme vector and an article theme vector corresponding to the current user are obtained.

FIG. 7 is a flow diagram illustrating a user representation generation method in accordance with another illustrative embodiment. The user representation generation method flow 70 illustratively describes the process of inputting the item collection and the description document into a probabilistic graphical model to determine a document subject vector and an item subject vector.

In S702, a key item is determined according to the item set, and a keyword is determined according to the descriptive document.

In one embodiment, determining key items (which may be one or more or a predetermined number) to which a keyword (which may be one or more or a predetermined number) is associated from the set of items and the descriptive document comprises: determining keywords through the description documents and the document topic matrix Lambda according to the topic quantity k; and determining key items through the item set and the item subject matrix phi according to the subject quantity.

Wherein the k keywords may be determined in turn, for example by the probability of occurrence of a keyword in the document topic matrix. For example, k is 50, there are 100 optional keywords, the probabilities of the 100 optional keywords appearing in the document theme matrix are sequentially calculated, after calculation, the keywords are sequentially ranked according to the appearance probabilities of the optional keywords, and 50 candidate keywords are sequentially selected from the approximate smallest keywords as the keywords in the calculation process.

The selection process of the k key items is the same as that of the k key words, and the details are not repeated herein.

In S704, a probability function is constructed at least by the keyword and the key item. Document topic vector Lambda and item topic distribution matrix in the above multi-input latent Dirichlet distribution model

In the solving, two groups of hidden variables assumed by the model, the topic z of the keyword and the topic c of the article are subjected to iterative sampling through Gibbs sampling, and when a first threshold value meets a condition in iterative sampling calculation, the probability graph model is generated through a current multi-input potential Dirichlet distribution model.

In S706, the keywords and their corresponding distribution functions, and the key articles and their corresponding distribution functions are substituted into the probability function.

On the basis of the formula (3), substituting the distribution function of each factor, and deducing probability formulas of the keywords and the item topics of the sampling user u as follows respectively:

description of the parameters:

·Z_j: user u describes the topic of the jth keyword in the document.

·c_i: user u generates a topic for the ith item in the item list of the action.

V |: the total number of keywords in the keyword dictionary.

H | I |: total number of items in the item dictionary.

·N_ku\j: and (4) in the user u description document, eliminating the jth keyword and the number of keywords belonging to the kth theme.

·

w_jIn the corresponding key words, w is removed_jCorresponding subject matter

The count belonging to the kth topic.

·N_u: user u describes the total number of keywords in the document.

·M_ku: the user d generates the total number of items belonging to the k-th subject in the item list of the action.

·

t_iIn the corresponding article, t is removed_iCorresponding subject matter

Belong to the k-thThe subject count.

·M_k\i: and (4) eliminating the total number of the items belonging to the k-th subject except the i of the current sample.

In S708, the probability function is solved to determine a document theme vector and an item theme vector.

In one embodiment, solving the probability function to determine a document topic vector and an item topic vector comprises: pairing document topic vectors T in the probability function by Gibbs sampling_dAnd an item topic vector T_tPerforming iterative sampling calculation; and determining the document subject vector and the item subject vector when the iterative sampling calculation converges.

The sampling process described by the formulas (4) and (5) can see that the Gibbs samples of the subjects of the articles and the words influence each other. The probability that the word j in the user u description document belongs to the subject k is determined by the number M of the items belonging to the subject k in the item list corresponding to the user_kuThe greater the number of items belonging to topic k, the greater the probability that word j belongs to k. Similarly, the probability that the item i in the item list of the user belongs to the subject k is influenced by the number N of keywords belonging to the subject k in the user description document_kdInfluence, the greater the number of words belonging to topic k, the greater the probability that item i belongs to k.

FIG. 8 is a flowchart illustrating a user representation generation method in accordance with another exemplary embodiment. User representation generation method 80 illustratively describes the process of multi-input latent dirichlet distribution model in a user representation and real-time user representation determination.

In S802, item description information is acquired.

In S804, a user purchase log is acquired.

In S806, user triple-tuple data is generated.

In S808, the BC-LDA model is trained.

In S810, a BC-LDA model is determined.

In S812, the user triple data is warehoused.

In S814, a user representation is constructed.

In S816, the user image database is stored.

Specifically, assume that a user representation is generated by BC-LDA in a shopping mall. First, a log of the user purchasing the article and description information of the article are obtained from a background. For each user u, the description information corresponding to the articles purchased by the user is spliced and participled to be used as the interest description document d of the user_uDefining t as the item set purchased by the user_uEach user is expressed as a triplet (u, d)_u,t_u)。

And setting the number of the model themes as K, and training through the user triple set to obtain a document theme matrix Lambda and an article theme matrix phi. Each keyword and item is expressed as a K-dimensional topic distribution vector in the Λ matrix and the Φ matrix, respectively.

After the parameters Λ and Φ of the multi-input potential dirichlet distribution model are obtained, the multi-input potential dirichlet distribution model under the parameters is determined to be the probability map model in the embodiment. And inputting the user triple into the model, and obtaining a user document theme distribution vector and a user purchased article set theme vector by utilizing Gibbs sampling so as to finally generate a user interest vector.

In one embodiment, user A purchases a plurality of products, which may be sporting good B, living good C, learning user D, etc., respectively, via an online store. An item set document can be generated through the purchased product A, the item set document can be { B, C and D }, and a description document can be generated through the description information of B, C and D.

The generation process of the specific description document may be, for example: obtaining the description information of B, C, D on the line may include, for example, B: { XX brand basketball shoe }, C: { XX brand mineral water in one tank }, D: { one box each for blue and black sign pens }. The descriptive information may include, for example, information about the usage of the product, product characteristics, appearance, color, and price, etc.

Splicing the description information of B, C and D to generate article description information: { XX brand basketball shoes, YY brand mineral water box, and blue black sign pen box }, performing word segmentation processing on the article description information to generate a description document: { XX, basketball shoes, YY, mineral water, one box, blue, black, sign pen, one box }.

The item set { B, C, D } and the generated description document { XX, basketball shoes, YY, mineral water, a box, blue, black, sign pen, a box } are input into the probability map model as described above, the probability map model has determined parameters (such as determined document theme matrix, determined item theme matrix and determined number of model themes), and the document theme vector and the item theme vector are calculated by inverse extrapolation of the probability map model. The image of user A is then determined using the document theme vector and the item theme vector.

According to the user portrait generation method, the coverage rate of interest to the user can be improved, and the user portrait accuracy can be improved.

According to the user portrait generation method, 60000 user installation app list information is extracted and contains 8677 apps, wherein 5003 apps have description information, and the number of topics is set to 80. According to the user portrait generation method disclosed by the invention, in the calculation process of the article operation information and the article description information through the probabilistic graphical model, through the mutual verification of the article operation information and the article description information, a subject word can be determined for the application of missing description, and the lower graph is the situation of a word and an App in a part of subjects.

Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. When executed by the CPU, performs the functions defined by the above-described methods provided by the present disclosure. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.

Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.

FIG. 9 is a block diagram illustrating a user representation generation apparatus in accordance with an exemplary embodiment. The user representation generation device 90 comprises: information module 902, item aggregation module 904, profile module 906, vector module 908, and user representation module 910.

The information module 902 is configured to obtain behavior information of a user, where the behavior information includes article operation information and article description information; the behavior information of the user may include: the user performs a normal use operation in the terminal device, and the use operation may include web browsing, purchasing an application in an application store and downloading the application, completing a predetermined function using the application, and the like. The behavior information of the user may further include time of the behavior, frequency of the behavior, and the like.

The item assembly module 904 is used for generating an item assembly according to the item operation information; . Can include the following steps: extracting article operation information from the article purchase log; and generating the item set by the purchased items in the item operation information.

The descriptive document module 906 is used for generating descriptive documents through the item description information; can include the following steps: acquiring a plurality of description information corresponding to a plurality of articles; splicing the plurality of description information to generate the article description information; and performing word segmentation processing on the article to generate the description document.

The vector module 908 is used for inputting the item set and the description document into a probability map model so as to calculate a document theme vector and an item theme vector through inverse push of the probability map model; can include the following steps: determining key words (which may be one or more or a predetermined number) and key items (which may be one or more or a predetermined number) from the item set and the descriptive document; constructing a probability function for the key articles through keywords based on the probability graph model; and solving the probability function to determine a document theme vector and an item theme vector.

User representation module 910 is configured to generate a user representation of the user from the document theme vector and the item theme vector. And finally generating a user interest vector through the document theme vector Lambda and the item theme vector phi, wherein the user interest vector can be used for describing the user portrait.

According to the user portrait generation device disclosed by the invention, the article set and the description document are input into the probability map model, so that the document theme vector and the article theme vector are determined, the characteristics of the user can be described from two aspects of article self information and user behavior information, the coverage rate of the article description information in the user portrait can be improved, and the user portrait precision is improved.

FIG. 10 is a block diagram illustrating a user representation generation apparatus in accordance with another exemplary embodiment. The user image generation apparatus 100 further includes, in addition to the user image generation apparatus 90: model training module 1002.

The model training module 1002 is configured to generate the probabilistic graph model through behavior information of a historical user and a multi-input latent dirichlet distribution model.

An electronic device 1100 according to this embodiment of the disclosure is described below with reference to fig. 11. The electronic device 1100 shown in fig. 11 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 11, electronic device 1100 is embodied in the form of a general purpose computing device. The components of the electronic device 1100 may include, but are not limited to: at least one processing unit 1110, at least one memory unit 1120, a bus 1130 connecting the various system components including the memory unit 1120 and the processing unit 1110, a display unit 1140, and the like.

Wherein the storage unit stores program code executable by the processing unit 1110 to cause the processing unit 1110 to perform steps according to various exemplary embodiments of the present disclosure described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, the processing unit 1110 may perform the steps shown in fig. 3, 6, 7, and 8.

The memory unit 1120 may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM)11201 and/or a cache memory unit 11202, and may further include a read only memory unit (ROM) 11203.

The storage unit 1120 may also include a program/utility 11204 having a set (at least one) of program modules 11205, such program modules 11205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 1130 may be representative of one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 1100 may also communicate with one or more external devices 1100' (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 1100, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1100 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 1150. Also, the electronic device 1100 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 1160. The network adapter 1160 may communicate with other modules of the electronic device 1100 via the bus 1130. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1100, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software article, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above method according to the embodiments of the present disclosure.

Fig. 12 schematically illustrates a computer-readable storage medium in an exemplary embodiment of the disclosure.

Referring to fig. 12, a program article 1200 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program article may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions of: acquiring behavior information of a user, wherein the behavior information comprises article operation information and article description information; generating an article set through article operation information; generating a description document through the item description information; inputting the item set and the description document into a probability graph model to determine a document subject vector and an item subject vector; and generating a user representation of the user through the document theme vector and the item theme vector.

Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software article, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Exemplary embodiments of the present disclosure are specifically illustrated and described above. It is to be understood that the present disclosure is not limited to the precise arrangements, instrumentalities, or instrumentalities described herein; on the contrary, the disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A user representation generation method, comprising:

acquiring behavior information of a user and article description information corresponding to the behavior;

generating an item set through the behavior information;

generating a description document through the item description information;

inputting the item set and the description document into a probability graph model so as to calculate a document theme vector and an item theme vector through the probability graph model; and

generating a user representation of the user from the document theme vector and the item theme vector;

inputting the item set and the description document into a probability graph model to calculate a document theme vector and an item theme vector through the probability graph model, wherein the method comprises the following steps:

determining key items according to the item set, and determining key words according to the description documents;

constructing a probability function at least through the key words and the key articles; and

substituting the key words and the corresponding distribution functions thereof, and the key articles and the corresponding distribution functions thereof into the probability functions;

and solving the probability function to determine a document theme vector and an article theme vector.

2. The method of claim 1, further comprising:

generating the probabilistic graph model from the behavioral information of one or more users and corresponding item description information through a multi-input potential dirichlet distribution model, the input of which is a plurality of data sets.

3. The method of claim 2, wherein generating the probabilistic graph model from the multi-input latent dirichlet distribution model based on behavioral information of one or more users and corresponding item description information comprises:

generating a set of item sets from the one or more behavioral information;

generating a description document group through one or more item description information; and

training the multi-input latent Dirichlet distribution model through the set of item sets and the set of descriptive documents to generate the probabilistic graphical model.

4. The method of claim 3, wherein training the multi-input potential Dirichlet distribution model over the set of item sets and the set of descriptive documents to generate the probabilistic graph model comprises:

inputting the item set group and the description document group into a multi-input latent Dirichlet distribution model to obtain a first document theme vector and a first item theme vector;

performing iterative sampling calculation on the first document theme vector and the first article theme vector through Gibbs sampling; and

and when the iterative sampling calculation meets the condition, generating the probability map model through a current multi-input potential Dirichlet distribution model.

5. The method of claim 4, wherein generating the probabilistic graph model through a current multiple-input potential Dirichlet distribution model comprises:

generating a document topic matrix corresponding to the first document topic vector by a first-level model structure of a multi-input potential Dirichlet distribution model;

generating an item topic matrix corresponding to the first item topic vector by a second-level model structure of a multi-input potential Dirichlet distribution model; and

and generating the probability graph model according to the document theme matrix and the article theme matrix.

6. The method of claim 1, wherein determining key items from the set of items, determining key words from the description document comprises:

determining keywords according to the descriptive documents and the document theme matrix; and

and determining key items through the item set and the item subject matrix according to the key words.

7. The method of claim 6, wherein determining keywords from the descriptive document and the document topic matrix comprises:

and extracting the keywords according to the probability distribution of each topic in the document topic matrix.

8. The method of claim 6, wherein determining key items from the keyword through the set of items and the item topic matrix comprises:

determining to extract the subject term according to the keyword; and

and extracting key articles according to the probability distribution of each key article in the article subject matrix in the article set by the extracted subject term.

9. The method of claim 1, wherein solving the probability function to determine a document topic vector and an item topic vector comprises:

performing iterative sampling calculation on the document theme vector and the article theme vector in the probability function through Gibbs sampling; and

determining the document topic vector and the item topic vector when the iterative sampling computation converges.

10. The method of claim 1, wherein generating a description document from item description information comprises:

acquiring a plurality of description information corresponding to a plurality of articles;

splicing the plurality of description information to generate the article description information; and

and performing word segmentation processing on the article description information to generate the description document.

11. A user representation generation apparatus, comprising:

the information module is used for acquiring behavior information of a user and article description information corresponding to the behavior;

the article collection module is used for generating an article collection through the article operation information;

the descriptive document module is used for generating descriptive documents through the article description information;

the vector module is used for inputting the item set and the description document into a probability graph model so as to calculate a document theme vector and an item theme vector through the probability graph model; and

the user portrait module is used for generating a user portrait of the user through a document theme vector and an article theme vector;

the vector module is further configured to: determining key items according to the item set, and determining key words according to the description documents; constructing a probability function at least through the key words and the key articles; substituting the key words and the corresponding distribution functions thereof, and the key articles and the corresponding distribution functions thereof into the probability functions; and solving the probability function to determine a document theme vector and an article theme vector.

12. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-10.

13. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-10.