CN111858702A

CN111858702A - User behavior data acquisition and weighting method for dynamic portrait

Info

Publication number: CN111858702A
Application number: CN202010597643.4A
Authority: CN
Inventors: 朱欣娟; 赵璟博; 罗云川; 吴哲; 高岭
Original assignee: National Public Cultural Development Center Of Ministry Of Culture And Tourism; Xian Polytechnic University
Current assignee: National Public Cultural Development Center Of Ministry Of Culture And Tourism; Xian Polytechnic University
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2020-10-30
Anticipated expiration: 2040-06-28
Also published as: CN111858702B

Abstract

The invention discloses a user behavior data acquisition and weighting method for dynamic portraits, which is implemented by the following steps: 1. dividing users into autonomous release users and non-autonomous release users, dividing time by taking time T as a time slice, and collecting user behavior data of four time slices, wherein N data are obtained in total; 2. endowing different weight coefficients to the contents of different user behaviors, carrying out normalization processing on the weight coefficients of N contents of the same user, and endowing the processed weight coefficients to the contents; 3. classifying the N data obtained in the step 1 to obtain N42-dimensional label vectors, calculating the weight coefficient corresponding to each content and the label vector corresponding to the content, and selecting the first three 3 labels as the interests of the user. The method of the invention can improve the accuracy of real-time user interest prediction.

Description

User behavior data acquisition and weighting method for dynamic portrait

Technical Field

The invention belongs to the technical field of big data user behavior data analysis mining processing, and relates to a user behavior data acquisition and weighting method for dynamic portraits.

Background

In the era of mobile internet, refined operation gradually becomes an important competitive power for enterprise development, and the concept of "user portrait" also comes from the beginning. The user portrait is a process of abstracting the behavior data of the user into labels and materializing the user image by utilizing the labels through cleaning, clustering, analyzing and mining mass behavior data information generated by the user in a big data era. The establishment of the user representation can help enterprises to better provide targeted services for users.

In the web2.0 era, the content output on the network was mainly users, and each user could generate its own content. Websites such as CSDN, Wikipedia and the like are used for popularizing knowledge and solving questions for net friends, users generate various behavior data on the websites every day, and the interest preference information of the users can be predicted by analyzing the behavior data of the users on the websites. CSDN is a forum dedicated to the popularization of computer domain knowledge, and users thereof generate a large number of different kinds of behavioral data each day. For example, a user posts a blog, a user reprints a blog, a user collects a blog, a user likes a blog, a user browses a blog, a user pays attention to other user objects, and so on. The behavior data reflect different interests of the user, and how to dynamically portray the user according to the behavior data is a focus of recent research in the field of computers

There are several methods for user profiling, but these methods currently have two problems: the method comprises the steps of distributing, forwarding, collecting, browsing, approving and paying attention to user behaviors, classifying the user behavior content data into various types according to different behaviors, and making no different contribution of different types of behaviors of different types of users to user portrayal by the existing method. Each user has own characteristics, and the behavior and content data quantity of the user generated in different time periods are different according to different periods and different frequencies of the behavior data. The problems that the data size is too large, the analysis efficiency is reduced and the dynamic real-time change of the attention of the user cannot be reflected are caused by analyzing all historical data of the user. How to make the portrait technology based on the periodic characteristics of the user personalized behavior content data and reflect the characteristic of dynamic change of user interest is a challenge faced by current technical research.

Disclosure of Invention

The invention aims to provide a user behavior data acquisition and weighting method for a dynamic portrait, which solves the problem that different contributions of different types of behaviors of different types of users to the user portrait are not highlighted in the prior art.

The invention adopts the technical scheme that a user behavior data acquisition and weighting method for dynamic portraits is implemented according to the following steps:

step 1, dividing users into autonomous release users and non-autonomous release users, dividing time by taking time T as a time segment, and collecting user behavior data of a current time segment and three time segments before the current time segment, wherein the total number of the user behavior data is N;

step 2, endowing different weight coefficients to the contents of different user behaviors, carrying out normalization processing on the weight coefficients of N contents of the same user, and endowing the processed weight coefficients to the contents;

and 3, classifying the N data obtained in the step 1 to obtain N42-dimensional label vectors, obtaining the weighted label vectors of the contents by using the weight coefficient corresponding to each content and the label vectors corresponding to the contents, weighting and summing the N weighted label vectors to obtain one label vector, and selecting the first three labels as the hobbies of the user.

The invention is also characterized in that:

the user behaviors of the self-releasing user in the step 1 comprise releasing, forwarding, collecting, browsing, praise and attention; the user behaviors of the non-autonomous releasing user comprise forwarding, collecting, browsing, praise and attention.

The step 1 of collecting user behavior data is implemented according to the following steps:

step 1.1, obtaining a personalized time attenuation function according to an Einghaus memory curve, and determining a weight coefficient of user behavior data collected by a certain time segment by the function;

step 1.2, respectively collecting data according to a proportion to different user behaviors of an autonomous release user and a non-autonomous release user, and collecting N data in a current time segment and three time segments before the current time segment;

and step 1.3, calculating the quantity of data required to be collected in different time segments of different user behaviors according to the weight coefficient calculation formula in the step 1.1.

Step 1.1 is specifically carried out according to the following steps:

step 1.1.1, fitting an Eibongos memory curve by using a power function, wherein the fitting function is shown as a formula (1):

L_(t)＝32.03(t_c-t₀)^-0.1236(1)

wherein L is_(t)Representing the percentage of memory residue, t₀Is the user's memory time, t_cThe time for memorizing the residual quantity, and the unit of time t is day;

step 1.1.2, adjusting the formula (1) to obtain an individualized time attenuation function as the formula (2):

L_(i)＝32.03[(i-1/2)k]^-0.1236i＝1、2、3、4 (2)

defining the current time period as the 1 st time segment, and respectively 2 nd, 3 rd and 4 th time segments along the forward time of the time, wherein L _(i)The method comprises the steps that a weighting coefficient of user behaviors is acquired in the ith time slice, k is T '/5, the current time point is set to be 0 moment, and T' is the time when a user closest to the current time point continuously generates one or more of 5 times of issuing behaviors, forwarding behaviors or collecting behaviors.

In the step 1.2, the autonomous release user can release N/2 behaviors, transmit N/6 behaviors, collect N/12 behaviors in favor of the behavior in the current time segment and the four time segments before the current time segment, the autonomous release user can transmit N/3 behaviors, collect N/6 behaviors and collect N/6 behaviors in favor of the behavior in the four time segments.

When the number of the issuing behavior data is less than N/2, more than half of the issuing behavior data is collected in the forwarding behavior and the collecting behavior respectively to supplement the issuing behavior data.

Step 2 specifically includes giving a weight coefficient 5 to the content of the release behavior, giving a weight coefficient 2.5 to the content of the forwarding behavior, giving a weight coefficient 2.5 to the content of the collection behavior, giving a weight coefficient 0.5 to the content of the browsing behavior, and giving a weight coefficient 0.5-2.5 to the content of the praise behavior.

The weight coefficient assignment of the contents of the praise behavior is divided into: when the contents of the behavior approval are the contents of browsing, publishing or approval of the user's own object of interest, the weighting coefficient is given to be 0.5, when the contents of the behavior approval are the contents of forwarding or collection of the user's own object of interest, the weighting coefficient is given to be 0.7, and when the contents of the behavior approval are not associated with the user's own object of interest, the weighting coefficient is given to be 2.5.

And 3, specifically, sending the N data into a Bi-LSTM + Attention model to obtain N42-dimensional label vectors, wherein each dimension of the label vectors has a probability value, the sum of the 42 probability values is 1, the probability value of each dimension is the proportion of the corresponding interest field on the user, the weight coefficient corresponding to each content is multiplied by the label vector corresponding to the content to obtain the label vector with the weight of each content, the N label vectors with the weight are obtained in total, the N label vectors with the weight are subjected to weighted summation to obtain one label vector, and 3 labels with the first three probability values are selected as the interest and hobbies of the user.

The invention has the beneficial effects that: the invention relates to a user behavior data acquisition and weighting method for a dynamic portrait. By aiming at the requirement of analyzing and processing the behavior data of the big data user, a weighting and dynamic acquisition method for different behavior data of different types of users is provided, and the problem of predicting the interest of the user in real time under the condition that different types of data have different influences on the portrait result and massive data are used is solved to a certain extent.

Drawings

FIG. 1 is a flow chart of a method of the present invention for user behavior data collection and weighting of a motion portrayal;

FIG. 2 is an Ebingois memory plot used in a method of user behavior data collection and weighting for motion portrayal in accordance with the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention discloses a user behavior data acquisition and weighting method for dynamic portraits, which is also suitable for the user behaviors of users such as CSDN, Wikipedia, small video and the like by taking the user behavior of a blog user as an example as shown in figure 1, and is implemented according to the following steps:

step 1, dividing users into users who independently issue blogs and users who do not autonomously issue blogs, wherein the user behaviors of the users who do not autonomously issue blogs include blog browsing, blog forwarding, blog collecting, blog praising and user attention objects; the method comprises the following steps that users who publish blogs, send blogs, collect blogs, like blogs, browse blogs and focus objects of the users, different types of behavior data are collected for different types of users, different weight coefficients are given according to different contributions of the data to portrait results, a foundation is provided for accurately describing interest characteristics of the users in the future, on the other hand, the sampling quantity of the behavior data of the user content is reduced, a sampling process method is optimized, the dynamic interest characteristics of the users are reflected in real time, and the portrait analysis and mining efficiency of the users is improved;

The time T is taken as a time segment (T is 30-90 days) to divide the time, the user behavior data of the current time segment and the three previous time segments are dynamically collected, the collected data are analyzed and mined to predict the interest of the user of the current time segment, and because the user generates a large amount of behavior data on the Internet every day, the user can not be imaged by using all the behavior data of the user to meet the requirement of real-time property, the dynamic sampling technology provided by the invention reduces the sampling quantity of the behavior data of the content of the user, optimizes the sampling process method, accurately reflects the dynamic interest characteristics of the user in real time,

the dynamic sample collection is specifically implemented according to the following steps: firstly, dividing time by taking time T as a time slice, dynamically collecting user behavior data of a current time slice and 3 preceding time slices, collecting N blogs in 4 time slices,

since the user interest in the current time period is predicted, the number of blogs collected in four time periods follows the following rules: the number of blogs collected in the current time period is the largest, the number of blogs collected in the current time period is sequentially reduced along with the forward progress of the time, N blogs are collected in four time periods, an individualized time attenuation function is obtained by using an Ebbinghaus memory curve, and the specific number of blogs collected in each time period is determined by the time attenuation function;

An Ebbinghaos memory forgetting curve is shown in FIG. 2, and an Ebbinghaos memory curve is fit by using a power function, wherein the fit function is shown in a formula (1):

L_(t)＝32.03(t_c-t₀)^-0.1236(1)

wherein L is_(t)Representing the percentage of memory residue, t₀Is the user's memory time, t_cThe time of memorizing the residual quantity, and the unit of the time t is day;

because the time slice used by the method is T, the change amplitude of the memory residual quantity after 10 days is not obvious according to an Einghaos memory curve, the weight coefficients of the number of the acquired blogs of the four time slices obtained by the method are not very different, and the rule of acquiring the blogs cannot be met: the number of blogs collected in the current time period is the largest, and the number of blogs collected in the current time period is gradually decreased along with the forward progress of time, so that the formula (1) needs to be adjusted to obtain a personalized time attenuation function as the formula (2):

L_(i)＝32.03[(i-1/2)k]^-0.1236i＝1、2、3、4 (2)

defining the current time period as the 1 st time period, and respectively 2 nd, 3 rd and 4 th time periods along with the forward time, wherein L_(i)The method comprises the steps that a weight coefficient of the number of microblogs collected in the ith time period is obtained, k is T '/5, the current time point is set to be 0, and T' is the time when a user closest to the current time point continuously produces 5 times of actions of publishing blogs, forwarding blogs or collecting blogs;

Since the users are divided into two categories of independently published blogs and blogs which are not published, the algorithm for dynamically collecting blogs is designed according to two situations.

For a user who independently publishes blogs, five types of blogs are provided, according to different behaviors, the number of different types of blogs is different, N/2 pieces of published blogs are set to be collected, N/6 pieces of forwarded blogs are collected, N/6 pieces of collected blogs are collected, N/12 pieces of browsed blogs are collected, N/12 pieces of praise blogs are collected, and N pieces of blogs are collected in four time periods;

respectively calculating the weight coefficient L of the number of the blogs collected in each time period according to the formula (1-2)_(i)In the whole portrait process, N blogs are collected in four time periods in total, wherein N/2 blogs are collected in published blogs, N/6 blogs are collected in forwarded blogs, N/6 blogs are collected in collected blogs, N/12 blogs are collected in browsed blogs, and N/12 blogs are collected in praise blogs. The number of blogs collected per time period is L_(i)N, blog Collection of publication L_(i)(N/2) pieces, forwarded blog Collection L_(i)(N/6) pieces, collected blogs, Collection L_(i)(N/6) paragraphs, blog collection of browsing L_(i)(N/12) pieces, collections of blogs, Collection L _(i)(N/12). Due to the characteristics of the user, the number of published blogs is not forwarded, collected, praised and browsed, the number of the blogs published by the user is not enough for N/2, the missing number is supplemented by the number of the blogs forwarded and collected by the user, and if the number of the blogs published by the user is not enough for N/2, the number of the blogs forwarded and collected is forwardedThe customers respectively collect half of the missing quantity;

② there are 4 types of blogs for users who have not published blogs. Setting forwarded blogs to collect N/3 blogs, collected blogs to collect N/3 blogs, browsed blogs to collect N/6 blogs, praise blogs to collect N/6 blogs, and collecting N blogs in total in four time periods;

respectively calculating the weight coefficient L of the number of the collected microblogs in each time period according to the formula (1-2)_(i)In the whole portrait process, N blogs are collected in four time periods in total, wherein N/3 blogs are collected in total for forwarded blogs, N/3 blogs are collected in total for collected blogs, N/6 blogs are collected in total for browsed blogs, and N/6 blogs are collected in total for praise blogs. The number of blogs collected per time period is L_(i)N, wherein forwarded blogs collect L_(i)(N/3) pieces, collections of blogs, Collection L_(i)(N/3) paragraphs, blog collection of browsing L _(i)(N/6) pieces, collected blogs, Collection L_(i)(N/6) pieces;

if the calculated number of the blogs is a decimal number, adjusting the number of the blogs into an integer by using a rounding rule;

step 2: the behavior data of the user includes publishing a blog, forwarding the blog, collecting the blog, praise the blog, browsing the blog and paying attention to the object by the user. Different weight coefficients are given to different user behaviors, and the main purpose is to highlight different influences of different types of behavior data on the portrait result so that the portrait result is more accurate;

first for a blog: if the blog published by the user is the blog, the weighting coefficient is given to be 5; for a blog that a user likes, three cases are distinguished: if the blog liked by the user is a blog browsed, published or liked by the user concerning the object, the weight coefficient is given to be 0.5, if the blog liked by the user is a blog forwarded or collected by the user concerning the object, the weight coefficient is given to be 0.7, and if the blog liked by the user is not associated with all objects concerned by the user, the weight coefficient is given to be 2.5; if the blog is forwarded by the user, the weighting coefficient is given to be 2.5; if the blog collected by the user is the blog collected by the user, the weighting coefficient is given to be 2.5; if the blog browsed by the user is the blog browsed by the user, the weighting coefficient is given to be 0.5. Then calculating the weight coefficient of the blog, wherein for one blog, a plurality of behaviors of the user can occur simultaneously, each behavior is generated, the corresponding weight coefficient is accumulated to obtain a total weight coefficient, and finally the total weight coefficient is normalized and then is given to the blog; for example, if a user browses a blog and forwards and collects the blog, the weight coefficient of the blog is: the weight of browsing behavior is 0.5+ the weight of forwarding blogs is 2.5+ the weight of collecting blogs is 2.5, namely the total weight of blogs is 5.5, finally, the weight coefficients of the N blogs are normalized, and the weight coefficients obtained after normalization are given to each blog;

And step 3: processing the blogs collected in the step 1 to obtain three interest hobbies of the user, wherein the events to be done by the user are obtained by analyzing blogs related to the user to obtain the interest hobbies of the three users, which totally have 42 interest fields, so that the total number of the interest tags is 42, each blog is classified to obtain a 42-dimensional tag vector, each dimension of the vector has a probability value, the sum of the 42 probability values is 1, the probability value of each dimension is the proportion of the corresponding interest field on the user, the method is concretely implemented by the following steps of firstly collecting N blogs according to the method in the step 1, then calculating the weight coefficient of each blog according to the method in the step 2, carrying out normalization processing on the weight coefficient of the N blogs for later use, and finally sending the N blogs into a Bi-LSTM + Attention model to obtain N42-dimensional tag vectors, each blog also has a weight coefficient, the weight coefficient corresponding to each blog is multiplied by the label vector corresponding to the blog to obtain the label vector of the weight of each blog, N label vectors with weights are obtained in total, the N label vectors with weights are subjected to weighted summation to obtain one label vector, and 3 labels with the first three probability values are selected as the interests of the user.

The invention provides a user behavior data acquisition and weighting method for a dynamic portrait, which fully utilizes various behavior data of a user and endows different weights for different behavior content data of the user; and on the other hand, the data volume to be analyzed is acquired and extracted based on the user personalized Einghaus memory curve, so that the analysis efficiency is improved, and a foundation is laid for realizing the label generation of the user dynamic portrait.

Claims

1. A user behavior data acquisition and weighting method for a dynamic portrait is characterized by comprising the following steps:

and 3, classifying the N data obtained in the step 1 to obtain N42-dimensional label vectors, obtaining the weighted label vectors of the contents by using the weight coefficient corresponding to each content and the label vectors corresponding to the contents, carrying out weighted summation on the N weighted label vectors to obtain one label vector, and selecting the first three 3 labels as the interests of the user.

2. A method as claimed in claim 1, wherein the step 1 of autonomously releasing the user behavior of the user includes releasing, forwarding, collecting, browsing, praise, and following; the user behaviors of the non-autonomous releasing user comprise forwarding, collecting, browsing, praise and attention.

3. A method for collecting and weighting user behavior data of a dynamic representation as claimed in claim 2, wherein the step 1 of collecting user behavior data is implemented by the steps of:

step 1.1, obtaining a personalized time attenuation function according to an Einghaus memory curve, wherein the function is a weight coefficient for collecting user behavior data for a certain time segment;

4. A method as claimed in claim 3, wherein said step 1.1 is implemented by the following steps:

L_(t)＝32.03(t_c-t₀)^-0.1236(1)

L_(i)＝32.03[(i-1/2)k]^-0.1236i＝1、2、3、4 (2)

defining the current time period as the 1 st time segment, and respectively 2 nd, 3 rd and 4 th time segments along the forward time of the time, wherein L_(i)The method comprises the steps that a weighting coefficient of user behaviors is acquired in the ith time slice, k is T '/5, the current time point is set to be 0 moment, and T' is the time when a user closest to the current time point continuously generates one or more of 5 times of issuing behaviors, forwarding behaviors or collecting behaviors.

5. The method as claimed in claim 3, wherein in step 1.2, the self-publishing user publishes N/2 behaviors in the current time slice and the first three time slices of the time slice, N/6 forwarding behaviors, N/6 collecting behavior, N/12 browsing behaviors, N/12 favorites behaviors, N/3 collecting behaviors without self-publishing user forwarding behaviors in the four time slices, N/3 collecting behavior, N/6 collecting browsing behaviors, and N/6 collecting favorites behaviors.

6. A method as claimed in claim 5, wherein when there are less than N/2 publishing behavior data, more than half of the publishing behavior data are collected in the forwarding behavior and the collection behavior respectively to supplement the publishing behavior data.

7. A method as claimed in claim 1, wherein the step 2 is to assign a weight coefficient 5 to the content of the publishing behavior, assign a weight coefficient 2.5 to the content of the forwarding behavior, assign a weight coefficient 2.5 to the content of the collecting behavior, assign a weight coefficient 0.5 to the content of the browsing behavior, and assign a weight coefficient 0.5-2.5 to the content of the praise behavior.

8. A method as claimed in claim 7, wherein the weighting factors of the content of the praise are given as: when the contents of the behavior approval are the contents of browsing, publishing or approval of the user's own object of interest, the weighting coefficient is given to be 0.5, when the contents of the behavior approval are the contents of forwarding or collection of the user's own object of interest, the weighting coefficient is given to be 0.7, and when the contents of the behavior approval are not associated with the user's own object of interest, the weighting coefficient is given to be 2.5.

9. The method as claimed in claim 1, wherein the step 3 is to send N data into a Bi-LSTM + Attention model to obtain N42-dimensional tag vectors, each dimension of the tag vectors has a probability value, the sum of the 42 probability values is 1, the probability value of each dimension is a specific gravity of the corresponding field of interest on the user, the weighted tag vector of each content is obtained by multiplying the weight coefficient corresponding to each content by the tag vector corresponding to the content, so as to obtain N weighted tag vectors, the weighted tag vectors are summed to obtain one tag vector, and 3 tags with the first three probability values are selected as interests of the user.