CN113946569A

CN113946569A - User portrait construction method

Info

Publication number: CN113946569A
Application number: CN202110987465.0A
Authority: CN
Inventors: 陈凡
Original assignee: Wuhan Krypton Cell Network Technology Co ltd
Current assignee: Wuhan Krypton Cell Network Technology Co ltd
Priority date: 2021-08-26
Filing date: 2021-08-26
Publication date: 2022-01-18

Abstract

The invention discloses a user portrait construction method, which comprises the following steps: acquiring a large amount of user behavior data; establishing a fact label library according to the collected behavior data; training a label model through a plurality of fact label libraries by using logistic regression; matching the similarity of the user and the label model library through the behavior weight to construct a user portrait; the user representation is continuously corrected and adjusted using a temporal decay factor. The invention has the beneficial effects that: the method uses a Newton cooling law mathematical model to predict that the historical behavior and the current correlation of the behavior of the user are weakened continuously along with the passing of time, and continuously corrects the label attribute of the user after establishing a function related to the time attenuation; the learning method is supervised based on logistic regression, which is based on likelihood classification, and the maximum correlation of data can be obtained.

Description

User portrait construction method

Technical Field

The invention relates to the technical field of software management, in particular to a user portrait construction method.

Background

At present, the data industry index explosion level development is realized by collecting social attributes, consumption habits,

And the characteristic attributes of the user or the product are described by preference characteristics and other dimensional data, and the characteristic analysis statistics is carried out on the characteristics to mine potential value information, so that the information overview of the user is abstracted, the information overview can be taken as a root of enterprise application big data, and the information overview is a precondition for targeted advertisement delivery and personalized recommendation.

1. For example, the chinese patent discloses a user portrait construction method, a device, an electronic device, and a readable storage medium (application number: CN 201911291414.3). the user portrait construction method first obtains a preset application scene corresponding to a user portrait to be constructed, and generates at least one dimension label according to the preset application scene, where the dimension label is used to indicate user information required by different application scenes. And acquiring user information corresponding to each dimension label based on a plurality of preset information acquisition channels. And finally, constructing the user portrait according to the user information. The method acquires the user information in all directions through a plurality of preset information acquisition channels, portrays the user according to the user information, and improves the accuracy of user portrayal.

2. For example, the Chinese patent also discloses a portrait construction method (application number: CN202110476312.X) based on data self-learning, the method issues and authorizes corresponding entity algorithm authority by defining an algorithm, defines a label for an entity and binds the corresponding relationship between the label and the algorithm; grouping a plurality of labels under an entity, and appointing a label list combination under each group; binding an entity with a data set, and specifying association conditions among the data sets; and constructing an entity portrait task. The method for constructing the portrait can more intuitively express the relationship between the entity and the portrait, more finely control the generation process of the label and the construction process of the portrait, and more flexibly adjust the realization process of the algorithm through the dynamic adjustment of the threshold parameter and the input parameter, thereby achieving the multiplexing capability of the algorithm. In addition, the accuracy of the label can be fed back dynamically through secondary correlation analysis of the grouping and the label, so that a basis is provided for adjustment of algorithm parameters.

The prior art more or less uses the characteristic of labeling user information to carry out iteration and correction, but still does not get rid of the following problems:

in the disclosed technology, user information is collected in a large range to construct a tag library of the user, the user is grouped to construct a user portrait, but the processing of cold and hot tags is omitted, some user tags may increase or decrease along with the user's liking and maturity changes of things to be treated, and the user tags should be continuously learned so as to achieve automatic correction.

The prior art disclosed the weight-treated classification algorithm is not adjusted timely, and the importance of a word is in direct proportion to the number of times it appears in the article and in inverse proportion to the number of times it appears in the whole document set. The relation between the label and the user can reflect the relation between the labels to a certain extent, the patent classifies based on the weight of the correlation coefficient matrix, the direct correlation between the label and the label is greatly improved, and when the user quantity and the label magnitude are more, the more the correlation between every two labels is obvious.

Therefore, it is necessary to provide a user profile construction method for the above problems.

Disclosure of Invention

In view of the above-mentioned shortcomings in the prior art, the present invention provides a user portrait construction method to solve the above-mentioned problems.

A user portrait construction method comprises the following steps:

s1, acquiring user behavior data;

s2, establishing a fact label library according to the collected behavior data;

s3, training a label model through a fact label library by using logistic regression;

s4, matching the similarity of the user and the label model library through behavior weight to construct a user portrait;

s5, continuously correcting and adjusting the user image by using the time attenuation factor.

Step S1 is to pre-embed user behavior embedding points in the software, and subdivide the behavior granularity according to the times and durations of the different behavior embedding points.

The step of acquiring the user behavior data in step S1 includes:

(1) based on the operation habits and behavior paths of the client-side multipoint multi-mobile-phone users;

(2) sending buried point data to cloud (cloud server) in scene without user awareness

(3) The cloud server receives the point data and uses a data analysis type database (such as CilckHouse) to persist the data.

Wherein the component fact tag repository in step S2 builds a fact tag repository from the data collected in step S1.

Wherein the step of establishing the fact label library comprises the following steps:

(1) tag library construction (hereinafter dw library) using persistent buried point data

(2) Then cleaning the buried point data (removing misoperation data, meaningless data and violation data)

(3) Selection of data features and decision tree generation from decision tree regression algorithms

Wherein the step of training the label model in step S3 is:

(1) utilizing machine learning enables a process that lets a computer learn to process a question as if it were a person and give an answer;

(2) the label can be trained by using the logistic regression of the linear support vector machine, the idea of the label training method is that the dichotomy is used, the label training method is very suitable for the question, and the label training method belongs to the supervised learning in ML

(3) The small-granularity labels are trained through learning of logistic regression into a label model that can be matched by step S4.

Wherein the step of constructing the user portrait in step S4 is:

(1) firstly, grouping users, wherein when an application scene mainly uses labels for businesses, pushing is often not performed by using only one label, and a plurality of labels are required to be combined to meet the definition of the crowd in the business under more conditions, and the user grouping is equivalent to making a crowd template and pushing the crowd under different scenes.

(2) In the process of constructing the portrait, users with certain attributes are determined to be used as data samples, and data characteristics of the users are extracted to train a model;

(3) having clarified the user data features to match our label model, for a given data set, a dividing line can be found in the sample space to separate the two different classes of samples, and this line is furthest from the closest training data point.

Wherein the step of correcting and adjusting the user portrait in step S5 is: the user portrait is adjusted by predicting the time attenuation factor, the time attenuation factor coefficient is different for different labels, some labels are not even influenced by time, and the attenuation factor is not needed to be considered in calculation.

Wherein the embedding points comprise clicking, browsing and quitting.

The data to be cleaned comprises misoperation data, meaningless data and violation data.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention is based on the weight classification of TF-IDF algorithm (), so that the relationship between the user and the label T is more compact;

2. applying a Newton cooling law mathematical model to predict that the historical behavior and the current correlation of the behavior of the user are weakened continuously along with the passing of time, establishing a function related to the time attenuation, and continuously correcting the label attribute of the user;

3. the learning method is supervised based on logistic regression, which is based on likelihood classification, so that the maximum correlation of data can be obtained.

Drawings

Fig. 1 is a flowchart of a background live video auditing method of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The embodiments of the invention will be described in detail below with reference to the drawings, but the invention can be implemented in many different ways as defined and covered by the claims.

As shown in fig. 1, a method for constructing a user portrait includes the steps of:

s1, acquiring user behavior data;

s2, establishing a fact label library according to the collected behavior data;

Step S1 is to pre-embed user behavior embedding points in the software, and subdivide the behavior granularity according to the times and durations of the different behavior embedding points. For example, 5 seconds and 30 seconds of behavior data should be categorized differently when viewing a page.

The step of acquiring the user behavior data in step S1 includes:

Wherein the component fact tag repository in step S2 builds a fact tag repository from the data collected in step S1. For example, behavior data of long time and short time may be filtered, behavior data of 1 second may be classified as user mistouch through various model judgments, and behavior data of time greater than a certain threshold may be classified as meaningless and not participating in model calculation.

(2) Then cleaning the buried point data (removing misoperation data, meaningless data and violation data);

(3) and selecting data characteristics and generating a decision tree according to a decision tree regression algorithm.

Wherein the step of training the label model in step S3 is:

Wherein the step of constructing the user portrait in step S4 is:

(1) firstly, users are grouped, and when the application scene is mainly a service use label, the label is not always used

Only one label is used for pushing, under more conditions, a plurality of labels are required to be combined to meet the definition of the crowd in business, and the grouping of users is equivalent to making a crowd template to push the crowd in different scenes.

Wherein the embedding points comprise clicking, browsing and quitting.

Compared with the prior art, the invention has the beneficial effects that:

1. the weight classification based on the TF-IDF algorithm ensures that the relationship between the user and the label T is tighter;

2. the method uses a Newton cooling law mathematical model to predict that the historical behavior and the current correlation of the behavior of the user are weakened continuously along with the passing of time, and continuously corrects the label attribute of the user after establishing a function related to the time attenuation;

3. supervised learning methods based on logistic regression, which is based on probability classification, can obtain the maximum correlation of data (because in practice the speed of a person is not constant, we have no way to get the speed at different times through this line.

TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency).

TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of relevance between a document and a user query. In addition to TF-IDF, search engines on the internet use a ranking method based on link analysis to determine the order in which documents appear in search results.

The main idea of TF-IDF is: if a word or phrase appears in an article with a high frequency TF and rarely appears in other articles, the word or phrase is considered to have a good classification capability and is suitable for classification.

The method has the advantages that the algorithm accuracy and the algorithm flexibility are greatly improved, particularly, the processing of the cold and hot labels is realized, the application uses the recorded information, and the method has the advantages of static stability, small data volume and large information volume relative to user information such as user webpage browsing records, social network relations, news advertisement click records and the like, and the user portrait information constructed by the method can be more accurately defined and identified. Further, the application label information comprises application installation label information and/or application active label information, and the user is labeled from different dimensions, so that more accurate user portrait information is constructed. Further, the application installation label information and/or the application activity label information based on the application theme are provided, richer differentiation label information is obtained, and the applications can be better classified.

The working process is as follows:

a method of constructing a user representation comprising the steps of:

and S1, acquiring a large amount of user behavior data.

And S2, building a fact label library according to the collected behavior data.

And S3, training a label model through a plurality of fact label libraries by using logistic regression.

And S4, matching the similarity of the user and the label model library through the behavior weight to construct the user portrait.

The specific steps of acquiring the user behavior data in step S1 are as follows:

(1) operation habit and behavior path based on client-side embedded multi-mobile-phone user

(3) The cloud server receives the purchase point data and adopts a data analysis type database (such as CilckHouse) to persist the data;

wherein the data analysis type database: different from the transaction processing (OLTP) scenario, for example, a shopping cart is added, an order is placed, payment and the like in an e-market scenario require a lot of insert, update and delete operations in place, and a data analysis (OLAP) scenario generally performs flexible exploration, BI tool insight, report making and the like of any dimension after data is imported in batches. After the data is written once, an analyst needs to try to mine and analyze the data from various angles until discovering information such as business value, business change trend and the like. This is a process that requires trial and error, constant adjustment, and continuous optimization, where data is read much more often than written. This requires the underlying database to be specifically designed for this feature rather than blindly adopting the technical architecture of the conventional database.

Wherein the step of building a fact label library according to the collected behavior data in step S2 comprises:

(1) building a tag library (called dw library below) by using the persistent buried point data;

(2) then, the data of the buried points are cleared (removing misoperation data, meaningless data and violation data);

(3) selection of data features and generation of decision trees according to a decision tree regression algorithm

Decision tree: the construction of the decision tree algorithm is divided into 3 parts: selection of features, generation of a decision tree, pruning of the decision tree, selection of features-selection of features that maximize information gain; that is, the selection of a classification feature must be more deterministic for the classification, and this feature is better; generating a decision tree, namely ID3, and performing C4.5 algorithm, wherein the decision tree is constructed in an iterative mode; note that the decision tree at this time is over-fitted because each selection is a locally optimal solution; the pruning of the decision tree, namely the pruning of the decision tree, is to prevent overfitting, according to the global cost function, if a number of branches are pruned, the cost function becomes smaller, then the branch is pruned;

wherein the step of training the label model in step S3 is: because rule judgment or manual classification cannot handle users with missing data or users who are not within the rule range, machine learning is needed to be utilized so that processing can enable a computer to learn and process the problem like a human and give an answer; the label can be subjected to model training by using logistic regression of a linear support vector machine, and the idea of the label training method is very suitable for the problem by using a dichotomy and belongs to one of supervised learning in ML; the small-granularity labels are trained into a label model for matching by S4 through learning of logistic regression.

Training can adopt a good idea to ensure the integrity of the model, and the idea of the method is that a conservative type definition is adopted, so long as a client has records on any label, the client is considered to be classified into the user. The value may also be null, considering that Y for a sample does not necessarily have a value across multiple products.

Since the most effective way to improve the model KS is to extend the data dimension, i.e., feature engineering, we must encounter the problem of multiple data sources X.

Wherein the linear regression algorithm: the purpose of regression is to predict the target values of the numerical type. The most straightforward way is to write a calculation formula for the target value from the input, which is called regression equation. The process of finding the regression coefficients in the regression equation is regression.

Linear regression (linear regression) means that the input terms can be multiplied by constants respectively and the results can be added together to obtain the output.

One problem with linear regression is that under-fitting phenomena are likely to occur because it addresses an unbiased estimate with minimum mean square error. To reduce the predicted mean square error, some bias can be introduced into the estimation, one of which is Local Weighted Linear Regression (LWLR); if the data has more features than the sample points, i.e. the matrix X of the input data is not a full rank matrix, the non-full rank matrix may present problems in the inversion. To solve this problem, ridge regression (ridge regression), lasso method, forward stepwise regression may be used.

Wherein the step of constructing the user portrait in step S4 is:

firstly, grouping users, wherein when an application scene mainly uses labels for businesses, pushing is often not performed by using only one label, and a plurality of labels are required to be combined to meet the definition of the crowd in the business under more conditions, and the user grouping is equivalent to making a crowd template and pushing the crowd under different scenes.

In the process of constructing the portrait, users with some attributes are used as data samples, and data features of the users are extracted to train the model. Having clarified the user data features to match our label model, for a given data set, a dividing line can be found in the sample space to separate the two different classes of samples, and this line is furthest from the closest training data point.

Finally, in step S5, the steps of correcting and adjusting the user portrait are: predicting the temporal decay factor to make adjustments to the user representation, the heat of some of the label models in our library may grow linearly with time leading to gradual cooling. For example, a piece of news may be the highest in its "temperature" today, but over time, the piece of news will gradually change to the same "temperature" as ordinary news; the attenuation factor coefficient of time is different for different labels, some labels are not even influenced by time, and the attenuation factor does not need to be considered in calculation.

Time attenuation factor: the time decay factor represents the gradual cooling process of the heat of the label along with time, and is derived from Newton's law of cooling, and the formula is shown as follows:

wherein T (t) is the current temperature;

the temperature drop speed of the object; k is the cooling coefficient; h is the heat convection heat transfer coefficient of the object;

the law states that the cooling rate of an object is proportional to the temperature difference between its current temperature and room temperature. For the news domain, a piece of news may be the highest "temperature" today, but over time, the piece of news will gradually become as "temperature" as ordinary news.

By deriving newton's law of cooling, we have derived the following equation:

wherein T (t) is the current temperature; t (T)₀) Is the original temperature; k is the cooling coefficient, t₀-t is the interval time;

the formula is shown in the specification: the current temperature, X exp (-cooling coefficient X interval time), applied to the label means: current weight X exp (cooling coefficient X interval time)

Such as: setting the weight of the preference of the user on the day of action as 1, setting the weight as 0.2 after 10 days, namely setting the weight to be 0.2 after 9 days, substituting the known variable into the formula, and obtaining the cooling coefficient through exponential operation, thereby obtaining the time decay factor.

Example (b):

with userID1 as the basic unit for identifying users, users are required to fill in basic information during registration, such as sex, age, area, school, interest tags; as the information input by the user, the part of the information may have different psychological authenticity of different users, and should be supplemented and corrected as basic data, wherein the correction may be reviewer correction, decision tree judgment correction, and the like.

The correction method is provided with two types:

frequency sense correction: the parameters are considered to be fixed values that exist objectively, although unknown. Thus, the parameter values can be estimated by optimizing a likelihood function or the like.

Bayesian sense correction: the parameters are considered random variables that are not observed and may themselves have a distribution. Thus, the parameters may be assumed to follow a prior distribution, and a posterior distribution of the parameters may be calculated based on the observed data.

Since the data has already been taken, a model class has already been determined, but the actual parameters are not yet known. Since the current observation sample has appeared, a set of parameters is estimated according to the result, so that the probability of the current result is the maximum (optimization goal), and since all samples in a set of samples are a whole, the probabilities of the samples are multiplied (multiplication principle in permutation and combination) to obtain an objective function; estimation correction is performed from now on.

The method comprises the steps of collecting and dividing labels according to a user behavior path reported by a client, selecting the client to report buried point data through cloud server communication in the implementation case, writing the data into a CilckHouse column type storage server by the cloud server for persistence, and discarding simple meaningless data (for example, exceeding a normal service range value, and not performing warehousing) while writing.

And (3) carrying out logistic regression on classified label data to continuously subdivide the granularity, then learning through basic information and behaviors of the userID1, and cooperatively calculating user similarity labels with the same behaviors, wherein a weight model needs to be introduced to continuously correct the labels, and the world weakening factors of different labels are different and should be dynamically adjusted, so that the portrait of each user is made.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A user portrait construction method is characterized in that: the method comprises the following steps:

s1, acquiring user behavior data;

s2, establishing a fact label library according to the collected behavior data;

2. A user representation construction method as claimed in claim 1, wherein: step S1 is to pre-embed user behavior embedding points in the software, and subdivide the behavior granularity according to the times and durations of the different behavior embedding points.

3. A user representation construction method as claimed in claim 1, wherein: the step of acquiring the user behavior data in step S1 includes:

(1) based on the operation habit and behavior path of a client embedded mobile phone user;

(2) sending the buried point data to a cloud end under a scene that a user does not sense;

(3) and the cloud end receives the buried point data and adopts a data analysis type database to persist the buried point data.

4. A user representation construction method as claimed in claim 1, wherein: wherein the component fact tag repository in step S2 builds a fact tag repository from the data collected in step S1.

5. A user representation construction method as claimed in claim 1, wherein: wherein the step of establishing the fact label library comprises the following steps:

(1) building a tag library by using the persistent buried point data;

(2) then cleaning the buried point data;

6. A user representation construction method as claimed in claim 1, wherein: wherein the step of training the label model in step S3 is:

(2) performing model training on the labels by using logistic regression of a linear support vector machine,

7. A user representation construction method as claimed in claim 1, wherein: wherein the step of constructing the user portrait in step S4 is:

(1) grouping users;

(3) having clarified the user data features to match the label model, for a given data set, a dividing line can be found in the sample space to separate the two different classes of samples, and this line is furthest from the closest training data point.

8. A user representation construction method as claimed in claim 1, wherein: finally, in step S5, the steps of correcting and adjusting the user portrait are: the temporal attenuation factor is predicted to adjust the user representation.

9. A user representation construction method as claimed in claim 2, wherein: the embedding points comprise clicking, browsing and quitting.

10. A user representation construction method as claimed in claim 1, wherein: the data to be cleaned comprises misoperation data, meaningless data and violation data.