CN110555717A

CN110555717A - method for mining potential purchased goods and categories of users based on user behavior characteristics

Info

Publication number: CN110555717A
Application number: CN201910687675.0A
Authority: CN
Inventors: 程锐; 张艳青; 杨漫瑶
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-07-29
Filing date: 2019-07-29
Publication date: 2019-12-10

Abstract

The invention belongs to the field of user behavior analysis and data mining, and relates to a method for mining potential purchased commodities and categories of users based on user behavior characteristics, which is used for carrying out data coding on preprocessed data and carrying out characteristic engineering processing to obtain user behavior characteristic data; carrying out positive and negative sample analysis and classification on sample data, and generating a plurality of sample subsets by carrying out dynamic undersampling on the positive and negative samples to be used as positive and negative sample data of training; training the decision tree model through positive and negative sample data to train a plurality of single prediction models, and fusing the single prediction models through a stacking mode to generate a plurality of fused prediction models; and predicting the potential purchased commodities and categories of the user based on the plurality of fusion prediction models, and processing and analyzing the prediction results of the fusion prediction models to obtain the potential purchased commodities and categories of the user with weights. The invention can help the commercial tenant to discover the users with high potential purchasing intention and improve the consumption conversion rate of the marketing users.

Description

method for mining potential purchased goods and categories of users based on user behavior characteristics

Technical Field

The invention belongs to the field of user behavior analysis and data mining, and relates to a method for mining potential purchased commodities and categories of users based on user behavior characteristics.

background

today, the electronic commerce is rapidly developed, active marketing can make merchants stand out in the market where the commodities are homogenized and spread, attract users and practically improve the consumption conversion rate of the marketing users. How to actively market is achieved through advertisement propaganda and media propagation, but the traditional modes are combined into a flow, and the conversion rate of the users is basically obtained through combination, so that a more effective method is needed to improve the consumption conversion rate of the users, the key point is how to accurately obtain target users and push commodity information which is most likely to be purchased to the target users, and how to obtain the target users and target commodities relates to the problems of mining and prediction.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method for mining potential target commodity classes purchased by a user based on user behavior characteristics.

The invention is realized by adopting the following technical scheme:

The method for mining the potential purchased goods and categories of the user based on the user behavior characteristics comprises the following steps:

Data cleaning and data preprocessing are carried out, and preprocessed data are obtained;

Carrying out data coding on the preprocessed data, extracting basic features, statistical features, time interval features and calculation features, carrying out feature importance evaluation on the extracted features through a filtering method, screening out important features and redundant features, introducing a time regression theory in the feature importance evaluation process, aiming at the fact that the closer some user behaviors are to the prediction time, the larger the influence on the result is, carrying out weighting processing on the data features, and obtaining user behavior feature data;

Carrying out positive and negative sample analysis and classification on sample data, and generating a plurality of sample subsets by carrying out dynamic undersampling on the positive and negative samples to be used as positive and negative sample data of training;

training the decision tree model through positive and negative sample data to train a plurality of single prediction models, and fusing the single prediction models through a stacking mode to generate a plurality of fused prediction models;

Predicting by using a fusion prediction model, comparing a prediction result with an expected value, and feeding back to the decision tree model for parameter adjustment and model retraining until optimal model parameters are obtained;

And predicting the potential purchased commodities and categories of the user based on the plurality of fusion prediction models, and processing and analyzing the prediction results of the fusion prediction models to obtain the potential purchased commodities and categories of the user with weights.

further, the filtering method is to score each feature according to an index by using a correlation coefficient method, wherein the score represents the importance of the feature, and then rank the features according to the scores.

preferably, chi-squared filtering is used to calculate the chi-squared statistic between each non-negative feature and the label, and the features are ranked according to chi-squared statistic from high to low, and the class of the top K features with the highest score is selected, thereby removing features that are most likely independent of label and unrelated to classification purposes.

preferably, the sample data includes behavioral characteristic data of a certain user for a certain commodity for a certain period of time, the commodity, characteristic data of the user himself.

further, the dynamic undersampling process is to extract a part of samples with an excessive number of samples by a certain method so as to coordinate the proportion imbalance between the part of samples and other samples.

preferably, a certain number of subsamples are extracted from the negative samples by random extraction, and combined with the positive samples to form a new sample set.

Preferably, the decision tree models employed are the RF and GDBT algorithms.

preferably, the process of generating the fusion prediction model by the plurality of single prediction models in a stacking manner includes:

after the positive and negative samples are processed, n sample training sets train _ x, …, train _ y and a test set test are generated;

firstly, selecting an untrained decision tree model;

Extracting n-1 parts of the training set as small training sets s _ train _ x, … and s _ train _ y, and the other part of the training set as a small test set s _ test, wherein the test set test is unchanged;

Thirdly, training a decision tree model by s _ train _ x, … and s _ train _ y, predicting s _ test by the trained model to obtain corresponding s _ pred, and predicting test to obtain y _ pred;

Selecting another part in the training set as a small test set s _ test _ x, and taking the other n-1 parts as the training set to train the decision tree model;

Repeating the steps of (a), (b) and (c) for n times to obtain n s _ preds and n y _ preds;

n s _ preds are used as a train _ X, the original train _ Y is used as a train _ Y to train a fusion prediction model to obtain a model G, the average value of the n Y _ preds is used as a new test _ X, and the test _ X is brought into the model G to obtain a prediction result;

In the second layer, a layer of stacking is performed by combining the output training set train _ X, train _ Y and the test set test _ X of the first layer with other feature sets, and the steps are repeated to generate a final fusion prediction model.

Preferably, the desired value is set using F1-score.

preferably, the input of the fusion prediction model comprises the behavior characteristics, commodities and categories of the user for a certain commodity in a fixed time before the prediction date, and the prediction output is whether the user will purchase the commodity.

compared with the prior art, the invention has the following beneficial effects:

(1) for the extracted data features, feature importance evaluation is carried out through a filtering method (each feature is scored according to an index by adopting a correlation coefficient method, the score represents the importance of the feature, then the features are sorted according to the score), important features and redundant features are screened out, in the feature importance evaluation process, aiming at the fact that the closer some user behaviors are to the prediction time, the larger the influence on the result is, a time regression theory is introduced (for some behaviors, such as click behaviors, if the weight is lower the farther away from the prediction day), weighting processing is carried out on the data features, and feature importance evaluation is more accurate.

(2) the method mainly includes that a certain number of sub-samples are extracted from negative samples in a random extraction mode, a new sample set is combined with positive samples to be used for learning training, a plurality of training subsets are generated by repeating the generation process of the new sample subset, each subset is independently trained, and then model fusion is carried out to carry out final prediction model synthesis.

(3) The method comprises the steps of training and predicting by utilizing various classifier models (decision tree models), fusing single decision tree models in a stacking mode to generate a plurality of fused prediction models, weighting and weight reducing processing are carried out by combining results of the plurality of fused prediction models, the weight reference of prediction results is enhanced, and user commodity recommendation is better carried out.

(4) dynamic diversity is carried out on a large amount of unbalanced user behavior data, and the failure of a classifier caused by unbalanced samples is prevented; and performing multi-model training by using the divided sample subsets, thereby improving the utilization rate and the training effect of data.

(5) The prediction model can improve the prediction precision of future purchasing behaviors of the user, help the merchant to discover the user with high potential purchasing intention, guide the merchant to push more accurate commodity information and preferential information to the user, and improve the user consumption conversion rate of marketing.

Drawings

fig. 1 is a flowchart illustrating a method for mining a category of a target commodity potentially purchased by a user based on user behavior characteristics according to an embodiment of the present invention.

FIG. 2 is a block diagram of user behavior characterization in accordance with an embodiment of the present invention;

FIG. 3 is a diagram illustrating a model training and parameter tuning process according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail below with reference to specific embodiments, but the embodiments of the present invention are not limited thereto.

At present, e-commerce is deeply involved in the life of each person, the proportion of behaviors of users in a network environment is increased, the clicking, browsing, purchasing, commenting and other behaviors of the users on the internet are seemingly irrelevant, very important information is hidden, user behavior feature extraction is carried out on the characteristics of user purchasing behaviors through a feature engineering theory, the behaviors of the users are understood in a machine learning mode, the future purchasing intention and target commodities of the users are predicted by utilizing historical behavior data of the users, and according to the prediction result, merchants can carry out more accurate commodity marketing and preferential pushing on the target users, the popularization user consumption conversion rate is improved, the minimum investment is achieved, and the maximum effect is obtained.

According to the method, the user behavior characteristics are extracted according to the business characteristics and the characteristic engineering method, a user behavior model is constructed by combining the characteristics of commodity data, and finally, the characteristic data are dynamically divided according to the balance of the user behavior characteristic data to generate a plurality of data sets for model training and prediction. The training and prediction of data are processed by adopting various machine learning classifier models, after the result of each model is predicted, the prediction results are fused, meanwhile, the results appearing in a plurality of prediction models are weighted, and the weight of the result appearing in a single prediction model is reduced.

The method for mining the potential purchase target commodity class of the user based on the user behavior characteristics, as shown in fig. 1, includes:

And S1, cleaning and preprocessing the data to obtain preprocessed data.

And carrying out abnormal data cleaning (including null value and abnormal value processing and abnormal user behavior data cleaning) on the collected user data and commodity data, and carrying out standardized processing according to specified data requirements.

the user data comprises user basic attribute data, user behavior data and user comment data. In this embodiment, the basic attribute data of the user includes gender, age, and registration time; the user behavior data comprises operation time, operation types, objects, products and product types; the user comment data comprise the number of user comments, the number of good comments, the number of bad comments and the time of final comment.

the commodity data includes commodity code, name, category code, category, and commodity attribute.

And carrying out abnormal value processing on the user data and the commodity data through the variance.

Data preprocessing as shown in fig. 2, includes: data format standardization, null value and illegal value statistics, and consistency detection of user, product and behavior data.

S2, carrying out data coding on the preprocessed data, extracting basic features, statistical features, time interval features and calculation features, carrying out feature importance evaluation on the extracted features through a filtering method, screening out important features and redundant features, introducing a time regression theory aiming at the fact that the closer some user behaviors are to the prediction time, the larger the effect on the result is, and carrying out weighting processing on the data features to obtain user behavior feature data.

as shown in fig. 2, step S2 may be understood as a feature engineering, including: the preprocessed data are coded, so that the data are conveniently used for decision tree training; extracting basic features of users and basic features of commodities; counting basic user behavior data through different dimensions, wherein the basic user behavior data mainly comprise time dimensions, commodity class dimensions, behavior type dimensions, user attribute dimensions and commodity attribute dimensions, and generating statistic class features, time interval class features and calculation class characteristics; performing characteristic association and fusion through the relationship among users, commodities and behaviors; and classifying the behaviors, and weighting the behaviors such as clicking, collecting and the like by using a time regression theory, wherein the behavior characteristic weight is lower when the behavior is farther away from the prediction time.

The filtering method of the invention is to score each feature according to an index by adopting a correlation coefficient method, wherein the score represents the importance of the feature, and then sort the features according to the scores. In this embodiment, chi-square filtering is used to calculate the chi-square statistic between each non-negative feature and the label, and rank the features according to the chi-square statistic from high to low, and select the top K classes of features with the highest scores, thereby removing features that are most likely to be independent of the label and unrelated to the classification purpose.

The invention introduces a time decay theory and carries out weighting processing on data characteristics, and the method comprises the following steps: weighting certain behaviors, such as click behaviors, makes the feature importance assessment more accurate if the weights are lower the farther away from the predicted day.

In this embodiment, the extraction process of the basic feature, the statistical feature, the time interval feature, and the calculation feature includes:

One-hot coding, e.g. type, is performed on the basic features. One-hot encoding, also known as one-bit-efficient encoding, uses an N-bit register to represent N states, each state having its own dedicated register bit and always remaining one bit active. The advantage of the one-hot coding is that the design is convenient and the realization is easy; in addition, it can also encode non-contiguous features without the need for a decoding operation.

Secondly, counting various user behaviors in different periods, such as the number of clicks, the number of collections, repeated behavior statistics and the like within 7 days.

And thirdly, extracting interval type features of certain behaviors, such as the interval between the last browsing and the previous browsing of the commodity by the user.

And fourthly, calculating the characteristics such as purchase conversion rate after the user adds the shopping cart within one week, purchase conversion rate of the browsing number of the user, the added number of the user and the concerned number of the user, time decay weighting of user behavior statistics and the like.

And S3, carrying out positive and negative sample analysis and classification on the sample data, and generating a plurality of sample subsets by carrying out dynamic undersampling on the positive and negative samples to be used as the positive and negative sample data of training.

The sample data of the invention comprises behavior characteristic data of a certain user for a certain commodity in a period of time, the commodity, characteristic data of the user and the like.

The sample data is marked by combining actual purchase data of a user, user characteristic data and the like to generate positive and negative sample data, the unbalance condition of the positive and negative samples is automatically analyzed, the sample data is subjected to proper undersampling processing according to the preset positive and negative proportion, and a plurality of positive and negative sample subsets are generated. If the user has purchased the item within the forecast date, the sample data is a positive sample, otherwise it is a negative sample.

aiming at the unbalance of positive and negative samples possibly existing in sample data, if the data are directly used, the training result has bias, so that the invention generates a plurality of sample subsets by the positive and negative samples through dynamic undersampling processing, and each subset is independently trained. The dynamic undersampling processing of the invention is to extract a part of samples with overlarge number of samples by a certain method so as to coordinate the unbalanced proportion between the samples and other samples. In this embodiment, a certain number of subsamples are extracted from the negative samples in a random extraction manner, and are combined with the positive samples to form a new sample set for learning training, and a plurality of classifiers are trained by repeating the generation process of the new sample subset.

And S4, training the decision tree model through positive and negative sample data to train a plurality of single prediction models, and fusing the single prediction models through a stacking mode to generate a plurality of fused prediction models.

In this embodiment, based on the purpose of prediction, the two-class problem may be abstracted, the adopted decision tree model is an RF and GDBT algorithm to train the sample data set, generate a plurality of prediction models, and fuse the models in a stacking manner to generate a plurality of fusion prediction models.

After the decision tree model is stable, LR is considered to be used for weighting the prediction result, and the accuracy of the fusion model is enhanced.

Generating a process description of the fusion prediction model by the plurality of single prediction models in a stacking mode:

after the positive and negative samples are processed, n sample training sets train _ x, …, train _ y, and test set test are generated.

Firstly, a decision tree model is selected, such as random forest RF or gradient boosting decision tree GDBT. (untrained)

And secondly, extracting n-1 parts of the training set as small training sets s _ train _ x, … and s _ train _ y, and extracting the other part of the training set as a small test set s _ test, wherein the test set test is unchanged.

and thirdly, training an RF model or a GDTB model by s _ train _ x, … and s _ train _ y, predicting s _ test by the trained model to obtain corresponding s _ pred, and predicting test to obtain y _ pred.

And selecting another part in the training set as a small test set s _ test _ x, and using the other n-1 parts as the training set to train the model RF or GDTB model.

And fifthly, repeating the steps of the third step, the fourth step for n times to obtain n s _ preds and n y _ preds.

n s _ preds are used as a train _ X, the original train _ Y is used as a train _ Y to train a fusion prediction model to obtain a model G, the average value of the n Y _ preds is used as a new test _ X, and the test _ X is brought into the model G to obtain a prediction result.

and S5, predicting the result by using the prediction model, comparing the result with an expected value, and feeding back to the decision tree model for parameter adjustment and model retraining until the optimal model parameters are obtained.

And predicting historical data by using a prediction model, comparing historical results, judging results according to preset coverage, recording the model if the historical data do not accord with the preset values, returning to the decision tree model to adjust parameters, and retraining and predicting until the coverage is higher than the preset value or the training times reach the preset value. In this embodiment, the model training and parameter adjusting process is shown in fig. 3.

The expected value is related to the model, and in this embodiment, the expected value is set by F1-score:

(1) precision (Precision): accuracy is the percentage of true classes among all identified "positive classes," which refers to the probability that the resulting positive sample is identified correctly. The calculation formula of the accuracy is as follows:

Wherein: TP is True Positive, determined to be a Positive sample, and in fact also a Positive sample; FP is False Positive and is judged to be a Positive sample, but is actually a negative sample.

(2) Recall (Recall): also called recall ratio, is the proportion of real classes in the test set to all positive classes. The recall ratio is calculated as follows:

wherein: FN is False Negative, and is judged as a Negative sample, but is in fact a positive sample.

(3) F-score: the accuracy and the Recall rate are often contradictory, the F-score considers the two values at the same time, the F-score is a harmonic average value of the Precision rate (Precision) and the Recall rate (Recall), the F1-socre is the condition that the beta is 1 in the general formula of the F-score, namely the Precision value and the Recall value have the same importance, and when the beta is more than 1, the proportion occupied by the Recall is larger. The general formula for F-score and the calculation formula for F1-socre are as follows:

In the formula, P represents accuracy, R represents recall, β is a weight for balancing accuracy and recall in F-score calculation, and the following three values are taken:

if 1 is taken, the accuracy rate is as important as the recall rate;

If the value is less than 1, the accuracy rate is more important than the recall rate;

If greater than 1 is taken, it indicates that recall is more important than accuracy.

And S6, predicting the potential purchased commodities and categories of the user based on the plurality of fusion prediction models, and processing and analyzing the prediction results of the fusion prediction models to obtain the potential purchased commodities and categories of the user with weights.

in this embodiment, the input data of the fusion prediction model includes behavior characteristics, commodities, categories, and the like of the user for a certain commodity in a fixed time before the prediction date, and predicts whether the output user will purchase the commodity.

And performing intersection and difference set operation on the prediction results of the fusion prediction models, wherein the recommendation weight of the intersection result is high, and the weight of the difference result is low. And if the predicted potential purchased commodities and categories of the user are predicted to be purchased by the user in a plurality of fusion prediction models, weighting the potential purchased commodities and categories of the user, and if the potential purchased commodities and categories of the user only appear in the output of a certain fusion prediction model, reducing the weight to finally obtain the prediction result of the potential purchased commodities and categories of the user with the weight.

The method of the present invention is described below by taking the example of mining the behavior characteristics of users of a certain website to find potential purchased goods and categories.

Firstly, collecting basic information of a user of a website, wherein the basic information comprises age, gender, registration time and the like; collecting user behavior data, including user online behavior data of clicking, browsing, purchasing, commenting, collecting, paying attention to, canceling attention to and the like; collecting commodity data: including name, category, commodity attributes, etc.; and converting the related data into a format required by the method of the invention, and cleaning abnormal data.

Secondly, encoding the basic data, and conveniently using the basic data for machine learning training; extracting basic features of users and basic features of commodities; counting basic user behavior data through different dimensions, wherein the basic user behavior data mainly comprise time dimensions, commodity class dimensions, behavior type dimensions, user attribute dimensions and commodity attribute dimensions, and generating statistic class features, time interval class features and calculation class characteristics; performing characteristic association and fusion through the relationship among users, commodities and behaviors; classifying the behaviors, weighting the behaviors such as clicking, collecting and the like by using a time regression theory, wherein the behavior characteristic weight which is farther away from the prediction time is lower;

and calculating chi-square statistic between each non-negative feature and the label by using chi-square filtering, and ranking the features according to the chi-square statistic from high to low, and selecting the class of the top K features with the highest score, thereby removing the features which are most possibly independent of the label and are irrelevant to the classification purpose.

and thirdly, marking the sample by combining actual purchase data of a user and user characteristic data to generate positive and negative sample data, automatically analyzing the unbalanced condition of the positive and negative samples, and performing proper undersampling processing on the sample data according to a preset positive and negative proportion to generate a plurality of positive and negative sample subsets.

and fourthly, training the sample data set through an RF (radio frequency) algorithm and a GDBT (generalized differential bit rate) algorithm to generate a plurality of prediction models, and fusing the models in a stacking mode to generate a plurality of fusion prediction models.

And fifthly, predicting historical data by using the fusion prediction model, comparing historical results, judging results according to preset coverage, recording the model if the results do not accord with the preset values, returning to the fourth step to adjust the model parameters, and retraining and predicting until the coverage is higher than the preset values or the training times reach the preset values.

And sixthly, intersecting results of the plurality of fusion prediction models, weighting the results if the results appear in the plurality of fusion prediction models, and reducing the weight if the results appear in only one fusion prediction model to obtain a prediction result with the weight, thereby better guiding the business development.

the invention discloses a method for mining potential target commodity categories purchased by a user based on user behavior characteristics, which comprises the steps of taking user behavior data of an electric business as an entry point, carrying out data acquisition according to characteristics of the electric business, carrying out standardization processing and user behavior characteristic extraction on the user data by combining a characteristic engineering theory, constructing a user behavior model, carrying out training analysis on the user behavior data through a machine learning algorithm, and outputting a prediction model capable of most 'understanding' the user behavior, wherein the prediction model is used for predicting the purchase intention of the user in a certain period in the future and the commodities and categories which are most likely to be purchased.

if the user behavior is to be understood and predicted effectively, the user behavior must be analyzed and refined effectively to obtain useful data and eliminate invalid noise, so as to obtain valuable data which can be used for user representation. The important data characteristics of the user are obtained, the data are subjected to correlation analysis, data modeling is conducted with the purpose of predicting the commodity purchasing intention and the classification intention of the user as guidance, the user behavior characteristic data are really converted into a user behavior model which can be used for prediction, and the correlation and analysis process can be realized by adopting a machine learning method. The method comprises the steps of carrying out learning modeling on extracted sample data through different algorithms of machine learning, predicting purchasing behaviors of users, carrying out model feedback through predicted results, helping the models to carry out parameter adjustment and repeated training, and continuously optimizing to obtain a fusion prediction model, wherein the fusion prediction model can be used for predicting future purchasing behaviors of the users, helping merchants to mine users with high potential purchasing intentions, guiding the merchants to push more accurate commodity information and preferential information to the users, and improving the user consumption conversion rate of marketing.

the above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. the method for mining the potential purchased goods and categories of the user based on the user behavior characteristics is characterized by comprising the following steps:

2. The method for mining the potential purchased goods and categories of the users based on the behavior characteristics of the users as claimed in claim 1, wherein the filtering method is to score each characteristic according to the index by using a correlation coefficient method, the score represents the importance of the characteristic, and then rank the characteristics according to the score.

3. The method of claim 2, wherein chi-squared statistics between each non-negative feature and the label are calculated using chi-squared filtering, and the features are ranked according to chi-squared statistics from high to low, and the class of the top K highest scoring features is selected, thereby removing features that are most likely independent of label and independent of classification purpose.

4. The method for mining the potential purchased commodities and categories of users based on the behavior characteristics of the users as claimed in claim 1, wherein the sample data includes the behavior characteristic data of a certain user for a certain commodity in a period of time, the commodities, and the characteristic data of the user.

5. The method for mining the potential purchased goods and categories of users based on the behavior characteristics of the users as claimed in claim 1, wherein the dynamic undersampling process is to extract a part of samples by a certain method aiming at the part of samples with too large number of samples so as to harmonize the proportion imbalance between the part of samples and other samples.

6. The method for mining the potential purchase goods and categories of the users based on the behavior characteristics of the users as claimed in claim 5, wherein a certain number of subsamples are extracted from the negative samples and combined with the positive samples into a new sample set by means of random extraction.

7. The method for mining the potential purchase goods and categories of users based on the user behavior characteristics as claimed in claim 1, wherein the decision tree model adopted is RF and GDBT algorithm.

8. The method for mining the potential purchased goods and categories of users based on the user behavior characteristics according to claim 1, wherein the process of generating the fusion prediction model by the plurality of single prediction models in a stacking manner comprises the following steps:

Firstly, selecting an untrained decision tree model;

9. The method for mining the potential purchase goods and categories of users based on the user behavior characteristics as claimed in claim 1, wherein the expected value is set using F1-score.

10. The method of claim 1, wherein the input of the fused prediction model comprises predicting the behavior characteristics, the commodities and the categories of the user for a certain commodity within a fixed time before the date, and predicting the output as whether the user will purchase the commodity.