CN112085525A

CN112085525A - User network purchasing behavior prediction research method based on hybrid model

Info

Publication number: CN112085525A
Application number: CN202010918871.7A
Authority: CN
Inventors: 陈曦; 丁石丑
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2020-12-15

Abstract

The invention discloses a method for predicting and researching user online purchasing behavior based on a hybrid model, which comprises the following steps: extracting behavior data and commodity information data from the online shopping platform, constructing a characteristic sample set, and performing data processing; extracting the feature with high weight as a category feature, and converting the category feature into a numerical feature; importing the numerical features into an xgboost model for training and cross-verifying to obtain a predicted value of each leaf node of the optimal xgboost model as a new numerical feature, wherein the new numerical feature is a numerical feature corresponding to the recombined unextracted features with correlation; splicing the new numerical value characteristics after one-hot coding with the original category characteristics to obtain reconstruction characteristics; importing the reconstruction characteristics into an LR model for training to obtain an optimal LR model; the optimal LR model is used to predict whether the user will purchase the specified good on the next day. The method and the device improve the accuracy of the prediction of the purchasing behavior of the user.

Description

User network purchasing behavior prediction research method based on hybrid model

Technical Field

The invention relates to the technical field of machine learning, in particular to a user online purchasing behavior prediction research method based on a hybrid model.

Background

At present, in the era of big explosion of internet data, a lot of information can be received every day, but not every kind of information is interesting and valuable, so that the recommendation system is also an idea for solving the problem of information overload, and the recommendation system mainly finds personalized demands of users by analyzing behaviors of the users, thereby recommending some commodities to corresponding users in a personalized manner and helping the users find commodities which the users want but are difficult to find. Machine learning is used as a primary element and widely applied to a recommendation system, in machine learning algorithms in practical application, Boosting series algorithms are relatively widely applied technologies, a serial mode is adopted when a Boosting method is used for training base classifiers, and dependence exists among the base classifiers. The main algorithms are GBDT (gradient boosting decision tree) and xgboost (gradient boosting model). When the machine learning is used for recommending and predicting the online purchasing behavior of the user, the data and the features determine the upper limit of the machine learning, and the model and the algorithm only approach the upper limit, so the feature extraction and the data preprocessing are very important rings in the aspects of recommending and predicting.

The data set mainly comes from each large e-commerce website platform, and the problems of large and incomplete data set, too primitive characteristics, discrete single characteristics, noise, uneven category distribution and the like exist. The aim of the method is to find out as many combined features as possible with potential relevance, which may result in missing cross features with relevant rows, and finally influence the model effect and the prediction result to have deviation; the traditional recommendation algorithm is mainly a collaborative filtering algorithm, finds preference bias of a user based on mining of historical behavior data of the user, and predicts products which the user may prefer to recommend, but the traditional collaborative filtering algorithm has difficulty in processing sparsity and expandability of data, and needs to calculate the increased degree of acquaintance of the user with all other users, so that the improvement of the calculation speed, namely training time, of the online recommendation system is a major challenge; while conventional recommendation systems typically use a single model to train a feature set, existing commonly practiced tree enhancement algorithms are still limited to millions of datasets, and while there are some discussions of how to parallelize such algorithms, there has been no discussion of how to jointly optimize the system.

Disclosure of Invention

Technical problem to be solved

Based on the problems, the invention provides a mixed model-based user network purchasing behavior prediction research method, which improves the accuracy of user network purchasing behavior prediction by optimizing the processing of characteristics and data and a fusion model consisting of an xgboost model and an LR model.

(II) technical scheme

Based on the technical problem, the invention provides a user online purchasing behavior prediction research method based on a hybrid model, which comprises the following steps:

s1, extracting behavior data and commodity information data from the online shopping platform, constructing a feature sample set, and performing data processing, wherein the features of the feature sample set comprise user ID, date, behavior, days, behavior type, behavior count, position space identification and commodity position information;

s2, extracting the features with large weights as category features, removing residual features, wherein the category features comprise user IDs, behavior types, dates and behavior quantity totals, the removed residual features comprise position space identifiers of commodities and commodity position information, and the category features are converted into numerical features;

s3, importing the numerical features into an xgboost model for training and cross-validation to obtain an optimal xgboost model, wherein the predicted value of each leaf node of the optimal xgboost model is used as a new numerical feature, and the new numerical feature is a numerical feature corresponding to the feature which is not extracted and has the relevance and is recombined, and the new numerical feature comprises the ranking of the behaviors of the user-commodity pairs in the user-category pairs and the ranking of the behaviors of the user-commodity pairs in all commodities of the user;

s4, splicing the new numerical value characteristics after one-hot coding with the original category characteristics to obtain reconstruction characteristics;

s5, importing the reconstruction characteristics into an LR model for training to obtain an optimal LR model;

and S6, predicting whether the user purchases the specified commodity in the future day by using the optimal LR model.

Further, the features described in step S1 are constructed from three basic dimensions of user, commodity and commodity category and their combination, and the combined features include u _ b _ count _ in _ n and u _ bi _ count _ in _ n, which respectively represent the total number of actions of the user n days before the investigation day, the counts of the actions of the user n days before the investigation day, u represents the category, b represents the actions, bi represents the actions, and n represents the days before the investigation day.

Further, the data in step S1 is behavior data and commodity information of the online shopping platform 20000 that the user takes seven days as a cycle and the window length is seven days.

Further, the data processing described in step S1 includes a normalization process of processing data in accordance with the rows of the feature matrix.

Further, the data processing in step S1 further includes feature data equalization processing, that is, negative samples of the feature data are clustered by K-means, and then are merged with positive samples by down-sampling to obtain relatively equalized feature data.

Further, the behaviors include commodity browsing, commodity collection, commodity purchase and commodity purchase.

Further, the xgboost model described in step S3 treats the missing value as a sparse matrix, and does not consider the value of the missing value when the node is split.

Further, the step S3 includes the following steps:

s3.1, dividing the numerical characteristics into a training set and a verification set;

s3.2, importing the training set into an xgboost model for training;

s3.3, customizing an xgboost parameter search function to enable the value of the segmented loss function to be maximum, and obtaining an optimal xgboost model;

and S3.4, substituting the verification set into the optimal xgboost model to perform cross verification, returning the optimal iteration times and the optimal xgboost model, and taking the predicted value of each leaf node of the optimal xgboost model as a new numerical characteristic.

The invention also discloses a server, comprising:

at least one processor; and at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor to invoke the hybrid model-based consumer network purchasing behavior prediction research method of any one of claims 1 to 8.

The present invention also discloses a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the hybrid model-based consumer network purchasing behavior prediction research method of any one of claims 1 to 8.

(III) advantageous effects

The technical scheme of the invention has the following advantages:

(1) according to the method, new numerical characteristics are obtained through an xgboost model, important characteristics missing during characteristic extraction are compensated, more cross characteristics, namely potential correlation characteristics with correlation, are constructed as far as possible through characteristic engineering, and then are spliced and reconstructed with original category characteristics, the reconstructed characteristics are more comprehensive and important, an LR model is introduced for training, the obtained model prediction result is better, and the accuracy of a machine learning model is improved in characteristics;

(2) according to the invention, through a clustering algorithm and downsampling of the negative samples, a more balanced data set is finally formed with the positive samples, the problem of unbalance of the positive samples and the negative samples is solved, the missing values are regarded as sparse matrixes, the problem of the missing values is solved, and all numerical characteristics of the model are converted into a one-hot type, so that the accuracy of the machine learning model is improved on the data;

(3) the invention is a mixed model formed by an xgboost model and an LR model, adopts a plurality of weak classifiers to form strong classifier training characteristic data to a new training set, integrates the advantages of the two models, analyzes effective single characteristics and cross characteristics through the xgboost model, and exerts the nonlinear fitting capability of the xgboost model, thereby indirectly enhancing the nonlinear learning capability of the LR, and the ultra-large scale characteristic throughput capability of the LR model is beneficial to obtaining more accurate prediction results.

Drawings

The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and not to be construed as limiting the invention in any way, and in which:

FIG. 1 is a schematic flow chart of a method for predicting and researching a network purchasing behavior of a user based on a hybrid model according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a new numerical feature obtained by training a CART tree of an xgboost model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of fusion of an xgboost model and an LR model according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

A method for predicting and researching online purchasing behavior of a user based on a hybrid model is disclosed, as shown in FIG. 1, and comprises the following steps:

s1, extracting behavior data and commodity information data from the online shopping platform, constructing a feature sample set, and performing data processing, wherein the features of the feature sample set comprise single features and joint features, the single features comprise behaviors, days, behavior types, behavior counts, position space identifiers, commodity position information and the like, and the joint features are a combination of the single features;

s1.1, extracting a data set and constructing a characteristic sample set: the method comprises the steps that a user on a certain online shopping platform 20000 considers behavior data and commodity information with a window length of seven days in a period of seven days, feature construction is carried out on the three basic dimensions of the user, the commodity and the commodity category and combination of the three basic dimensions, and 106 features are constructed in total;

for the data set of the e-commerce website, the quantity is usually very large and irregular, which has a considerable influence on the accuracy of the model, in order to reduce the influence of the feature data on the model training, potential relevant attributes are mined as much as possible, missing values are processed, feature conversion is performed, and the final relatively complete feature set is formed by splicing.

The raw data set selects the behavior data of a certain online shopping platform 20000 user and million-level commodity information. Since the influence of the user behavior on the purchase is weakened with time, according to analysis, the influence of the behavior of the user before one week on whether the user purchases the study day is small, and therefore only characteristic data within seven days are considered. The data source is from a certain online shopping platform e-commerce, and is characterized in that online purchasing and offline consumption are mainly realized, purchasing behaviors are expected to have certain periodicity, and the preliminary set period is seven days.

For the current business demand, considering feature construction starting from three basic dimensions of users, commodities and commodity categories and combinations thereof, 106 features are constructed in total, including single features and combined features, the single features include user IDs, dates, behaviors, days, behavior types, behavior counts, position space identifiers, commodity position information and the like, the combined features are combinations of the single features, category classification is performed as exemplified in table 1 below, for example, u _ bi _ count _ in _ n is the behavior counts of the users n days before an investigation day, u represents the behavior types of the users, b represents behaviors, and bi represents behaviors, that is, b1/b2/b3/b4 respectively corresponds to a behavior, and commodity browsing, commodity collection, commodity purchase and commodity purchase corresponding to the behaviors are included, and n represents the days before the investigation day.

TABLE 1

S1.2, carrying out normalization processing on the feature data of the feature sample set:

the features of different measurement scales need to be normalized, and here we use sklern preprocessing, standard scaler (), which simply refers to processing data according to rows of a feature matrix, and aims to make sample vectors have a uniform standard when calculating similarity through point multiplication or other kernel functions, that is, all sample vectors are converted into unit vectors.

S1.3, carrying out equalization processing on the characteristic data: clustering negative samples of the feature data through K-means, and then combining the negative samples and the positive samples through downsampling (under-sampling) to obtain relatively balanced feature data:

the proportion of positive and negative samples of the feature data obtained by feature construction, namely the proportion of purchasing or not is about 1:1200, so that the data is seriously unbalanced, and the model training is easy to fail. Therefore, this problem is dealt with using undersampling (also called under-sampling) and evaluation criteria based on F1 scores (F1_ score), and sample equalization is achieved by reducing the number of samples of most classes of samples in the classification. In order to avoid insufficient coverage of the random sampling feature space, negative samples are clustered by using k-means, then a sub-sample is adopted on each cluster to obtain comprehensive negative sample sampling, then the negative samples in a training set are downsampled, and finally a balanced data set is formed by the negative samples and the positive samples.

And S2, extracting the features with large weights as category features, removing residual features, wherein the category features comprise user IDs, behavior types, dates and behavior quantity totals, the removed residual features comprise position space identifiers of commodities and commodity position information, and the category features are converted into numerical features.

S3, importing the numerical features into an xgboost model for training and cross-verifying to obtain an optimal xgboost model, wherein the predicted value of each leaf node of the optimal xgboost model is used as a new numerical feature, and the new numerical feature is the recombination of the numerical features in S2;

dividing the numerical features into a training set and a verification set according to the investigation period, and assuming that the training set comprises: part 1-train, 11.22-11.27 > 11.28; part 2-train, 11.29-12.04 > 12.05; the verification set includes: part 3-test of 12.13-12.18 > 12.19; and seven days are investigation data of a period, friday is investigation date, namely 11.28, 12.05 and 12.19 are investigation date, characteristic data in seven days are respectively investigated, and the training set and the verification set should avoid consumption shopping nodes such as 'double 11' and '618', and are not beneficial to accurate prediction because the consumption shopping node period belongs to a special value.

S3.2, importing the training set into an xgboost model for training:

the data set of a general E-commerce website has a large number of missing values, the method for processing the missing values by the xgboost model is different from other tree models, the xgboost treats the missing values as a sparse matrix, the numerical values of the missing values are not considered when nodes are split, the missing value data are divided into a left sub-tree and a right sub-tree to calculate the loss respectively, and the better one is selected. If there are no missing values in the training, missing data is predicted and classified into the right sub-tree by default.

S3.3, customizing an xgboost parameter search function to enable the value of the segmented loss function to be maximum, and obtaining an optimal xgboost model:

s3.3.1 popularization of xgboost model objective function

The xgboost is a set of machine learning system with extensible lifting tree, and our main design and construction are a highly extensible end-to-end lifting tree system and introduce a distributed approximate tree search algorithm, so that the default direction of the missing value is reached. The initial objective function basic form of the algorithm is as follows:

is the predicted value of the first t-1 ensemble learners to the sample; f. of_t(x_i) Is the predicted value of the current learner on the sample; omega (f)_t) Is a regular term of the t-th learner, an

Note that gamma and lambda are present here, which the xgboost model defines itself. When using xgboost you can set their values, obviously the larger gamma, the more desirable it is to get a tree with a simple structure, because the more penalties there are for trees with more leaf nodes. It is more desirable to obtain a tree with a simple structure if λ is larger. Therefore, the structure of the tree can be restricted, the variance of the model is reduced, and overfitting is prevented. The loss function we use is MSE (mean square error), and the objective function can be written as:

we apply it as a second order taylor expansion:

wherein

First and second derivatives of the loss function of the previous tree, respectively. And is

Known at the t-th round.

(4) The formula can be written as:

defining:

in the case of a leaf node j,

for a certain structure (q (x) representation) of the t CART tree, all g' s_iAnd h_iAre all known and the value w of each leaf node in equation (6)_jAre independent of each other. The optimal value objective function value for each leaf node can be found:

wherein the content of the first and second substances,

t is the number of leaf nodes;

s3.3.2, node partitioning

In general, we cannot enumerate all possible tree structures and then choose the best, so we use a greedy algorithm instead, we start with a single node and iteratively split to add new nodes to the tree. Loss function after node segmentation:

is the score of the left sub-tree,

representing the score of the right sub-tree,

representing the node score when not segmented, λ representing the complexity penalty introduced by the newly added leaf node, G_L←G_L+g_j，H_L←H_L+h_j，G_R←G-G_L，H_R←H-H_LEquation (9) is used to evaluate segmentation candidates, and our goal is to find a feature and corresponding value that maximizes the (loss reduction) of the minimized loss function after segmentation. Besides controlling the complexity of the tree, gamma also has the function of serving as a threshold value, and the splitting is selected only when the gain after the splitting is larger than gamma, so that the pre-pruning function is realized.

S3.4, substituting the verification set into the optimal xgboost model for cross verification, returning the optimal iteration times and the optimal xgboost model, wherein the predicted value of each leaf node of the optimal xgboost model is used as a new numerical characteristic:

in the training process, the parameter search function is customized to obtain the optimal model parameters, and then the verification set is used to test the xgboost model of the optimal model parameters obtained from the training set, so as to obtain the optimal iteration times and the optimal model result, wherein the new numerical characteristics are the predicted values of each leaf node of the optimal xgboost model, as shown in fig. 2 and 3.

The predicted value of each leaf node of the optimized xgboost model is taken as a new numerical characteristic, which is a recombined union characteristic of the numerical characteristics described in S2, for example, the new numerical characteristics obtained by the xgboost model from the characteristics u _ b _ count _ in _ n (i 1/2/3/4, n 3/6/10) representing the total number of behaviors of the user n days before the investigation day as in the above table include ui _ b _ count _ rank _ in _ n _ in _ uc representing the order of the behaviors of the user ID-commodity ID pairs in the user ID-commodity ID pairs, reflecting the behavior preference of the user ID for each commodity ID in the commodity category, or ui _ b _ count _ rank _ in _ n _ in _ u representing the order of the behaviors of the user ID-commodity ID pairs in all the user commodities, reflecting the behavior preference of the user ID for the commodity ID, as shown in table 2, ui indicates that the type of behavior belongs to the user ID. The new numerical features obtained by xgboost are to obtain the joint features which are not extracted in step S2 but have a large influence on the prediction model.

TABLE 2

S4, splicing the new numerical value characteristics after one-hot coding with the original category characteristics to obtain reconstruction characteristics; and after the new numerical characteristic one-hot is coded, the original category characteristic corresponding to the initially constructed numerical characteristic is used for realizing splicing reconstruction by using codes.

S5, importing the reconstruction characteristics into an LR model for training to obtain an optimal LR model, wherein the LR model is a logistic regression model;

this step fuses the xgboost model with the LR model, and takes the new numerical features and the original category features together as the data set of the LR model, as shown in fig. 3.

The LR model is a linear model, has ultra-large-scale characteristic throughput capacity and good real-time performance, but has poor nonlinear learning capacity and low computational complexity, the xgboost model has good nonlinear fitting capacity and high computational complexity, can treat missing values as a sparse matrix, but has more optimized computational efficiency than a general GBDT model, but cannot throughput large-scale samples and has poor real-time performance; the characteristic data set of the user purchasing behavior is larger, the missing value is more, therefore, the characteristic with higher weight is selected as the numerical characteristic to construct an optimal model of the xgboost model and predict to obtain a new numerical characteristic, the reconstruction characteristic is obtained together with the original category characteristic with lower weight, the reconstruction characteristic is used to construct the LR model, the xgboost model and the LR model are fused into a mixed model, and the advantages of the two models are exerted, so that the obtained user network purchasing behavior prediction model has the characteristic throughput capacity capable of handling large scale, the nonlinear learning capacity is better, the real-time performance is better, the model is not influenced by the missing value, and the model is higher than auc (model evaluation index) of a single model, the result is good, and the F1-score (the harmonic mean of accuracy and recall rate) of machine learning is higher.

In summary, the method for predicting and researching the online purchasing behavior of the user based on the hybrid model has the following advantages:

(3) the invention is a mixed model formed by an xgboost model and an LR model, adopts a plurality of weak classifiers to form strong classifier training characteristic data to a new training set, integrates the advantages of the two models, analyzes effective single characteristics and cross characteristics through the xgboost model, and exerts the nonlinear fitting capability of the xgboost model, thereby indirectly enhancing the nonlinear learning capability of the LR, and is favorable for obtaining more accurate prediction results by combining the ultra-large scale characteristic throughput capability of the LR model;

(4) according to the invention, an xgboost parameter search function is adopted to keep an optimal training model, so that the problem that the traditional gradient decision-making lifting tree searches for an optimal segmentation point is solved, the predicted value of the xgboost model is favorably optimized, and the user can purchase a prediction result favorably;

(5) the invention self-defines the xgboost parameter search function and omits the step of parameter adjustment, thereby improving the training efficiency.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A user online purchasing behavior prediction research method based on a hybrid model is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the characteristics in step S1 are constructed from three basic dimensions of user, commodity and commodity category and their combination, and the combined characteristics include u _ b _ count _ in _ n and u _ bi _ count _ in _ n, which respectively represent the total number of actions of the user n days before the investigation day, the actions of the user n days before the investigation day, u represents the category, b represents the actions, bi represents the actions, and n represents the days before the investigation day.

3. The method as claimed in claim 1, wherein the data in step S1 is behavior data and commodity information of the online shopping platform 20000 user with a period of seven days and a window length of seven days.

4. The method as claimed in claim 1, wherein the data processing in step S1 includes a normalization process, and the normalization process is a process of processing data according to rows of the feature matrix.

5. The method as claimed in claim 4, wherein the data processing in step S1 further includes feature data equalization processing, that is, negative samples of feature data are clustered by K-means, and then are downsampled and combined with positive samples to obtain relatively equalized feature data.

6. The method as claimed in claim 1, wherein the behavior includes browsing, collecting, purchasing and buying.

7. The method as claimed in claim 1, wherein the xgboost model in step S3 treats the missing values as a sparse matrix, and does not consider the values of the missing values during node splitting.

8. The method for researching on online purchasing behavior prediction of users based on hybrid model as claimed in claim 1, wherein said step S3 includes the steps of:

s3.2, importing the training set into an xgboost model for training;

9. A server, comprising:

10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the hybrid model-based consumer network purchasing behavior prediction research method of any one of claims 1 to 8.