CN112085525A - User network purchasing behavior prediction research method based on hybrid model - Google Patents

User network purchasing behavior prediction research method based on hybrid model Download PDF

Info

Publication number
CN112085525A
CN112085525A CN202010918871.7A CN202010918871A CN112085525A CN 112085525 A CN112085525 A CN 112085525A CN 202010918871 A CN202010918871 A CN 202010918871A CN 112085525 A CN112085525 A CN 112085525A
Authority
CN
China
Prior art keywords
model
user
feature
optimal
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010918871.7A
Other languages
Chinese (zh)
Inventor
陈曦
丁石丑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha University of Science and Technology
Original Assignee
Changsha University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changsha University of Science and Technology filed Critical Changsha University of Science and Technology
Priority to CN202010918871.7A priority Critical patent/CN112085525A/en
Publication of CN112085525A publication Critical patent/CN112085525A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0202Market predictions or forecasting for commercial activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Finance (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Accounting & Taxation (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Economics (AREA)
  • Software Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computing Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for predicting and researching user online purchasing behavior based on a hybrid model, which comprises the following steps: extracting behavior data and commodity information data from the online shopping platform, constructing a characteristic sample set, and performing data processing; extracting the feature with high weight as a category feature, and converting the category feature into a numerical feature; importing the numerical features into an xgboost model for training and cross-verifying to obtain a predicted value of each leaf node of the optimal xgboost model as a new numerical feature, wherein the new numerical feature is a numerical feature corresponding to the recombined unextracted features with correlation; splicing the new numerical value characteristics after one-hot coding with the original category characteristics to obtain reconstruction characteristics; importing the reconstruction characteristics into an LR model for training to obtain an optimal LR model; the optimal LR model is used to predict whether the user will purchase the specified good on the next day. The method and the device improve the accuracy of the prediction of the purchasing behavior of the user.

Description

User network purchasing behavior prediction research method based on hybrid model
Technical Field
The invention relates to the technical field of machine learning, in particular to a user online purchasing behavior prediction research method based on a hybrid model.
Background
At present, in the era of big explosion of internet data, a lot of information can be received every day, but not every kind of information is interesting and valuable, so that the recommendation system is also an idea for solving the problem of information overload, and the recommendation system mainly finds personalized demands of users by analyzing behaviors of the users, thereby recommending some commodities to corresponding users in a personalized manner and helping the users find commodities which the users want but are difficult to find. Machine learning is used as a primary element and widely applied to a recommendation system, in machine learning algorithms in practical application, Boosting series algorithms are relatively widely applied technologies, a serial mode is adopted when a Boosting method is used for training base classifiers, and dependence exists among the base classifiers. The main algorithms are GBDT (gradient boosting decision tree) and xgboost (gradient boosting model). When the machine learning is used for recommending and predicting the online purchasing behavior of the user, the data and the features determine the upper limit of the machine learning, and the model and the algorithm only approach the upper limit, so the feature extraction and the data preprocessing are very important rings in the aspects of recommending and predicting.
The data set mainly comes from each large e-commerce website platform, and the problems of large and incomplete data set, too primitive characteristics, discrete single characteristics, noise, uneven category distribution and the like exist. The aim of the method is to find out as many combined features as possible with potential relevance, which may result in missing cross features with relevant rows, and finally influence the model effect and the prediction result to have deviation; the traditional recommendation algorithm is mainly a collaborative filtering algorithm, finds preference bias of a user based on mining of historical behavior data of the user, and predicts products which the user may prefer to recommend, but the traditional collaborative filtering algorithm has difficulty in processing sparsity and expandability of data, and needs to calculate the increased degree of acquaintance of the user with all other users, so that the improvement of the calculation speed, namely training time, of the online recommendation system is a major challenge; while conventional recommendation systems typically use a single model to train a feature set, existing commonly practiced tree enhancement algorithms are still limited to millions of datasets, and while there are some discussions of how to parallelize such algorithms, there has been no discussion of how to jointly optimize the system.
Disclosure of Invention
Technical problem to be solved
Based on the problems, the invention provides a mixed model-based user network purchasing behavior prediction research method, which improves the accuracy of user network purchasing behavior prediction by optimizing the processing of characteristics and data and a fusion model consisting of an xgboost model and an LR model.
(II) technical scheme
Based on the technical problem, the invention provides a user online purchasing behavior prediction research method based on a hybrid model, which comprises the following steps:
s1, extracting behavior data and commodity information data from the online shopping platform, constructing a feature sample set, and performing data processing, wherein the features of the feature sample set comprise user ID, date, behavior, days, behavior type, behavior count, position space identification and commodity position information;
s2, extracting the features with large weights as category features, removing residual features, wherein the category features comprise user IDs, behavior types, dates and behavior quantity totals, the removed residual features comprise position space identifiers of commodities and commodity position information, and the category features are converted into numerical features;
s3, importing the numerical features into an xgboost model for training and cross-validation to obtain an optimal xgboost model, wherein the predicted value of each leaf node of the optimal xgboost model is used as a new numerical feature, and the new numerical feature is a numerical feature corresponding to the feature which is not extracted and has the relevance and is recombined, and the new numerical feature comprises the ranking of the behaviors of the user-commodity pairs in the user-category pairs and the ranking of the behaviors of the user-commodity pairs in all commodities of the user;
s4, splicing the new numerical value characteristics after one-hot coding with the original category characteristics to obtain reconstruction characteristics;
s5, importing the reconstruction characteristics into an LR model for training to obtain an optimal LR model;
and S6, predicting whether the user purchases the specified commodity in the future day by using the optimal LR model.
Further, the features described in step S1 are constructed from three basic dimensions of user, commodity and commodity category and their combination, and the combined features include u _ b _ count _ in _ n and u _ bi _ count _ in _ n, which respectively represent the total number of actions of the user n days before the investigation day, the counts of the actions of the user n days before the investigation day, u represents the category, b represents the actions, bi represents the actions, and n represents the days before the investigation day.
Further, the data in step S1 is behavior data and commodity information of the online shopping platform 20000 that the user takes seven days as a cycle and the window length is seven days.
Further, the data processing described in step S1 includes a normalization process of processing data in accordance with the rows of the feature matrix.
Further, the data processing in step S1 further includes feature data equalization processing, that is, negative samples of the feature data are clustered by K-means, and then are merged with positive samples by down-sampling to obtain relatively equalized feature data.
Further, the behaviors include commodity browsing, commodity collection, commodity purchase and commodity purchase.
Further, the xgboost model described in step S3 treats the missing value as a sparse matrix, and does not consider the value of the missing value when the node is split.
Further, the step S3 includes the following steps:
s3.1, dividing the numerical characteristics into a training set and a verification set;
s3.2, importing the training set into an xgboost model for training;
s3.3, customizing an xgboost parameter search function to enable the value of the segmented loss function to be maximum, and obtaining an optimal xgboost model;
and S3.4, substituting the verification set into the optimal xgboost model to perform cross verification, returning the optimal iteration times and the optimal xgboost model, and taking the predicted value of each leaf node of the optimal xgboost model as a new numerical characteristic.
The invention also discloses a server, comprising:
at least one processor; and at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor to invoke the hybrid model-based consumer network purchasing behavior prediction research method of any one of claims 1 to 8.
The present invention also discloses a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the hybrid model-based consumer network purchasing behavior prediction research method of any one of claims 1 to 8.
(III) advantageous effects
The technical scheme of the invention has the following advantages:
(1) according to the method, new numerical characteristics are obtained through an xgboost model, important characteristics missing during characteristic extraction are compensated, more cross characteristics, namely potential correlation characteristics with correlation, are constructed as far as possible through characteristic engineering, and then are spliced and reconstructed with original category characteristics, the reconstructed characteristics are more comprehensive and important, an LR model is introduced for training, the obtained model prediction result is better, and the accuracy of a machine learning model is improved in characteristics;
(2) according to the invention, through a clustering algorithm and downsampling of the negative samples, a more balanced data set is finally formed with the positive samples, the problem of unbalance of the positive samples and the negative samples is solved, the missing values are regarded as sparse matrixes, the problem of the missing values is solved, and all numerical characteristics of the model are converted into a one-hot type, so that the accuracy of the machine learning model is improved on the data;
(3) the invention is a mixed model formed by an xgboost model and an LR model, adopts a plurality of weak classifiers to form strong classifier training characteristic data to a new training set, integrates the advantages of the two models, analyzes effective single characteristics and cross characteristics through the xgboost model, and exerts the nonlinear fitting capability of the xgboost model, thereby indirectly enhancing the nonlinear learning capability of the LR, and the ultra-large scale characteristic throughput capability of the LR model is beneficial to obtaining more accurate prediction results.
Drawings
The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and not to be construed as limiting the invention in any way, and in which:
FIG. 1 is a schematic flow chart of a method for predicting and researching a network purchasing behavior of a user based on a hybrid model according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a new numerical feature obtained by training a CART tree of an xgboost model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of fusion of an xgboost model and an LR model according to an embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
A method for predicting and researching online purchasing behavior of a user based on a hybrid model is disclosed, as shown in FIG. 1, and comprises the following steps:
s1, extracting behavior data and commodity information data from the online shopping platform, constructing a feature sample set, and performing data processing, wherein the features of the feature sample set comprise single features and joint features, the single features comprise behaviors, days, behavior types, behavior counts, position space identifiers, commodity position information and the like, and the joint features are a combination of the single features;
s1.1, extracting a data set and constructing a characteristic sample set: the method comprises the steps that a user on a certain online shopping platform 20000 considers behavior data and commodity information with a window length of seven days in a period of seven days, feature construction is carried out on the three basic dimensions of the user, the commodity and the commodity category and combination of the three basic dimensions, and 106 features are constructed in total;
for the data set of the e-commerce website, the quantity is usually very large and irregular, which has a considerable influence on the accuracy of the model, in order to reduce the influence of the feature data on the model training, potential relevant attributes are mined as much as possible, missing values are processed, feature conversion is performed, and the final relatively complete feature set is formed by splicing.
The raw data set selects the behavior data of a certain online shopping platform 20000 user and million-level commodity information. Since the influence of the user behavior on the purchase is weakened with time, according to analysis, the influence of the behavior of the user before one week on whether the user purchases the study day is small, and therefore only characteristic data within seven days are considered. The data source is from a certain online shopping platform e-commerce, and is characterized in that online purchasing and offline consumption are mainly realized, purchasing behaviors are expected to have certain periodicity, and the preliminary set period is seven days.
For the current business demand, considering feature construction starting from three basic dimensions of users, commodities and commodity categories and combinations thereof, 106 features are constructed in total, including single features and combined features, the single features include user IDs, dates, behaviors, days, behavior types, behavior counts, position space identifiers, commodity position information and the like, the combined features are combinations of the single features, category classification is performed as exemplified in table 1 below, for example, u _ bi _ count _ in _ n is the behavior counts of the users n days before an investigation day, u represents the behavior types of the users, b represents behaviors, and bi represents behaviors, that is, b1/b2/b3/b4 respectively corresponds to a behavior, and commodity browsing, commodity collection, commodity purchase and commodity purchase corresponding to the behaviors are included, and n represents the days before the investigation day.
TABLE 1
Figure BDA0002665962760000071
Figure BDA0002665962760000081
S1.2, carrying out normalization processing on the feature data of the feature sample set:
the features of different measurement scales need to be normalized, and here we use sklern preprocessing, standard scaler (), which simply refers to processing data according to rows of a feature matrix, and aims to make sample vectors have a uniform standard when calculating similarity through point multiplication or other kernel functions, that is, all sample vectors are converted into unit vectors.
S1.3, carrying out equalization processing on the characteristic data: clustering negative samples of the feature data through K-means, and then combining the negative samples and the positive samples through downsampling (under-sampling) to obtain relatively balanced feature data:
the proportion of positive and negative samples of the feature data obtained by feature construction, namely the proportion of purchasing or not is about 1:1200, so that the data is seriously unbalanced, and the model training is easy to fail. Therefore, this problem is dealt with using undersampling (also called under-sampling) and evaluation criteria based on F1 scores (F1_ score), and sample equalization is achieved by reducing the number of samples of most classes of samples in the classification. In order to avoid insufficient coverage of the random sampling feature space, negative samples are clustered by using k-means, then a sub-sample is adopted on each cluster to obtain comprehensive negative sample sampling, then the negative samples in a training set are downsampled, and finally a balanced data set is formed by the negative samples and the positive samples.
And S2, extracting the features with large weights as category features, removing residual features, wherein the category features comprise user IDs, behavior types, dates and behavior quantity totals, the removed residual features comprise position space identifiers of commodities and commodity position information, and the category features are converted into numerical features.
S3, importing the numerical features into an xgboost model for training and cross-verifying to obtain an optimal xgboost model, wherein the predicted value of each leaf node of the optimal xgboost model is used as a new numerical feature, and the new numerical feature is the recombination of the numerical features in S2;
s3.1, dividing the numerical characteristics into a training set and a verification set;
dividing the numerical features into a training set and a verification set according to the investigation period, and assuming that the training set comprises: part 1-train, 11.22-11.27 > 11.28; part 2-train, 11.29-12.04 > 12.05; the verification set includes: part 3-test of 12.13-12.18 > 12.19; and seven days are investigation data of a period, friday is investigation date, namely 11.28, 12.05 and 12.19 are investigation date, characteristic data in seven days are respectively investigated, and the training set and the verification set should avoid consumption shopping nodes such as 'double 11' and '618', and are not beneficial to accurate prediction because the consumption shopping node period belongs to a special value.
S3.2, importing the training set into an xgboost model for training:
the data set of a general E-commerce website has a large number of missing values, the method for processing the missing values by the xgboost model is different from other tree models, the xgboost treats the missing values as a sparse matrix, the numerical values of the missing values are not considered when nodes are split, the missing value data are divided into a left sub-tree and a right sub-tree to calculate the loss respectively, and the better one is selected. If there are no missing values in the training, missing data is predicted and classified into the right sub-tree by default.
S3.3, customizing an xgboost parameter search function to enable the value of the segmented loss function to be maximum, and obtaining an optimal xgboost model:
s3.3.1 popularization of xgboost model objective function
The xgboost is a set of machine learning system with extensible lifting tree, and our main design and construction are a highly extensible end-to-end lifting tree system and introduce a distributed approximate tree search algorithm, so that the default direction of the missing value is reached. The initial objective function basic form of the algorithm is as follows:
Figure BDA0002665962760000091
Figure BDA0002665962760000101
is the predicted value of the first t-1 ensemble learners to the sample; f. oft(xi) Is the predicted value of the current learner on the sample; omega (f)t) Is a regular term of the t-th learner, an
Figure BDA0002665962760000102
Note that gamma and lambda are present here, which the xgboost model defines itself. When using xgboost you can set their values, obviously the larger gamma, the more desirable it is to get a tree with a simple structure, because the more penalties there are for trees with more leaf nodes. It is more desirable to obtain a tree with a simple structure if λ is larger. Therefore, the structure of the tree can be restricted, the variance of the model is reduced, and overfitting is prevented. The loss function we use is MSE (mean square error), and the objective function can be written as:
Figure BDA0002665962760000103
we apply it as a second order taylor expansion:
Figure BDA0002665962760000104
wherein
Figure BDA0002665962760000105
First and second derivatives of the loss function of the previous tree, respectively. And is
Figure BDA0002665962760000106
Known at the t-th round.
(4) The formula can be written as:
Figure BDA0002665962760000107
defining:
Figure BDA0002665962760000108
in the case of a leaf node j,
Figure BDA0002665962760000111
for a certain structure (q (x) representation) of the t CART tree, all g' siAnd hiAre all known and the value w of each leaf node in equation (6)jAre independent of each other. The optimal value objective function value for each leaf node can be found:
Figure BDA0002665962760000112
Figure BDA0002665962760000113
wherein the content of the first and second substances,
Figure BDA0002665962760000114
t is the number of leaf nodes;
s3.3.2, node partitioning
In general, we cannot enumerate all possible tree structures and then choose the best, so we use a greedy algorithm instead, we start with a single node and iteratively split to add new nodes to the tree. Loss function after node segmentation:
Figure BDA0002665962760000115
Figure BDA0002665962760000116
is the score of the left sub-tree,
Figure BDA0002665962760000117
representing the score of the right sub-tree,
Figure BDA0002665962760000118
representing the node score when not segmented, λ representing the complexity penalty introduced by the newly added leaf node, GL←GL+gj,HL←HL+hj,GR←G-GL,HR←H-HLEquation (9) is used to evaluate segmentation candidates, and our goal is to find a feature and corresponding value that maximizes the (loss reduction) of the minimized loss function after segmentation. Besides controlling the complexity of the tree, gamma also has the function of serving as a threshold value, and the splitting is selected only when the gain after the splitting is larger than gamma, so that the pre-pruning function is realized.
S3.4, substituting the verification set into the optimal xgboost model for cross verification, returning the optimal iteration times and the optimal xgboost model, wherein the predicted value of each leaf node of the optimal xgboost model is used as a new numerical characteristic:
in the training process, the parameter search function is customized to obtain the optimal model parameters, and then the verification set is used to test the xgboost model of the optimal model parameters obtained from the training set, so as to obtain the optimal iteration times and the optimal model result, wherein the new numerical characteristics are the predicted values of each leaf node of the optimal xgboost model, as shown in fig. 2 and 3.
The predicted value of each leaf node of the optimized xgboost model is taken as a new numerical characteristic, which is a recombined union characteristic of the numerical characteristics described in S2, for example, the new numerical characteristics obtained by the xgboost model from the characteristics u _ b _ count _ in _ n (i 1/2/3/4, n 3/6/10) representing the total number of behaviors of the user n days before the investigation day as in the above table include ui _ b _ count _ rank _ in _ n _ in _ uc representing the order of the behaviors of the user ID-commodity ID pairs in the user ID-commodity ID pairs, reflecting the behavior preference of the user ID for each commodity ID in the commodity category, or ui _ b _ count _ rank _ in _ n _ in _ u representing the order of the behaviors of the user ID-commodity ID pairs in all the user commodities, reflecting the behavior preference of the user ID for the commodity ID, as shown in table 2, ui indicates that the type of behavior belongs to the user ID. The new numerical features obtained by xgboost are to obtain the joint features which are not extracted in step S2 but have a large influence on the prediction model.
TABLE 2
Figure BDA0002665962760000121
Figure BDA0002665962760000131
S4, splicing the new numerical value characteristics after one-hot coding with the original category characteristics to obtain reconstruction characteristics; and after the new numerical characteristic one-hot is coded, the original category characteristic corresponding to the initially constructed numerical characteristic is used for realizing splicing reconstruction by using codes.
S5, importing the reconstruction characteristics into an LR model for training to obtain an optimal LR model, wherein the LR model is a logistic regression model;
this step fuses the xgboost model with the LR model, and takes the new numerical features and the original category features together as the data set of the LR model, as shown in fig. 3.
And S6, predicting whether the user purchases the specified commodity in the future day by using the optimal LR model.
The LR model is a linear model, has ultra-large-scale characteristic throughput capacity and good real-time performance, but has poor nonlinear learning capacity and low computational complexity, the xgboost model has good nonlinear fitting capacity and high computational complexity, can treat missing values as a sparse matrix, but has more optimized computational efficiency than a general GBDT model, but cannot throughput large-scale samples and has poor real-time performance; the characteristic data set of the user purchasing behavior is larger, the missing value is more, therefore, the characteristic with higher weight is selected as the numerical characteristic to construct an optimal model of the xgboost model and predict to obtain a new numerical characteristic, the reconstruction characteristic is obtained together with the original category characteristic with lower weight, the reconstruction characteristic is used to construct the LR model, the xgboost model and the LR model are fused into a mixed model, and the advantages of the two models are exerted, so that the obtained user network purchasing behavior prediction model has the characteristic throughput capacity capable of handling large scale, the nonlinear learning capacity is better, the real-time performance is better, the model is not influenced by the missing value, and the model is higher than auc (model evaluation index) of a single model, the result is good, and the F1-score (the harmonic mean of accuracy and recall rate) of machine learning is higher.
In summary, the method for predicting and researching the online purchasing behavior of the user based on the hybrid model has the following advantages:
(1) according to the method, new numerical characteristics are obtained through an xgboost model, important characteristics missing during characteristic extraction are compensated, more cross characteristics, namely potential correlation characteristics with correlation, are constructed as far as possible through characteristic engineering, and then are spliced and reconstructed with original category characteristics, the reconstructed characteristics are more comprehensive and important, an LR model is introduced for training, the obtained model prediction result is better, and the accuracy of a machine learning model is improved in characteristics;
(2) according to the invention, through a clustering algorithm and downsampling of the negative samples, a more balanced data set is finally formed with the positive samples, the problem of unbalance of the positive samples and the negative samples is solved, the missing values are regarded as sparse matrixes, the problem of the missing values is solved, and all numerical characteristics of the model are converted into a one-hot type, so that the accuracy of the machine learning model is improved on the data;
(3) the invention is a mixed model formed by an xgboost model and an LR model, adopts a plurality of weak classifiers to form strong classifier training characteristic data to a new training set, integrates the advantages of the two models, analyzes effective single characteristics and cross characteristics through the xgboost model, and exerts the nonlinear fitting capability of the xgboost model, thereby indirectly enhancing the nonlinear learning capability of the LR, and is favorable for obtaining more accurate prediction results by combining the ultra-large scale characteristic throughput capability of the LR model;
(4) according to the invention, an xgboost parameter search function is adopted to keep an optimal training model, so that the problem that the traditional gradient decision-making lifting tree searches for an optimal segmentation point is solved, the predicted value of the xgboost model is favorably optimized, and the user can purchase a prediction result favorably;
(5) the invention self-defines the xgboost parameter search function and omits the step of parameter adjustment, thereby improving the training efficiency.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (10)

1. A user online purchasing behavior prediction research method based on a hybrid model is characterized by comprising the following steps:
s1, extracting behavior data and commodity information data from the online shopping platform, constructing a feature sample set, and performing data processing, wherein the features of the feature sample set comprise user ID, date, behavior, days, behavior type, behavior count, position space identification and commodity position information;
s2, extracting the features with large weights as category features, removing residual features, wherein the category features comprise user IDs, behavior types, dates and behavior quantity totals, the removed residual features comprise position space identifiers of commodities and commodity position information, and the category features are converted into numerical features;
s3, importing the numerical features into an xgboost model for training and cross-validation to obtain an optimal xgboost model, wherein the predicted value of each leaf node of the optimal xgboost model is used as a new numerical feature, and the new numerical feature is a numerical feature corresponding to the feature which is not extracted and has the relevance and is recombined, and the new numerical feature comprises the ranking of the behaviors of the user-commodity pairs in the user-category pairs and the ranking of the behaviors of the user-commodity pairs in all commodities of the user;
s4, splicing the new numerical value characteristics after one-hot coding with the original category characteristics to obtain reconstruction characteristics;
s5, importing the reconstruction characteristics into an LR model for training to obtain an optimal LR model;
and S6, predicting whether the user purchases the specified commodity in the future day by using the optimal LR model.
2. The method as claimed in claim 1, wherein the characteristics in step S1 are constructed from three basic dimensions of user, commodity and commodity category and their combination, and the combined characteristics include u _ b _ count _ in _ n and u _ bi _ count _ in _ n, which respectively represent the total number of actions of the user n days before the investigation day, the actions of the user n days before the investigation day, u represents the category, b represents the actions, bi represents the actions, and n represents the days before the investigation day.
3. The method as claimed in claim 1, wherein the data in step S1 is behavior data and commodity information of the online shopping platform 20000 user with a period of seven days and a window length of seven days.
4. The method as claimed in claim 1, wherein the data processing in step S1 includes a normalization process, and the normalization process is a process of processing data according to rows of the feature matrix.
5. The method as claimed in claim 4, wherein the data processing in step S1 further includes feature data equalization processing, that is, negative samples of feature data are clustered by K-means, and then are downsampled and combined with positive samples to obtain relatively equalized feature data.
6. The method as claimed in claim 1, wherein the behavior includes browsing, collecting, purchasing and buying.
7. The method as claimed in claim 1, wherein the xgboost model in step S3 treats the missing values as a sparse matrix, and does not consider the values of the missing values during node splitting.
8. The method for researching on online purchasing behavior prediction of users based on hybrid model as claimed in claim 1, wherein said step S3 includes the steps of:
s3.1, dividing the numerical characteristics into a training set and a verification set;
s3.2, importing the training set into an xgboost model for training;
s3.3, customizing an xgboost parameter search function to enable the value of the segmented loss function to be maximum, and obtaining an optimal xgboost model;
and S3.4, substituting the verification set into the optimal xgboost model to perform cross verification, returning the optimal iteration times and the optimal xgboost model, and taking the predicted value of each leaf node of the optimal xgboost model as a new numerical characteristic.
9. A server, comprising:
at least one processor; and at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor to invoke the hybrid model-based consumer network purchasing behavior prediction research method of any one of claims 1 to 8.
10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the hybrid model-based consumer network purchasing behavior prediction research method of any one of claims 1 to 8.
CN202010918871.7A 2020-09-04 2020-09-04 User network purchasing behavior prediction research method based on hybrid model Pending CN112085525A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010918871.7A CN112085525A (en) 2020-09-04 2020-09-04 User network purchasing behavior prediction research method based on hybrid model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010918871.7A CN112085525A (en) 2020-09-04 2020-09-04 User network purchasing behavior prediction research method based on hybrid model

Publications (1)

Publication Number Publication Date
CN112085525A true CN112085525A (en) 2020-12-15

Family

ID=73731421

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010918871.7A Pending CN112085525A (en) 2020-09-04 2020-09-04 User network purchasing behavior prediction research method based on hybrid model

Country Status (1)

Country Link
CN (1) CN112085525A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784787A (en) * 2021-01-29 2021-05-11 南京智数云信息科技有限公司 Device, system and method for analyzing and predicting user behavior based on deep learning
CN112990284A (en) * 2021-03-04 2021-06-18 安徽大学 Individual trip behavior prediction method, system and terminal based on XGboost algorithm
CN113095390A (en) * 2021-04-02 2021-07-09 东北大学 Walking stick motion analysis system and method based on cloud database and improved ensemble learning
CN113190696A (en) * 2021-05-12 2021-07-30 百果园技术(新加坡)有限公司 Training method of user screening model, user pushing method and related devices
CN113191821A (en) * 2021-05-20 2021-07-30 北京大米科技有限公司 Data processing method and device
CN113344254A (en) * 2021-05-20 2021-09-03 山西省交通新技术发展有限公司 Method for predicting traffic flow of expressway service area based on LSTM-LightGBM-KNN

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107944913A (en) * 2017-11-21 2018-04-20 重庆邮电大学 High potential user's purchase intention Forecasting Methodology based on big data user behavior analysis
CN109741113A (en) * 2019-01-10 2019-05-10 博拉网络股份有限公司 A kind of user's purchase intention prediction technique based on big data
CN109886349A (en) * 2019-02-28 2019-06-14 成都新希望金融信息有限公司 A kind of user classification method based on multi-model fusion

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107944913A (en) * 2017-11-21 2018-04-20 重庆邮电大学 High potential user's purchase intention Forecasting Methodology based on big data user behavior analysis
CN109741113A (en) * 2019-01-10 2019-05-10 博拉网络股份有限公司 A kind of user's purchase intention prediction technique based on big data
CN109886349A (en) * 2019-02-28 2019-06-14 成都新希望金融信息有限公司 A kind of user classification method based on multi-model fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MING_H: "特征组合之 XGBoost + LR", 《HTTPS://WWW.CNBLOGS.COM/MING-H/P/10897948.HTML》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784787A (en) * 2021-01-29 2021-05-11 南京智数云信息科技有限公司 Device, system and method for analyzing and predicting user behavior based on deep learning
CN112784787B (en) * 2021-01-29 2022-06-17 南京智数云信息科技有限公司 Device, system and method for analyzing and predicting user behavior based on deep learning
CN112990284A (en) * 2021-03-04 2021-06-18 安徽大学 Individual trip behavior prediction method, system and terminal based on XGboost algorithm
CN112990284B (en) * 2021-03-04 2022-11-22 安徽大学 Individual trip behavior prediction method, system and terminal based on XGboost algorithm
CN113095390A (en) * 2021-04-02 2021-07-09 东北大学 Walking stick motion analysis system and method based on cloud database and improved ensemble learning
CN113190696A (en) * 2021-05-12 2021-07-30 百果园技术(新加坡)有限公司 Training method of user screening model, user pushing method and related devices
CN113191821A (en) * 2021-05-20 2021-07-30 北京大米科技有限公司 Data processing method and device
CN113344254A (en) * 2021-05-20 2021-09-03 山西省交通新技术发展有限公司 Method for predicting traffic flow of expressway service area based on LSTM-LightGBM-KNN

Similar Documents

Publication Publication Date Title
CN111222332B (en) Commodity recommendation method combining attention network and user emotion
CN111797321B (en) Personalized knowledge recommendation method and system for different scenes
CN112085525A (en) User network purchasing behavior prediction research method based on hybrid model
CN107944986B (en) Method, system and equipment for recommending O2O commodities
EP4322031A1 (en) Recommendation method, recommendation model training method, and related product
CN112231583B (en) E-commerce recommendation method based on dynamic interest group identification and generation of confrontation network
CN114780831A (en) Sequence recommendation method and system based on Transformer
CN113744032B (en) Book recommendation method, related device, equipment and storage medium
CN110598120A (en) Behavior data based financing recommendation method, device and equipment
CN111429161B (en) Feature extraction method, feature extraction device, storage medium and electronic equipment
CN110727872A (en) Method and device for mining ambiguous selection behavior based on implicit feedback
CN116304299A (en) Personalized recommendation method integrating user interest evolution and gradient promotion algorithm
Li Accurate digital marketing communication based on intelligent data analysis
CN114861050A (en) Feature fusion recommendation method and system based on neural network
CN117216281A (en) Knowledge graph-based user interest diffusion recommendation method and system
CN112819024A (en) Model processing method, user data processing method and device and computer equipment
CN114840745A (en) Personalized recommendation method and system based on graph feature learning and deep semantic matching model
CN114548296A (en) Graph convolution recommendation method based on self-adaptive framework and related device
CN116340643B (en) Object recommendation adjustment method and device, storage medium and electronic equipment
CN116071128A (en) Multitask recommendation method based on multi-behavioral feature extraction and self-supervision learning
CN110956528B (en) Recommendation method and system for e-commerce platform
Wang et al. Jointly modeling intra-and inter-transaction dependencies with hierarchical attentive transaction embeddings for next-item recommendation
CN115618079A (en) Session recommendation method, device, electronic equipment and storage medium
CN112464106B (en) Object recommendation method and device
CN115344794A (en) Scenic spot recommendation method based on knowledge map semantic embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201215