CN112990284B - Individual trip behavior prediction method, system and terminal based on XGboost algorithm - Google Patents

Individual trip behavior prediction method, system and terminal based on XGboost algorithm Download PDF

Info

Publication number
CN112990284B
CN112990284B CN202110239454.4A CN202110239454A CN112990284B CN 112990284 B CN112990284 B CN 112990284B CN 202110239454 A CN202110239454 A CN 202110239454A CN 112990284 B CN112990284 B CN 112990284B
Authority
CN
China
Prior art keywords
data
user
prediction
tree
behavior
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110239454.4A
Other languages
Chinese (zh)
Other versions
CN112990284A (en
Inventor
张红伟
崔逊龙
戚晓东
谢国豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202110239454.4A priority Critical patent/CN112990284B/en
Publication of CN112990284A publication Critical patent/CN112990284A/en
Application granted granted Critical
Publication of CN112990284B publication Critical patent/CN112990284B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of big data model prediction, and particularly relates to an individual trip behavior prediction method, system and terminal based on an XGboost algorithm. The method comprises the following steps: s1: acquiring historical data for representing user behavior characteristics; s2: preprocessing the acquired historical data to obtain a sample data set; s3: constructing a prediction model of individual travel behaviors based on an XGboost algorithm; s4: setting a hyper-parameter of a prediction model, carrying out hyper-parameter adjustment on the prediction model through a layered three-fold cross differentiator, and training the prediction model by utilizing a training data set until the prediction model meets the requirement of an evaluation index; s5: and (3) adopting the prediction data set as an input sample, outputting the sample output by using the trained prediction model, and obtaining a prediction result about the user travel behavior decision from the output of the prediction model. The method, the system or the terminal can solve the problem that the decision of the individual trip behavior cannot be accurately predicted in the prior art.

Description

Individual trip behavior prediction method, system and terminal based on XGboost algorithm
Technical Field
The invention belongs to the technical field of big data model prediction, and particularly relates to an individual trip behavior prediction method, system and terminal based on an XGboost algorithm.
Background
In recent years, with the rapid development of mobile internet and information technology, the importance of data is increasingly highlighted. In the process of times of creating digital economic plateau, data flow helps to promote technical flow, fund flow, talent flow and material flow, and plays an important role in promoting social productivity and promoting innovation development. For example, the utilization data can help to predict the behavior or the demand of an individual, which is widely applied to commercial promotion and advertisement distribution. Theoretically, the travel behavior of the human body can also be predicted through big data, so that help is provided for processing some public management affairs.
Under certain specific scenes, the travel behaviors of individuals in the crowd are predicted, so that advanced planning is scientifically made for other affairs of social management according to a prediction result, and the prediction method becomes an important problem to be solved by a related management department; the method has great significance for reducing the impact of crowd gathering on the industries such as transportation, catering and tourism and preventing the occurrence of safety incidents caused by crowd excessive gathering. However, at present, no better method can utilize existing data to realize accurate prediction of individual trip behavior decisions.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides an individual travel behavior prediction method, an individual travel behavior prediction system and an individual travel behavior prediction terminal based on an XGboost algorithm.
In order to achieve the purpose, the invention is realized by the following technical scheme:
an individual trip behavior prediction method based on an XGboost algorithm comprises the following steps:
s1: acquiring historical data for representing user behavior characteristics, wherein the historical data comprises user basic information, user call behavior data of nearly three months, user internet surfing behavior data of nearly three months and user track behavior data of nearly three months;
s2: preprocessing the acquired historical data to obtain a sample data set, using part of the sample data set as a pre-training data set, and using the rest of the sample data set as a prediction data set; wherein the pre-training data set comprises a training set and a test set;
s3: constructing a prediction model of individual travel behaviors based on an XGboost algorithm; the construction method of the prediction model comprises the following steps:
s31: constructing a classifier of the XGboot by using a CART regression tree; the number of trees is added by continuously performing feature splitting, so that a new function is learned and the residual error predicted on the upper side is fitted;
s32: accumulating the scores of sub-nodes in a certain tree in the CART tree to obtain the score sum of the certain tree, and accumulating the scores of all the trees to obtain the predicted value of the sample;
s33: constructing an objective function of an algorithm model, wherein the objective function comprises a loss function part and a regular term part;
s34: using addition training distribution to optimize a target function, sequentially optimizing each tree in the CART, minimizing the target function on the basis of an optimal tree, and completing the construction of a loss function part;
s35: the definition of a regularization term in the target function is completed through the redefinition of the CART tree, and the function of the regularization term part is determined;
s36: solving the optimal value of each leaf node in the CART tree and the value of the current target function by using the function;
s4: setting a hyper-parameter of a prediction model, carrying out hyper-parameter adjustment on the prediction model through a layered three-fold cross differentiator, and training the prediction model by utilizing a pre-training data set until the prediction model meets the requirement of an evaluation index;
s5: and (3) taking the prediction data set as an input sample, outputting the sample output by using the trained prediction model, and obtaining a prediction result about the user travel behavior decision from the output of the prediction model.
Further, the sources of the history data acquired in step S1 are government opening data of the government and real user data collected by the communication carrier.
Further, in step S2, the pre-processing procedure of the historical data includes the following steps:
s21: table processing of user basic information:
s211: finding abnormal values in the data variables, and performing null value processing on the abnormal values smaller than 0 in a network time length column;
s212: finding the colinearity among the characteristic variables through the thermodynamic diagram of the characteristic variables, and deleting all columns with the colinear value exceeding 0.9 in the data;
s213: performing numerical processing on text information in data, uniformly dividing the types of terminal operating systems into android and IOS systems, and respectively representing the types by 0 and 1;
s214: deleting columns with the Pearson correlation coefficient being 0 according to the Pearson correlation coefficients of all the characteristic variables and the target variable; deleting the related columns of the terminal model, the terminal brand and the first use time of the terminal, which are not related to training;
s215: summarizing and counting the characteristic variable information according to the ID of the user, and filling null values in the characteristic variable information by using an average value or a mode;
s22: and (3) processing a table of user call behaviors:
s221: deleting the number of the opposite terminal and the related column of the call starting time which are irrelevant to training;
s222: summarizing and counting the characteristic variables by adopting a sum or a mode according to the user ID, and carrying out digital processing on abnormal values in the columns of the long-distance area numbers of the home office and the opposite terminal number; filling missing values by adopting a mode;
s23: and (3) table processing of user internet behavior:
s231: deleting the related columns of the application name and the data date which are not related to training; and the application classification related column is processed by single-hot coding;
s232: summing and summarizing the access times and the access flow in the characteristic variables according to the user ID, and summarizing and counting the use mode of all the characteristic variables applying the classified one-hot codes;
s233: processing abnormal values in the found characteristic variables by adopting a three-time quartile spacing capping method;
s234: filling null values existing in each characteristic variable by adopting a mode;
s24: table processing of user trajectory behavior:
s241: whether the current longitude and latitude is the scenic spot is marked by a data warehouse tool hive according to the user ID and whether the province is in two scenic spots above the level 4A, and three columns of data of the total stay time of all users, the scenic spot stay time and the non-scenic spot stay time are derived based on the current longitude and latitude;
s242: finding abnormal values in the characteristic variables, and processing the abnormal values by adopting a triple-quarter-pitch capping method
S243: filling null values existing in the characteristic variables by adopting a mode;
in the step of the relevant data preprocessing method, abnormal values of all characteristic variables are discovered by a method of drawing a relevant variable box line diagram.
In step S31, the CART regression tree is an assumed binary tree, and its expression is:
R 1 (j,s)={x|x (j) ≤s}and R 2 (j,s)={x|x (j) >s}
in the above formula, R 1 (j, s) and R 2 (j, s) respectively represent a left subtree and a right subtree, j represents the jth characteristic in the data, and s represents a dividing point;
in the decision binary tree, if the tree node is split based on the jth eigenvalue in the data, when the eigenvalue is smaller than s, the sample is divided into a left subtree, and when the eigenvalue is larger than s, the sample is divided into a right subtree.
Further, in step S32, the score of a certain tree is calculated by using the following function:
Figure GDA0003889870110000041
in the above formula, the first and second carbon atoms are,
Figure GDA0003889870110000042
denotes the predicted value of the ith sample, K denotes the tree of the tree, F denotes all CART trees, and F denotes a specific oneCART Tree, f k (x i ) The scores obtained for the leaf nodes of a sample in a certain tree.
Further, in steps S33-S35, the expression of the objective function is:
Figure GDA0003889870110000043
in the above formula, l represents the empirical loss function of the tree model, y i Representing the true value of the ith sample, and omega represents a regression tree regularization item;
wherein, the left side of the above formula represents a loss function, and the right side is a regularization term;
the function expression of the loss function is:
Figure GDA0003889870110000044
in the above formula, g i First order partial derivatives, h, representing the ith leaf node i Second order partial derivatives of the ith leaf node;
the regularization term expression is:
Figure GDA0003889870110000045
in the above formula, γ and λ represent trade-off factors, ω j And the output average value of the jth leaf node is shown, and T represents the number of the leaf nodes.
Further, in step S36, the optimal value of each leaf node is calculated by using the following formula:
Figure GDA0003889870110000046
in the above formula, G j And H j Respectively representing the sum of first-order partial derivatives and second-order partial derivatives accumulation of samples contained in the leaf node j, wherein the sum is a constant;
the value of the objective function at this time is calculated by the following equation:
Figure GDA0003889870110000047
in the above equation, T represents the number of leaf nodes.
Further, in step S4, the hyper-parameters to be set in the prediction model include an iteration model category, a loss function category, a learning rate, a tree depth, and L 1 Regularization parameters and iteration times;
the evaluation indexes of the prediction model comprise an accuracy P, a recall ratio R and an F1 value, and the calculation formulas of the accuracy P, the recall ratio R and the F1 value are as follows:
Figure GDA0003889870110000051
Figure GDA0003889870110000052
Figure GDA0003889870110000053
in the above equation, TP represents the number of samples for which positive samples are actually predicted as positive samples, FP represents the number of samples for which negative samples are actually predicted as positive samples, and FN represents the number of samples for which positive samples are actually predicted as negative samples.
The invention also comprises an individual travel behavior prediction system based on the XGboost algorithm, wherein the system adopts the individual travel behavior prediction method based on the XGboost algorithm to realize the result prediction of the individual travel behavior; the prediction system comprises:
the data acquisition module is used for acquiring historical data for representing the recent behavior characteristics of the user, wherein the historical data comprises basic user information, user conversation behavior data of nearly three months, user Internet surfing behavior data of nearly three months and user track behavior data of nearly three months; the collected historical data is output to a preprocessing module;
the preprocessing module is used for preprocessing the historical data acquired by the data acquisition module to obtain a required sample data set; outputting the sample data set to a behavior prediction module; and
and the prediction module is used for training the prediction model by adopting a pre-training data set in the sample data set based on the constructed prediction model, and acquiring output containing a user travel behavior prediction result by adopting the prediction data set in the sample data set as input.
The invention also comprises an individual travel behavior prediction terminal based on the XGboost algorithm, wherein the terminal comprises a memory, a processor and a computer program which is stored on the memory and can be operated on the processor, and the processor executes the individual travel behavior prediction method based on the XGboost algorithm.
The individual trip behavior prediction method, the individual trip behavior prediction system and the individual trip behavior prediction terminal based on the XGboost algorithm have the following beneficial effects:
1. the method is characterized in that a needed prediction model is constructed based on an XGboost algorithm, and the prediction of travel in the future province of the user is carried out by utilizing real information big data of the user; the original user data used for prediction is preprocessed, and key attributes required by model prediction are extracted, so that the accuracy and reliability of the obtained prediction result are high, and the recall rate and the F1 value of the prediction model have better performances.
2. The prediction model adopted in the invention is a weighted regression model, and the characteristics can be selected by self without normalization of the characteristics in the application process, so that the method has better operability, and can reduce the workload and the calculation burden of data processing.
3. In the preprocessing process, the problems of abnormal values and missing values in the original data are considered, and the influence of the data on the prediction result is reduced; during attribute selection, most of characteristic variables which can be obtained through big data are considered, optimization and improvement are carried out on the construction of a prediction model, and partial parameters of the model are optimized; the working contents provide guarantee for the accuracy and reliability of the prediction result.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
fig. 1 is a flowchart of a method for predicting individual travel behaviors based on an XGBoost algorithm in embodiment 1 of the present invention;
FIG. 2 is a box diagram of characteristic quantities of time series on a network according to example 2 of the present invention;
FIG. 3 is a thermodynamic diagram of a user basic information table in embodiment 2 of the present invention;
fig. 4 is a table of partial data in a user internet behavior data table according to embodiment 2 of the present invention;
fig. 5 is a boxline diagram of the access times class in the user internet behavior data table in embodiment 2 of the present invention;
FIG. 6 is a table of partial data in a user trajectory behavior data table according to embodiment 2 of the present invention;
FIG. 7 is a table of partial data in a user trajectory behavior derivative data table in embodiment 2 of the present invention;
FIG. 8 is a boxplot of total dwell time in user trajectory behavior data for example 2 of the present invention;
fig. 9 is a graph of variation of the F1 value with the number of iterations in the test result of the individual travel behavior prediction method based on the XGBoost algorithm in embodiment 2 of the present invention;
fig. 10 is a schematic block diagram of an individual trip behavior prediction system based on an XGBoost algorithm in embodiment 3 of the present invention;
labeled as: 1. a data acquisition module; 2. a pre-processing module; 3. and a prediction module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
Example 1
As shown in fig. 1, the present embodiment provides an individual trip behavior prediction method based on an XGBoost algorithm, and the method includes the following steps:
s1: acquiring historical data for representing user behavior characteristics, wherein the historical data comprises user basic information, user call behavior data of nearly three months, user internet surfing behavior data of nearly three months and user track behavior data of nearly three months; the acquired historical data sources are government open data and real user data collected by communication operators.
S2: preprocessing the acquired historical data to obtain a sample data set, using part of the sample data set as a pre-training data set, and using the rest of the sample data set as a prediction data set; wherein the pre-training data set comprises a training set and a test set; the preprocessing process of the historical data comprises the following steps:
s21: table processing of user basic information:
s211: finding abnormal values in the data variable, and performing null value processing on the abnormal values smaller than 0 in a network time length column;
s212: finding the colinearity among the characteristic variables through the thermodynamic diagram of the characteristic variables, and deleting all columns with the colinear value exceeding 0.9 in the data;
s213: performing numerical processing on text information in data, uniformly dividing the type of a terminal operating system into an android system and an IOS system, and respectively representing the types by 0 and 1;
s214: deleting columns with the Pearson correlation coefficient being 0 according to the Pearson correlation coefficients of all the characteristic variables and the target variable; deleting the terminal model, the terminal brand and the related column of the first use time of the terminal which are irrelevant to training;
s215: summarizing and counting the characteristic variable information according to the ID of the user, and filling null values in the characteristic variable information by using an average value or a mode;
s22: table processing of user call behaviors:
s221: deleting the number of the opposite terminal and the related column of the call starting time which are irrelevant to the training;
s222: summarizing and counting the characteristic variables by adopting a sum or a mode according to the user ID, and carrying out digital processing on abnormal values in the columns of the long-distance area numbers of the home office and the opposite terminal number; filling missing values by adopting mode;
s23: and (3) table processing of user internet behavior:
s231: deleting the related columns of application names and data dates which are irrelevant to training; and the application classification related column is processed by single-hot coding;
s232: summing and summarizing the access times and the access flow in the characteristic variables according to the user ID, and summarizing and counting the use mode of all the characteristic variables applying the classified one-hot codes;
s233: processing abnormal values in the found characteristic variables by adopting a three-time quartile spacing capping method;
s234: filling null values existing in each characteristic variable by adopting a mode;
s24: table processing of user trajectory behavior:
s241: whether the current longitude and latitude is the scenic spot is marked by a data warehouse tool hive according to the user ID and whether the province is in two scenic spots above the level 4A, and three columns of data of the total stay time of all users, the scenic spot stay time and the non-scenic spot stay time are derived based on the current longitude and latitude;
s242: finding abnormal values in the characteristic variables, and processing the abnormal values by adopting a triple-quarter-pitch capping method
S243: filling null values existing in the characteristic variables by adopting a mode;
in the step of the relevant data preprocessing method, abnormal values of all characteristic variables are discovered by a method of drawing a relevant variable box line diagram.
S3: constructing a prediction model of individual travel behaviors based on an XGboost algorithm; the XGboost algorithm is an improved learning algorithm based on a gradient boosting algorithm and a decision tree. The principle is that a large number of weak classifiers are converted into strong classifiers by using the idea of iterative operation so as to realize accurate classification effect. XGBoost is the classical approach in Boosting. The idea of Boosting algorithm is to construct a strong classifier by integrating many weak classifiers, wherein in this embodiment, XGBoost uses a CART regression tree.
The construction method of the prediction model comprises the following steps:
s31: the CART regression tree is an assumed binary tree, the features can be continuously split, if the tree nodes are split based on the jth feature value in the data, when the feature value is smaller than s, the sample is divided into a left sub-tree, when the feature value is greater than s, the sample is divided into a right sub-tree, and the expression of the CART regression tree is as follows:
R 1 (j,s)={x|x (j) ≤s}and R 2 (j,s)={x|x j >s}
in the above formula, R 1 (j, s) and R 2 (j, s) respectively represent a left sub-tree and a right sub-tree, j represents the jth feature in the data, and s represents a dividing point;
s32: the idea of the XGboost algorithm is to continuously add trees, continuously perform characteristic splitting to grow a tree, and learn a new function through each addition to fit the residual error of the last prediction. After K trees are trained, when the score of a certain sample needs to be predicted, each tree can obtain the score of a child node according to the characteristics of the sample, and finally the corresponding scores in each tree are added to form the predicted value of the sample; wherein the score value in a tree is determined by:
Figure GDA0003889870110000091
in the above formula, the first and second carbon atoms are,
Figure GDA0003889870110000092
represents the predicted value of the ith sample, K represents a tree of the tree, F represents all CART trees, F represents a specific CART tree, F k (x i ) The scores obtained for the leaf nodes of a sample in a certain tree.
S33: an objective function is typically set to determine whether each parameter of the algorithm is optimal, in this example, the XGboost objective function is defined as:
Figure GDA0003889870110000093
in the above formula, l represents the empirical loss function of the tree model, y i Representing the true value of the ith sample, and omega represents a regression tree regularization item;
the formula comprises two parts, wherein the left side is a loss function, the right side is a regular term, and the loss function ensures the difference between the weighing measurement score and the real score.
S34: after the model is built, data needs to be trained, the optimal parameters are found by minimizing the objective function, in the embodiment, the objective function is optimized by using additive training distribution, and the method comprises the following steps:
s341: first optimizing a first tree of CART, then optimizing a second tree until finally optimizing K trees, the optimization objective function is as follows:
Figure GDA0003889870110000094
in the above-mentioned formula, the compound has the following structure,
Figure GDA0003889870110000095
represents the predicted score of sample i after the t-th iteration,
Figure GDA0003889870110000096
preset representing top t-1 trees
Measurement of score, f t (x i ) A functional form representing the t-th tree;
s342: an optimal CART tree f can be obtained in the last step of optimization t (x i ) The tree is in f t-1 (x i ) On the basis of (1) minimizing the objective function, i.e. satisfying the following formula:
Figure GDA0003889870110000101
in the above formula, constant is the complexity of the first t-1 trees.
S343: when the loss function used is considered to be MSE, the above expression becomes:
Figure GDA0003889870110000102
while for a general function we expand its taylor second order, the above expression further becomes:
Figure GDA0003889870110000103
in the above formula:
Figure GDA0003889870110000104
Figure GDA0003889870110000105
s344: the goal of XGBoost is to minimize the objective function, so by removing the constant term, the expression of the constructed penalty function is:
Figure GDA0003889870110000106
in the above formula, g i First order partial derivatives, h, representing the ith leaf node i Second order partial derivatives of the ith leaf node.
S35: the definition of a regularization item in XGboost is determined through the redefinition of the CART tree;
the CART tree is defined as an expression as follows:
f t (x)=ω q(x) ,ω∈R T ,q:R d →{1,2,…,T}
in the above formula, T represents the number of leaf nodes in a tree, a T-dimensional vector ω is formed by these values, q (x) is a mapping of ω, and a sample is assigned to a certain leaf node; omega q(x) Representing the predicted value of the tree for the sample.
Under the above definition, the regularization term of XGBoost is defined as:
Figure GDA0003889870110000111
in the above formula, γ and λ represent trade-off factors, ω j Representing the output average value of the jth leaf node, and T representing the number of the leaf nodes;
when the XGboost algorithm model is applied, the values of gamma and lambda can be set manually, and the larger the values of gamma and lambda are, the simpler the model is.
S35: under the new definition, the objective function is further deformed as:
Figure GDA0003889870110000112
in the above-mentioned formula, the compound has the following structure,
Figure GDA0003889870110000115
representing the predicted value of the ith tree to the sample;
the expression is simplified to obtain:
Figure GDA0003889870110000113
in the above formula, G j And H j Respectively representing the sum of first-order partial derivatives and second-order partial derivatives of samples contained in leaf nodes j, wherein the sum is constant, gamma and lambda represent balance factors, and omega represents a balance factor j Representing the output average value of the jth leaf node, and T representing the quantity of the leaf nodes;
after the structure of the t CART tree is determined, G j And H j Are all definite, so that each leaf can be found separately by the following formulaOptimal value of child node:
Figure GDA0003889870110000114
in the above formula, G j And H j Respectively representing the sum of the first-order partial derivatives and the second-order partial derivatives of the samples contained in the leaf node j, wherein the sum is a constant;
and calculating the value of the objective function at the moment:
Figure GDA0003889870110000121
in the above equation, T represents the number of leaf nodes.
S4: setting a hyper-parameter of a prediction model, carrying out hyper-parameter adjustment on the prediction model through a layered three-fold cross differentiator, and training the prediction model by utilizing a pre-training data set until the prediction model meets the requirement of an evaluation index;
the hyper-parameters to be set in the prediction model comprise an iteration model category, a loss function category, a learning rate, a tree depth and L 1 Regularization parameters and iteration times;
the evaluation indexes of the prediction model comprise an accuracy P, a recall ratio R and an F1 value, and the calculation formulas of the accuracy P, the recall ratio R and the F1 value are as follows:
Figure GDA0003889870110000122
Figure GDA0003889870110000123
Figure GDA0003889870110000124
in the above equation, TP represents the number of samples for which positive samples are actually predicted as positive samples, FP represents the number of samples for which negative samples are actually predicted but positive samples, and FN represents the number of samples for which positive samples are actually predicted as negative samples.
S5: and (3) adopting the prediction data set as an input sample, outputting the sample output by using the trained prediction model, and obtaining a prediction result about the user trip behavior decision from the output of the prediction model.
Example 2
The embodiment provides an individual travel behavior prediction method based on the XGBoost algorithm as in the embodiment, and simulation tests are performed on the basis of the method as in the embodiment 1 (in other embodiments, simulation tests may not be performed, and tests may be performed by using other experimental schemes to determine relevant parameters and prediction performance of individual travel behaviors).
1. Data source
In this example, the actual real operation data of the user of the Anhui operator is used. Data for 10000 users are provided, wherein 7000 users 'related data are used for training, 3000 users' related data are used for prediction, and a training set and a test set are automatically divided from 7000 users.
Data files and their description:
in this embodiment, the data file in the prediction model includes the following components:
(1).DataPlus_Public_UserInfo_travel.csv
description of the drawings: the data packet contains basic information of 10000 users, each user has a record in a row every month, and all users have records.
(2).DataPlus_Public_Comm_travel.csv
Description of the drawings: the data packet contains 10000 users' call data, each user has multiple records, and there may be no record of user.
(3).DataPlus_Public_Net_travel.csv
Description of the drawings: the data packet contains the internet access data of 10000 users, and each user has multiple lines of records, and there may be no records of the user.
(4).Dataplus_Travel_Train_Trail.csv
Description of the invention: the data packet contains 10000 users' track information (including whether appearing in scenic spot and two columns of scenic spot names) in the first 3 months, each user has multi-line records, and there may be no user records. The data dictionary for this packet is shown in table 1:
table 1: dataplus _ Travel _ Train _ tail.csv data dictionary
user_id User identification Sampling&Field desensitization
come_time Time of entry Particle size to minute
leave_time Time of departure Particle size to minute
longitude Longitude (WGS 84) Field desensitization, retention of the last 3 decimal places
latitude Latitude (WGS 84) Field desensitization, reserving the last 3 decimal places
poi_tag Whether or not to go above provincial 4A scenic spot 0: and (5) if not: is that
poi_name Provincial 4A above scenic spot name
(5).Dataplus_Travel_Train_User.csv
Description of the drawings: the data packet contains 7000 users' travel information in the subsequent 10-day provinces, each user has one line of record, and all users have records; the data dictionary for this packet is shown in table 2:
table 2: csv's data dictionary, dataplus _ Travel _ Train _ user:
user_id user identification Sampling&Field desensitization
in_flag Travel out of province 0: no trip is available; 1: have a trip
2. Preprocessing process of raw data
2.1 user basic information Table processing
Each record in the original user basic information data contains characteristics of user ID, client age, attribution city, attribution display and the like, wherein each user has one record per month, and 30000 rows by 40 columns of records are total.
In this embodiment, as shown in fig. 2, a box line graph of the data set is drawn, it can be seen that some variables have abnormal values, and the abnormal values less than 0 in one column of the time length in the network are preliminarily subjected to null processing.
Drawing a thermodynamic diagram of all characteristic variables of the data set as shown in fig. 3 can see that the collinearity among part of the characteristic variables is high, all columns with the collinearity value exceeding 0.9 are deleted when the data is processed, only one column is reserved, and the dimension of the matrix is reduced.
And performing numerical processing on the text information in the data set, and dividing a mobile phone terminal operating system into an android system and an IOS system in a row, wherein the android system and the IOS system are respectively represented by 0 and 1. And deleting three columns of the terminal model, the terminal brand and the first use time of the terminal, which are irrelevant to training, which are columns with the Pearson correlation coefficient of 0 according to the Pearson correlation coefficients of all the characteristic variables and the target variable. And finally, summarizing and counting the characteristic variable information according to the ID of the user, filling the null value of the characteristic variable information by using an average value or a mode, and obtaining a complete Dataframe with 7000 rows by 30 columns after processing.
2.2 user Call behavior Table handling
Each record in the original user call behavior data contains characteristics of user ID, opposite terminal number coding, call duration and the like, wherein each user has a plurality of records, and the total records of 3703119 lines by 10 columns and 9879 users are recorded.
In this example, two columns of the opposite terminal number and the call start time which are irrelevant to training are deleted, the feature variable information is subjected to summary statistics by adopting a sum or mode according to the ID of the user, wherein abnormal values in the columns of the home office and the opposite terminal number long-distance area numbers are subjected to digital processing, missing values are filled by adopting the mode, and finally, a complete Dataframe of 7000 rows and 8 columns is obtained.
2.3 user surf-Net behavior Table handling
The original user internet behavior data is shown in fig. 4, where each record includes features such as a user ID, an application name, and an application classification, where each user has multiple records, and the total records include 2246083 rows by 6 columns and 8879 users.
In this embodiment, two columns of application names and data dates that are irrelevant to training are deleted, and the column of application classification is changed into 23 columns by using the one-hot encoding process. And summing and summarizing statistics on the two columns of the access times and the access flow of the feature variables according to the ID of the user, and summarizing and counting all the feature variables applying the classified one-hot codes by using the mode. The box diagram of the feature data shown in fig. 5 is drawn, and the abnormal value of 6000000 which is the highest access frequency is found, and the box diagram is processed by the triple quartile pitch capping method. Finally, filling the existing null values with mode, and processing to obtain a complete Dataframe of 7000 rows by 26 columns.
2.4 user trajectory behavior Table processing
Raw user trajectory behavior data as shown in fig. 6, each record contains characteristics of user ID, time of entry, time of departure, etc., where each user has multiple records, total 68332348 rows by 7 columns of records, and 9933 records.
Considering that the track behavior data set is too large, the data warehouse tool hive marks whether the current longitude and latitude is a scenic spot according to the user ID and whether the intra-province is two columns above the 4A level (poi _ tag = 1). As shown in fig. 7, the total dwell time for all users, the scenic dwell time and the non-scenic dwell time can be derived.
And drawing a box plot of characteristic variables in the data, and as shown in fig. 8, finding that an abnormal value which has the highest total retention time and reaches 26490441.0 minutes exists in the box plot, and treating the abnormal value by adopting a quartile-by-quartile spacing cap method after the abnormal value is far beyond three months. And finally, filling the existing null values by using a mode, and obtaining a complete Dataframe with 7000 rows by 4 columns after processing.
3. Model parameter determination
To make the experimental results more general, the present embodiment divides the data set into a 20% test set and an 80% training set. Super-parameter adjustment is performed by using a layered triple-fold cross differentiator for the XGboost algorithm, so that the model is more stable, and specific parameter settings are as shown in table 3:
table 3: hyper-parameter setting of XGboost model
Figure GDA0003889870110000151
Figure GDA0003889870110000161
4. Comparison of results of predictive tests
In this example, the XGBoost, BDT, LR, and GBDT + LR are used to fuse four prediction models to perform a comparison test of the prediction results, and the evaluation index parameters of each model after training are shown in table 2 below:
TABLE 2 comparison of the evaluation indexes of this example with other algorithm models
Kind of model Accuracy rate P Recall rate R Value of F1
XGBoost 0.8809 0.9429 0.9108
LR 0.5515 0.8002 0.6530
GBDT 0.8378 0.9239 0.8787
GBDT+LR 0.8394 0.9248 0.8800
Analyzing the test results to find that: compared with three models selected from a control group and fused by GBDT, LR and GBDT + LR, the model for predicting the travel of the user in the future province based on the real information big data of the user provided by the embodiment has the advantages that each evaluation index in the experimental result comprises accuracy, recall rate and F1 value which are superior to those of other models. Therefore, the prediction method provided by the embodiment can be considered to really solve the problem of predicting the individual trip behavior in the prior art, and meanwhile, the prediction conclusion obtained by the method has higher accuracy and reliability.
A curve of the change of the F1 value of the prediction model with the increase of the iteration number is drawn, and the curve is shown in fig. 9, and the analysis curve shows that the XGBoost model F1 value is better and better with the increase of the iteration number, and reaches the maximum value near the iteration number of 1100.
Example 3
As shown in fig. 10, the present embodiment provides an individual trip behavior prediction system based on an XGBoost algorithm, and the system adopts an individual trip behavior prediction method based on an XGBoost algorithm as in embodiment 1 to realize result prediction of an individual trip behavior; the prediction system comprises:
the data acquisition module is used for acquiring historical data for representing recent behavior characteristics of the user, wherein the historical data comprises user basic information, user conversation behavior data of nearly three months, user internet surfing behavior data of nearly three months and user track behavior data of nearly three months; the collected historical data is output to a preprocessing module;
the preprocessing module is used for preprocessing the historical data acquired by the data acquisition module to obtain a required sample data set; outputting the sample data set to a behavior prediction module; and
and the prediction module is used for training the prediction model by adopting a training data set in the sample data set based on the constructed prediction model, and acquiring output containing a user travel behavior prediction result by adopting the prediction data set in the sample data set as input.
Example 4
The embodiment provides an individual travel behavior prediction terminal based on an XGBoost algorithm, which includes a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor executes the individual travel behavior prediction method based on the XGBoost algorithm as in embodiment 1.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the technical solutions described in the foregoing embodiments. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. An individual trip behavior prediction method based on an XGboost algorithm is characterized by comprising the following steps:
s1: acquiring historical data for representing user behavior characteristics, wherein the historical data comprises user basic information, user conversation behavior data of nearly three months, user internet behavior data of nearly three months and user track behavior data of nearly three months;
s2: preprocessing the acquired historical data to obtain a sample data set, using part of the sample data set as a pre-training data set, and using the rest of the sample data set as a prediction data set; wherein the pre-training data set comprises a training set and a test set;
the preprocessing process of the historical data comprises the following steps:
s21: table processing of user basic information:
s211: finding abnormal values in the data variables, and performing null value processing on the abnormal values smaller than 0 in a network time length column;
s212: finding the colinearity among the characteristic variables through the thermodynamic diagram of the characteristic variables, and deleting all columns with the colinear value exceeding 0.9 in the data;
s213: performing numerical processing on text information in data, uniformly dividing the type of a terminal operating system into an android system and an IOS system, and respectively representing the types by 0 and 1;
s214: deleting columns with the Pearson correlation coefficient of 0 according to the Pearson correlation coefficients of all the characteristic variables and the target variable; deleting the terminal model, the terminal brand and the related column of the first use time of the terminal which are irrelevant to training;
s215: summarizing and counting the characteristic variable information according to the ID of the user, and filling null values in the characteristic variable information by using an average value or a mode;
s22: table processing of user call behaviors:
s221: deleting the number of the opposite terminal and the related column of the call starting time which are irrelevant to the training;
s222: summarizing and counting the characteristic variables by adopting a sum or a mode according to the user ID, and carrying out digital processing on abnormal values in the columns of the long-distance area numbers of the home office and the opposite terminal number; filling missing values by adopting mode;
s23: and (3) table processing of user internet behavior:
s231: deleting the related columns of application names and data dates which are irrelevant to training; and the application classification related column is processed by single-hot coding;
s232: summing and summarizing the access times and the access flow in the characteristic variables according to the user ID, and summarizing and counting the use mode of all the characteristic variables applying the classified one-hot codes;
s233: processing abnormal values in the found characteristic variables by adopting a three-time quartile spacing capping method;
s234: filling null values existing in each characteristic variable by adopting a mode;
s24: table processing of user trajectory behavior:
s241: whether the current longitude and latitude is the scenic spot is marked by a data warehouse tool hive according to the user ID and whether the province is in two scenic spots above the level 4A, and three columns of data of the total stay time of all users, the scenic spot stay time and the non-scenic spot stay time are derived based on the current longitude and latitude;
s242: finding abnormal values in the characteristic variables, and processing the abnormal values by adopting a triple-quarter-pitch capping method
S243: filling null values existing in the characteristic variables by adopting a mode;
in the relevant data preprocessing method step, abnormal values of all characteristic variables are discovered by a method of drawing a relevant variable box line graph;
s3: constructing a prediction model of individual travel behaviors based on an XGboost algorithm; the construction method of the prediction model comprises the following steps:
s31: constructing a classifier of the XGboot by using a CART regression tree; adding the number of trees by continuously performing feature splitting, so as to learn a new function and fit a predicted residual error of the previous layer;
s32: accumulating the scores of sub-nodes in a certain tree in the CART tree to obtain the score sum of the certain tree, and accumulating the scores of all the trees to obtain the predicted value of the sample;
s33: constructing an objective function of an algorithm model, wherein the objective function comprises a loss function part and a regular term part;
s34: using addition training distribution to optimize a target function, sequentially optimizing each tree in the CART, minimizing the target function on the basis of an optimal tree, and completing the construction of a loss function part;
s35: the definition of a regularization item in the objective function is completed through the redefinition of the CART tree, and the function of the regularization item part is determined;
s36: the function is utilized to obtain the optimal value of each leaf node in the CART tree and the value of the current objective function;
s4: setting a hyper-parameter of a prediction model, carrying out hyper-parameter adjustment on the prediction model through a layered three-fold cross differentiator, and training the prediction model by utilizing a pre-training data set until the prediction model meets the requirement of an evaluation index;
s5: and (3) adopting the prediction data set as an input sample, outputting the sample output by using the trained prediction model, and obtaining a prediction result about the user trip behavior decision from the output of the prediction model.
2. The individual travel behavior prediction method based on the XGBoost algorithm of claim 1, characterized in that: the historical data sources acquired in the step S1 are government open data and real user data collected by a communication carrier.
3. The individual travel behavior prediction method based on the XGBoost algorithm of claim 1, characterized in that: in step S31, the CART regression tree is an assumed binary tree, and its expression is:
R 1 (m,s)={xx (m) ≤s}and R 2 (m,s)={xx (m) >s}
in the above formula, R 1 (m, s) and R 2 (m, s) respectively represent a left sub-tree and a right sub-tree, m represents the mth feature in the data, and s represents a cut point;
in the decision binary tree, if the tree node is split based on the mth feature in the data, when the feature value is smaller than s, the sample is divided into a left subtree, and when the feature value is greater than s, the sample is divided into a right subtree.
4. The individual travel behavior prediction method based on the XGBoost algorithm of claim 1, characterized in that: in step S32, the score of a certain tree is calculated by using the following function:
Figure FDA0003850313020000031
in the above formula, the first and second carbon atoms are,
Figure FDA0003850313020000032
represents the ith sampleIn the predicted value, K represents a tree of the tree, F represents all CART trees, F represents a specific CART tree, F k (x i ) The scores obtained for leaf nodes of a sample in a certain tree.
5. The individual travel behavior prediction method based on the XGBoost algorithm of claim 1, characterized in that: in the steps S33 to S35, the expression of the objective function is:
Figure FDA0003850313020000033
in the above formula, l represents the empirical loss function of the tree model, y i Representing the real value of the ith sample, and omega represents a regression tree regularization item;
wherein, the left side of the above formula represents a loss function, and the right side is a regularization term;
the function expression of the loss function is:
Figure FDA0003850313020000034
in the above formula, g i Represents the first order partial derivative of the ith leaf node, h i Second order partial derivatives of the ith leaf node;
the regularization term expression is:
Figure FDA0003850313020000035
in the above formula, γ and λ represent trade-off factors, ω j Represents the output average value of the jth leaf node, and T represents the number of leaf nodes.
6. The individual travel behavior prediction method based on the XGBoost algorithm of claim 1, characterized in that: in step S36, the optimal value of each leaf node is calculated by using the following formula:
Figure FDA0003850313020000041
in the above formula, G j And H j Respectively representing the sum of the first-order partial derivatives and the second-order partial derivatives of the samples contained in the leaf node j, wherein the sum is a constant;
the value of the objective function at this time is calculated by the following equation:
Figure FDA0003850313020000042
in the above equation, T represents the number of leaf nodes.
7. The individual travel behavior prediction method based on the XGBoost algorithm of claim 1, characterized in that: in step S4, the hyper-parameters to be set in the prediction model include an iteration model category, a loss function category, a learning rate, a tree depth, and L 1 Regularization parameters and iteration times;
the evaluation indexes of the prediction model comprise an accuracy P, a recall rate R and an F1 value, and the calculation formulas of the accuracy P, the recall rate R and the F1 value are as follows:
Figure FDA0003850313020000043
Figure FDA0003850313020000044
Figure FDA0003850313020000045
in the above equation, TP represents the number of samples for which positive samples are actually predicted as positive samples, FP represents the number of samples for which negative samples are actually predicted as positive samples, and FN represents the number of samples for which positive samples are actually predicted as negative samples.
8. An individual travel behavior prediction system based on an XGboost algorithm is characterized in that the system adopts the individual travel behavior prediction method based on the XGboost algorithm according to any one of claims 1-7 to realize result prediction of individual travel behaviors; the prediction system comprises:
the data acquisition module is used for acquiring historical data for representing recent behavior characteristics of the user, wherein the historical data comprises basic information of the user, conversation behavior data of the user in nearly three months, internet surfing behavior data of the user in nearly three months and track behavior data of the user in nearly three months; outputting the historical data to a preprocessing module;
the preprocessing module is used for preprocessing the historical data acquired by the data acquisition module to obtain a required sample data set; the sample data set is output to a behavior prediction module; and
and the prediction module is used for training the prediction model by adopting a pre-training data set in the sample data set based on the constructed prediction model, and acquiring output containing a user travel behavior prediction result by adopting the prediction data set in the sample data set as input.
9. An individual travel behavior prediction terminal based on an XGboost algorithm, which is characterized by comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, and is characterized in that: the processor executes the individual travel behavior prediction method based on the XGboost algorithm according to any one of claims 1 to 7.
CN202110239454.4A 2021-03-04 2021-03-04 Individual trip behavior prediction method, system and terminal based on XGboost algorithm Active CN112990284B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110239454.4A CN112990284B (en) 2021-03-04 2021-03-04 Individual trip behavior prediction method, system and terminal based on XGboost algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110239454.4A CN112990284B (en) 2021-03-04 2021-03-04 Individual trip behavior prediction method, system and terminal based on XGboost algorithm

Publications (2)

Publication Number Publication Date
CN112990284A CN112990284A (en) 2021-06-18
CN112990284B true CN112990284B (en) 2022-11-22

Family

ID=76352643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110239454.4A Active CN112990284B (en) 2021-03-04 2021-03-04 Individual trip behavior prediction method, system and terminal based on XGboost algorithm

Country Status (1)

Country Link
CN (1) CN112990284B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469469A (en) * 2021-09-02 2021-10-01 杭州华网信息技术有限公司 Student physical ability score prediction method based on sectional loss function
CN113762805A (en) * 2021-09-23 2021-12-07 国网湖南省电力有限公司 Mountain forest fire early warning method applied to power transmission line
CN113837383B (en) * 2021-10-18 2023-06-23 中国联合网络通信集团有限公司 Model training method and device, electronic equipment and storage medium
CN114034375B (en) * 2021-10-26 2024-06-11 三峡大学 Ultra-high voltage transmission line noise measurement system and method
CN114035468B (en) * 2021-11-08 2024-05-28 山东理工大学 Method and system for predictively monitoring overhaul flow of fan based on XGBoost algorithm
CN114235740A (en) * 2021-11-12 2022-03-25 华南理工大学 XGboost model-based waste plastic spectrum identification method
CN114253242B (en) * 2021-12-21 2023-12-26 上海纽酷信息科技有限公司 VPN-based cloud equipment data acquisition system for Internet of things
CN114898818B (en) * 2022-04-06 2024-06-18 中国石油大学(北京) Mixed crude oil condensation point prediction model training method, device and application method
CN114692515B (en) * 2022-06-01 2022-09-02 中材邦业(杭州)智能技术有限公司 Soft measurement method for clinker free calcium content based on time lag XGBOOST model

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766911A (en) * 2018-12-04 2019-05-17 深圳先进技术研究院 A kind of behavior prediction method
CN109783686A (en) * 2019-01-21 2019-05-21 广州虎牙信息科技有限公司 Behavioral data processing method, device, terminal device and storage medium
CN110084630A (en) * 2019-03-05 2019-08-02 浙江工业大学之江学院 The user's tourism trip intention and type prediction method of decision tree are promoted based on gradient
CN110472817A (en) * 2019-07-03 2019-11-19 西北大学 A kind of XGBoost of combination deep neural network integrates credit evaluation system and its method
CN110647929A (en) * 2019-09-19 2020-01-03 京东城市(北京)数字科技有限公司 Method for predicting travel destination and method for training classifier
CN111047425A (en) * 2019-11-25 2020-04-21 中国联合网络通信集团有限公司 Behavior prediction method and device
CN111079968A (en) * 2018-10-22 2020-04-28 昆山炫生活信息技术股份有限公司 Scenic spot playing track prediction system based on multi-feature fusion
AU2020100709A4 (en) * 2020-05-05 2020-06-11 Bao, Yuhang Mr A method of prediction model based on random forest algorithm
CN111367575A (en) * 2018-12-06 2020-07-03 北京嘀嘀无限科技发展有限公司 User behavior prediction method and device, electronic equipment and storage medium
CN111999649A (en) * 2020-08-20 2020-11-27 浙江工业大学 XGboost algorithm-based lithium battery residual life prediction method
CN112085525A (en) * 2020-09-04 2020-12-15 长沙理工大学 User network purchasing behavior prediction research method based on hybrid model
CN112232892A (en) * 2020-12-14 2021-01-15 南京华苏科技有限公司 Method for mining accessible users based on satisfaction of mobile operators

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9396253B2 (en) * 2013-09-27 2016-07-19 International Business Machines Corporation Activity based analytics

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079968A (en) * 2018-10-22 2020-04-28 昆山炫生活信息技术股份有限公司 Scenic spot playing track prediction system based on multi-feature fusion
CN109766911A (en) * 2018-12-04 2019-05-17 深圳先进技术研究院 A kind of behavior prediction method
WO2020114302A1 (en) * 2018-12-04 2020-06-11 深圳先进技术研究院 Behavior prediction method
CN111367575A (en) * 2018-12-06 2020-07-03 北京嘀嘀无限科技发展有限公司 User behavior prediction method and device, electronic equipment and storage medium
CN109783686A (en) * 2019-01-21 2019-05-21 广州虎牙信息科技有限公司 Behavioral data processing method, device, terminal device and storage medium
CN110084630A (en) * 2019-03-05 2019-08-02 浙江工业大学之江学院 The user's tourism trip intention and type prediction method of decision tree are promoted based on gradient
CN110472817A (en) * 2019-07-03 2019-11-19 西北大学 A kind of XGBoost of combination deep neural network integrates credit evaluation system and its method
CN110647929A (en) * 2019-09-19 2020-01-03 京东城市(北京)数字科技有限公司 Method for predicting travel destination and method for training classifier
CN111047425A (en) * 2019-11-25 2020-04-21 中国联合网络通信集团有限公司 Behavior prediction method and device
AU2020100709A4 (en) * 2020-05-05 2020-06-11 Bao, Yuhang Mr A method of prediction model based on random forest algorithm
CN111999649A (en) * 2020-08-20 2020-11-27 浙江工业大学 XGboost algorithm-based lithium battery residual life prediction method
CN112085525A (en) * 2020-09-04 2020-12-15 长沙理工大学 User network purchasing behavior prediction research method based on hybrid model
CN112232892A (en) * 2020-12-14 2021-01-15 南京华苏科技有限公司 Method for mining accessible users based on satisfaction of mobile operators

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
An Efficient Predictive Analysis Model of Customer Purchase Behavior using Random Forest and XGBoost Algorithm;Subhatav Dhali;《2020 IEEE 1st International Conference for Convergence in Engineering (ICCE)》;20201221;416-421 *
Research on User Consumption Behavior Prediction Based on Improved XGBoost Algorithm;Wang XingFen等;《2018 IEEE International Conference on Big Data (Big Data)》;20190124;4169-4175 *
基于spark的电商用户行为大数据分析的研究;周伟坤;《中国优秀硕士学位论文全文数据库 信息科技辑》;20200215;第2020年卷(第2期);I138-1028 *
网络学习行为分析与预测的研究;吴文凯;《中国优秀硕士学位论文全文数据库 社会科学Ⅱ辑》;20210215;第2021年卷(第2期);H127-291 *

Also Published As

Publication number Publication date
CN112990284A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
CN112990284B (en) Individual trip behavior prediction method, system and terminal based on XGboost algorithm
Zhang et al. A feature selection and multi-model fusion-based approach of predicting air quality
CN113919448B (en) Method for analyzing influence factors of carbon dioxide concentration prediction at any time-space position
CN111027629B (en) Power distribution network fault power failure rate prediction method and system based on improved random forest
CN105718490A (en) Method and device for updating classifying model
Osojnik et al. Tree-based methods for online multi-target regression
CN111611488B (en) Information recommendation method and device based on artificial intelligence and electronic equipment
JP2004157814A (en) Decision tree generating method and model structure generating device
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
KR102009284B1 (en) Training apparatus for training dynamic recurrent neural networks to predict performance time of last activity in business process
CN114169502A (en) Rainfall prediction method and device based on neural network and computer equipment
CN114818353A (en) Train control vehicle-mounted equipment fault prediction method based on fault characteristic relation map
CN115063035A (en) Customer evaluation method, system, equipment and storage medium based on neural network
CN113743453A (en) Population quantity prediction method based on random forest
CN113177644A (en) Automatic modeling system based on word embedding and depth time sequence model
CN117494760A (en) Semantic tag-rich data augmentation method based on ultra-large-scale language model
Banerjee et al. Predictive analysis of taxi fare using machine learning
CN113537607B (en) Power failure prediction method
CN115293827A (en) Novel model interpretability analysis method for assisting fine operation of enterprise
Hou et al. Prediction of learners' academic performance using factorization machine and decision tree
CN111078882A (en) Text emotion measuring method and device
Zhou et al. Bank Customer Classification Algorithm Based on Improved Decision Tree
Rasaizadi et al. Stacking ensemble learning process to predict rural road traffic flow
Ansary Machine Learning for Predicting the Stock Price Direction with Trading Indicators
CN116882711B (en) Government affair hall business handling optimization method and system based on big data analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant