CN112990284B

CN112990284B - Individual trip behavior prediction method, system and terminal based on XGboost algorithm

Info

Publication number: CN112990284B
Application number: CN202110239454.4A
Authority: CN
Inventors: 张红伟; 崔逊龙; 戚晓东; 谢国豪
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2022-11-22
Anticipated expiration: 2041-03-04
Also published as: CN112990284A

Abstract

The invention belongs to the technical field of big data model prediction, and in particular relates to a method, system and terminal for individual travel behavior prediction based on XGBoost algorithm. The method includes the following steps: S1: Obtaining historical data used to characterize user behavior characteristics; S2: Preprocessing the acquired historical data to obtain a sample data set; S3: Constructing a prediction model of individual travel behavior based on the XGBoost algorithm; S4 : Set the hyperparameters of the prediction model, adjust the hyperparameters of the prediction model through the hierarchical three-fold cross difference device, and use the training data set to train the prediction model until the prediction model meets the requirements of the evaluation index; S5: adopt The prediction data set is used as the input sample, and the trained prediction model is used to output the sample output, and the prediction result about the user's travel behavior decision is obtained from the output of the prediction model. The method, system or terminal can solve the problem that the prior art cannot accurately predict individual travel behavior decisions.

Description

A method, system and terminal for predicting individual travel behavior based on XGBoost algorithm

技术领域technical field

本发明属于大数据模型预测技术领域，具体涉及一种基于XGBoost算法的个体出行行为预测方法、系统及终端。The invention belongs to the technical field of big data model prediction, and in particular relates to a method, system and terminal for individual travel behavior prediction based on XGBoost algorithm.

背景技术Background technique

近年来，随着移动互联网和信息科技的蓬勃发展，数据的重要性日益凸显。在打造数字经济高地的时代进程中，数据流助推技术流、资金流、人才流、物资流，在促进社会生产率提升，推动创新发展方面发挥着重要作用。例如，利用数据可以帮助对个人的行为或需求进行预测，这在商业推广、广告分发中已有广泛应用。理论上说，通过大数据应当也可以对人体的出行行为进行预测，从而为处理一些公共管理事务提供帮助。In recent years, with the vigorous development of mobile Internet and information technology, the importance of data has become increasingly prominent. In the process of creating a digital economic highland, data flow boosts technology flow, capital flow, talent flow, and material flow, and plays an important role in promoting social productivity and promoting innovation and development. For example, using data can help predict individual behavior or needs, which has been widely used in commercial promotion and advertisement distribution. Theoretically speaking, it should be possible to predict the travel behavior of the human body through big data, so as to provide assistance in handling some public management affairs.

在某些特定场景下，对人群中个体的出行行为进行预测，从而根据预测结果科学地为社会管理其它各项事务做好提前规划，成为相关的管理部门需要解决的重要问题；这对于降低人群聚集对交通、餐饮、旅游等行业造成冲击，以及预防发生人群超额聚集引发安全事件具有重大意义。但是，目前还没有一种较好的方法可以利用既有数据，实现对个体出行行为决策的准确预测。In some specific scenarios, predicting the travel behavior of individuals in the crowd, so as to scientifically plan ahead for social management and other affairs based on the prediction results, has become an important issue that the relevant management departments need to solve; Gathering has an impact on transportation, catering, tourism and other industries, and it is of great significance to prevent safety incidents caused by excessive crowd gathering. However, there is currently no better method that can use existing data to achieve accurate prediction of individual travel behavior decisions.

发明内容Contents of the invention

针对现有技术中的问题，本发明提供一种基于XGBoost算法的个体出行行为预测方法、系统及终端，该方法、系统或终端可以解决现有技术无法准确预测个体出行行为决策的问题。Aiming at the problems in the prior art, the present invention provides an individual travel behavior prediction method, system and terminal based on the XGBoost algorithm. The method, system or terminal can solve the problem that the prior art cannot accurately predict individual travel behavior decisions.

为了达到上述目的，本发明是通过以下技术方案来实现的：In order to achieve the above object, the present invention is achieved through the following technical solutions:

一种基于XGBoost算法的个体出行行为预测方法，该方法包括如下步骤：A method for predicting individual travel behavior based on the XGBoost algorithm, the method comprising the following steps:

S1：获取用于表征用户行为特征的历史数据，历史数据包括用户基本信息、近三个月的用户通话行为数据、近三个月的用户上网行为数据，以及近三个月的用户轨迹行为数据；S1: Obtain historical data used to characterize user behavior characteristics. Historical data includes basic user information, user call behavior data in the past three months, user online behavior data in the past three months, and user trajectory behavior data in the past three months ;

S2：对获取的历史数据进行预处理得到样本数据集，并使用样本数据集中的部分作为预训练数据集，其余作为预测数据集；其中，所述预训练数据集中包括训练集和测试集；S2: Preprocessing the acquired historical data to obtain a sample data set, and using a part of the sample data set as a pre-training data set, and the rest as a prediction data set; wherein, the pre-training data set includes a training set and a test set;

S3：构建基于XGBoost算法的个体出行行为的预测模型；所述预测模型的构建方法包括如下步骤：S3: Construct the prediction model of the individual travel behavior based on XGBoost algorithm; The construction method of described prediction model comprises the steps:

S31：使用CART回归树构造XGBoot的分类器；通过不断进行特征分裂来添加树的数量，从而学习新的函数并拟合上一侧预测的残差；S31: Use the CART regression tree to construct the classifier of XGBoot; add the number of trees by continuously performing feature splitting, so as to learn new functions and fit the residual predicted by the upper side;

S32：将CART树中某棵树中子节点的分数累加获得某个树的分数和，再将所有树的得分累加获得样本的预测值；S32: Accumulate the scores of child nodes in a certain tree in the CART tree to obtain the sum of scores of a certain tree, and then accumulate the scores of all trees to obtain the predicted value of the sample;

S33：构造算法模型的目标函数，所述目标函数包括损失函数部分和正则项部分；S33: Construct an objective function of the algorithm model, the objective function includes a loss function part and a regular term part;

S34：使用加法训练分布优化目标函数，依次优化CART中的每一棵树，并在最优树的基础上使得目标函数最小，完成损失函数部分的构造；S34: Use the addition training distribution to optimize the objective function, optimize each tree in the CART in turn, and minimize the objective function on the basis of the optimal tree, and complete the construction of the loss function part;

S35：通过对CART树的重新定义，完成对目标函数中正则化项的定义，确定正则项部分的函数；S35: complete the definition of the regularization term in the objective function by redefining the CART tree, and determine the function of the regularization term;

S36：利用前述函数求取CART树中各叶子节点的最佳值，以及当前目标函数的值；S36: use the aforementioned function to obtain the optimal value of each leaf node in the CART tree, and the value of the current objective function;

S4：设定预测模型的超参数，通过分层三折交叉差分器对预测模型进行超参数调整，并利用预训练数据集对预测模型进行训练，直到所述预测模型达到评价指标的要求；S4: setting the hyperparameters of the prediction model, adjusting the hyperparameters of the prediction model through the hierarchical three-fold cross difference device, and using the pre-training data set to train the prediction model until the prediction model meets the requirements of the evaluation index;

S5：采用预测数据集作为作为输入样本，利用训练好的预测模型对样本输出进行输出，从预测模型的输出中获得关于用户出行行为决策的预测结果。S5: Use the prediction data set as the input sample, use the trained prediction model to output the sample output, and obtain the prediction result of the user's travel behavior decision from the output of the prediction model.

进一步地，步骤S1中获取的历史数据来源是政府的政务开放数据和通信运营商收集的真实用户数据。Further, the source of historical data acquired in step S1 is the government's open government data and real user data collected by communication operators.

进一步地，步骤S2中，历史数据的预处理过程包括如下步骤：Further, in step S2, the preprocessing process of historical data includes the following steps:

S21：对用户基本信息的表处理：S21: Table processing of basic user information:

S211：发现数据变量中的异常值，对在网时长一列中小于0的异常值进行空值处理；S211: find the abnormal value in the data variable, and perform null value processing on the abnormal value less than 0 in the column of network duration;

S212：通过特征变量的热力图发现特征变量之间的共线性，删除数据中共线值超过0.9的所有列；S212: Find the collinearity between the characteristic variables through the heat map of the characteristic variables, and delete all the columns whose collinear value exceeds 0.9 in the data;

S213：对数据中的文本信息进行数值化处理，终端操作系统类型统一划分为安卓和IOS系统，分别用0和1进行表示；S213: numerically process the text information in the data, and uniformly divide the terminal operating system types into Android and IOS systems, which are represented by 0 and 1 respectively;

S214：根据所有特征变量和目标变量的皮尔逊相关系数，皮尔逊相关系数为0的列删除；并删除与训练无关的终端型号，终端品牌和终端首次使用时间的相关列；S214: According to the Pearson correlation coefficients of all feature variables and target variables, delete the column with Pearson correlation coefficient of 0; and delete the related columns of terminal model, terminal brand and terminal first use time that are not related to training;

S215：根据用户的ID来对特征变量信息进行汇总统计，对其中的空值使用平均值或众数进行填充；S215: Perform summary statistics on the characteristic variable information according to the ID of the user, and fill in the empty values with the mean or mode;

S22：对用户通话行为的表处理：S22: Table processing of user call behavior:

S221：删除与训练无关的对端号码编号和通话起始时间的相关列；S221: Delete the relevant columns of the peer number number and the call start time that are not related to the training;

S222：根据用户ID对特征变量采用总和或众数进行汇总统计，对归属局和对端号长途区号的列中的异常值进行数字化处理；对缺失值采用众数进行填充；S222: According to the user ID, the sum or mode of the characteristic variable is used for summary statistics, and the abnormal values in the column of the home office and the long-distance area code of the corresponding terminal number are digitized; the missing values are filled with the mode;

S23：对用户上网行为的表处理：S23: Table processing of the user's online behavior:

S231：删除与训练无关的应用名称和数据日期的相关列；并对应用分类相关列采用独热编码处理；S231: Delete relevant columns of application names and data dates that are not related to training; and use one-hot encoding for application classification-related columns;

S232：根据用户ID对特征变量中的访问次数和访问流量进行求和汇总，并对应用分类独热编码的所有特征变量使用众数进行汇总统计；S232: Summarize the number of visits and visit traffic in the characteristic variables according to the user ID, and perform summary statistics on the mode of all characteristic variables to which classification one-hot encoding is applied;

S233：对发现的特征变量中的异常值采用三倍四分位间距盖帽法进行处理；S233: Using the triple interquartile range capping method to process the outliers in the found characteristic variables;

S234：对各特征变量中存在的空值采用众数进行填充；S234: filling the null values existing in each feature variable with the mode number;

S24：对用户轨迹行为的表处理：S24: Table processing of user track behavior:

S241：通过数据仓库工具hive根据用户ID和是否省内4A级以上景区两列来标记当前经纬度是否为景区，并基于此衍生出所有用户的总停留时间，景区停留时间和非景区停留的时间三列数据；S241: Use the data warehouse tool hive to mark whether the current latitude and longitude is a scenic spot according to the user ID and the two columns of whether it is a 4A-level or above scenic spot in the province, and based on this, derive the total stay time of all users, the stay time in scenic spots and the time in non-scenic spots. column data;

S242：发现特征变量中的异常值，采用三倍四分间距盖帽法对异常值进行处理S242: Find the outliers in the characteristic variables, and use the three-times quarter-distance capping method to deal with the outliers

S243：对特征变量中存在的空值采用众数进行填充；S243: filling the null values existing in the characteristic variables with the mode number;

其中，所述相关数据的预处理方法步骤中，各特征变量的异常值均通过绘制相关变量箱线图的方法来发现。Wherein, in the step of the preprocessing method of the related data, the abnormal value of each characteristic variable is found by drawing a box plot of the related variable.

步骤S31中，CART回归树是假设二叉树，其表达式为：In step S31, the CART regression tree is a hypothetical binary tree, and its expression is:

R₁(j,s)＝{x|x^(j)≤s}and R₂(j,s)＝{x|x^(j)＞s}R ₁ (j,s)={x|x ^(j) ≤s} and R ₂ (j,s)={x|x ^(j) ＞s}

上式中，R₁(j,s)和R₂(j,s)分别表示左棵子树和右棵子树，j表示数据中第j个特征，s表示切分点；In the above formula, R ₁ (j,s) and R ₂ (j,s) represent the left subtree and the right subtree respectively, j represents the jth feature in the data, and s represents the segmentation point;

决策二叉树中，若树节点是基于数据中第j个特征值进行分裂，当特征值小于s时，样本划分为左棵子树，当特征值大于s时，则划分为右棵子树。In the decision binary tree, if the tree node is split based on the jth eigenvalue in the data, when the eigenvalue is less than s, the sample is divided into the left subtree, and when the eigenvalue is greater than s, the sample is divided into the right subtree.

进一步地，步骤S32中，某棵树的得分采用如下函数计算：Further, in step S32, the score of a certain tree is calculated using the following function:

上式中，

表示第i个样本的预测值，K表示树的棵树，F表示所有的CART树，f表示某一棵具体的CART树，f_k(x_i)为样本在某棵树中叶子节点得到的分数。In the above formula,

Represents the predicted value of the i-th sample, K represents the tree of the tree, F represents all CART trees, f represents a specific CART tree, and f _k ( _xi ) is obtained from the leaf node of the sample in a tree Fraction.

进一步地，步骤S33-S35中，目标函数的表达式为：Further, in steps S33-S35, the expression of the objective function is:

上式中，l表示树模型经验损失函数，y_i表示第i个样本的真实值，Ω表示回归树正则化项；In the above formula, l represents the empirical loss function of the tree model, y _i represents the true value of the i-th sample, and Ω represents the regression tree regularization term;

其中，上式左边表示损失函数，右边为正则化项；Among them, the left side of the above formula represents the loss function, and the right side is the regularization term;

损失函数的函数表达式为：The functional expression of the loss function is:

上式中，g_i表示第i个叶子节点的一阶偏导，h_i第i个叶子节点的二阶偏导；In the above formula, g _i represents the first-order partial derivative of the i-th leaf node, and h _{i is} the second-order partial derivative of the i-th leaf node;

正则化项表达式为：The regularization term expression is:

上式中，γ和λ表示权衡因子，ω_j表示第j个叶子节点的输出均值，T表示叶子节点的数量。In the above formula, γ and λ represent trade-off factors, ω _j represents the output mean of the jth leaf node, and T represents the number of leaf nodes.

进一步地，步骤S36中，各叶子节点的最佳值采用下式计算：Further, in step S36, the optimal value of each leaf node is calculated using the following formula:

上式中，G_j和H_j分别表示叶子节点j所包含样本的一阶偏导、二阶偏导累加之和，均为常数；In the above formula, G _j and H _j represent the cumulative sum of first-order partial derivatives and second-order partial derivatives of the samples contained in leaf node j respectively, both of which are constants;

此时的目标函数的值采用下式计算：The value of the objective function at this time is calculated by the following formula:

上式中，T表示叶子节点的数量。In the above formula, T represents the number of leaf nodes.

进一步地，步骤S4中，预测模型中需要设置的超参数包括迭代模型类别、损失函数类别、学习率、树的深度、L₁正则化参数、迭代次数；Further, in step S4, the hyperparameters that need to be set in the prediction model include iterative model category, loss function category, learning rate, depth of tree, _L1 regularization parameters, and number of iterations;

预测模型的评价指标包括准确率P、召回率R和F1值，三者的计算公式如下：The evaluation indicators of the prediction model include the accuracy rate P, the recall rate R and the F1 value. The calculation formulas of the three are as follows:

上式中，TP表示的是实际是正样本预测为正样本的样本数，FP表示实际是负样本预测却为正样本的样本数，FN实际是正样本预测成负样本的样本数。In the above formula, TP represents the number of samples that are actually predicted as positive samples, FP represents the number of samples that are actually predicted as negative samples but are positive samples, and FN is actually the number of samples that are predicted to be negative samples from positive samples.

本发明还包括一种基于XGBoost算法的个体出行行为预测系统，该系统采用如前述的基于XGBoost算法的个体出行行为预测方法，实现对个体出行行为的结果预测；预测系统包括：The present invention also includes an individual travel behavior prediction system based on the XGBoost algorithm, which uses the aforementioned individual travel behavior prediction method based on the XGBoost algorithm to realize the result prediction of the individual travel behavior; the prediction system includes:

数据采集模块，其用于获取用于表征用户近期行为特征的历史数据，历史数据包括用户基本信息、近三个月的用户通话行为数据、近三个月的用户上网行为数据，近三个月的用户轨迹行为数据；采集的历史数据输出到预处理模块中；The data acquisition module is used to obtain historical data used to characterize the user's recent behavior characteristics. The historical data includes user basic information, user call behavior data in the past three months, user online behavior data in the past three months, and user online behavior data in the past three months. The user trajectory behavior data; the collected historical data is output to the preprocessing module;

预处理模块，其用于对数据采集模块获取的历史数据进行预处理，得到所需的样本数据集；样本数据集输出到行为预测模块中；以及A preprocessing module, which is used to preprocess the historical data acquired by the data acquisition module to obtain the required sample data set; the sample data set is output to the behavior prediction module; and

预测模块，其用于基于构建的预测模型，采用样本数据集中的预训练数据集对预测模型进行训练，并采用样本数据集中的预测数据集作为输入，获取包含用户出行行为预测结果的输出。The prediction module is used to train the prediction model based on the pre-training data set in the sample data set, and uses the prediction data set in the sample data set as input to obtain an output including the prediction result of the user's travel behavior.

本发明还包括一种基于XGBoost算法的个体出行行为预测终端，该终端包括存储器、处理器以及存储在所述存储器上并可在处理器上运行的计算机程序，处理器执行如前述的基于XGBoost算法的个体出行行为预测方法。The present invention also includes a personal travel behavior prediction terminal based on the XGBoost algorithm, the terminal includes a memory, a processor and a computer program stored on the memory and operable on the processor, and the processor executes the aforementioned XGBoost algorithm based individual travel behavior prediction method.

本发明提供的一种基于XGBoost算法的个体出行行为预测方法、系统及终端，具有如下的有益效果：A method, system and terminal for predicting individual travel behavior based on the XGBoost algorithm provided by the present invention have the following beneficial effects:

1、本发明基于XGBoost算法构建所需的预测模型，并利用用户真实的信息大数据进行用户未来省内旅游出行预测；用于预测的原始用户数据经过预处理，提取了模型预测所需的关键属性，因此获得的与预测结果的准确率和可靠性较高，同时该预测模型的召回率和F1值均具有较佳的表现。1. The present invention builds the required forecasting model based on the XGBoost algorithm, and utilizes the user's real information big data to predict the user's future travel within the province; the original user data used for prediction is preprocessed to extract the key points required for model prediction attributes, so the accuracy and reliability of the obtained prediction results are higher, and the recall rate and F1 value of the prediction model have better performance.

2、本发明中采用预测模型是一种加权回归模型，在应用过程中不需要做特征的归一化，可以自行选择特征，可以适应多种损失函数，因此具有更好的可操作性，可以降低数据处理的工作量和计算负担。2. The prediction model used in the present invention is a weighted regression model, which does not require normalization of features during the application process, can select features by itself, and can adapt to various loss functions, so it has better operability and can Reduce the workload and computational burden of data processing.

3、本发明中的预处理过程中，考虑到了原始数据中的异常值和缺失值的问题，降低该部分数据对预测结果的影响；在属性选取时，考虑到了可通过大数据获取的绝大部分特征变量，同时在预测模型的构建上也进行了优选和改进，并优化了模型的部分参数；上述工作内容均为预测结果的准确性和可靠性提供了保障。3. In the preprocessing process in the present invention, the problem of outliers and missing values in the original data is considered, and the influence of this part of data on the prediction results is reduced; when the attribute is selected, the vast majority that can be obtained through big data is considered. Part of the characteristic variables, at the same time, optimized and improved the construction of the prediction model, and optimized some parameters of the model; the above work content provided a guarantee for the accuracy and reliability of the prediction results.

附图说明Description of drawings

附图用来提供对本发明的进一步理解，并且构成说明书的一部分，与本发明的实施例一起用于解释本发明，并不构成对本发明的限制。在附图中：The accompanying drawings are used to provide a further understanding of the present invention, and constitute a part of the description, and are used together with the embodiments of the present invention to explain the present invention, and do not constitute a limitation to the present invention. In the attached picture:

图1为本发明实施例1中基于XGBoost算法的个体出行行为预测方法的方法流程图；Fig. 1 is the method flowchart of the individual travel behavior prediction method based on XGBoost algorithm in embodiment 1 of the present invention;

图2为本发明实施例2的在网时长列的特征量箱线图；Fig. 2 is the feature quantity box plot of long column when the network time of embodiment 2 of the present invention;

图3为本发明实施例2中用户基本信息表的热力图；FIG. 3 is a thermal map of the user basic information table in Embodiment 2 of the present invention;

图4为本发明实施例2的用户上网行为数据表中部分数据的表格；Fig. 4 is the table of part data in the user's online behavior data table of embodiment 2 of the present invention;

图5为本发明实施例2中用户上网行为数据表中访问次数类的箱线图；Fig. 5 is the box-whisker plot of the number of visits class in the user's online behavior data table in Embodiment 2 of the present invention;

图6为本发明实施例2的用户轨迹行为数据表中部分数据的表格；Fig. 6 is a table of some data in the user trajectory behavior data table of Embodiment 2 of the present invention;

图7为本发明实施例2中用户轨迹行为衍生数据表中部分数据的表格；7 is a table of some data in the user trajectory behavior derived data table in Embodiment 2 of the present invention;

图8为本发明实施例2的用户轨迹行为数据中总停留时间的箱线图；Fig. 8 is the box plot of the total dwell time in the user track behavior data of embodiment 2 of the present invention;

图9为本发明实施例2中基于XGBoost算法的个体出行行为预测方法试验结果中F1值随迭代次数的变化曲线图；Fig. 9 is the change curve of F1 value with the number of iterations in the test results of the individual travel behavior prediction method based on XGBoost algorithm in Example 2 of the present invention;

图10为本发明实施例3中基于XGBoost算法的个体出行行为预测系统的模块示意图；Fig. 10 is a schematic module diagram of an individual travel behavior prediction system based on the XGBoost algorithm in Embodiment 3 of the present invention;

图中标记为：1、数据采集模块；2、预处理模块；3、预测模块。The marks in the figure are: 1. Data acquisition module; 2. Preprocessing module; 3. Prediction module.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

实施例1Example 1

如图1所示，本实施例提供一种基于XGBoost算法的个体出行行为预测方法，该方法包括如下步骤：As shown in Figure 1, the present embodiment provides a method for predicting individual travel behavior based on the XGBoost algorithm, the method includes the following steps:

S1：获取用于表征用户行为特征的历史数据，历史数据包括用户基本信息、近三个月的用户通话行为数据、近三个月的用户上网行为数据，以及近三个月的用户轨迹行为数据；获取的历史数据来源是政府的政务开放数据和通信运营商收集的真实用户数据。S1: Obtain historical data used to characterize user behavior characteristics. Historical data includes basic user information, user call behavior data in the past three months, user online behavior data in the past three months, and user trajectory behavior data in the past three months ; The source of historical data obtained is the government's open government data and real user data collected by communication operators.

S2：对获取的历史数据进行预处理得到样本数据集，并使用样本数据集中的部分作为预训练数据集，其余作为预测数据集；其中，所述预训练数据集中包括训练集和测试集；历史数据的预处理过程包括如下步骤：S2: Preprocess the acquired historical data to obtain a sample data set, and use part of the sample data set as a pre-training data set, and the rest as a prediction data set; wherein, the pre-training data set includes a training set and a test set; historical The data preprocessing process includes the following steps:

其中，相关数据的预处理方法步骤中，各特征变量的异常值均通过绘制相关变量箱线图的方法来发现。Wherein, in the step of the preprocessing method of the relevant data, the outliers of each characteristic variable are found by drawing a box plot of the relevant variable.

S3：构建基于XGBoost算法的个体出行行为的预测模型；XGBoost算法是一种基于梯度提升算法以及决策树的改进型学习算法。其原理是使用迭代运算的思想，将大量的弱分类器转化成强分类器，以实现准确的分类效果。XGBoost是Boosting中的经典方法。Boosting算法的宗旨是通过将许多弱分类器集成，构造一个强的分类器，其中，在本实施例中，XGBoost使用的是CART回归树。S3: Build a prediction model of individual travel behavior based on XGBoost algorithm; XGBoost algorithm is an improved learning algorithm based on gradient boosting algorithm and decision tree. The principle is to use the idea of iterative operation to convert a large number of weak classifiers into strong classifiers to achieve accurate classification results. XGBoost is a classic method in Boosting. The purpose of the Boosting algorithm is to construct a strong classifier by integrating many weak classifiers. In this embodiment, XGBoost uses a CART regression tree.

所述预测模型的构建方法包括如下步骤：The construction method of described predictive model comprises the following steps:

S31：CART回归树是假设二叉树，可以对特征不断地进行分裂，若树节点是基于数据中第j个特征值进行分裂，当特征值小于s时，则将样本划分为左棵子树，当特征值大于s时，则将样本划分为右棵子树，CART回归树的表达式如下所示：S31: The CART regression tree assumes a binary tree, which can continuously split features. If the tree node is split based on the jth eigenvalue in the data, when the eigenvalue is less than s, the sample is divided into the left subtree. When the feature When the value is greater than s, the sample is divided into right subtrees, and the expression of the CART regression tree is as follows:

R₁(j,s)＝{x|x^(j)≤s}and R₂(j,s)＝{x|x^j＞s}R ₁ (j,s)={x|x ^(j) ≤s} and R ₂ (j,s)={x|x ^j ＞s}

S32：XGBoost算法思想在于不断的添加树，不断地进行特征分裂生长一棵树，通过每一次的添加，可以学习一个新的函数，拟合上一次预测的残差。在训练了K棵树以后，当需要预测某个样本的分数的时候，可以根据该样本的特征，每棵树都会得到一个子节点的分数，最后将每棵树中相应的得分加起来就是该样本的预测值；其中，某一棵树中的分数值由下式确定：S32: The idea of the XGBoost algorithm is to continuously add trees and continuously perform feature splitting to grow a tree. Through each addition, a new function can be learned to fit the residual error of the previous prediction. After training K trees, when it is necessary to predict the score of a certain sample, each tree will get a score of a child node according to the characteristics of the sample, and finally add up the corresponding scores in each tree to get the The predicted value of the sample; where the score value in a certain tree is determined by:

上式中，

S33：通常设置目标函数来确定算法的每个参数是否是最佳的，在本实例中，XGBoost目标函数定义为：S33: The objective function is usually set to determine whether each parameter of the algorithm is optimal. In this example, the XGBoost objective function is defined as:

式中包含两个部分，左边为损失函数，右边为正则项，损失函数保证了衡量测分数和真实分数的差别。The formula contains two parts, the left is the loss function, and the right is the regular term. The loss function ensures the difference between the measured score and the real score.

S34：模型建立后需要对数据进行训练，通过最小化目标函数找到最佳参数，本实例中使用加法训练分布优化目标函数，其步骤如下：S34: After the model is established, the data needs to be trained, and the optimal parameters are found by minimizing the objective function. In this example, the additive training distribution is used to optimize the objective function. The steps are as follows:

S341：首先优化CART的第一棵树，然后优化第二棵，直到最后优化完K棵树为止，所述优化目标函数如下：S341: first optimize the first tree of CART, then optimize the second tree, until finally the K trees are optimized, the optimization objective function is as follows:

上式中，

表示第t次迭代后样本i的预测分数，

表示前t-1棵树的预In the above formula,

Denotes the predicted score of sample i after the t-th iteration,

Indicates the prediction of the previous t-1 trees

测得分，f_t(x_i)表示第t棵树的函数形式；score, f _t ( _xi ) represents the functional form of the tth tree;

S342：在优化的最后一步可以得到了一颗最优的CART树f_t(x_i)，该棵树是在f_t-1(x_i)的基础上使得目标函数最小，即满足下式：S342: In the last step of optimization, an optimal CART tree f _t (xi ₎ can be obtained, which minimizes the objective function on the basis of f _t-1 ( _xi ), which satisfies the following formula:

上式中，constant为前t-1棵树的复杂度。In the above formula, constant is the complexity of the first t-1 trees.

S343：当考虑到使用的损失函数为MSE时，上述表达式变为：S343: When considering that the loss function used is MSE, the above expression becomes:

而对于一般函数，我们将其泰勒二阶展开，则上式表达式进一步变为：As for the general function, we expand it to the second order of Taylor, then the expression of the above formula further becomes:

上式中：In the above formula:

S344：XGBoost的目标是使目标函数最小化，因此将常数项去掉，可得构造的损失函数的表达式为：S344: The goal of XGBoost is to minimize the objective function, so the constant item is removed, and the expression of the constructed loss function is:

上式中，g_i表示第i个叶子节点的一阶偏导，h_i第i个叶子节点的二阶偏导。In the above formula, g _i represents the first-order partial derivative of the i-th leaf node, and h _{i is} the second-order partial derivative of the i-th leaf node.

S35：通过对CART树的重新定义，确定XGBoost中正则化项的定义；S35: Determine the definition of the regularization item in XGBoost by redefining the CART tree;

将CART树定义为如下的表达式：Define a CART tree as the following expression:

f_t(x)＝ω_q(x),ω∈R^T,q:R^d→{1,2,…,T}f _t (x)＝ω _q(x) ,ω∈R ^T ,q:R ^d →{1,2,…,T}

上式中，T表示一棵树中的叶子节点数，由这些值组成了一个T维向量ω，q(x)是ω的一个映射，将样本分到某个叶子节点；ω_q(x)表示这棵树对样本的预测值。In the above formula, T represents the number of leaf nodes in a tree, and these values form a T-dimensional vector ω, and q(x) is a map of ω, which divides samples into a certain leaf node; ω _q(x) Indicates the predicted value of this tree for the sample.

在上述定义下，XGBoost的正则化项定义为：Under the above definition, the regularization term of XGBoost is defined as:

上式中，γ和λ表示权衡因子，ω_j表示第j个叶子节点的输出均值，T表示叶子节点的数量；In the above formula, γ and λ represent trade-off factors, ω _j represents the output mean value of the jth leaf node, and T represents the number of leaf nodes;

在应用XGBoost算法模型时，可以对γ和λ的取值进行人为设定，γ、λ值越大，模型越简单。When applying the XGBoost algorithm model, the values of γ and λ can be set artificially. The larger the values of γ and λ, the simpler the model.

S35：在新的定义下，目标函数进一步变形为：S35: Under the new definition, the objective function is further transformed into:

上式中，

表示第i棵树对样本的预测值；In the above formula,

Indicates the predicted value of the i-th tree for the sample;

对上述表达式进行简化得到：Simplify the above expression to get:

上式中，G_j和H_j分别表示叶子节点j所包含样本的一阶偏导、二阶偏导累加之和，均为常数，γ和λ表示权衡因子，ω_j表示第j个叶子节点的输出均值，T表示叶子节点的数量；In the above formula, G _j and H _j represent the cumulative sum of first-order partial derivatives and second-order partial derivatives of the samples contained in leaf node j respectively, both of which are constants, γ and λ represent trade-off factors, and ω _j represents the jth leaf node The output mean of , T represents the number of leaf nodes;

当第t棵CART树的结构确定之后，G_j和H_j都是确定的，因此可以通过下式分别求得各叶子节点的最佳值：When the structure of the t-th CART tree is determined, both G _j and H _j are determined, so the optimal value of each leaf node can be obtained by the following formula:

以及求取此时目标函数的值：And find the value of the objective function at this time:

预测模型中需要设置的超参数包括迭代模型类别、损失函数类别、学习率、树的深度、L₁正则化参数、迭代次数；The hyperparameters that need to be set in the prediction model include iterative model category, loss function category, learning rate, tree depth, _L1 regularization parameter, and number of iterations;

S5：采用预测数据集作为输入样本，利用训练好的预测模型对样本输出进行输出，从预测模型的输出中获得关于用户出行行为决策的预测结果。S5: Use the prediction data set as the input sample, use the trained prediction model to output the sample output, and obtain the prediction result about the user's travel behavior decision from the output of the prediction model.

实施例2Example 2

本实施例提供了一种如实施例的基于XGBoost算法的个体出行行为预测方法，并在实施如实施例1的方法的基础上进行仿真试验(在其他实施例中可以不进行仿真实验，也可以采用其他实验方案进行试验以确定相关参数以及对个体出行行为的预测性能)。This embodiment provides a method for predicting individual travel behavior based on the XGBoost algorithm of the embodiment, and performs a simulation experiment on the basis of implementing the method in Example 1 (in other embodiments, the simulation experiment may not be performed, or Experiment with other experimental protocols to determine relevant parameters and predictive performance for individual travel behavior).

一、数据来源1. Data source

本实例中，采用安徽运营商用户的实际真实运营数据。提供了供10000用户的数据，其中7000用户的相关数据用于训练，3000用户的相关数据用于预测，自行从7000个用户中划分训练集、测试集。In this example, the actual real operating data of Anhui operator users is used. The data for 10,000 users is provided, of which the relevant data of 7,000 users is used for training, and the relevant data of 3,000 users is used for prediction, and the training set and test set are divided from 7,000 users by ourselves.

数据文件及其说明：Data files and their descriptions:

在本实施例中，预测模型中的数据文件包括如下组成：In this embodiment, the data files in the prediction model include the following components:

(1).DataPlus_Public_UserInfo_travel.csv(1).DataPlus_Public_UserInfo_travel.csv

说明：该数据包中含有10000用户的基本信息，每个用户每月一行记录，所有用户均有记录。Explanation: This data package contains the basic information of 10,000 users, each user records one line per month, and all users have records.

(2).DataPlus_Public_Comm_travel.csv(2).DataPlus_Public_Comm_travel.csv

说明：该数据包中含有10000用户的通话数据，每个用户多行记录，可能有用户无记录。Explanation: This data packet contains the call data of 10,000 users, each user has multiple records, and some users may have no records.

(3).DataPlus_Public_Net_travel.csv(3).DataPlus_Public_Net_travel.csv

说明：该数据包中含有10000用户的上网数据，每个用户多行记录，可能有用户无记录。Explanation: This data package contains the Internet access data of 10,000 users, each user has multiple records, and some users may have no records.

(4).Dataplus_Travel_Train_Trail.csv(4).Dataplus_Travel_Train_Trail.csv

说明：该数据包中含有10000用户的前3个月轨迹信息(含是否出现在景区、景区名称两列)，每个用户多行记录，可能有用户无记录。该数据包的数据字典如表1所示：Explanation: This data package contains the trajectory information of 10,000 users in the first 3 months (including whether they appear in the scenic spot and the name of the scenic spot). Each user has multiple rows of records, and some users may have no records. The data dictionary of this packet is shown in Table 1:

表1：.Dataplus_Travel_Train_Trail.csv的数据字典Table 1: Data dictionary for .Dataplus_Travel_Train_Trail.csv

user_iduser_id 用户标识User ID 抽样&字段脱敏Sampling & field desensitization come_timecome_time 进入时间Entry time 粒度到分钟Granularity down to minutes leave_timeleave_time 离开时间departure time 粒度到分钟Granularity down to minutes longitudelongitude 经度(WGS84)Longitude (WGS84) 字段脱敏，保留小数点后3位Field desensitization, retain 3 digits after the decimal point latitudelatitude 纬度(WGS84)Latitude (WGS84) 字段脱敏，保留小数点后3位Field desensitization, retain 3 digits after the decimal point poi_tagpoi_tag 是否省内4A级以上景区Whether it is a 4A level or above scenic spot in the province 0：否1：是0: no 1: yes poi_namepoi_name 省内4A级以上景区名称Names of 4A-level and above scenic spots in the province

(5).Dataplus_Travel_Train_User.csv(5).Dataplus_Travel_Train_User.csv

说明：该数据包中含有7000用户的后续10天省内旅游信息，每个用户一行记录，所有用户均有记录；该数据包的数据字典如表2所示：Explanation: This data package contains 7,000 users’ travel information within the province for the next 10 days. Each user has a row of records, and all users have records; the data dictionary of this data package is shown in Table 2:

表2：.Dataplus_Travel_Train_User.csv的数据字典：Table 2: Data dictionary of .Dataplus_Travel_Train_User.csv:

user_iduser_id 用户标识User ID 抽样&字段脱敏Sampling & field desensitization in_flagin_flag 省内游出行结果Intra-province travel results 0：无出行；1：有出行0: no travel; 1: travel

二、原始数据的预处理过程2. Preprocessing of raw data

2.1用户基本信息表处理2.1 User basic information table processing

原始用户基本信息数据中每条记录包含用户ID、客户年龄、归属地市、归属显示等特征，其中，每个用户每个月有一条记录，总共30000行*40列条记录。Each record in the original user basic information data includes features such as user ID, customer age, city of origin, and attribution display. Among them, each user has one record per month, with a total of 30,000 rows*40 columns of records.

本实施例中，如图2所示，绘制数据集的箱线图，可以看见其中有些变量存在着异常值，对在网时长一列的小于0的异常值初步进行空值处理。In this embodiment, as shown in FIG. 2 , the boxplot of the data set is drawn, and it can be seen that there are abnormal values in some variables, and the initial null value processing is performed on the abnormal values less than 0 in the online time column.

绘制如图3所示的数据集的所有特征变量的热力图，可以看见部分特征变量之间的共线性较高，在处理数据时删除所有共线性值超过0.9的列，只保留其中一列，减少矩阵的维度。Draw the heat map of all the characteristic variables of the data set shown in Figure 3. It can be seen that the collinearity between some characteristic variables is high. When processing the data, delete all columns with a collinearity value exceeding 0.9, and only keep one of them. Reduce Dimensions of the matrix.

对数据集中的文本信息进行数值化处理，把手机终端操作系统一列划分为安卓和IOS两大系统，分别用0和1进行表示。根据所有特征变量和目标变量的皮尔逊相关系数，皮尔逊相关系数为0的列删除，与训练无关的终端型号，终端品牌和终端首次使用时间这三列删除。最后根据用户的ID来对特征变量信息进行汇总统计，对其中空值使用平均值或众数来填充，处理后得到的是一个7000行*30列完整的Dataframe。The text information in the data set is numerically processed, and the mobile terminal operating system is divided into two major systems, Android and IOS, which are represented by 0 and 1 respectively. According to the Pearson correlation coefficient of all feature variables and target variables, the column with a Pearson correlation coefficient of 0 is deleted, and the three columns of terminal model, terminal brand and terminal first use time irrelevant to training are deleted. Finally, according to the user ID, the characteristic variable information is summarized and counted, and the empty value is filled with the average value or mode. After processing, a complete Dataframe with 7000 rows*30 columns is obtained.

2.2用户通话行为表处理2.2 User call behavior table processing

原始用户通话行为数据中每条记录包含用户ID、对端号编码、通话时长等特征，其中每个用户有多条记录，总共3703119行*10列条记录，9879个用户的记录。Each record in the original user call behavior data contains features such as user ID, peer number code, and call duration. Each user has multiple records, a total of 3,703,119 rows*10 columns of records, and 9,879 user records.

本实例中，删除与训练无关的对端号码编号和通话起始时间两列，根据用户的ID来对特征变量信息采用总和或众数进行汇总统计，其中归属局和对端号长途区号的列中的异常值进行数字化处理，对缺失值采用众数填充，最后我们得到一个7000行*8列的完整Dataframe。In this example, delete the two columns of peer number number and call start time that are not related to training, and use the sum or mode of the characteristic variable information to conduct summary statistics based on the user ID, among which the home office and peer number long-distance area code columns The outliers in are digitized, and the missing values are filled with the majority. Finally, we get a complete Dataframe with 7000 rows*8 columns.

2.3用户上网行为表处理2.3 User online behavior table processing

原始用户上网行为数据如图4所示，每条记录包含用户ID、应用名称、应用分类等特征，其中，每个用户有多条记录，总共2246083行*6列条记录，8879个用户的记录。The original user online behavior data is shown in Figure 4. Each record contains features such as user ID, application name, and application classification. Among them, each user has multiple records, a total of 2,246,083 rows*6 columns of records, and 8,879 user records .

本实施例中，删除与训练无关的应用名称和数据日期两列，对应用分类这一列采用独热编码处理变为23列。根据用户的ID来对特征变量访问次数和访问流量两列求和汇总统计，应用分类独热编码的所有特征变量使用众数进行汇总统计。绘制如图5所示的特征数据的箱体图，发现访问次数最高的达到6000000次的异常值，采用三倍四分位间距盖帽法处理。最后，对存在的空值采用众数进行填充，处理后得到一个7000行*26列的完整Dataframe。In this embodiment, two columns of application name and data date irrelevant to training are deleted, and the column of application classification is changed to 23 columns by one-hot encoding. According to the user's ID, the two columns of the feature variable visit times and visit traffic are summed and summarized, and all the feature variables that apply classification one-hot encoding use the mode to perform summary statistics. Draw the box plot of the characteristic data as shown in Figure 5, and find outliers with the highest number of visits reaching 6,000,000 times, and use the triple interquartile range capping method to deal with them. Finally, the existing null values are filled with the mode, and a complete Dataframe with 7000 rows*26 columns is obtained after processing.

2.4用户轨迹行为表处理2.4 User Track Behavior Table Processing

原始用户轨迹行为数据如图6所示，每条记录包含用户ID、进入时间、离开时间等特征，其中每个用户有多条记录，总共68332348行*7列条记录，9933个用户的记录。The original user trajectory behavior data is shown in Figure 6. Each record contains user ID, entry time, exit time and other characteristics. Each user has multiple records, with a total of 68332348 rows*7 columns of records and 9933 user records.

考虑到轨迹行为数据集过大，通过数据仓库工具hive根据用户ID和是否省内4A级以上景区(poi_tag＝1)两列来标记为当前经纬度是否为景区。如图7所示，可衍生出所有用户的总停留时间，景区停留时间和非景区停留的时间。Considering that the trajectory behavior data set is too large, the data warehouse tool hive is used to mark whether the current latitude and longitude is a scenic spot according to the two columns of user ID and whether it is a scenic spot above 4A level in the province (poi_tag=1). As shown in Figure 7, the total stay time of all users, the stay time in scenic spots and the stay time in non-scenic spots can be derived.

绘制数据中特征变量的箱线图，如图8所示，在箱线图中发现存在总停留时间最高的达到26490441.0分钟的异常值，已经远超过三个月的时间，采用三倍四分位间距盖帽法对其进行处理。最后对存在的空值采用众数进行填充，处理后得到一个7000行*4列的完整Dataframe。Draw a boxplot of the characteristic variables in the data, as shown in Figure 8. In the boxplot, it is found that there is an outlier with the highest total residence time of 26490441.0 minutes, which has been far more than three months, and triple quartiles are used It is dealt with by spacing capping method. Finally, the existing null values are filled with the mode, and a complete Dataframe with 7000 rows*4 columns is obtained after processing.

三、模型参数确定3. Model parameter determination

为使实验结果更具有普遍性，本实施例将数据集划分为20％测试集和80％训练集。通过对XGBoost算法使用分层三折交叉差分器进行超参数调整,使得模型更加稳定,具体参数设置如表3所示：In order to make the experimental results more general, this embodiment divides the data set into 20% test set and 80% training set. By adjusting the hyperparameters of the XGBoost algorithm using a hierarchical three-fold crossover differencer, the model is made more stable. The specific parameter settings are shown in Table 3:

表3：XGBoost模型的超参数设置Table 3: Hyperparameter settings for the XGBoost model

四、预测试验的结果对比Fourth, the comparison of the results of the prediction test

在本实例中，采用XGBoost、BDT、LR和GBDT+LR融合四种预测模型进行预测结果的对照试验，经过训练后各模型的评价指标参数如下表2所示：In this example, XGBoost, BDT, LR and GBDT+LR are used to fuse four prediction models to conduct a control experiment of prediction results. After training, the evaluation index parameters of each model are shown in Table 2 below:

表2本实施例与其它算法模型评价指标的对比Table 2 Comparison between this embodiment and other algorithm model evaluation indicators

模型种类model type 准确率PAccuracy P 召回率RRecall R F1值F1 value XGBoostXGBoost 0.88090.8809 0.94290.9429 0.91080.9108 LRLR 0.55150.5515 0.80020.8002 0.65300.6530 GBDTGBDT 0.83780.8378 0.92390.9239 0.87870.8787 GBDT+LRGBDT+LR 0.83940.8394 0.92480.9248 0.88000.8800

分析上述测试结果发现：本实施例提供的基于用户真实的信息大数据进行用户未来省内旅游出行预测模型，相对于对照组中选取GBDT、LR和GBDT+LR融合的三种模型而言，实验结果中各个评价指标包括准确率，召回率和F1值都优于其他模型。因此可以认为本实施例提供的预测方法确实解决了现有技术中预测个人出行行为的问题，同时该方法得出的预测结论还具有较高的准确性和可靠性。Analyzing the above test results, it is found that the prediction model based on the user’s real information big data provided by this embodiment to predict the user’s future travel in the province, compared with the three models selected for the fusion of GBDT, LR and GBDT+LR in the control group, the experiment In the results, various evaluation indicators including accuracy rate, recall rate and F1 value are better than other models. Therefore, it can be considered that the prediction method provided by this embodiment has indeed solved the problem of predicting individual travel behaviors in the prior art, and at the same time, the prediction conclusion obtained by this method has relatively high accuracy and reliability.

绘制本实施例中预测模型的F1值随迭代次数增加的变化曲线，所述曲线如图9所示，分析曲线可以发现，随着迭代次数的增加，XGBoost模型F1值越来越好，在迭代次数为1100附近达到最大。Draw the change curve of the F1 value of the prediction model in this embodiment as the number of iterations increases. The number of times reaches the maximum around 1100.

实施例3Example 3

如图10所示，本实施例提供一种基于XGBoost算法的个体出行行为预测系统，该系统采用如实施例1的基于XGBoost算法的个体出行行为预测方法，实现对个体出行行为的结果预测；该预测系统包括：As shown in Figure 10, the present embodiment provides an individual travel behavior prediction system based on the XGBoost algorithm, which uses the individual travel behavior prediction method based on the XGBoost algorithm as in Embodiment 1 to realize the result prediction of the individual travel behavior; Forecasting systems include:

预测模块，其用于基于构建的预测模型，采用样本数据集中的训练数据集对预测模型进行训练，并采用样本数据集中的预测数据集作为输入，获取包含用户出行行为预测结果的输出。The prediction module is used for the prediction model based on construction, using the training data set in the sample data set to train the prediction model, and using the prediction data set in the sample data set as input to obtain an output including the prediction result of the user's travel behavior.

实施例4Example 4

本实施例提供一种基于XGBoost算法的个体出行行为预测终端，该终端包括存储器、处理器以及存储在所述存储器上并可在处理器上运行的计算机程序，处理器执行如实施例1的基于XGBoost算法的个体出行行为预测方法。This embodiment provides an individual travel behavior prediction terminal based on the XGBoost algorithm. The terminal includes a memory, a processor, and a computer program stored in the memory and operable on the processor. The processor executes the method based on Individual travel behavior prediction method based on XGBoost algorithm.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，尽管参照前述实施例对本发明进行了详细的说明，对于本领域的技术人员来说，其依然可以对前述各实施例所记载的技术方案进行修改。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still understand the foregoing embodiments The technical scheme recorded is modified. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. An individual trip behavior prediction method based on an XGboost algorithm is characterized by comprising the following steps:

s1: acquiring historical data for representing user behavior characteristics, wherein the historical data comprises user basic information, user conversation behavior data of nearly three months, user internet behavior data of nearly three months and user track behavior data of nearly three months;

s2: preprocessing the acquired historical data to obtain a sample data set, using part of the sample data set as a pre-training data set, and using the rest of the sample data set as a prediction data set; wherein the pre-training data set comprises a training set and a test set;

the preprocessing process of the historical data comprises the following steps:

s21: table processing of user basic information:

s211: finding abnormal values in the data variables, and performing null value processing on the abnormal values smaller than 0 in a network time length column;

s212: finding the colinearity among the characteristic variables through the thermodynamic diagram of the characteristic variables, and deleting all columns with the colinear value exceeding 0.9 in the data;

s213: performing numerical processing on text information in data, uniformly dividing the type of a terminal operating system into an android system and an IOS system, and respectively representing the types by 0 and 1;

s214: deleting columns with the Pearson correlation coefficient of 0 according to the Pearson correlation coefficients of all the characteristic variables and the target variable; deleting the terminal model, the terminal brand and the related column of the first use time of the terminal which are irrelevant to training;

s215: summarizing and counting the characteristic variable information according to the ID of the user, and filling null values in the characteristic variable information by using an average value or a mode;

s22: table processing of user call behaviors:

s221: deleting the number of the opposite terminal and the related column of the call starting time which are irrelevant to the training;

s222: summarizing and counting the characteristic variables by adopting a sum or a mode according to the user ID, and carrying out digital processing on abnormal values in the columns of the long-distance area numbers of the home office and the opposite terminal number; filling missing values by adopting mode;

s23: and (3) table processing of user internet behavior:

s231: deleting the related columns of application names and data dates which are irrelevant to training; and the application classification related column is processed by single-hot coding;

s232: summing and summarizing the access times and the access flow in the characteristic variables according to the user ID, and summarizing and counting the use mode of all the characteristic variables applying the classified one-hot codes;

s233: processing abnormal values in the found characteristic variables by adopting a three-time quartile spacing capping method;

s234: filling null values existing in each characteristic variable by adopting a mode;

s24: table processing of user trajectory behavior:

s241: whether the current longitude and latitude is the scenic spot is marked by a data warehouse tool hive according to the user ID and whether the province is in two scenic spots above the level 4A, and three columns of data of the total stay time of all users, the scenic spot stay time and the non-scenic spot stay time are derived based on the current longitude and latitude;

s242: finding abnormal values in the characteristic variables, and processing the abnormal values by adopting a triple-quarter-pitch capping method

S243: filling null values existing in the characteristic variables by adopting a mode;

in the relevant data preprocessing method step, abnormal values of all characteristic variables are discovered by a method of drawing a relevant variable box line graph;

s3: constructing a prediction model of individual travel behaviors based on an XGboost algorithm; the construction method of the prediction model comprises the following steps:

s31: constructing a classifier of the XGboot by using a CART regression tree; adding the number of trees by continuously performing feature splitting, so as to learn a new function and fit a predicted residual error of the previous layer;

s32: accumulating the scores of sub-nodes in a certain tree in the CART tree to obtain the score sum of the certain tree, and accumulating the scores of all the trees to obtain the predicted value of the sample;

s33: constructing an objective function of an algorithm model, wherein the objective function comprises a loss function part and a regular term part;

s34: using addition training distribution to optimize a target function, sequentially optimizing each tree in the CART, minimizing the target function on the basis of an optimal tree, and completing the construction of a loss function part;

s35: the definition of a regularization item in the objective function is completed through the redefinition of the CART tree, and the function of the regularization item part is determined;

s36: the function is utilized to obtain the optimal value of each leaf node in the CART tree and the value of the current objective function;

s4: setting a hyper-parameter of a prediction model, carrying out hyper-parameter adjustment on the prediction model through a layered three-fold cross differentiator, and training the prediction model by utilizing a pre-training data set until the prediction model meets the requirement of an evaluation index;

s5: and (3) adopting the prediction data set as an input sample, outputting the sample output by using the trained prediction model, and obtaining a prediction result about the user trip behavior decision from the output of the prediction model.

2. The individual travel behavior prediction method based on the XGBoost algorithm of claim 1, characterized in that: the historical data sources acquired in the step S1 are government open data and real user data collected by a communication carrier.

3. The individual travel behavior prediction method based on the XGBoost algorithm of claim 1, characterized in that: in step S31, the CART regression tree is an assumed binary tree, and its expression is:

R ₁ (m,s)＝{xx ^(m) ≤s}and R ₂ (m,s)＝{xx ^(m) ＞s}

in the above formula, R ₁ (m, s) and R ₂ (m, s) respectively represent a left sub-tree and a right sub-tree, m represents the mth feature in the data, and s represents a cut point;

in the decision binary tree, if the tree node is split based on the mth feature in the data, when the feature value is smaller than s, the sample is divided into a left subtree, and when the feature value is greater than s, the sample is divided into a right subtree.

4. The individual travel behavior prediction method based on the XGBoost algorithm of claim 1, characterized in that: in step S32, the score of a certain tree is calculated by using the following function:

in the above formula, the first and second carbon atoms are,

represents the ith sampleIn the predicted value, K represents a tree of the tree, F represents all CART trees, F represents a specific CART tree, F _k (x _i ) The scores obtained for leaf nodes of a sample in a certain tree.

5. The individual travel behavior prediction method based on the XGBoost algorithm of claim 1, characterized in that: in the steps S33 to S35, the expression of the objective function is:

in the above formula, l represents the empirical loss function of the tree model, y _i Representing the real value of the ith sample, and omega represents a regression tree regularization item;

wherein, the left side of the above formula represents a loss function, and the right side is a regularization term;

the function expression of the loss function is:

in the above formula, g _i Represents the first order partial derivative of the ith leaf node, h _i Second order partial derivatives of the ith leaf node;

the regularization term expression is:

in the above formula, γ and λ represent trade-off factors, ω _j Represents the output average value of the jth leaf node, and T represents the number of leaf nodes.

6. The individual travel behavior prediction method based on the XGBoost algorithm of claim 1, characterized in that: in step S36, the optimal value of each leaf node is calculated by using the following formula:

in the above formula, G _j And H _j Respectively representing the sum of the first-order partial derivatives and the second-order partial derivatives of the samples contained in the leaf node j, wherein the sum is a constant;

the value of the objective function at this time is calculated by the following equation:

in the above equation, T represents the number of leaf nodes.

7. The individual travel behavior prediction method based on the XGBoost algorithm of claim 1, characterized in that: in step S4, the hyper-parameters to be set in the prediction model include an iteration model category, a loss function category, a learning rate, a tree depth, and L ₁ Regularization parameters and iteration times;

the evaluation indexes of the prediction model comprise an accuracy P, a recall rate R and an F1 value, and the calculation formulas of the accuracy P, the recall rate R and the F1 value are as follows:

in the above equation, TP represents the number of samples for which positive samples are actually predicted as positive samples, FP represents the number of samples for which negative samples are actually predicted as positive samples, and FN represents the number of samples for which positive samples are actually predicted as negative samples.

8. An individual travel behavior prediction system based on an XGboost algorithm is characterized in that the system adopts the individual travel behavior prediction method based on the XGboost algorithm according to any one of claims 1-7 to realize result prediction of individual travel behaviors; the prediction system comprises:

the data acquisition module is used for acquiring historical data for representing recent behavior characteristics of the user, wherein the historical data comprises basic information of the user, conversation behavior data of the user in nearly three months, internet surfing behavior data of the user in nearly three months and track behavior data of the user in nearly three months; outputting the historical data to a preprocessing module;

the preprocessing module is used for preprocessing the historical data acquired by the data acquisition module to obtain a required sample data set; the sample data set is output to a behavior prediction module; and

and the prediction module is used for training the prediction model by adopting a pre-training data set in the sample data set based on the constructed prediction model, and acquiring output containing a user travel behavior prediction result by adopting the prediction data set in the sample data set as input.

9. An individual travel behavior prediction terminal based on an XGboost algorithm, which is characterized by comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, and is characterized in that: the processor executes the individual travel behavior prediction method based on the XGboost algorithm according to any one of claims 1 to 7.