CN108648023A

CN108648023A - A kind of businessman's passenger flow forecast method of fusion history mean value and boosted tree

Info

Publication number: CN108648023A
Application number: CN201810485114.8A
Authority: CN
Inventors: 白智远; 吕品; 温从威; 杨锦浩; 陈智
Original assignee: Shanghai Dianji University
Current assignee: Shanghai Dianji University
Priority date: 2018-05-18
Filing date: 2018-05-18
Publication date: 2018-10-12

Abstract

The present invention relates to a kind of businessman's passenger flow forecast methods of fusion history mean value and boosted tree, which is characterized in that includes the following steps：The complete behavioral data of the businessman of certain time period is pre-processed；To passing through pretreated data construction feature；Passenger flow forecast model is built based on history mean value and boosted tree；Carry out passenger flow forecast.The present invention proposes internet businessman's passenger flow forecast model that history mean value is merged with boosted tree.The essence of the model is to promote tree-model and history mean value model, according to the weight coefficient calculated by calculation formula, the weighted sum that merges according to a certain percentage.The present invention not only allows for how improving the precision of prediction of model, and also contemplates the dependence of the prediction and time of the volume of the flow of passengers, and is made that comparative analysis to the prediction result of different models.

Description

A Merchant Customer Flow Prediction Method Combining Historical Mean and Boosting Tree

技术领域technical field

本发明涉及一种融合历史均值与提升树的客流量预测模型，属于智能信息处理和机器学习领域。The invention relates to a passenger flow forecasting model that integrates historical mean values and boosted trees, and belongs to the fields of intelligent information processing and machine learning.

背景技术Background technique

移动定位服务的发展使得互联网商家“线上线下”的交易数据急剧增长。对比传统的零售行业，互联网商家的营销对用户消费给予了更多的关注，在产品详情页的介绍、客服服务、便捷的移动支付等方面都致力于为用户带来更好的消费体验。比如，某些商业智能服务平台可以为每个商家提供销售预测。基于预测结果，商家可以与用户建立信任关系，吸引到更多忠实的用户并优化运营决策、降低成本、改善用户体验。The development of mobile positioning services has led to a sharp increase in the "online and offline" transaction data of Internet merchants. Compared with the traditional retail industry, the marketing of Internet merchants pays more attention to user consumption, and is committed to bringing users a better consumption experience in terms of product details page introduction, customer service, and convenient mobile payment. For example, some business intelligence service platforms can provide sales forecasts for each merchant. Based on the prediction results, merchants can establish a trust relationship with users, attract more loyal users, optimize operational decisions, reduce costs, and improve user experience.

现有的销售预测技术一般通过历史数据，简单地使用时间加权序列方法进行预测。但在实际生活中，用户的消费行为往往受到节假日、天气等因素的影响，此时，现有的技术无法及时预测出商家的客流量，可能导致预测精度并不理想，预测出的客流量在很大程度上偏离商家的实际客流量。Existing sales forecasting techniques generally use historical data and simply use time-weighted series methods for forecasting. However, in real life, users' consumption behavior is often affected by factors such as holidays and weather. At this time, the existing technology cannot predict the customer flow of the merchant in time, which may lead to unsatisfactory prediction accuracy. To a large extent, it deviates from the actual passenger flow of the merchant.

发明内容Contents of the invention

本发明的目的是提供一种能够更为精度地预测出客流量的方法。The purpose of the present invention is to provide a method capable of predicting passenger flow more accurately.

为了达到上述目的，本发明的技术方案是提供了一种融合历史均值与提升树的商家客流量预测方法，其特征在于，包括以下步骤：In order to achieve the above object, the technical solution of the present invention is to provide a method for predicting customer flow of merchants that integrates historical mean values and boosted trees, which is characterized in that it includes the following steps:

步骤1、对某一时间段的商家完整行为数据进行预处理，商家完整行为数据包括商家特征数据、用户支付行为数据和用户浏览行为数据；Step 1. Preprocess the complete merchant behavior data for a certain period of time. The complete merchant behavior data includes merchant characteristic data, user payment behavior data and user browsing behavior data;

步骤2、对经过预处理的数据构建特征，增加节假日数据及天气特征数据；Step 2. Build features on the preprocessed data, adding holiday data and weather feature data;

步骤3、基于历史均值与提升树构建客流量预测模型，包括以下步骤：Step 3, constructing a passenger flow forecasting model based on the historical average value and the boosted tree, including the following steps:

步骤301、分别对XGBoost与GBDT构建2个学习模型，对2个学习模型调整树的深度、学习率以及迭代次数的参数，确定XGBoost学习模型的学习率以及树的最大深度时，引入XGBoost学习模型中内置的cv函数；Step 301: Construct two learning models for XGBoost and GBDT respectively, adjust the parameters of tree depth, learning rate and number of iterations for the two learning models, and introduce the XGBoost learning model when determining the learning rate of the XGBoost learning model and the maximum depth of the tree The built-in cv function in;

步骤302、利用步骤2得到的数据对XGBoost学习模型与GBDT学习模型进行训练，设定预测日，计算预测日之前到某一天的平均客流量、销量增量；。Step 302, using the data obtained in step 2 to train the XGBoost learning model and the GBDT learning model, setting a forecast date, and calculating the average passenger flow and sales increment from before the forecast date to a certain day;

步骤4、把过去某一时间段的历史销量的相关度矩阵作为步骤3已训练的客流量预测模型的输入，将未来某一时间段的销量和XGBoost学习模型与GBDT学习模型的模型融合的权重系数Credit作为输出：Step 4. Use the correlation matrix of the historical sales volume of a certain time period in the past as the input of the passenger flow prediction model trained in step 3, and combine the sales volume of a certain time period in the future and the weight of the model fusion of the XGBoost learning model and the GBDT learning model The coefficient Credit is output as:

式中，是过去某一时间段的平均销量；Fus_last是过去某一时间段的销量，由此，将XGBoost学习模型、GBDT学习模型和历史均值模型得到的过去某一时间段的平均销量和销量值，分别代入权重系数Credit公式当中，求出相应的权重系数，最终，将训练得到的2组XGBoost学习模型和2组GBDT学习模型的不同结果分别与历史均值模型分别按求出的相应的权重系数的比例融合，得到预测未来某一时间段的客流量。In the formula, is the average sales volume of a certain period of time in the past; Fus _last is the sales volume of a certain period of time in the past, thus, the average sales volume and sales value of a certain period of time in the past obtained by the XGBoost learning model, GBDT learning model and historical mean model, Substitute into the weight coefficient Credit formula to find the corresponding weight coefficient. Finally, the different results of the two groups of XGBoost learning models and two groups of GBDT learning models obtained by training are respectively compared with the historical average model according to the obtained corresponding weight coefficients. Proportional fusion is used to predict the passenger flow for a certain time period in the future.

优选地，步骤1中所述预处理包括以下步骤：Preferably, the pretreatment described in step 1 comprises the following steps:

步骤101、剔除商家完整行为数据中商家开业前7天的数据以及销量中断前后3天的数据，将剩余数据分为训练集和测试集；Step 101. Eliminate the data of the 7 days before the business opened and the data of the 3 days before and after the sales interruption in the complete behavior data of the business, and divide the remaining data into a training set and a test set;

步骤102、去除训练集和测试集中的重复数据，利用基于规则的方法对训练集和测试集中去重后的数据进行归一化处理，从而消除短时间内单个用户大量购买而造成的异常数据；Step 102, remove the duplicate data in the training set and the test set, and use a rule-based method to normalize the deduplicated data in the training set and the test set, thereby eliminating abnormal data caused by a single user buying in large quantities in a short period of time;

对于由于特殊时间节点而造成的异常数据和难以预计的大幅波动而造成的异常数据，采用模型预训练方法剔除，即采用欠拟合算法对客流量预测模型进行预训练，清除数据中残差为10％和25％的数据。For abnormal data caused by special time nodes and abnormal data caused by unpredictable large fluctuations, the model pre-training method is used to eliminate, that is, the under-fitting algorithm is used to pre-train the passenger flow prediction model, and the residual in the cleared data is 10% and 25% of the data.

优选地，所述步骤2包括以下步骤：Preferably, said step 2 includes the following steps:

步骤201、采集全国各省市的天气数据；Step 201, collecting weather data of various provinces and cities across the country;

步骤202、将天气状况简单转换为降水指数和天晴指数两个指标，并生成人体舒适度指数作为客流量预测模型训练的一个重要特征；Step 202, simply convert the weather conditions into two indicators of precipitation index and fineness index, and generate a human comfort index as an important feature of passenger flow forecasting model training;

步骤203、采集当前时间段的节假日数据，将工作日标注为0，周末标注为1，假期标注为2。Step 203 , collecting holiday data in the current time period, marking weekdays as 0, weekends as 1, and holidays as 2.

本发明提出了历史均值与提升树融合的互联网商家客流量预测模型。该模型的本质是提升树模型与历史均值模型，按照计算公式所求出的权重系数，按照一定比例而融合的加权和。本发明不仅考虑了如何提高模型的预测精度，而且还考虑了客流量的预测与时间的依赖关系，并且对不同模型的预测结果做出了对比分析。The invention proposes a forecasting model of traffic flow of Internet merchants in which the historical mean value and the boosting tree are fused. The essence of the model is the weighted sum of the promotion tree model and the historical average model, which are fused according to a certain proportion according to the weight coefficient obtained by the calculation formula. The invention not only considers how to improve the prediction accuracy of the model, but also considers the dependence of the passenger flow prediction on time, and compares and analyzes the prediction results of different models.

附图说明Description of drawings

图1为历史均值与提升树融合模型预测图：Figure 1 is the prediction chart of the fusion model of historical mean and boosted tree:

图2为时间序列加权回归模型预测图。Figure 2 is the prediction chart of time series weighted regression model.

具体实施方式Detailed ways

下面结合具体实施例，进一步阐述本发明。应理解，这些实施例仅用于说明本发明而不用于限制本发明的范围。此外应理解，在阅读了本发明讲授的内容之后，本领域技术人员可以对本发明作各种改动或修改，这些等价形式同样落于本申请所附权利要求书所限定的范围。Below in conjunction with specific embodiment, further illustrate the present invention. It should be understood that these examples are only used to illustrate the present invention and are not intended to limit the scope of the present invention. In addition, it should be understood that after reading the teachings of the present invention, those skilled in the art can make various changes or modifications to the present invention, and these equivalent forms also fall within the scope defined by the appended claims of the present application.

本发明提供了一种融合历史均值与提升树的商家客流量预测方法，包括以下步骤：The present invention provides a method for predicting customer flow of merchants by integrating historical mean value and boosting tree, comprising the following steps:

步骤一：对商家完整行为数据进行预处理Step 1: Preprocessing the complete behavior data of merchants

本发明使用的数据来自天池大数据平台，共包含某年7月1日至次年10月31日的商家完整行为数据。其中包含“商家特征”数据、“用户支付行为”数据和“用户浏览行为”数据。由于直接使用原始数据训练模型不仅会产生误差，还会耗费大量的计算资源。因此，对原始数据集中存在的异常值进行剔除、去重、归一化等处理。一方面，由于商家从入驻平台到销售量增加存在一定的启动时间，并且可能出现某段时间销量中断的现象，因此，商家开业前7天的数据以及销量中断前后3天的数据不作为训练数据；另一方面，由于原始数据中存在短时间内单个用户大量购买的情况，为消除这种异常消费对预测的影响，采用了基于规则的方法对原始数据进行归一化。另外，原始数据中还存在一些特殊时间节点和难以预计的大幅波动：如大型的节假日(如中秋节、国庆节等)、停业、商家开展促销活动时单个用户大量购买的情况。对于这些基于规则的方法难以处理的异常值，本发明采用了模型预训练方法。即，首先采用欠拟合算法对客流量预测模型预训练，清除原始数据中残差为10％和25％的数据。由于预测目标是商家的日销量，因此预处理后用于训练的数据是按小时统计的商家的总销量。The data used in the present invention comes from the Tianchi big data platform, including the complete behavior data of merchants from July 1 of a certain year to October 31 of the following year. It includes "merchant characteristics" data, "user payment behavior" data and "user browsing behavior" data. Because directly using the original data to train the model will not only generate errors, but also consume a lot of computing resources. Therefore, the abnormal values existing in the original data set are eliminated, deduplicated, and normalized. On the one hand, since there is a certain start-up time for merchants from entering the platform to increasing sales, and sales may be interrupted for a certain period of time, the data of 7 days before the opening of the merchant and the data of 3 days before and after the interruption of sales are not used as training data ; On the other hand, due to the fact that a single user purchases a large amount in a short period of time in the original data, in order to eliminate the impact of this abnormal consumption on the prediction, a rule-based method is used to normalize the original data. In addition, there are some special time nodes and unpredictable large fluctuations in the original data: such as large-scale holidays (such as Mid-Autumn Festival, National Day, etc.), business closures, and the situation of a single user buying in large quantities when merchants carry out promotional activities. For these outliers that are difficult to deal with by rule-based methods, the present invention adopts a model pre-training method. That is, firstly, the underfitting algorithm is used to pre-train the passenger flow forecasting model, and the data with residual error of 10% and 25% in the original data are cleared. Since the prediction target is the daily sales of merchants, the data used for training after preprocessing is the total sales of merchants counted by hour.

步骤二、对经过预处理的数据构建特征。Step 2: Construct features on the preprocessed data.

为提高模型预测的准确性，本发明采集全国各省市的天气数据以及节假日天气数据作为原始数据的补充。在额外采集的气温、湿度、气压等数据中，根据经验，将天气状况简单转换为降水指数和天晴指数两个指标，由于人体对于气象参数的感受不成线性关系，故生成人体舒适度指数(Comfort Index of Human Body，SSD)作为模型训练的一个重要特征。最终，模型训练与预测使用的特征与标签如表1所示。In order to improve the accuracy of model prediction, the present invention collects weather data of various provinces and cities across the country and weather data of holidays as a supplement to the original data. In the additional collected data such as temperature, humidity, air pressure, etc., according to experience, the weather conditions are simply converted into two indicators: precipitation index and sunny index. Since the human body’s perception of meteorological parameters is not in a linear relationship, the human comfort index ( Comfort Index of Human Body, SSD) as an important feature of model training. Finally, the features and labels used for model training and prediction are shown in Table 1.

表1模型训练与预测使用的特征Table 1 Features used for model training and prediction

步骤三、基于历史均值与提升树构建客流量预测模型。Step 3: Construct a passenger flow forecasting model based on the historical mean value and the boosted tree.

为获得精确度高的客流量预测模型，本发明采用了两个阶段的训练方法。第一次阶段的训练中，使用了XGBoost(eXtreme Gradient Boost)与GBDT(Gradient BoostingDecision Tree)模型。模型训练的参数如表2和表3所示。每一种模型分别使用了2组参数进行训练，总共获得4个模型。In order to obtain a high-precision passenger flow prediction model, the present invention adopts a two-stage training method. In the first phase of training, XGBoost (eXtreme Gradient Boost) and GBDT (Gradient Boosting Decision Tree) models were used. The parameters of model training are shown in Table 2 and Table 3. Each model uses 2 sets of parameters for training, and a total of 4 models are obtained.

表2 XGBoost算法的不同参数Table 2 Different parameters of XGBoost algorithm

XGBoostXGBoost 1号number 1 2号number 2 目标函数objective function 线性回归模型linear regression model 线性回归模型linear regression model 树的最大深度the maximum depth of the tree 33 55 学习率learning rate 0.10.1 0.030.03 提升树个数Number of boosted trees 500500 16001600 L1正则化项参数L1 regularization parameter 00 11 L2正则化项参数L2 regularization parameter 11 00

表3 GBDT算法的不同参数Table 3 Different parameters of GBDT algorithm

GBDTGBDT 树的最大深度the maximum depth of the tree 学习率learning rate 提升树个数Number of boosted trees 训练采样比例training sampling ratio 1号number 1 33 0.10.1 500500 0.950.95 2号number 2 55 0.10.1 500500 0.950.95

本发明调整XGBoost与GBDT算法中树的深度、学习率以及迭代次数的参数，在XGBoost算法的1号模型中，一般情况下，学习率的值默认为0.1，而树的最大深度默认为3。但是，对于不同的问题，理想的学习率有时候会在一些特定的区间范围之间波动。树的深度越大，则对数据的拟合程度越高。因此，本发明在确定XGBoost算法的2号模型的学习率以及树的最大深度时，引入XGBoost算法中内置的cv函数，cv函数在每一轮迭代中使用交叉验证，根据算法参数的调整，并返回理想的决策树数量，因此，通过cv函数较为精确的计算，将2号模型的学习率调至0.03，树的最大深度为5。第二阶段的训练使用了历史均值模型。历史均值模型以预测日为基准，首先求出预测日之前的21天的销量平均值，得到每天的平均销量；其次，以周为单位，统计每周的销量的中位数和平均值，通过线性拟合得到每周的销量增量。The invention adjusts the parameters of tree depth, learning rate and iteration number in the XGBoost and GBDT algorithms. In the No. 1 model of the XGBoost algorithm, in general, the default value of the learning rate is 0.1, and the default maximum depth of the tree is 3. However, for different problems, the ideal learning rate sometimes fluctuates between some specific intervals. The deeper the tree, the better it fits the data. Therefore, the present invention introduces the built-in cv function in the XGBoost algorithm when determining the learning rate of the No. 2 model of the XGBoost algorithm and the maximum depth of the tree, and the cv function uses cross-validation in each round of iteration, according to the adjustment of the algorithm parameters, and Returns the ideal number of decision trees. Therefore, through the more accurate calculation of the cv function, the learning rate of the No. 2 model is adjusted to 0.03, and the maximum depth of the tree is 5. The second stage of training uses the historical mean model. Based on the forecast date, the historical average model first calculates the average sales volume of the 21 days before the forecast date to obtain the daily average sales volume; secondly, counts the median and average sales volume of each week in units of weeks, and passes A linear fit yields weekly sales increments.

步骤四、对已经训练好的学习器进行多模型加权融合，预测商家客流量；把过去21天的历史销量的相关度矩阵作为输入；将未来两周的销量和历史均值模型与第一阶段的模型融合的权重系数作为输出。均值模型的融合比例最大为0.75。融合的权重系数Credit计算如公式：Step 4: Carry out multi-model weighted fusion of the trained learner to predict the traffic flow of merchants; use the correlation matrix of historical sales in the past 21 days as input; combine the sales and historical average models of the next two weeks with the first-stage The weight coefficients of the model fusion are output. The mean model has a fusion scale of up to 0.75. The fusion weight coefficient Credit is calculated as the formula:

式中，是过去三周的平均销量，Fus_last为过去三周的销量。由此，将XGBoost、GBDT和历史均值模型(历史均值模型是一种以预测日为基准，求出预测日之前到某一天的平均客流量、销量增量等信息，再以权重系数作为融合的比例，达到预测未来14天的客流量)得到的过去三周的平均销量和销量值，分别代入权重系数公式当中，可求出相应的权重系数为：0.47，0.34，0.19。最终，将训练得到的2组XGBoost模型和2组GBDT的不同结果分别与历史均值模型分别按0.47，0.34，0.19的比例融合，得到预测未来14天的客流量。In the formula, is the average sales volume of the past three weeks, and Fus _last is the sales volume of the past three weeks. Therefore, XGBoost, GBDT and historical average model (historical average model is a kind of information based on the forecast date to obtain the average passenger flow, sales increment and other information from before the forecast date to a certain day, and then use the weight coefficient as the fusion model The average sales volume and sales volume value obtained in the past three weeks are respectively substituted into the weight coefficient formula, and the corresponding weight coefficients can be obtained as follows: 0.47, 0.34, and 0.19. Finally, the different results of the 2 sets of XGBoost models and 2 sets of GBDT obtained from training are fused with the historical average model at a ratio of 0.47, 0.34, and 0.19, respectively, and the passenger flow forecast for the next 14 days is obtained.

通过优化算法参数，采用测试集样本对建模结果进行预测，算法的运行结果和精度测试如表4所示。By optimizing the algorithm parameters and using the test set samples to predict the modeling results, the running results and accuracy tests of the algorithm are shown in Table 4.

表4历史均值与提升树融合模型精度测试Table 4 Historical mean and boosted tree fusion model accuracy test

实验中利用XGBoost自定义的评价函数对提出的模型进行了性能评估。调用评价函数时，传入验证集和验证集上的预测值作为函数参数，返回一个浮点类型的评估值fevalerror。fevalerror的值越大，模型预测精度越低。反之，fevalerror的值越小，模型预测精度越高。结果表明，随着训练集样本大小的增多，运算时间增长，fevalerror值逐渐减小，精度上却逐渐增高。由此，历史均值与提升树的融合模型具有预测精度较高、运算速度较快的优势。In the experiment, the performance evaluation of the proposed model is carried out by using the evaluation function customized by XGBoost. When calling the evaluation function, the validation set and the predicted value on the validation set are passed in as function parameters, and an evaluation value fevalerror of floating point type is returned. The larger the value of fevalerror, the lower the prediction accuracy of the model. Conversely, the smaller the value of fevalerror, the higher the prediction accuracy of the model. The results show that with the increase of the sample size of the training set, the calculation time increases, the fevalerror value gradually decreases, but the accuracy gradually increases. Therefore, the fusion model of historical mean value and boosted tree has the advantages of higher prediction accuracy and faster operation speed.

由于时间序列反映了实体属性在时间顺序上的特征，因此，实现了时间序列加权回归算法，分析2种算法的预测结果后，得到图1和图2所示的前500位互联网商家在未来14天的客流量发展趋势。其中，横轴是商家的ID号，纵轴则表示客流量的预测值。分析客流量发展趋势可知：Since the time series reflects the characteristics of entity attributes in time order, the time series weighted regression algorithm is implemented. After analyzing the prediction results of the two algorithms, the top 500 Internet merchants in the next 14 years are obtained as shown in Figure 1 and Figure 2. The daily passenger flow development trend. Wherein, the horizontal axis is the ID number of the merchant, and the vertical axis represents the predicted value of the passenger flow. Analyzing the development trend of passenger flow shows that:

1)与浏览动作相关的变量对模型的贡献程度最大，这是因为浏览是用户交互的最主要方式，其信息丰富程度远高于其它特征；1) Variables related to browsing actions contribute the most to the model, because browsing is the most important way of user interaction, and its information richness is much higher than other features;

2)部分商家可能所经营的商品评价较高，顾客的返回率使得部分商家的客流量稳步上升。2) Some merchants may have higher product evaluations, and the return rate of customers has led to a steady increase in the passenger flow of some merchants.

3)大部分的商家十四天总客流量已经突破了5000，少量甚至达到了约25000的级别。这极有可能是商家近期的某种促销活动所导致的。比如通过平台派发不同程度的优惠券、现金红包、买满一定金额优惠等活动。但如何调整自己的运营策略，吸引到更多的客流量显得至关重要。3) The total customer flow of most merchants in 14 days has exceeded 5,000, and a few even reached the level of about 25,000. This is most likely caused by some kind of promotional activity by the merchant recently. For example, different levels of coupons, cash red envelopes, and discounts for purchases of a certain amount are distributed through the platform. But how to adjust one's own operation strategy to attract more passenger flow is very important.

Claims

1. a kind of businessman's passenger flow forecast method of fusion history mean value and boosted tree, which is characterized in that include the following steps：

Step 1 pre-processes the complete behavioral data of the businessman of certain time period, and the complete behavior data packet of businessman includes businessman spy Levy data, user's payment behavior data and user browsing behavior data；

Step 2, to passing through pretreated data construction feature, increase festivals or holidays data and weather characteristics data；

Step 3 builds passenger flow forecast model based on history mean value and boosted tree, includes the following steps：

Step 301 builds 2 learning models to XGBoost and GBDT respectively, depth, study to 2 learning model adjustment trees The parameter of rate and iterations, determine XGBoost learning models learning rate and tree depth capacity when, introduce Built-in cv functions in XGBoost learning models；

Step 302, the data obtained using step 2 are trained XGBoost learning models and GBDT learning models, and setting is pre- It surveys day, the average volume of the flow of passengers, sales volume increment before calculating prediction day to some day；.

Step 4, the passenger flow forecast mould that the correlation matrix of the history sales volume of section in those years has been trained as step 3 The input of type, by the power of the sales volume of the following certain time period and XGBoost learning models and the Model Fusion of GBDT learning models Weight coefficient Credit is as output：

In formula,It is the average sales volume of section in those years；Fus_lastIt is the sales volume of section in those years, as a result, will The average sales volume and pin for the section in those years that XGBoost learning models, GBDT learning models and history mean value model obtain Magnitude substitutes into weight coefficient Credit formula, finds out corresponding weight coefficient respectively, finally, 2 groups that training is obtained The Different Results of XGBoost learning models and 2 groups of GBDT learning models are corresponding by what is found out respectively to history mean value model respectively Weight coefficient ratio fusion, obtain the volume of the flow of passengers for predicting the following certain time period.

2. businessman's passenger flow forecast method of a kind of fusion history mean value and boosted tree according to claim 1, feature It is, pretreatment described in step 1 includes the following steps：

Step 101 rejects the number that the preceding 7 days data of businessman's opening and sales volume in the complete behavioral data of businessman interrupt front and back 3 days According to remaining data is divided into training set and test set；

Duplicate data in step 102, removal training set and test set, using rule-based method to training set and test set Data after middle duplicate removal are normalized, to eliminate single user in the short time make a big purchase in large quantities and caused by abnormal number According to；

For due to special timing node and caused by abnormal data and unpredicted fluctuation and caused by abnormal data, It is rejected using model pre-training method, that is, uses poor fitting algorithm to carry out pre-training to passenger flow forecast model, in clearing data The data that residual error is 10% and 25%.

3. businessman's passenger flow forecast method of a kind of fusion history mean value and boosted tree according to claim 1, feature It is, the step 2 includes the following steps：

Step 201, the weather data for acquiring national each province and city；

Weather conditions simple conversion for Precipitation Index and is become a fine day index two indices, and generates human comfort and refer to by step 202 An important feature of the number as passenger flow forecast model training；

Step 203, the festivals or holidays data for acquiring current slot, will be labeled as 0 working day, weekend is labeled as 1, and vacation is labeled as 2。