CN111178633A

CN111178633A - Method and device for predicting scenic spot passenger flow based on random forest algorithm

Info

Publication number: CN111178633A
Application number: CN201911406473.0A
Authority: CN
Inventors: 皮慧婷; 洪学海; 杨勇; 陈鑫
Original assignee: Institute Of Big Data Cloud Computing Center Of Chinese Academy Shangrao
Current assignee: Institute Of Big Data Cloud Computing Center Of Chinese Academy Shangrao
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-19

Abstract

The embodiment of the application discloses a method and a device for predicting scenic spot passenger flow based on a random forest algorithm, wherein the method comprises the following steps: establishing a parameter-optimized random forest algorithm model, debugging model parameters according to goodness of fit and average standard error, and searching an optimal random forest algorithm model based on a grid search algorithm; and inputting the characteristic data set into the optimal random forest algorithm model to obtain the future daily passenger flow of the forecast scenic spot. The method and the device provided by the embodiment of the application can effectively improve the accuracy of the prediction model, ensure the timeliness of prediction and facilitate the management of cities and scenic spots.

Description

Method and device for predicting scenic spot passenger flow based on random forest algorithm

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method and a device for predicting scenic spot passenger flow based on a random forest algorithm.

Background

With the rapid development of the tourism industry in China, the tourism is a normal state of life, and the passenger flow of a plurality of scenic spots in China obviously changes along with seasons, so that the seasonal tourism is typical. Every time the passenger flow volume is high, congestion and confusion of different conditions occur in cities and various tourist attractions, even mass friction occurs, personal and property safety of tourists is endangered, and great trouble is caused to management of the cities and the scenic spots. Therefore, a method for predicting future passenger flow in a scenic spot is urgently needed, and a scenic spot manager can take effective precautionary measures in advance according to the future passenger flow and the actual reception capacity of the scenic spot, so that social hazard phenomena such as congestion, friction and even personal and property injuries are avoided, and the service quality and the safety of the scenic spot are ensured. With the development of big tourism data, two problems mainly face to the current scenic spot passenger flow volume prediction. Firstly, as the historical data is continuously increased, the model training time is also continuously increased, and the timeliness of prediction is difficult to ensure; second, the accuracy of the prediction is related to the features and the predictive model, which results in an undesirable accuracy due to the features and the training model. At present, a plurality of methods for scenic spot passenger flow volume prediction provide certain help for scenic spot management decisions, but the timeliness and the accuracy of a prediction model are difficult to improve.

Disclosure of Invention

The embodiment of the application mainly aims to provide a method and a device for predicting scenic spot passenger flow based on a random forest algorithm, so that the accuracy of predicting scenic spot passenger flow is improved, and the timeliness of prediction is guaranteed.

In a first aspect, a method for predicting scenic spot passenger flow based on a random forest algorithm is provided, which comprises the following steps: firstly, establishing a parameter optimized random forest algorithm model, debugging model parameters according to goodness of fit and average standard error, and searching an optimal random forest algorithm model based on a grid search algorithm; and then, inputting the characteristic data set into the optimal random forest algorithm model to obtain the future daily passenger flow of the forecast scenic spot, so that management measures can be made in advance conveniently in the scenic spot.

In one possible design, the feature data set is obtained in a manner that includes: collecting historical passenger flow and characteristic factors of scenic spots to obtain historical original data, wherein the characteristic factors comprise weather, holidays, highest air temperature, lowest air temperature and seasons; respectively completing tourist data values missing from the historical original data by using a Lagrange interpolation method and a contemporaneous historical data mean value aiming at different types of missing values; quantizing the discrete qualitative characteristics through high-dimensional mapping; normalization and normalization processes are performed.

In one possible design, the establishing a parameter-optimized random forest algorithm model includes: defining a training set and a test set, dividing the training set and the test set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the test set; calculating an R square value, a mean square error and an average absolute error; and debugging the model parameters according to the goodness of fit and the average standard error, and searching an optimal random forest model based on a grid search algorithm.

In a possible design, the finding of the optimal random forest model based on the grid search algorithm specifically includes: determining the number k of decision trees and the number m of candidate split attributes_trySet step size, at k and m_tryEstablishing two-dimensional grids on the coordinate system, wherein grid nodes are corresponding k and m_tryThe parameter pairs of (1); constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags; selecting the parameter k, m with the smallest classification error_tryAnd if the classification error or the step length meets the requirement, outputting the optimal parameter and the classification error.

In one possible design, the calculating the R-squared value, the mean-squared error, and the mean absolute error includes:

r square value

Mean square error

Mean absolute error

Wherein y represents the daily passenger flow volume, y_iRepresenting the historical day i traffic.

In a second aspect, an embodiment of the present application further provides an apparatus, including:

the characteristic data set unit is used for processing historical original data consisting of historical passenger flow and characteristic factors of scenic spots to obtain a characteristic data set;

and the random forest algorithm model is used for outputting the predicted future daily passenger flow of the scenic spot according to the input feature data set.

In one possible design, the feature data set unit includes:

the system comprises an acquisition unit, a display unit and a control unit, wherein the acquisition unit is used for acquiring historical passenger flow and characteristic factors of scenic spots to obtain historical original data, and the characteristic factors comprise weather, holidays, maximum temperature, minimum temperature and seasons;

the first processing unit is used for completing the tourist data values missing from the historical original data by using a Lagrange interpolation method and a contemporaneous historical data mean value respectively aiming at different types of missing values;

the second processing unit is used for quantizing the discrete qualitative characteristics through high-dimensional mapping;

and the third processing unit is used for carrying out standardization and normalization processing.

In one possible design, the random forest algorithm model is generated by: defining a training set and a test set, dividing the training set and the test set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the test set; calculating an R square value, a mean square error and an average absolute error; and debugging the model parameters according to the goodness of fit and the average standard error, and searching an optimal random forest model based on a grid search algorithm.

In one possible design, the finding of the optimal random forest model by the random forest algorithm model based on the grid search algorithm specifically includes: determining the number k of decision trees and the number m of candidate split attributes_trySet step size, at k and m_tryEstablishing two-dimensional grids on the coordinate system, wherein grid nodes are corresponding k and m_tryThe parameter pairs of (1); constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags; selectingParameter k, m with minimum classification error_tryAnd if the classification error or the step length meets the requirement, outputting the optimal parameter and the classification error.

In one possible design, the calculating the R-square value, the mean square error, and the mean absolute error by the random forest algorithm model includes:

r square value

Mean square error

Mean absolute error

In a third aspect, the present application provides a computer program product, where the computer program product includes a computer program stored in a computer-readable storage medium, and the computer program is loaded by a controller to implement the method of any possible implementation manner of the first aspect.

In a fourth aspect, the present application provides a non-transitory computer-readable storage medium for storing a computer program, which is loaded by a processor to execute the instructions of the method of any possible implementation manner of the first aspect.

In a fifth aspect, embodiments of the present application provide a chip, where the chip includes a programmable logic circuit and/or program instructions, and when the chip is executed, the chip is configured to implement the method according to any possible implementation manner of the first aspect.

In summary, the method and the device provided by the embodiment of the application can effectively improve the accuracy of the prediction model and ensure the timeliness of the prediction. Firstly, the parameter optimization random forest algorithm is suitable for discrete data, rules contained in the column data can be extracted, prior knowledge is not needed, the method is lower in cost and convenient and efficient to use compared with methods such as a neural network, and timeliness of scenic spot daily passenger flow prediction is guaranteed on the basis of acquiring latest prediction sample data in real time; secondly, the parameter optimization random forest algorithm has good tolerance on missing values, abnormal values and noise, overfitting is not easy to occur, and the accuracy of prediction is further improved by combining with scenic spot passenger flow influence factor analysis.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a frame diagram of a random forest algorithm provided in an embodiment of the present application;

FIG. 2 is a flow chart of a method for predicting scenic spot passenger flow based on a random forest algorithm disclosed in the embodiment of the present application;

FIG. 3 is a flowchart of a method for finding an optimal random forest model based on a grid search algorithm, disclosed in an embodiment of the present application;

FIG. 4 is a schematic block diagram of a device for predicting scenic spot passenger flow based on a random forest algorithm, disclosed in the embodiment of the present application;

FIG. 5 is a schematic block diagram of a feature data set unit disclosed in an embodiment of the present application;

FIG. 6 shows an application example of predicting scenic spot passenger flow based on a random forest algorithm according to an embodiment of the present application.

Detailed Description

The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.

The Random Forest (RF) algorithm is a classification and regression algorithm that integrates multiple decision trees (decisiontrees) by the idea of Ensemble Learning (Ensemble Learning). In the RF, decision trees are firstly adopted as a base classifier, then a Bagging (boosting aggregation) Method is used for generating training data sets which are different from each other, each decision tree is constructed by using a Random Subspace partitioning (Random Subspace Method) strategy, partial attributes are randomly selected from all the attributes, and when the tree is split each time, the optimal attribute is selected from the partial attributes for splitting. If the classification algorithm is adopted, the final classification is the classification with the largest number of votes from the leaf node to which the sample point belongs; in the case of a regression analysis (including data prediction) algorithm, the final class is the mean of the leaf nodes to which the sample point arrives. The introduction of the 'double random' idea makes the RF not easy to fall into overfitting and diversity exists among the sub-classifiers, so the RF has excellent performance. The random forest algorithm frame diagram is shown in fig. 1.

The random forest algorithm can effectively improve the accuracy of the prediction model and ensure the timeliness of the prediction. Firstly, the parameter optimization random forest algorithm is suitable for discrete data, rules contained in the column data can be extracted, prior knowledge is not needed, the method is lower in cost and convenient and efficient to use compared with methods such as a neural network, and timeliness of scenic spot daily passenger flow prediction is guaranteed on the basis of acquiring latest prediction sample data in real time; secondly, the parameter optimization random forest algorithm has good tolerance on missing values, abnormal values and noise, overfitting is not easy to occur, and the accuracy of prediction is further improved by combining with scenic spot passenger flow influence factor analysis.

Referring to fig. 2, a flowchart of a method for predicting scenic spot passenger flow based on a random forest algorithm disclosed in the embodiment of the present application includes the following steps:

step 201: and establishing a parameter optimized random forest algorithm model, debugging model parameters according to the goodness of fit and the average standard error, and searching an optimal random forest algorithm model based on a grid search algorithm.

Step 202: and inputting the characteristic data set into the optimal random forest algorithm model to obtain the future daily passenger flow of the forecast scenic spot.

In one embodiment, for step 201, establishing a random forest algorithm model with optimized parameters includes the following steps:

a. establishing a model: defining a training set and a testing set, dividing the training set and the testing set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the testing set.

In the process of establishing the model, firstly defining a training set and a testing set, dividing the training set and the testing set of the whole data set according to proportion, establishing an input matrix and an output matrix of the training set/the testing set,

the training matrix is:

inputting a matrix:

outputting a matrix:

wherein f is_i(x) historical passenger flow; m is_iIs the date; [ w ]₁，...，w_n]Representing a weather sequence, 3 columns of data, wherein each column of data represents a group of weather conditions including weather, highest air temperature and lowest air temperature; h is₁Holidays (weekday 0, ordinary weekend 1, holiday 2); f. of_i(t) is the amount of traffic to be predicted. Then, a random forest algorithm is used for training the data set to establish a model.

b. Determining an evaluation system: and calculating the R square value, the mean square error and the mean absolute error of the prediction model.

c. Optimizing a model: and debugging the model parameters according to the goodness of fit and the average standard error, and searching for an optimal random forest model based on a grid search algorithm.

In one embodiment, step 202 includes the following steps for the feature data set:

a. data collection: and collecting historical passenger flow and characteristic factors of the scenic spot, wherein the characteristic factors mainly comprise weather, holidays, highest temperature, lowest temperature and seasons.

The collected influence factor data mainly comprises five dimensions of weather, holidays, highest air temperature, lowest air temperature and seasons, wherein the weather mainly refers to the influence of air temperature and rainfall on the amount of tourists; the holidays comprise weekends and national legal holidays, and the influence weights of different holidays on the scenic area tourist capacity are different; many scenic spots are sensitive to the influence of season changes, and the number of tourists is different in different seasons; the air temperature refers to the influence of the highest and lowest air temperature in the day in the scenic spot on the trip of the tourists.

b. Missing value processing: the missing value processing mainly means that original passenger flow data provided by scenic spots is easy to have some missing passenger flow data, and the two situations are divided: for a few scattered missing values, completing the missing values by using a Lagrange interpolation method; and supplementing the missing tourist data by using the mean value of the synchronous historical data for a large area of missing values.

c. Data preprocessing: because the collected data set has discrete qualitative characteristics and the statistical units and calibers of the input and output variables are inconsistent, which may bring great precision loss to prediction, the discrete qualitative characteristics need to be quantized through high-dimensional mapping before data modeling, and then input data is standardized and normalized, so that the scenic spot passenger flow volume data and the data in the influencing factor matrix are both positive values, which is beneficial to improving the prediction precision.

The normalization formula is shown in (1), where E (X) represents the mean and Var (X) represents the variance:

as shown in fig. 3, the step of finding the optimal random forest model based on the grid search algorithm specifically includes:

(1) determining the number k of decision trees and the number m of candidate split attributes_trySet step size, at k and m_tryCoordinate systemA two-dimensional grid is built, and grid nodes are corresponding k and m_tryThe parameter pair of (1).

(2) And (4) constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags.

(3) Selecting the parameter k, m with the smallest classification error_tryIf the classification error or the step length meets the requirement, outputting an optimal parameter and a classification error; otherwise, reducing the step length, repeating the steps and continuing searching.

In one embodiment, calculating the R-squared value, the mean-squared error, and the mean absolute error comprises:

r square value

Mean square error

Mean absolute error

As shown in fig. 4, an embodiment of the present application further provides an apparatus, including:

the feature data set unit 401 is a feature data set obtained by processing historical raw data composed of the scenic spot historical passenger flow volume and the feature factors.

And the random forest algorithm model 402 is used for outputting the predicted future daily passenger flow of the scenic spot according to the input feature data set.

In one embodiment, the establishing of the parameter optimized random forest algorithm model comprises the following steps:

the training matrix is:

inputting a matrix:

outputting a matrix:

As shown in fig. 5, in an embodiment, the feature data set unit specifically includes:

the acquisition unit 501 is used for collecting historical passenger flow and characteristic factors of scenic spots to obtain historical original data, wherein the characteristic factors comprise weather, holidays, maximum temperature, minimum temperature and seasons;

a first processing unit 502, configured to complement, for different types of missing values, guest data values missing from the historical original data by using a lagrangian interpolation method and a contemporaneous historical data mean value, respectively;

a second processing unit 503, configured to quantize the discrete qualitative features through high-dimensional mapping;

a third processing unit 504 for performing normalization and normalization processing.

For the acquisition unit 501, for the feature data set, the collected influence factor data mainly includes five dimensions of weather, holidays, highest air temperature, lowest air temperature and seasons, and the weather mainly refers to the influence of air temperature and rainfall on the amount of tourists; the holidays comprise weekends and national legal holidays, and the influence weights of different holidays on the scenic area tourist capacity are different; many scenic spots are sensitive to the influence of season changes, and the number of tourists is different in different seasons; the air temperature refers to the influence of the highest and lowest air temperature in the day in the scenic spot on the trip of the tourists.

For the first processing unit 502, the missing value processing mainly means that there are some missing passenger flow data easily in the original passenger flow data provided by the scenic spot, and there are two cases: for a few scattered missing values, completing the missing values by using a Lagrange interpolation method; and supplementing the missing tourist data by using the mean value of the synchronous historical data for a large area of missing values.

For the second processing unit 503, because the discrete qualitative features exist in the collected data set and the statistical units and apertures of the input and output variables are inconsistent, which may bring a large precision loss to the prediction, it is necessary to quantize the discrete qualitative features through high-dimensional mapping before data modeling, and then perform normalization and normalization processing on the input data, so that both the scenic spot passenger flow volume data and the data in the influencing factor matrix are positive values, which is beneficial to improving the prediction precision.

(1) determining the number k of decision trees and the number m of candidate split attributes_tryRange of (1), set step lengthAt k and m_tryEstablishing two-dimensional grids on the coordinate system, wherein grid nodes are corresponding k and m_tryThe parameter pair of (1).

r square value

Mean square error

Mean absolute error

In a third aspect, the present application provides a computer program product, where the computer program product includes a computer program stored in a computer-readable storage medium, and the computer program is loaded by a controller to implement the method shown in fig. 2.

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium for storing a computer program, which is loaded by a processor to execute the instructions of the method of fig. 2.

In a fifth aspect, embodiments of the present application provide a chip comprising programmable logic circuits and/or program instructions, which when run, implement the method of fig. 2.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for predicting scenic spot passenger flow based on a random forest algorithm is characterized by comprising the following steps:

establishing a parameter-optimized random forest algorithm model, debugging model parameters according to goodness of fit and average standard error, and searching an optimal random forest algorithm model based on a grid search algorithm;

and inputting the characteristic data set into the optimal random forest algorithm model to obtain the future daily passenger flow of the forecast scenic spot.

2. The method of claim 1, wherein the feature data set is obtained in a manner comprising:

collecting historical passenger flow and characteristic factors of scenic spots to obtain historical original data, wherein the characteristic factors comprise weather, holidays, highest air temperature, lowest air temperature and seasons;

respectively completing tourist data values missing from the historical original data by using a Lagrange interpolation method and a contemporaneous historical data mean value aiming at different types of missing values;

quantizing the discrete qualitative characteristics through high-dimensional mapping;

normalization and normalization processes are performed.

3. The method of claim 1, wherein the establishing a parameter-optimized random forest algorithm model comprises:

defining a training set and a test set, dividing the training set and the test set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the test set;

calculating an R square value, a mean square error and an average absolute error;

and debugging the model parameters according to the goodness of fit and the average standard error, and searching an optimal random forest model based on a grid search algorithm.

4. The method according to claim 1, wherein the finding of the optimal random forest model based on the grid search algorithm specifically comprises:

determining the number k of decision trees and the number m of candidate split attributes_trySet step sizes of k and m_tryEstablishing two-dimensional grids on the coordinate system, wherein grid nodes are corresponding k and m_tryThe parameter pairs of (1);

constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags;

selecting the parameter k, m with the smallest classification error_tryAnd if the classification error or the step length meets the requirement, outputting the optimal parameter and the classification error.

5. The method of claim 1, wherein calculating the R-squared, mean-squared and mean-absolute-error comprises:

r square value

Mean square error

Mean absolute error

6. An apparatus, comprising:

7. The apparatus of claim 6, wherein the feature data set unit comprises:

8. The apparatus of claim 6, wherein the random forest algorithm model is generated by:

9. The apparatus as claimed in claim 8, wherein the finding of the optimal random forest model based on the grid search algorithm by the random forest algorithm model specifically comprises:

10. The apparatus of claim 9, wherein the random forest algorithm model calculating an R-squared value, a mean squared error, and a mean absolute error comprises:

r square value

Mean square error

Mean absolute error