CN111178633A - Method and device for predicting scenic spot passenger flow based on random forest algorithm - Google Patents
Method and device for predicting scenic spot passenger flow based on random forest algorithm Download PDFInfo
- Publication number
- CN111178633A CN111178633A CN201911406473.0A CN201911406473A CN111178633A CN 111178633 A CN111178633 A CN 111178633A CN 201911406473 A CN201911406473 A CN 201911406473A CN 111178633 A CN111178633 A CN 111178633A
- Authority
- CN
- China
- Prior art keywords
- random forest
- error
- passenger flow
- data
- historical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007637 random forest analysis Methods 0.000 title claims abstract description 69
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 49
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000010845 search algorithm Methods 0.000 claims abstract description 17
- 238000012549 training Methods 0.000 claims description 31
- 239000011159 matrix material Substances 0.000 claims description 24
- 238000012360 testing method Methods 0.000 claims description 24
- 238000012545 processing Methods 0.000 claims description 22
- 238000010606 normalization Methods 0.000 claims description 12
- 238000003066 decision tree Methods 0.000 claims description 10
- 238000013507 mapping Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 description 10
- 238000013461 design Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000006467 substitution reaction Methods 0.000 description 3
- 230000002159 abnormal effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000000556 factor analysis Methods 0.000 description 2
- 230000001502 supplementing effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000000611 regression analysis Methods 0.000 description 1
- 230000001932 seasonal effect Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/14—Travel agencies
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- Strategic Management (AREA)
- Human Resources & Organizations (AREA)
- Economics (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Development Economics (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Entrepreneurship & Innovation (AREA)
- Game Theory and Decision Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The embodiment of the application discloses a method and a device for predicting scenic spot passenger flow based on a random forest algorithm, wherein the method comprises the following steps: establishing a parameter-optimized random forest algorithm model, debugging model parameters according to goodness of fit and average standard error, and searching an optimal random forest algorithm model based on a grid search algorithm; and inputting the characteristic data set into the optimal random forest algorithm model to obtain the future daily passenger flow of the forecast scenic spot. The method and the device provided by the embodiment of the application can effectively improve the accuracy of the prediction model, ensure the timeliness of prediction and facilitate the management of cities and scenic spots.
Description
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a method and a device for predicting scenic spot passenger flow based on a random forest algorithm.
Background
With the rapid development of the tourism industry in China, the tourism is a normal state of life, and the passenger flow of a plurality of scenic spots in China obviously changes along with seasons, so that the seasonal tourism is typical. Every time the passenger flow volume is high, congestion and confusion of different conditions occur in cities and various tourist attractions, even mass friction occurs, personal and property safety of tourists is endangered, and great trouble is caused to management of the cities and the scenic spots. Therefore, a method for predicting future passenger flow in a scenic spot is urgently needed, and a scenic spot manager can take effective precautionary measures in advance according to the future passenger flow and the actual reception capacity of the scenic spot, so that social hazard phenomena such as congestion, friction and even personal and property injuries are avoided, and the service quality and the safety of the scenic spot are ensured. With the development of big tourism data, two problems mainly face to the current scenic spot passenger flow volume prediction. Firstly, as the historical data is continuously increased, the model training time is also continuously increased, and the timeliness of prediction is difficult to ensure; second, the accuracy of the prediction is related to the features and the predictive model, which results in an undesirable accuracy due to the features and the training model. At present, a plurality of methods for scenic spot passenger flow volume prediction provide certain help for scenic spot management decisions, but the timeliness and the accuracy of a prediction model are difficult to improve.
Disclosure of Invention
The embodiment of the application mainly aims to provide a method and a device for predicting scenic spot passenger flow based on a random forest algorithm, so that the accuracy of predicting scenic spot passenger flow is improved, and the timeliness of prediction is guaranteed.
In a first aspect, a method for predicting scenic spot passenger flow based on a random forest algorithm is provided, which comprises the following steps: firstly, establishing a parameter optimized random forest algorithm model, debugging model parameters according to goodness of fit and average standard error, and searching an optimal random forest algorithm model based on a grid search algorithm; and then, inputting the characteristic data set into the optimal random forest algorithm model to obtain the future daily passenger flow of the forecast scenic spot, so that management measures can be made in advance conveniently in the scenic spot.
In one possible design, the feature data set is obtained in a manner that includes: collecting historical passenger flow and characteristic factors of scenic spots to obtain historical original data, wherein the characteristic factors comprise weather, holidays, highest air temperature, lowest air temperature and seasons; respectively completing tourist data values missing from the historical original data by using a Lagrange interpolation method and a contemporaneous historical data mean value aiming at different types of missing values; quantizing the discrete qualitative characteristics through high-dimensional mapping; normalization and normalization processes are performed.
In one possible design, the establishing a parameter-optimized random forest algorithm model includes: defining a training set and a test set, dividing the training set and the test set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the test set; calculating an R square value, a mean square error and an average absolute error; and debugging the model parameters according to the goodness of fit and the average standard error, and searching an optimal random forest model based on a grid search algorithm.
In a possible design, the finding of the optimal random forest model based on the grid search algorithm specifically includes: determining the number k of decision trees and the number m of candidate split attributestrySet step size, at k and mtryEstablishing two-dimensional grids on the coordinate system, wherein grid nodes are corresponding k and mtryThe parameter pairs of (1); constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags; selecting the parameter k, m with the smallest classification errortryAnd if the classification error or the step length meets the requirement, outputting the optimal parameter and the classification error.
In one possible design, the calculating the R-squared value, the mean-squared error, and the mean absolute error includes:
Wherein y represents the daily passenger flow volume, yiRepresenting the historical day i traffic.
In a second aspect, an embodiment of the present application further provides an apparatus, including:
the characteristic data set unit is used for processing historical original data consisting of historical passenger flow and characteristic factors of scenic spots to obtain a characteristic data set;
and the random forest algorithm model is used for outputting the predicted future daily passenger flow of the scenic spot according to the input feature data set.
In one possible design, the feature data set unit includes:
the system comprises an acquisition unit, a display unit and a control unit, wherein the acquisition unit is used for acquiring historical passenger flow and characteristic factors of scenic spots to obtain historical original data, and the characteristic factors comprise weather, holidays, maximum temperature, minimum temperature and seasons;
the first processing unit is used for completing the tourist data values missing from the historical original data by using a Lagrange interpolation method and a contemporaneous historical data mean value respectively aiming at different types of missing values;
the second processing unit is used for quantizing the discrete qualitative characteristics through high-dimensional mapping;
and the third processing unit is used for carrying out standardization and normalization processing.
In one possible design, the random forest algorithm model is generated by: defining a training set and a test set, dividing the training set and the test set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the test set; calculating an R square value, a mean square error and an average absolute error; and debugging the model parameters according to the goodness of fit and the average standard error, and searching an optimal random forest model based on a grid search algorithm.
In one possible design, the finding of the optimal random forest model by the random forest algorithm model based on the grid search algorithm specifically includes: determining the number k of decision trees and the number m of candidate split attributestrySet step size, at k and mtryEstablishing two-dimensional grids on the coordinate system, wherein grid nodes are corresponding k and mtryThe parameter pairs of (1); constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags; selectingParameter k, m with minimum classification errortryAnd if the classification error or the step length meets the requirement, outputting the optimal parameter and the classification error.
In one possible design, the calculating the R-square value, the mean square error, and the mean absolute error by the random forest algorithm model includes:
Wherein y represents the daily passenger flow volume, yiRepresenting the historical day i traffic.
In a third aspect, the present application provides a computer program product, where the computer program product includes a computer program stored in a computer-readable storage medium, and the computer program is loaded by a controller to implement the method of any possible implementation manner of the first aspect.
In a fourth aspect, the present application provides a non-transitory computer-readable storage medium for storing a computer program, which is loaded by a processor to execute the instructions of the method of any possible implementation manner of the first aspect.
In a fifth aspect, embodiments of the present application provide a chip, where the chip includes a programmable logic circuit and/or program instructions, and when the chip is executed, the chip is configured to implement the method according to any possible implementation manner of the first aspect.
In summary, the method and the device provided by the embodiment of the application can effectively improve the accuracy of the prediction model and ensure the timeliness of the prediction. Firstly, the parameter optimization random forest algorithm is suitable for discrete data, rules contained in the column data can be extracted, prior knowledge is not needed, the method is lower in cost and convenient and efficient to use compared with methods such as a neural network, and timeliness of scenic spot daily passenger flow prediction is guaranteed on the basis of acquiring latest prediction sample data in real time; secondly, the parameter optimization random forest algorithm has good tolerance on missing values, abnormal values and noise, overfitting is not easy to occur, and the accuracy of prediction is further improved by combining with scenic spot passenger flow influence factor analysis.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a frame diagram of a random forest algorithm provided in an embodiment of the present application;
FIG. 2 is a flow chart of a method for predicting scenic spot passenger flow based on a random forest algorithm disclosed in the embodiment of the present application;
FIG. 3 is a flowchart of a method for finding an optimal random forest model based on a grid search algorithm, disclosed in an embodiment of the present application;
FIG. 4 is a schematic block diagram of a device for predicting scenic spot passenger flow based on a random forest algorithm, disclosed in the embodiment of the present application;
FIG. 5 is a schematic block diagram of a feature data set unit disclosed in an embodiment of the present application;
FIG. 6 shows an application example of predicting scenic spot passenger flow based on a random forest algorithm according to an embodiment of the present application.
Detailed Description
The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
The Random Forest (RF) algorithm is a classification and regression algorithm that integrates multiple decision trees (decisiontrees) by the idea of Ensemble Learning (Ensemble Learning). In the RF, decision trees are firstly adopted as a base classifier, then a Bagging (boosting aggregation) Method is used for generating training data sets which are different from each other, each decision tree is constructed by using a Random Subspace partitioning (Random Subspace Method) strategy, partial attributes are randomly selected from all the attributes, and when the tree is split each time, the optimal attribute is selected from the partial attributes for splitting. If the classification algorithm is adopted, the final classification is the classification with the largest number of votes from the leaf node to which the sample point belongs; in the case of a regression analysis (including data prediction) algorithm, the final class is the mean of the leaf nodes to which the sample point arrives. The introduction of the 'double random' idea makes the RF not easy to fall into overfitting and diversity exists among the sub-classifiers, so the RF has excellent performance. The random forest algorithm frame diagram is shown in fig. 1.
The random forest algorithm can effectively improve the accuracy of the prediction model and ensure the timeliness of the prediction. Firstly, the parameter optimization random forest algorithm is suitable for discrete data, rules contained in the column data can be extracted, prior knowledge is not needed, the method is lower in cost and convenient and efficient to use compared with methods such as a neural network, and timeliness of scenic spot daily passenger flow prediction is guaranteed on the basis of acquiring latest prediction sample data in real time; secondly, the parameter optimization random forest algorithm has good tolerance on missing values, abnormal values and noise, overfitting is not easy to occur, and the accuracy of prediction is further improved by combining with scenic spot passenger flow influence factor analysis.
Referring to fig. 2, a flowchart of a method for predicting scenic spot passenger flow based on a random forest algorithm disclosed in the embodiment of the present application includes the following steps:
step 201: and establishing a parameter optimized random forest algorithm model, debugging model parameters according to the goodness of fit and the average standard error, and searching an optimal random forest algorithm model based on a grid search algorithm.
Step 202: and inputting the characteristic data set into the optimal random forest algorithm model to obtain the future daily passenger flow of the forecast scenic spot.
In one embodiment, for step 201, establishing a random forest algorithm model with optimized parameters includes the following steps:
a. establishing a model: defining a training set and a testing set, dividing the training set and the testing set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the testing set.
In the process of establishing the model, firstly defining a training set and a testing set, dividing the training set and the testing set of the whole data set according to proportion, establishing an input matrix and an output matrix of the training set/the testing set,
wherein f isi(x) historical passenger flow; m isiIs the date; [ w ]1,...,wn]Representing a weather sequence, 3 columns of data, wherein each column of data represents a group of weather conditions including weather, highest air temperature and lowest air temperature; h is1Holidays (weekday 0, ordinary weekend 1, holiday 2); f. ofi(t) is the amount of traffic to be predicted. Then, a random forest algorithm is used for training the data set to establish a model.
b. Determining an evaluation system: and calculating the R square value, the mean square error and the mean absolute error of the prediction model.
c. Optimizing a model: and debugging the model parameters according to the goodness of fit and the average standard error, and searching for an optimal random forest model based on a grid search algorithm.
In one embodiment, step 202 includes the following steps for the feature data set:
a. data collection: and collecting historical passenger flow and characteristic factors of the scenic spot, wherein the characteristic factors mainly comprise weather, holidays, highest temperature, lowest temperature and seasons.
The collected influence factor data mainly comprises five dimensions of weather, holidays, highest air temperature, lowest air temperature and seasons, wherein the weather mainly refers to the influence of air temperature and rainfall on the amount of tourists; the holidays comprise weekends and national legal holidays, and the influence weights of different holidays on the scenic area tourist capacity are different; many scenic spots are sensitive to the influence of season changes, and the number of tourists is different in different seasons; the air temperature refers to the influence of the highest and lowest air temperature in the day in the scenic spot on the trip of the tourists.
b. Missing value processing: the missing value processing mainly means that original passenger flow data provided by scenic spots is easy to have some missing passenger flow data, and the two situations are divided: for a few scattered missing values, completing the missing values by using a Lagrange interpolation method; and supplementing the missing tourist data by using the mean value of the synchronous historical data for a large area of missing values.
c. Data preprocessing: because the collected data set has discrete qualitative characteristics and the statistical units and calibers of the input and output variables are inconsistent, which may bring great precision loss to prediction, the discrete qualitative characteristics need to be quantized through high-dimensional mapping before data modeling, and then input data is standardized and normalized, so that the scenic spot passenger flow volume data and the data in the influencing factor matrix are both positive values, which is beneficial to improving the prediction precision.
The normalization formula is shown in (1), where E (X) represents the mean and Var (X) represents the variance:
as shown in fig. 3, the step of finding the optimal random forest model based on the grid search algorithm specifically includes:
(1) determining the number k of decision trees and the number m of candidate split attributestrySet step size, at k and mtryCoordinate systemA two-dimensional grid is built, and grid nodes are corresponding k and mtryThe parameter pair of (1).
(2) And (4) constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags.
(3) Selecting the parameter k, m with the smallest classification errortryIf the classification error or the step length meets the requirement, outputting an optimal parameter and a classification error; otherwise, reducing the step length, repeating the steps and continuing searching.
In one embodiment, calculating the R-squared value, the mean-squared error, and the mean absolute error comprises:
Wherein y represents the daily passenger flow volume, yiRepresenting the historical day i traffic.
As shown in fig. 4, an embodiment of the present application further provides an apparatus, including:
the feature data set unit 401 is a feature data set obtained by processing historical raw data composed of the scenic spot historical passenger flow volume and the feature factors.
And the random forest algorithm model 402 is used for outputting the predicted future daily passenger flow of the scenic spot according to the input feature data set.
In one embodiment, the establishing of the parameter optimized random forest algorithm model comprises the following steps:
a. establishing a model: defining a training set and a testing set, dividing the training set and the testing set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the testing set.
In the process of establishing the model, firstly defining a training set and a testing set, dividing the training set and the testing set of the whole data set according to proportion, establishing an input matrix and an output matrix of the training set/the testing set,
wherein f isi(x) historical passenger flow; m isiIs the date; [ w ]1,...,wn]Representing a weather sequence, 3 columns of data, wherein each column of data represents a group of weather conditions including weather, highest air temperature and lowest air temperature; h is1Holidays (weekday 0, ordinary weekend 1, holiday 2); f. ofi(t) is the amount of traffic to be predicted. Then, a random forest algorithm is used for training the data set to establish a model.
b. Determining an evaluation system: and calculating the R square value, the mean square error and the mean absolute error of the prediction model.
c. Optimizing a model: and debugging the model parameters according to the goodness of fit and the average standard error, and searching for an optimal random forest model based on a grid search algorithm.
As shown in fig. 5, in an embodiment, the feature data set unit specifically includes:
the acquisition unit 501 is used for collecting historical passenger flow and characteristic factors of scenic spots to obtain historical original data, wherein the characteristic factors comprise weather, holidays, maximum temperature, minimum temperature and seasons;
a first processing unit 502, configured to complement, for different types of missing values, guest data values missing from the historical original data by using a lagrangian interpolation method and a contemporaneous historical data mean value, respectively;
a second processing unit 503, configured to quantize the discrete qualitative features through high-dimensional mapping;
a third processing unit 504 for performing normalization and normalization processing.
For the acquisition unit 501, for the feature data set, the collected influence factor data mainly includes five dimensions of weather, holidays, highest air temperature, lowest air temperature and seasons, and the weather mainly refers to the influence of air temperature and rainfall on the amount of tourists; the holidays comprise weekends and national legal holidays, and the influence weights of different holidays on the scenic area tourist capacity are different; many scenic spots are sensitive to the influence of season changes, and the number of tourists is different in different seasons; the air temperature refers to the influence of the highest and lowest air temperature in the day in the scenic spot on the trip of the tourists.
For the first processing unit 502, the missing value processing mainly means that there are some missing passenger flow data easily in the original passenger flow data provided by the scenic spot, and there are two cases: for a few scattered missing values, completing the missing values by using a Lagrange interpolation method; and supplementing the missing tourist data by using the mean value of the synchronous historical data for a large area of missing values.
For the second processing unit 503, because the discrete qualitative features exist in the collected data set and the statistical units and apertures of the input and output variables are inconsistent, which may bring a large precision loss to the prediction, it is necessary to quantize the discrete qualitative features through high-dimensional mapping before data modeling, and then perform normalization and normalization processing on the input data, so that both the scenic spot passenger flow volume data and the data in the influencing factor matrix are positive values, which is beneficial to improving the prediction precision.
The normalization formula is shown in (1), where E (X) represents the mean and Var (X) represents the variance:
as shown in fig. 3, the step of finding the optimal random forest model based on the grid search algorithm specifically includes:
(1) determining the number k of decision trees and the number m of candidate split attributestryRange of (1), set step lengthAt k and mtryEstablishing two-dimensional grids on the coordinate system, wherein grid nodes are corresponding k and mtryThe parameter pair of (1).
(2) And (4) constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags.
(3) Selecting the parameter k, m with the smallest classification errortryIf the classification error or the step length meets the requirement, outputting an optimal parameter and a classification error; otherwise, reducing the step length, repeating the steps and continuing searching.
In one embodiment, calculating the R-squared value, the mean-squared error, and the mean absolute error comprises:
Wherein y represents the daily passenger flow volume, yiRepresenting the historical day i traffic.
In a third aspect, the present application provides a computer program product, where the computer program product includes a computer program stored in a computer-readable storage medium, and the computer program is loaded by a controller to implement the method shown in fig. 2.
In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium for storing a computer program, which is loaded by a processor to execute the instructions of the method of fig. 2.
In a fifth aspect, embodiments of the present application provide a chip comprising programmable logic circuits and/or program instructions, which when run, implement the method of fig. 2.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A method for predicting scenic spot passenger flow based on a random forest algorithm is characterized by comprising the following steps:
establishing a parameter-optimized random forest algorithm model, debugging model parameters according to goodness of fit and average standard error, and searching an optimal random forest algorithm model based on a grid search algorithm;
and inputting the characteristic data set into the optimal random forest algorithm model to obtain the future daily passenger flow of the forecast scenic spot.
2. The method of claim 1, wherein the feature data set is obtained in a manner comprising:
collecting historical passenger flow and characteristic factors of scenic spots to obtain historical original data, wherein the characteristic factors comprise weather, holidays, highest air temperature, lowest air temperature and seasons;
respectively completing tourist data values missing from the historical original data by using a Lagrange interpolation method and a contemporaneous historical data mean value aiming at different types of missing values;
quantizing the discrete qualitative characteristics through high-dimensional mapping;
normalization and normalization processes are performed.
3. The method of claim 1, wherein the establishing a parameter-optimized random forest algorithm model comprises:
defining a training set and a test set, dividing the training set and the test set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the test set;
calculating an R square value, a mean square error and an average absolute error;
and debugging the model parameters according to the goodness of fit and the average standard error, and searching an optimal random forest model based on a grid search algorithm.
4. The method according to claim 1, wherein the finding of the optimal random forest model based on the grid search algorithm specifically comprises:
determining the number k of decision trees and the number m of candidate split attributestrySet step sizes of k and mtryEstablishing two-dimensional grids on the coordinate system, wherein grid nodes are corresponding k and mtryThe parameter pairs of (1);
constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags;
selecting the parameter k, m with the smallest classification errortryAnd if the classification error or the step length meets the requirement, outputting the optimal parameter and the classification error.
6. An apparatus, comprising:
the characteristic data set unit is used for processing historical original data consisting of historical passenger flow and characteristic factors of scenic spots to obtain a characteristic data set;
and the random forest algorithm model is used for outputting the predicted future daily passenger flow of the scenic spot according to the input feature data set.
7. The apparatus of claim 6, wherein the feature data set unit comprises:
the system comprises an acquisition unit, a display unit and a control unit, wherein the acquisition unit is used for acquiring historical passenger flow and characteristic factors of scenic spots to obtain historical original data, and the characteristic factors comprise weather, holidays, maximum temperature, minimum temperature and seasons;
the first processing unit is used for completing the tourist data values missing from the historical original data by using a Lagrange interpolation method and a contemporaneous historical data mean value respectively aiming at different types of missing values;
the second processing unit is used for quantizing the discrete qualitative characteristics through high-dimensional mapping;
and the third processing unit is used for carrying out standardization and normalization processing.
8. The apparatus of claim 6, wherein the random forest algorithm model is generated by:
defining a training set and a test set, dividing the training set and the test set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the test set;
calculating an R square value, a mean square error and an average absolute error;
and debugging the model parameters according to the goodness of fit and the average standard error, and searching an optimal random forest model based on a grid search algorithm.
9. The apparatus as claimed in claim 8, wherein the finding of the optimal random forest model based on the grid search algorithm by the random forest algorithm model specifically comprises:
determining the number k of decision trees and the number m of candidate split attributestrySet step sizes of k and mtryEstablishing two-dimensional grids on the coordinate system, wherein grid nodes are corresponding k and mtryThe parameter pairs of (1);
constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags;
selecting the parameter k, m with the smallest classification errortryAnd if the classification error or the step length meets the requirement, outputting the optimal parameter and the classification error.
10. The apparatus of claim 9, wherein the random forest algorithm model calculating an R-squared value, a mean squared error, and a mean absolute error comprises:
Wherein y represents the daily passenger flow volume, yiRepresenting the historical day i traffic.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911406473.0A CN111178633A (en) | 2019-12-31 | 2019-12-31 | Method and device for predicting scenic spot passenger flow based on random forest algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911406473.0A CN111178633A (en) | 2019-12-31 | 2019-12-31 | Method and device for predicting scenic spot passenger flow based on random forest algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111178633A true CN111178633A (en) | 2020-05-19 |
Family
ID=70654257
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911406473.0A Pending CN111178633A (en) | 2019-12-31 | 2019-12-31 | Method and device for predicting scenic spot passenger flow based on random forest algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111178633A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112330359A (en) * | 2020-11-04 | 2021-02-05 | 上饶市中科院云计算中心大数据研究院 | Smart tourist attraction saturation evaluation method and device |
CN112949939A (en) * | 2021-03-30 | 2021-06-11 | 福州市电子信息集团有限公司 | Taxi passenger carrying hotspot prediction method based on random forest model |
CN113256695A (en) * | 2021-06-23 | 2021-08-13 | 武汉工程大学 | Random forest based terrain prediction model method for potassium sulfate production salt pond |
CN113962437A (en) * | 2021-09-18 | 2022-01-21 | 深圳市城市交通规划设计研究中心股份有限公司 | Construction method of people stream prediction model and people stream situation prediction method of rail transit station |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106779196A (en) * | 2016-12-05 | 2017-05-31 | 中国航天系统工程有限公司 | A kind of tourist flow prediction and peak value regulation and control method based on tourism big data |
CN108877226A (en) * | 2018-08-24 | 2018-11-23 | 交通运输部规划研究院 | Scenic spot traffic for tourism prediction technique and early warning system |
CN109034469A (en) * | 2018-07-20 | 2018-12-18 | 成都中科大旗软件有限公司 | A kind of tourist flow prediction technique based on machine learning |
CN110135630A (en) * | 2019-04-25 | 2019-08-16 | 武汉数澎科技有限公司 | The short term needing forecasting method with multi-step optimization is returned based on random forest |
CN110175690A (en) * | 2019-04-04 | 2019-08-27 | 中兴飞流信息科技有限公司 | A kind of method, apparatus, server and the storage medium of scenic spot passenger flow forecast |
CN110322075A (en) * | 2019-07-10 | 2019-10-11 | 上饶市中科院云计算中心大数据研究院 | A kind of scenic spot passenger flow forecast method and system based on hybrid optimization RBF neural |
CN110443314A (en) * | 2019-08-08 | 2019-11-12 | 中国工商银行股份有限公司 | Scenic spot passenger flow forecast method and device based on machine learning |
-
2019
- 2019-12-31 CN CN201911406473.0A patent/CN111178633A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106779196A (en) * | 2016-12-05 | 2017-05-31 | 中国航天系统工程有限公司 | A kind of tourist flow prediction and peak value regulation and control method based on tourism big data |
CN109034469A (en) * | 2018-07-20 | 2018-12-18 | 成都中科大旗软件有限公司 | A kind of tourist flow prediction technique based on machine learning |
CN108877226A (en) * | 2018-08-24 | 2018-11-23 | 交通运输部规划研究院 | Scenic spot traffic for tourism prediction technique and early warning system |
CN110175690A (en) * | 2019-04-04 | 2019-08-27 | 中兴飞流信息科技有限公司 | A kind of method, apparatus, server and the storage medium of scenic spot passenger flow forecast |
CN110135630A (en) * | 2019-04-25 | 2019-08-16 | 武汉数澎科技有限公司 | The short term needing forecasting method with multi-step optimization is returned based on random forest |
CN110322075A (en) * | 2019-07-10 | 2019-10-11 | 上饶市中科院云计算中心大数据研究院 | A kind of scenic spot passenger flow forecast method and system based on hybrid optimization RBF neural |
CN110443314A (en) * | 2019-08-08 | 2019-11-12 | 中国工商银行股份有限公司 | Scenic spot passenger flow forecast method and device based on machine learning |
Non-Patent Citations (9)
Title |
---|
戴剑良等: "《图书发行统计》", 31 December 1994 * |
杨修德等: "XGBoost在超短期负荷预测中的应用", 《电气传动自动化》 * |
温博文等: "基于改进网格搜索算法的随机森林参数优化", 《计算机工程与应用》 * |
王晓宇等: "基于BA-SVR的乡村游短期客流预测模型", 《计算机工程与设计》 * |
谢天保等: "基于网络搜索数据的游客量组合预测模型", 《计算机系统应用》 * |
邵良杉等: "改进GSM-RFC模型在回采巷道围岩稳定性分级的预测", 《辽宁工程技术大学学报(自然科学版)》 * |
陈元鹏等: "基于网格搜索随机森林算法的工矿复垦区土地利用分类", 《农业工程学报》 * |
马丽君等: "湖南"红三角"潜在游客时空分布特征及其影响因素", 《资源开发与市场》 * |
马银超等: "基于分类模型的日客流量预测", 《国土资源科技管理》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112330359A (en) * | 2020-11-04 | 2021-02-05 | 上饶市中科院云计算中心大数据研究院 | Smart tourist attraction saturation evaluation method and device |
CN112949939A (en) * | 2021-03-30 | 2021-06-11 | 福州市电子信息集团有限公司 | Taxi passenger carrying hotspot prediction method based on random forest model |
CN112949939B (en) * | 2021-03-30 | 2022-12-06 | 福州市电子信息集团有限公司 | Taxi passenger carrying hotspot prediction method based on random forest model |
CN113256695A (en) * | 2021-06-23 | 2021-08-13 | 武汉工程大学 | Random forest based terrain prediction model method for potassium sulfate production salt pond |
CN113256695B (en) * | 2021-06-23 | 2021-10-08 | 武汉工程大学 | Random forest based terrain prediction model method for potassium sulfate production salt pond |
CN113962437A (en) * | 2021-09-18 | 2022-01-21 | 深圳市城市交通规划设计研究中心股份有限公司 | Construction method of people stream prediction model and people stream situation prediction method of rail transit station |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111178633A (en) | Method and device for predicting scenic spot passenger flow based on random forest algorithm | |
US10606862B2 (en) | Method and apparatus for data processing in data modeling | |
AU2018101946A4 (en) | Geographical multivariate flow data spatio-temporal autocorrelation analysis method based on cellular automaton | |
CN108446293A (en) | A method of based on urban multi-source isomeric data structure city portrait | |
CN109493119B (en) | POI data-based urban business center identification method and system | |
CN107168995B (en) | Data processing method and server | |
AU2019100968A4 (en) | A Credit Reporting Evaluation System Based on Mixed Machine Learning | |
CN111918298B (en) | Clustering-based site planning method and device, electronic equipment and storage medium | |
CN109919781A (en) | Case recognition methods, electronic device and computer readable storage medium are cheated by clique | |
CN113625697B (en) | Unmanned aerial vehicle cluster reliability assessment method and system considering task capability change | |
CN109995611B (en) | Traffic classification model establishing and traffic classification method, device, equipment and server | |
CN116186548B (en) | Power load prediction model training method and power load prediction method | |
CN110837841B (en) | KPI degradation root cause identification method and device based on random forest | |
CN111797188B (en) | Urban functional area quantitative identification method based on open source geospatial vector data | |
CN109472075A (en) | A kind of base station performance analysis method and system | |
CN111986027A (en) | Abnormal transaction processing method and device based on artificial intelligence | |
CN112330227A (en) | Urban management capability evaluation method and system based on digital urban management service construction | |
CN112381644A (en) | Credit scene risk user assessment method based on space variable reasoning | |
CN113095680A (en) | Evaluation index system and construction method of electric power big data model | |
CN113704389A (en) | Data evaluation method and device, computer equipment and storage medium | |
CN116739619A (en) | Energy power carbon emission monitoring analysis modeling method and device | |
CN113890833B (en) | Network coverage prediction method, device, equipment and storage medium | |
CN115756919A (en) | Root cause positioning method and system for multidimensional data | |
CN115049158A (en) | Method, system, storage medium and terminal for predicting running state of urban system | |
CN116541252B (en) | Computer room fault log data processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200519 |