CN111178633A - Method and device for predicting scenic spot passenger flow based on random forest algorithm - Google Patents

Method and device for predicting scenic spot passenger flow based on random forest algorithm Download PDF

Info

Publication number
CN111178633A
CN111178633A CN201911406473.0A CN201911406473A CN111178633A CN 111178633 A CN111178633 A CN 111178633A CN 201911406473 A CN201911406473 A CN 201911406473A CN 111178633 A CN111178633 A CN 111178633A
Authority
CN
China
Prior art keywords
random forest
error
passenger flow
data
historical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911406473.0A
Other languages
Chinese (zh)
Inventor
皮慧婷
洪学海
杨勇
陈鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute Of Big Data Cloud Computing Center Of Chinese Academy Shangrao
Original Assignee
Institute Of Big Data Cloud Computing Center Of Chinese Academy Shangrao
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute Of Big Data Cloud Computing Center Of Chinese Academy Shangrao filed Critical Institute Of Big Data Cloud Computing Center Of Chinese Academy Shangrao
Priority to CN201911406473.0A priority Critical patent/CN111178633A/en
Publication of CN111178633A publication Critical patent/CN111178633A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/14Travel agencies

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Development Economics (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the application discloses a method and a device for predicting scenic spot passenger flow based on a random forest algorithm, wherein the method comprises the following steps: establishing a parameter-optimized random forest algorithm model, debugging model parameters according to goodness of fit and average standard error, and searching an optimal random forest algorithm model based on a grid search algorithm; and inputting the characteristic data set into the optimal random forest algorithm model to obtain the future daily passenger flow of the forecast scenic spot. The method and the device provided by the embodiment of the application can effectively improve the accuracy of the prediction model, ensure the timeliness of prediction and facilitate the management of cities and scenic spots.

Description

Method and device for predicting scenic spot passenger flow based on random forest algorithm
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a method and a device for predicting scenic spot passenger flow based on a random forest algorithm.
Background
With the rapid development of the tourism industry in China, the tourism is a normal state of life, and the passenger flow of a plurality of scenic spots in China obviously changes along with seasons, so that the seasonal tourism is typical. Every time the passenger flow volume is high, congestion and confusion of different conditions occur in cities and various tourist attractions, even mass friction occurs, personal and property safety of tourists is endangered, and great trouble is caused to management of the cities and the scenic spots. Therefore, a method for predicting future passenger flow in a scenic spot is urgently needed, and a scenic spot manager can take effective precautionary measures in advance according to the future passenger flow and the actual reception capacity of the scenic spot, so that social hazard phenomena such as congestion, friction and even personal and property injuries are avoided, and the service quality and the safety of the scenic spot are ensured. With the development of big tourism data, two problems mainly face to the current scenic spot passenger flow volume prediction. Firstly, as the historical data is continuously increased, the model training time is also continuously increased, and the timeliness of prediction is difficult to ensure; second, the accuracy of the prediction is related to the features and the predictive model, which results in an undesirable accuracy due to the features and the training model. At present, a plurality of methods for scenic spot passenger flow volume prediction provide certain help for scenic spot management decisions, but the timeliness and the accuracy of a prediction model are difficult to improve.
Disclosure of Invention
The embodiment of the application mainly aims to provide a method and a device for predicting scenic spot passenger flow based on a random forest algorithm, so that the accuracy of predicting scenic spot passenger flow is improved, and the timeliness of prediction is guaranteed.
In a first aspect, a method for predicting scenic spot passenger flow based on a random forest algorithm is provided, which comprises the following steps: firstly, establishing a parameter optimized random forest algorithm model, debugging model parameters according to goodness of fit and average standard error, and searching an optimal random forest algorithm model based on a grid search algorithm; and then, inputting the characteristic data set into the optimal random forest algorithm model to obtain the future daily passenger flow of the forecast scenic spot, so that management measures can be made in advance conveniently in the scenic spot.
In one possible design, the feature data set is obtained in a manner that includes: collecting historical passenger flow and characteristic factors of scenic spots to obtain historical original data, wherein the characteristic factors comprise weather, holidays, highest air temperature, lowest air temperature and seasons; respectively completing tourist data values missing from the historical original data by using a Lagrange interpolation method and a contemporaneous historical data mean value aiming at different types of missing values; quantizing the discrete qualitative characteristics through high-dimensional mapping; normalization and normalization processes are performed.
In one possible design, the establishing a parameter-optimized random forest algorithm model includes: defining a training set and a test set, dividing the training set and the test set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the test set; calculating an R square value, a mean square error and an average absolute error; and debugging the model parameters according to the goodness of fit and the average standard error, and searching an optimal random forest model based on a grid search algorithm.
In a possible design, the finding of the optimal random forest model based on the grid search algorithm specifically includes: determining the number k of decision trees and the number m of candidate split attributestrySet step size, at k and mtryEstablishing two-dimensional grids on the coordinate system, wherein grid nodes are corresponding k and mtryThe parameter pairs of (1); constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags; selecting the parameter k, m with the smallest classification errortryAnd if the classification error or the step length meets the requirement, outputting the optimal parameter and the classification error.
In one possible design, the calculating the R-squared value, the mean-squared error, and the mean absolute error includes:
r square value
Figure BDA0002348767380000021
Mean square error
Figure BDA0002348767380000022
Mean absolute error
Figure BDA0002348767380000023
Wherein y represents the daily passenger flow volume, yiRepresenting the historical day i traffic.
In a second aspect, an embodiment of the present application further provides an apparatus, including:
the characteristic data set unit is used for processing historical original data consisting of historical passenger flow and characteristic factors of scenic spots to obtain a characteristic data set;
and the random forest algorithm model is used for outputting the predicted future daily passenger flow of the scenic spot according to the input feature data set.
In one possible design, the feature data set unit includes:
the system comprises an acquisition unit, a display unit and a control unit, wherein the acquisition unit is used for acquiring historical passenger flow and characteristic factors of scenic spots to obtain historical original data, and the characteristic factors comprise weather, holidays, maximum temperature, minimum temperature and seasons;
the first processing unit is used for completing the tourist data values missing from the historical original data by using a Lagrange interpolation method and a contemporaneous historical data mean value respectively aiming at different types of missing values;
the second processing unit is used for quantizing the discrete qualitative characteristics through high-dimensional mapping;
and the third processing unit is used for carrying out standardization and normalization processing.
In one possible design, the random forest algorithm model is generated by: defining a training set and a test set, dividing the training set and the test set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the test set; calculating an R square value, a mean square error and an average absolute error; and debugging the model parameters according to the goodness of fit and the average standard error, and searching an optimal random forest model based on a grid search algorithm.
In one possible design, the finding of the optimal random forest model by the random forest algorithm model based on the grid search algorithm specifically includes: determining the number k of decision trees and the number m of candidate split attributestrySet step size, at k and mtryEstablishing two-dimensional grids on the coordinate system, wherein grid nodes are corresponding k and mtryThe parameter pairs of (1); constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags; selectingParameter k, m with minimum classification errortryAnd if the classification error or the step length meets the requirement, outputting the optimal parameter and the classification error.
In one possible design, the calculating the R-square value, the mean square error, and the mean absolute error by the random forest algorithm model includes:
r square value
Figure BDA0002348767380000031
Mean square error
Figure BDA0002348767380000032
Mean absolute error
Figure BDA0002348767380000041
Wherein y represents the daily passenger flow volume, yiRepresenting the historical day i traffic.
In a third aspect, the present application provides a computer program product, where the computer program product includes a computer program stored in a computer-readable storage medium, and the computer program is loaded by a controller to implement the method of any possible implementation manner of the first aspect.
In a fourth aspect, the present application provides a non-transitory computer-readable storage medium for storing a computer program, which is loaded by a processor to execute the instructions of the method of any possible implementation manner of the first aspect.
In a fifth aspect, embodiments of the present application provide a chip, where the chip includes a programmable logic circuit and/or program instructions, and when the chip is executed, the chip is configured to implement the method according to any possible implementation manner of the first aspect.
In summary, the method and the device provided by the embodiment of the application can effectively improve the accuracy of the prediction model and ensure the timeliness of the prediction. Firstly, the parameter optimization random forest algorithm is suitable for discrete data, rules contained in the column data can be extracted, prior knowledge is not needed, the method is lower in cost and convenient and efficient to use compared with methods such as a neural network, and timeliness of scenic spot daily passenger flow prediction is guaranteed on the basis of acquiring latest prediction sample data in real time; secondly, the parameter optimization random forest algorithm has good tolerance on missing values, abnormal values and noise, overfitting is not easy to occur, and the accuracy of prediction is further improved by combining with scenic spot passenger flow influence factor analysis.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a frame diagram of a random forest algorithm provided in an embodiment of the present application;
FIG. 2 is a flow chart of a method for predicting scenic spot passenger flow based on a random forest algorithm disclosed in the embodiment of the present application;
FIG. 3 is a flowchart of a method for finding an optimal random forest model based on a grid search algorithm, disclosed in an embodiment of the present application;
FIG. 4 is a schematic block diagram of a device for predicting scenic spot passenger flow based on a random forest algorithm, disclosed in the embodiment of the present application;
FIG. 5 is a schematic block diagram of a feature data set unit disclosed in an embodiment of the present application;
FIG. 6 shows an application example of predicting scenic spot passenger flow based on a random forest algorithm according to an embodiment of the present application.
Detailed Description
The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
The Random Forest (RF) algorithm is a classification and regression algorithm that integrates multiple decision trees (decisiontrees) by the idea of Ensemble Learning (Ensemble Learning). In the RF, decision trees are firstly adopted as a base classifier, then a Bagging (boosting aggregation) Method is used for generating training data sets which are different from each other, each decision tree is constructed by using a Random Subspace partitioning (Random Subspace Method) strategy, partial attributes are randomly selected from all the attributes, and when the tree is split each time, the optimal attribute is selected from the partial attributes for splitting. If the classification algorithm is adopted, the final classification is the classification with the largest number of votes from the leaf node to which the sample point belongs; in the case of a regression analysis (including data prediction) algorithm, the final class is the mean of the leaf nodes to which the sample point arrives. The introduction of the 'double random' idea makes the RF not easy to fall into overfitting and diversity exists among the sub-classifiers, so the RF has excellent performance. The random forest algorithm frame diagram is shown in fig. 1.
The random forest algorithm can effectively improve the accuracy of the prediction model and ensure the timeliness of the prediction. Firstly, the parameter optimization random forest algorithm is suitable for discrete data, rules contained in the column data can be extracted, prior knowledge is not needed, the method is lower in cost and convenient and efficient to use compared with methods such as a neural network, and timeliness of scenic spot daily passenger flow prediction is guaranteed on the basis of acquiring latest prediction sample data in real time; secondly, the parameter optimization random forest algorithm has good tolerance on missing values, abnormal values and noise, overfitting is not easy to occur, and the accuracy of prediction is further improved by combining with scenic spot passenger flow influence factor analysis.
Referring to fig. 2, a flowchart of a method for predicting scenic spot passenger flow based on a random forest algorithm disclosed in the embodiment of the present application includes the following steps:
step 201: and establishing a parameter optimized random forest algorithm model, debugging model parameters according to the goodness of fit and the average standard error, and searching an optimal random forest algorithm model based on a grid search algorithm.
Step 202: and inputting the characteristic data set into the optimal random forest algorithm model to obtain the future daily passenger flow of the forecast scenic spot.
In one embodiment, for step 201, establishing a random forest algorithm model with optimized parameters includes the following steps:
a. establishing a model: defining a training set and a testing set, dividing the training set and the testing set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the testing set.
In the process of establishing the model, firstly defining a training set and a testing set, dividing the training set and the testing set of the whole data set according to proportion, establishing an input matrix and an output matrix of the training set/the testing set,
the training matrix is:
Figure BDA0002348767380000061
inputting a matrix:
Figure BDA0002348767380000062
outputting a matrix:
Figure BDA0002348767380000063
wherein f isi(x) historical passenger flow; m isiIs the date; [ w ]1,...,wn]Representing a weather sequence, 3 columns of data, wherein each column of data represents a group of weather conditions including weather, highest air temperature and lowest air temperature; h is1Holidays (weekday 0, ordinary weekend 1, holiday 2); f. ofi(t) is the amount of traffic to be predicted. Then, a random forest algorithm is used for training the data set to establish a model.
b. Determining an evaluation system: and calculating the R square value, the mean square error and the mean absolute error of the prediction model.
c. Optimizing a model: and debugging the model parameters according to the goodness of fit and the average standard error, and searching for an optimal random forest model based on a grid search algorithm.
In one embodiment, step 202 includes the following steps for the feature data set:
a. data collection: and collecting historical passenger flow and characteristic factors of the scenic spot, wherein the characteristic factors mainly comprise weather, holidays, highest temperature, lowest temperature and seasons.
The collected influence factor data mainly comprises five dimensions of weather, holidays, highest air temperature, lowest air temperature and seasons, wherein the weather mainly refers to the influence of air temperature and rainfall on the amount of tourists; the holidays comprise weekends and national legal holidays, and the influence weights of different holidays on the scenic area tourist capacity are different; many scenic spots are sensitive to the influence of season changes, and the number of tourists is different in different seasons; the air temperature refers to the influence of the highest and lowest air temperature in the day in the scenic spot on the trip of the tourists.
b. Missing value processing: the missing value processing mainly means that original passenger flow data provided by scenic spots is easy to have some missing passenger flow data, and the two situations are divided: for a few scattered missing values, completing the missing values by using a Lagrange interpolation method; and supplementing the missing tourist data by using the mean value of the synchronous historical data for a large area of missing values.
c. Data preprocessing: because the collected data set has discrete qualitative characteristics and the statistical units and calibers of the input and output variables are inconsistent, which may bring great precision loss to prediction, the discrete qualitative characteristics need to be quantized through high-dimensional mapping before data modeling, and then input data is standardized and normalized, so that the scenic spot passenger flow volume data and the data in the influencing factor matrix are both positive values, which is beneficial to improving the prediction precision.
The normalization formula is shown in (1), where E (X) represents the mean and Var (X) represents the variance:
Figure BDA0002348767380000071
as shown in fig. 3, the step of finding the optimal random forest model based on the grid search algorithm specifically includes:
(1) determining the number k of decision trees and the number m of candidate split attributestrySet step size, at k and mtryCoordinate systemA two-dimensional grid is built, and grid nodes are corresponding k and mtryThe parameter pair of (1).
(2) And (4) constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags.
(3) Selecting the parameter k, m with the smallest classification errortryIf the classification error or the step length meets the requirement, outputting an optimal parameter and a classification error; otherwise, reducing the step length, repeating the steps and continuing searching.
In one embodiment, calculating the R-squared value, the mean-squared error, and the mean absolute error comprises:
r square value
Figure BDA0002348767380000081
Mean square error
Figure BDA0002348767380000082
Mean absolute error
Figure BDA0002348767380000083
Wherein y represents the daily passenger flow volume, yiRepresenting the historical day i traffic.
As shown in fig. 4, an embodiment of the present application further provides an apparatus, including:
the feature data set unit 401 is a feature data set obtained by processing historical raw data composed of the scenic spot historical passenger flow volume and the feature factors.
And the random forest algorithm model 402 is used for outputting the predicted future daily passenger flow of the scenic spot according to the input feature data set.
In one embodiment, the establishing of the parameter optimized random forest algorithm model comprises the following steps:
a. establishing a model: defining a training set and a testing set, dividing the training set and the testing set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the testing set.
In the process of establishing the model, firstly defining a training set and a testing set, dividing the training set and the testing set of the whole data set according to proportion, establishing an input matrix and an output matrix of the training set/the testing set,
the training matrix is:
Figure BDA0002348767380000084
inputting a matrix:
Figure BDA0002348767380000085
outputting a matrix:
Figure BDA0002348767380000091
wherein f isi(x) historical passenger flow; m isiIs the date; [ w ]1,...,wn]Representing a weather sequence, 3 columns of data, wherein each column of data represents a group of weather conditions including weather, highest air temperature and lowest air temperature; h is1Holidays (weekday 0, ordinary weekend 1, holiday 2); f. ofi(t) is the amount of traffic to be predicted. Then, a random forest algorithm is used for training the data set to establish a model.
b. Determining an evaluation system: and calculating the R square value, the mean square error and the mean absolute error of the prediction model.
c. Optimizing a model: and debugging the model parameters according to the goodness of fit and the average standard error, and searching for an optimal random forest model based on a grid search algorithm.
As shown in fig. 5, in an embodiment, the feature data set unit specifically includes:
the acquisition unit 501 is used for collecting historical passenger flow and characteristic factors of scenic spots to obtain historical original data, wherein the characteristic factors comprise weather, holidays, maximum temperature, minimum temperature and seasons;
a first processing unit 502, configured to complement, for different types of missing values, guest data values missing from the historical original data by using a lagrangian interpolation method and a contemporaneous historical data mean value, respectively;
a second processing unit 503, configured to quantize the discrete qualitative features through high-dimensional mapping;
a third processing unit 504 for performing normalization and normalization processing.
For the acquisition unit 501, for the feature data set, the collected influence factor data mainly includes five dimensions of weather, holidays, highest air temperature, lowest air temperature and seasons, and the weather mainly refers to the influence of air temperature and rainfall on the amount of tourists; the holidays comprise weekends and national legal holidays, and the influence weights of different holidays on the scenic area tourist capacity are different; many scenic spots are sensitive to the influence of season changes, and the number of tourists is different in different seasons; the air temperature refers to the influence of the highest and lowest air temperature in the day in the scenic spot on the trip of the tourists.
For the first processing unit 502, the missing value processing mainly means that there are some missing passenger flow data easily in the original passenger flow data provided by the scenic spot, and there are two cases: for a few scattered missing values, completing the missing values by using a Lagrange interpolation method; and supplementing the missing tourist data by using the mean value of the synchronous historical data for a large area of missing values.
For the second processing unit 503, because the discrete qualitative features exist in the collected data set and the statistical units and apertures of the input and output variables are inconsistent, which may bring a large precision loss to the prediction, it is necessary to quantize the discrete qualitative features through high-dimensional mapping before data modeling, and then perform normalization and normalization processing on the input data, so that both the scenic spot passenger flow volume data and the data in the influencing factor matrix are positive values, which is beneficial to improving the prediction precision.
The normalization formula is shown in (1), where E (X) represents the mean and Var (X) represents the variance:
Figure BDA0002348767380000101
as shown in fig. 3, the step of finding the optimal random forest model based on the grid search algorithm specifically includes:
(1) determining the number k of decision trees and the number m of candidate split attributestryRange of (1), set step lengthAt k and mtryEstablishing two-dimensional grids on the coordinate system, wherein grid nodes are corresponding k and mtryThe parameter pair of (1).
(2) And (4) constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags.
(3) Selecting the parameter k, m with the smallest classification errortryIf the classification error or the step length meets the requirement, outputting an optimal parameter and a classification error; otherwise, reducing the step length, repeating the steps and continuing searching.
In one embodiment, calculating the R-squared value, the mean-squared error, and the mean absolute error comprises:
r square value
Figure BDA0002348767380000102
Mean square error
Figure BDA0002348767380000103
Mean absolute error
Figure BDA0002348767380000104
Wherein y represents the daily passenger flow volume, yiRepresenting the historical day i traffic.
In a third aspect, the present application provides a computer program product, where the computer program product includes a computer program stored in a computer-readable storage medium, and the computer program is loaded by a controller to implement the method shown in fig. 2.
In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium for storing a computer program, which is loaded by a processor to execute the instructions of the method of fig. 2.
In a fifth aspect, embodiments of the present application provide a chip comprising programmable logic circuits and/or program instructions, which when run, implement the method of fig. 2.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for predicting scenic spot passenger flow based on a random forest algorithm is characterized by comprising the following steps:
establishing a parameter-optimized random forest algorithm model, debugging model parameters according to goodness of fit and average standard error, and searching an optimal random forest algorithm model based on a grid search algorithm;
and inputting the characteristic data set into the optimal random forest algorithm model to obtain the future daily passenger flow of the forecast scenic spot.
2. The method of claim 1, wherein the feature data set is obtained in a manner comprising:
collecting historical passenger flow and characteristic factors of scenic spots to obtain historical original data, wherein the characteristic factors comprise weather, holidays, highest air temperature, lowest air temperature and seasons;
respectively completing tourist data values missing from the historical original data by using a Lagrange interpolation method and a contemporaneous historical data mean value aiming at different types of missing values;
quantizing the discrete qualitative characteristics through high-dimensional mapping;
normalization and normalization processes are performed.
3. The method of claim 1, wherein the establishing a parameter-optimized random forest algorithm model comprises:
defining a training set and a test set, dividing the training set and the test set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the test set;
calculating an R square value, a mean square error and an average absolute error;
and debugging the model parameters according to the goodness of fit and the average standard error, and searching an optimal random forest model based on a grid search algorithm.
4. The method according to claim 1, wherein the finding of the optimal random forest model based on the grid search algorithm specifically comprises:
determining the number k of decision trees and the number m of candidate split attributestrySet step sizes of k and mtryEstablishing two-dimensional grids on the coordinate system, wherein grid nodes are corresponding k and mtryThe parameter pairs of (1);
constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags;
selecting the parameter k, m with the smallest classification errortryAnd if the classification error or the step length meets the requirement, outputting the optimal parameter and the classification error.
5. The method of claim 1, wherein calculating the R-squared, mean-squared and mean-absolute-error comprises:
r square value
Figure FDA0002348767370000021
Mean square error
Figure FDA0002348767370000022
Mean absolute error
Figure FDA0002348767370000023
Wherein y represents the daily passenger flow volume, yiRepresenting the historical day i traffic.
6. An apparatus, comprising:
the characteristic data set unit is used for processing historical original data consisting of historical passenger flow and characteristic factors of scenic spots to obtain a characteristic data set;
and the random forest algorithm model is used for outputting the predicted future daily passenger flow of the scenic spot according to the input feature data set.
7. The apparatus of claim 6, wherein the feature data set unit comprises:
the system comprises an acquisition unit, a display unit and a control unit, wherein the acquisition unit is used for acquiring historical passenger flow and characteristic factors of scenic spots to obtain historical original data, and the characteristic factors comprise weather, holidays, maximum temperature, minimum temperature and seasons;
the first processing unit is used for completing the tourist data values missing from the historical original data by using a Lagrange interpolation method and a contemporaneous historical data mean value respectively aiming at different types of missing values;
the second processing unit is used for quantizing the discrete qualitative characteristics through high-dimensional mapping;
and the third processing unit is used for carrying out standardization and normalization processing.
8. The apparatus of claim 6, wherein the random forest algorithm model is generated by:
defining a training set and a test set, dividing the training set and the test set of the whole data set according to a proportion, and establishing an input matrix and an output matrix of the training set/the test set;
calculating an R square value, a mean square error and an average absolute error;
and debugging the model parameters according to the goodness of fit and the average standard error, and searching an optimal random forest model based on a grid search algorithm.
9. The apparatus as claimed in claim 8, wherein the finding of the optimal random forest model based on the grid search algorithm by the random forest algorithm model specifically comprises:
determining the number k of decision trees and the number m of candidate split attributestrySet step sizes of k and mtryEstablishing two-dimensional grids on the coordinate system, wherein grid nodes are corresponding k and mtryThe parameter pairs of (1);
constructing a random forest for each group of parameters on the grid nodes, and estimating a classification error by using the data outside the bags;
selecting the parameter k, m with the smallest classification errortryAnd if the classification error or the step length meets the requirement, outputting the optimal parameter and the classification error.
10. The apparatus of claim 9, wherein the random forest algorithm model calculating an R-squared value, a mean squared error, and a mean absolute error comprises:
r square value
Figure FDA0002348767370000031
Mean square error
Figure FDA0002348767370000032
Mean absolute error
Figure FDA0002348767370000033
Wherein y represents the daily passenger flow volume, yiRepresenting the historical day i traffic.
CN201911406473.0A 2019-12-31 2019-12-31 Method and device for predicting scenic spot passenger flow based on random forest algorithm Pending CN111178633A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911406473.0A CN111178633A (en) 2019-12-31 2019-12-31 Method and device for predicting scenic spot passenger flow based on random forest algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911406473.0A CN111178633A (en) 2019-12-31 2019-12-31 Method and device for predicting scenic spot passenger flow based on random forest algorithm

Publications (1)

Publication Number Publication Date
CN111178633A true CN111178633A (en) 2020-05-19

Family

ID=70654257

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911406473.0A Pending CN111178633A (en) 2019-12-31 2019-12-31 Method and device for predicting scenic spot passenger flow based on random forest algorithm

Country Status (1)

Country Link
CN (1) CN111178633A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112330359A (en) * 2020-11-04 2021-02-05 上饶市中科院云计算中心大数据研究院 Smart tourist attraction saturation evaluation method and device
CN112949939A (en) * 2021-03-30 2021-06-11 福州市电子信息集团有限公司 Taxi passenger carrying hotspot prediction method based on random forest model
CN113256695A (en) * 2021-06-23 2021-08-13 武汉工程大学 Random forest based terrain prediction model method for potassium sulfate production salt pond
CN113962437A (en) * 2021-09-18 2022-01-21 深圳市城市交通规划设计研究中心股份有限公司 Construction method of people stream prediction model and people stream situation prediction method of rail transit station

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106779196A (en) * 2016-12-05 2017-05-31 中国航天系统工程有限公司 A kind of tourist flow prediction and peak value regulation and control method based on tourism big data
CN108877226A (en) * 2018-08-24 2018-11-23 交通运输部规划研究院 Scenic spot traffic for tourism prediction technique and early warning system
CN109034469A (en) * 2018-07-20 2018-12-18 成都中科大旗软件有限公司 A kind of tourist flow prediction technique based on machine learning
CN110135630A (en) * 2019-04-25 2019-08-16 武汉数澎科技有限公司 The short term needing forecasting method with multi-step optimization is returned based on random forest
CN110175690A (en) * 2019-04-04 2019-08-27 中兴飞流信息科技有限公司 A kind of method, apparatus, server and the storage medium of scenic spot passenger flow forecast
CN110322075A (en) * 2019-07-10 2019-10-11 上饶市中科院云计算中心大数据研究院 A kind of scenic spot passenger flow forecast method and system based on hybrid optimization RBF neural
CN110443314A (en) * 2019-08-08 2019-11-12 中国工商银行股份有限公司 Scenic spot passenger flow forecast method and device based on machine learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106779196A (en) * 2016-12-05 2017-05-31 中国航天系统工程有限公司 A kind of tourist flow prediction and peak value regulation and control method based on tourism big data
CN109034469A (en) * 2018-07-20 2018-12-18 成都中科大旗软件有限公司 A kind of tourist flow prediction technique based on machine learning
CN108877226A (en) * 2018-08-24 2018-11-23 交通运输部规划研究院 Scenic spot traffic for tourism prediction technique and early warning system
CN110175690A (en) * 2019-04-04 2019-08-27 中兴飞流信息科技有限公司 A kind of method, apparatus, server and the storage medium of scenic spot passenger flow forecast
CN110135630A (en) * 2019-04-25 2019-08-16 武汉数澎科技有限公司 The short term needing forecasting method with multi-step optimization is returned based on random forest
CN110322075A (en) * 2019-07-10 2019-10-11 上饶市中科院云计算中心大数据研究院 A kind of scenic spot passenger flow forecast method and system based on hybrid optimization RBF neural
CN110443314A (en) * 2019-08-08 2019-11-12 中国工商银行股份有限公司 Scenic spot passenger flow forecast method and device based on machine learning

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
戴剑良等: "《图书发行统计》", 31 December 1994 *
杨修德等: "XGBoost在超短期负荷预测中的应用", 《电气传动自动化》 *
温博文等: "基于改进网格搜索算法的随机森林参数优化", 《计算机工程与应用》 *
王晓宇等: "基于BA-SVR的乡村游短期客流预测模型", 《计算机工程与设计》 *
谢天保等: "基于网络搜索数据的游客量组合预测模型", 《计算机系统应用》 *
邵良杉等: "改进GSM-RFC模型在回采巷道围岩稳定性分级的预测", 《辽宁工程技术大学学报(自然科学版)》 *
陈元鹏等: "基于网格搜索随机森林算法的工矿复垦区土地利用分类", 《农业工程学报》 *
马丽君等: "湖南"红三角"潜在游客时空分布特征及其影响因素", 《资源开发与市场》 *
马银超等: "基于分类模型的日客流量预测", 《国土资源科技管理》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112330359A (en) * 2020-11-04 2021-02-05 上饶市中科院云计算中心大数据研究院 Smart tourist attraction saturation evaluation method and device
CN112949939A (en) * 2021-03-30 2021-06-11 福州市电子信息集团有限公司 Taxi passenger carrying hotspot prediction method based on random forest model
CN112949939B (en) * 2021-03-30 2022-12-06 福州市电子信息集团有限公司 Taxi passenger carrying hotspot prediction method based on random forest model
CN113256695A (en) * 2021-06-23 2021-08-13 武汉工程大学 Random forest based terrain prediction model method for potassium sulfate production salt pond
CN113256695B (en) * 2021-06-23 2021-10-08 武汉工程大学 Random forest based terrain prediction model method for potassium sulfate production salt pond
CN113962437A (en) * 2021-09-18 2022-01-21 深圳市城市交通规划设计研究中心股份有限公司 Construction method of people stream prediction model and people stream situation prediction method of rail transit station

Similar Documents

Publication Publication Date Title
CN111178633A (en) Method and device for predicting scenic spot passenger flow based on random forest algorithm
US10606862B2 (en) Method and apparatus for data processing in data modeling
AU2018101946A4 (en) Geographical multivariate flow data spatio-temporal autocorrelation analysis method based on cellular automaton
CN108446293A (en) A method of based on urban multi-source isomeric data structure city portrait
CN109493119B (en) POI data-based urban business center identification method and system
CN107168995B (en) Data processing method and server
AU2019100968A4 (en) A Credit Reporting Evaluation System Based on Mixed Machine Learning
CN111918298B (en) Clustering-based site planning method and device, electronic equipment and storage medium
CN109919781A (en) Case recognition methods, electronic device and computer readable storage medium are cheated by clique
CN113625697B (en) Unmanned aerial vehicle cluster reliability assessment method and system considering task capability change
CN109995611B (en) Traffic classification model establishing and traffic classification method, device, equipment and server
CN116186548B (en) Power load prediction model training method and power load prediction method
CN110837841B (en) KPI degradation root cause identification method and device based on random forest
CN111797188B (en) Urban functional area quantitative identification method based on open source geospatial vector data
CN109472075A (en) A kind of base station performance analysis method and system
CN111986027A (en) Abnormal transaction processing method and device based on artificial intelligence
CN112330227A (en) Urban management capability evaluation method and system based on digital urban management service construction
CN112381644A (en) Credit scene risk user assessment method based on space variable reasoning
CN113095680A (en) Evaluation index system and construction method of electric power big data model
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
CN116739619A (en) Energy power carbon emission monitoring analysis modeling method and device
CN113890833B (en) Network coverage prediction method, device, equipment and storage medium
CN115756919A (en) Root cause positioning method and system for multidimensional data
CN115049158A (en) Method, system, storage medium and terminal for predicting running state of urban system
CN116541252B (en) Computer room fault log data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200519