CN116187501A - Low-temperature prediction based on Catboost model - Google Patents
Low-temperature prediction based on Catboost model Download PDFInfo
- Publication number
- CN116187501A CN116187501A CN202211509183.0A CN202211509183A CN116187501A CN 116187501 A CN116187501 A CN 116187501A CN 202211509183 A CN202211509183 A CN 202211509183A CN 116187501 A CN116187501 A CN 116187501A
- Authority
- CN
- China
- Prior art keywords
- model
- variable
- correlation
- data
- catboost
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 claims abstract description 51
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 39
- 238000010606 normalization Methods 0.000 claims abstract description 17
- 230000002068 genetic effect Effects 0.000 claims abstract description 11
- 238000011156 evaluation Methods 0.000 claims abstract description 6
- 238000007781 pre-processing Methods 0.000 claims abstract description 6
- 230000006870 function Effects 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 11
- 238000012417 linear regression Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000002790 cross-validation Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- 238000013459 approach Methods 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 208000025174 PANDAS Diseases 0.000 claims description 3
- 208000021155 Paediatric autoimmune neuropsychiatric disorders associated with streptococcal infection Diseases 0.000 claims description 3
- 235000016496 Panda oleosa Nutrition 0.000 claims description 3
- 238000012886 linear function Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 230000001351 cycling effect Effects 0.000 claims description 2
- YHXISWVBGDMDLQ-UHFFFAOYSA-N moclobemide Chemical compound C1=CC(Cl)=CC=C1C(=O)NCCN1CCOCC1 YHXISWVBGDMDLQ-UHFFFAOYSA-N 0.000 claims 2
- 240000000220 Panda oleosa Species 0.000 claims 1
- 230000007787 long-term memory Effects 0.000 claims 1
- 230000006403 short-term memory Effects 0.000 claims 1
- 230000005540 biological transmission Effects 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 4
- 240000004718 Panda Species 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01W—METEOROLOGY
- G01W1/00—Meteorology
- G01W1/10—Devices for predicting weather conditions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- Health & Medical Sciences (AREA)
- Computational Mathematics (AREA)
- Mathematical Optimization (AREA)
- Economics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Human Resources & Organizations (AREA)
- Tourism & Hospitality (AREA)
- Strategic Management (AREA)
- Algebra (AREA)
- Computing Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Marketing (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- General Business, Economics & Management (AREA)
- Bioinformatics & Computational Biology (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Biology (AREA)
- Environmental & Geological Engineering (AREA)
- Primary Health Care (AREA)
- Operations Research (AREA)
- Development Economics (AREA)
- Artificial Intelligence (AREA)
- Public Health (AREA)
- Environmental Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Educational Administration (AREA)
- Computer Vision & Pattern Recognition (AREA)
Abstract
The invention provides a low-temperature prediction method based on a Catboost model, which comprises the following specific implementation steps: (1) acquiring meteorological data; (2) Establishing hysteresis characteristics between meteorological data through autocorrelation coefficients; (3) The data preprocessing specifically comprises the steps of missing value filling and data normalization; (4) selection of features by LassoCV; (5) establishing a LassoCV-Catboost model to perform low-temperature prediction; (6) Optimizing parameters of the Catboost model through a genetic algorithm GA to obtain a final low-temperature prediction model; (7) model evaluation.
Description
Technical Field
The invention belongs to the technical field of early warning and monitoring of icing disasters of a power transmission line, and further relates to a low-temperature prediction method based on a Catboost model in the technical field of low-temperature prediction. The invention can be used for predicting low temperature.
Background
The operation experience of the power grid shows that the damage of the power transmission line caused by the wire breakage and the tower dumping accident caused by the icing disaster of the power transmission line is extremely large, and the adverse effect is caused on the safe and stable operation of the power grid system. The icing accident of the transmission line mostly occurs in a microclimate area, which is a comprehensive physical phenomenon influenced by factors such as temperature, humidity, convection of cold and hot air, circulation, wind and the like. Low temperature is one of the important reasons for icing of the transmission line, so that accurate low-temperature prediction can provide good data support for short-term icing prediction of the transmission line. The low-temperature data has time sequence characteristics, most of traditional prediction methods are unit time sequence modeling, and in fact, the change of air temperature is under the comprehensive action of various meteorological factors, and factors with high correlation with air temperature include wind direction, wind speed and relative humidity. The traditional time series air temperature prediction model mainly comprises a multiple linear regression method, an autoregressive moving average method (ARIMA) and a gray prediction method, the prediction effect of the method is difficult to meet the dynamic change of the air temperature, and the prediction result basically tends to be average. Tao et al propose a long-short term memory network air temperature prediction model based on random forests, niu Zhijuan et al propose a back propagation neural network (back propagation neural network, BP) and a radial basis function neural network (radial basis function neural network, RBF) using principal component analysis to build an air temperature prediction model, which considers the influence of multi-element weather data on air temperature but does not take the time series characteristics of the multi-element weather data itself into account. Jiang Genwei et al propose an application model of PSO-RBF-ANN in air temperature prediction, and although structural parameters of RBF model are optimized by a particle swarm optimization algorithm, the problem of unit time sequence prediction is also existed, so that the prediction accuracy is not high.
Disclosure of Invention
The invention aims to provide a low-temperature prediction method based on a Catboost model aiming at the problems that a traditional prediction method is difficult to learn massive data and the influence of multi-element meteorological data and the time correlation of the multi-element meteorological data on air temperature change is not fully considered.
The method comprises the steps of firstly establishing hysteresis characteristics by utilizing an Autocorrelation coefficient (Autocorrelation), then utilizing the characteristics that the importance of a single characteristic variable can be measured by utilizing LassoCV, screening characteristics which are related to low temperature height from the established characteristics as input variables of a Catboost model, modeling low temperature time series data, and finally optimizing parameters of the Catboost model through a genetic algorithm GA so as to obtain a final low temperature prediction model.
The specific steps of the implementation of the invention include the following steps:
(1) Acquiring meteorological data;
(2) Establishing hysteresis characteristics between meteorological data through autocorrelation coefficients;
(3) The data preprocessing specifically comprises the steps of missing value filling and data normalization;
(4) Selecting characteristics through LassoCV;
(5) Establishing a LassoCV-Catboost model for low-temperature prediction;
(6) Optimizing parameters of the Catboost model through a genetic algorithm GA to obtain a final low-temperature prediction model;
(7) Evaluating a model;
further, the hysteresis feature establishing method in the step (2) is as follows:
measuring the current moment y by means of an autocorrelation coefficient t With a lag of y from k t-k Correlation between them. The correlation measure is the degree of correlation of two random variables, and the correlation coefficient can measure the linear correlation between the two variables.
r k Denoted by y t And his degree of correlation of the k-order hysteresis. r is (r) k Referred to as autocorrelation coefficients (Autocorrelation Coefficient, ACF), the autocorrelation study is a relationship of values of a time series at different time points.
Further, the step of preprocessing the data in the step (3) is as follows:
and processing and modeling the data by using a Python environment, filling the missing value and the null value by using a filena function, and selecting an average value for filling. Next, the date and time information (year, month, day) is integrated into one date and time so that it can be used as an index of Pandas.
And carrying out normalization processing on the input data. The normalization method selected herein is linear function normalization (Min-Max scaling), and the specific normalization formula is shown as follows.
Wherein x is i (i=1, 2,3, …, n) is a meteorological data input feature,for the normalized value, max is the maximum value of each meteorological element, and min is the minimum value of each meteorological element.
Further, the method for selecting the LassoCV features in step (4) is as follows:
setting a linear regression model as
Y=X T β+ε (3)
Wherein X= [ X ] 1 ,x 2 ,…,x i ,…,x n ] T ,x i =[x i,1 ,x i,2 ,…,x i,m ]∈R 1×m For low-temperature characteristic data subjected to autocorrelation pretreatment, y= [ Y ] 1 ,y 2 ,…,y n ] T ∈R n×1 For response variables, β= [ β ] 1 ,β 2 ,…,β m ] T ∈R 1×m As model coefficients, ε= [ ε ] 1 ,ε 2 ,…,ε n ] T ∈R n×1 Is an error vector. The normal least squares estimation of the linear regression model is When adding constraint functions, i.e. LASSO, it is specifically expressed as
The parameter lambda is a penalty coefficient of parameter estimation, the size of the parameter lambda passes ten-fold cross validation, and the determination mode of the parameter alpha is the same.
The LASSO regression algorithm is solved by the minimum regression method, the minimum regression method is a variable screening algorithm based on a forward selection algorithm and a forward gradient algorithm, and more accurate feature vectors can be obtained, and the method is specifically described as follows:
1) The calculation process of the forward selection algorithm is as follows: at X= [ X 1 ,x 2 ,…,x i ,…,x n ] T In the selection and target variable y k The closest argument x i =[x i,1 ,x i,2 ,…,x i,m ]There is
Wherein the coefficient beta k Determined by the above method
The variable residual is
Defining the variable residual as the new target variable while the variable residual will be free of x k The process is repeated until the residual is smaller than the set range or the number of the independent variable sets is zero, and the algorithm is terminated.
2) The forward gradient algorithm selects a feature variable x with the largest correlation each time k Approximating the target variable y k Unlike the forward selection algorithm, its residual is defined as
y res,k =y k -x k β k (8)
Taking the residual as a new objective function, and taking the original variable set X= [ X ] 1 ,x 2 ,…,x i ,…,x n ] T As a variable set, according to y res,k =y k -x k β k Recalculate until residual y res,k And (5) being smaller than the set threshold range, and obtaining the optimal solution.
Further, the specific steps of the genetic algorithm GA optimization parameters described in the step (6) are as follows:
the low-temperature prediction is defined as a sequence of meteorological elements { …, x }, obtained with historical time t-1 ,x t Low temperature sequence {, x } to predict future time t+1 ,x t+2 …. And carrying the preprocessed and feature-selected data set into a Catboost model for training, and then selecting a genetic algorithm GA and ten-fold cross validation to optimize main super parameters of the Catboost model, including interfaces, learning_rate, max_depth and criterion, so as to improve the precision of model low-temperature prediction.
Further, the specific steps of the model evaluation in the step (7) are as follows:
according to the actual low-temperature value and the predicted low-temperature value, the model is compared with the traditional prediction model ARIMA and a long-short-term memory network (LSTM) in precision. The Root Mean Square Error (RMSE), mean Absolute Error (MAE) and Mean Absolute Percent Error (MAPE) are selected as the evaluation indexes of the model. Their calculation formula is as follows:
wherein y' i For the predicted low temperature value, y i N is the actual value of the air temperature, and N is the number of data sets. The LassoCV-Catboost model test results are compared with the results predicted by ARIMA and LSTM models, and the final fitting result comparison chart is shown in FIG. 2.
Compared with the prior art, the invention has the following advantages:
the first, catboost gradient lifting tree integrated model is suitable for modeling tasks of multi-element time series data, and compared with a traditional time series prediction method, the unique gradient single-side sampling (GOSS), feature binding technology (EFB) and histogram algorithm (Hist) of the integrated model better solve the problems of high dimensionality, nonlinearity and local minimum, and have stronger data learning capacity and generalization capacity.
Second, lassoCV has the ability to analyze feature importance and a regularization term was added to prevent data overfitting.
Thirdly, the invention establishes the hysteresis characteristic through the autocorrelation coefficient, then uses the LassoCV to perform characteristic selection on the multi-element weather time sequence data, and finally performs parameter tuning through the genetic algorithm GA, thereby providing more effective and accurate data for the construction of the model and reducing the complexity of the model.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a graph of the final fit result of the present invention;
Detailed Description
The present invention will be described in further detail with reference to examples.
It will be appreciated by those skilled in the art that the following examples are illustrative of the present invention and should not be construed as limiting the scope of the invention. The specific techniques or conditions are not identified in the examples and are performed according to techniques or conditions described in the literature in this field or according to the product specifications. The materials or equipment used are conventional products available from commercial sources, not identified to the manufacturer.
A low-temperature prediction method implementation flow chart based on a Catboost model is shown in FIG. 1, and comprises the following steps:
(1) Acquiring meteorological data;
(2) Establishing hysteresis characteristics between meteorological data through autocorrelation coefficients;
(3) The data preprocessing specifically comprises the steps of missing value filling and data normalization;
(4) Selecting characteristics through LassoCV;
(5) Establishing a LassoCV-Catboost model for low-temperature prediction;
(6) Optimizing parameters of the Catboost model through a genetic algorithm GA to obtain a final low-temperature prediction model;
(7) Evaluating a model;
measuring the current moment y by means of an autocorrelation coefficient t With a lag of y from k t-k Correlation between them. The correlation measure is the degree of correlation of two random variables, and the correlation coefficient can measure the linear correlation between the two variables.
r k Denoted by y t And his degree of correlation of the k-order hysteresis. r is (r) k Referred to as autocorrelation coefficients (Autocorrelation Coefficient, ACF), the autocorrelation study is a relationship of values of a time series at different time points.
Preferably, the specific process of the step (3) is as follows:
and processing and modeling the data by using a Python environment, filling the missing value and the null value by using a filena function, and selecting an average value for filling. Next, the date and time information (year, month, day) is integrated into one date and time so that it can be used as an index of Pandas.
And carrying out normalization processing on the input data. The normalization method selected herein is linear function normalization (Min-Max scaling), and the specific normalization formula is shown as follows.
Wherein x is i (i=1, 2,3, …, n) is a meteorological data input feature,for the normalized value, max is the maximum value of each meteorological element, and min is the minimum value of each meteorological element. />
Preferably, the specific process of the step (4) is as follows:
setting a linear regression model as
Wherein X= [ X ] 1 ,x 2 ,…,x i ,…,x n ] T ,x i =[x i,1 ,x i,2 ,…,x i,m ]∈R 1×m For low-temperature characteristic data subjected to autocorrelation pretreatment, y= [ Y ] 1 ,y 2 ,…,y n ] T ∈R n×1 For response variables, β= [ β ] 1 ,β 2 ,…,β m ] T ∈R 1×m As model coefficients, ε= [ ε ] 1 ,ε 2 ,…,ε n ] T ∈R n×1 Is an error vector. The normal least squares estimation of the linear regression model is When adding constraint functions, i.e. LASSO, it is specifically expressed as
The parameter lambda is a penalty coefficient of parameter estimation, the size of the parameter lambda passes ten-fold cross validation, and the determination mode of the parameter alpha is the same.
The LASSO regression algorithm is solved by the minimum regression method, the minimum regression method is a variable screening algorithm based on a forward selection algorithm and a forward gradient algorithm, and more accurate feature vectors can be obtained, and the method is specifically described as follows:
1) The calculation process of the forward selection algorithm is as follows: at X= [ X 1 ,x 2 ,…,x i ,…,x n ] T In the selection and target variable y k The closest argument x i =[x i,1 ,x i,2 ,…,x i,m ]There is
Wherein the coefficient beta k Determined by the above method
The variable residual is
Defining the variable residual as the new target variable while the variable residual will be free of x k As a new set of arguments, repeating the above procedure until the residual is less than the set rangeThe number of sets of surrounding or independent variables is zero and the algorithm is terminated.
2) The forward gradient algorithm selects a feature variable x with the largest correlation each time k Approximating the target variable y k Unlike the forward selection algorithm, its residual is defined as
y res,k =y k -x k β k (19)
Taking the residual as a new objective function, and taking the original variable set X= [ X ] 1 ,x 2 ,…,x i ,…,x n ] T As a variable set, according to y res,k =y k -x k β k Recalculate until residual y res,k And (5) being smaller than the set threshold range, and obtaining the optimal solution.
The LASSO algorithm comprises the following specific steps:
step1 solving the variable x having the highest correlation with the objective function according to the equation (16) and the equation (17) k And removing it from the set of variables and determining a new target variable according to equation (19);
step2 repeat Step1 until a new variable x is obtained l With the target variable y res,k Correlation and variable x of (2) k And y is res,k The correlation degree of (3) is the same;
step3 at x k And x l On the angular bisector of (2), re-approximating by means of equation (19) to obtain the variable x t So that x is t And y is res,k Correlation of (2) and x k ,x l And y is res,k As is the correlation of the variable x t Adding the new approach direction into the feature set, and taking a common angular bisector of the feature set as a new approach direction;
step4, cycling the above process until y res,k Small enough or the variable set is empty, the final feature set is the required feature variable.
Preferably, the specific process of the step (6) is as follows:
the low-temperature prediction is defined as a sequence of meteorological elements { …, x }, obtained with historical time t-1 ,x t Low temperature sequence {, x } to predict future time t+1 ,x t+2 …. Will be pretreated and feature selectedThe data set is brought into the Catboost model for training, and then a genetic algorithm GA is selected to combine with ten-fold cross validation to optimize main super parameters of the Catboost model, including the parameters including the candidate_rate, the max_depth and the criterion, so that the precision of low-temperature prediction of the model is improved.
Preferably, the specific process of the step (7) is as follows:
according to the actual low-temperature value and the predicted low-temperature value, the model is compared with the traditional prediction model ARIMA and a long-short-term memory network (LSTM) in precision. The Root Mean Square Error (RMSE), mean Absolute Error (MAE) and Mean Absolute Percent Error (MAPE) are selected as the evaluation indexes of the model. Their calculation formula is as follows:
wherein y' i For the predicted low temperature value, y i N is the actual value of the air temperature, and N is the number of data sets. The LassoCV-Catboost model test results are compared with the results predicted by ARIMA and LSTM models, and the final fitting result comparison chart is shown in FIG. 2.
Application example of the invention:
(1) And (3) data acquisition: meteorological data is downloaded from the national academy of sciences of China (http:// www.resdc.cn /) website.
(2) The invention predicts the low-temperature data of a certain site in the historical Yunnan province, and evaluates and compares the model.
(3) The comparison chart is shown in the specification and the drawing (figure 2). The RMSE value of the invention was 1.432, the MAE was 1.223 and the MAPE was 11.38%.
Claims (7)
1. A low-temperature prediction method based on a Catboost model is characterized in that the influence of time correlation of multi-element meteorological data and the multi-element meteorological data on temperature change can be fully considered, and the method comprises the following steps:
(1) Acquiring meteorological data;
(2) Establishing hysteresis characteristics between meteorological data through autocorrelation coefficients;
(3) The data preprocessing specifically comprises the steps of missing value filling and data normalization;
(4) Selecting characteristics through LassoCV;
(5) Establishing a LassoCV-Catboost model for low-temperature prediction;
(6) Optimizing parameters of the Catboost model through a genetic algorithm GA to obtain a final low-temperature prediction model;
(7) And (5) evaluating a model.
2. The method for predicting low temperature based on the Catboost model as claimed in claim 1, wherein the hysteresis feature establishing method in the step (2) is as follows:
measuring the current moment y by means of an autocorrelation coefficient t With a lag of y from k t-k The correlation between the two variables is measured by the correlation degree of the two random variables, and the linear correlation between the two variables can be measured by the correlation coefficient;
r k denoted by y t And his degree of correlation with the k-order hysteresis, r k Referred to as autocorrelation coefficients (Autocorrelation Coefficient, ACF), the autocorrelation study is a relationship of values of a time series at different time points.
3. The method for predicting low temperature based on the Catboost model as claimed in claim 1, wherein the step of preprocessing the data in the step (3) is as follows:
processing and modeling data by using a Python environment, filling a missing value and a null value by using a filena function, selecting an average value for filling, and integrating date and time information (year, month and day) into a date and time so as to be used as an index of Pandas;
the normalization processing is performed on the input data, wherein the normalization method is selected as a linear function normalization (Min-Max normalization), and the specific normalization formula is shown as follows:
4. The method for predicting low temperature based on the Catboost model as claimed in claim 1, wherein the method for selecting LassoCV features in the step (4) is as follows:
setting a linear regression model as follows:
Y=X T β+ε (3)
wherein X= [ X ] 1 ,x 2 ,…,x i ,…,x n ] T ,x i =[x i,1 ,x i,2 ,…,x i,m ]∈R 1×m For low-temperature characteristic data subjected to autocorrelation pretreatment, y= [ Y ] 1 ,y 2 ,…,y n ] T ∈R n×1 For response variables, β= [ β ] 1 ,β 2 ,…,β m ] T ∈R 1×m As model coefficients, ε= [ ε ] 1 ,ε 2 ,…,ε n ] T ∈R n×1 As error vector, the normal least square method of the linear regression model is estimated as When adding constraint functions, i.e. LASSO, it is specifically expressed as: />
The parameter lambda is a penalty coefficient of parameter estimation, the size of the parameter lambda passes ten-fold cross validation, and the determination mode of the parameter alpha is the same;
the LASSO regression algorithm is solved by the minimum regression method, the minimum regression method is a variable screening algorithm based on a forward selection algorithm and a forward gradient algorithm, and more accurate feature vectors can be obtained, and the method is specifically described as follows:
1) The calculation process of the forward selection algorithm is as follows: at X= [ X 1 ,x 2 ,…,x i ,…,x n ] T In the selection and target variable y k The closest argument x i =[x i,1 ,x i,2 ,…,x i,m ]There is
Wherein the coefficient beta k Determined by the above method
The variable residual is
Defining the variable residual as the new target variable while the variable residual will be free of x k The set X of the set (2) is taken as a new independent variable set, the process is repeated until the residual error is smaller than a set range or the number of the independent variable sets is zero, and the algorithm is terminated;
2) The forward gradient algorithm selects a feature variable x with the largest correlation each time k Approximating the target variable y k Unlike the forward selection algorithm, its residual is defined as:
y res,k =y k -x k β k (8)
taking the residual as a new objective function, and taking the original variable set X= [ X ] 1 ,x 2 ,…,x i ,…,x n ] T As a variable set, according to y res,k =y k -x k β k Recalculate until residual y res,k Less than the set threshold range to obtain an optimal solution;
the LASSO algorithm comprises the following specific steps:
step1 solving the variable x having the highest correlation with the objective function according to the equation (5) and the equation (6) k Removing the new target variable from the variable set sum, and determining a new target variable according to a formula (8);
step2 repeat Step1 until a new variable x is obtained l With the target variable y res,k Correlation and variable x of (2) k And y is res,k The correlation degree of (3) is the same;
step3 at x k And x l On the angular bisector of (2), re-approximating by equation (8) to obtain the variable x t So that x is t And y is res,k Correlation of (2) and x k ,x l And y is res,k As is the correlation of the variable x t Adding the new approach direction into the feature set, and taking a common angular bisector of the feature set as a new approach direction;
step4, cycling the above process until y res,k Small enough or the variable set is empty, the final feature set is the required feature variable.
5. The castboost model-based low temperature prediction method of claim 1: the method is characterized in that in the step (5), the characteristics after the characteristics of the Lasso-CV are selected are brought into a Catboost model for training.
6. The method for predicting low temperature based on the Catboost model as claimed in claim 1, wherein the specific steps of the genetic algorithm GA optimization parameters in the step (6) are as follows:
the low-temperature prediction is defined as a sequence of meteorological elements { …, x }, obtained with historical time t-1 ,x t Low temperature sequence {, x } to predict future time t+1 ,x t+2 …, the data set after pretreatment and feature selection is brought into a Catboost model for training, and then a genetic algorithm GA is selected to combine with ten-fold cross validation to optimize main super parameters of the Catboost model, including interfaces, learning_rate, max_depth and criterion, so as to improve the precision of model low-temperature prediction.
7. The method for predicting low temperature based on the Catboost model as claimed in claim 1, wherein the specific steps of the model evaluation in the step (7) are as follows:
according to the actual low temperature value and the predicted low temperature value, the model of the invention is compared with the traditional prediction model ARIMA and a long and short term memory network (LSTM) in precision, and Root Mean Square Error (RMSE), mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) are selected as evaluation indexes of the model, and the calculation formulas are as follows:
wherein y is i ' is a predicted low temperature value, y i As for the actual value of the air temperature, N is the number of data sets, the test result of the LassoCV-Catboost model is compared with the predicted result of the ARIMA model and the LSTM model, and the final fitting result comparison chart is shown in fig. 2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211509183.0A CN116187501A (en) | 2022-11-29 | 2022-11-29 | Low-temperature prediction based on Catboost model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211509183.0A CN116187501A (en) | 2022-11-29 | 2022-11-29 | Low-temperature prediction based on Catboost model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116187501A true CN116187501A (en) | 2023-05-30 |
Family
ID=86441046
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211509183.0A Pending CN116187501A (en) | 2022-11-29 | 2022-11-29 | Low-temperature prediction based on Catboost model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116187501A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110046756A (en) * | 2019-04-08 | 2019-07-23 | 东南大学 | Short-time weather forecasting method based on Wavelet Denoising Method and Catboost |
CN111311025A (en) * | 2020-03-17 | 2020-06-19 | 南京工程学院 | Load prediction method based on meteorological similar days |
KR102149053B1 (en) * | 2020-05-14 | 2020-08-31 | 주식회사 애자일소다 | Modeling system and method for predicting component |
CN113641959A (en) * | 2021-08-13 | 2021-11-12 | 山东电工电气集团有限公司 | High-voltage cable joint temperature trend prediction method |
CN113705877A (en) * | 2021-08-23 | 2021-11-26 | 武汉大学 | Real-time monthly runoff forecasting method based on deep learning model |
CN113821895A (en) * | 2021-09-01 | 2021-12-21 | 南方电网科学研究院有限责任公司 | Construction method and device of power transmission line icing thickness prediction model and storage medium |
CN115018193A (en) * | 2022-07-01 | 2022-09-06 | 北京华能新锐控制技术有限公司 | Time series wind energy data prediction method based on LSTM-GA model |
-
2022
- 2022-11-29 CN CN202211509183.0A patent/CN116187501A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110046756A (en) * | 2019-04-08 | 2019-07-23 | 东南大学 | Short-time weather forecasting method based on Wavelet Denoising Method and Catboost |
CN111311025A (en) * | 2020-03-17 | 2020-06-19 | 南京工程学院 | Load prediction method based on meteorological similar days |
KR102149053B1 (en) * | 2020-05-14 | 2020-08-31 | 주식회사 애자일소다 | Modeling system and method for predicting component |
CN113641959A (en) * | 2021-08-13 | 2021-11-12 | 山东电工电气集团有限公司 | High-voltage cable joint temperature trend prediction method |
CN113705877A (en) * | 2021-08-23 | 2021-11-26 | 武汉大学 | Real-time monthly runoff forecasting method based on deep learning model |
CN113821895A (en) * | 2021-09-01 | 2021-12-21 | 南方电网科学研究院有限责任公司 | Construction method and device of power transmission line icing thickness prediction model and storage medium |
CN115018193A (en) * | 2022-07-01 | 2022-09-06 | 北京华能新锐控制技术有限公司 | Time series wind energy data prediction method based on LSTM-GA model |
Non-Patent Citations (5)
Title |
---|
周元哲: "《Python数据分析与机器学习》", 30 June 2022, 机械工业出版社, pages: 116 - 118 * |
成立明等: "《Python爬虫、数据分析与可视化 工具详解与案例实战》", 31 January 2021, 哈尔滨工业大学出版社, pages: 142 - 237 * |
艾小伟: "《PYTHON程序设计 从基础开发到数据分析》", 31 July 2021, 机械工业出版社, pages: 248 - 249 * |
董力铭等: "分类梯度提升算法(CatBoost)与蝙蝠算法(Bat)耦合建模预测中国西北部地区水面蒸发量", 《节水灌溉》, no. 2, 10 February 2021 (2021-02-10), pages 63 - 69 * |
钟晓妮等: "中华生物医学统计大辞典 描述统计分册》", 31 December 2020, 中国统计出版社, pages: 79 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bouzgou et al. | Minimum redundancy–maximum relevance with extreme learning machines for global solar radiation forecasting: Toward an optimized dimensionality reduction for solar time series | |
CN111310968A (en) | LSTM neural network circulation hydrological forecasting method based on mutual information | |
CN111105104A (en) | Short-term power load prediction method based on similar day and RBF neural network | |
CN111028100A (en) | Refined short-term load prediction method, device and medium considering meteorological factors | |
CN115271186B (en) | Reservoir water level prediction and early warning method based on delay factor and PSO RNN Attention model | |
CN109143408B (en) | Dynamic region combined short-time rainfall forecasting method based on MLP | |
CN113536665B (en) | Road surface temperature short-term prediction method and system based on characteristic engineering and LSTM | |
CN113139605A (en) | Power load prediction method based on principal component analysis and LSTM neural network | |
CN111506868B (en) | Ultra-short-term wind speed prediction method based on HHT weight optimization | |
CN118095570A (en) | Intelligent load prediction method and system for transformer area, electronic equipment, medium and chip | |
CN115329930A (en) | Flood process probability forecasting method based on mixed deep learning model | |
CN116205508A (en) | Distributed photovoltaic power generation abnormality diagnosis method and system | |
CN115310648A (en) | Medium-and-long-term wind power combination prediction method based on multi-meteorological variable model identification | |
CN114429238A (en) | Wind turbine generator fault early warning method based on space-time feature extraction | |
CN116960962A (en) | Mid-long term area load prediction method for cross-area data fusion | |
CN112861418A (en) | Short-term icing thickness prediction method for stay cable based on GA-WOA-GRNN network | |
CN117200223A (en) | Day-ahead power load prediction method and device | |
CN117290673A (en) | Ship energy consumption high-precision prediction system based on multi-model fusion | |
JP7342369B2 (en) | Prediction system, prediction method | |
CN115034426B (en) | Rolling load prediction method based on phase space reconstruction and multi-model fusion Stacking integrated learning mode | |
CN115936236A (en) | Method, system, equipment and medium for predicting energy consumption of cigarette factory | |
CN116187501A (en) | Low-temperature prediction based on Catboost model | |
CN115907228A (en) | Short-term power load prediction analysis method based on PSO-LSSVM | |
CN115759343A (en) | E-LSTM-based user electric quantity prediction method and device | |
CN112581311B (en) | Method and system for predicting long-term output fluctuation characteristics of aggregated multiple wind power plants |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |