CN117391221B

CN117391221B - NDVI prediction integrated optimization method and system based on machine learning

Info

Publication number: CN117391221B
Application number: CN202311687644.8A
Authority: CN
Inventors: 周泽慧; 黄卫东; 翟青; 孙殿臣
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-12-11
Filing date: 2023-12-11
Publication date: 2024-02-20
Anticipated expiration: 2043-12-11
Also published as: CN117391221A

Abstract

The invention discloses an NDVI prediction integrated optimization method and system based on machine learning, comprising the steps of acquiring research data of a research area, screening and preprocessing to form an input variable data set; the study data includes at least an NDVI dataset and a climate variable dataset; selecting at least one type of NDVI prediction model, and constructing an NDVI prediction model set comprising at least two types of NDVI prediction models; constructing an NDVI prediction integrated optimization model, wherein the model comprises an objective function, a weight matrix and constraint conditions; and solving the NDVI prediction integrated optimization model by adopting a preconfigured algorithm, determining optimal weight, calculating and evaluating prediction precision, and outputting a prediction result. Aiming at the problem that a single machine learning model is prone to under fitting or over fitting, a linear weighting based NDVI prediction integrated model is constructed, the weight of each model is determined through a genetic algorithm for enhancing elite retention, and adverse effects of model uncertainty are reduced.

Description

NDVI prediction integrated optimization method and system based on machine learning

Technical Field

The invention relates to a remote sensing image or spectrum data processing technology, in particular to an NDVI prediction integrated optimization method and system based on machine learning.

Background

The southwest river basin of China spans a plurality of climate zones such as cold zone, warm zone, subtropical zone and tropical zone, the temperature and precipitation gradually decrease from south to north, the terrain is larger from west to east, the vegetation type is more abundant, and the influence degree of climate factors in different areas on the vegetation is different. Vegetation is an important component of the terrestrial ecosystem and plays an important role in the energy exchange process, the bio-geochemical cycle process and the hydrologic cycle process of the terrestrial surface. Therefore, understanding vegetation coverage is of great significance to regional sustainable development and ecological environment protection.

The vegetation index is a simple, effective and empirical measure of the surface vegetation status, and can detect vegetation growth status, vegetation coverage, eliminate partial radiation errors, etc. NDVI (Normalized Difference Vegetation Index) is an important tool for researching vegetation coverage change, is the most widely used vegetation index at present, and can better reflect the rule of dynamic change of the vegetation on the underlying surface of the area.

NDVI prediction is mainly studied in multi-factor models, including multi-factor parametric models and multi-factor non-parametric models. The multi-factor parametric model is a statistical relationship of the study dependent variable to two or more independent variables. The model has more independent variables, and partial independent variables (such as climate factors) have the problems of large space quantification errors and the like, and have certain limitations and uncertainty. The multi-factor non-parameter model is a kind of function directly defined according to the self characteristics of remote sensing data, and the data of different structural characteristics and vegetation indexes can be associated in a linear or nonlinear form through a specific rule. Compared with a parameter model, the non-parameter model based on the machine learning method has better vegetation index estimation and prediction capability, and the model precision can be correspondingly improved along with the increase of input data samples. However, the internal mechanism of the non-parametric model based on the machine learning method is complex, cannot be intuitively expressed, and is easy to cause the problem of under fitting or over fitting. Meanwhile, the prediction results obtained through different machine learning algorithms are large in difference, so that the results are large in uncertainty. In addition, in areas with larger terrain and climate change such as southwest watershed, the existing machine learning method has poor prediction effect on vegetation indexes.

Therefore, how to improve or optimize a non-parametric model based on machine learning and accurately estimate and predict southwest river basin vegetation indexes needs research innovation to solve the above-mentioned problems existing in the prior art.

Disclosure of Invention

The invention aims to provide an NDVI prediction integration optimization method and system based on machine learning, so as to solve the problems in the prior art.

According to one aspect of the application, the machine learning-based NDVI prediction integration optimization method is characterized by comprising the following steps:

s1, acquiring research data of a research area, screening and preprocessing to form an input variable data set; the study data includes at least an NDVI dataset and a climate variable dataset;

s2, selecting at least two types of NDVI prediction models, and constructing an NDVI prediction model set comprising at least two NDVI prediction models;

s3, constructing an NDVI prediction integrated optimization model, wherein the model comprises an objective function, a weight matrix and constraint conditions;

and S4, solving the NDVI prediction integrated optimization model by adopting a preconfigured algorithm, determining optimal weight, outputting a prediction result, and calculating and evaluating the prediction precision.

According to one aspect of the application, the step S1 further includes:

Step S11, determining the range of a research area, dividing the research area into at least two sub-domains according to domain division standards, and acquiring time sequence data in a preset format as research data; the study data includes at least an NDVI dataset and a climate variable dataset;

step S12, a variable optimization method is utilized to screen out climate variables and time-lapse influencing factors which are strongly related to the NDVI of each sub-drainage basin of the drainage basin, and an input variable set is constructed; the input variables include: short wave radiation, wind speed, precipitation, temperature, barometric pressure, dew point temperature, and vapor pressure differential VPD.

According to one aspect of the application, the step S2 further includes:

step S21, screening at least two types of machine learning models, wherein each type of machine learning model comprises at least one machine learning model; the machine learning model includes at least: linear regression model, support vector machine model, KNN model, random forest model and extreme gradient lifting decision tree model;

step S22, respectively establishing an NDVI prediction model aiming at each machine learning model and obtaining a prediction result of each model.

According to one aspect of the application, the step S3 further includes:

S31, constructing an NDVI prediction integrated optimization model based on linear weighting, wherein the model comprises an objective function, a weight matrix and constraint conditions;

and S32, determining the weight of each model through a genetic algorithm for enhancing elite retention, so that the root mean square error between the predicted value and the observed value of the integrated model is minimum.

According to one aspect of the application, the step 4 further includes:

s41, constructing a solving algorithm set, wherein the solving algorithm comprises a genetic algorithm;

and S42, solving the NDVI prediction integrated optimization model by adopting a solving algorithm to obtain the weight of each sub-model, and analyzing the performance and adaptability of the predicted NDVI in the verification period by adopting a correlation coefficient, a relative deviation and a root mean square error.

According to one aspect of the application, the step S12 further includes:

step S12a, a linear relation model between NDVI and climate variables is respectively established for each sub-drainage basin by using a generalized linear regression method, coefficients and significance levels of the variables are calculated, and variables with significance levels smaller than a preset value are selected as candidate variables;

step S12b, respectively carrying out forward, backward or bidirectional stepwise variable selection on each sub-drainage basin by using a stepwise regression method, and selecting an optimal variable combination as a candidate variable according to AIC indexes or BIC indexes;

And step S12c, comparing candidate variable combinations obtained by the generalized linear regression method and the stepwise regression method for each sub-drainage basin by using the AIC index or the BIC index, and selecting the variable combination with the minimum index value as an input variable set.

According to one aspect of the present application, the step S22 is further:

step S22a, a linear regression model is adopted to respectively establish a linear regression model between the NDVI and the input variable set for each sub-drainage basin, and a least square method or a ridge regression method is used for parameter estimation to obtain a prediction result of the linear regression model;

step S22b, a support vector regression model between the NDVI and the input variable set is respectively established for each sub-drainage basin by using a support vector machine model, and nonlinear mapping is performed by using a kernel function, so that a prediction result of the support vector machine model is obtained;

step S22c, a K neighbor regression model between the NDVI and the input variable set is respectively established for each sub-drainage basin by using a KNN model, and similarity calculation is carried out by using Euclidean distance or Manhattan distance, so that a prediction result of the KNN model is obtained;

step S22d, a random forest regression model between the NDVI and the input variable set is respectively established for each sub-drainage basin by using a random forest model, and the randomization processing of the characteristics and the samples is carried out by using a self-help sampling method and a random characteristic selection method, so that a prediction result of the random forest model is obtained;

Step S22e, an extreme gradient lifting decision tree model is used, an extreme gradient lifting decision tree regression model between NDVI and an input variable set is respectively established for each sub-drainage basin, and model optimization and overfitting control are carried out by using a gradient lifting method and regularization items, so that a prediction result of the extreme gradient lifting decision tree model is obtained;

step S22f, aiming at each sub-basin, according to descending order of fitting effect, obtaining the best machine learning model of the top N effects, obtaining a weighted result, obtaining the best machine learning model corresponding to each sub-basin, and storing. In the subsequent step, an NDVI prediction integrated optimization model of the whole river basin is built according to the machine learning model of each sub-river basin.

Or S22f can also construct a class identifier of each sub-drainage basin according to the NDVI condition of vegetation in the drainage basin, and allocate the class identifier to each sub-drainage basin to form at least two types of sub-drainage basins;

based on the prediction results of the machine learning models (i.e. the prediction results in step S22a to step S22 e), collecting the machine learning models corresponding to each class of sub-watershed, counting the frequency and arranging in descending order, and establishing an alternative set of the machine learning models of each class of sub-watershed;

Sequentially calculating whether the predicted result before or after the same mutation point of any two sub-watershed in each class of sub-watershed accords with a preset threshold value, and if so, clustering each class of sub-watershed; otherwise, selecting the machine learning model with highest frequency from the alternative set of the machine learning model, sequentially replacing the machine learning model corresponding to the current sub-basin, re-simulating and giving out a prediction result, and calculating whether the prediction result meets a preset threshold value; until all the machine learning models in the alternative set of machine learning models are selected;

and constructing an NDVI prediction integrated optimization model based on the clustered sub-watershed.

According to an aspect of the application, the step S32 is further:

step S32a, initializing a group of random weight vectors as an initial population, and calculating an fitness function value corresponding to each weight vector, namely, a Root Mean Square Error (RMSE) between the integrated model predicted value and the observed value.

Step S32b, selecting the most excellent weight vector with a predetermined proportion as elite individuals, and directly copying the elite individuals into the next generation population;

step S32c, selecting two weight vectors from the current population as parent individuals by using a roulette method or a tournament method, generating two new weight vectors as child individuals by using a crossover method or a mutation method, and calculating fitness function values of the new weight vectors;

Step S32d, repeating the step S32c until a preset number of offspring individuals are generated, and combining the offspring individuals with elite individuals to form a next generation population;

step S32e, judging whether a termination condition is reached, wherein the termination condition comprises the maximum iteration times, the minimum fitness function value or the minimum weight change amplitude; if so, outputting an optimal weight vector as a final result; if not, the process returns to the step S32b, and the optimization is continued.

According to an aspect of the application, the step S11 further includes determining whether the data value is valid and determining whether the mutation point exists in the time series data, specifically as follows:

step S11a, selecting pixels with NDVI values larger than a threshold value as effective pixels for the NDVI data set;

step S11b, constructing a mutation point detection method set, carrying out time sequence analysis on the data of each sub-drainage basin, and detecting whether mutation points exist by using a mutation point detection method;

step S11c, for each sub-basin, dividing the data into a plurality of time periods according to the number and the positions of the detected mutation points, so that the data in each time period reach a preset autocorrelation coefficient or smoothing coefficient.

According to another aspect of the present application, a machine learning based NDVI predictive integrated optimization system includes:

At least one processor; and

a memory communicatively coupled to at least one of the processors; wherein,

the memory stores instructions executable by the processor for execution by the processor to implement the machine learning based NDVI prediction integration optimization method of any one of the above technical solutions.

Aiming at the problem that a single machine learning model is prone to under fitting or over fitting, the method and the device construct an NDVI prediction integrated model based on linear weighting, determine the weight of each model through a genetic algorithm for enhancing elite retention, and reduce adverse effects of model uncertainty. The related art advantages will be described in detail in the detailed description.

Drawings

Fig. 1 is a flow chart of the present invention.

Fig. 2 is a flow chart of the invention S1.

Fig. 3 is a flow chart of the invention S2.

Fig. 4 is a flow chart of the invention S3.

Fig. 5 is a flow chart of the invention S4.

Detailed Description

As shown in fig. 1, an NDVI prediction integration optimization method based on machine learning is provided, which includes the following steps:

s1, acquiring research data of a research area, screening and preprocessing to form an input variable data set; the study data includes at least an NDVI dataset and a climate variable dataset; the climate variable data set comprises a lunar climate variable data set and a time-lapse (also called time-lapse) climate variable data set; time lapse data refers to a climate variable dataset of the previous month, the previous two months, the previous three months or the previous M months, such as the effect of precipitation of the previous month on vegetation index NDVI.

In this embodiment, the defect of a single model is overcome by integrating the optimization model, and the prediction accuracy and stability are improved by comprehensively utilizing multiple types of NDVI prediction models, such as a linear regression model, a neural network model, a support vector machine model, and the like, and weighting and combining the models through a weight matrix. And according to the objective function and constraint conditions, a pre-configured algorithm (such as a genetic algorithm, a particle swarm algorithm and the like) is adopted to solve the NDVI prediction integrated optimization model, so that the optimal weight is determined, and the dynamic adjustment and optimization of different models are realized. In addition, multisource, multi-temporal, multi-band, multi-scale remote sensing data can be utilized as input variables, including NDVI datasets and climate variable datasets, thereby increasing the information content and representativeness of the data. Finally, the effect and reliability of the model are determined by calculating and evaluating indexes such as correlation coefficient, root mean square error, average absolute error and the like of the prediction precision. The method can realize high-precision, high-efficiency and low-cost estimation of vegetation coverage, and provides scientific basis for evaluating vegetation resources, protecting vegetation ecological environment and formulating vegetation management strategies. The method realizes accurate monitoring and analysis of vegetation growth conditions and change trends, and provides data support for exploring the relationship and influence between vegetation and climate change and human activities.

As shown in fig. 2, according to an aspect of the present application, the step S1 further includes:

step S11, determining the range of a research area, dividing the research area into at least two sub-domains according to domain division standards, and acquiring time sequence data in a preset format as research data; the study data includes at least an NDVI dataset and a climate variable dataset; or dividing different sub-watercourses according to geographic features and ecological conditions, and acquiring corresponding remote sensing data and meteorological data to provide a basis for subsequent modeling.

Step S12, screening out climate variables affecting NDVI of each sub-basin of the southwest basin from the research data by utilizing a variable optimization method, wherein the climate variables comprise the climate variables of the current month and the climate variables of the time delay, and constructing an input variable set; the input variables include: short wave radiation, wind speed, precipitation, temperature, barometric pressure, dew point temperature, and vapor pressure differential VPD. The step S12 further includes:

step S12a, a linear relation model between NDVI and climate variables is respectively established for each sub-drainage basin by using a generalized linear regression method, coefficients and significance levels of the variables are calculated, and variables with significance levels smaller than a preset value are selected as candidate variables; the step is to primarily screen out climate variables with higher correlation with NDVI and better stability.

Step S12b, respectively carrying out forward, backward or bidirectional stepwise variable selection on each sub-drainage basin by using a stepwise regression method, and selecting an optimal variable combination as a candidate variable according to AIC indexes or BIC indexes; and the input variable set is further optimized, redundant or irrelevant variables are removed, and the model efficiency is improved.

And step S12c, comparing candidate variable combinations obtained by the generalized linear regression method and the stepwise regression method for each sub-drainage basin by using the AIC index or the BIC index, and selecting the variable combination with the minimum index value as an input variable set. In order to comprehensively consider the complexity and the fitting degree of the model, the most suitable input variable set is selected.

In this embodiment, according to the characteristics of different sub-watershed, the most suitable data such as the current month climate variable and time delay factor are selected, so that the accuracy and applicability of NDVI prediction can be improved. The variable optimization method can be utilized, the number and the dimension of input variables are reduced, the complexity and the operand of the model are reduced, and the efficiency and the stability of the model are improved. And the AIC index or BIC index is utilized to comprehensively evaluate the complexity and fitting degree of the model, and the optimal input variable set is selected, so that the problems of over fitting or under fitting are avoided.

step S11a, selecting pixels with the NDVI value greater than 0.1 as effective pixels for the NDVI data set; in some embodiments, specifically including:

the pixels with the NDVI value larger than 0.1 are considered as effective vegetation pixels, and the NDVI pixel value of each sub-drainage basin is compared with a 0.1 threshold value to judge whether the pixels are effective vegetation pixels, so that the data accuracy is improved.

Step S11b, constructing a mutation point detection method set, carrying out time sequence analysis on the data of each sub-drainage basin, and detecting whether mutation points exist by using a mutation point detection method; in order to identify abnormal or discontinuous change points possibly existing in the data, the quality and the credibility of the data are improved.

In some embodiments, specifically including:

based on the accumulation method, the method judges whether mutation points exist or not by calculating the accumulation of data and comparing the accumulation with a threshold value. Bayesian-based methods that estimate the number and location of mutation points that may be present in data using Bayesian reasoning and model selection. A method based on quantile regression is provided, which identifies mutation points by using a quantile regression model and checking whether model parameters change in different time periods.

Or using the KS method: the data is first divided into two sub-samples, pre-mutation and post-mutation data, and the respective cumulative distribution functions CDF are calculated. Next, using KS test formula, the maximum difference D between the two CDFs is calculated and compared with a threshold value dα corresponding to a given level of saliency α. If D is larger than D alpha, rejecting the original assumption, and considering that the two sub-samples have significant differences, namely mutation points exist; if D is less than or equal to dα, the original hypothesis cannot be rejected, and the two subsamples are considered to have no significant difference, i.e., no mutation points.

In some embodiments, a sliding window method is used, and the specific steps are as follows:

determining the length and the step length of a window; generally, the data is selected according to the characteristics and the targets of the data, the smaller the length is, the higher the sensitivity is, but noise is also easy to introduce; the smaller the step size, the higher the accuracy, but also increases the calculation amount.

Placing the window at the starting position of the time sequence, and calculating statistics of data in the window; such as mean, variance, extremum, etc.

Moving the window backwards by one step length, and repeating the previous step until the window reaches the end position of the time sequence;

and drawing a time-varying curve of the statistic in the window, comparing the curve with the statistic of the whole sequence, and if the statistic in a certain interval is found to have obvious jump or fluctuation, considering that one or more mutation points can exist in the interval.

The detected regions where mutation points may be present may be further analyzed, such as by verification or localization using other methods, as desired.

Step S11c, for each sub-basin, dividing the data into a plurality of time periods according to the number and the positions of the detected mutation points, so that the data in each time period reach a preset autocorrelation coefficient (for measuring the stability of the data) or a smoothing coefficient. In order to eliminate the influence of mutation points on data analysis and modeling, the accuracy and stability of the model are improved.

In some embodiments, specifically including: based on a dynamic programming method, the method searches an optimal segmentation scheme by using a dynamic programming algorithm, so that the data fitting error in each time period is minimized. A minimum description length based method balances the trade-off between the number of segments and the quality of the segments by using the minimum description length principle so that the data in each time period has minimal complexity. Based on the hierarchical clustering method, the method aggregates similar or adjacent data into one time period by using a hierarchical clustering algorithm.

In the embodiment, the characteristics of nonlinearity, non-stability, non-uniformity and the like possibly existing in the time series data can be effectively processed, and the reliability and the representativeness of the data are enhanced. The mutation point detection method is combined with different detection principles and indexes, so that the identification capability and sensitivity of mutation points are improved. According to the embodiment, the data dividing and processing modes are dynamically adjusted according to the data characteristics of different sub-watercourses and different time periods, so that the adaptability and the flexibility of the model are improved.

As shown in fig. 3, according to an aspect of the present application, the step S2 further includes:

step S21, screening at least two types of machine learning models, wherein each type of machine learning model comprises at least one machine learning model; the machine learning model includes at least: linear regression model, support vector machine model, KNN model, random forest model and extreme gradient lifting decision tree model; this step is to select machine learning models of different complexity and performance to accommodate different feature and data distribution scenarios. In some embodiments, the types of machine learning models include a first algorithm model, a second algorithm model, and a third algorithm model. The first algorithm model may be a basic linear algorithm model and the second algorithm model may be a classical machine learning algorithm model. For example, in some embodiments, the basic linear algorithm model includes a multiple linear regression LR, the classical machine learning algorithm model includes KNN (K-neighbor algorithm), a support vector machine SVM, and a random forest RF. The advanced machine learning algorithm model includes an extreme gradient boost decision tree XGBoost.

Step S22, respectively establishing an NDVI prediction model aiming at each machine learning model and obtaining a prediction result of each model. The present embodiment takes full advantage of the multiple types of machine learning models, such as simplicity and interpretability of the first algorithm model, generalization ability of the second algorithm model, and accuracy and flexibility of the third algorithm model. According to the method and the device, weight distribution and integrated optimization are carried out according to the prediction results of different machine learning models, so that the accuracy and the stability of NDVI prediction are improved.

According to one aspect of the present application, the step S22 is further:

step S22e, an extreme gradient lifting decision tree model is used, an extreme gradient lifting decision tree regression model between the NDVI and the input variable set is respectively established for each sub-drainage basin, and model optimization and overfitting control are carried out by using a gradient lifting method and a regularization term, so that a prediction result of the extreme gradient lifting decision tree model is obtained.

Step S22f, aiming at each sub-basin, according to descending order of fitting effect, obtaining the best machine learning model of the top N effects, obtaining a weighted result, obtaining the best machine learning model corresponding to each sub-basin, and storing. In the subsequent step, an NDVI prediction integrated optimization model of the whole river basin is built according to the machine learning model of each sub-river basin. N is 1, 2 or 3.

In this embodiment, the advantages of multiple types of machine learning models, such as simplicity and interpretability of a linear regression model, generalization capability and nonlinear fitting capability of a support vector machine model, rapidity and simplicity of a KNN model, accuracy and stability of a random forest model, and accuracy and interpretability of an extreme gradient lifting decision tree model, are comprehensively utilized, and the accuracy and stability of NDVI prediction are improved through weight distribution and integrated optimization. According to the characteristics of different sub-watercourses, the most suitable machine learning model combination is selected to adapt to different characteristics and data distribution conditions, so that the accuracy and applicability of NDVI prediction are improved. The embodiment utilizes multi-source, multi-time-phase and multi-scale remote sensing data and meteorological data contained in the input variable set, so that the information quantity and the representativeness of the data are increased, and the reliability and the comparability of NDVI prediction are improved.

In a further embodiment, step S22f may be:

constructing a class identifier of each sub-drainage basin according to the NDVI condition of vegetation in the drainage basin, and distributing the class identifier for each sub-drainage basin to form at least two types of sub-drainage basins;

In this embodiment, the research watershed is clustered based on the NDVI condition, so that division into different sub-watersheds according to the watershed division standard is solved, but the sub-watersheds which are similar in prediction result through the machine learning model can be combined, so that complexity and calculation workload of the integrated model are reduced, and the method is convenient to deploy in a scene with limited calculation resources and storage resources. For example, in some scenarios, regions with different heights or regions in different administrative regions may belong to different sub-domains according to the domain division criteria, so as to obtain at least two sub-domains similar at the geographic level, but after being simulated by prediction of the machine learning model, they are found to be similar at the data prediction level, so that clustering or combination can be performed to establish similar sub-domains at the logic level.

In the above embodiment, in combination with the mutation points of the time sequence, different sub-watershed in different time periods may be clustered on the data prediction layer, for example, in a certain scenario, there is one mutation point, i.e. the prediction process may be divided into two segments, in the first segment, the sub-watershed a may belong to the same class as the sub-watershed b and the sub-watershed c on the data prediction layer, and in the second segment, the sub-watershed a may belong to the same class as the sub-watershed b and the sub-watershed d on the data prediction layer.

Of course, some of the similarity of the predictions between the sub-watershed may be due to deviations caused by weighting, so that the machine learning model may be re-selected from the candidate set, the predictions may be re-combined, and a determination may be made as to whether the predictions are sufficiently similar, by which means relatively accurate predictions may be obtained for each class of sub-watershed by using as few machine learning models as possible.

In another embodiment of the present application, the step S22a further includes performing hierarchical optimization on the linear regression model between the set of input variables and the established NDVI, specifically as follows:

calling normalized input variable data, constructing an original input feature matrix, obtaining feature values, and mapping all the feature values to the same order-of-magnitude range so as to be beneficial to constructing polynomial mapping;

Calculating the product of second-order or third-order eigenvalues for the original input eigenvalue matrix to form a new polynomial eigenvalue matrix;

splicing the original feature matrix and the polynomial feature matrix, and merging along the feature dimension to obtain a feature representation matrix;

a new feature representation matrix is used to train and build a linear regression model between NDVI and input variables, predicting the target variable.

In this embodiment, non-linear factors are introduced in the polynomial features, but the regression model itself is still a linear model. And selecting polynomial features according to the result of the prediction target, removing redundant and non-key features, and preventing overfitting.

In this embodiment, the raw characteristic data includes at least climate data: time sequence statistical data such as temperature, humidity, precipitation and the like; the system can also comprise remote sensing image data so as to reflect spectral band data of the coverage condition of the surface vegetation; in some embodiments, other relevant area data (e.g., land utilization data, etc.) may also be used.

And for the climate statistical data, directly adopting the time sequence values of various observed climate variables as characteristic values. For remote sensing image data, the image needs to be preprocessed, and then the characteristic value is extracted, and the method mainly comprises the following two types of methods: (1) Directly adopting pixel values of all wave bands as characteristic values to form multidimensional spectrum characteristics; (2) And extracting image features representing the feature information, such as anisotropy, edges, texture features and the like, from the image, and adopting the feature vectors as feature values.

As shown in fig. 4, according to an aspect of the present application, the step S3 further includes:

In the step, the advantages and disadvantages of each NDVI prediction model are integrated by using a linear weighting mode, so that the accuracy and stability of the integrated model are improved. And the optimal weight is effectively searched by utilizing a genetic algorithm for enhancing elite retention, so that the efficiency and the flexibility of the integrated model are improved.

According to an aspect of the application, the step S32 is further:

In this embodiment, by enhancing the genetic algorithm of elite retention, the optimal weight is effectively searched, thereby improving the prediction accuracy and stability of the integrated model. And the local or global adjustment of the weight vector is realized by using a crossover method or a mutation method while the diversity of the population is maintained, so that the efficiency and the flexibility of the integrated model are improved. And automatically judging whether the algorithm reaches the optimal solution according to a preset standard by utilizing the termination condition, thereby improving the reliability and comparability of the integrated model.

As shown in fig. 5, according to an aspect of the present application, the step 4 further includes:

s41, constructing a solving algorithm set, wherein the solving algorithm comprises optimization algorithms such as a genetic algorithm and the like;

And S42, solving the weight of each submodel in the NDVI prediction integrated optimization model by adopting a solving algorithm, and predicting the performance and adaptability of the NDVI in the verification period by adopting correlation coefficient, relative deviation and root mean square error analysis.

And measuring the linear correlation degree between each model predicted value and the observed value by using a correlation coefficient CC index. The closer CC is to 1, the higher the correlation degree, and the better the prediction effect. And measuring the average deviation between the predicted value and the observed value of each model by using a relative deviation BIAS index. The closer BIAS is to 0, the smaller the deviation, and the better the prediction effect. The root mean square error RMSE index is used to measure the root mean square error between each model predicted value and observed value. The closer the RMSE is to 0, the smaller the error, and the better the prediction effect. And comprehensively comparing the results of the indexes, analyzing the performance difference of each model in different sub-watercourses and different time periods, and summarizing the advantages and limitations of the integrated optimization model relative to a single machine learning model or a simple average integrated model.

In a further embodiment, the preprocessing of the NDVI predicted base data is as follows:

step S1: and (5) data acquisition.

S11: NDVI data is collected. Data from MODIS3 grade product MOD13C2 with spatial resolution of 0.1 degree x 0.1 degree and temporal resolution of month scale. MODIS3 grade product MOD13C2 is the net primary productivity data obtained by MODIS sensors. The data set provides a global time resolution of every 16 days and a spatial resolution of 500 meters. MOD13C2 data includes net primary productivity, standard error for net primary productivity, and effective observed quantity per pel. These data are useful for studying ecosystem processes such as vegetation growth and carbon cycling. The MOD13C2 dataset can be used to analyze the growth trend of global vegetation, monitor environmental changes such as drought and grassland degradation.

S12: meteorological data is collected. The meteorological data involved (including air temperature, precipitation, wind speed, radiation, barometric pressure, potential vapor emissions, and actual vapor emissions) are all derived from the ERA5-Land analysis dataset. The spatial resolution is 0.1 degree by 0.1 degree, and the time resolution is a month scale.

Step S2: and (5) preprocessing data. The following pretreatment was performed on the data in this example:

s21: NDVI data preprocessing. The NDVI data was quality controlled, values less than 0.1 were removed, and NDVI values within the study area were extracted.

S22: preprocessing meteorological data; converting meteorological data into units, converting precipitation units from m to mm, converting temperature units from K to ℃ and converting air pressure units from Pa to hPa; vector conversion is carried out on the vertical wind speed and the horizontal wind speed; and calculating VPD (virtual private digital) by using the temperature, the air pressure and the dew point temperature, converting the lattice point data into lattice data, and finally further extracting all meteorological data in the research area.

In a further embodiment, a space-time statistical model is introduced, and the influence of space heterogeneity factors such as extreme events, seasonal and snowfall changes on grassland changes is considered by using the poisson regression model, the space autoregressive model and other technologies, so that the fitting degree and interpretation ability of the model are improved.

And S35, constructing a poisson regression model. The NDVI value of each pel is assumed to follow a poisson distribution, the mean of which is a nonlinear function of weather variables (precipitation, air temperature and VPD), extreme events (drought and flood), seasonal (sine and cosine functions) and snowfall variations (snow depth and snow water equivalent). Parameters of the model are estimated using a maximum likelihood method, and a fitting value (ndvi_f) for each pixel is calculated.

And constructing a spatial autoregressive model. Taking the spatial correlation between pixels into consideration, introducing a spatial lag term and a spatial error term, constructing a spatial autoregressive model, and taking NDVI_f as a dependent variable, and taking a meteorological variable, an extreme event, seasonal and snowfall change as independent variables. Parameters of the model are estimated using a maximum likelihood method, and a predicted value (ndvi_p) for each pixel is calculated.

The merits of the spatio-temporal statistical model and the multiple regression model of the above embodiment are compared using the red pool information criterion (AIC) and the Bayesian Information Criterion (BIC). And evaluating the fitting degree and the prediction capability of the space-time statistical model by using indexes such as a Correlation Coefficient (CC), a relative deviation (BIAS), a Root Mean Square Error (RMSE) and the like.

It should be noted that step S34 and step S35 may be processed in parallel with step S31 to step S34.

In a further embodiment, to address the problem with the above embodiments, a multiple regression model is utilized to simulate the climate affected vegetation NDVI values, but the uncertainty in the structure and parameters of the model and the meteorological data may prevent accurate quantitative results from being obtained by the hydrologic model. For example, multiple regression models may suffer from multiple collinearity, heteroscedasticity, nonlinearity, etc., requiring proper verification and correction; meteorological data may have problems such as observation errors, spatial interpolation errors, unit conversion errors and the like, and proper quality control and calibration are required.

The method also comprises the following steps:

and S3a, constructing a Bayesian multiple regression model. The NDVI value of each pel is assumed to follow a normal distribution, the mean of which is a linear function of the meteorological variables (precipitation, air temperature and VPD) and the variance of which is an unknown parameter. Using bayesian methods, given a priori distribution, model parameters are extrapolated using MCMC algorithms, and the posterior mean (ndvi_b) and posterior standard deviations (ndvi_s) for each pel are calculated.

And (3) utilizing a posterior predictive test method to test the fitting degree and predictive capability of the Bayesian multiple regression model, and calculating indexes such as Mean Absolute Error (MAE), root Mean Square Error (RMSE), information criterion (DIC) and the like. The corrected prediction model is compared with the prediction model of the above embodiment.

In a further embodiment, the method may further include:

and S3i, constructing a random forest model. More climate variables such as precipitation, air temperature, VPD, nitrogen sedimentation, soil moisture and the like are selected from the meteorological data by using a machine learning method and used as independent variables, a random forest model is built by using NDVI data as the independent variables, training and testing are performed by using a self-help method, and a predicted value (NDVI_r) of each pixel is calculated. And evaluating the influence degree of each climate variable in the random forest model on vegetation change by using a variable importance measurement method, and determining a sensitive variable and a dominant variable. And evaluating the fitting degree and the prediction capability of the random forest model by using indexes such as a Correlation Coefficient (CC), a relative deviation (BIAS), a Root Mean Square Error (RMSE) and the like.

at least one processor; and

a memory communicatively coupled to at least one of the processors; wherein,

In one embodiment of the present application, the above procedure may be further simplified as:

preprocessing all data to obtain a long-sequence month scale NDVI data set and climate variable data, wherein the data comprise short-wave radiation, wind speed, precipitation, near-surface temperature, near-surface air pressure and dew point temperature; considering the time delay effect of the NDVI vegetation index influence factor, selecting climate variables which are related with each sub-drainage basin NDVI by using a generalized linear regression algorithm, a stepwise regression method and an AIC three variable optimization method, and constructing an input variable set;

and constructing an NDVI prediction model based on Linear Regression (LR), a Support Vector Machine (SVM), a K-nearest neighbor (KNN), a Random Forest (RF) and an extreme gradient lifting decision tree (XGBoost), and respectively inputting an input variable set and NDVI month scale data serving as dependent variables into five machine learning models to obtain a southwest river basin NDVI prediction result.

Constructing an NDVI prediction integrated optimization model based on linear weighting, determining the weight of each model by utilizing a genetic algorithm (SEGA) for enhancing elite retention, and inputting the prediction result of each model and an actual measurement value into the integrated model to obtain an NDVI prediction result based on multi-model integrated optimization; NDVI performance was predicted from CC, BIAS and RMSE comparative analysis of each model.

In this application, the model can be aided to better capture the trend of vegetation growth changes by adding a time-lapse climate variable dataset. For example, the effect of precipitation on vegetation growth tends to be lagging. Excessive precipitation can lead to vegetation growing vigorously, but excessive precipitation can also lead to vegetation death. If only the current month of precipitation data is used to predict NDVI, the model is susceptible to short term precipitation, resulting in inaccurate predictions. And if time-lapse rainfall data are used, the model can better capture the influence of the rainfall on vegetation growth, so that prediction accuracy is improved. Meanwhile, the risk of overfitting of the model can be reduced, and the robustness of the model is improved. For example, the effect of air temperature on vegetation growth is common, and both the monthly air temperature and the time-lapse air temperature have an effect on vegetation growth. If only the current month of air temperature data is used to predict NDVI, the model is susceptible to short term climate change, resulting in an overfitting. If time-lapse air temperature data are used, the model can learn the rule of influence of air temperature on vegetation growth better, so that the risk of overfitting is reduced. If heavy rain occurs in a short period, the rainfall data in the current month can be greatly influenced, so that the model prediction result is inaccurate. And if time-lapse precipitation data is used, the model can resist the influence of short-term precipitation, so that the robustness of the model is improved.

The preferred embodiments of the present invention have been described in detail above, but the present invention is not limited to the specific details of the above embodiments, and various equivalent changes can be made to the technical solution of the present invention within the scope of the technical concept of the present invention, and all the equivalent changes belong to the protection scope of the present invention.

Claims

1. The NDVI prediction integration optimization method based on machine learning is characterized by comprising the following steps of:

s2, selecting at least two types of machine learning models, and constructing an NDVI prediction model set comprising at least two NDVI prediction models;

s4, solving an NDVI prediction integrated optimization model by adopting a preconfigured algorithm, determining optimal weight, outputting a prediction result, and calculating and evaluating prediction precision;

the step S1 further includes:

Step S12, a variable optimization method is utilized to screen out climate variables and time-lapse influencing factors which are strongly related to the NDVI of each sub-drainage basin of the drainage basin, and an input variable set is constructed; the input variables include: short wave radiation, wind speed, precipitation, temperature, air pressure, dew point temperature and vapor pressure differential VPD;

the step S12 further includes:

step S12c, comparing candidate variable combinations obtained by a generalized linear regression method and a stepwise regression method for each sub-drainage basin by using AIC indexes or BIC indexes, and selecting the variable combination with the minimum index value as an input variable set;

the step S2 further includes:

Step S22, respectively establishing an NDVI prediction model aiming at each machine learning model and obtaining a prediction result of each model;

the step S22 is further:

step S22f, aiming at each sub-basin, according to descending order of fitting effect, obtaining the best machine learning model of the first N items of effect, obtaining a weighted result, obtaining the best machine learning model corresponding to each sub-basin, storing, and storing N being a natural number.

2. The machine learning based NDVI prediction integration optimization method of claim 1, wherein step S3 further comprises:

3. The machine learning based NDVI prediction integration optimization method of claim 2, wherein step S4 further comprises:

4. The machine learning based NDVI prediction integration optimization method of claim 3, wherein the step S32 is further to:

step S32a, initializing a group of random weight vectors as an initial population, and calculating an fitness function value corresponding to each weight vector, namely, a Root Mean Square Error (RMSE) between an integrated model predicted value and an observed value;

5. The machine learning based NDVI prediction integration optimization method of claim 4, wherein step S11 further includes determining whether the data value is valid and determining whether the time series data has a mutation point, specifically as follows:

6. An NDVI predictive integrated optimization system based on machine learning, comprising:

At least one processor; and

a memory communicatively coupled to at least one of the processors; wherein,

the memory stores instructions executable by the processor for execution by the processor to implement the machine learning based NDVI prediction integration optimization method of any one of claims 1-5.