CN114840506A - Data dimension reduction method and device, computer equipment and storage medium - Google Patents

Data dimension reduction method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN114840506A
CN114840506A CN202210363550.4A CN202210363550A CN114840506A CN 114840506 A CN114840506 A CN 114840506A CN 202210363550 A CN202210363550 A CN 202210363550A CN 114840506 A CN114840506 A CN 114840506A
Authority
CN
China
Prior art keywords
influence
data
influence factor
factor
factors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210363550.4A
Other languages
Chinese (zh)
Inventor
乔颖
鲁宗相
孙书鑫
王楠
袁帅
程艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Original Assignee
Tsinghua University
Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd filed Critical Tsinghua University
Priority to CN202210363550.4A priority Critical patent/CN114840506A/en
Publication of CN114840506A publication Critical patent/CN114840506A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Algebra (AREA)
  • Quality & Reliability (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The application relates to a data dimension reduction method, a data dimension reduction device, a computer device, a storage medium and a computer program product. The method comprises the following steps: acquiring various influence factors and influence factor data of wind power; performing multiple co-linearity diagnosis on the multiple influence factors according to the influence factor data of the target season, and calculating a ridge regression coefficient of each influence factor according to the influence factor data of the target season, preset influence factor observation data and ridge parameters if multiple co-linearity exists among the multiple influence factors; and screening all influence factors according to the corresponding relation between the ridge regression coefficient and the ridge parameters, and eliminating the influence factors of which the ridge regression coefficient does not meet the change condition of the preset ridge regression coefficient to obtain a plurality of target influence factors. By adopting the method, a plurality of influence factors can be screened through the ridge regression coefficient of each influence factor, and the data redundancy is reduced under the condition of ensuring the application value of the data.

Description

Data dimension reduction method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of power system technologies, and in particular, to a data dimension reduction method, apparatus, computer device, storage medium, and computer program product.
Background
With the higher requirements of people on low-carbon of electric energy, renewable energy mainly based on wind power and photovoltaic gradually replaces fossil energy, and occupies a larger and larger proportion in the field of power generation. Wind power generation is the new energy which is most widely applied, has typical randomness and unpredictability, and increases the difficulty of making a reasonable power planning scheme. How to calculate the power generation capacity of the new energy as accurately as possible under the unpredictable natural conditions is an urgent problem to be solved.
Massive data existing in the current power grid can be used for analyzing and mining characteristics of wind power generation, and due to the fact that linear or nonlinear relations often exist among different kinds of influence factors collected by a wind measuring tower, redundant influence factors exist, and the redundant data not only increase modeling complexity, but also can cause inaccuracy of related models. In order to better mine the data value of the numerical weather forecast, the data redundancy problem needs to be researched, and an effective method is adopted to screen and reduce the dimension of the data, so that the data redundancy is reduced, and the application value of the data is improved.
The dimensionality reduction method in the traditional technology can be divided into a linear dimensionality reduction method and a nonlinear dimensionality reduction method, wherein the representative algorithm of linear dimensionality reduction comprises Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA) and the like, the linear method embeds an original data set into a linear structure, model calculation is simple, and the linear and Gaussian distribution data effect is good. However, when the actual data has an obvious nonlinear structure, the dimension reduction effect is obviously poor. In a common nonlinear dimensionality reduction method, the most key kernel in a Kernel Principal Component Analysis (KPCA) is difficult to select, at present, the most key kernel can only be judged by depending on experience, a t-distribution neighborhood embedding (t-SNE) algorithm is high in calculation complexity and only can be dimensionality reduced to 2-dimension or 3-dimension, data points can not be distinguished when mixed together in wind power prediction, and the dimensionality reduction effect is poor.
Disclosure of Invention
In view of the above, it is necessary to provide a data dimension reduction method, apparatus, computer device, computer readable storage medium and computer program product capable of improving the application value of data in view of the above technical problems.
In a first aspect, the present application provides a data dimension reduction method. The method comprises the following steps:
acquiring various influence factors and influence factor data of wind power, wherein the influence factor data comprises parameter values of the various influence factors at multiple moments in a target season;
Multiple collinearity diagnosis is carried out on the multiple influence factors according to the influence factor data of the target season, and multiple collinearity diagnosis parameters of each influence factor are obtained;
if multiple collinearity exists among the multiple influence factors according to the multiple collinearity diagnosis parameters of the influence factors, calculating a ridge regression coefficient of each influence factor according to the influence factor data of the target season, preset influence factor observation data and ridge parameters;
and screening all influence factors according to the corresponding relation between the ridge regression coefficient and the ridge parameters, and eliminating the influence factors of which the ridge regression coefficient does not meet the change condition of the preset ridge regression coefficient to obtain a plurality of target influence factors.
In one embodiment, after the step of performing multiple collinearity diagnosis on the plurality of influencing factors according to the influencing factor data of the target season to obtain multiple collinearity diagnosis parameters of each influencing factor, the method further includes:
and judging whether multiple collinearity exists among the multiple influencing factors according to the multiple collinearity diagnosis parameter of each influencing factor.
In one embodiment, the multiple co-linear diagnostic parameters of the influencing factors include tolerance;
the multiple collinearity diagnosis is carried out on the multiple influence factors according to the influence factor data of the target season to obtain multiple collinearity diagnosis parameters of each influence factor, and the multiple collinearity diagnosis parameters comprise:
aiming at each influence factor, performing least square calculation on influence factor data of a target season corresponding to the influence factor to determine a complex correlation coefficient of the influence factor;
calculating the tolerance according to the square of a preset target value and the complex correlation coefficient;
the judging whether multiple collinearity exists among the multiple influence factors according to the multiple collinearity diagnosis parameter of each influence factor comprises the following steps:
and if the tolerance of each influence factor is less than a first preset threshold, determining that multiple collinearity exists among the plurality of influence factors.
In one embodiment, the multiple collinearity diagnostic parameters of the influencing factors include tolerance, variance inflation factor;
the multiple collinearity diagnosis is carried out on the multiple influence factors according to the influence factor data of the target season to obtain multiple collinearity diagnosis parameters of each influence factor, and the multiple collinearity diagnosis parameters comprise:
Aiming at each influence factor, performing least square calculation on the influence factor data of the target season corresponding to the influence factor to determine a complex correlation coefficient of the influence factor;
calculating the tolerance according to a preset target value and the square of the complex correlation coefficient;
taking the reciprocal of the tolerance as a variance expansion factor of the influencing factors;
the judging whether multiple collinearity exists among the multiple influence factors according to the multiple collinearity diagnosis parameter of each influence factor comprises the following steps:
judging whether multiple collinearity exists among the multiple influence factors or not according to the variance expansion factor of each influence factor;
and if the variance expansion factor of each influence factor is larger than a second preset threshold value, determining that multiple collinearity exists among the plurality of influence factors.
In one embodiment, the multicollinearity diagnostic parameter of the influencing factor comprises a condition index;
the multiple collinearity diagnosis is carried out on the multiple influence factors according to the influence factor data of the target season to obtain multiple collinearity diagnosis parameters of each influence factor, and the multiple collinearity diagnosis parameters comprise:
determining a first data matrix according to the influence factor data of the target season;
Determining a first observation matrix according to the first data matrix and the transposed matrix of the first data matrix;
calculating at least one feature root of the first observation matrix;
taking the ratio of the maximum characteristic root to the minimum characteristic root as a maximum condition index;
the judging whether multiple collinearity exists among the multiple influence factors according to the multiple collinearity diagnosis parameter of each influence factor comprises the following steps:
judging whether multiple collinearity exists among the multiple influence factors or not according to the maximum condition index;
and if the maximum condition index is larger than or equal to a third preset threshold value, determining that multiple collinearity exists among the multiple influencing factors.
In one embodiment, the calculating a ridge regression coefficient for each influence factor according to the influence factor data of the target season, preset influence factor observation data, and ridge parameters includes:
determining a first data matrix according to the influence factor data of the target season;
determining a first observation matrix according to the first data matrix and the transposed matrix of the first data matrix;
determining a target matrix according to the first observation matrix, the ridge parameters and a preset standard value;
Determining a second observation matrix according to the transposed matrix of the first data matrix and the preset influence factor observation data;
and obtaining ridge regression coefficients of the influence factors according to the target matrix and the second observation matrix.
In one embodiment, the screening, according to the corresponding relationship between the ridge regression coefficient and the ridge parameter, each influence factor is filtered, and influence factors for which the ridge regression coefficient does not satisfy a preset ridge regression coefficient change condition are removed, so as to obtain a plurality of target influence factors, including:
determining the influence factors of which the ridge regression coefficients meet preset stability conditions and the absolute values are smaller than a preset stability threshold as first influence factors under the condition that the ridge parameters are within a preset range;
determining the influence factors of which the ridge regression coefficients do not accord with a preset stable condition and accord with a preset ridge regression coefficient change trend condition as second influence factors under the condition that the ridge parameters are in a preset range;
screening each influence factor, and removing the first influence factor and the second influence factor to obtain a plurality of target influence factors.
In one embodiment, the method further comprises:
And inputting the influence factor data of the target seasons corresponding to the plurality of target influence factors into a pre-trained convolutional neural network to obtain the predicted wind power generation power of the target seasons.
In a second aspect, the application further provides a data dimension reduction device. The device comprises:
the acquisition module is used for acquiring various influence factors of the wind power and influence factor data of a target season, wherein the influence factor data comprises parameter values of the various influence factors at multiple moments;
the diagnosis module is used for performing multiple collinearity diagnosis on the multiple influence factors according to the influence factor data of the target season to obtain multiple collinearity diagnosis parameters of each influence factor;
the calculation module is used for calculating ridge regression coefficients of the various influence factors according to the influence factor data of the target season, preset influence factor observation data and ridge parameters if the multiple collinearity among the various influence factors is determined according to the multiple collinearity diagnosis parameters of the influence factors;
and the screening module is used for screening all the influence factors according to the corresponding relation between the ridge regression coefficient and the ridge parameters, eliminating the influence factors of which the ridge regression coefficient does not meet the change condition of the preset ridge regression coefficient, and obtaining a plurality of target influence factors.
In one embodiment, the apparatus further comprises:
and the judging module is used for judging whether multiple collinearity exists among the multiple influencing factors according to the multiple collinearity diagnosis parameter of each influencing factor.
In one embodiment, the multicollinearity diagnostic parameter of the influencing factor comprises a tolerance;
the diagnostic module is specifically configured to:
aiming at each influence factor, performing least square calculation on the influence factor data of the target season corresponding to the influence factor to determine a complex correlation coefficient of the influence factor;
calculating the tolerance according to the square of a preset target value and the complex correlation coefficient;
the judgment module is specifically configured to:
and if the tolerance of each influence factor is less than a first preset threshold, determining that multiple collinearity exists among the plurality of influence factors.
In one embodiment, the multiple collinearity diagnostic parameters of the influencing factors include tolerance, variance inflation factor;
the diagnostic module is specifically configured to:
aiming at each influence factor, performing least square calculation on the influence factor data of the target season corresponding to the influence factor to determine a complex correlation coefficient of the influence factor;
Calculating the tolerance according to a preset target value and the square of the complex correlation coefficient;
taking the reciprocal of the tolerance as a variance expansion factor of the influence factors;
the judgment module is specifically configured to:
judging whether multiple collinearity exists among the multiple influence factors or not according to the variance expansion factor of each influence factor;
and if the variance expansion factor of each influence factor is larger than a second preset threshold value, determining that multiple collinearity exists among the plurality of influence factors.
In one embodiment, the multicollinearity diagnostic parameter of the influencing factor comprises a condition index;
the diagnostic module is specifically configured to:
determining a first data matrix according to the influence factor data of the target season;
determining a first observation matrix according to the first data matrix and the transposed matrix of the first data matrix;
calculating at least one feature root of the first observation matrix;
taking the ratio of the maximum characteristic root to the minimum characteristic root as a maximum condition index;
the judgment module is specifically configured to:
judging whether multiple collinearity exists among the multiple influence factors or not according to the maximum condition index;
And if the maximum condition index is larger than or equal to a third preset threshold value, determining that multiple collinearity exists among the multiple influencing factors.
In one embodiment, the calculation module is specifically configured to:
determining a first data matrix according to the influence factor data of the target season;
determining a first observation matrix according to the first data matrix and the transposed matrix of the first data matrix;
determining a target matrix according to the first observation matrix, the ridge parameters and a preset standard value;
determining a second observation matrix according to the transposed matrix of the first data matrix and the preset influence factor observation data;
and obtaining a ridge regression coefficient of each influence factor according to the target matrix and the second observation matrix.
In one embodiment, the screening module is specifically configured to:
determining the influence factors of which the ridge regression coefficients meet preset stability conditions and the absolute values are smaller than a preset stability threshold as first influence factors under the condition that the ridge parameters are within a preset range;
determining the influence factors of which the ridge regression coefficients do not accord with a preset stable condition and accord with a preset ridge regression coefficient change trend condition as second influence factors under the condition that the ridge parameters are in a preset range;
Screening all the influence factors, and removing the first influence factors and the second influence factors to obtain a plurality of target influence factors.
In one embodiment, the apparatus further comprises:
and the prediction module is used for inputting the influence factor data of the target season corresponding to the plurality of target influence factors into a pre-trained convolutional neural network to obtain the predicted wind power generation power of the target season.
In a third aspect, the application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the above steps when executing the computer program.
In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the above steps.
In a fifth aspect, the present application further provides a computer program product. The computer program product comprises a computer program which when executed by a processor performs the above steps.
The data dimension reduction method, the data dimension reduction device, the computer equipment, the storage medium and the computer program product comprise the steps of obtaining multiple influence factors of wind power and influence factor data, wherein the influence factor data comprise parameter values of the multiple influence factors at multiple moments in a target season; performing multiple collinearity diagnosis on the multiple influence factors according to the influence factor data of the target season to obtain multiple collinearity diagnosis parameters of each influence factor; if multiple collinearity exists among the multiple influence factors according to the multiple collinearity diagnosis parameters of the influence factors, calculating a ridge regression coefficient of each influence factor according to the influence factor data of the target season, preset influence factor observation data and ridge parameters; and screening all influence factors according to the corresponding relation between the ridge regression coefficient and the ridge parameters, and eliminating the influence factors of which the ridge regression coefficient does not meet the change condition of the preset ridge regression coefficient to obtain a plurality of target influence factors. By adopting the method, a plurality of influence factors can be screened through the ridge regression coefficient of each influence factor, and the data redundancy is reduced under the condition of ensuring the application value of the data.
Drawings
FIG. 1 is a flow diagram illustrating a data dimension reduction method according to an embodiment;
FIG. 2 is a flow chart illustrating the steps of calculating the tolerance in one embodiment;
FIG. 3 is a flow chart illustrating the steps of determining based on tolerance according to one embodiment;
FIG. 4 is a schematic flow chart illustrating the step of calculating the variance inflation factor in one embodiment;
FIG. 5 is a flowchart illustrating the determining step according to the variance inflation factor in one embodiment;
FIG. 6 is a schematic flow chart diagram illustrating the step of calculating a maximum condition index in one embodiment;
FIG. 7 is a flowchart illustrating the determining step according to the maximum condition index in one embodiment;
FIG. 8 is a schematic flow chart of the ridge regression coefficient step for each influencing factor in one embodiment;
FIG. 9 is a graph illustrating ridge regression coefficients for various influencing factors as a function of ridge parameters, in accordance with an embodiment;
FIG. 10 is a graph showing ridge regression coefficients as a function of ridge parameters for the influencing factors No. 1, 5, 12, 17, 22, and 23 in one embodiment;
FIG. 11 is a graph showing ridge regression coefficients as a function of ridge parameters for influencing factors No. 2, 4, 5, 6, and 7 according to an embodiment;
FIG. 12A is a diagram illustrating an example of distribution of error frequency before and after dimensionality reduction for spring data of influencing factors;
FIG. 12B is a diagram illustrating an example of an error frequency distribution of summer data before and after dimensionality reduction for the influencing factors;
FIG. 12C is a diagram illustrating an example of error frequency distributions of autumn data before and after dimensionality reduction for influencing factors;
FIG. 12D is a diagram illustrating an error frequency distribution of the winter data before and after dimensionality reduction for the influencing factors in one embodiment;
FIG. 13 is a diagram illustrating comparison of predicted values and actual values of sample points of a partial test set in autumn before and after dimensionality reduction for influencing factors in one embodiment;
FIG. 14 is a block diagram of an apparatus for dimensionality reduction of data according to one embodiment;
FIG. 15 is a diagram showing an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In an embodiment, a data dimension reduction method is provided, and this embodiment is exemplified by applying the method to a terminal, it can be understood that the method can also be applied to a server, and can also be applied to a system including the terminal and the server, and the method is implemented through interaction between the terminal and the server, where the terminal can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices can be smart speakers, smart televisions, smart air conditioners, smart vehicle-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device and the like, and the server can be realized by an independent server or a server cluster formed by a plurality of servers. The main purpose of this embodiment is to overcome the defects of the existing dimension reduction method, and perform dimension reduction on data through a ridge regression algorithm, so as to screen out a plurality of influence factors having a large influence on wind power prediction. As shown in fig. 1, the data dimension reduction method includes the following steps:
And 102, acquiring various influence factors and influence factor data of the wind power.
Wherein the influence factor data comprises parameter values of a plurality of influence factors at a plurality of times within the target season.
Specifically, various influences of the wind power are influences that have an influence on the wind power prediction of the wind farm, for example, 35m temperature (temperature at a height of 35m of the wind farm location), momentum flux, 30m wind direction, 170m wind speed, and 100m wind speed, and the like. The target season may be any one of a first season, a second season, a third season, and a fourth season. For example, 3 to 5 months per year are spring (first season), 6 to 8 months per year are summer (second season), 9 to 11 months are autumn (third season), and 12 to 2 months of the following year are winter (fourth season). When the terminal performs data dimension reduction, a plurality of influence factors having a relationship (influence) with the prediction of the wind power plant and a sequence of parameter values of the plurality of influence factors at a plurality of moments in a target season need to be acquired.
Optionally, after the terminal determines the wind farm, the terminal may determine 24 influencing factors related to the prediction of the wind power according to specific conditions of the wind farm, and the types of the specific influencing factors may be as shown in table 1 below:
TABLE 1
Figure BDA0003586122640000091
Figure BDA0003586122640000101
And 104, performing multiple collinearity diagnosis on the multiple influence factors according to the influence factor data of the target season to obtain multiple collinearity diagnosis parameters of each influence factor.
The multicollinearity diagnosis is used for calculating multicollinearity diagnosis parameters of each influence factor, the multicollinearity diagnosis parameters are used for representing the degree of linear correlation relationship among different influence factors, and the multicollinearity refers to the linear correlation relationship among different influence factors.
Specifically, the terminal calculates multiple collinearity diagnosis parameters (values of multiple indexes) of each influence factor according to influence factor data of each influence factor of a target season, wherein the multiple collinearity diagnosis parameters of the influence factors represent whether multiple collinearity relations exist between the influence factors and other multiple influence factors.
And 106, if multiple collinearity exists among the multiple influence factors according to the multiple collinearity diagnosis parameters of the influence factors, calculating the ridge regression coefficient of each influence factor according to the influence factor data of the target season, the preset influence factor observation data and the ridge parameters.
Specifically, under the condition that the terminal determines that multiple collinearity exists among multiple influence factors according to multiple collinearity diagnosis parameters of the influence factors, the terminal can calculate the ridge regression coefficient of each influence factor according to influence factor data of each influence factor in a target season, preset influence factor observation data and ridge parameters.
Alternatively, the ridge parameter k may be any value within the target range, which may be, for example, 0 ≦ k ≦ 1. Thus, for each influence factor, the terminal can respectively calculate the ridge regression coefficients corresponding to the ridge parameters k according to different ridge parameters k, and the corresponding relation between the ridge regression coefficient of each influence factor and the ridge parameter of the target range is obtained.
And 108, screening all the influence factors according to the corresponding relation between the ridge regression coefficient and the ridge parameters, and eliminating the influence factors of which the ridge regression coefficient meets the change condition of the preset ridge regression coefficient to obtain a plurality of target influence factors.
Specifically, the preset ridge regression coefficient change condition may be that the ridge regression coefficient is stable and the absolute value of the ridge regression coefficient is smaller than a preset threshold; the preset ridge regression coefficient variation condition may also be that the ridge regression coefficient is unstable and tends to 0 as the ridge parameter k increases. The terminal can screen among the multiple influence factors, eliminate the influence factors of which the ridge regression coefficients meet the change conditions of the preset ridge regression coefficients, and reserve the influence factors of which the ridge regression coefficients do not meet the change conditions of the preset ridge regression coefficients to obtain the multiple target influence factors.
Optionally, the terminal may further perform the filtering among a plurality of target influence factors, for example, the terminal may determine a target parameter value of a ridge parameter, and calculate an absolute value of a ridge regression coefficient of each influence factor in a case where the ridge parameter is the target parameter value, extract according to a magnitude of the absolute value of the ridge regression coefficient of each influence factor, and take the target influence factor of the number of targets with the largest absolute value as the updated target influence factor, where the number of targets may be five.
In one example, the terminal determines a first range of the ridge parameter k (for example, when k > 0.4) for stabilizing the ridge regression coefficient of each influence factor according to the corresponding relationship between the ridge regression coefficient of each influence factor and the ridge parameter, and the terminal may randomly determine an arbitrary value within the first range as the target parameter value (for example, 0.8) of the ridge parameter.
In the data dimension reduction method, various influence factors of the wind power and influence factor data of a target season are obtained. And performing multiple collinearity diagnosis on the multiple influence factors according to the influence factor data of the target season to obtain multiple collinearity diagnosis parameters of each influence factor. And if the multiple collinearity among the multiple influence factors is determined according to the multiple collinearity diagnosis parameters of the influence factors, calculating the ridge regression coefficient of each influence factor according to the influence factor data of the target season, the preset influence factor observation data and the ridge parameters. And screening all the influence factors according to the ridge regression coefficient, and eliminating the influence factors of which the ridge regression coefficient meets the change condition of the preset ridge regression coefficient to obtain a plurality of target influence factors. By adopting the method, a plurality of influence factors can be screened through the ridge regression coefficient of each influence factor, the data redundancy is reduced under the condition of ensuring the data application value, and the influence factors which have great influence on the wind power prediction of the wind power plant are screened.
In one embodiment, after the step of performing multiple collinearity diagnosis on a plurality of influencing factors according to the influencing factor data of the target season to obtain multiple collinearity diagnosis parameters of each influencing factor, the data dimension reduction method further includes:
and judging whether multiple collinearity exists among the multiple influence factors according to the multiple collinearity diagnosis parameter of each influence factor.
Specifically, the multiple collinearity diagnosis parameter of each influencing factor may include one or more of a tolerance, a variance expansion factor, and a condition index of the influencing factor, and the terminal may determine whether multiple collinearity exists or not among the multiple influencing factors according to a parameter value of each indicator included in the multiple collinearity diagnosis parameter of each influencing factor.
In one embodiment, the multicollinearity diagnostic parameter affecting the factor comprises a tolerant parameter value.
Accordingly, as shown in fig. 2, the specific processing procedure of "performing multiple collinearity diagnosis on multiple influencing factors according to the influencing factor data of the target season to obtain multiple collinearity diagnosis parameters of each influencing factor" in step 104 includes:
And 202, aiming at each influence factor, performing least square calculation on the influence factor data of the target season corresponding to the influence factor, and determining a complex correlation coefficient of the influence factor.
Specifically, for each influence factor, the terminal performs multiple linear regression processing by a preset least square method according to first influence factor data of the influence factor in the target season and second influence factor data of other multiple influence factors in the target season, and calculates a complex correlation coefficient R corresponding to the influence factor j
Optionally, the influence factor having a relation with the prediction of the wind power of the wind farm may be n, such that R j Is the jth influencing factor x j And (3) complex correlation coefficient with other n-1 influencing factors. The larger the complex correlation coefficient is, the more closely the linear correlation between the above-mentioned various influencing factors is expressed.
And step 204, calculating the tolerance according to the preset target value and the square of the complex correlation coefficient.
Specifically, the method is aimed at the j influence factor x j The terminal may square the preset target value (e.g., 1) with the complex correlation coefficient
Figure BDA0003586122640000131
The difference of (a) as the tolerance TOL of the influencing factor j
Alternatively, the terminal may calculate the jth influencing factor x by the following formula j Tolerance TOL of j
Figure BDA0003586122640000132
Where n represents the number of the plurality of influencing factors.
Accordingly, as shown in fig. 3, the specific implementation of the step "determining whether multiple collinearity exists among multiple influencing factors according to the multiple collinearity diagnosis parameter of each influencing factor" includes:
step 302, if the tolerance of each influencing factor is smaller than a first preset threshold, determining that multiple collinearity exists among the influencing factors.
Specifically, the first preset threshold may be determined by the terminal according to an actual application scenario. The terminal compares the tolerance of the influencing factor with a first preset threshold, and if the tolerances of all the influencing factors are smaller than the first preset threshold, multiple collinearity among the influencing factors can be determined. If the tolerance of all the influencing factors is greater than or equal to the first preset threshold, it can be determined that multiple collinearity does not exist among the influencing factors.
In one example, the larger the complex correlation coefficient, the more closely the linear correlation between the respective influencing factors is represented,
Figure BDA0003586122640000133
the larger the TOL j The smaller. If the jth influencing factor x j And all other factors are multiple collinearity (severe collinearity), TOL j About 0, otherwise, TOL j ≈1。TOL j The closer to 0, the j-th influencing factor x is illustrated j The more closely related to all other influencing factors, the more likely that multiple collinearity will occur.
In this embodiment, by calculating the tolerance of the influencing factors and comparing the tolerance with the first preset threshold, the effect of accurately determining whether multiple collinearity exists among multiple influencing factors can be achieved.
In one embodiment, the multiple collinearity diagnostic parameters of the influencing factor include a tolerance, a variance inflation factor. Accordingly, as shown in fig. 4, the specific processing procedure of "performing multiple collinearity diagnosis on multiple influencing factors according to the influencing factor data of the target season to obtain multiple collinearity diagnosis parameters of each influencing factor" includes:
step 402, aiming at each influence factor, performing least square calculation on the influence factor data of the target season corresponding to the influence factor, and determining a complex correlation coefficient of the influence factor;
specifically, the specific execution process of this step is similar to the execution process of step 202, and is not repeated here, please refer to the description of step 202 above.
Step 404, calculating the tolerance according to the square of a preset target value and the complex correlation coefficient;
Specifically, the specific execution process of this step is similar to the execution process of step 204, and is not described herein again, please refer to the description of step 204 above.
Step 406, taking the reciprocal of the tolerance as a variance expansion factor of the influence factor;
specifically, the terminal may calculate the jth influencing factor x by the following formula j Variance inflation factor VIF of j
VIF j =1/TOL j
Accordingly, as shown in fig. 5, the specific processing procedure of "determining whether multiple collinearity exists among multiple influencing factors according to the multiple collinearity diagnosis parameter of each influencing factor" includes:
step 502, if the variance expansion factor of each influence factor is larger than a second preset threshold, determining that multiple collinearity exists among the plurality of influence factors.
Specifically, the second preset threshold may be determined by the terminal according to an actual application scenario. The terminal compares the variance expansion factor of the influencing factor with a second preset threshold, and if the variance expansion factors of all the influencing factors are larger than or equal to the second preset threshold, multiple collinearity among the influencing factors can be determined. If the variance expansion factors of all the influencing factors are smaller than the second preset threshold value, it can be determined that no multicollinearity exists among the influencing factors.
In one example, if the jth influencing factor x j There is no correlation between the influence factors and all other influence factors (e.g., n-1 kinds), then
Figure BDA0003586122640000141
VIF j 1; if the jth influencing factor x j There is a linear correlation with all other influencing factors (e.g., n-1 kinds), then
Figure BDA0003586122640000142
TOL j <1,VIF j > 1, and VIF j The larger the size, the more severe the collinearity that exists between the various influencing factors. For example, when VIF i And the value is more than or equal to 10, the serious collinearity among various influencing factors can be considered, and the second preset threshold value can be 10.
In this embodiment, by calculating the tolerance of the influencing factors and the variance expansion factor, comparing the tolerance with the first preset threshold, and comparing the variance expansion factor with the second preset threshold, the effect of accurately determining whether multiple collinearity exists among multiple influencing factors can be achieved.
In one embodiment, the multiple collinearity diagnostic parameter of the influencing factor comprises a condition index.
Accordingly, as shown in fig. 6, the specific processing procedure of "performing multiple collinearity diagnosis on multiple influencing factors according to the influencing factor data of the target season to obtain multiple collinearity diagnosis parameters of each influencing factor" includes:
Step 602, determining a first data matrix according to the influence factor data of the target season.
Wherein, the influence factor data of the target season may include parameter values of a plurality of influence factors at a plurality of times of the target season, and the plurality of influence factors may be n kinds of influence factors, including: x is a radical of a fluorine atom 1 ,x 2 ,…,x n 。 x 1 Parameters that may indicate a number of moments within the target season for the first influencing factorThe value is obtained.
Specifically, the terminal may determine the first data matrix X according to parameter values of multiple influencing factors at multiple times of the target season, i.e., X ═ X 1 ,x 2 ,...,x n ]。
Step 604, a first observation matrix is determined according to the first data matrix and the transpose matrix of the first data matrix.
In particular, the transpose of the first data matrix may be X T The terminal may transfer the first data matrix X and a transposed matrix X of the first data matrix T As the first observation matrix XX T
At step 606, at least one feature root of the first observation matrix is computed.
Specifically, the terminal may calculate all feature roots of the first observation matrix according to a preset feature root calculation method, and the size arrangement of each feature root may be λ 1 ≥λ 2 ≥…≥λ n
Step 608, taking the ratio of the maximum characteristic root to the minimum characteristic root as the maximum condition index;
Specifically, the terminal may calculate the kth conditional exponent η of the first observation matrix by the following formula k
η k =λ 1k ,k=2,3,…,n。
Wherein λ is 1 May be the largest root of the feature, λ k May be the kth largest feature root. In this way, the terminal may determine that the maximum condition index is the ratio between the maximum and minimum feature roots.
Accordingly, as shown in fig. 7, the specific implementation of the step "determining whether multiple collinearity exists among multiple influencing factors according to the multiple collinearity diagnosis parameter of each influencing factor" includes:
and step 702, determining that multiple collinearity exists among the multiple influencing factors according to the condition index if the maximum condition index is greater than or equal to a third preset threshold.
Specifically, the third preset threshold may be determined by the terminal according to an actual application scenario. The terminal compares the maximum condition index determined by the influence factor data of the plurality of influence factors in the target season with a third preset threshold, and if the maximum condition index is smaller than the third preset threshold, the terminal can determine that multiple collinearity exists among the plurality of influence factors. If the maximum condition index is greater than or equal to a third predetermined threshold, it may be determined that no multicollinearity exists between the plurality of influencing factors.
In one example, the terminal may determine a fourth preset threshold and a fifth preset threshold. Thus, the terminal determines the conditional exponent η if it is maximum n Is 0. ltoreq. eta n Less than or equal to the fourth preset threshold value, n influencing factors x 1 ,x 2 ,...,x n There is no multiple collinearity relation between them; if maximum conditional index η n Is a fourth predetermined threshold < eta n Less than or equal to a fifth preset threshold value, n influencing factors x 1 ,x 2 ,...,x n There is a moderate degree of multicollinearity between; if maximum conditional index η n Is eta n If the number is more than a fifth preset threshold value, determining n influencing factors x 1 ,x 2 ,...,x n There is severe multicollinearity.
Alternatively, the fourth preset threshold may be 100, and the fifth preset threshold may be 100. Thus, the terminal determines the conditional exponent η if it is maximum n Is 0. ltoreq. eta n Less than or equal to 100, n influencing factors x 1 ,x 2 ,...,x n There is no multiple collinearity relation between them; if maximum conditional index η n Is 100 < eta n Less than or equal to 1000, n influencing factors x 1 ,x 2 ,...,x n There is a moderate degree of multicollinearity between; if maximum conditional index η n Is eta n If the number of the influencing factors x is more than 1000, determining n influencing factors x 1 ,x 2 ,...,x n There is severe multicollinearity.
In this embodiment, the effect of accurately determining whether multiple collinearity exists among multiple influence factors can be achieved by calculating the maximum condition index of the influence factors and comparing the maximum condition index with a third preset threshold.
In one embodiment, as shown in fig. 8, the specific process of "calculating a ridge regression coefficient of each influencing factor according to the influencing factor data of the target season, the preset influencing factor observation data and the ridge parameters" in step 106 includes:
step 802, determining a first data matrix according to the influence factor data of the target season.
Wherein, the influence factor data of the target season may include parameter values of a plurality of influence factors at a plurality of times of the target season, and the plurality of influence factors may be n kinds of influence factors, including: x is a radical of a fluorine atom 1 ,x 2 ,…,x n 。 x 1 Parameter values for the first influencing factor may be indicated at a plurality of times within the target season.
Specifically, the terminal may determine the first data matrix X according to parameter values of multiple influencing factors at multiple times of the target season, i.e., X ═ X 1 ,x 2 ,...,x n ]。
Step 804, a first observation matrix is determined according to the first data matrix and the transpose matrix of the first data matrix.
In particular, the transpose of the first data matrix may be X T The terminal may transfer the first data matrix X and a transposed matrix X of the first data matrix T As the first observation matrix XX T
Step 806, determining a target matrix according to the first observation matrix, the ridge parameters and a preset standard value.
Specifically, the ridge parameter may be k, and the preset standard value may be a value that can make the feature root of the matrix far from zero and has better prediction effect, and may be, for example, an identity matrix I. Thus, the terminal can obtain a first product kI of the ridge parameter k and a preset standard value I, and obtain a first product kI and a first observation matrix XX T The sum of (a) and (b) is used as the objective matrix.
Step 808, determining a second observation matrix according to the transposed matrix of the first data matrix and the observation data of the preset influence factor.
Specifically, the preset influence factor observation data may be Y, and the terminal may transpose the first data matrix to a transpose matrix X T Taking the product of the observation data Y and the preset influence factor as a second observation matrix X T Y。
In one example, Y is a set of values of an observed variable at different times, which can be calculated by the following formula:
Y=Xβ+ε,
wherein X is each influencing factor and is an independent variable; y is a group of observed values consisting of the generated power of the wind power plant at a plurality of moments, and is a dependent variable; β is the regression coefficient of a set of corresponding independent variables in the multiple linear regression model, and ε is the random error term.
Step 810, obtaining ridge regression coefficients of the influencing factors according to the target matrix and the second observation matrix.
Specifically, the terminal obtains a ridge regression model matrix through the object matrix and the second observation matrix, so that the terminal can calculate ridge regression coefficients of the influencing factors through the following formula (ridge regression model matrix):
Figure BDA0003586122640000171
wherein, the data matrix X of the various influencing factors is ═ X 1 ,x 2 ,...,x n ]In (2), X constituting a certain column of the matrix X i ( i 1, 2.., n) denotes a set of values of a certain influence factor at different times, the ridge parameter k is a parameter varying between 0 and 1, and Y is a set of values of the observed variable at different times.
Figure BDA0003586122640000172
A set of ridge regression coefficients for n influencing factors under a certain parameter k.
In one example, for a multiple linear regression model Y ═ X β + epsilon, where Y is an observation vector, an estimated value of the vector β can be obtained by using the least squares method, and specifically can be calculated by the following formula:
Figure BDA0003586122640000173
Figure BDA0003586122640000181
thus, if multiple collinearity exists between the above-mentioned multiple influencing factors, some characteristic roots of the matrix are close to zero, X T And X is singular. That is, X is a linear dependence of a plurality of influencing factors T X singularity, some of its characteristic roots are close to zero,
Figure BDA0003586122640000182
the value of (a) may be large, resulting in a large deviation between the estimated value and the actual value of the ridge regression coefficient, resulting in a loss of stability and reliability of the conventional least square method. In this case, the present invention re-determines the matrix of the ridge regression model by the method in the above embodiment, so that the characteristic root of the matrix is far from zero,
Figure BDA0003586122640000183
Is decreased to obtain an estimate of the vector beta.
In the present embodiment, by using X T X + kI instead of X T X, the characteristic root of the matrix of the ridge regression model is far from zero because of X T X + kI is not singular, so that the calculated values can be reduced
Figure BDA0003586122640000184
The more accurate ridge regression coefficient can be obtained through the error of (2).
In one embodiment, the step of "screening each influence factor according to a correspondence between a ridge regression coefficient and a ridge parameter, and removing influence factors whose ridge regression coefficients satisfy a predetermined ridge regression coefficient change condition to obtain a plurality of target influence factors" includes a specific processing procedure of:
determining the influence factors of which the ridge regression coefficients meet preset stability conditions and the absolute values are smaller than a preset stability threshold as first influence factors under the condition that the ridge parameters are within a preset range; determining the influence factors of which the ridge regression coefficients do not accord with a preset stable condition and accord with a preset ridge regression coefficient change trend condition as second influence factors under the condition that the ridge parameters are in a preset range; screening each influence factor, and removing the first influence factor and the second influence factor to obtain a plurality of target influence factors.
Specifically, the terminal may calculate a corresponding relationship between the ridge regression coefficient and the ridge parameter of each influence factor according to the ridge regression model matrix. The terminal can screen each influence factor according to the corresponding relation between the ridge regression coefficient and the ridge parameter of each influence factor to obtain a plurality of target influence factors. The preset stable condition may be that a rate of change of the ridge regression coefficients is less than a sixth preset threshold, which may be ten percent of a maximum rate of change of the ridge regression coefficients. The preset ridge regression coefficient trend condition may be that the ridge regression coefficient of the influencing factor tends to zero as the ridge parameter increases. The determination of the preset stability threshold may be determined from the actual value of the preset standard value.
Optionally, the terminal may filter each influence factor according to a corresponding relationship between the ridge regression coefficient and the ridge parameter of each influence factor, and filter the ridge regression coefficient
Figure BDA0003586122640000191
Eliminating stable influence factors with small absolute values; coefficient of regression of ridge
Figure BDA0003586122640000192
Unstable and rapidly approaches to zero with the increase of the ridge parameter k.
In one example, as the ridge parameter k increases, the final influencing factor x i Ridge regression coefficient of
Figure BDA0003586122640000193
Tends to stabilize the value b i The stable value b i The influencing factor x can be expressed i Coefficient of influence on the observation variable Y. When the value b is stabilized i For positive numbers, the influencing factor x is illustrated i Has a positive influence on the observed variable when b i For negative numbers, the influencing factor x is illustrated i With an inverse effect on the observed variable. When b is i When the absolute value of (A) is small, the influence factor x is described i The influence on Y is small, and the Y can be screened out, so that the dimension reduction is realized.
In one example, the terminal may determine a growth step size (which may be 0.01, for example) for a ridge parameter, which may range from 0.01 to 0.8. The terminal can calculate ridge regression coefficients of different influence factors
Figure BDA0003586122640000194
With the variation of the ridge parameter k. The terminal can calculate the variation value of the ridge regression coefficient from 0.01 to 0.8 at k according to the influence factor data of each influence factor in the target season
Figure BDA0003586122640000195
And the variation values of different influencing factors
Figure BDA0003586122640000196
Maximum value of absolute value of
Figure BDA0003586122640000197
For ridge regression coefficient variation value
Figure BDA0003586122640000198
Is less than that corresponding to the season
Figure BDA0003586122640000199
Is considered to be the factor x i And meeting the preset stable condition.
In one embodiment, the data dimension reduction method further comprises:
and inputting the influence factor data of the target seasons corresponding to the plurality of target influence factors into a pre-trained convolutional neural network to obtain the predicted wind power generation power of the target seasons.
Specifically, the terminal may divide the influence factor data of the target season of the plurality of target influence factors into training data and test data according to time, where the training data further includes actual generated power of a wind farm corresponding to the target season.
In the embodiment, the generated power of the wind power plant is predicted through the data of the plurality of target influence factors with reduced dimensions, so that better training and predicting effects can be achieved, and more accurate power generation capability assessment can be realized.
The following describes a specific implementation procedure of the data dimension reduction method, with reference to a detailed embodiment:
for example, the data dimension reduction method can be applied to a target wind farm in a target area, the wind farm capacity of the wind farm can be 75MW, the position can be xx degrees E, xx degrees N, and the collection interval of various influence factors can be that power data of one point is collected every 15 minutes. And influence factor data can be provided through the target anemometer tower under the same longitude and latitude.
For example, the influence factor data of the plurality of influence factors can be data between xxx9 1 month and 1 day 20:00 and xx10 year 12 month and 30 days 20:00, each wind farm has a corresponding nearest anemometer tower to provide a numerical weather forecast, and the numerical weather forecast has 24 possible influence factors of wind power. The data of the first 20 days of each month are taken as a training set, and the data of the last 10 days are taken as a test set. The spring is 3 to 5 months, the summer is 6 to 8 months, the autumn is 9 to 11 months, the winter is 12 to 2 months next year, the training and the testing are carried out by seasons, and the target season can be any one of four seasons of spring, summer, autumn and winter.
Step 1, performing multiple linear regression processing on 24-dimensional possible influence factors and wind power data (influence factor data of various influence factors) in numerical weather forecast NWP data of a target wind power plant by a terminal through a least square method, calculating R square statistic, wherein the result can be 0.5762, and preliminarily judging that multiple collinearity exists in a regression equation.
And 2, performing multiple collinearity diagnosis on 24 influencing factors (X1-X24) corresponding to 24 dimensions, wherein the tolerance and variance expansion factor of each influencing factor are shown in a table 2. The tolerance of a plurality of influence factors is less than or equal to 0.1, and the variance expansion factor is greater than or equal to 10, which indicates that serious multiple collinearity exists between the influence factors and the rest of the influence factors. The maximum value of the variance expansion factor is 2467.2, the minimum value of the tolerance 0.0004053 is far less than 0.1, and high degree of multiple collinearity exists among independent variables. Matrix X formed by observation data T The maximum conditional index for X is 7.6806X 1017, which is much greater than 100, with severe multicollinearity.
TABLE 2
Figure BDA0003586122640000201
Figure BDA0003586122640000211
And 3, solving the problem of multiple collinearity among all the influence factors by a ridge regression method, determining a ridge parameter k by using a ridge method, and finally obtaining an independent variable after dimension reduction. The preset screening condition 1 may be: elimination ridge regression coefficient
Figure BDA0003586122640000212
The independent variables are stable and have small absolute values; the preset screening condition 2 may be: ridge regression coefficient
Figure BDA0003586122640000213
Unstable, but rapidly approaches zero independent variable as the ridge parameter k increases. The curve of the ridge regression coefficients of all kinds of influence factors with the change of the ridge parameters may be a graph as shown in fig. 9, and the ridge regression coefficients of the respective influence factors substantially tend to be stable after the ridge parameter k is greater than 0.4. Selecting the situation when the ridge parameter k is 0.8, wherein the value of the ridge regression coefficient can represent the influence of different influence factors on wind power prediction, and performing subsequent countingThe dimensions are reduced.
The terminal can remove the influence factors with stable ridge regression coefficient and small absolute value according to the preset screening condition 1 and the preset screening condition 2, and remove the influence factors with unstable ridge regression coefficient and fast trend to zero along with the increase of the ridge parameters. The ridge parameter k is increased from 0.01 to 0.8 by taking 0.01 as a step length, and ridge regression coefficients of different factors are considered
Figure BDA0003586122640000221
As k varies. Calculating the variation value of the ridge regression coefficient of the training set data in each season from 0.01 to 0.8
Figure BDA0003586122640000222
And the variation values of different influencing factors
Figure BDA0003586122640000223
Maximum value of absolute value of
Figure BDA0003586122640000224
For ridge regression coefficient variation value
Figure BDA0003586122640000225
Is less than that corresponding to the season
Figure BDA0003586122640000226
Is considered to be the factor x i Is relatively stable.
Specifically, ridge regression coefficients
Figure BDA0003586122640000227
In the vicinity of zero, the size is almost unchanged, as shown in fig. 10 for the curve of influence factor No. 12 and the curve of influence factor No. 22, and also shown in fig. 10 for the curve of influence factor No. 1, the curve of influence factor No. 5, the curve of influence factor No. 17, and the curve of influence factor No. 23.
Specifically, if the influence factor x is 0.8 in k i Ridge regression coefficient of
Figure BDA0003586122640000228
Also does not exceed 10% in absolute value
Figure BDA0003586122640000229
The influence factors are considered to meet the preset screening condition 1, and the spring training data corresponding to the station should be eliminated by using the influence factors of X3(30m wind direction), X10(100m wind direction), X11(10m wind direction), X12(10m sea level wind direction), X13 (sea level air pressure), X14 (cloud amount) and X19 (ground air pressure). Then screening by preset screening conditions 2 to influence factor x i Ridge regression coefficient at k-0.8
Figure BDA00035861226400002210
The larger the absolute value of (A) is, the larger the influence of the absolute value of (B) on the observed value of the wind power is. In this patent 5 influencing factors are kept per season, so among the remaining influencing factors, a choice is made
Figure BDA00035861226400002211
The 5 largest influencing factors of absolute value of (a) are retained. For the data of spring, autumn and winter, the 5 influencing factors of X2 (momentum flux), X4(170m wind speed), X5(100m wind speed), X6(30m wind speed) and X7(10m wind speed) are retained for subsequent analysis, and the curves of the five influencing factors can be as shown in FIG. 11; note that X7(10m wind speed) and X8(10m sea level wind speed) are identical in the data, and here the repeated influencing factor X8 is screened out, and X7 is retained. In summer, the 5 largest absolute values of ridge regression coefficients are X4(170m wind speed), X5(100m wind speed), X6(30m wind speed), X16 (sensible heat flux), and X24(2m relative humidity), which may be related to the influence of high temperature in summer on the operation of the wind turbine and the characteristics of the marine climate in which the target region is located. Five influencing factors reserved after the dimensionality reduction of the target wind farm in the target area can be as shown in table 3:
TABLE 3
Figure BDA00035861226400002212
Figure BDA0003586122640000231
And 4, selecting all normalized independent variables and the independent variables screened out by ridge regression, and respectively carrying out CNN convolutional neural network prediction. Wherein, the data of the first 20 days of each month in different seasons are taken as a training set, and the data of the last 10 days are taken as a testing set. Since the outcome of CNN has some random fluctuation, 10 predictions were made for each case. And comparing and analyzing the errors by taking the Root Mean Square (RMSE) between the prediction result and the actual value as an error index, wherein the comparison and analysis can be as shown in tables 4 and 5, and the accuracy of wind power prediction through the data after dimension reduction can be greatly improved. The row before spring-dimensionality reduction represents errors of prediction of all 24 independent variables selected by the wind power plant, and the row after spring-dimensionality reduction represents errors of prediction of 5 independent variables selected by the wind power plant after dimensionality reduction. The last column is the prediction error obtained by training data obtained by dimensionality reduction in seasons with a training set of the whole year, and the prediction error is larger than the result obtained when the seasons are divided, which indicates that the data needs to be processed by seasons.
TABLE 4
Figure BDA0003586122640000232
Figure BDA0003586122640000241
TABLE 5
Figure BDA0003586122640000242
Fig. 12A, 12B, 12C, and 12D show error frequency distribution diagrams before and after data dimensionality reduction in different seasons. After the dimension of data in summer, autumn and winter is reduced, the prediction effect is improved, wherein the prediction effect in autumn is improved most obviously. Sample points of a part of test sets in autumn are taken out, and the predicted values before and after dimensionality reduction are compared with the actual values, as shown in fig. 13. The predicted value after dimensionality reduction is represented by a dotted line, and compared with the predicted value before dimensionality reduction, most sampling points are closer to actual values, and the prediction effect can be really improved after dimensionality reduction.
It should be understood that, although the steps in the flowcharts related to the embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the application also provides a data dimension reduction device for realizing the data dimension reduction method. The implementation scheme for solving the problem provided by the apparatus is similar to the implementation scheme described in the above method, so specific limitations in one or more embodiments of the data dimension reduction apparatus provided below may refer to the limitations of the data dimension reduction method in the foregoing, and details are not described here.
In one embodiment, as shown in fig. 14, there is provided a data dimension reduction apparatus 900, including:
the obtaining module 901 is configured to obtain multiple influence factors of the wind power and influence factor data, where the influence factor data includes parameter values of the multiple influence factors at multiple times in a target season.
And the diagnosis module 902 is configured to perform multiple collinearity diagnosis on multiple influence factors according to the influence factor data of the target season to obtain multiple collinearity diagnosis parameters of each influence factor.
A calculating module 903, configured to calculate a ridge regression coefficient of each influence factor according to the influence factor data of the target season, preset influence factor observation data, and the ridge parameter if it is determined that multiple collinearity exists among multiple influence factors according to the multiple collinearity diagnosis parameter of each influence factor.
The screening module 904 is configured to screen each influence factor according to a corresponding relationship between the ridge regression coefficient and the ridge parameter, and remove influence factors whose ridge regression coefficient satisfies a predetermined ridge regression coefficient change condition, to obtain a plurality of target influence factors.
The modules in the data dimension reducing device 900 can be implemented in whole or in part by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, the apparatus further comprises:
and the judging module is used for judging whether multiple collinearity exists among the multiple influencing factors according to the multiple collinearity diagnosis parameter of each influencing factor.
In one embodiment, the multiple co-linear diagnostic parameters of the influencing factors include tolerance;
the diagnostic module is specifically configured to:
aiming at each influence factor, performing least square calculation on the influence factor data of the target season corresponding to the influence factor to determine a complex correlation coefficient of the influence factor;
Calculating the tolerance according to a preset target value and the square of the complex correlation coefficient;
the judgment module is specifically configured to:
and if the tolerance of each influence factor is less than a first preset threshold, determining that multiple collinearity exists among the plurality of influence factors.
In one embodiment, the multiple collinearity diagnostic parameters of the influencing factors include tolerance, variance inflation factor;
the diagnostic module is specifically configured to:
aiming at each influence factor, performing least square calculation on the influence factor data of the target season corresponding to the influence factor to determine a complex correlation coefficient of the influence factor;
calculating the tolerance according to the square of a preset target value and the complex correlation coefficient;
taking the reciprocal of the tolerance as a variance expansion factor of the influencing factors;
the judgment module is specifically configured to:
judging whether multiple collinearity exists among the multiple influence factors or not according to the variance expansion factor of each influence factor;
and if the variance expansion factor of each influence factor is larger than a second preset threshold value, determining that multiple collinearity exists among the plurality of influence factors.
In one embodiment, the multicollinearity diagnostic parameter of the influencing factor comprises a condition index;
the diagnostic module is specifically configured to:
determining a first data matrix according to the influence factor data of the target season;
determining a first observation matrix according to the first data matrix and the transposed matrix of the first data matrix;
calculating at least one feature root of the first observation matrix;
taking the ratio of the maximum characteristic root to the minimum characteristic root as a maximum condition index;
the judgment module is specifically configured to:
judging whether multiple collinearity exists among the multiple influence factors or not according to the maximum condition index;
and if the maximum condition index is larger than or equal to a third preset threshold value, determining that multiple collinearity exists among the multiple influencing factors.
In one embodiment, the calculation module is specifically configured to:
determining a first data matrix according to the influence factor data of the target season;
determining a first observation matrix according to the first data matrix and the transposed matrix of the first data matrix;
determining a target matrix according to the first observation matrix, the ridge parameters and a preset standard value;
Determining a second observation matrix according to the transposed matrix of the first data matrix and the preset influence factor observation data;
and obtaining ridge regression coefficients of the influence factors according to the target matrix and the second observation matrix.
In one embodiment, the screening module is specifically configured to:
determining the influence factors of which the ridge regression coefficients meet preset stability conditions and the absolute values are smaller than a preset stability threshold as first influence factors under the condition that the ridge parameters are within a preset range;
determining the influence factors of which the ridge regression coefficients do not accord with a preset stable condition and accord with a preset ridge regression coefficient change trend condition as second influence factors under the condition that the ridge parameters are in a preset range;
screening each influence factor, and removing the first influence factor and the second influence factor to obtain a plurality of target influence factors.
In one embodiment, the apparatus further comprises:
and the prediction module is used for inputting the influence factor data of the target season corresponding to the plurality of target influence factors into a pre-trained convolutional neural network to obtain the predicted wind power generation power of the target season.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 15. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing relevant data of the influencing factors. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data dimension reduction method.
Those skilled in the art will appreciate that the architecture shown in fig. 15 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, carries out the steps in the method embodiments described above.
It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application should be subject to the appended claims.

Claims (12)

1. A method for reducing dimensions of data, the method comprising:
acquiring various influence factors and influence factor data of wind power, wherein the influence factor data comprises parameter values of the various influence factors at multiple moments in a target season;
performing multiple collinearity diagnosis on the multiple influence factors according to the influence factor data of the target season to obtain multiple collinearity diagnosis parameters of each influence factor;
If multiple collinearity exists among the multiple influence factors according to the multiple collinearity diagnosis parameters of the influence factors, calculating ridge regression coefficients of the influence factors according to the influence factor data of the target season, preset influence factor observation data and ridge parameters;
and screening all influence factors according to the corresponding relation between the ridge regression coefficient and the ridge parameters, and eliminating the influence factors of which the ridge regression coefficient does not meet the change condition of the preset ridge regression coefficient to obtain a plurality of target influence factors.
2. The method of claim 1, wherein after the step of performing the multiple collinearity diagnosis on the plurality of influencing factors according to the influencing factor data of the target season to obtain the multiple collinearity diagnosis parameter of each influencing factor, the method further comprises:
and judging whether multiple collinearity exists among the multiple influencing factors according to the multiple collinearity diagnosis parameter of each influencing factor.
3. The method of claim 2, wherein the multicollinearity diagnostic parameter of the influencing factor comprises a tolerance;
the multiple collinearity diagnosis is carried out on the multiple influence factors according to the influence factor data of the target season to obtain multiple collinearity diagnosis parameters of each influence factor, and the multiple collinearity diagnosis parameters comprise:
Aiming at each influence factor, performing least square calculation on influence factor data of a target season corresponding to the influence factor to determine a complex correlation coefficient of the influence factor;
calculating the tolerance according to a preset target value and the square of the complex correlation coefficient;
the judging whether multiple collinearity exists among the multiple influence factors according to the multiple collinearity diagnosis parameter of each influence factor comprises the following steps:
and if the tolerance of each influence factor is less than a first preset threshold, determining that multiple collinearity exists among the plurality of influence factors.
4. The method of claim 2, wherein the multicollinearity diagnostic parameters of the influencing factors comprise a tolerance, a variance inflation factor;
the multiple collinearity diagnosis is carried out on the multiple influence factors according to the influence factor data of the target season to obtain multiple collinearity diagnosis parameters of each influence factor, and the multiple collinearity diagnosis parameters comprise:
aiming at each influence factor, performing least square calculation on the influence factor data of the target season corresponding to the influence factor to determine a complex correlation coefficient of the influence factor;
calculating the tolerance according to the square of a preset target value and the complex correlation coefficient;
Taking the reciprocal of the tolerance as a variance expansion factor of the influence factors;
the judging whether multiple collinearity exists among the multiple influence factors according to the multiple collinearity diagnosis parameter of each influence factor comprises the following steps:
and if the variance expansion factor of each influence factor is larger than a second preset threshold value, determining that multiple collinearity exists among the plurality of influence factors.
5. The method of claim 2, wherein the multicollinearity diagnostic parameter of the influencing factor comprises a condition index;
the multiple collinearity diagnosis is carried out on the multiple influence factors according to the influence factor data of the target season to obtain multiple collinearity diagnosis parameters of each influence factor, and the multiple collinearity diagnosis parameters comprise:
determining a first data matrix according to the influence factor data of the target season;
determining a first observation matrix according to the first data matrix and the transposed matrix of the first data matrix;
calculating at least one feature root of the first observation matrix;
taking the ratio of the maximum characteristic root to the minimum characteristic root as a maximum condition index;
the judging whether multiple collinearity exists among the multiple influence factors according to the multiple collinearity diagnosis parameter of each influence factor comprises the following steps:
And if the maximum condition index is larger than or equal to a third preset threshold value, determining that multiple collinearity exists among the multiple influencing factors.
6. The method as claimed in claim 1, wherein said calculating a ridge regression coefficient for each said influence factor according to influence factor data of said target season, preset influence factor observation data and ridge parameters comprises:
determining a first data matrix according to the influence factor data of the target season;
determining a first observation matrix according to the first data matrix and the transposed matrix of the first data matrix;
determining a target matrix according to the first observation matrix, the ridge parameters and a preset standard value;
determining a second observation matrix according to the transposed matrix of the first data matrix and the preset influence factor observation data;
and obtaining a ridge regression coefficient of each influence factor according to the target matrix and the second observation matrix.
7. The method as claimed in claim 1, wherein the selecting each influence factor according to the corresponding relationship between the ridge regression coefficient and the ridge parameter, and removing the influence factors whose ridge regression coefficient does not satisfy the variation condition of the predetermined ridge regression coefficient to obtain a plurality of target influence factors comprises:
Determining the influence factors of which the ridge regression coefficients meet preset stability conditions and the absolute values are smaller than a preset stability threshold as first influence factors under the condition that the ridge parameters are within a preset range;
determining the influence factors of which the ridge regression coefficients do not accord with a preset stable condition and accord with a preset ridge regression coefficient change trend condition as second influence factors under the condition that the ridge parameters are in a preset range;
screening each influence factor, and removing the first influence factor and the second influence factor to obtain a plurality of target influence factors.
8. The method of claim 1, further comprising:
and inputting the influence factor data of the target seasons corresponding to the plurality of target influence factors into a pre-trained convolutional neural network to obtain the predicted wind power generation power of the target seasons.
9. An apparatus for data dimension reduction, the apparatus comprising:
the acquiring module is used for acquiring various influence factors of the wind power and influence factor data, wherein the influence factor data comprises parameter values of the various influence factors at multiple moments in a target season;
the diagnosis module is used for performing multiple collinearity diagnosis on the multiple influence factors according to the influence factor data of the target season to obtain multiple collinearity diagnosis parameters of each influence factor;
The calculation module is used for calculating ridge regression coefficients of the various influence factors according to the influence factor data of the target season, preset influence factor observation data and ridge parameters if the multiple collinearity among the various influence factors is determined according to the multiple collinearity diagnosis parameters of the various influence factors;
and the screening module is used for screening all the influence factors according to the corresponding relation between the ridge regression coefficient and the ridge parameters, eliminating the influence factors of which the ridge regression coefficient does not meet the change condition of the preset ridge regression coefficient, and obtaining a plurality of target influence factors.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 8.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.
12. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 8 when executed by a processor.
CN202210363550.4A 2022-04-08 2022-04-08 Data dimension reduction method and device, computer equipment and storage medium Pending CN114840506A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210363550.4A CN114840506A (en) 2022-04-08 2022-04-08 Data dimension reduction method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210363550.4A CN114840506A (en) 2022-04-08 2022-04-08 Data dimension reduction method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114840506A true CN114840506A (en) 2022-08-02

Family

ID=82563389

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210363550.4A Pending CN114840506A (en) 2022-04-08 2022-04-08 Data dimension reduction method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114840506A (en)

Similar Documents

Publication Publication Date Title
Raza et al. An ensemble framework for day-ahead forecast of PV output power in smart grids
CN108336739B (en) RBF neural network-based probability load flow online calculation method
CN115186923A (en) Photovoltaic power generation power prediction method and device and electronic equipment
CN112906757A (en) Model training method, optical power prediction method, device, equipment and storage medium
CN115329880A (en) Meteorological feature extraction method and device, computer equipment and storage medium
CN117674119A (en) Power grid operation risk assessment method, device, computer equipment and storage medium
CN115907131A (en) Method and system for building electric heating load prediction model in northern area
CN114840506A (en) Data dimension reduction method and device, computer equipment and storage medium
CN113705929B (en) Spring festival holiday load prediction method based on load characteristic curve and typical characteristic value fusion
CN114925931A (en) Platform area load prediction method and system
Zhu et al. [Retracted] Photovoltaic Generation Prediction of CCIPCA Combined with LSTM
CN115545164A (en) Photovoltaic power generation power prediction method, system, equipment and medium
CN114092276A (en) Prediction method of short-term load of transformer area
CN113780644A (en) Photovoltaic output prediction method based on online learning
CN113408816A (en) Power grid disaster situation evaluation method based on deep neural network
Kesavavarthini et al. Bias correction of CMIP6 simulations of precipitation over Indian monsoon core region using deep learning algorithms
CN111612289A (en) Electric power system risk assessment method based on new energy multi-scene risk feature guidance
CN112734073A (en) Photovoltaic power generation short-term prediction method based on long and short-term memory network
TWI810750B (en) Solar power forecasting method
CN117851736B (en) Meteorological element interpolation method based on fuzzy self-adaptive optimizing fusion
Ng et al. Statistical downscaled local climate model for future rainfall changes analysis: A case study of hyogo prefecture, Japan
CN117613883A (en) Method and device for predicting generated power, computer equipment and storage medium
CN115860247A (en) Method and device for training fan loss power prediction model in extreme weather
CN117674294A (en) Scene reduction method, device, computer equipment and storage medium
CN117833216A (en) Photovoltaic power station generated power prediction method and device based on hybrid neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination