CN115481844A - Distribution network material demand prediction system based on feature extraction and improved SVR model - Google Patents

Distribution network material demand prediction system based on feature extraction and improved SVR model Download PDF

Info

Publication number
CN115481844A
CN115481844A CN202110664592.7A CN202110664592A CN115481844A CN 115481844 A CN115481844 A CN 115481844A CN 202110664592 A CN202110664592 A CN 202110664592A CN 115481844 A CN115481844 A CN 115481844A
Authority
CN
China
Prior art keywords
data
module
project
optimization
historical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110664592.7A
Other languages
Chinese (zh)
Inventor
刘康军
马婉仪
黄振球
黎莫林
江健武
刘永忠
梁曦匀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Power Supply Bureau Co Ltd
Original Assignee
Shenzhen Power Supply Bureau Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Power Supply Bureau Co Ltd filed Critical Shenzhen Power Supply Bureau Co Ltd
Priority to CN202110664592.7A priority Critical patent/CN115481844A/en
Publication of CN115481844A publication Critical patent/CN115481844A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06315Needs-based resource requirements planning or analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/06Asset management; Financial planning or analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Strategic Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Development Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • Marketing (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Tourism & Hospitality (AREA)
  • Game Theory and Decision Science (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Operations Research (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Technology Law (AREA)
  • Molecular Biology (AREA)
  • Educational Administration (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)

Abstract

The invention relates to a distribution network material demand prediction system based on a feature extraction and improved SVR model, which comprises: the data import module is used for importing historical item data of a corresponding time period according to the selected time period; the preprocessing module is used for carrying out data extraction on the imported historical project data to obtain a material historical data set; the data grouping module is used for grouping the project material historical data sets according to the industry and engineering classification knowledge base to obtain material grouping historical data; the predicted substance type screening module is used for sorting the importance of the materials in the historical items according to the item attribute data and screening the material types; the SVR model training module is used for training each selected material to be predicted to obtain a corresponding SVR model; the forecasting module is used for acquiring project attribute data of the materials according to the input material types and importing the corresponding SVR model to forecast the demand quantity of the materials. The invention not only improves the prediction efficiency, but also improves the prediction performance of unbalanced data.

Description

Distribution network material demand prediction system based on feature extraction and improved SVR model
Technical Field
The invention relates to the technical field of material demand prediction and data processing, in particular to a distribution network material demand prediction system based on a feature extraction and improved SVR model.
Background
The demand forecast of the distribution network materials is an important capability required by reasonably formulating a purchasing strategy and planning a development strategy for modern power grid enterprises, and is an important method for improving the management level and the operation efficiency. How to scientifically and reasonably predict the material demand has gradually become an important topic of power grid enterprises. The distribution network material demand forecasting is to explore the internal law of the use of project materials by using a data mining technology according to historical project material use data, so as to forecast the future demand of the project materials, provide decision-making basis for decision-makers, be beneficial to improving the lean level of distribution network material management, and restrict the improvement of the material management level by the accuracy and timeliness of the forecasting. Therefore, the method has very important practical significance for accurately and timely predicting the material demand.
In the aspect of distribution network material demand prediction, the traditional multiple regression model, time series prediction method and the like are used for prediction based on a linear technology, and the nonlinear data is not reasonably processed and has poor effect; and the traditional prediction method is improved by using an artificial intelligence algorithm, so that the composite characteristics in the material sequence can be comprehensively extracted, and the prediction precision of the distribution network materials is further improved. The machine learning and deep learning algorithm can find some characteristics which are difficult to find through training and learning of historical data of the distribution network material demand, so that prediction is more accurate. However, although the new theoretical technology can solve the problem of nonlinear, complex and variable distribution network material demand prediction, the new theoretical technology also brings the problems of large calculation amount, difficult parameter design, slow convergence and the like. For example, the neural network model has more optimization parameters, is influenced in application in the engineering field, and has the defects of over-learning and insufficient prediction expansion capability.
Disclosure of Invention
In view of the above analysis, the present invention aims to provide a distribution network material demand prediction system based on feature extraction and an improved SVR model, so as to improve the accuracy of distribution network material prediction.
The technical scheme provided by the invention is as follows:
the invention discloses a distribution network material demand prediction system based on a feature extraction and improved SVR model, which comprises:
the data import module is used for importing historical item data of a corresponding time period according to the selected time period; the historical project data comprises material demand data and project attribute data corresponding to the material demand;
the preprocessing module is used for carrying out data extraction on the imported historical project data to obtain a material historical data set;
the data grouping module is used for grouping the project material historical data sets according to the industry and engineering classification knowledge base to obtain material grouping historical data;
the predicted material type screening module is used for sorting the importance of the materials in the historical items according to the item attribute data and screening the material types to be predicted;
the SVR model training module is used for training each selected material to be predicted to obtain a corresponding SVR model;
and the prediction module is used for acquiring the project attribute data of the material according to the input material types and importing the corresponding SVR model to predict the demand quantity of the material.
Furthermore, the SVR model training module comprises a data set partitioning module, a support vector machine modeling configuration module and a model prediction and optimization module;
the data set dividing module is used for dividing a training set and a test set;
the support vector machine modeling configuration module is used for configuring parameters of the SVR model, including a kernel function and parameters (C, g) in the kernel function; the kernel function is a Gaussian radial basis kernel function; parameter C is the penalty for misclassification; g is the coefficient of the kernel function;
and the model prediction and optimization module is used for optimizing the parameters (C, g) of the SVR model and improving the prediction accuracy of the demand and the generalization capability of the model.
Further, the model prediction and optimization module comprises a first prediction and optimization module and a second prediction and optimization module;
the first prediction and optimization module is used for carrying out first optimization on the parameters (C, g) of the SVR by adopting a particle swarm algorithm, quickly searching and positioning an optimal parameter interval, and obtaining the result (C, g) of the optimized parameters 1
A second prediction and optimization module for performing secondary parameter optimization from the optimal parameter interval by using a grid search method to obtain an optimized parameter result (C, g) 2
Further, the first optimizing method in the first prediction and optimization module comprises:
1) Carrying out normalization processing on the divided training set and test set;
2) Setting an initial search range of parameters (C, g) to be searched, an initial evolution algebra and population number of a particle swarm algorithm, an initial value of the maximum iteration times of inertia factors and the maximum iteration times;
3) Determining the individual optimal position and the global optimal position of the particle by calculating the fitness of the particle;
4) Performing iterative operation, and updating the speed and the position of the particle at this time according to the individual optimal position and the global optimal position of the particle determined at the last time to obtain corresponding parameters (C, g);
5) Judging whether an iteration end condition is met; if yes, ending iteration, and outputting the parameters (C, g) as optimal parameter values (C, g) for particle swarm optimization parameter optimization 1 (ii) a Otherwise, turning to 3), the iterative updating is continued.
Further, in the quadratic parameter optimization method of the second prediction and optimization module, in the parameters (C, g) 1 Carrying out the optimization of the grid search method with preset step length according to the second predicted search interval nearby, and determining the optimal parameter result (C, g) which is required to be obtained by people 2
The second predicted search interval is determined by the following process: from the parameters (C, g) 1 Starting to expand the search interval by a preset step length until the first search condition and the second search condition are met simultaneously, and determining a second predicted search interval;
the first search condition is that the search interval expanded by the preset step length is larger than a parameter interval which cannot jump out the local optimum when the particle swarm optimization is adopted for the first time;
the second search condition is that the optimization time when the secondary parameter optimization is carried out in the search interval enlarged by the preset step length does not exceed the preset longest optimization time.
Further, the preprocessing module comprises: the system comprises a characteristic data cleaning module, a first characteristic data extraction module, a second characteristic data extraction module and a material historical data set module;
the characteristic data cleaning module is used for cleaning data and processing abnormal values of material demand data in historical project data and project attribute data corresponding to the material demand based on outlier detection of kernel K-Means clustering;
the first characteristic data extraction module is used for screening data to obtain project attribute characteristic data comprising the characteristic data of the material demand quantity and corresponding year, project name, material code and investment amount;
the second characteristic data extraction module is used for carrying out semantic recognition, text analysis and word segmentation processing on the project names of the project attribute data based on natural language processing, and extracting characteristic data including regions, sites, industry types and engineering types contained in the project names;
and the material historical data set module is used for forming a material historical data set by the material demand characteristic data and the corresponding characteristic data including year, region, site, industry type, project name, material code and investment amount.
Further, the processing of the abnormal values of the project attribute data in the feature data cleaning module comprises:
1) Clustering by using kernel K-Means, dividing historical project data into single clusters, dividing the clustered data set into D, dividing the clustered data set into K clusters, and using D to obtain a cluster a To represent a cluster with a corresponding cluster center of c a A =1, …, K, a represents the number of clusters corresponding to each feature data in the material history data;
2) Searching missing data A with each characteristic data in material historical data as null values b In the cluster D a At D a Middle search and A b Most similar data A c B is not equal to c, and b and c are not more than the cluster D where the b and c are located a The amount of data of (a); with A c Property of (2) De-filling A b The iteration is repeated until all the missing data are filled, and the filled data set is F Supplement device
3) For data set F Supplement device Setting a threshold value of a target function for each cluster in the data set B, deleting the data objects one by one, and if the data objects are deleted, obviously reducing the data objects, namely marking the data objects as outliers, and adding the outliers into the data set B;
4) Confirming whether a filled missing value exists in the data set B, if so, skipping to 1) and carrying out clustering filling again; if not, ending;
5) And after the iteration is repeated to the set iteration threshold in the step 4), filling the missing data of the iteration by using a cluster mean value and finishing.
Further, the objective function
Figure BDA0003115868450000041
Where dist is the distance of the kernel to each cluster for the missing value; x is formed as c a ,a=1,…,K。
Further, the second feature data extraction module comprises a natural language processing module and a label coding module;
the natural language processing module is used for carrying out semantic recognition, text analysis and word segmentation processing on the project name of the project attribute data based on the Jieba word segmentation processed by the natural language, and extracting text type characteristic data including regions, sites, industry types and engineering types contained in the project name;
and the label coding module is used for respectively carrying out label coding on the regions, the sites, the industry types and the engineering types by adopting a label coding mode and converting the label coding into digital characteristic data.
Further, the predicted substance category screening module comprises a sorting module and a screening module;
the sorting module is used for sorting the importance of the demand frequency and/or the value amount ratio of the demand frequency of the materials in the historical project;
and the screening module is used for screening out the material types needing to be predicted according to the sorting result of the sorting module and the set threshold value.
The invention can realize at least one of the following beneficial effects:
semantic recognition, word segmentation and information extraction are carried out by using a natural language processing method, and further key information is extracted from the project name and is used as a characteristic variable, so that the problems that more original variables exist and main variables are difficult to recognize accurately are solved; the text type variables are digitalized by Label Encoding (Label Encoding), the dimensionality of original data is kept, the original data are used as a plurality of input variables of the model, and the problem of model input is solved;
a power grid material demand prediction model is established, and the advantages of multiple inputs and multiple outputs of machine learning are fully utilized to realize the prediction of the power grid material demand; by optimizing the support vector regression model, the calculation resources are effectively reduced, and the prediction efficiency and the prediction precision are improved.
The Support Vector Regression (SVR) has unique advantages in small sample regression and the like, and can avoid overfitting while minimizing empirical risk and structural risk without requiring a large number of samples and empirical assumptions. Particularly, the model improved by the grid search and the particle swarm algorithm with multiple granularities further improves the prediction accuracy of the model.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
FIG. 1 is a schematic connection diagram of a distribution network material demand prediction system in an embodiment of the present invention;
FIG. 2 is a flow chart of optimizing SVR parameters in an embodiment of the present invention;
FIG. 3 is a diagram illustrating the statistics of the proportion of the investment amount of the material in the embodiment of the present invention;
FIG. 4 is a diagram illustrating the result of forecasting the investment amount of the material in the embodiment of the present invention;
FIG. 5 is a graph of the comparison of the investment amount in substance prediction with the actual investment amount in an embodiment of the present invention;
FIG. 6 is a comparison of the predicted results of the first type of material in the embodiment of the present invention;
FIG. 7 is a comparison graph of the predicted results of the second type of material in the embodiment of the present invention;
FIG. 8 is a comparison diagram of the predicted result of the third kind of material in the embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with the embodiments of the invention serve to explain the principles of the invention.
An embodiment of the present invention discloses a distribution network material demand prediction system based on a feature extraction and improved SVR model, as shown in fig. 1, including:
the data import module is used for importing historical item data of a corresponding time period according to the selected time period; the historical project data comprises material demand data and project attribute data corresponding to the material demand;
the preprocessing module is used for carrying out data extraction on the imported historical project data to obtain a material historical data set;
the data grouping module is used for grouping the project material historical data sets according to the industry and engineering classification knowledge base to obtain material grouping historical data;
the predicted material type screening module is used for sorting the importance of the materials in the historical items according to the item attribute data and screening the material types to be predicted;
the SVR model training module is used for training each selected material to be predicted to obtain a corresponding SVR model;
and the prediction module is used for acquiring the project attribute data of the material according to the input material types and importing the corresponding SVR model to predict the demand quantity of the material.
Specifically, historical project material data of a certain age are extracted from a distribution network material project information base through an interactive window of a human-computer interface in a data import module; for example, when the historical project material data is input or selected in an interactive window of a human-computer interface, in 2010-2019, historical project material data of 10 years in total in 2010-2019 are extracted from a distribution network material project information base by a data import module, wherein the distribution network material project information base comprises data information such as a project historical material demand list, a project comprehensive material receiving table, project historical project balance, demand summary results and the like, and an effective historical data comprehensive material receiving table is selected as the historical project material data.
Specifically, the preprocessing module includes: the system comprises a characteristic data cleaning module, a first characteristic data extraction module, a second characteristic data extraction module and a material historical data set module;
the characteristic data cleaning module is used for cleaning data and processing abnormal values of material demand data in historical project data and project attribute data corresponding to the material demand based on outlier detection of kernel K-Means clustering;
the first characteristic data extraction module is used for screening data to obtain project attribute characteristic data comprising the characteristic data of the material demand quantity and corresponding year, project name, material code and investment amount;
the second characteristic data extraction module is used for carrying out semantic recognition, text analysis and word segmentation processing on the project name of the project attribute data based on natural language processing, and extracting characteristic data including regions, sites, industry types and engineering types contained in the project name;
and the material historical data set module is used for forming a material historical data set by the material demand characteristic data and the corresponding characteristic data including year, region, site, industry type, project name, material code and investment amount.
More specifically, the characteristic data cleaning module can adopt a manual or automatic cleaning mode; since the prediction is mainly directed to the distribution network materials, only data records of the distribution network items are reserved. On the basis, after the data recorded repeatedly is checked, if the data records are consistent, the data records including blank records are merged, deleted, and the data records which are obviously different from the data records of other years are deleted.
And aiming at the missing value, a manual cleaning mode is adopted, the missing value of the investment amount is filled by utilizing the mean value of the investment amount of the same item, and the missing value of the material demand is filled by utilizing the mean value of the material demand of the same item.
Aiming at abnormal values or error values, abnormal detection is carried out on the abnormal values of the investment amount and the material demand amount of distribution network materials by using a local clustering method, and then records containing the abnormal values are filled. By utilizing the clustering algorithm, the accuracy of detecting outliers of the material demand and the investment amount is improved, the complexity of time and space is reduced, and the calculation cost is relatively low.
Because the given distribution network data belongs to multidimensional data, the thought is to convert the multidimensional data into one-dimensional data, and then detect abnormal values by investigating the relationship between each data point of distribution network materials and a material project cluster. Specifically, the method adopts detection based on kernel K-Means clustering to perform outlier detection, namely, a kernel method is integrated on the basis of K-Means outlier detection, after data are mapped by using a kernel function, outlier detection is performed in a high-dimensional space, and a kernel distance is used as a similarity measure. When the kernel K-Means clustering algorithm is used for outlier detection, the set target function is emphasized, and whether the data object is an outlier is determined by the change amplitude of the target function caused by adding or deleting the data object. In the clustering algorithm, outliers are defined as follows:
if there is a data object x not belonging to any cluster D a Then, x is defined as an outlier, and the detection of the outlier can be expressed as an object that is anomalous as a function of the objective function
Figure BDA0003115868450000071
Where dist is the distance of the kernel to each cluster for the missing value; x is formed as c a A =1, …, K (a represents the number of clusters corresponding to each feature in the material history data, c a The cluster center point of the a-th cluster), when some object x is added, dist is significantly increased, and then x can be determined as an outlier.
The attribute values of outliers deviate significantly from the expected or common attribute values. Thus, if there are padding values detected as outliers, there is a high likelihood that the padding values will be too inaccurate. According to the characteristic of the outlier, after data is filled, kernel K-Means clustering is used for detecting the outlier, whether the outlier is a missing value of filling is detected, and the outlier is adopted
Figure BDA0003115868450000072
As a function of the decision. And extracting the filling values confirmed as the outliers, reconstructing the missing data set, and iterating the data filling algorithm until the outliers are not detected and confirmed as the filling values of the outliers.
More specifically, the present embodiment processes abnormal values of the material demand data and the project attribute data corresponding to the material demand in the historical project data based on outlier detection of kernel K-Means clustering, including:
1) Clustering by using kernel K-Means, dividing historical project data into single clusters, dividing the clustered data set into D, dividing the clustered data set into K clusters, and using D a To represent a cluster, corresponding to the cluster center c a ,a=1,…,K;
2) Finding missing data A b In the cluster D a For D, for a Middle search and A b Most similar data A c B is not equal to c, and b and c are not more than the cluster D where the b and c are located a The amount of data of (a); with A c Property of (2) De-filling A b The iteration is repeated until all the missing data are filled, and the filled data set is F Supplement device
3) For data set F Supplement device Setting a threshold value of a target function for each cluster in the data set B, deleting the data objects one by one, if the data objects are deleted, obviously reducing the data objects, namely marking the data objects as outliers, and adding the outliers into the data set B;
4) Confirming whether the filled missing value exists in the data set B, if so, skipping to 1) to perform clustering filling again, and if not, ending;
5) And after the iteration is repeated to the set iteration threshold in the step 4), filling the missing data of the iteration by using a cluster mean value and ending.
In the K-Means algorithm, the error square sum SSE of the cluster can be obviously improved by deleting the objects far away from the center of the related cluster, so that the accuracy of detecting the outliers of the material demand and the investment amount is improved. For the repeated values, if the repeated values are judged, combining the records with the same attribute in the database into the same record; for example, data of the same material or project name in different expression modes can be combined.
The first characteristic data extraction module is used for screening data, can screen data cleaned by adopting an automatic or manual mode, and only reserves data of 6 useful dimensions of 'year', 'project name', 'material code', 'material name', 'material demand amount', 'investment amount', which are related to power grid material demand prediction.
The second characteristic data extraction module is used for carrying out semantic recognition, text analysis and word segmentation processing on the project name of the project attribute data based on natural language processing, and extracting characteristic data including regions, sites, industry types and engineering types contained in the project name;
specifically, the 'project name' extracted from the first feature data extraction module contains many factors influencing the requirements of power grid materials, so that semantic recognition, word segmentation and information extraction are performed on the project name in the cleaned and screened data, key information is extracted and used as a feature variable, and a precondition is provided for carrying out feature coding below.
More specifically, the second feature data extraction module comprises a natural language processing module and a tag coding module;
the natural language processing module is used for carrying out semantic recognition, text analysis and word segmentation processing on the project name of the project attribute data based on the Jieba word segmentation processed by the natural language, and extracting text type characteristic data including regions, sites, industry types and engineering types contained in the project name;
the method comprises the steps of performing word segmentation on a project name through a Jieba word segmentation technology of natural language processing to obtain four parts of a project service area, a project service specific place, a project service object and project service content, and extracting corresponding features in the project name, namely 4 features of the area, the site, the industry type and the project type.
The specific extraction method comprises the following steps:
the 'region' feature contains factors of each administrative region of Shenzhen city;
the 'site' feature contains transportation sites of which the factors are all regions in Shenzhen city;
the 'industry type' feature contains the industry to which the service object of each item of Shenzhen power supply bureau belongs as a factor. The industry division basis is a 'national economy industry classification' table of 2017 edition;
the "project type" feature primary inclusion factor is the category of project service content. The extraction basis is the last fields of the item name.
Aiming at the conventional project, the project name of 'Longhua/Dalang station/F15 south China beautiful garden line #01719 Longhua power supply station/district reforming project' is used as an example, and the word segmentation result is that the region 'Longhua', the station 'Dalang station', the industry type 'power, heat, gas and water production and supply industry', and the project type 'district reforming project';
aiming at the project that the industry type belongs to the company, taking 'Shenzhen Guangming New zone// Priseko precision materials (Shenzhen) Limited company/industry expansion assembly project' as an example, the industry type which the company belongs to on the enterprise inquiry or the sky eye inquiry website is obtained through the Python crawler technology.
And the label coding module is used for respectively carrying out label coding on the regions, the sites, the industry types and the engineering types by adopting a label coding mode and converting the label coding into digital characteristic data.
Considering that the input of the prediction model is numerical features, text features need to be converted into numerical features, and the specific method is as follows: and (3) Label coding (Label Encoding), namely serialization Label coding, is used for respectively carrying out Label coding on text type characteristics such as regions, sites, industry types and engineering types. This not only preserves the dimensionality of the original data, but also allows these textual features to be used with other numerical features as input variables to the predictive model. The following are the features of several items randomly drawn and their tag-encoded results.
TABLE 1 characteristic Properties of raw data
Figure BDA0003115868450000091
TABLE 2 tag-encoded feature attributes
Figure BDA0003115868450000101
And the material historical data set module is used for combining the output data of the first characteristic data extraction module and the output data of the second characteristic data extraction module to form a material historical data set by the material demand characteristic data and the corresponding characteristic data including year, region, site, industry type, project name, material code and investment amount.
Specifically, the data grouping module is used for grouping historical data sets of project materials to obtain material grouping historical data based on the established industry and engineering classification knowledge base; and (3) freely combining the project type P, the industry type Q and the project type M of each element in the project knowledge base, the industry knowledge base and the project knowledge base to obtain the grouped data of the historical materials. For example, the materials belonging to P1, Q1, M1 are divided into a group G1, the materials belonging to P1, Q1, M2 are divided into a group G2, and so on, to obtain P × Q × M groups as the input of the subsequent model.
More specifically, industry and engineering knowledge bases are established according to division standards.
And an industry and engineering classification knowledge base in the data grouping module is established according to characteristic data including industry types and engineering types in the project material historical data set.
Preferably, all industry types and engineering types included in the project material historical data set and project types included in the project material historical data set are counted; establishing a project knowledge base according to different classifications of the counted project types; establishing an industry knowledge base according to different classifications of the counted industry types; and establishing an engineering knowledge base according to different classifications of the statistical engineering types.
Specifically, the predicted substance type screening module comprises a sorting module and a screening module;
the sorting module is used for sorting the importance of the demand frequency and/or the value amount ratio of the demand frequency of the materials in the historical project;
the project attribute data comprises material demand characteristic data and data corresponding to the material demand characteristic data, including year, project name, material code and investment amount;
the sorting comprises the following steps:
sorting I, sorting the importance of the materials according to the demand frequency ratio of the materials in the historical project;
the demand frequency is determined according to the material demand in the project attribute data and the corresponding year, project name and material code, and is calculated according to the specific year, project, material type and demand data, and the demand frequency historical data of each material in each project.
Sorting II, sorting the importance of the materials according to the value ratio of the materials in the historical project;
the value amount is determined according to the material demand amount in the project attribute data and the corresponding year, project name, material code and investment amount, and the value amount ratio historical data of each material in each project is calculated according to the specific year, project, material type, investment amount and demand amount data.
And the screening module is used for screening out the material types needing to be predicted according to the sorting result of the sorting module and the set threshold value.
More specifically, the screening module classifies and queues the distribution network materials according to the consumption quantity and the value of the distribution network materials by utilizing the rule that 'a few materials occupy most of funds, but on the contrary, most materials occupy little of funds' existing between the distribution network material and the value occupation ratio;
classifying the materials with small quantity and large value quantity ratio into a first type of materials, classifying the materials with large quantity and small value quantity ratio into a second type of materials, classifying the materials between the first two types into a third type of materials, sorting the materials into three material data sets according to a set threshold value, and taking the three material data sets as the materials which are screened out to be predicted
The method comprises the following steps:
one is that the material value ratio rank exceeds a first threshold (30 ranks) and the demand frequency ratio rank exceeds a second threshold (30 ranks) is a first material data set;
secondly, the second material data set is formed by ranking the material value amount ratio exceeding a first threshold (ranking 30), and the accumulated ratio is 78.65%;
thirdly, the requirement frequency ratio exceeds a second threshold (30 before ranking) to be a third material data set, and the cumulative ratio is 63.39%.
In the historical project data in the embodiment, the three material data sets selected from the material library by the screening module comprise 46 materials such as a cable working well prefabricated part, a polyethylene plastic-coated flared steel pipe, a power distribution terminal safety module and the like; and then, respectively carrying out predictive analysis on 46 materials of the three types of materials so as to control the three types of materials by adopting different management methods at a later stage.
The SVR model training module comprises a data set partitioning module, a support vector machine modeling configuration module and a model prediction and optimization module;
and the data set dividing module is used for dividing the data set into a training set and a verification set. Specifically, 80% of the training set and 20% of the verification set are taken.
And the support vector machine modeling configuration module is used for configuring model parameters, including cross validation folding numbers, kernel functions and parameters (C, g) in the kernel functions. Specifically, the number of cross validation folds is 10; the kernel function selects a Gaussian radial basis kernel function rbf; the specific parameter C is punishment on error classification, g is coefficient of kernel function, the parameters are determined according to two subsequent parameter optimization algorithms, and the set initial interval ranges are all [2 ] -8 ,2 8 ]。
And the model prediction and optimization module is used for optimizing the parameters (C, g) of the SVR model and improving the prediction precision of the demand and the generalization capability of the model. In this embodiment, a particle swarm algorithm and a grid search method are combined to optimize parameters C and g of the SVR model. FIG. 2 is a flow chart of optimization of SVR parameters by a combination of particle swarm optimization and grid search.
(1) Particle swarm algorithm
Particle Swarm Optimization (PSO) is a Swarm-intelligent Optimization algorithm in the field of intelligent computing. It is an algorithm model proposed inspired by bird flock predation behavior. For the optimization problem, the particle swarm optimization algorithm is firstlyFirstly, a population group X = { X) composed of f particles is randomly initialized in a T-dimensional solution space 1 ,X 2 ,......,X f The position of each particle may represent X } t =(x t1 ,x t2 ,......,x tf ) T (t =1,2,.., f), t represents the dimension of the project feature variable. The location of each particle represents a solution to the problem to be optimized. Substituting the obtained value into the objective function to obtain the fitness value corresponding to the particle. Then each particle carries out iterative search on a new solution by continuously updating the position of the particle, and the optimal solution p searched by the particle is obtained after each iteration td And the global optimal solution g which can be searched in the iteration of the whole particle population td . Before the next iteration, each particle updates its own position and velocity in the solution space for searching possible solutions according to the following formula:
speed: v. of td k+1 =ωv td k +C 1 r 1 (p td k -x td k )-C 2 r 2 (g td k -x td k ) (1)
Position: x is the number of td k+1 =x td k +v td k+1 (2)
Wherein d =1,2, which represents the d-th dimension of the particle, i.e., the first dimension of the particle is parameter C and the second dimension of the particle is parameter g; k is the current iteration number; ω is the inertial weight; v. of td k+1 Representing the updating speed of the t particle in the d dimension in the (k + 1) th iteration; c 1 And C 2 Is a learning factor; r is 1 And r 2 Is a pair of the components distributed in [0,1]Random number of intervals. p is a radical of td Historical optimum position, g, representing individual search of the t-th particle td The historical optimal position of the whole population, namely the integral historical optimal position of all the particles is represented. These parameters are set according to the embodiment. In addition, in order to prevent the particles from searching unintentionally, the speed and position thereof are limited within a certain range: [ -v ] max ,v max ]、[-x max ,x max ]。
(2) Grid search method
The grid search method is an exhaustive method, and a plurality of grids are taken from each dimension of a parameter space, so that all grid intersections in a control can be conveniently input, and an optimal solution can be obtained. The algorithm firstly determines the value range of each parameter according to a certain rule difference value to obtain a plurality of groups of parameter combinations: and performing primary calculation on each group of parameter combination, and obtaining the prediction error of each group of parameter combination by applying cross validation calculation, wherein the optimal parameter value is obtained corresponding to the parameter combination with the minimum prediction error. The method has the advantages that parameters can be calculated in parallel in the calculation process, the calculation efficiency is high, and the optimization accuracy and the optimization speed are improved. However, this method has the disadvantage that the algorithm of the grid search is to traverse all points corresponding to the grid, which results in a large number of unnecessary invalid calculations, and thus results in a power-raising operation time.
Because the Grid Search algorithm has the defects that invalid Search areas are too many and the operation time of the algorithm is greatly influenced by the step length, the Particle Swarm Optimization (PSO) algorithm and the Grid Search method (Grid-Search) are adopted to optimize the parameters C and g of the SVM in the embodiment, the particle swarm optimization is firstly utilized to roughly Search in a large range, an optimal parameter range is preliminarily determined, then the Grid Search method is utilized to carry out accurate Search in a small step distance and a small range, secondary parameter optimization is carried out, and finally the optimal parameter combination (C, g) is determined, so that the Search of the invalid areas is avoided, and the algorithm can be quickly positioned to the global optimal solution according to the actual situation.
The prediction and optimization module specifically comprises a first prediction and optimization module and a second prediction and optimization module;
the first prediction and optimization module is used for carrying out first optimization on the parameters (C, g) of the SVR by adopting a particle swarm algorithm, quickly searching and positioning an optimal parameter interval, and obtaining the result (C, g) of the optimized parameters 1 . At this time, if there are multiple groups (C, g) 1 The group with the smallest parameter C is selected for the highest prediction accuracy.
The first optimization method in the first prediction and optimization module is implemented as follows:
1) Carrying out normalization processing on the divided training set and test set of the distribution network material data;
2) Setting an initial range of parameters (C, g) to be searched, an initial evolution algebra and population number of a particle swarm algorithm, an initial value of the maximum iteration times of the inertia factors and the maximum iteration times. Wherein the effective range of the parameter C, g is 10 -8 ~10 8 So the initial search range of (C, g) is set to C ∈ [2 ] -8 ,2 8 ],g∈[2 -8 ,2 8 ]. The population quantity of material demand prediction is set to be 20, the initial evolution quantity is set to be 200, and a learning factor C 1 Initially 1.5, C 2 The initial value of the inertial weight factor ω in front of the velocity of the rate update formula (1) is set to 0.9, so that the PSO global optimization capability is strong, ω is decreased with the depth of the iteration number, so that the PSO has strong local optimization capability, and ω =1 when the iteration is finished. r is 1 And r 2 Is the interval [0,1]The random number of (c). The maximum number of iterations is set to 200.
3) Determining the individual optimal position and the global optimal position of the particle by calculating the fitness of the particle;
the fitness of the particles is used for determining the optimal position p searched by each particle td k (Individual optimal solution) and optimal position g searched by group td k (population-optimal solution). The mean square error MSE or R is selected in general 2 The statistic is used as an adaptive value that is an optimization objective function value of the SVR, and in the present embodiment, the optimal accuracy of the corresponding model constructed under each material is used as the adaptive value of each particle.
4) Performing iterative operation, and updating the speed and the position of the current particle according to the individual optimal position and the global optimal position of the particle determined last time to obtain corresponding parameters (C, g);
specifically, the updating of the speed and the position of the particle is to determine the individual optimal direction p according to the individual optimal position and the global optimal position found in the step 3) td k -x td k And the population optimal direction g td k -x td k (ii) a And combined with direction of inertia v td k The velocity v of each particle is updated by using the formula (1) and the formula (2) td k+1 And position x td k+1 And the corresponding parameters (C, g) are obtained. Here, the position of the boundary crossing is reasonably adjusted, and the boundary value is selected for the boundary crossing condition in the embodiment. Finally, the updated historical optimal position p is obtained td k+1 And global optimal position g td k+1
5) Judging whether an iteration end condition is met; if yes, ending iteration, and outputting the parameters (C, g) as optimal parameter values (C, g) for particle swarm optimization parameter optimization 1 (ii) a Otherwise, turning to the step 3) to continue iterative updating; the iteration ending condition is that the maximum iteration times are reached
Through the process, the optimal parameter values (C, g) for the first time of particle swarm optimization parameter optimization are obtained 1
The second prediction and optimization module is used for carrying out secondary parameter optimization from the optimal parameter interval by adopting a grid search method to obtain the result (C, g) of the optimized parameter 2
In particular, the second order parameter optimization algorithm (C, g) 1 Starting with (C, g) 1 Re-determining and narrowing the search range for the interval search center, performing the second small step precise optimization by using the grid search algorithm, gradually jumping out of the local optimum of the particle swarm algorithm, and obtaining the optimum value from the multiple groups of optimization results to obtain the results (C, g) 2
More specifically, the optimal parameters (C, g) obtained in the first optimization module are optimized 1 As target points, a space range parameter of the grid search and a search step size parameter during the grid search are initialized, and other variables. In (C, g) 1 The method selects a proper interval nearby to perform the small-step search optimization, the step pitch in the conventional grid search method is generally set to 0.1, and in the small-step fine search optimization of the embodiment, the initial step pitch is set to 0.05.
The second predicted search interval is determined by the following process: from the parameters (C, g) 1 Starting to expand the search interval by a preset step length until the first search condition and the second search condition are met simultaneously, and determining a second predicted search interval;
the first search condition is that the search interval expanded by the preset step length needs to be larger than a parameter interval which cannot jump out the local optimum when the particle swarm optimization is adopted for the first time. The first search condition is met, so that the situation that local optimization cannot be skipped if the parameter interval of the secondary optimization is selected too narrow can be avoided.
The second search condition is that the optimization time when the secondary parameter optimization is carried out in the search interval enlarged by the preset step length does not exceed the preset longest optimization time. . The second search condition is met, so that the problems of overlong optimization time and overlow efficiency can be avoided.
Therefore, the search range of (C, g) is newly determined by performing the compromise process between the first search condition and the second search condition to gradually expand the search range. After the new (C, g) search interval is re-determined, the set of search intervals is substituted into a second small step-size precision grid search to determine that we want to obtain the optimal (C, g) parameters (C, g) 2
Finally, the parameters (C, g) obtained finally 2 And substituting the SVR model into a kernel function of the support vector machine again to establish the SVR model of the particle swarm algorithm and the grid search method.
Compared with the conventional method for optimizing the parameters of the SVR by using a single grid search method and a single particle swarm optimization, the improved SVR model which integrates the two parameter optimization modes can utilize the advantage that the particle swarm optimization converges in an optimal interval quickly, the early-stage search speed of the grid search algorithm is accelerated, and the fine search with small step length is used in the later-stage search, so that the calculation time is shortened, the search speed is higher, and the prediction accuracy is improved.
In addition, the embodiment of the invention also comprises an evaluation module used for evaluating the prediction precision of the model;
Figure BDA0003115868450000151
in the prediction accuracy formula
Figure BDA0003115868450000152
The predicted value obtained by using the prediction model is shown, and y represents the actual value.
Specifically, the prediction module is configured to obtain item attribute data of the material according to the input material type, and import a corresponding SVR model to predict the required amount of the material. The input material types can be input in corresponding input fields on a human-computer interaction interface or selected through a pull-down menu.
In the one-time prediction modeling construction, the application of the demand quantity of various goods and materials is predicted according to historical project goods and materials data.
However, in particular, in a case where the demanded quantity of the materials is predicted in the next year, a total investment amount corresponding to the demanded quantity of the materials is given before the prediction, and the total investment amount represents only an example, and other attribute data corresponding to the demanded quantity of the materials may be given before the prediction.
Based on this, the prediction module of the present embodiment may employ two predictions.
In the first prediction, according to the proportion of the annual investment sum of various materials in the total investment sum, predicting the proportion of the investment sum of the materials in the next year in the total investment sum, and then obtaining a predicted value of the investment sum of the materials in the next year;
in the second prediction, the predicted value of the investment amount of the first prediction in the next year is used as the project attribute data of the material to input to predict the demand of the material in the next year,
the method comprises the following specific steps: and predicting the material demand in 2020 according to the project quantity and the investment amount ratio in the last three years, namely 2017-2019 in 2020, and predicting various project quantities and various project investment amount ratios in 2020.
In the first prediction, the prediction is performed,
(1) And counting the proportion of the investment money corresponding to various materials in the historical data in the total investment money.
Fig. 3 is an example of 10044001645, and the proportion of the annual investment amount of the material 2011-2019 in the total investment amount is counted:
(2) And taking the average value of the investment amount ratios of the material in the last three years as the investment amount ratio of the material in the next year, and further obtaining the investment amount of the material in the next year.
On the basis of counting the proportion of the investment amount of the material in each year 2011-2019 in 10044001645 in the total investment amount, taking the proportion average value of the investment amount of the material in the total investment amount in 2017-2019 in the total investment amount as the proportion of the investment amount of the material in the total investment amount in 2020, and further obtaining the predicted investment amount of the material in 2020, wherein the specific result is the material investment amount prediction result shown in the following figure 4;
and comparing the predicted value of the investment amount in the 2020 project attribute data of the material 10044001645 obtained by the method with the real investment amount of the material in the 2020 project attribute data. In particular the predicted amount is compared to the actual amount as illustrated in fig. 5.
In the second prediction, the prediction is performed,
after the specific project investment amount of the material in 2020 is obtained, the characteristics of the material in 2020 are used as input variables of a prediction model, so that the material demand of the material in 2020 is predicted, and the prediction process of the historical project material data in 2011-2019 in the same prediction modeling is performed in the specific prediction process.
The performance of the model is reflected by the prediction precision of the prediction model, and in addition, the model can be quantitatively evaluated by adopting the general model error, fitting degree and efficiency as measurement indexes for evaluating the performance of the model, and the method specifically comprises two aspects of model accuracy and model efficiency:
a. accuracy of model
Including Mean Relative Error (MRE) and coefficient of determination (R) 2 ). Wherein R is 2 Representing the interpretation of model input variables to output variablesThe degree, also called goodness-of-fit, takes on a value between 0 and 1. The smaller the MRE, the R 2 Closer to 1, the higher the accuracy of the model is demonstrated.
Figure BDA0003115868450000171
Wherein, W True Representing true value, W Preparation of The predicted value is represented by a value of the prediction,
Figure BDA0003115868450000172
the mean of the true values is shown and N is the number of samples.
b. Efficiency of model
The training Time of the model is calculated. The shorter the training time, the higher the prediction efficiency of the model.
As shown in fig. 6-8, a comparison graph of the predicted results of the three types of materials is shown;
for the materials with the material value ratio ranking 30 and the demand frequency ratio ranking 30, the part of the materials is of great importance, and the historical data is rich, so the prediction effect is good. Except for the material 'distribution terminal safety module', the prediction precision of the Dataset1 can reach 90%, wherein the prediction precision of most materials is even as high as 95%. In contrast, since the use structure of the materials and the like may be slightly changed in 2019, the accuracy of prediction with Dataset2 is slightly lower, but the prediction accuracy of other materials except for individual materials can still reach 85%.
In terms of the materials with the material value ratio of 30 top, although the investment amount of the partial materials is larger, the total demand is less, so that the historical data is relatively lacking, and the model cannot effectively learn the rule of material use, so that the prediction precision is reduced. From the point of accuracy of prediction by using Dataset1, the prediction accuracy can reach 85% except for individual materials. Because the historical data of the part of materials is less, and a lot of newly used materials exist, the model learning effect is poor, and the data set2 cannot be used for effectively predicting a part of the materials, the result predicted by the data set1 can be used as the final prediction result of the part of the materials.
The demand frequency is less than the material of the top 30, the investment of the material is less than the ratio, but the demand frequency is higher and more important. Because the demand frequency is high, the historical data is rich, and the model can better learn the use rule of the materials, the prediction precision by using the Dataset1 is relatively good, except individual materials, the prediction precision can reach more than 85%, the prediction precision of most materials can reach 90%, and the prediction precision of 1/3 materials can even reach 95%. However, some of the materials are used in large quantities in recent years, and the historical data is relatively poor, so that the accuracy of prediction by using Dataset2 is low.
In summary, the embodiments of the present invention have the following beneficial effects:
1. feature extraction: semantic recognition, word segmentation and information extraction are carried out by using a natural language processing method, and then key information is extracted from the project name and is used as a characteristic variable, so that the problems that more original variables exist and main variables are difficult to accurately recognize are solved;
2. establishing a material knowledge base: the method utilizes a pareto analysis method, namely a two-eight rule and an ABC classification method, to classify the materials according to the value (investment amount) and the demand so as to solve the problem of unbalanced value and demand of the materials.
3. Establishing a project information base: establishing a project information base including an industry type project base and an engineering type project base according to a certain division standard rule, and preparing for the follow-up exploration of the commonalities of materials under the same project type and the change trends of material demand under different years;
4. on the basis of a complete data processing flow, a distribution network material demand prediction model is established, and the advantages of multiple inputs and multiple outputs of machine learning are fully utilized to realize the prediction of distribution network material demand;
5. by optimizing the support vector regression model, the calculation resources are effectively reduced, and the prediction efficiency and the prediction precision are improved.
6. The support vector regression machine has unique advantages in the aspects of small sample regression and the like, and due to the fact that a large number of samples and empirical assumptions are not needed, overfitting can be avoided while empirical risk minimization and structural risk minimization are achieved.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (10)

1. A distribution network material demand prediction system based on feature extraction and improved SVR model is characterized by comprising:
the data import module is used for importing historical item data of a corresponding time period according to the selected time period; the historical project data comprises material demand data and project attribute data corresponding to material demands;
the preprocessing module is used for carrying out data extraction on the imported historical project data to obtain a material historical data set;
the data grouping module is used for grouping the project material historical data sets according to the industry and engineering classification knowledge base to obtain material grouping historical data;
the predicted material type screening module is used for sorting the importance of the materials in the historical items according to the item attribute data and screening the material types to be predicted;
the SVR model training module is used for training each selected material to be predicted to obtain a corresponding SVR model;
and the prediction module is used for acquiring the project attribute data of the material according to the input material types and importing the corresponding SVR model to predict the demand quantity of the material.
2. The distribution network material demand prediction system of claim 1, wherein the SVR model training module comprises a data set partitioning module, a support vector machine modeling configuration module, and a model prediction and optimization module;
the data set dividing module is used for dividing a training set and a test set;
the support vector machine modeling configuration module is used for configuring parameters of the SVR model, including a kernel function and parameters (C, g) in the kernel function; the kernel function is a Gaussian radial basis kernel function; parameter C is the penalty for misclassification; g is the coefficient of the kernel function;
and the model prediction and optimization module is used for optimizing the parameters (C, g) of the SVR model and improving the prediction precision of the demand and the generalization capability of the model.
3. The distribution network material demand forecasting system of claim 2, wherein the model forecasting and optimization module comprises a first forecasting and optimization module and a second forecasting and optimization module;
the first prediction and optimization module is used for carrying out first optimization on the parameters (C, g) of the SVR by adopting a particle swarm algorithm, quickly searching and positioning an optimal parameter interval, and obtaining the result (C, g) of the optimized parameters 1
A second prediction and optimization module for performing secondary parameter optimization from the optimal parameter interval by using a grid search method to obtain an optimized parameter result (C, g) 2
4. The distribution network material demand prediction system of claim 3, wherein the first optimization method in the first prediction and optimization module comprises:
1) Carrying out normalization processing on the divided training set and test set;
2) Setting an initial search range of parameters (C, g) to be searched, an initial evolution algebra and population number of a particle swarm algorithm, an initial value of the maximum iteration times of inertia factors and the maximum iteration times;
3) Determining the individual optimal position and the global optimal position of the particle by calculating the fitness of the particle;
4) Performing iterative operation, and updating the speed and the position of the particle at this time according to the individual optimal position and the global optimal position of the particle determined at the last time to obtain corresponding parameters (C, g);
5) Judging whether an iteration end condition is met; if yes, ending iteration and outputting parameters(C, g) optimum parameter values (C, g) for parameter optimization of particle swarm optimization 1 (ii) a Otherwise, turning to 3), the iterative updating is continued.
5. The system for predicting demand for distribution network materials of claim 3, wherein the second prediction and optimization module is configured to optimize the parameters (C, g) in the second parameter optimization method 1 Carrying out the optimization of the grid search method with preset step length according to the second predicted search interval nearby, and determining the optimal parameter result (C, g) which is required to be obtained by people 2
The second predicted search interval is determined by the following process: from the parameters (C, g) 1 Starting to enlarge the search interval by a preset step length until the first search condition and the second search condition are met simultaneously, and determining a second predicted search interval;
the first search condition is that the search interval expanded by the preset step length is larger than a parameter interval which cannot jump out the local optimum when the particle swarm optimization is adopted for the first time;
the second search condition is that the optimization time when the secondary parameter optimization is carried out in the search interval enlarged by the preset step length does not exceed the preset longest optimization time.
6. The distribution network material demand forecasting system of claim 1,
the preprocessing module comprises: the system comprises a characteristic data cleaning module, a first characteristic data extraction module, a second characteristic data extraction module and a material historical data set module;
the characteristic data cleaning module is used for cleaning data and processing abnormal values of material demand data in historical project data and project attribute data corresponding to the material demand based on outlier detection of kernel K-Means clustering;
the first characteristic data extraction module is used for screening data to obtain project attribute characteristic data comprising the material demand quantity characteristic data and corresponding year, project name, material code and investment amount;
the second characteristic data extraction module is used for carrying out semantic recognition, text analysis and word segmentation processing on the project names of the project attribute data based on natural language processing, and extracting characteristic data including regions, sites, industry types and engineering types contained in the project names;
and the material historical data set module is used for forming a material historical data set by the material demand characteristic data and the corresponding characteristic data including year, region, site, industry type, project name, material code and investment amount.
7. The system for predicting demand of materials for distribution networks according to claim 2, wherein the processing of abnormal values of project attribute data in the feature data cleaning module comprises:
1) Clustering by using kernel K-Means, dividing historical project data into single clusters, dividing the clustered data set into D, dividing the clustered data set into K clusters, and using D to obtain a cluster a To represent a cluster with a corresponding cluster center of c a A =1, …, K, a represents the number of clusters corresponding to each feature data in the material history data;
2) Searching missing data A with each characteristic data in material historical data as null values b In the cluster D a At D a Middle search and A b Most similar data A c B is not equal to c, and b and c are not more than the cluster D where the b and c are located a The amount of data of (a); with A c Property of (2) De-filling A b The iteration is repeated until all the missing data are filled, and the filled data set is F Supplement device
3) For data set F Supplement device Setting a threshold value of a target function for each cluster in the data set B, deleting the data objects one by one, if the data objects are deleted, obviously reducing the data objects, namely marking the data objects as outliers, and adding the outliers into the data set B;
4) Confirming whether a filled missing value exists in the data set B, if so, jumping to 1) to perform clustering filling again; if not, ending;
5) And after the iteration is repeated to the set iteration threshold in the step 4), filling the missing data of the iteration by using a cluster mean value and finishing.
8. The distribution network material demand forecasting system of claim 3, wherein the objective function
Figure FDA0003115868440000031
Where dist is the distance of the kernel to each cluster for the missing value; x is formed as c a ,a=1,…,K。
9. The distribution network material demand forecasting system of claim 1, wherein the second characteristic data extraction module comprises a natural language processing module and a tag encoding module;
the natural language processing module is used for carrying out semantic recognition, text analysis and word segmentation processing on the project name of the project attribute data based on the Jieba word segmentation processed by the natural language, and extracting text type characteristic data including regions, sites, industry types and engineering types contained in the project name;
and the label coding module is used for respectively carrying out label coding on the regions, the sites, the industry types and the engineering types by adopting a label coding mode and converting the label coding into digital characteristic data.
10. The distribution network material demand forecasting system of claim 1, wherein the forecast material category screening module comprises a ranking module and a screening module;
the sorting module is used for sorting the importance of the demand frequency and/or the value amount ratio of the demand frequency of the materials in the historical project;
and the screening module is used for screening out the material types needing to be predicted according to the sorting result of the sorting module and the set threshold value.
CN202110664592.7A 2021-06-15 2021-06-15 Distribution network material demand prediction system based on feature extraction and improved SVR model Pending CN115481844A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110664592.7A CN115481844A (en) 2021-06-15 2021-06-15 Distribution network material demand prediction system based on feature extraction and improved SVR model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110664592.7A CN115481844A (en) 2021-06-15 2021-06-15 Distribution network material demand prediction system based on feature extraction and improved SVR model

Publications (1)

Publication Number Publication Date
CN115481844A true CN115481844A (en) 2022-12-16

Family

ID=84420166

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110664592.7A Pending CN115481844A (en) 2021-06-15 2021-06-15 Distribution network material demand prediction system based on feature extraction and improved SVR model

Country Status (1)

Country Link
CN (1) CN115481844A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116187588A (en) * 2023-04-24 2023-05-30 成都思威服供应链管理有限公司 Project task information extraction and cost optimization method and device and electronic equipment
CN116502771A (en) * 2023-06-21 2023-07-28 国网浙江省电力有限公司宁波供电公司 Power distribution method and system based on electric power material prediction

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116187588A (en) * 2023-04-24 2023-05-30 成都思威服供应链管理有限公司 Project task information extraction and cost optimization method and device and electronic equipment
CN116187588B (en) * 2023-04-24 2023-06-27 成都思威服供应链管理有限公司 Project task information extraction and cost optimization method and device and electronic equipment
CN116502771A (en) * 2023-06-21 2023-07-28 国网浙江省电力有限公司宁波供电公司 Power distribution method and system based on electric power material prediction
CN116502771B (en) * 2023-06-21 2023-12-01 国网浙江省电力有限公司宁波供电公司 Power distribution method and system based on electric power material prediction

Similar Documents

Publication Publication Date Title
Xuan et al. Multi-model fusion short-term load forecasting based on random forest feature selection and hybrid neural network
CN106845717B (en) Energy efficiency evaluation method based on multi-model fusion strategy
Kuo et al. Application of a hybrid of genetic algorithm and particle swarm optimization algorithm for order clustering
CN111563706A (en) Multivariable logistics freight volume prediction method based on LSTM network
CN111324642A (en) Model algorithm type selection and evaluation method for power grid big data analysis
Chang et al. Trend discovery in financial time series data using a case based fuzzy decision tree
AU2018202527A1 (en) Identification and management system for log entries
CN110135630A (en) The short term needing forecasting method with multi-step optimization is returned based on random forest
CN107451666A (en) Breaker based on big data analysis assembles Tracing back of quality questions system and method
CN112735097A (en) Regional landslide early warning method and system
CN115481844A (en) Distribution network material demand prediction system based on feature extraction and improved SVR model
CN102750286A (en) Novel decision tree classifier method for processing missing data
CN106708659A (en) Filling method for adaptive nearest neighbor missing data
CN117977568A (en) Power load prediction method based on nested LSTM and quantile calculation
CN113052225A (en) Alarm convergence method and device based on clustering algorithm and time sequence association rule
CN115470962A (en) LightGBM-based enterprise confidence loss risk prediction model construction method
Dudek et al. Medium-term electric energy demand forecasting using Nadaraya-Watson estimator
CN115481841A (en) Material demand prediction method based on feature extraction and improved random forest
Zhang Decision Trees for Objective House Price Prediction
CN111062539A (en) Total electric quantity prediction method based on secondary electric quantity characteristic clustering analysis
Zhang et al. Research on borrower's credit classification of P2P network loan based on LightGBM algorithm
Hao et al. The research and analysis in decision tree algorithm based on C4. 5 algorithm
CN112288571B (en) Personal credit risk assessment method based on rapid construction of neighborhood coverage
Wang et al. Digital Management Mode of Enterprise Human Resources under the Background of Digital Transformation
CN112465253A (en) Method and device for predicting links in urban road network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination