CN110990461A - Big data analysis model algorithm model selection method and device, electronic equipment and medium - Google Patents

Big data analysis model algorithm model selection method and device, electronic equipment and medium Download PDF

Info

Publication number
CN110990461A
CN110990461A CN201911292789.1A CN201911292789A CN110990461A CN 110990461 A CN110990461 A CN 110990461A CN 201911292789 A CN201911292789 A CN 201911292789A CN 110990461 A CN110990461 A CN 110990461A
Authority
CN
China
Prior art keywords
model
data
analysis
algorithm
power grid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201911292789.1A
Other languages
Chinese (zh)
Inventor
王宏刚
纪鑫
刘识
赵晓龙
余婷
刘�文
李君婷
赵宇亮
张帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Big Data Center Of State Grid Corp Of China
Original Assignee
Big Data Center Of State Grid Corp Of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Big Data Center Of State Grid Corp Of China filed Critical Big Data Center Of State Grid Corp Of China
Priority to CN201911292789.1A priority Critical patent/CN110990461A/en
Priority to CN202010194935.3A priority patent/CN111324642A/en
Publication of CN110990461A publication Critical patent/CN110990461A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/06Electricity, gas or water supply
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Tourism & Hospitality (AREA)
  • Educational Administration (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Primary Health Care (AREA)
  • Databases & Information Systems (AREA)
  • Fuzzy Systems (AREA)
  • Operations Research (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • Game Theory and Decision Science (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention provides a big data analysis model algorithm model selection method, a big data analysis model algorithm model selection device, electronic equipment and a medium. The method comprises the following steps: matching corresponding model types according to the application scene and the data characteristics of the power grid service data; respectively processing the power grid service data based on at least two analysis models corresponding to the model types to obtain processing results; and evaluating the analysis model according to the processing result and the evaluation parameters corresponding to the model types, and recommending the model based on the evaluation result. According to the technical scheme of the embodiment of the invention, the type of the model required to be used for big data analysis is rapidly determined through the application scene and the data characteristics, so that the big data analysis efficiency is improved.

Description

Big data analysis model algorithm model selection method and device, electronic equipment and medium
Technical Field
The embodiment of the invention relates to the technical field of big data, in particular to a big data analysis model algorithm model selection method, a big data analysis model algorithm model selection device, electronic equipment and a medium.
Background
The big data technology is a hot project which is researched by various industries at home and abroad at present. With the technical challenges brought by big data change in the global scope, China also pays more and more attention to the practical application of big data technology. In recent years, with the shift of the national grid management emphasis from centralization, unification to refinement and high efficiency, the development of the national grid and the digital technology is combined with the high-speed development of the information technology and the wide application of various digital technologies under the era of "internet +", and the trend is reached. The big data realizes the integration, analysis and processing of the data and supports the retrieval of mass data of related services of national network enterprises. The big data technology is based on the visual angle of a large amount of high-dimensional variable data, directly and vividly shows the overall design of the foreign network, and can better support the planning and development of the power grid.
The application of big data in a power grid has huge commercial value and social value, and huge opportunities are faced for mining the value of the big data of the power. The intelligent power grid promotes deep fusion of energy and information technology based on synchronous transmission of data and energy, and a strong, reliable, clean, environment-friendly and interactive energy management network supported by an operation system with the energy and the data is gradually formed. The mining of the big data power grid well realizes intelligent power utilization management and greatly improves energy efficiency. Therefore, the user can master the electricity utilization performance, the electricity consumption data, the instant electricity price and the like in real time, and secondary circulation and efficient use of energy are realized. The power grid is large in overall planning amount and wide in area, the asset management difficulty is high, a large amount of basic data is needed, the advantages of the big data are achieved, and the big data technology is utilized to achieve improvement of the power distribution network in aspects such as an asset management system, a model method and information interaction, so that the power distribution asset management level is fundamentally improved. In addition, in terms of data sources, the openness of most data is low, and acquisition of the data is difficult. In the aspect of data quality, the particle degree of the data acquirable in the power industry, the timeliness, integrity, consistency and the like of data acquisition do not reach ideal levels, and the data acquisition should be continuously perfected and improved.
Models and algorithms are two core problems in big data analysis. The research of big data analysis model can be divided into 3 layers, namely description analysis, prediction analysis and specification analysis. Describing and analyzing exploration historical data and describing what happens, wherein the level comprises clustering of discovered data rules, related rule mining, pattern discovery and visual analysis for describing the data rules; predictive analysis is used to predict future probabilities and trends, such as logistic regression-based predictions, classifier-based predictions, and the like; canonical analysis gives recommendations for future decisions based on desired results, specific scenarios, resources, and knowledge of past and current events, such as complex system analysis based on simulations and optimal solution generation based on given constraints. The research of the big data analysis algorithm designs an efficient algorithm aiming at a specific analysis model, and researches how to improve the expandability, the real-time performance and the like of the algorithm. The power grid big data has 5V (volume, velocity, variety, value and veracity) characteristics of the big data, and also has a plurality of characteristics closely related to power production characteristics, such as various data sources, low data quality, complex data containing information, uncertain coupling, high data real-time performance and the like, so that a power grid big data analysis model is more complex and diverse, and the real-time requirement on an algorithm is higher.
At present, various models and algorithms have a series of parameters and indexes which can be evaluated and compared by the algorithms. However, in practical application, due to the fact that technical backgrounds of analysts are different, if analysts are relied on to perform model selection, model use, parameter configuration, model evaluation and the like, accuracy and scientificity of model selection cannot be well guaranteed, and application difficulty of the model in a power grid scene is increased.
Disclosure of Invention
The embodiment of the invention provides a big data analysis model algorithm model selection method, a big data analysis model algorithm model selection device, electronic equipment and a medium, which can provide an intelligent model selection scheme and simplify model selection work.
In a first aspect, an embodiment of the present invention provides a big data analysis model algorithm model selection method, including:
matching corresponding model types according to the application scene and the data characteristics of the power grid service data;
respectively processing the power grid service data based on at least two analysis models corresponding to the model types to obtain processing results;
and evaluating the analysis model according to the processing result and the evaluation parameters corresponding to the model types, and recommending the model based on the evaluation result.
Optionally, before matching the corresponding model category according to the application scenario and the data characteristics of the power grid service data, the method further includes:
establishing an incidence relation between an analysis model and a model category, wherein the model category comprises an incidence rule model, a classification model, a regression model and a clustering model;
establishing an application scene of the power grid service data and a mapping relation between data characteristics and model categories;
and storing an analysis model, an incidence relation between the analysis model and the model category and a mapping relation between the application scene and the data characteristic of the power grid service data and the model category through a model base.
Optionally, the method includes establishing an application scenario of the power grid service data and a mapping relationship between data characteristics and model categories in the following manner:
analyzing and mining the relation and the relation among the power grid service data by adopting an association rule model;
processing the power grid service data with the labels and the supervision scene by adopting a classification model or a regression model;
and processing the power grid service data which has no label but needs classification by adopting a clustering model.
Optionally, before evaluating the analysis model according to the processing result and the evaluation parameter corresponding to the model category, the method further includes:
the method comprises the steps of presetting evaluation parameters corresponding to model categories, wherein the evaluation parameters corresponding to the association rule model comprise support degree and confidence degree, the evaluation parameters of the classification model comprise precision ratio, recall ratio, F-score, accuracy ratio and ROC curve, the evaluation parameters of the regression model comprise error square sum decision coefficients, and the evaluation parameters of the clustering model comprise estimation clustering tendency, determination of cluster number in a data set and determination of clustering quality.
Optionally, matching the corresponding model category according to the application scenario and the data characteristics of the power grid service data includes:
inquiring the model base based on the application scene and the data characteristics of the power grid service data to obtain the model category corresponding to the application scene and the data characteristics;
and acquiring at least two analysis models corresponding to the model types.
Optionally, the processing the power grid service data based on at least two analysis models corresponding to the model categories respectively to obtain processing results includes:
and respectively inputting the power grid service data into the analysis models corresponding to the model types, and taking the output results of the analysis models as processing results.
Optionally, the evaluating the analysis model according to the processing result and the evaluation parameter corresponding to the model category, and performing model recommendation based on the evaluation result includes:
determining the matching degree of the analysis model and the application scene according to the processing result and the evaluation parameters corresponding to the model types, and evaluating the analysis model according to the matching degree;
and generating model recommendation information based on the analysis model with the highest matching degree, and displaying the model recommendation information.
In a second aspect, an embodiment of the present invention further provides a big data analysis model algorithm model selection device, where the device includes:
the model category matching module is used for matching corresponding model categories according to the application scenes and the data characteristics of the power grid service data;
the data processing module is used for respectively processing the power grid service data based on at least two analysis models corresponding to the model types to obtain processing results;
and the model evaluation module is used for evaluating the analysis model according to the processing result and the evaluation parameters corresponding to the model types and recommending the model based on the evaluation result.
Optionally, the method further includes:
the model base building module is used for building an incidence relation between an analysis model and a model class before matching the corresponding model class according to the application scene and the data characteristics of the power grid service data, wherein the model class comprises an incidence rule model, a classification model, a regression model and a clustering model;
establishing an application scene of the power grid service data and a mapping relation between data characteristics and model categories;
and storing an analysis model, an incidence relation between the analysis model and the model category and a mapping relation between the application scene and the data characteristic of the power grid service data and the model category through a model base.
Optionally, the following method is adopted to establish the application scenario of the power grid service data and the mapping relationship between the data characteristics and the model categories:
analyzing and mining the relation and the relation among the power grid service data by adopting an association rule model;
processing the power grid service data with the labels and the supervision scene by adopting a classification model or a regression model;
and processing the power grid service data which has no label but needs classification by adopting a clustering model.
Optionally, the method further includes:
and the evaluation parameter setting module is used for presetting the evaluation parameters corresponding to the model types before evaluating the analysis model according to the processing result and the evaluation parameters corresponding to the model types, wherein the evaluation parameters corresponding to the association rule model comprise support degree and confidence degree, the evaluation parameters of the classification model comprise precision ratio, recall ratio, F-score, accuracy ratio and ROC curve, the evaluation parameters of the regression model comprise error square sum decision coefficients, and the evaluation parameters of the clustering model comprise estimated clustering tendency, determined cluster number in a data set and measured clustering quality.
Optionally, the model category matching module is specifically configured to:
inquiring the model base based on the application scene and the data characteristics of the power grid service data to obtain the model category corresponding to the application scene and the data characteristics;
and acquiring at least two analysis models corresponding to the model types.
Optionally, the data processing module is specifically configured to:
and respectively inputting the power grid service data into the analysis models corresponding to the model types, and taking the output results of the analysis models as processing results.
Optionally, the model evaluation module is specifically configured to:
determining the matching degree of the analysis model and the application scene according to the processing result and the evaluation parameters corresponding to the model types, and evaluating the analysis model according to the matching degree;
and generating model recommendation information based on the analysis model with the highest matching degree, and displaying the model recommendation information.
In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement a big data analytics model algorithm selection method as provided by embodiments of the present invention.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the big data analysis model algorithm model selection method provided by the embodiment of the present invention.
According to the technical scheme of the embodiment of the invention, the corresponding model types are matched based on the application scene and the data characteristics of the power grid service data, the power grid service data are respectively processed based on the at least two analysis models corresponding to the model types to obtain the processing results, the analysis models are evaluated according to the processing results and the evaluation parameters corresponding to the model types, model recommendation is carried out based on the evaluation results, the model types needed to be used for data analysis are rapidly determined according to the application scene and the data characteristics, and the big data analysis efficiency is improved. In addition, the analysis model is quantitatively evaluated through the pre-configured evaluation parameters, and the accuracy and the scientificity of model recommendation are improved.
Drawings
FIG. 1 is a flowchart of a big data analysis model algorithm model selection method according to an embodiment of the present invention;
FIG. 2 is a block diagram of a big data analysis model algorithm model selection apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Fig. 1 is a flowchart of a big data analysis model algorithm selection method according to an embodiment of the present invention, where the embodiment is applicable to a big data analysis situation, and the method may be executed by a big data analysis model algorithm selection apparatus, which may be implemented by software and/or hardware and is generally integrated in an electronic device. As shown in fig. 1, the method includes:
and step 110, matching corresponding model types according to the application scene and data characteristics of the power grid service data.
The model category comprises an association rule model, a classification model, a regression model, a clustering model and the like. Each model class contains multiple analytical models for big data analysis. For example, the analysis models used for the big data analysis are classified, a plurality of analysis models of the same type are classified into one class, and the analysis models in the class are considered to belong to the same model class.
In the embodiment of the invention, before the corresponding model types are matched according to the application scenes and the data characteristics of the power grid service data, the incidence relation between the analysis model and the model types is pre-established, the mapping relation between the application scenes and the data characteristics of the power grid service data and the model types is established, and the analysis model, the incidence relation between the analysis model and the model types and the mapping relation between the application scenes and the data characteristics of the power grid service data and the model types are stored through the model base.
It should be noted that there are many ways to establish the mapping relationship between the application scenario and the data characteristic of the power grid service data and the model category, and the embodiment of the present invention is not particularly limited. For example, the mapping relationship between the application scenario and the data characteristics of the grid service data and the model category may be established as follows: analyzing and mining the relation and the relation among the power grid service data by adopting an association rule model; or processing the power grid service data with the labels and the supervision scene by adopting a classification model or a regression model; or processing the power grid service data without the label but needing classification by adopting a clustering model.
Illustratively, the model base is queried based on the application scene and the data characteristics of the power grid service data, and model categories corresponding to the application scene and the data characteristics are obtained; and acquiring at least two analysis models corresponding to the model types.
And 120, respectively processing the power grid service data based on at least two analysis models corresponding to the model types to obtain processing results.
Illustratively, the power grid service data are respectively input into the analysis models corresponding to the model categories, and the output results of the analysis models are used as processing results. Because the model type comprises a plurality of analysis models, the power grid service data is respectively used as the model input of at least two analysis models corresponding to the model type, and the output result of each analysis model is used as the processing result.
And step 130, evaluating the analysis model according to the processing result and the evaluation parameters corresponding to the model types, and recommending the model based on the evaluation result.
Exemplarily, according to the processing result and the evaluation parameter corresponding to the model category, determining the matching degree of the analysis model and the application scene, and evaluating the analysis model according to the matching degree; and generating model recommendation information based on the analysis model with the highest matching degree, and displaying the model recommendation information.
It should be noted that before the analysis model is evaluated according to the processing result and the evaluation parameters corresponding to the model category, the evaluation parameters corresponding to the model category are preset, wherein the evaluation parameters corresponding to the association rule model include support degree and confidence degree, the evaluation parameters of the classification model include precision ratio, recall ratio, F-score, accuracy ratio and ROC curve, the evaluation parameters of the regression model include error square sum decision coefficient, and the evaluation parameters of the clustering model include estimation of clustering tendency, determination of cluster number in the data set and determination of clustering quality.
According to the technical scheme, the power grid service data are matched with the corresponding model types based on the application scene and the data characteristics of the power grid service data, the power grid service data are respectively processed based on the at least two analysis models corresponding to the model types to obtain the processing results, the analysis models are evaluated according to the processing results and the evaluation parameters corresponding to the model types, model recommendation is carried out based on the evaluation results, the model types needed to be used for data analysis are rapidly determined according to the application scene and the data characteristics, and the big data analysis efficiency is improved. In addition, the analysis model is quantitatively evaluated through the pre-configured evaluation parameters, and the accuracy and the scientificity of model recommendation are improved.
In an exemplary embodiment, a mainstream big data analysis model algorithm at present is researched and arranged, different models and algorithms are classified into different categories according to different service scenes and application requirements, a reasonable model algorithm library framework is provided, a model library is supported to be constructed, and algorithm service support is provided for big data value mining. The method specifically comprises two parts of accurate model selection of the big data center model algorithm and evaluation of the big data center model algorithm.
And carrying out accurate model selection on the large data center model algorithm. Data mining is a step in database knowledge discovery, and the subject research is suitable for common algorithms for data value discovery of big data companies, forms a data mining algorithm library and supports company data analysis work. The big data center is combined with the characteristics of the power grid service, researches are conducted on application scenes and service applicability of various model algorithms, application scenes, service types and data types suitable for various algorithms are screened, and rationality suggestions are provided.
And evaluating a big data center model algorithm. And selecting a proper model algorithm according to the model selection basis of the large data center model algorithm, and evaluating the matching degree and the applicability of the model and the application scene from the aspects of model accuracy, recall ratio, sensitivity and the like.
For a given power grid big data analysis problem, the problem is divided into association rule analysis, classification analysis, regression analysis and cluster analysis according to the characteristics of the problem, the type of data and the requirement of the problem to be solved and the input and output forms of a big data analysis model. These principal analytical models, and combinations thereof, are applied to analytical tasks for a variety of practical power scenarios, such as scheduling parameter optimization, power forecasting, fault detection and diagnosis, customer demand analysis, and service type identification.
The association rule mining is provided for the problem of shopping basket analysis and is mainly used for solving a plurality of problems and analyzing the association relation between data. The initial purpose is to mine the association relationship existing between different commodities in the transaction database, so as to obtain the general rules of the purchasing mode of the customer, and guide the merchant to carry out reasonable shelf design by using the rules. In the actual industrial process, a plurality of scenes can be modeled by using similar incidence relations, a frequent item set mining method is used for acquiring implicit rules, and a data and knowledge driven mode is used for replacing the traditional decision mode which depends on a large amount of experience. The support degree and the confidence degree are common evaluation indexes of the association rule.
The classification model and the regression model are mainly used for analyzing data with labels, the application scene of the classification model and the regression model is mainly supervised learning, and the classification model and the regression model are widely applied to the field of analysis, judgment and prediction based on big data. The classification model mainly aims at the judgment and prediction of discrete attribute values, such as fault detection and diagnosis and customer segmentation; while regression models are primarily directed to continuous attribute values such as production and sales of products.
Decision tree models and neural network models are the main models in classification and regression analysis based on power big data. A decision tree is a tree-like structure that characterizes the mapping between object attributes and object values. The decision tree model is simple and intuitive, has strong interpretability and good analysis and prediction capabilities, and is suitable for a plurality of scenes of industrial big data analysis.
The neural network model can represent a complex nonlinear function and has good performance on classification and regression tasks. Technological processes, product quality, energy consumption, faults and the like of actual industrial scenes (such as power grids, production lines, manufacturing equipment and the like) are influenced by multiple factors, influence process nonlinearity is caused, and coupling relations often exist among the influencing factors. The big data generated in the processes are used for training the neural network, the complex processes can be effectively represented, and process flow optimization, quality management monitoring, energy consumption optimization, fault detection early warning and the like are realized.
For the classification model, the commonly used evaluation indexes include precision, recall, F-score, accuracy, ROC curve and the like. For the regression model, the evaluation index generally used has a sum of squared errors determining a coefficient.
The cluster analysis model can generalize objects with similar patterns into a cluster, is a typical unsupervised learning model and is mainly used for analyzing and dividing data without labels. The cluster analysis model is good at extracting intrinsic relations from seemingly complex, unknown objects. Therefore, in grid big data analysis, a cluster analysis model is used to analyze relationships between complex parameters, refine customer groups, and the like. The clustering assessment estimates the feasibility of clustering on the data set and the quality of the results produced by the clustering method. The cluster evaluation mainly comprises the following steps: estimating clustering tendency, determining the number of clusters in the data set and determining clustering quality.
In an exemplary embodiment, for a power grid big data analysis problem needing to be analyzed, the method firstly analyzes and abstracts the problem and the scene to form a corresponding logic abstract problem, and the concrete method is to analyze and mine the relation and adopt an association rule analysis model; adopting a classification regression analysis model for the data with the labels and the supervision problems; there is no labeled data, but it needs to be classified, and a clustering model is adopted.
After the power grid big data model algorithm is divided into categories such as an association rule analysis model, a classification regression analysis model and a clustering analysis model, according to different application problems, the method adopts model algorithms of different categories and corresponding evaluation indexes to realize type selection recommendation and quantitative evaluation.
And associating the rule analysis model. For the association rule analysis, the algorithms mainly selected are Apriori algorithm and FP-Growth algorithm. The theoretical basis for the Apriori algorithm is two important properties of the frequent item set, namely that any subset of a frequent item set is frequent and any superset of a non-frequent item set is infrequent. The algorithm idea is that firstly, a data set is scanned for 1 time to obtain a 1-frequent item set, then a k-candidate set is obtained from a (k1) -frequent item set layer by layer through iteration, and the k-frequent item set is screened from the k-candidate set by utilizing the properties of the frequent item set until no new frequent item set is generated. Due to the property of frequent item sets, the Apriori algorithm greatly improves the calculation efficiency compared with a brute force method, and the idea of the algorithm is simple, so that the Apriori algorithm is widely applied to the field of association rule analysis. However, the classical Apriori algorithm suffers from two major problems: one is that when the data volume is large, the algorithm generates a large number of candidate sets; the second is that the algorithm needs to scan the data set many times, with a large I/O overhead.
The FP-Growth algorithm compresses and represents data based on the data structure of the FP-tree, and thus does not need to generate a candidate set. The algorithm firstly scans the data sets twice, constructs an FP-tree, and then utilizes the concept of divide-and-conquer to mine the constructed FP-tree without scanning the data sets for many times. When the overlapped paths formed by the transactions in the data set on the FP-tree are more, and the size of the FP-tree is small enough, the operation efficiency of the FPGrowth algorithm is improved by several orders of magnitude compared with that of the Apriori algorithm.
And (5) evaluating an association rule analysis model. In practical applications, an association rule is an implication of the form: x → Y, X, Y satisfy: x, Y are the proper subset of I, and the intersection of X and Y is the empty set. Where X is referred to as the front piece and Y is referred to as the back piece. For the rule X → Y, its support degree is (X, Y). count/t.count, and its confidence degree is (X, Y). count/x.count, as can be known from the above expression. Wherein (X, Y). count represents the number of transactions that I contains X and Y at the same time, and X.count represents the number of transactions that I contains X.
The association rule mining is to mine all association rules meeting the requirement of the minimum threshold of support degree and confidence degree from the transaction set, and such association rules are also called strong association rules. The support degree of a rule represents the possibility of the rule, if the support degree of a rule is small, the support degree of a rule indicates that the coverage of the rule in a transaction set is small and the rule is likely to happen accidentally; if the confidence is low, it indicates that it is difficult to deduce Y from X.
And (5) carrying out classification regression model analysis. For the classification regression analysis model, the algorithm mainly selected is a decision tree algorithm and a neural network algorithm. The core of the decision tree algorithm is to select a proper test attribute on each node of the decision tree and divide the data set according to the test attribute so as to construct a complete decision tree. The decision tree algorithm mainly selected comprises an ID3 algorithm, a C4.5 algorithm and a CART algorithm.
The ID3 algorithm introduces the information entropy theory into the decision tree learning, selects the test attributes of tree nodes by taking the information gain as the standard, and recursively constructs the decision tree. The ID3 algorithm is simple in concept and has strong learning ability. However, since the ID3 algorithm favors handling attributes with more values, there is a problem with overfitting; the algorithm is sensitive to noisy data and can only process discrete values, not continuous attribute values.
The core of the C4.5 algorithm is to replace the information gain with the information gain rate as the criterion for attribute selection when selecting the test attribute. This improvement effectively overcomes the biased nature of the ID3 algorithm. In the process of constructing the decision tree, the C4.5 algorithm introduces a pruning strategy so as to avoid data overfitting. Furthermore, the C4.5 algorithm adds processing to the discretization of the continuous property, enabling the algorithm to process continuous property values. However, when the algorithm processes the continuous attribute values, the data needs to be scanned and sequenced, which affects the algorithm execution efficiency, and the algorithm can only process the data in the memory.
The CART algorithm uses the GINI coefficient representing the data pureness as a criterion for attribute partitioning. Compared with the ID3 algorithm and the C4.5 algorithm, the test attribute is calculated based on the information entropy, the method based on the GINI coefficient is simpler and more convenient to calculate, and has good approximation precision. In addition, the CART algorithm further simplifies the calculation of GINI coefficients by a binary recursion method, and obtains a simpler and more intuitive binary decision tree model. The CART algorithm discretizes the continuous attribute using a concept similar to the C4.5 algorithm, and thus is able to process continuous attribute values. However, when the attribute types are too many and the complexity of the decision tree is high, the error of the CART algorithm is large.
The core of the neural network algorithm is to train the neural network model, i.e. the parameters of the neural network model are adjusted according to the training data, so as to optimize the representation capability of the model. The earliest neural network learning algorithm was the sensor training rule, which adjusted the network connection weights based on the difference between the target output and the actual output of the training examples until the sensor was able to correctly classify all the training data. The perceptron training algorithm will converge for linearly separable training data, but will not converge for linearly inseparable training data.
The BP algorithm is the most widely applied and most representative neural network learning algorithm at present. In addition to the feedforward neural network model, most neural network models such as a Radial Basis Function (RBF) neural network, a recurrent neural network, and a convolutional neural network can also be trained by using a BP algorithm.
And (5) evaluating a classification analysis model. For the classification model, the evaluation indexes adopted are as follows:
after classifying the samples, the samples can be divided into four classes according to the classification structure:
TP: the true label is 1 and the predicted label is 1.
FP: the true label is 0 and the predicted label is 1.
FN: the true label is 1 and the predicted label is 0.
TN: the true label is 0 and the predicted label is 0.
Precision:
Figure BDA0002319675330000151
recall ratio Recall:
Figure BDA0002319675330000152
F-Score, the harmonic mean of precision and call, is closer to the smaller of these values:
Figure BDA0002319675330000153
accuracy:
Figure BDA0002319675330000154
ROC curve: when an ROC curve is drawn, firstly, counting the number of positive and negative samples according to the labels, and assuming that the number of the positive samples is P and the number of the negative samples is N; setting the scale interval of the horizontal axis to be 1/N and the scale interval of the vertical axis to be 1/P; then, sequencing the samples (from high to low) according to the prediction probability output by the model; and sequentially traversing the samples, simultaneously drawing the ROC curve from the zero point, drawing a curve with a scale interval along the direction of a longitudinal axis when a positive sample is encountered, and drawing a curve with a scale interval along a transverse axis in a universal way when a negative sample is encountered until all the samples are traversed, wherein the curve is finally stopped at the point of (1,1), the whole ROC curve is drawn, and the closer the area under the curve is to 1, the better the classification effect is.
And (4) evaluating a regression analysis model. For the regression model, the evaluation indexes adopted are as follows:
square error:
Figure BDA0002319675330000161
in general, the RMSE can well reflect the degree of deviation of the predicted value from the true value of the regression model. However, in the actual problem, if there are outliers having a very large degree of individual deviation, the RMSE index becomes very poor even if the number of outliers is very small. An index that is more robust than RMSE, such as Mean Absolute Percentage Error (MAPE), can be defined as:
Figure BDA0002319675330000162
coefficient of determination (R-Square):
Figure BDA0002319675330000163
among them, the closer the determination coefficient is to 1, the better.
And (5) clustering the analysis model. Clustering analysis algorithms are mainly classified into hierarchical clustering, partition-based clustering, density-based clustering, and grid-based clustering.
The basic idea of the hierarchical clustering algorithm is to group data layer by layer to form a hierarchical clustering result of a dendrogram structure. Depending on the construction, hierarchical clustering can be divided into two categories: aggregation hierarchical clustering and decomposition hierarchical clustering. The aggregation hierarchical clustering adopts a bottom-up mode, each individual is regarded as one class initially, and the classes are combined layer by layer; the decomposition hierarchical clustering adopts a top-down mode, initially takes all individuals as one class, and then segments the classes layer by layer. The main hierarchical clustering algorithms are the BIRCH algorithm, the CURE algorithm and the ROCK algorithm.
The core idea of the BIRCH algorithm is to establish a clustering feature tree (CFTree) and perform clustering analysis on the CFTree. The BIRCH algorithm has high execution efficiency because the data set only needs to be scanned once and the clustering process is completed in the memory. However, BIRCH does not have ideal clustering effects on non-convex data set distribution clusters.
The CURE algorithm can process mass data and identify clusters of different shapes and sizes. The algorithm uses multiple points in the data space to represent a cluster, thereby filtering outliers and better identifying clusters that are non-spherical and of varying sizes. In addition, the algorithm adopts a random sampling and partitioning strategy to process large-scale data, so that better time efficiency is obtained.
The ROCK algorithm is an improvement over the CURE algorithm. The ROCK algorithm is based on the CURE algorithm, identification of class attributes is increased, and the robustness of the algorithm is improved by investigating similarity among data points and the number of common neighbors.
The clustering algorithm based on division firstly needs to appoint a clustering number, and the algorithm gradually optimizes an objective function through iteration to finally obtain a result cluster with the appointed number. The K-means algorithm is a typical partition-based clustering algorithm. The algorithm represents this class by the mean of all data in each class, i.e., the cluster center. The algorithm starts from k random clustering centers, and iteratively divides points closest to the clustering centers into one class until the clustering center points converge. The algorithm is simple and efficient, and has low time and space complexity, so that the method is widely applied to cluster analysis. However, the K-means algorithm suffers from a number of deficiencies. The K-means algorithm can only process numerical data, and the clustering effect of the algorithm on non-standard normal distribution and non-uniform sample sets is poor; the algorithm is sensitive to the setting of an initial value, and an initial clustering center has great influence on a clustering result; in addition, the algorithm is sensitive to outlier data and outliers.
The improved K-means + + algorithm has the core idea that K points with longer distances are selected as clustering centers, and the idea of selecting the clustering centers by the K-means + + algorithm is visual and effective; aiming at the problem that the K-means algorithm is sensitive to outliers and abnormal points, KaufmanL et al propose the K-means algorithm, and replace the class by a certain point in a cluster instead of an average value of all the points, thereby realizing effective processing of abnormal values.
The density-based clustering algorithm divides data with a certain density degree into a cluster, so that clustering of any shape can be processed, and sparse abnormal points can be effectively eliminated. The DBSCAN algorithm is a classical density-based clustering algorithm. The DBSCAN algorithm starts from any unmarked point, takes the maximum point set connected with the density as a cluster, and obtains all clustering results by the same method. The DBSCAN algorithm does not need to specify the number of categories, can process data in any shape, and is insensitive to abnormal points. However, since DBSCAN uses a global density threshold, if the density distribution of clusters is not uniform, the algorithm will consider all clusters with a density below the threshold as outliers. The OPTICS algorithm sorts the neighborhood points according to the density, and finds clusters with different densities by a visualization method.
A grid-based clustering algorithm divides the data space into a finite number of network cells, calculates the data density mapped into each cell, and merges adjacent dense cells into a clustering result. The computation time of such algorithms is independent of the number of data and the input order, and can cluster data of various shapes. However, since the accuracy of clustering depends on the number of grid cells divided, the algorithmic clustering quality improves at the expense of time. A typical grid-based clustering algorithm is the STING algorithm.
The STING algorithm divides the data space into multiple levels of rectangular units for different levels of resolution, wherein the upper level units are divided into multiple lower level units, and statistical information of each unit attribute is pre-calculated and stored to perform a query operation. The algorithm starts from a certain level of units, and inquires the units meeting the constraint condition layer by layer downwards, and the obtained inquiry result is equivalent to the clustering result. The STING algorithm facilitates parallel and incremental updates and has high execution efficiency. However, the algorithm can only obtain clusters with vertical or horizontal boundaries, and the accuracy of the clustering result is poor.
And evaluating clustering trend estimation. For a given data set, it is evaluated whether a non-random structure exists for the data set. Blindly using clustering methods on the data set will return some clusters, which may be misleading. Clustering analysis on the data set is meaningful only if non-random structures are present in the data. The cluster trend evaluation determines whether a given data set has a non-random structure that can lead to meaningful clustering. A data set without any non-random structure, such as evenly distributed points in the data space, is random and meaningless, although the clustering algorithm may return clusters for the data set. Clustering requires a non-uniform distribution of data. A commonly used evaluation index is Hopkins statistical (Hopkins statistical), which is a spatial statistical used to test spatial randomness of spatially distributed variables. The calculation steps are as follows:
(1) uniformly extracting n points p from the space of D1,p2,...pnFor each point pi(1 ≦ i ≦ n), find the nearest neighbor of pi in D, and let xiIs piDistance from its nearest neighbor in D, i.e.
Figure BDA0002319675330000191
(2) Uniformly extracting n points q from the space of D1,q2,...qnFor each point qi(1. ltoreq. i.ltoreq.n), find the nearest neighbor of qi in D- { qi } and let yi be the distance between qi and its nearest neighbor in D- { qi }, i.e. the distance between qi and its nearest neighbor in D- { qi ≦ q ≦ n
Figure BDA0002319675330000192
(3) Computing a hopkins statistic H:
Figure BDA0002319675330000193
if D is uniformly distributed, then
Figure BDA0002319675330000194
And
Figure BDA0002319675330000195
it will be very close that H is about 0.5, whereas if D is highly inclined, then
Figure BDA0002319675330000196
Will be significantly less than
Figure BDA0002319675330000197
Thus H will be close to 0.
And evaluating the cluster number estimation. The K-means algorithm requires as a parameter the number of clusters of the data set, which can also be seen as an interesting and important summary statistic of the data set. Therefore, it is desirable to estimate the number of clusters before using a clustering algorithm to derive detailed clusters. Common methods are the elbow method and the cross-validation method.
Elbow method (elbow method). Given K >0, the dataset is clustered using an algorithm like K-means and the intra-cluster variance sum var (K) is calculated. Then, plot var with respect to k. The first (or most significant) inflection point of the curve implies the "correct" cluster number.
And (4) cross validation. Dividing data into m parts; obtaining a clustering model by using the m-1 part, and evaluating the clustering quality (the distance sum of the test sample and the class center) by using the rest part; repeating for k >0 m times, comparing the overall quality, and selecting k which can obtain the best clustering quality.
And evaluating clustering quality. After using the clustering method on the data set, the quality of the resulting clusters needs to be evaluated. Two types of methods commonly used are extrinsic and intrinsic methods.
An extrinsic method. With supervised methods, reference data is required. And judging the coincidence degree of the clustering result and the reference data by using a certain measure. A variety of metrics are typically used for the measurements.
(1) Index weighing method:
jaccard Coefficient (Jaccard Coefficient, JC):
Figure BDA0002319675330000201
FM Index (Fowles and Mallows Index, FMI):
Figure BDA0002319675330000202
rand Index (Rand Index, RI):
Figure BDA0002319675330000203
wherein a + b + c + d ═ m (m-1)/2.
The result values of the performance metrics are all in the interval of [0, 1], and the larger the value, the better the value.
(2) Quality measurement method: with Q (C, C)g) Representing clusters C at given reference data CgQuality metric under the conditions.
The quality of Q depends on four conditions:
homogeneity of the clusters: the purer the cluster the better.
Integrity of the cluster: samples belonging to the same class in the reference data can be clustered into the same class.
Crushing a cloth bag: adding a heterogeneous data to a pure class should be subject to a greater "penalty" than placing it in a rag bag.
Retention of small clusters: dividing small clusters into smaller clusters is more hazardous than dividing large clusters into small clusters.
BCubed precision and recall: the accuracy of an object indicates how many other objects in the same cluster belong to the same class as the object. The recall rate of an object reflects how many objects of the same category are allocated in the same cluster.
Let D ═ o1,o2,...onIs a set of objects, C is a cluster in D. Let L (o)i) (1. ltoreq. i.ltoreq.n) is o determined as a referenceiClass (b), C (o)i) Is o in CiCluster _ ID of, for two objects oiAnd oj(1. ltoreq. i, j. ltoreq. n, i. noteq. j), the correctness of the relationship between them in the cluster C is determined by
Figure BDA0002319675330000211
Figure BDA0002319675330000212
It is given.
BCubed precision is defined as:
Figure BDA0002319675330000213
BCubed recall is defined as:
Figure BDA0002319675330000214
for the unsupervised approach, no reference data is needed. The degree of intra-class aggregation and the degree of inter-class segregation were directly evaluated.
Considering cluster partitioning C ═ C1, C2.., Ck } of the clustering result, an average distance between samples within the cluster C is defined:
Figure BDA0002319675330000215
farthest distance between samples within cluster C:
diam(C)=max1≤i≤j≤|C|dist(xi,xj)
distance between cluster Ci and closest sample of cluster Cj:
Figure BDA0002319675330000221
distance between cluster Ci and center point of cluster Cj:
dcen(C)=dist(μij)
DB Index (Davies-Bouldin Index, DBI):
Figure BDA0002319675330000222
dunn Index (Dunn Index, DI):
Figure BDA0002319675330000223
the smaller the DBI value, the better, and the opposite the DI, the larger the value, the better.
Contour coefficient (silouette coeffient), for each object o in D, calculate the average distance a (o) between o and the other objects in the cluster to which o belongs:
Figure BDA0002319675330000224
b (o) is the minimum average distance of o to all clusters that do not contain o:
Figure BDA0002319675330000225
the contour coefficients are defined as:
Figure BDA0002319675330000226
the value of the contour coefficient is between-1 and 1.
The value of a (o) reflects the compactness of the cluster to which o belongs. The smaller the value, the more compact the cluster.
The value of b (o) captures the degree of separation of o from other clusters. The larger the value of b (o), the more separated o is from other clusters.
When the value of the profile coefficient of o is close to 1, the cluster containing o is compact and o is far from other clusters, which is desirable. When the value of the contour coefficient is negative, this means that o is closer to the objects of other clusters than to the object of the same cluster as itself, which is very bad in many cases, indicating that the clustering result is very poor, in the expected case.
The method provides a set of complete model algorithm type selection and evaluation method facing to power grid big data analysis, and by the method, scene characteristics and data types can be rapidly determined for power grid big data analysis scenes commonly used by state grids, so that problem analysis difficulty is simplified. The integrated algorithm and model of the method can be directly selected for analysis aiming at the determined scene, so that the type of the model to be used is rapidly determined, and the efficiency of big data analysis is improved. For the corresponding algorithm and model, because the method integrates the corresponding evaluation parameters, the method can be directly used for carrying out quantitative evaluation on the algorithm and the model, thereby improving the accuracy and the scientificity of the model evaluation.
Fig. 2 is a block diagram of a big data analysis model algorithm model selection device according to an embodiment of the present invention. The device can rapidly determine the analysis model to be used by executing a big data analysis model algorithm model selection method. The apparatus may be implemented by software and/or hardware and is typically integrated in an electronic device. As shown in fig. 2, the apparatus includes:
the model class matching module 210 is configured to match corresponding model classes according to application scenarios and data characteristics of the power grid service data;
the data processing module 220 is configured to process the power grid service data respectively based on at least two analysis models corresponding to the model categories to obtain processing results;
and the model evaluation module 230 is configured to evaluate the analysis model according to the processing result and the evaluation parameter corresponding to the model category, and recommend the model based on the evaluation result.
The big data analysis model algorithm model selection device provided by the embodiment of the invention can execute the big data analysis model algorithm model selection method provided by any embodiment of the invention, the realization principle and the technical effect of the big data analysis model algorithm model selection device are similar to those of the big data analysis model algorithm model selection method, and the details are not repeated here.
In an exemplary embodiment, the apparatus further comprises:
the model base building module is used for building an incidence relation between an analysis model and a model class before matching the corresponding model class according to the application scene and the data characteristics of the power grid service data, wherein the model class comprises an incidence rule model, a classification model, a regression model and a clustering model;
establishing an application scene of the power grid service data and a mapping relation between data characteristics and model categories;
and storing an analysis model, an incidence relation between the analysis model and the model category and a mapping relation between the application scene and the data characteristic of the power grid service data and the model category through a model base.
In an exemplary embodiment, the mapping relationship between the application scene and the data characteristic of the power grid service data and the model category is established in the following way:
analyzing and mining the relation and the relation among the power grid service data by adopting an association rule model;
processing the power grid service data with the labels and the supervision scene by adopting a classification model or a regression model;
and processing the power grid service data which has no label but needs classification by adopting a clustering model.
In an exemplary embodiment, the apparatus further comprises:
and the evaluation parameter setting module is used for presetting the evaluation parameters corresponding to the model types before evaluating the analysis model according to the processing result and the evaluation parameters corresponding to the model types, wherein the evaluation parameters corresponding to the association rule model comprise support degree and confidence degree, the evaluation parameters of the classification model comprise precision ratio, recall ratio, F-score, accuracy ratio and ROC curve, the evaluation parameters of the regression model comprise error square sum decision coefficients, and the evaluation parameters of the clustering model comprise estimated clustering tendency, determined cluster number in a data set and measured clustering quality.
In an exemplary embodiment, the model class matching module is specifically configured to:
inquiring the model base based on the application scene and the data characteristics of the power grid service data to obtain the model category corresponding to the application scene and the data characteristics;
and acquiring at least two analysis models corresponding to the model types.
In an exemplary embodiment, the data processing module is specifically configured to:
and respectively inputting the power grid service data into the analysis models corresponding to the model types, and taking the output results of the analysis models as processing results.
In an exemplary embodiment, the model evaluation module is specifically configured to:
determining the matching degree of the analysis model and the application scene according to the processing result and the evaluation parameters corresponding to the model types, and evaluating the analysis model according to the matching degree;
and generating model recommendation information based on the analysis model with the highest matching degree, and displaying the model recommendation information.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 3, the electronic device includes a processor 310, a memory 320, an input device 330, and an output device 340; the number of the processors 310 in the electronic device may be one or more, and one processor 310 is taken as an example in fig. 3; the processor 310, the memory 320, the input device 330 and the output device 340 in the electronic apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 3.
The memory 320 is a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules (e.g., the model class matching module 210, the data processing module 220, and the model evaluation module 230) corresponding to the big data analysis model algorithm selection method in the embodiment of the present invention. The processor 310 executes various functional applications and data processing of the electronic device by running the software programs, instructions and modules stored in the memory 320, that is, matching corresponding model classes according to application scenarios and data characteristics of the power grid service data; respectively processing the power grid service data based on at least two analysis models corresponding to the model types to obtain processing results; and evaluating the analysis model according to the processing result and the evaluation parameters corresponding to the model types, and recommending the model based on the evaluation result.
The memory 320 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 320 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 320 may further include memory located remotely from the processor 310, which may be connected to the electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 330 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus. The output device 340 may include a display device such as a display screen.
Embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a big data analytics model algorithm typing method, the method comprising: matching corresponding model types according to the application scene and the data characteristics of the power grid service data; respectively processing the power grid service data based on at least two analysis models corresponding to the model types to obtain processing results; and evaluating the analysis model according to the processing result and the evaluation parameters corresponding to the model types, and recommending the model based on the evaluation result.
Of course, the storage medium provided by the embodiments of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the big data analysis model algorithm model selection method provided by any embodiments of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It should be noted that, in the embodiment of the big data analysis model algorithm model selection apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A big data analysis model algorithm model selection method is characterized by comprising the following steps:
matching corresponding model types according to the application scene and the data characteristics of the power grid service data;
respectively processing the power grid service data based on at least two analysis models corresponding to the model types to obtain processing results;
and evaluating the analysis model according to the processing result and the evaluation parameters corresponding to the model types, and recommending the model based on the evaluation result.
2. The method of claim 1, further comprising, before matching the corresponding model class according to the application scenario and data characteristics of the grid service data:
establishing an incidence relation between an analysis model and a model category, wherein the model category comprises an incidence rule model, a classification model, a regression model and a clustering model;
establishing an application scene of the power grid service data and a mapping relation between data characteristics and model categories;
and storing an analysis model, an incidence relation between the analysis model and the model category and a mapping relation between the application scene and the data characteristic of the power grid service data and the model category through a model base.
3. The method according to claim 2, comprising establishing the mapping relationship between the application scenario and the data characteristics of the grid service data and the model class by adopting the following modes:
analyzing and mining the relation and the relation among the power grid service data by adopting an association rule model;
processing the power grid service data with the labels and the supervision scene by adopting a classification model or a regression model;
and processing the power grid service data which has no label but needs classification by adopting a clustering model.
4. The method of claim 2, further comprising, prior to evaluating the analytical model based on the processing results and the evaluation parameters corresponding to the model categories:
the method comprises the steps of presetting evaluation parameters corresponding to model categories, wherein the evaluation parameters corresponding to the association rule model comprise support degree and confidence degree, the evaluation parameters of the classification model comprise precision ratio, recall ratio, F-score, accuracy ratio and ROC curve, the evaluation parameters of the regression model comprise error square sum decision coefficients, and the evaluation parameters of the clustering model comprise estimation clustering tendency, determination of cluster number in a data set and determination of clustering quality.
5. The method of claim 2, wherein matching the corresponding model classes according to the application scenarios and data characteristics of the grid service data comprises:
inquiring the model base based on the application scene and the data characteristics of the power grid service data to obtain the model category corresponding to the application scene and the data characteristics;
and acquiring at least two analysis models corresponding to the model types.
6. The method according to claim 1, wherein the processing the grid service data based on at least two analysis models corresponding to the model categories respectively to obtain processing results comprises:
and respectively inputting the power grid service data into the analysis models corresponding to the model types, and taking the output results of the analysis models as processing results.
7. The method of claim 1, wherein evaluating the analytical model according to the processing result and the evaluation parameter corresponding to the model category, and performing model recommendation based on the evaluation result comprises:
determining the matching degree of the analysis model and the application scene according to the processing result and the evaluation parameters corresponding to the model types, and evaluating the analysis model according to the matching degree;
and generating model recommendation information based on the analysis model with the highest matching degree, and displaying the model recommendation information.
8. A big data analysis model algorithm model selection device is characterized by comprising:
the model category matching module is used for matching corresponding model categories according to the application scenes and the data characteristics of the power grid service data;
the data processing module is used for respectively processing the power grid service data based on at least two analysis models corresponding to the model types to obtain processing results;
and the model evaluation module is used for evaluating the analysis model according to the processing result and the evaluation parameters corresponding to the model types and recommending the model based on the evaluation result.
9. An electronic device, characterized in that the electronic device comprises:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the big data analytics model algorithm selection method of any one of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a big data analysis model algorithm selection method according to any one of claims 1 to 7.
CN201911292789.1A 2019-12-12 2019-12-12 Big data analysis model algorithm model selection method and device, electronic equipment and medium Withdrawn CN110990461A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201911292789.1A CN110990461A (en) 2019-12-12 2019-12-12 Big data analysis model algorithm model selection method and device, electronic equipment and medium
CN202010194935.3A CN111324642A (en) 2019-12-12 2020-03-19 Model algorithm type selection and evaluation method for power grid big data analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911292789.1A CN110990461A (en) 2019-12-12 2019-12-12 Big data analysis model algorithm model selection method and device, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN110990461A true CN110990461A (en) 2020-04-10

Family

ID=70093961

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201911292789.1A Withdrawn CN110990461A (en) 2019-12-12 2019-12-12 Big data analysis model algorithm model selection method and device, electronic equipment and medium
CN202010194935.3A Pending CN111324642A (en) 2019-12-12 2020-03-19 Model algorithm type selection and evaluation method for power grid big data analysis

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202010194935.3A Pending CN111324642A (en) 2019-12-12 2020-03-19 Model algorithm type selection and evaluation method for power grid big data analysis

Country Status (1)

Country Link
CN (2) CN110990461A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257418A (en) * 2020-10-30 2021-01-22 北京青丝科技有限公司 Questionnaire data processing method and device and storage medium
CN112487720A (en) * 2020-11-30 2021-03-12 重庆大学 Method and system for quickly partitioning wind pressure coefficient based on K-means three-dimensional clustering algorithm and storage medium
CN112506913A (en) * 2021-02-02 2021-03-16 广东工业大学 Big data architecture construction method for manufacturing industry data space
CN112632000A (en) * 2020-12-30 2021-04-09 北京天融信网络安全技术有限公司 Log file clustering method and device, electronic equipment and readable storage medium
CN112948687A (en) * 2021-03-25 2021-06-11 重庆高开清芯智联网络科技有限公司 Node message recommendation method based on name card file characteristics
CN112966778A (en) * 2021-03-29 2021-06-15 上海冰鉴信息科技有限公司 Data processing method and device for unbalanced sample data
CN113048807A (en) * 2021-03-15 2021-06-29 太原理工大学 Air cooling unit backpressure abnormity detection method
CN113159220A (en) * 2021-05-14 2021-07-23 中国人民解放军军事科学院国防工程研究院工程防护研究所 Random forest based concrete penetration depth empirical algorithm evaluation method and device
CN113282686A (en) * 2021-06-03 2021-08-20 光大科技有限公司 Method and device for determining association rule of unbalanced sample
CN113408601A (en) * 2021-06-10 2021-09-17 共达地创新技术(深圳)有限公司 Model generation method, electronic device, and storage medium
CN113591884A (en) * 2020-04-30 2021-11-02 上海高德威智能交通系统有限公司 Method, device and equipment for determining character recognition model and storage medium
CN113705849A (en) * 2020-05-21 2021-11-26 富士通株式会社 Information processing apparatus, information processing method, and computer program
CN113822327A (en) * 2021-07-31 2021-12-21 云南电网有限责任公司信息中心 Algorithm recommendation method based on data characteristics and analytic hierarchy process

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985815A (en) * 2020-08-21 2020-11-24 国网能源研究院有限公司 Method and device for screening energy and power operation evaluation indexes
TWI821641B (en) * 2021-03-12 2023-11-11 殷祐科技股份有限公司 Artificial intelligent manufacturing & production energy-saving system and method thereof
CN113642850A (en) * 2021-07-20 2021-11-12 国网江苏省电力有限公司南通供电分公司 Data fusion method and terminal for power distribution network planning
CN116703165B (en) * 2023-08-03 2024-01-19 国网山西省电力公司营销服务中心 Electric power metering data security risk assessment method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1483739A2 (en) * 2001-09-27 2004-12-08 BRITISH TELECOMMUNICATIONS public limited company Method and apparatus for data analysis
CN105654196A (en) * 2015-12-29 2016-06-08 中国电力科学研究院 Adaptive load prediction selection method based on electric power big data
CN109165249B (en) * 2018-08-07 2020-08-04 阿里巴巴集团控股有限公司 Data processing model construction method and device, server and user side
CN109726749A (en) * 2018-12-21 2019-05-07 齐鲁工业大学 A kind of Optimal Clustering selection method and device based on multiple attribute decision making (MADM)
CN110457360A (en) * 2019-06-18 2019-11-15 北京易莱信科技有限公司 A kind of modeling method and system based on data mining

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113591884A (en) * 2020-04-30 2021-11-02 上海高德威智能交通系统有限公司 Method, device and equipment for determining character recognition model and storage medium
CN113591884B (en) * 2020-04-30 2023-11-14 上海高德威智能交通系统有限公司 Method, device, equipment and storage medium for determining character recognition model
CN113705849A (en) * 2020-05-21 2021-11-26 富士通株式会社 Information processing apparatus, information processing method, and computer program
CN112257418A (en) * 2020-10-30 2021-01-22 北京青丝科技有限公司 Questionnaire data processing method and device and storage medium
CN112487720A (en) * 2020-11-30 2021-03-12 重庆大学 Method and system for quickly partitioning wind pressure coefficient based on K-means three-dimensional clustering algorithm and storage medium
CN112487720B (en) * 2020-11-30 2022-11-22 重庆大学 Method and system for quickly partitioning wind pressure coefficient based on K-means three-dimensional clustering algorithm and storage medium
CN112632000A (en) * 2020-12-30 2021-04-09 北京天融信网络安全技术有限公司 Log file clustering method and device, electronic equipment and readable storage medium
CN112632000B (en) * 2020-12-30 2023-11-10 北京天融信网络安全技术有限公司 Log file clustering method, device, electronic equipment and readable storage medium
CN112506913A (en) * 2021-02-02 2021-03-16 广东工业大学 Big data architecture construction method for manufacturing industry data space
CN113048807A (en) * 2021-03-15 2021-06-29 太原理工大学 Air cooling unit backpressure abnormity detection method
CN112948687A (en) * 2021-03-25 2021-06-11 重庆高开清芯智联网络科技有限公司 Node message recommendation method based on name card file characteristics
CN112948687B (en) * 2021-03-25 2023-05-02 重庆高开清芯智联网络科技有限公司 Node message recommendation method based on name card file characteristics
CN112966778B (en) * 2021-03-29 2024-03-15 上海冰鉴信息科技有限公司 Data processing method and device for unbalanced sample data
CN112966778A (en) * 2021-03-29 2021-06-15 上海冰鉴信息科技有限公司 Data processing method and device for unbalanced sample data
CN113159220A (en) * 2021-05-14 2021-07-23 中国人民解放军军事科学院国防工程研究院工程防护研究所 Random forest based concrete penetration depth empirical algorithm evaluation method and device
CN113282686A (en) * 2021-06-03 2021-08-20 光大科技有限公司 Method and device for determining association rule of unbalanced sample
CN113282686B (en) * 2021-06-03 2023-11-07 光大科技有限公司 Association rule determining method and device for unbalanced sample
CN113408601A (en) * 2021-06-10 2021-09-17 共达地创新技术(深圳)有限公司 Model generation method, electronic device, and storage medium
CN113822327A (en) * 2021-07-31 2021-12-21 云南电网有限责任公司信息中心 Algorithm recommendation method based on data characteristics and analytic hierarchy process

Also Published As

Publication number Publication date
CN111324642A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
CN110990461A (en) Big data analysis model algorithm model selection method and device, electronic equipment and medium
Cheng et al. Data and knowledge mining with big data towards smart production
Hachicha et al. A survey of control-chart pattern-recognition literature (1991–2010) based on a new conceptual classification scheme
Jiang et al. Evolutionary dynamic multi-objective optimisation: A survey
CN108985380B (en) Point switch fault identification method based on cluster integration
CN110990718B (en) Social network model building module of company image lifting system
Duan et al. Root cause analysis approach based on reverse cascading decomposition in QFD and fuzzy weight ARM for quality accidents
CN114861788A (en) Load abnormity detection method and system based on DBSCAN clustering
CN114118269A (en) Energy big data aggregation analysis method based on typical service scene
CN114066073A (en) Power grid load prediction method
Cao et al. Froth image clustering with feature semi-supervision through selection and label information
Hao et al. A new method for noise data detection based on DBSCAN and SVDD
Mithy et al. Classification of Iris Flower Dataset using Different Algorithms
Goyle et al. Dataassist: A machine learning approach to data cleaning and preparation
Zhang et al. Anomaly detection method for building energy consumption in multivariate time series based on graph attention mechanism
CN114444573A (en) Power customer label generation method based on big data clustering technology
Ye et al. A Novel Self-Supervised Learning-Based Anomalous Node Detection Method Based on an Autoencoder for Wireless Sensor Networks
Zhou et al. Pre-clustering active learning method for automatic classification of building structures in urban areas
CN113705920A (en) Generation method of water data sample set for thermal power plant and terminal equipment
Xu et al. The unordered time series fuzzy clustering algorithm based on the adaptive incremental learning
Liu et al. Inventory Management of Automobile After-sales Parts Based on Data Mining
Безкоровайний et al. Mathematical models of a multi-criteria problem of reengineering topological structures of ecological monitoring networks
Chang Evaluation model of enterprise lean management effect based on data mining
Liu et al. An abnormal detection of positive active total power based on local outlier factor
Liu et al. A novel effective distance measure and a relevant algorithm for optimizing the initial cluster centroids of K-means

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200410