Data set classification method for operation accident analysis of comprehensive energy system
Technical Field
The invention belongs to the field of data processing and analysis of an integrated energy system, and particularly relates to a data set classification method for analysis of operation accidents of the integrated energy system.
Background
With the continuous development of scientific technology, the intelligent internet of things in China is also rapidly developed, and the application of automation and information systems provides a new technical means for the analysis and control of the faults of the comprehensive energy system, so that the energy network gradually enters a big data era. With the popularization and application of the energy network automation and supply and demand consumption information acquisition system, the automation and informatization levels are continuously improved, and the comprehensive energy system is developing towards the automation and intellectualization directions, so that higher requirements are provided for the construction of an intelligent energy network. At present, each large energy enterprise begins to build an enterprise-level data integration platform, so that the full coverage of a core business system is realized. With the rapid development of information technology and storage media, the data volume and data range are continuously expanded, data set information contacted by people is gradually increased, and an enterprise management layer is eager to quickly acquire useful information from the mass data. Conventional techniques have been unable to meet the needs of modern management, and the manual processing of data set information requires the expenditure of more and more time and effort. Therefore, research on information mining technology becomes more and more important, and as the scale of the energy network is continuously enlarged, when the energy network equipment fails, a large amount of fault information data is transmitted to the energy network dispatching center. How to extract valuable information from massive fault data and discover knowledge in time is a problem to be solved urgently.
For example, chinese patent invention (CN 105824945A) discloses a method for collecting technical resource data of global energy internet, which automatically extracts text information by using an extraction model based on natural language processing for a target URL; and storing the extracted data in a local hard disk, and then automatically classifying the data according to a text classification technology based on naive Bayes. Although the patent literature can improve efficiency and accuracy rate compared with manual data classification, the method is not suitable for classification of long data sets, but because the energy network has the characteristics of wide distribution region range and connection of a large number of intelligent devices, a large number of data sets for fault reason analysis are often collected in one-time operation accidents, sometimes, the data sets of one-time accidents can exceed hundreds or even thousands of characters, and aiming at the long data sets, if a commonly-used Bayesian model, a convolutional neural network or a self-attention model in the prior art and the like are adopted, extremely high calculation cost is often caused, and therefore, a technical scheme which has higher accuracy rate and simultaneously keeps lower calculation cost in the process of classification of the long data sets of the energy network is urgently needed in the prior art.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a data set classification method for analyzing operation accidents of an integrated energy system, which aims at overcoming the defects of the technical scheme, and solves the problem that a long data set classification model is not suitable by screening the long data set, establishing a classic BOW model, combining an original text corpus to train a token list, outputting a token feature list ST, shortening the original data set into a new data set with the length of 300, covering most important information of the original data set in the new data set, and carrying the new data set into a convolutional neural network model for classification.
Step 1: acquiring an operation accident cause data set of the comprehensive energy system; the method comprises the steps of obtaining a large number of fault reasons of the energy monitoring equipment, taking an energy company as an example, screening out a long data set with characters exceeding 300 by analyzing a collected fault data set;
step 2: carrying out multidimensional preprocessing on the long data set to obtain a preprocessed long data set; the pretreatment comprises the following steps: (1) the data set is standardized, punctuation marks, special marks and some meaningless common words are deleted, because the punctuation marks, the special marks and the meaningless common words are not beneficial to analyzing and predicting the content of the data set by a system, the calculation complexity is increased, and vacancy values, data abnormal values and repeated values in the data set are removed. And filtering out data which cannot be classified, such as sample data lacking important fields. The data outlier analysis is to make corresponding rules to eliminate or replace unreasonable and erroneous data. The data repetition value analysis is to compare different fields of different data samples and eliminate repeated data; (2) and simplifying the data set by a data specification method, wherein the data set comprises a dimension specification and a numerical specification. The dimension specification may reduce variables through principal component analysis and correlation analysis to obtain a simplified or compressed representation of the raw data; (3) judging abnormal data in the data set, removing the data set, standardizing the data set and simplifying the data set, wherein in general, abnormal data information such as unreasonable data and repeated data still exists in the energy network fault reason data, and the abnormal data information is different from most data in the same data set and is called abnormal sample data. The abnormal sample data can directly influence the calculation precision of the model, and larger errors occur, so the abnormal value data is searched and eliminated by adopting an abnormal value diagnosis technology;
and 3, step 3: establishing a classical BOW model, training the long data set obtained in the step (2) to obtain a feature list of the long data set, training the feature list through a gradient enhancement classifier to obtain feature importance identified by the gradient enhancement classifier, and sequencing the feature importance from the highest value to obtain a token feature list ST with N important features;
and 4, step 4: selecting any two parts of data from the long data Set obtained in the step 2 as Part1 and Part2, storing Part1 as the beginning of a new data Set _ new, storing Part2 as the end of the new data Set _ new, and Part1+ Part2 ≪ 1, deleting the two parts of data from the long data Set, and keeping the rest Part3=1 (Part1+ Part2) in the original long data Set in a reasonable proportion;
and 5: incorporating important features in part3 into the new data Set _ new by means of iterative discrimination; in particular, the token feature list ST in said step 3 is iterated, starting from the first significant token previously determined, searching for this token in the original data set. If not, selecting the next important feature and searching again; if the selected marker is present in the original data Set, selecting the marker and its neighbors, and adding the marker and its neighbors to the new data Set _ new;
and 6: setting the number of characters of the new data Set _ new to be 300, and repeating the step 5 until the new data Set _ new is filled;
and 7: repeating the steps 1-6, processing all the long data sets, and classifying the new data set finally obtained in the step 6 and the short data sets which are not screened out in the step 1 by adopting a convolutional neural network; the convolutional neural network is one of the neural networks, has the advantages of high efficiency, simplicity in training, high speed and the like in a classification task, and is suitable for processing short-length data sets. The method is characterized in that a convolutional neural network is selected to classify data sets by combining the particularity of the energy network equipment fault data sets, the convolutional neural network is a neural network which replaces matrix multiplication with convolutional operation, features are extracted through multiple turning, sliding and superposition, and the convolutional neural network is applied to the field of computer vision at the earliest. In the process of image identification, in the face of massive image data, the convolutional neural network can be fully utilized to continuously reduce the dimension of the data, and finally the main features in the data are reserved. With the continuous and deep research, in recent years, convolutional neural networks are widely applied to natural language processing tasks, and the convolutional neural network mainly comprises a convolutional layer, a pooling layer and an output layer. The convolutional layer is used as a sensor of the network and used for carrying out feature extraction on input data. And a large amount of information is obtained by feature extraction, and the information needs to be further screened and compressed by a pooling layer, so that effective information is reserved. The connection between the pooling layer and the output layer is typically made through a fully-connected layer, which is equivalent to the hidden layer of a conventional feedforward neural network. The final output layer is the output that implements the prediction. The convolutional neural network model of the embodiment mainly uses a one-dimensional convolutional layer and a time-series maximum pooling layer, and the specific structure of the convolutional neural network model is shown in fig. 2.
Further, the step 7 specifically includes the following sub-steps: step 7.1: converting the new data Set _ new or the short data Set obtained in the step 6 into a word vector matrix, wherein if the number of words is n and the dimension of the word vector is k, the size of the matrix is n × k; step 7.2: the word vector matrix is used as the input end of the convolutional neural network model and is used for extracting data characteristics; step 7.3: performing feature fusion operation on the features extracted in the step 7.2 through the convolutional neural network model, obtaining a plurality of feature maps through convolutional layers in the model, then performing feature compression through a pooling layer, simplifying the complexity of network calculation, and extracting main features; wherein the pooling layer is a max pooling function selected to obtain the most important features; finally, outputting the convolutional neural network model through a splicing layer and an output layer to obtain a plurality of data sets of different categories; step 7.4: and calculating the probability of the data set under each category through a softmax classifier to obtain a final classification result.
Based on the technical scheme, the data set classification method for analyzing the operation accident reason of the comprehensive energy system has the following technical effects:
according to the data set classification method for analyzing the operation accident reasons of the comprehensive energy system, long data sets are processed, a classic BOW model is established and trained aiming at each long data set, a feature list is obtained, the feature list is trained through a gradient enhancement classifier, the feature importance recognized by a gradient enhancement classifier is obtained, the marking features used by a machine learning classifier are sequenced from the highest value according to the feature importance, a token feature list ST is obtained, the new data set covers most important information of the original data set, and the new data set is brought into a convolutional neural network model for classification; the method solves the problem that a long data set classification model is not suitable, and meanwhile, the calculated amount is reduced to a certain extent, namely, the balance between the accuracy and the calculation efficiency is achieved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a data set classification method for analyzing an operation accident of an integrated energy system according to an embodiment of the present application;
fig. 2 is a diagram of a convolutional neural network structure provided in an embodiment of the present application;
fig. 3 is a long data set processing procedure provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. Based on the embodiments in this application, all other embodiments that can be obtained by a person skilled in the art without making any creative effort belong to the protection scope of this application, and the concept related to this application will be first explained with reference to the attached drawings. It should be noted that the following descriptions of the concepts are only for the purpose of facilitating understanding of the contents of the present application, and do not represent limitations on the scope of the present application.
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings.
As shown in fig. 1: the technical scheme of the invention is that the data set classification method for analyzing the operation accident reason of the comprehensive energy system comprises the following steps:
step 1: acquiring an operation accident cause data set of the comprehensive energy system; the method comprises the steps of obtaining a large number of fault reasons of the energy monitoring equipment, taking an energy company as an example, screening out a long data set with characters exceeding 300 by analyzing a collected fault data set;
step 2: carrying out multidimensional preprocessing on the long data set to obtain a preprocessed long data set; the pretreatment comprises the following steps: (1) the data set is standardized, punctuation marks, special marks and some meaningless common words are deleted, because the punctuation marks, the special marks and the meaningless common words are not beneficial to analyzing and predicting the content of the data set by a system, the calculation complexity is increased, and vacancy values, data abnormal values and repeated values in the data set are removed. And filtering out data which cannot be classified, such as sample data lacking important fields. The analysis of data outliers is to make corresponding rules to eliminate or replace unreasonable and erroneous data. The data repetition value analysis is to compare different fields of different data samples and eliminate repeated data; (2) and simplifying the data set by a data specification method, wherein the data set comprises a dimension specification and a numerical specification. The dimension specification may reduce the variables through principal component analysis and correlation analysis to obtain a simplified or compressed representation of the raw data; (3) judging abnormal data in the data set, removing the data set, standardizing the data set and simplifying the data set, wherein in general, abnormal data information such as unreasonable data and repeated data still exists in the energy network fault reason data, and the abnormal data information is different from most data in the same data set and is called abnormal sample data. The abnormal sample data can directly influence the calculation precision of the model, so that a large error occurs, and therefore, the abnormal value data is searched and eliminated by adopting an abnormal value diagnosis technology;
and 3, step 3: establishing a classical BOW model, training the long data set obtained in the step 2 to obtain a feature list of the long data set, training the feature list through a gradient enhancement classifier to obtain feature importance identified by the gradient enhancement classifier, and sequencing the feature importance from the highest value to obtain a token feature list ST with N important features; at this time, the token feature list has more lost information, and the efficiency of the classification model for identifying the token feature list is not high, so that the effect of classifying the data set is not ideal if the token feature list ST is used as the input data of the classification model;
and 4, step 4: selecting any two parts of data from the long dataset obtained in the step 2 as Part1 and Part2, storing Part1 as the beginning of the new dataset Set _ new, storing Part2 as the end of the new dataset, and Part1+ Part2 ≪ 1, and deleting the two parts of data from the long dataset, so that the remaining Part3=1- (Part1+ Part2) in the original long dataset keeps reasonable values in the original long dataset and the new dataset, in the embodiment, by setting Part 1: {0.1, 0.2, 0.3, 0.4, 0.5} and part 2: {0, 0.05, 0.1, 0.15}, and substituting the data into the long data Set processing program, setting how many important features the Set _ new contains as assessment indexes, finding that part1 is Set to be 0.1, part2 is Set to be 0.1, and the processing effect is optimal;
and 5: incorporating important features in part3 into the new data Set _ new by means of iterative discrimination; in particular, the token feature list ST in said step 3 is iterated, starting from the first significant token previously determined, searching for this token in the original data set. If not, selecting the next important feature and searching again; if the selected mark exists in the original data Set, selecting the mark and mark neighbors before and after the mark, and adding the mark and the mark neighbors of the mark to the new data Set _ new, wherein the specific process is shown in fig. 3;
and 6: setting the number of characters of the new data Set _ new to be 300, and repeating the step 5 until the new data Set _ new is filled;
and 7: repeating the steps 1-6, processing all the long data sets, and classifying the new data set finally obtained in the step 6 and the short data sets which are not screened in the step 1 by adopting a convolutional neural network; the convolutional neural network is one of the neural networks, has the advantages of high efficiency, simple training, high speed and the like in a classification task, and is suitable for processing short-length data sets. The method is characterized in that a convolutional neural network is selected to classify data sets by combining the particularity of the energy network equipment fault data sets, the convolutional neural network is a neural network which replaces matrix multiplication with convolutional operation, features are extracted through multiple turning, sliding and superposition, and the convolutional neural network is applied to the field of computer vision at the earliest. In the process of image identification, in the face of massive image data, the convolutional neural network can be fully utilized to continuously reduce the dimension of the data, and finally the main features in the data are reserved. With the continuous and deep research, in recent years, convolutional neural networks are widely applied to natural language processing tasks, and the convolutional neural network mainly comprises a convolutional layer, a pooling layer and an output layer. The convolutional layer is used as a sensor of the network and used for carrying out feature extraction on input data. And a large amount of information is obtained by feature extraction, the information needs to be further screened and compressed by a pooling layer, and effective information is reserved. The connection between the pooling layer and the output layer is typically made through a fully-connected layer, which is equivalent to the hidden layer of a conventional feedforward neural network. The final output layer is the output that implements the prediction. The convolutional neural network model of the embodiment mainly uses a one-dimensional convolutional layer and a time-series maximum pooling layer, and the specific structure of the convolutional neural network model is shown in fig. 2.
Further, the step 7 specifically includes the following sub-steps: step 7.1: converting the new data Set _ new or the short data Set obtained in the step 6 into a word vector matrix, wherein if the number of words is n and the dimension of the word vector is k, the size of the matrix is n × k; step 7.2: the word vector matrix is used as the input end of the convolutional neural network model and is used for extracting data characteristics; step 7.3: performing feature fusion operation on the features extracted in the step 7.2 through the convolutional neural network model, obtaining a plurality of feature maps through convolutional layers in the model, then performing feature compression through a pooling layer, simplifying the complexity of network calculation, and extracting main features; wherein the pooling layer is a max boosting function selected to obtain the most important features in the dataset; finally, outputting the convolutional neural network model through a splicing layer and an output layer to obtain a plurality of data sets of different categories; step 7.4: and calculating the probability of the data set under each category through a softmax classifier to obtain a final classification result.
Based on the technical scheme, the data set classification method for analyzing the operation accident reasons of the comprehensive energy system comprises the steps of screening a long data set, establishing a classic BOW model for training a token list, outputting a token feature list ST, informing feature extraction to shorten an original data set into a new data set with the length of 300, covering most important information of the original data set in the new data set, and adopting the new data set to bring the new data set into a convolutional neural network model for classification; the method solves the problem that a long data set classification model is not suitable, and meanwhile, the calculated amount is greatly reduced.
The above-described embodiments and/or implementations are only for illustrating the preferred embodiments and/or implementations of the present technology, and are not intended to limit the implementations of the present technology in any way, and those skilled in the art can make many modifications or changes without departing from the scope of the technology disclosed in the present disclosure, but should be construed as technology or implementations that are substantially the same as the present technology.