CN114638558B - Data set classification method for operation accident analysis of comprehensive energy system - Google Patents

Data set classification method for operation accident analysis of comprehensive energy system Download PDF

Info

Publication number
CN114638558B
CN114638558B CN202210540826.1A CN202210540826A CN114638558B CN 114638558 B CN114638558 B CN 114638558B CN 202210540826 A CN202210540826 A CN 202210540826A CN 114638558 B CN114638558 B CN 114638558B
Authority
CN
China
Prior art keywords
data set
data
new
long
mark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210540826.1A
Other languages
Chinese (zh)
Other versions
CN114638558A (en
Inventor
李强
胡浩瀚
赵峰
李温静
刘永清
郭正雄
闫松
董建强
佘文魁
朱传晶
纪元
戴彬
刘晓静
张来东
彭晓武
田永茂
张雪成
倪升亚
李琳
张健
韩永跃
任承欢
张瑞超
强凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Richsoft Electric Power Information Technology Co ltd
State Grid Information and Telecommunication Co Ltd
Original Assignee
Tianjin Richsoft Electric Power Information Technology Co ltd
State Grid Information and Telecommunication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Richsoft Electric Power Information Technology Co ltd, State Grid Information and Telecommunication Co Ltd filed Critical Tianjin Richsoft Electric Power Information Technology Co ltd
Priority to CN202210540826.1A priority Critical patent/CN114638558B/en
Publication of CN114638558A publication Critical patent/CN114638558A/en
Application granted granted Critical
Publication of CN114638558B publication Critical patent/CN114638558B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/20Administration of product repair or maintenance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/06Electricity, gas or water supply
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/80Management or planning
    • Y02P90/82Energy audits or management systems therefor

Abstract

According to the data set classification method for analyzing the operation accident reasons of the comprehensive energy system, a long data set is screened, a classic BOW model is established to train aiming at a token list, a token feature list ST is output, then an original data set is shortened to a new data set with the length of 300, the new data set covers most important information of the original data set, and the new data set is brought into a convolutional neural network model for classification; the method solves the problem that a long data set classification model is not suitable, and meanwhile, the calculated amount is reduced to a certain extent, namely, the balance between the accuracy and the calculation efficiency is achieved.

Description

Data set classification method for operation accident analysis of comprehensive energy system
Technical Field
The invention belongs to the field of data processing and analysis of an integrated energy system, and particularly relates to a data set classification method for analysis of operation accidents of the integrated energy system.
Background
With the continuous development of scientific technology, the intelligent internet of things in China is also rapidly developed, and the application of automation and information systems provides a new technical means for the analysis and control of the faults of the comprehensive energy system, so that the energy network gradually enters a big data era. With the popularization and application of the energy network automation and supply and demand consumption information acquisition system, the automation and informatization levels are continuously improved, and the comprehensive energy system is developing towards the automation and intellectualization directions, so that higher requirements are provided for the construction of an intelligent energy network. At present, each large energy enterprise begins to build an enterprise-level data integration platform, so that the full coverage of a core business system is realized. With the rapid development of information technology and storage media, the data volume and data range are continuously expanded, data set information contacted by people is gradually increased, and an enterprise management layer is eager to quickly acquire useful information from the mass data. Conventional techniques have been unable to meet the needs of modern management, and the manual processing of data set information requires the expenditure of more and more time and effort. Therefore, research on information mining technology becomes more and more important, and as the scale of the energy network is continuously enlarged, when the energy network equipment fails, a large amount of fault information data is transmitted to the energy network dispatching center. How to extract valuable information from massive fault data and discover knowledge in time is a problem to be solved urgently.
For example, chinese patent invention (CN 105824945A) discloses a method for collecting technical resource data of global energy internet, which automatically extracts text information by using an extraction model based on natural language processing for a target URL; and storing the extracted data in a local hard disk, and then automatically classifying the data according to a text classification technology based on naive Bayes. Although the patent literature can improve efficiency and accuracy rate compared with manual data classification, the method is not suitable for classification of long data sets, but because the energy network has the characteristics of wide distribution region range and connection of a large number of intelligent devices, a large number of data sets for fault reason analysis are often collected in one-time operation accidents, sometimes, the data sets of one-time accidents can exceed hundreds or even thousands of characters, and aiming at the long data sets, if a commonly-used Bayesian model, a convolutional neural network or a self-attention model in the prior art and the like are adopted, extremely high calculation cost is often caused, and therefore, a technical scheme which has higher accuracy rate and simultaneously keeps lower calculation cost in the process of classification of the long data sets of the energy network is urgently needed in the prior art.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a data set classification method for analyzing operation accidents of an integrated energy system, which aims at overcoming the defects of the technical scheme, and solves the problem that a long data set classification model is not suitable by screening the long data set, establishing a classic BOW model, combining an original text corpus to train a token list, outputting a token feature list ST, shortening the original data set into a new data set with the length of 300, covering most important information of the original data set in the new data set, and carrying the new data set into a convolutional neural network model for classification.
Step 1: acquiring an operation accident cause data set of the comprehensive energy system; the method comprises the steps of obtaining a large number of fault reasons of the energy monitoring equipment, taking an energy company as an example, screening out a long data set with characters exceeding 300 by analyzing a collected fault data set;
step 2: carrying out multidimensional preprocessing on the long data set to obtain a preprocessed long data set; the pretreatment comprises the following steps: (1) the data set is standardized, punctuation marks, special marks and some meaningless common words are deleted, because the punctuation marks, the special marks and the meaningless common words are not beneficial to analyzing and predicting the content of the data set by a system, the calculation complexity is increased, and vacancy values, data abnormal values and repeated values in the data set are removed. And filtering out data which cannot be classified, such as sample data lacking important fields. The data outlier analysis is to make corresponding rules to eliminate or replace unreasonable and erroneous data. The data repetition value analysis is to compare different fields of different data samples and eliminate repeated data; (2) and simplifying the data set by a data specification method, wherein the data set comprises a dimension specification and a numerical specification. The dimension specification may reduce variables through principal component analysis and correlation analysis to obtain a simplified or compressed representation of the raw data; (3) judging abnormal data in the data set, removing the data set, standardizing the data set and simplifying the data set, wherein in general, abnormal data information such as unreasonable data and repeated data still exists in the energy network fault reason data, and the abnormal data information is different from most data in the same data set and is called abnormal sample data. The abnormal sample data can directly influence the calculation precision of the model, and larger errors occur, so the abnormal value data is searched and eliminated by adopting an abnormal value diagnosis technology;
and 3, step 3: establishing a classical BOW model, training the long data set obtained in the step (2) to obtain a feature list of the long data set, training the feature list through a gradient enhancement classifier to obtain feature importance identified by the gradient enhancement classifier, and sequencing the feature importance from the highest value to obtain a token feature list ST with N important features;
and 4, step 4: selecting any two parts of data from the long data Set obtained in the step 2 as Part1 and Part2, storing Part1 as the beginning of a new data Set _ new, storing Part2 as the end of the new data Set _ new, and Part1+ Part2 ≪ 1, deleting the two parts of data from the long data Set, and keeping the rest Part3=1 (Part1+ Part2) in the original long data Set in a reasonable proportion;
and 5: incorporating important features in part3 into the new data Set _ new by means of iterative discrimination; in particular, the token feature list ST in said step 3 is iterated, starting from the first significant token previously determined, searching for this token in the original data set. If not, selecting the next important feature and searching again; if the selected marker is present in the original data Set, selecting the marker and its neighbors, and adding the marker and its neighbors to the new data Set _ new;
and 6: setting the number of characters of the new data Set _ new to be 300, and repeating the step 5 until the new data Set _ new is filled;
and 7: repeating the steps 1-6, processing all the long data sets, and classifying the new data set finally obtained in the step 6 and the short data sets which are not screened out in the step 1 by adopting a convolutional neural network; the convolutional neural network is one of the neural networks, has the advantages of high efficiency, simplicity in training, high speed and the like in a classification task, and is suitable for processing short-length data sets. The method is characterized in that a convolutional neural network is selected to classify data sets by combining the particularity of the energy network equipment fault data sets, the convolutional neural network is a neural network which replaces matrix multiplication with convolutional operation, features are extracted through multiple turning, sliding and superposition, and the convolutional neural network is applied to the field of computer vision at the earliest. In the process of image identification, in the face of massive image data, the convolutional neural network can be fully utilized to continuously reduce the dimension of the data, and finally the main features in the data are reserved. With the continuous and deep research, in recent years, convolutional neural networks are widely applied to natural language processing tasks, and the convolutional neural network mainly comprises a convolutional layer, a pooling layer and an output layer. The convolutional layer is used as a sensor of the network and used for carrying out feature extraction on input data. And a large amount of information is obtained by feature extraction, and the information needs to be further screened and compressed by a pooling layer, so that effective information is reserved. The connection between the pooling layer and the output layer is typically made through a fully-connected layer, which is equivalent to the hidden layer of a conventional feedforward neural network. The final output layer is the output that implements the prediction. The convolutional neural network model of the embodiment mainly uses a one-dimensional convolutional layer and a time-series maximum pooling layer, and the specific structure of the convolutional neural network model is shown in fig. 2.
Further, the step 7 specifically includes the following sub-steps: step 7.1: converting the new data Set _ new or the short data Set obtained in the step 6 into a word vector matrix, wherein if the number of words is n and the dimension of the word vector is k, the size of the matrix is n × k; step 7.2: the word vector matrix is used as the input end of the convolutional neural network model and is used for extracting data characteristics; step 7.3: performing feature fusion operation on the features extracted in the step 7.2 through the convolutional neural network model, obtaining a plurality of feature maps through convolutional layers in the model, then performing feature compression through a pooling layer, simplifying the complexity of network calculation, and extracting main features; wherein the pooling layer is a max pooling function selected to obtain the most important features; finally, outputting the convolutional neural network model through a splicing layer and an output layer to obtain a plurality of data sets of different categories; step 7.4: and calculating the probability of the data set under each category through a softmax classifier to obtain a final classification result.
Based on the technical scheme, the data set classification method for analyzing the operation accident reason of the comprehensive energy system has the following technical effects:
according to the data set classification method for analyzing the operation accident reasons of the comprehensive energy system, long data sets are processed, a classic BOW model is established and trained aiming at each long data set, a feature list is obtained, the feature list is trained through a gradient enhancement classifier, the feature importance recognized by a gradient enhancement classifier is obtained, the marking features used by a machine learning classifier are sequenced from the highest value according to the feature importance, a token feature list ST is obtained, the new data set covers most important information of the original data set, and the new data set is brought into a convolutional neural network model for classification; the method solves the problem that a long data set classification model is not suitable, and meanwhile, the calculated amount is reduced to a certain extent, namely, the balance between the accuracy and the calculation efficiency is achieved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a data set classification method for analyzing an operation accident of an integrated energy system according to an embodiment of the present application;
fig. 2 is a diagram of a convolutional neural network structure provided in an embodiment of the present application;
fig. 3 is a long data set processing procedure provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. Based on the embodiments in this application, all other embodiments that can be obtained by a person skilled in the art without making any creative effort belong to the protection scope of this application, and the concept related to this application will be first explained with reference to the attached drawings. It should be noted that the following descriptions of the concepts are only for the purpose of facilitating understanding of the contents of the present application, and do not represent limitations on the scope of the present application.
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings.
As shown in fig. 1: the technical scheme of the invention is that the data set classification method for analyzing the operation accident reason of the comprehensive energy system comprises the following steps:
step 1: acquiring an operation accident cause data set of the comprehensive energy system; the method comprises the steps of obtaining a large number of fault reasons of the energy monitoring equipment, taking an energy company as an example, screening out a long data set with characters exceeding 300 by analyzing a collected fault data set;
step 2: carrying out multidimensional preprocessing on the long data set to obtain a preprocessed long data set; the pretreatment comprises the following steps: (1) the data set is standardized, punctuation marks, special marks and some meaningless common words are deleted, because the punctuation marks, the special marks and the meaningless common words are not beneficial to analyzing and predicting the content of the data set by a system, the calculation complexity is increased, and vacancy values, data abnormal values and repeated values in the data set are removed. And filtering out data which cannot be classified, such as sample data lacking important fields. The analysis of data outliers is to make corresponding rules to eliminate or replace unreasonable and erroneous data. The data repetition value analysis is to compare different fields of different data samples and eliminate repeated data; (2) and simplifying the data set by a data specification method, wherein the data set comprises a dimension specification and a numerical specification. The dimension specification may reduce the variables through principal component analysis and correlation analysis to obtain a simplified or compressed representation of the raw data; (3) judging abnormal data in the data set, removing the data set, standardizing the data set and simplifying the data set, wherein in general, abnormal data information such as unreasonable data and repeated data still exists in the energy network fault reason data, and the abnormal data information is different from most data in the same data set and is called abnormal sample data. The abnormal sample data can directly influence the calculation precision of the model, so that a large error occurs, and therefore, the abnormal value data is searched and eliminated by adopting an abnormal value diagnosis technology;
and 3, step 3: establishing a classical BOW model, training the long data set obtained in the step 2 to obtain a feature list of the long data set, training the feature list through a gradient enhancement classifier to obtain feature importance identified by the gradient enhancement classifier, and sequencing the feature importance from the highest value to obtain a token feature list ST with N important features; at this time, the token feature list has more lost information, and the efficiency of the classification model for identifying the token feature list is not high, so that the effect of classifying the data set is not ideal if the token feature list ST is used as the input data of the classification model;
and 4, step 4: selecting any two parts of data from the long dataset obtained in the step 2 as Part1 and Part2, storing Part1 as the beginning of the new dataset Set _ new, storing Part2 as the end of the new dataset, and Part1+ Part2 ≪ 1, and deleting the two parts of data from the long dataset, so that the remaining Part3=1- (Part1+ Part2) in the original long dataset keeps reasonable values in the original long dataset and the new dataset, in the embodiment, by setting Part 1: {0.1, 0.2, 0.3, 0.4, 0.5} and part 2: {0, 0.05, 0.1, 0.15}, and substituting the data into the long data Set processing program, setting how many important features the Set _ new contains as assessment indexes, finding that part1 is Set to be 0.1, part2 is Set to be 0.1, and the processing effect is optimal;
and 5: incorporating important features in part3 into the new data Set _ new by means of iterative discrimination; in particular, the token feature list ST in said step 3 is iterated, starting from the first significant token previously determined, searching for this token in the original data set. If not, selecting the next important feature and searching again; if the selected mark exists in the original data Set, selecting the mark and mark neighbors before and after the mark, and adding the mark and the mark neighbors of the mark to the new data Set _ new, wherein the specific process is shown in fig. 3;
and 6: setting the number of characters of the new data Set _ new to be 300, and repeating the step 5 until the new data Set _ new is filled;
and 7: repeating the steps 1-6, processing all the long data sets, and classifying the new data set finally obtained in the step 6 and the short data sets which are not screened in the step 1 by adopting a convolutional neural network; the convolutional neural network is one of the neural networks, has the advantages of high efficiency, simple training, high speed and the like in a classification task, and is suitable for processing short-length data sets. The method is characterized in that a convolutional neural network is selected to classify data sets by combining the particularity of the energy network equipment fault data sets, the convolutional neural network is a neural network which replaces matrix multiplication with convolutional operation, features are extracted through multiple turning, sliding and superposition, and the convolutional neural network is applied to the field of computer vision at the earliest. In the process of image identification, in the face of massive image data, the convolutional neural network can be fully utilized to continuously reduce the dimension of the data, and finally the main features in the data are reserved. With the continuous and deep research, in recent years, convolutional neural networks are widely applied to natural language processing tasks, and the convolutional neural network mainly comprises a convolutional layer, a pooling layer and an output layer. The convolutional layer is used as a sensor of the network and used for carrying out feature extraction on input data. And a large amount of information is obtained by feature extraction, the information needs to be further screened and compressed by a pooling layer, and effective information is reserved. The connection between the pooling layer and the output layer is typically made through a fully-connected layer, which is equivalent to the hidden layer of a conventional feedforward neural network. The final output layer is the output that implements the prediction. The convolutional neural network model of the embodiment mainly uses a one-dimensional convolutional layer and a time-series maximum pooling layer, and the specific structure of the convolutional neural network model is shown in fig. 2.
Further, the step 7 specifically includes the following sub-steps: step 7.1: converting the new data Set _ new or the short data Set obtained in the step 6 into a word vector matrix, wherein if the number of words is n and the dimension of the word vector is k, the size of the matrix is n × k; step 7.2: the word vector matrix is used as the input end of the convolutional neural network model and is used for extracting data characteristics; step 7.3: performing feature fusion operation on the features extracted in the step 7.2 through the convolutional neural network model, obtaining a plurality of feature maps through convolutional layers in the model, then performing feature compression through a pooling layer, simplifying the complexity of network calculation, and extracting main features; wherein the pooling layer is a max boosting function selected to obtain the most important features in the dataset; finally, outputting the convolutional neural network model through a splicing layer and an output layer to obtain a plurality of data sets of different categories; step 7.4: and calculating the probability of the data set under each category through a softmax classifier to obtain a final classification result.
Based on the technical scheme, the data set classification method for analyzing the operation accident reasons of the comprehensive energy system comprises the steps of screening a long data set, establishing a classic BOW model for training a token list, outputting a token feature list ST, informing feature extraction to shorten an original data set into a new data set with the length of 300, covering most important information of the original data set in the new data set, and adopting the new data set to bring the new data set into a convolutional neural network model for classification; the method solves the problem that a long data set classification model is not suitable, and meanwhile, the calculated amount is greatly reduced.
The above-described embodiments and/or implementations are only for illustrating the preferred embodiments and/or implementations of the present technology, and are not intended to limit the implementations of the present technology in any way, and those skilled in the art can make many modifications or changes without departing from the scope of the technology disclosed in the present disclosure, but should be construed as technology or implementations that are substantially the same as the present technology.

Claims (5)

1. A data set classification method for analyzing operation accident causes of an integrated energy system is characterized by comprising the following steps: the method comprises the following steps:
step 1: acquiring a comprehensive energy system operation accident reason data set, and screening out a long data set with characters exceeding 300;
step 2: carrying out multidimensional preprocessing on the long data set to obtain a preprocessed long data set;
and step 3: establishing a classical BOW model, training the long data set obtained in the step (2) to obtain a feature list of the long data set, training the feature list through a gradient enhancement classifier to obtain feature importance identified by the gradient enhancement classifier, and sequencing the feature importance from the highest value to obtain a token feature list ST with N important features;
and 4, step 4: selecting any two parts of data from the long dataset obtained in the step 2 as part1 and part2, storing part1 as the beginning of a new dataset Set _ new, storing part2 as the end of the new dataset Set _ new, and part1+ part2 ≪ 1, deleting the two parts of data from the long dataset, and keeping the rest of the original long dataset in the original long dataset at a ratio of 80%;
and 5: identifying important features in part3 in an iterative discrimination mode, and merging the important features into the new data Set _ new;
step 6: setting the character number of the new data Set _ new to be 300, and repeating the step 5 until the new data Set _ new is filled;
and 7: repeating the steps 1-6, processing all the long data sets, and classifying the new data set finally obtained in the step 6 and the short data sets which are not screened out in the step 1 by adopting a convolutional neural network;
iterating the token feature list ST in the step 3, starting from a mark determined as a first important mark in advance, searching the mark in an original data set, if the mark is not found, selecting a next important feature, and searching again; if the selected mark exists in the original data Set, selecting mark neighbors before and after the mark, and adding the mark and the mark neighbors to the new data Set _ new;
converting the new data Set _ new or the short data Set into a word vector matrix;
the step 7 specifically includes the following substeps:
step 7.1: converting the new data Set _ new or the short data Set obtained in the step 6 into a word vector matrix, wherein if the number of words is n and the dimension of the word vector is k, the size of the matrix is n × k;
step 7.2: the word vector matrix is used as the input end of the convolutional neural network model and is used for extracting data characteristics;
step 7.3: performing feature fusion operation on the features extracted in the step 7.2 through the convolutional neural network model;
step 7.4: and calculating the probability of the data set under each category through a softmax classifier to obtain a final classification result.
2. The method for classifying a data set for analysis of a cause of an operational accident of an integrated energy system according to claim 1, wherein: the multidimensional preprocessing in the step 2 comprises the following steps: (1) standardizing the data set, deleting punctuation marks, special marks and meaningless common words, and removing missing values, data abnormal values and repeated values in the data set; (2) simplifying the data set by a data specification method, wherein the data set comprises a dimension specification and a numerical specification, and the dimension specification reduces variables through principal component analysis and correlation analysis to obtain simplified or compressed representation of original data; (3) judging abnormal data in the data set, eliminating the data set, standardizing the data set, simplifying the data set, and searching and eliminating abnormal value data by adopting an abnormal value diagnosis technology.
3. The method for classifying a data set for analyzing the cause of an operational accident of an integrated energy system according to claim 2, wherein: the removing of the missing value, the data abnormal value and the repeated value in the data set specifically includes: according to the characteristics of original data, missing values are searched in the original data, the missing values are eliminated or supplemented, and for some unimportant missing values, a method of adjacent data filling is adopted for supplementing, so that data which cannot be classified are filtered; rules are formulated to eliminate or replace unreasonable and erroneous data; and comparing different fields of different data samples, and rejecting repeated data.
4. The method for classifying a data set for analysis of a cause of an operational accident of an integrated energy system according to claim 1, wherein: the step 7.3 specifically comprises: obtaining a plurality of feature maps through a convolutional layer in the model, then performing feature compression through a pooling layer, simplifying the complexity of network calculation, extracting main features, and finally outputting the convolutional neural network model through a splicing layer and an output layer to obtain a plurality of data sets of different types.
5. The method for classifying the data set of the analysis of the cause of the operation accident of the integrated energy system according to claim 4, wherein: the pooling layer is a function chosen to maximize the features.
CN202210540826.1A 2022-05-19 2022-05-19 Data set classification method for operation accident analysis of comprehensive energy system Active CN114638558B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210540826.1A CN114638558B (en) 2022-05-19 2022-05-19 Data set classification method for operation accident analysis of comprehensive energy system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210540826.1A CN114638558B (en) 2022-05-19 2022-05-19 Data set classification method for operation accident analysis of comprehensive energy system

Publications (2)

Publication Number Publication Date
CN114638558A CN114638558A (en) 2022-06-17
CN114638558B true CN114638558B (en) 2022-08-23

Family

ID=81953295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210540826.1A Active CN114638558B (en) 2022-05-19 2022-05-19 Data set classification method for operation accident analysis of comprehensive energy system

Country Status (1)

Country Link
CN (1) CN114638558B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105824945A (en) * 2016-03-21 2016-08-03 中国电力科学研究院 Method for collecting global energy Internet technology resource data
US10424048B1 (en) * 2019-02-15 2019-09-24 Shotspotter, Inc. Systems and methods involving creation and/or utilization of image mosaic in classification of acoustic events
CN111489554A (en) * 2020-05-12 2020-08-04 哈尔滨工业大学 Urban road traffic accident prevention and control analysis method based on Bow-tie model
CN112036472A (en) * 2020-08-28 2020-12-04 长安大学 Visual image classification method and system for power system
CN112989052A (en) * 2021-04-19 2021-06-18 北京建筑大学 Chinese news text classification method based on combined-convolutional neural network
CN113792825A (en) * 2021-11-17 2021-12-14 国网江苏省电力有限公司营销服务中心 Fault classification model training method and device for electricity information acquisition equipment
CN113850330A (en) * 2021-09-27 2021-12-28 华北电力大学 Power distribution network fault cause detection method based on short-time Fourier transform and convolutional neural network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10402697B2 (en) * 2016-08-01 2019-09-03 Nvidia Corporation Fusing multilayer and multimodal deep neural networks for video classification
CN111046945B (en) * 2019-12-10 2023-10-24 北京化工大学 Fault type and damage degree diagnosis method based on combined convolutional neural network
CN111767398A (en) * 2020-06-30 2020-10-13 国网新疆电力有限公司电力科学研究院 Secondary equipment fault short text data classification method based on convolutional neural network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105824945A (en) * 2016-03-21 2016-08-03 中国电力科学研究院 Method for collecting global energy Internet technology resource data
US10424048B1 (en) * 2019-02-15 2019-09-24 Shotspotter, Inc. Systems and methods involving creation and/or utilization of image mosaic in classification of acoustic events
CN111489554A (en) * 2020-05-12 2020-08-04 哈尔滨工业大学 Urban road traffic accident prevention and control analysis method based on Bow-tie model
CN112036472A (en) * 2020-08-28 2020-12-04 长安大学 Visual image classification method and system for power system
CN112989052A (en) * 2021-04-19 2021-06-18 北京建筑大学 Chinese news text classification method based on combined-convolutional neural network
CN113850330A (en) * 2021-09-27 2021-12-28 华北电力大学 Power distribution network fault cause detection method based on short-time Fourier transform and convolutional neural network
CN113792825A (en) * 2021-11-17 2021-12-14 国网江苏省电力有限公司营销服务中心 Fault classification model training method and device for electricity information acquisition equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Word Embedding based News Classification by using CNN;Faisal Ahmed etc.;《IEEE》;20210826;第609-613页 *
不平衡数据集分类方法研究综述;周玉等;《计算机应用研究》;20220111;第1-8页 *

Also Published As

Publication number Publication date
CN114638558A (en) 2022-06-17

Similar Documents

Publication Publication Date Title
CN110597735B (en) Software defect prediction method for open-source software defect feature deep learning
CN109635171B (en) Fusion reasoning system and method for news program intelligent tags
CN109271539B (en) Image automatic labeling method and device based on deep learning
CN112989841B (en) Semi-supervised learning method for emergency news identification and classification
CN107145516B (en) Text clustering method and system
CN107577702B (en) Method for distinguishing traffic information in social media
CN111274817A (en) Intelligent software cost measurement method based on natural language processing technology
CN111191033B (en) Open set classification method based on classification utility
CN114461890A (en) Hierarchical multi-modal intellectual property search engine method and system
CN116257644A (en) Method for marking data, method and device for marking data through model
CN116049397A (en) Sensitive information discovery and automatic classification method based on multi-mode fusion
CN113159326B (en) Intelligent business decision method based on artificial intelligence
CN110532449A (en) A kind of processing method of service profile, device, equipment and storage medium
CN112685374B (en) Log classification method and device and electronic equipment
CN113094512A (en) Fault analysis system and method in industrial production and manufacturing
CN113157918A (en) Commodity name short text classification method and system based on attention mechanism
CN114638558B (en) Data set classification method for operation accident analysis of comprehensive energy system
CN116150010A (en) Test case classification method based on ship feature labels
CN115953123A (en) Method, device and equipment for generating robot automation flow and storage medium
CN115240145A (en) Method and system for detecting illegal operation behaviors based on scene recognition
CN115330268A (en) Comprehensive emergency command method and system for dealing with mine disaster
CN111046934B (en) SWIFT message soft clause recognition method and device
CN115204179A (en) Entity relationship prediction method and device based on power grid public data model
CN114077663A (en) Application log analysis method and device
CN116882416B (en) Information identification method and system for bidding documents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220803

Address after: Room 608, block J, Haitai green industrial base, No.6 Haitai development road, Huayuan Industrial Zone, Binhai New Area, Tianjin 300384

Applicant after: TIANJIN RICHSOFT ELECTRIC POWER INFORMATION TECHNOLOGY Co.,Ltd.

Applicant after: STATE GRID INFORMATION & TELECOMMUNICATION GROUP Co.,Ltd.

Address before: Room 608, block J, Haitai green industrial base, No.6 Haitai development road, Huayuan Industrial Zone, Binhai New Area, Tianjin 300384

Applicant before: TIANJIN RICHSOFT ELECTRIC POWER INFORMATION TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant