CN114154561B

CN114154561B - Electric power data management method based on natural language processing and random forest

Info

Publication number: CN114154561B
Application number: CN202111345415.9A
Authority: CN
Inventors: 刘伟; 叶磊
Original assignee: Hubei Central China Technology Development Of Electric Power Co ltd; State Grid Corp of China SGCC
Current assignee: Hubei Central China Technology Development Of Electric Power Co ltd; State Grid Corp of China SGCC
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2024-02-27
Anticipated expiration: 2041-11-15
Also published as: CN114154561A

Abstract

The invention provides a power data management method based on natural language processing and random forests, which comprises the following steps: data extraction, namely obtaining a training set F; extracting feature data from the training set F, and segmenting model data to obtain a feature data set; and a third step of: deactivating words in the feature data set to form a data set; fourth step: word segmentation is carried out on the data set, and word2vec transformation is carried out to form word vectors; fifth step: classifying word vectors by a random forest algorithm; sixth step: constructing a random forest classification model; seventh step: after the random forest classification model is determined, the data is classified in the using stage, then the abnormal data of each class is returned to the user, and the normal data is recommended to the user and is modified by the user reference. The invention classifies and analyzes the data abnormal problems by utilizing the big data, provides the data generator with the correction, can reduce the data problems from the source and provides reference for the correction of the data source.

Description

Electric power data management method based on natural language processing and random forest

Technical Field

The invention relates to the technical field of computer science, in particular to a power data management method based on natural language processing and random forests.

Background

The electric power data, especially the electric power equipment archival data, is the basis of the development of the power grid production work, and at present, many kinds of production equipment archival data are stored in the equipment (asset) operation and maintenance lean management system (PMS 2.0), the total data amount is already more than 100G, and more than 200 pieces of equipment are involved, for example: transformers, bus bars, etc.

The company equipment archive data is maintained by basic team personnel, all links of power production are based on the equipment archive data, and only the accuracy of the equipment archive data is ensured, and all processes and businesses related to power can be more accurately unfolded, so that a firmer support is provided for power operation, maintenance and power analysis decision-making.

At present, the file data of the power grid production equipment has the problems of incompleteness, inaccuracy and the like, for example: incomplete key parameters of the equipment file, filling errors of equipment account parameters and the like. These problems, particularly the problem of inaccurate data, are difficult to conduct error data checking by refining rules and then developing programs; the current situation is that an operation and maintenance personnel manual checking method is adopted for data checking, but the method has low efficiency, high difficulty and poor effect.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a power data management method based on natural language processing and random forests, which can realize data management and automatic debugging of power equipment archive data.

A power data management method based on natural language processing and random forests comprises the following steps:

the first step: data extraction, obtaining a training set: the method comprises the steps of obtaining model data and rated capacity data of a pole-mounted transformer, and taking 70% of the model data and rated capacity data of the pole-mounted transformer as a training set F;

and a second step of: extracting feature data from the training set F, and segmenting model data to obtain a feature data set S= { S ₁ ,s ₂ ,s ₃ ,...,s _n }；

And a third step of: deactivating words of the feature data set S to form a data set S ', S' = { S ₁ ,s ₂ ,s ₃ …,s _m }, whereinm≤n；

Fourth step: word segmentation is carried out on the data set S ', word2vec transformation is then carried out to form word vector v (S'),wherein the method comprises the steps ofv (S ') represents a word vector of the data set S' after word2vec transformation, and k represents the length of the word vector;

fifth step: carrying out random forest algorithm classification on the word vector v (s'), wherein the tag column is rated capacity data L;

sixth step: and (3) constructing a random forest classification model: aiming at the classification result obtained in the fifth step, obtaining the accuracy of the random forest classification model, and returning to the fourth step and the fifth step for parameter adjustment if the accuracy does not reach the expected threshold value until the accuracy reaches the expected threshold value;

seventh step: after the random forest classification model is determined, the data is classified in the using stage, then the abnormal data of each class is returned to the user, and the normal data is recommended to the user and is modified by the user reference.

Further, in the first step, data cleaning and filtering are further performed after data extraction: first, the rows with the transformer model and rated capacity fields being empty are filtered out, then the rows without "-" in the transformer model field are filtered out, and finally the rows without "M" or "M" in the transformer model field are filtered out.

Further, in the third step, the step of performing the deactivating word on the feature data set S specifically includes: the "-" and "/" in the transformer model field are replaced with a space.

Further, in the fifth step, the process of classifying the word vector v (s ') formed by converting the model data into the word vector v (s') by using the random forest algorithm is as follows:

(1) Setting the total tree number of the random forest decision tree as B, wherein the generation process of one decision tree B is as follows:

(a) Randomly selecting N samples from the word vector v (s') in a put-back form;

(b) Then recursively generating a random forest tree Tb;

(2) Outputting a set of random forest trees

(3) A classification prediction is made for a new data point x (i.e., model data that the user has newly entered): assume thatRepresenting the classification of the new data point x on the b-th tree, then +.>

Further, the specific steps of recursively generating the random forest tree Tb include:

i randomly selecting a number of characteristic vectors with k 'lengths from the k word vector lengths, wherein k' is less than or equal to k;

ii selecting a feature with the smallest uncertainty of the data set information from k' feature vectors to perform data segmentation, wherein the feature is also called the best segmentation feature;

iii dividing the best selected segmentation feature vector node into two sub-nodes until each node is sufficiently pure, finally forming a complete random forest tree Tb, and stopping segmentation regardless of whether the node is sufficiently pure if a decision tree formed by the segmentation feature vector nodes reaches a set maximum depth value.

Further, the means for computing the minimum uncertainty of the data set information includes: information gain based, information gain rate based, and based on the coefficient of kunning.

Further, the sixth step specifically includes: and (3) manually verifying the data in the classification according to the classification result obtained in the fifth step, selecting abnormal data and misjudgment data, verifying the misjudgment condition of the data in each class, obtaining the accuracy of data verification, averaging the accuracy in all the classifications, obtaining the accuracy of a random forest classification model, judging whether the model accuracy reaches an expected threshold, if not, turning to the fourth step and the fifth step, readjusting the length k of the word vector in the fourth step and the number B of the decision tree in the fifth step, calculating the method for enabling the uncertainty of the data set information to be minimum and the maximum depth of the decision tree until the accuracy reaches the expected threshold.

Further, the process of determining the hyper-parameters of the random forest classification model is performed by adopting a grid search method, namely each possibility is tried through cyclic traversal, and the best-performing parameter is the final result.

The invention utilizes natural language processing and random forest technology to develop data management, automatically diagnoses abnormality in a large amount of data, provides advice for rectifying and modifying the data, can reduce strong dependence of data verification work on business personnel, and can realize automatic processing for scattered data abnormality which is completely irregular and extractable, and avoid complex workload brought by manual screening.

Drawings

FIG. 1 is a flow chart of the power data governance method based on natural language processing and random forests of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

To describe the present embodiment, sample data is introduced (see Table 1)

TABLE 1 sample data

From table 1, it can be seen that the sample data has 3 columns, and the main transformer model and rated capacity with physical and business meanings has 2 columns; according to the service rules, each determined transformer model corresponds to a unique rated capacity, but the writing methods of the transformer models are various, for example, the rated capacities corresponding to the transformer models of S9-M-50/10, S11-M-50/10, S9-50 and S9-50KVA are all 50, the value of the rated capacity 50 is hidden in filling information of the transformer model, but the position is uncertain, and no clear rule exists, but experienced service personnel can basically judge how much the rated capacity corresponding to the model should be through the value of the transformer model, but the method is low-efficiency and difficult.

As shown in fig. 1, the embodiment of the invention provides a power data management method based on natural language processing and random forest, which adopts word2vec algorithm to extract characteristics of transformer model, and then uses random forest classification algorithm to construct model based on the characteristics, comprising the following specific steps:

the first step: data extraction, obtaining a training set: the method comprises the steps of obtaining model data and rated capacity data of a pole-mounted transformer, and taking 70% of the model data and rated capacity data of the pole-mounted transformer as a training set F; the data can be cleaned and filtered after the data is extracted, specifically, firstly, the rows with the transformer model and rated capacity fields being empty are filtered, then the rows without the "-" in the transformer model field are filtered, finally, the rows without the "M" or the "M" in the transformer model field are filtered, the transformer model is named xh and the rated capacity field is named edrl;

and a second step of: extracting feature data from the training set F, and segmenting model data to obtain a feature data set S= { S ₁ ,s ₂ ,s ₃ ,…,s _n }；

And a third step of: deactivating words (e.g., "-" and "/", etc.) to the feature data set S forms a data set S ', S' = { S ₁ ,s ₂ ,s ₃ …,s _m }, whereinm is less than or equal to n; for example, replacing "-" and "/" in the transformer model field with "-" (one space), the processed and transformed transformer model is named xh1, e.g., the transformer model field is "S9-M-50/10" (xh) which becomes "S9M 50/10" (xh 1) after processing;

fourth step: word segmentation is performed on the data set S ', and word2vec transformation is performed to form word vectors v (S'), whereinv (S ') represents a word vector of the data set S' after word2vec transformation, and k represents the length of the word vector;

specifically, the token is used for segmenting the content of the transformer model field (xh 1) after processing and transformation, and the segmented array field is named xh2, for example, "S9M 50" (xh 1) is changed into "[ S9, M,50,10]" (xh 2) after being segmented by the token, then the word2vec model is trained by the field xh2, the output field of the word2vec model is named as rawFeateurs, the field rawFeateurs is a multidimensional feature vector, for example "[ S9, M,50,10]" (xh 2) is used as the input of the word2vec model, and the output of the word2vec model is obtained:

[-0.3870379527409871,0.883052121847868,0.16217718521753946,0.24961639444033304,0.09006961186726888,-0.3612159974873066](rawFeatures)。

in the fifth step, the process of classifying the random forest algorithm for converting the type data into the formed word vector v (s') is as follows:

1. setting the total tree number of the random forest decision tree as B, wherein the generation process of one decision tree B is as follows:

(b) Then recursively generating a random forest tree Tb by the following three steps;

ii selecting one feature with minimum uncertainty of the data set information from k' feature vectors to perform data segmentation, wherein the feature is also called a best segmentation feature, and three ways of calculating the minimum uncertainty of the data set information are ID3 (based on information gain), C4.5 (based on information gain rate) and CART (based on kunit coefficient).

2. Outputting a set of random forest trees

3. A classification prediction is made for a new data point x (i.e., model data that the user has newly entered):

assume thatRepresenting the classification of the new data point x on the b-th tree, then

Sixth step: and (3) constructing a random forest classification model: manually verifying the data in the classification according to the classification result obtained in the fifth step, selecting abnormal data and misjudgment data, verifying the misjudgment condition of the data in each class to obtain the accuracy of data verification, averaging the accuracy in all the classifications to obtain the accuracy of a random forest classification model, judging whether the model accuracy reaches an expected threshold value, if not, turning to the fourth step and the fifth step, readjusting the length k of the word vector in the fourth step and the number B of the decision trees in the fifth step, calculating the method for minimizing the uncertainty of the data set information and the maximum depth of the decision trees until the accuracy reaches the expected threshold value, for example, setting the expected threshold value of the accuracy to be 95%; the whole process of determining the super parameters of the random forest classification model is carried out by adopting a grid search method, namely each possibility is tried through cyclic traversal, and the best-performing parameter is the final result.

Further, during the use process of the user, the label column (namely rated capacity data) of the model is corrected through the feedback of the user, so that the probability of correct classification in random forest classification is increased.

Rated capacity recommended values given for the transformer model in some of the sample data are as follows in table 2:

TABLE 2

The invention utilizes natural language processing and random forest technology to develop data governance, automatically diagnoses a large amount of data, provides advice for data correction, can reduce the strong dependence of data verification work on business personnel, can realize automatic processing for completely irregular and extractable dispersed data abnormal conditions, can also realize automatic processing by machine learning, avoids complex workload brought by manual screening (the manual means is adopted to check every 100 data usually needs to input 3 days of workload, the data governance of tens of thousands of data can be completed in a few minutes by the method of the invention, and the accuracy of governance can reach more than 95 percent).

The foregoing is merely illustrative embodiments of the present invention, and the present invention is not limited thereto, and any changes or substitutions that may be easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A power data management method based on natural language processing and random forests is characterized in that: the method comprises the following steps:

2. A method of power data management based on natural language processing and random forests as claimed in claim 1, characterized by: and in the first step, data are also cleaned and filtered after being extracted: first, the rows with the transformer model and rated capacity fields being empty are filtered out, then the rows without "-" in the transformer model field are filtered out, and finally the rows without "M" or "M" in the transformer model field are filtered out.

3. A method of power data management based on natural language processing and random forests as claimed in claim 1, characterized by: in the third step, the feature data set S is specifically: the "-" and "/" in the transformer model field are replaced with a space.

4. A method of power data management based on natural language processing and random forests as claimed in claim 1, characterized by: in the fifth step, the process of classifying the word vector v (s ') formed by converting the model data into the word vector v (s') by using the random forest algorithm is as follows:

(b) Then recursively generating a random forest tree Tb;

(2) Outputting a set of random forest trees

5. The method for managing power data based on natural language processing and random forests according to claim 4, characterized in that: wherein the specific steps of recursively generating the random forest tree Tb include:

6. The method for managing power data based on natural language processing and random forests according to claim 5, characterized in that: the means for computing the minimum uncertainty of the data set information includes: information gain based, information gain rate based, and based on the coefficient of kunning.

7. A method of power data management based on natural language processing and random forests as claimed in claim 1, characterized by: the sixth step specifically comprises: and (3) manually verifying the data in the classification according to the classification result obtained in the fifth step, selecting abnormal data and misjudgment data, verifying the misjudgment condition of the data in each class, obtaining the accuracy of data verification, averaging the accuracy in all the classifications, obtaining the accuracy of a random forest classification model, judging whether the model accuracy reaches an expected threshold, if not, turning to the fourth step and the fifth step, readjusting the length k of the word vector in the fourth step and the number B of the decision tree in the fifth step, calculating the method for enabling the uncertainty of the data set information to be minimum and the maximum depth of the decision tree until the accuracy reaches the expected threshold.

8. The method for managing power data based on natural language processing and random forests according to claim 7, characterized in that: the process of determining the super parameters of the random forest classification model is carried out by adopting a grid search method, namely each possibility is tried through cyclic traversal, and the best-performing parameter is the final result.