CN111966825A

CN111966825A - Power grid equipment defect text classification method based on machine learning

Info

Publication number: CN111966825A
Application number: CN202010683964.6A
Authority: CN
Inventors: 郑泽忠; 李慕杰; 姜宇轩; 王志勇; 牟范; 江邵斌
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2020-11-20

Abstract

The application discloses a power grid equipment defect text classification method based on machine learning, and belongs to the field of Chinese text analysis and natural language processing. The method comprises the steps of firstly carrying out digital processing on words, then extracting keywords from sentences in a defective text, combining word vectors of the keywords into a word matrix, using the word matrix as the input of a neural network, classifying the word matrix by using the neural network, and then identifying which type of defects represented by the defective text are. Therefore, the method has the advantages of high classification precision and high speed; the invention also optimizes the sample characteristics by extracting the core words from the sample data, reduces the calculation amount of the subsequent neural network classification and improves the calculation speed.

Description

Power grid equipment defect text classification method based on machine learning

Technical Field

The invention belongs to the field of Chinese text analysis and natural language processing, and particularly relates to a power grid equipment defect text classification method based on machine learning.

Background

Along with the construction of power grid intellectualization and informatization, a large amount of data are accumulated by power grid enterprises, and electric power big data which are commonly concerned by academic circles and industrial circles are gradually formed. In the field of electric power, research on structured data mining is mainly focused, and research on image recognition is also focused, but electric power text mining research is just started.

There are many valuable things in the defect description recorded by a large number of workers, however, the manual arrangement of the data requires a professional and is time-consuming and labor-consuming. At present, there is a part of research on the power grid text, but there is no relevant research result applied to the power grid text based on deep learning.

Disclosure of Invention

The invention aims to provide a power grid equipment defect text classification method based on machine learning, which is used for solving the problem of automatic classification of power grid equipment defect texts in the prior art.

In order to achieve the purpose, the technical scheme of the invention is as follows: a power grid equipment defect text classification method based on machine learning comprises the following steps:

step 1: acquiring existing defect text csv data, finding out data with data items less than a set threshold value under the same defect type in the defect text csv data, and deleting all data under the defect type; finding out data which have the same description contents but different actual IDs and defect types in the defect text csv data, judging that the data item under the defect type in the data is the most, and classifying the data into the defect type;

taking out defect description data, defect type ID data and defect type name data in the remaining defect text csv data;

step 2: reclassifying all samples in the defect description data and the defect type name data according to the classification format of the defect type ID data, and labeling all data according to the defect type to be used as training data;

and step 3: preprocessing each sample in the training data;

step 3.1: removing stop words in the training data according to the Chinese stop word library and replacing the stop words with spaces;

step 3.2: setting a frequency threshold, and if the frequency of a certain word appearing in the sample is less than the set frequency threshold, deleting the word;

step 3.3: performing text preprocessing by using a word vector training model, and preparing for neural network training;

step 3.3.1: establishing a Huffman tree, a root node and a plurality of leaf nodes, wherein the root node and the leaf nodes are connected through intermediate nodes; selecting all terms with the most representative defect from a defect text database, wherein the number of leaf nodes is the same as that of the selected terms with the most representative defect, and each leaf node represents one term with the most representative defect;

step 3.3.2: training a Huffman tree, wherein the root node and the middle node of the Huffman tree calculate the probability once, and the probability is approximate to 0 or 1, wherein 0 represents one branch transmitted to the node, and 1 represents the other branch transmitted to the node; randomly forming all words in the defect text database into vectors consisting of 0 and 1, and calling the vectors as word vectors; training a root node and a middle node of the Huffman tree by using all the obtained word vectors, and considering that the training is finished when all the words are conducted to leaf nodes which are the same as or similar to the words in meaning;

step 3.3.3: extracting 2A subject words representing the defect from the data processed in the step 3.2; taking a root node of a Huffman tree as an input of each word, transmitting the word to leaf nodes all the time, wherein the word vector of the corresponding leaf node represents the word vector of the word;

step 3.3.4: forming a matrix by the 2A obtained in the step 3.3.3, and using the matrix as the input of a neural network;

and 4, step 4: establishing a neural network and training;

training the established neural network by adopting the sample preprocessed in the step 3 until the neural network is converged;

and 5: and classifying the power grid equipment defect data to be classified by adopting the trained Huffman tree and the neural network.

The specific method of the step 4 comprises the following steps:

establishing a neural network consisting of a first convolution-first pooling-second convolution-second pooling-full connection layer; training the neural network by adopting the matrix obtained in the step 3;

embedding each word of a training sample into a matrix, inputting the matrix into a first convolution layer, converting a two-dimensional matrix into three dimensions after convolution, and then sequentially carrying out subsequent first pooling-second convolution-second pooling-full connection processing; the convolution calculation method of the first convolution layer and the second convolution layer comprises the following steps: layers. convolution2d, convolution kernel 100, convolution kernel size 20 x 20, activation function relu.

The method classifies the power grid defect data in a neural network mode, and has the advantages of high classification precision and high speed; the invention also optimizes the sample characteristics by extracting the core words from the sample data, reduces the calculation amount of the subsequent neural network classification and improves the calculation speed.

Drawings

FIG. 1 is an example of a raw data set of one embodiment of the present invention.

FIG. 2 is an example of raw data processing save according to an embodiment of the present invention.

FIG. 3 is an example of a preprocessed word embedding matrix according to one embodiment of the invention.

FIG. 4 is an example of a preprocessed word of the FastText model of an embodiment of the present invention.

FIG. 5 is an example of the results of the operation of the FastText model of one embodiment of the present invention; (a) as training results on the training set, and (b) as training results on the test set.

Detailed Description

Example 1:

a power grid equipment defect text classification method based on machine learning comprises the following steps:

s1, preprocessing the collected defect text csv file, and taking out defect description, defect type ID and defect type name columns in the csv data as training data;

s2, on the basis of the step S1, classifying the data group into a plurality of csv files according to the defect type ID, and marking the csv files as supervised learning samples;

s3, preprocessing the text, stopping words and carrying out original feature dimension reduction through a word vector training model;

s4, constructing a two-layer convolutional neural network by using an equal sample extraction training method, and importing the embedded word vector matrix for learning;

s5, circularly extracting sample data for training, analyzing the trained model and evaluating the accuracy;

the five-step process is detailed below in a simulation experiment using existing defect text for classification.

Step 1: preprocessing a file;

information items representing classification bases in the collected csv files are checked by using Notepad, then the information items are exported according to columns and labeled with labels and IDs (identification) so as to be used in subsequent supervision training, and header data are unified so as to be convenient to read; if the description contents are the same but the actual ID and the defect type labels are different, the different workers are indicated to be inconsistent in terms in the marking process, and the items are integrated under the same defect description for learning; if too few file entries with few defect types are encountered, the entries are removed to prevent interference with learning; only about 3% of data is removed finally, and actually, the method has no influence on establishing a classification model; an example of a raw data set is shown in FIG. 1;

step 2: classifying and making a learning sample;

classifying the defect type ID into a plurality of csv files, taking the defect type ID as a file name, and uniformly storing the defect type ID in a data folder under the engineering file for convenient calling; when calling, marking the corresponding defect type ID to facilitate word segmentation; the format of the selected csv file is GB2312 during calling; an example of raw data processing and saving is shown in fig. 2.

And step 3: text preprocessing and embedding training words into a matrix;

calling all defect text descriptions, embedding stop words and training words into a matrix, wherein the stop words adopt data of a Chinese stop word library, and the removed stop words are replaced by spaces; performing word embedding by using a word vector training model CBOW, and applying a multilayer Softmax model;

the specific operation is as follows: in order to extract features in each sentence, non-meaning words such as "in", "out", and the like in a sentence are removed, and because these words frequently appear in any sentence, they do not help to distinguish two sentences having different meanings. The remaining semantic words are separated by spaces so that a sentence is processed into a collection of individual words.

In order to extract the meaning represented behind these words and the word-to-word association, it is necessary to map these words into a word embedding matrix, since the computer cannot understand where different words differ. In order to make the computer solve the meaning behind the words, each word in a sentence is represented by an array, so that the computer can judge whether the two words have similar meanings or the connection relation between the two words according to the characteristics of the array. The maximum number of elements of each array can set an upper limit value m by itself, and the upper limit value m is usually related to the total number of words in the article. Therefore, each sentence in the text can be represented by a matrix, and the meaning behind the sentence can be understood by a computer by extracting the features in the matrix. The initial value problem for this word embedding matrix.

Assuming that the total number of extracted words in the sentence is 2A, the sentence constitutes a matrix of 2A m; if the matrix is updated, each word is randomized with a word vector, then the word vectors W1, W2, W2A for all words in 2A;

the word vectors for 2A words are added up, i.e.:

the output of the model can be understood as a binary tree whose leaf nodes are all words in the trained article, and then the word frequency is used as the weight of all the leaf nodes. The method is a Huffman tree, in the Huffman tree, the number of leaf nodes is all word numbers in a training article, and non-leaf nodes are the number of leaf nodes minus one.

The Huffman tree is constructed to obtain the probability of which word is closest to a word in the total vocabulary table, namely the classification basis; starting from the root node of the Huffman tree, representing all left subtrees as 0 and all right subtrees as 1, judging through a mapping function, if the probability of the node is less than 0, entering left subtree judgment, and if the probability of the node is more than 0, entering right subtree judgment; the continuous circulation finally reaches the leaf node closest to the word, and the circulation judgment is completed; in the Huffman tree, the high weight is close to the root node, and the low weight is far from the root node; thus established, the mapping function is:

in the formula, Xw is a word vector of the leaf node, and theta represents a model parameter of logistic regression required to be solved from a training sample; it needs to be solved from the training sample; in conclusion, if a word W exists in a sentence, the huffman tree must have a path p, which is the only path from the root node to the node where the word W is located; route p^ωAbove l^ω1 branch as a second classification, then one probability can be obtained each time; multiplying the probabilities to obtain a required conditional probability; substituting this function into the log-likelihood function:

L＝∑_ω∈2Alogp(X_ω|2A)

the following likelihood functions can be obtained:

(n (ω), i) represents the probability of the word W in the node of the Huffman tree, and the total number of nodes from the root node of the Huffman tree to the leaf node where the word is to be judged is l, i ∈ (1, l). The model parameter corresponding to each node is theta_i. To maximize this likelihood function by a stochastic gradient solver (max). In this process theta_iAnd the word vector Xw is updated continuously, finally, a complete Huffman tree is obtained, and all leaf nodes are filled with different words, so that a better preprocessing result is achieved. All word vectors are updated at the moment, and the formed word embedding matrix is trained in the next step.

Storing a pre-trained word vector matrix, and embedding the pre-processed words into the matrix as shown in FIG. 3;

and 4, step 4: establishing a CNN model;

for all the preprocessed texts (defect descriptions), the same number of texts are taken for training for each defect type ID. Establishing a CNN model formed by convolution-pooling-full connection layers, and inputting a training set (ratio test set) of the word embedding matrix into the model according to the proportion of 0.75/0.25 for training the model of the training set.

The specific operation is as follows:

the width of the convolution kernel is consistent with that of the word Embedding matrix (Embedding Layer), a convolution kernel is used for processing sentences, and a vector can be obtained after convolution. Assuming that there is a convolution kernel, which is a matrix W with a width d and a height h, the parameters to be updated are all the elements of the matrix W, and the total number is h times d. Each sentence is processed, and a matrix with s rows and d columns can be obtained through convolution kernel operation:

A∈Rs×d

when A [ i: j ] represents that the matrix is from the i row to the j row of the matrix A, the formula of the convolution operation is as follows:

oi＝w·A[i:i+h-1]

it may be biased by a bias b, simply denoted by f (x) an activation function. This can be used to formulate the characteristics as follows:

ci＝f(oi+b)

if this operation is repeated a number of times on the convolution kernel, the feature c ∈ Rs-h +1 can be obtained, for a total of s-h +1 features. To obtain richer and more characteristic expressions, other highly different convolution kernels may be employed for polling.

The pooling operation follows. This step may also be called sub-sampling, down-sampling or sub-sampling. Mainly aiming at non-overlapping areas, the pooling operation simultaneously comprises pooling modes such as mean pooling (mean pooling), maximum pooling (max pooling) and the like. The essence of the pooling operation is down-sampling. For example, a 6 × 6 matrix may be down-sampled to a 3 × 3 matrix by using a maximum pooling method.

After convolution calculation by a convolution kernel, a matrix can be obtained, and dimension reduction of the matrix is realized through pooling. The operation of maximum pooling, as the name implies, is to take the maximum value of the convolved matrix (for example, 6 × 6 matrix, taking the maximum value on each 2 × 2 unit matrix, thus obtaining a 3 × 3 matrix), which can prevent part of the individual data from affecting the whole.

The role of adding pooling layers is first to reduce dimensions. As is apparent, for the example case the 6 x 6 matrix becomes a 3 x 3 matrix, the dimensionality is reduced. The intuitive benefit of the reduction is to relieve computational stress. And the reduced computational pressure is exponentially reduced due to the convolution calculations.

Adding a pooling layer may also guarantee invariance (invariance) of features. Including translation invariance, rotation invariance, scale invariance. Meaning the features in the original that are obtained after convolution, which do not change after the pooling operation, which is the feature invariance. Translation invariance means that this feature can also be recognized in other sentences.

The fixed-length output can also be ensured by adding the pooling layer. This feature is used for text classification. If the matrix of inputs is not as large, how to guarantee that the outputs are as large. At this time, the convolved matrices are pooled and then spliced together by pooling operation, and a circle of constraint matrices (all 1) is added to make the final outputs have the same size.

And finally, a full connection layer. And (4) splicing the matrixes obtained after the maximum pooling, and inputting the matrixes into Softmax to obtain the probability that each category, such as label, is 1 and the probability that label is-1. The flow to this complete Textcnn is complete for prediction, i.e., unsupervised learning. If supervised learning is available, the loss function can be calculated at this time based on the predicted label and the actual (i.e., training set label). After calculation, the gradient is updated by updating parameters of the Softmax function, the max-firing function, the activation function and the convolution kernel function, and a round of training is completed.

In the experiment, the preset word embedding dimension is 100, the pre-trained word embedding matrix is used as features and input into the first convolution layer, the input two-dimensional word vector matrix is converted into three dimensions (the third dimension is 1), and then layer. The convolution kernel takes 100 and the convolution kernel size takes 20 x 20, and then the relu activation function is used, i.e. for the input vector x from the convolution calculation completion, the activation function is used for gradient descent. The second convolutional and pooling layers are the same as the first and then output to the fully-connected layer. The optimizer uses Adam and the learning rate is adjusted according to a loss function.

The step length is 100, 10000 rounds of polling training are carried out, the operation time and the loss function are output, and the trained model is stored.

And 5: analyzing the model and evaluating the accuracy;

and establishing a mapping relation between the type and y _ train y _ test, and using a map function to correspond the train _ target and the test _ target to a defect type ID so as to conveniently judge the accuracy of prediction. And comparing the y _ predicted (predicted value label obtained by model learning of the x _ test set through S4) obtained by importing the test set into the model with the actual y _ test (actual accurate label of the test set) to obtain corresponding accuracy, and adjusting the evaluation model according to the accuracy of polling output and the loss function. Meanwhile, a piece of text can be manually input, and the category of the input text is judged according to the possibility of each category given by the model.

Example 2:

on the basis of embodiment 1, the CNN model building in step 4 may be replaced by the following steps:

classification is performed by building a FastText model:

step 1: preprocessing the text marked with the defect type ID, stopping words, and dividing a training set and a test set;

step 2: establishing a softmax function, normalizing the value of an output layer to a 0-1 interval, constructing neuron output into probability distribution, and normalizing the neuron output value;

step 3; and (5) carrying out model training, adjusting parameters, and evaluating a model result through Precision and Recall.

Wherein, the specific operation of the step 1 is as follows:

for all the preprocessed texts (defect descriptions), the same number of texts are taken for training for each defect type ID. Cutting out a training set and a test set according to the proportion of 0.75/0.25, carrying out de-stop word processing on the text of the training set by using a Chinese common stop word library, and storing the corresponding defect type ID in the format of __ label __ + defect type ID before the processed text.

Wherein, the specific operation of the step 2 is as follows:

the word vector training model mentioned above is to give each word after word segmentation to the other word, and then to do operations such as dimension reduction on the basis. But this leads to a problem: morphological features inside the word are ignored, such as: "it generates a vector for each word. "in this context, it is segmented into vectors such as" it "," each "," word "," generate "," word vector ". However, if there is another sentence with words such as "word segmentation word" and "processing generation", it is clear that the words with close meaning are expressed by two different word vectors, and if the two sentences are under the same label, the training result of the word segmentation obviously greatly reduces the accuracy. In the aforementioned word vector training model method, the information of the word grains in the word and the information between the word grains are lost due to their integration.

FastText to solve this problem, each word is represented using N-grams in characters. For the word "process generation", assuming that N takes the value of 2, its N-gram has:

"< department", "processing", "physiology", "generation", "cheng >", etc "

These word grains, wherein the prefix of the word grain may be denoted by "<" and the suffix of the word grain may be denoted by ">". These N-grams can then be used to indicate that the word "process generated" is used, and further, the word vectors obtained from these five N-grams are used to indicate the word vectors of the original words. This brings two benefits:

for low-frequency words, the word vector generated by the feature extraction can express the meaning more accurately. The N-grams of these words may become the same as others, which facilitates the processing of conjunctions.

These word vectors can still be obtained if the words are not present in the article, but may be present in other samples. Several constants are defined next:

VOCAB _ SIZE is 2000, the variable means the SIZE of the word list, and a simple understanding is how many words are to be word vectors in total. Here simply set to 2000;

the term "embed _ DIM" is 100, and this means that after the output of the EMBEDDING layer, each word is vectorized to obtain the dimension of the vector, which is set to 64 here. For example, for the term "deep learning", a length 64 real-valued vector similar to [0.123456, 1.123465, 2.1234656, 6.46487987, 0.2165466, … ] may be used;

MAX _ worrds is 500, this variable represents the word that is most used when preprocessing the word embedding matrix. The lengths of different learning texts are different, and the number of words obtained by word segmentation is different. The purpose of (a) is to be able to output a neural network of a given dimension. A maximum number of words may be given. If there is a text word that is less than the number of words, the remaining portion is filled with a blank, i.e., 0. An example of a preprocessed word of the FastText model is shown in fig. 4.

Wherein, the specific operation of the step 3 is as follows:

building a model, wherein the building step comprises the following steps:

word embedding, i.e. the input layer (embedding layer), is required first. The input to the Embellding layer is a collection of documents, all the input text being represented by a sequence to their index. For example: [ 205090200 ] this sequence, may represent the short text "Chinese text classification machine learning", where the indexes of "Chinese", "text", "classification", "machine learning" in the construction are 20, 50, 90, 200, respectively; the dimension of each word vector when simultaneously input is the EMBEDDING _ DIM. There are: input _ shape ═ bat _ SIZE, MAX _ worrds;

the hidden layer (projection layer) is needed next. He is actually a superposition-averaging operation, which is performed by embedding words into the vectors of the matrix input. The input _ shape of this layer is the output _ shape of the Embedding layer, and the output _ shape of this layer is (BATCH _ SIZE, Embedding _ DIM);

the output layer (softmax layer) is added next. The real fastText layer is Hierarchical Softmax, which is replaced by Softmax. This layer specifies CLASS _ NUM, which represents the probability with which the likelihood is expressed. Output _ shape of this layer (BATCH _ SIZE, CLASS _ NUM);

finally, a loss function, an optimizer type, an evaluation index and a compiling model need to be set. In the project, a loss function is set as the probability _ cross, which is also the regression value of the loss function of the previous softmax layer; setting the optimizer to SGD, which means that the random gradient decreases; and setting the evaluation index as accuracy, namely finally outputting the comparison accuracy.

Generally, the model needs to be continuously adjusted according to the obtained accuracy rate on the basis of the first round of parameters. The upper graph is the final accuracy of the output model, and some variables used by me in the parameter adjusting process are as follows:

model: the first is the selection of models, which are divided into skip-gram and CBOW. The difference between skip-gram and CBOW is also described in the above, that is, the current word is used to predict the words before and after, i.e. skip-gram, otherwise CBOW. The model selected in this experiment is a skip-gram.

lr: i.e. the learning rate. The learning rate is adjusted continuously according to the obtained result, and may be usually about 0.1, too high may result in overfitting, and too low may affect the training speed. The final learning rate of the experiment is 1.8, and because the text is short and the number of classified items is too many, the fitting is not easy to happen even if a high learning rate is set.

epoch: the number of finger polls. Epoch and learning rate have a correlation, and are usually set to 50 times. The number of polling times in this experiment was 50.

dim is the dimension of the preset word vector. If the number of words after preprocessing is large, the preset value is large, and the proper number of linguistic data can be tested by polling to find a proper value. Dim for this experiment was set to 64.

ws, window size, i.e., window size. For example, ws is set to default 5, which is to predict the first five and last five of the current word.

wordNgrams, this refers to word granularity. In summary, the default setting is 1-gram, i.e. each word is a word vector without using word grains. If bi-gram (i.e., word grams ═ 2) is set, that is, every two word grains are treated as a unit, the number of words increases. If the 2-gram is used in the classification, a certain training rate is ensured, and the training time is saved.

The default setting for loss is newivesample. It is a softmax output layer that does not use all words, but only activates some of them with the target word as output layer dimensions when outputting the retained target word.

bucket, set to 10000000.

minCount means word frequency processing. The default setting is 5, i.e. words less than 5 word frequencies will be ignored. If the text is relatively large, the text is generally set to be smaller, and words with too low word frequency are ignored because the content cannot be basically learned.

Finally, the finished model is called and output and stored. An example of the results of the FastText model run is shown in FIG. 5.

Claims

1. A power grid equipment defect text classification method based on machine learning comprises the following steps:

and step 3: preprocessing each sample in the training data;

and 4, step 4: establishing a neural network and training;

2. The machine learning-based power grid equipment defect text classification method according to claim 1, wherein the specific method in the step 4 is as follows:

3. The machine learning-based power grid equipment defect text classification method according to claim 1, wherein the specific method of the step 3.3 preprocessing is as follows: setting a total number of extracted words of 2A, wherein each word is random with a word vector, and the words form a 2A m matrix; if this matrix is updated, then the word vectors W1, W2, W2A for all words in 2A;

the word vectors for 2A words are added up, i.e.:

the output of the model can be understood as a binary tree, the leaf nodes of the binary tree are all words in the trained article, and then the word frequency is used as the weight of all the leaf nodes; the method is a Huffman tree, and in the Huffman tree, the number of leaf nodes is all the word numbers in a training article;

L＝∑_ω∈2Alogp(X_ω|2A)

the following likelihood functions can be obtained:

(n (ω), i) represents the probability of the word W in the node of the Huffman tree, the total number of the nodes from the root node of the Huffman tree to the leaf node where the word is to be judged is l, i belongs to (1, l); the model parameter corresponding to each node is theta_i(ii) a To maximize this likelihood function by a stochastic gradient solver (max) in which theta_iAnd the word vector Xw is continuously updated, finally, a complete Huffman tree is obtained, all leaf nodes are filled with different words, all word vectors are updated at the moment, and the updated vector combination is adopted as a word matrix to be input into the neural network. And 4, step 4: establishing a neural network and training;