CN111966825A - Power grid equipment defect text classification method based on machine learning - Google Patents

Power grid equipment defect text classification method based on machine learning Download PDF

Info

Publication number
CN111966825A
CN111966825A CN202010683964.6A CN202010683964A CN111966825A CN 111966825 A CN111966825 A CN 111966825A CN 202010683964 A CN202010683964 A CN 202010683964A CN 111966825 A CN111966825 A CN 111966825A
Authority
CN
China
Prior art keywords
word
data
defect
training
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010683964.6A
Other languages
Chinese (zh)
Inventor
郑泽忠
李慕杰
姜宇轩
王志勇
牟范
江邵斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010683964.6A priority Critical patent/CN111966825A/en
Publication of CN111966825A publication Critical patent/CN111966825A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a power grid equipment defect text classification method based on machine learning, and belongs to the field of Chinese text analysis and natural language processing. The method comprises the steps of firstly carrying out digital processing on words, then extracting keywords from sentences in a defective text, combining word vectors of the keywords into a word matrix, using the word matrix as the input of a neural network, classifying the word matrix by using the neural network, and then identifying which type of defects represented by the defective text are. Therefore, the method has the advantages of high classification precision and high speed; the invention also optimizes the sample characteristics by extracting the core words from the sample data, reduces the calculation amount of the subsequent neural network classification and improves the calculation speed.

Description

Power grid equipment defect text classification method based on machine learning
Technical Field
The invention belongs to the field of Chinese text analysis and natural language processing, and particularly relates to a power grid equipment defect text classification method based on machine learning.
Background
Along with the construction of power grid intellectualization and informatization, a large amount of data are accumulated by power grid enterprises, and electric power big data which are commonly concerned by academic circles and industrial circles are gradually formed. In the field of electric power, research on structured data mining is mainly focused, and research on image recognition is also focused, but electric power text mining research is just started.
There are many valuable things in the defect description recorded by a large number of workers, however, the manual arrangement of the data requires a professional and is time-consuming and labor-consuming. At present, there is a part of research on the power grid text, but there is no relevant research result applied to the power grid text based on deep learning.
Disclosure of Invention
The invention aims to provide a power grid equipment defect text classification method based on machine learning, which is used for solving the problem of automatic classification of power grid equipment defect texts in the prior art.
In order to achieve the purpose, the technical scheme of the invention is as follows: a power grid equipment defect text classification method based on machine learning comprises the following steps:
step 1: acquiring existing defect text csv data, finding out data with data items less than a set threshold value under the same defect type in the defect text csv data, and deleting all data under the defect type; finding out data which have the same description contents but different actual IDs and defect types in the defect text csv data, judging that the data item under the defect type in the data is the most, and classifying the data into the defect type;
taking out defect description data, defect type ID data and defect type name data in the remaining defect text csv data;
step 2: reclassifying all samples in the defect description data and the defect type name data according to the classification format of the defect type ID data, and labeling all data according to the defect type to be used as training data;
and step 3: preprocessing each sample in the training data;
step 3.1: removing stop words in the training data according to the Chinese stop word library and replacing the stop words with spaces;
step 3.2: setting a frequency threshold, and if the frequency of a certain word appearing in the sample is less than the set frequency threshold, deleting the word;
step 3.3: performing text preprocessing by using a word vector training model, and preparing for neural network training;
step 3.3.1: establishing a Huffman tree, a root node and a plurality of leaf nodes, wherein the root node and the leaf nodes are connected through intermediate nodes; selecting all terms with the most representative defect from a defect text database, wherein the number of leaf nodes is the same as that of the selected terms with the most representative defect, and each leaf node represents one term with the most representative defect;
step 3.3.2: training a Huffman tree, wherein the root node and the middle node of the Huffman tree calculate the probability once, and the probability is approximate to 0 or 1, wherein 0 represents one branch transmitted to the node, and 1 represents the other branch transmitted to the node; randomly forming all words in the defect text database into vectors consisting of 0 and 1, and calling the vectors as word vectors; training a root node and a middle node of the Huffman tree by using all the obtained word vectors, and considering that the training is finished when all the words are conducted to leaf nodes which are the same as or similar to the words in meaning;
step 3.3.3: extracting 2A subject words representing the defect from the data processed in the step 3.2; taking a root node of a Huffman tree as an input of each word, transmitting the word to leaf nodes all the time, wherein the word vector of the corresponding leaf node represents the word vector of the word;
step 3.3.4: forming a matrix by the 2A obtained in the step 3.3.3, and using the matrix as the input of a neural network;
and 4, step 4: establishing a neural network and training;
training the established neural network by adopting the sample preprocessed in the step 3 until the neural network is converged;
and 5: and classifying the power grid equipment defect data to be classified by adopting the trained Huffman tree and the neural network.
The specific method of the step 4 comprises the following steps:
establishing a neural network consisting of a first convolution-first pooling-second convolution-second pooling-full connection layer; training the neural network by adopting the matrix obtained in the step 3;
embedding each word of a training sample into a matrix, inputting the matrix into a first convolution layer, converting a two-dimensional matrix into three dimensions after convolution, and then sequentially carrying out subsequent first pooling-second convolution-second pooling-full connection processing; the convolution calculation method of the first convolution layer and the second convolution layer comprises the following steps: layers. convolution2d, convolution kernel 100, convolution kernel size 20 x 20, activation function relu.
The method classifies the power grid defect data in a neural network mode, and has the advantages of high classification precision and high speed; the invention also optimizes the sample characteristics by extracting the core words from the sample data, reduces the calculation amount of the subsequent neural network classification and improves the calculation speed.
Drawings
FIG. 1 is an example of a raw data set of one embodiment of the present invention.
FIG. 2 is an example of raw data processing save according to an embodiment of the present invention.
FIG. 3 is an example of a preprocessed word embedding matrix according to one embodiment of the invention.
FIG. 4 is an example of a preprocessed word of the FastText model of an embodiment of the present invention.
FIG. 5 is an example of the results of the operation of the FastText model of one embodiment of the present invention; (a) as training results on the training set, and (b) as training results on the test set.
Detailed Description
Example 1:
a power grid equipment defect text classification method based on machine learning comprises the following steps:
s1, preprocessing the collected defect text csv file, and taking out defect description, defect type ID and defect type name columns in the csv data as training data;
s2, on the basis of the step S1, classifying the data group into a plurality of csv files according to the defect type ID, and marking the csv files as supervised learning samples;
s3, preprocessing the text, stopping words and carrying out original feature dimension reduction through a word vector training model;
s4, constructing a two-layer convolutional neural network by using an equal sample extraction training method, and importing the embedded word vector matrix for learning;
s5, circularly extracting sample data for training, analyzing the trained model and evaluating the accuracy;
the five-step process is detailed below in a simulation experiment using existing defect text for classification.
Step 1: preprocessing a file;
information items representing classification bases in the collected csv files are checked by using Notepad, then the information items are exported according to columns and labeled with labels and IDs (identification) so as to be used in subsequent supervision training, and header data are unified so as to be convenient to read; if the description contents are the same but the actual ID and the defect type labels are different, the different workers are indicated to be inconsistent in terms in the marking process, and the items are integrated under the same defect description for learning; if too few file entries with few defect types are encountered, the entries are removed to prevent interference with learning; only about 3% of data is removed finally, and actually, the method has no influence on establishing a classification model; an example of a raw data set is shown in FIG. 1;
step 2: classifying and making a learning sample;
classifying the defect type ID into a plurality of csv files, taking the defect type ID as a file name, and uniformly storing the defect type ID in a data folder under the engineering file for convenient calling; when calling, marking the corresponding defect type ID to facilitate word segmentation; the format of the selected csv file is GB2312 during calling; an example of raw data processing and saving is shown in fig. 2.
And step 3: text preprocessing and embedding training words into a matrix;
calling all defect text descriptions, embedding stop words and training words into a matrix, wherein the stop words adopt data of a Chinese stop word library, and the removed stop words are replaced by spaces; performing word embedding by using a word vector training model CBOW, and applying a multilayer Softmax model;
the specific operation is as follows: in order to extract features in each sentence, non-meaning words such as "in", "out", and the like in a sentence are removed, and because these words frequently appear in any sentence, they do not help to distinguish two sentences having different meanings. The remaining semantic words are separated by spaces so that a sentence is processed into a collection of individual words.
In order to extract the meaning represented behind these words and the word-to-word association, it is necessary to map these words into a word embedding matrix, since the computer cannot understand where different words differ. In order to make the computer solve the meaning behind the words, each word in a sentence is represented by an array, so that the computer can judge whether the two words have similar meanings or the connection relation between the two words according to the characteristics of the array. The maximum number of elements of each array can set an upper limit value m by itself, and the upper limit value m is usually related to the total number of words in the article. Therefore, each sentence in the text can be represented by a matrix, and the meaning behind the sentence can be understood by a computer by extracting the features in the matrix. The initial value problem for this word embedding matrix.
Assuming that the total number of extracted words in the sentence is 2A, the sentence constitutes a matrix of 2A m; if the matrix is updated, each word is randomized with a word vector, then the word vectors W1, W2, W2A for all words in 2A;
the word vectors for 2A words are added up, i.e.:
Figure BDA0002586814650000041
the output of the model can be understood as a binary tree whose leaf nodes are all words in the trained article, and then the word frequency is used as the weight of all the leaf nodes. The method is a Huffman tree, in the Huffman tree, the number of leaf nodes is all word numbers in a training article, and non-leaf nodes are the number of leaf nodes minus one.
The Huffman tree is constructed to obtain the probability of which word is closest to a word in the total vocabulary table, namely the classification basis; starting from the root node of the Huffman tree, representing all left subtrees as 0 and all right subtrees as 1, judging through a mapping function, if the probability of the node is less than 0, entering left subtree judgment, and if the probability of the node is more than 0, entering right subtree judgment; the continuous circulation finally reaches the leaf node closest to the word, and the circulation judgment is completed; in the Huffman tree, the high weight is close to the root node, and the low weight is far from the root node; thus established, the mapping function is:
Figure BDA0002586814650000042
in the formula, Xw is a word vector of the leaf node, and theta represents a model parameter of logistic regression required to be solved from a training sample; it needs to be solved from the training sample; in conclusion, if a word W exists in a sentence, the huffman tree must have a path p, which is the only path from the root node to the node where the word W is located; route pωAbove lω1 branch as a second classification, then one probability can be obtained each time; multiplying the probabilities to obtain a required conditional probability; substituting this function into the log-likelihood function:
L=∑ω∈2Alogp(Xω|2A)
the following likelihood functions can be obtained:
Figure BDA0002586814650000051
(n (ω), i) represents the probability of the word W in the node of the Huffman tree, and the total number of nodes from the root node of the Huffman tree to the leaf node where the word is to be judged is l, i ∈ (1, l). The model parameter corresponding to each node is thetai. To maximize this likelihood function by a stochastic gradient solver (max). In this process thetaiAnd the word vector Xw is updated continuously, finally, a complete Huffman tree is obtained, and all leaf nodes are filled with different words, so that a better preprocessing result is achieved. All word vectors are updated at the moment, and the formed word embedding matrix is trained in the next step.
Storing a pre-trained word vector matrix, and embedding the pre-processed words into the matrix as shown in FIG. 3;
and 4, step 4: establishing a CNN model;
for all the preprocessed texts (defect descriptions), the same number of texts are taken for training for each defect type ID. Establishing a CNN model formed by convolution-pooling-full connection layers, and inputting a training set (ratio test set) of the word embedding matrix into the model according to the proportion of 0.75/0.25 for training the model of the training set.
The specific operation is as follows:
the width of the convolution kernel is consistent with that of the word Embedding matrix (Embedding Layer), a convolution kernel is used for processing sentences, and a vector can be obtained after convolution. Assuming that there is a convolution kernel, which is a matrix W with a width d and a height h, the parameters to be updated are all the elements of the matrix W, and the total number is h times d. Each sentence is processed, and a matrix with s rows and d columns can be obtained through convolution kernel operation:
A∈Rs×d
when A [ i: j ] represents that the matrix is from the i row to the j row of the matrix A, the formula of the convolution operation is as follows:
oi=w·A[i:i+h-1]
it may be biased by a bias b, simply denoted by f (x) an activation function. This can be used to formulate the characteristics as follows:
ci=f(oi+b)
if this operation is repeated a number of times on the convolution kernel, the feature c ∈ Rs-h +1 can be obtained, for a total of s-h +1 features. To obtain richer and more characteristic expressions, other highly different convolution kernels may be employed for polling.
The pooling operation follows. This step may also be called sub-sampling, down-sampling or sub-sampling. Mainly aiming at non-overlapping areas, the pooling operation simultaneously comprises pooling modes such as mean pooling (mean pooling), maximum pooling (max pooling) and the like. The essence of the pooling operation is down-sampling. For example, a 6 × 6 matrix may be down-sampled to a 3 × 3 matrix by using a maximum pooling method.
After convolution calculation by a convolution kernel, a matrix can be obtained, and dimension reduction of the matrix is realized through pooling. The operation of maximum pooling, as the name implies, is to take the maximum value of the convolved matrix (for example, 6 × 6 matrix, taking the maximum value on each 2 × 2 unit matrix, thus obtaining a 3 × 3 matrix), which can prevent part of the individual data from affecting the whole.
The role of adding pooling layers is first to reduce dimensions. As is apparent, for the example case the 6 x 6 matrix becomes a 3 x 3 matrix, the dimensionality is reduced. The intuitive benefit of the reduction is to relieve computational stress. And the reduced computational pressure is exponentially reduced due to the convolution calculations.
Adding a pooling layer may also guarantee invariance (invariance) of features. Including translation invariance, rotation invariance, scale invariance. Meaning the features in the original that are obtained after convolution, which do not change after the pooling operation, which is the feature invariance. Translation invariance means that this feature can also be recognized in other sentences.
The fixed-length output can also be ensured by adding the pooling layer. This feature is used for text classification. If the matrix of inputs is not as large, how to guarantee that the outputs are as large. At this time, the convolved matrices are pooled and then spliced together by pooling operation, and a circle of constraint matrices (all 1) is added to make the final outputs have the same size.
And finally, a full connection layer. And (4) splicing the matrixes obtained after the maximum pooling, and inputting the matrixes into Softmax to obtain the probability that each category, such as label, is 1 and the probability that label is-1. The flow to this complete Textcnn is complete for prediction, i.e., unsupervised learning. If supervised learning is available, the loss function can be calculated at this time based on the predicted label and the actual (i.e., training set label). After calculation, the gradient is updated by updating parameters of the Softmax function, the max-firing function, the activation function and the convolution kernel function, and a round of training is completed.
In the experiment, the preset word embedding dimension is 100, the pre-trained word embedding matrix is used as features and input into the first convolution layer, the input two-dimensional word vector matrix is converted into three dimensions (the third dimension is 1), and then layer. The convolution kernel takes 100 and the convolution kernel size takes 20 x 20, and then the relu activation function is used, i.e. for the input vector x from the convolution calculation completion, the activation function is used for gradient descent. The second convolutional and pooling layers are the same as the first and then output to the fully-connected layer. The optimizer uses Adam and the learning rate is adjusted according to a loss function.
The step length is 100, 10000 rounds of polling training are carried out, the operation time and the loss function are output, and the trained model is stored.
And 5: analyzing the model and evaluating the accuracy;
and establishing a mapping relation between the type and y _ train y _ test, and using a map function to correspond the train _ target and the test _ target to a defect type ID so as to conveniently judge the accuracy of prediction. And comparing the y _ predicted (predicted value label obtained by model learning of the x _ test set through S4) obtained by importing the test set into the model with the actual y _ test (actual accurate label of the test set) to obtain corresponding accuracy, and adjusting the evaluation model according to the accuracy of polling output and the loss function. Meanwhile, a piece of text can be manually input, and the category of the input text is judged according to the possibility of each category given by the model.
Example 2:
on the basis of embodiment 1, the CNN model building in step 4 may be replaced by the following steps:
classification is performed by building a FastText model:
step 1: preprocessing the text marked with the defect type ID, stopping words, and dividing a training set and a test set;
step 2: establishing a softmax function, normalizing the value of an output layer to a 0-1 interval, constructing neuron output into probability distribution, and normalizing the neuron output value;
step 3; and (5) carrying out model training, adjusting parameters, and evaluating a model result through Precision and Recall.
Wherein, the specific operation of the step 1 is as follows:
for all the preprocessed texts (defect descriptions), the same number of texts are taken for training for each defect type ID. Cutting out a training set and a test set according to the proportion of 0.75/0.25, carrying out de-stop word processing on the text of the training set by using a Chinese common stop word library, and storing the corresponding defect type ID in the format of __ label __ + defect type ID before the processed text.
Wherein, the specific operation of the step 2 is as follows:
the word vector training model mentioned above is to give each word after word segmentation to the other word, and then to do operations such as dimension reduction on the basis. But this leads to a problem: morphological features inside the word are ignored, such as: "it generates a vector for each word. "in this context, it is segmented into vectors such as" it "," each "," word "," generate "," word vector ". However, if there is another sentence with words such as "word segmentation word" and "processing generation", it is clear that the words with close meaning are expressed by two different word vectors, and if the two sentences are under the same label, the training result of the word segmentation obviously greatly reduces the accuracy. In the aforementioned word vector training model method, the information of the word grains in the word and the information between the word grains are lost due to their integration.
FastText to solve this problem, each word is represented using N-grams in characters. For the word "process generation", assuming that N takes the value of 2, its N-gram has:
"< department", "processing", "physiology", "generation", "cheng >", etc "
These word grains, wherein the prefix of the word grain may be denoted by "<" and the suffix of the word grain may be denoted by ">". These N-grams can then be used to indicate that the word "process generated" is used, and further, the word vectors obtained from these five N-grams are used to indicate the word vectors of the original words. This brings two benefits:
for low-frequency words, the word vector generated by the feature extraction can express the meaning more accurately. The N-grams of these words may become the same as others, which facilitates the processing of conjunctions.
These word vectors can still be obtained if the words are not present in the article, but may be present in other samples. Several constants are defined next:
VOCAB _ SIZE is 2000, the variable means the SIZE of the word list, and a simple understanding is how many words are to be word vectors in total. Here simply set to 2000;
the term "embed _ DIM" is 100, and this means that after the output of the EMBEDDING layer, each word is vectorized to obtain the dimension of the vector, which is set to 64 here. For example, for the term "deep learning", a length 64 real-valued vector similar to [0.123456, 1.123465, 2.1234656, 6.46487987, 0.2165466, … ] may be used;
MAX _ worrds is 500, this variable represents the word that is most used when preprocessing the word embedding matrix. The lengths of different learning texts are different, and the number of words obtained by word segmentation is different. The purpose of (a) is to be able to output a neural network of a given dimension. A maximum number of words may be given. If there is a text word that is less than the number of words, the remaining portion is filled with a blank, i.e., 0. An example of a preprocessed word of the FastText model is shown in fig. 4.
Wherein, the specific operation of the step 3 is as follows:
building a model, wherein the building step comprises the following steps:
word embedding, i.e. the input layer (embedding layer), is required first. The input to the Embellding layer is a collection of documents, all the input text being represented by a sequence to their index. For example: [ 205090200 ] this sequence, may represent the short text "Chinese text classification machine learning", where the indexes of "Chinese", "text", "classification", "machine learning" in the construction are 20, 50, 90, 200, respectively; the dimension of each word vector when simultaneously input is the EMBEDDING _ DIM. There are: input _ shape ═ bat _ SIZE, MAX _ worrds;
the hidden layer (projection layer) is needed next. He is actually a superposition-averaging operation, which is performed by embedding words into the vectors of the matrix input. The input _ shape of this layer is the output _ shape of the Embedding layer, and the output _ shape of this layer is (BATCH _ SIZE, Embedding _ DIM);
the output layer (softmax layer) is added next. The real fastText layer is Hierarchical Softmax, which is replaced by Softmax. This layer specifies CLASS _ NUM, which represents the probability with which the likelihood is expressed. Output _ shape of this layer (BATCH _ SIZE, CLASS _ NUM);
finally, a loss function, an optimizer type, an evaluation index and a compiling model need to be set. In the project, a loss function is set as the probability _ cross, which is also the regression value of the loss function of the previous softmax layer; setting the optimizer to SGD, which means that the random gradient decreases; and setting the evaluation index as accuracy, namely finally outputting the comparison accuracy.
Generally, the model needs to be continuously adjusted according to the obtained accuracy rate on the basis of the first round of parameters. The upper graph is the final accuracy of the output model, and some variables used by me in the parameter adjusting process are as follows:
model: the first is the selection of models, which are divided into skip-gram and CBOW. The difference between skip-gram and CBOW is also described in the above, that is, the current word is used to predict the words before and after, i.e. skip-gram, otherwise CBOW. The model selected in this experiment is a skip-gram.
lr: i.e. the learning rate. The learning rate is adjusted continuously according to the obtained result, and may be usually about 0.1, too high may result in overfitting, and too low may affect the training speed. The final learning rate of the experiment is 1.8, and because the text is short and the number of classified items is too many, the fitting is not easy to happen even if a high learning rate is set.
epoch: the number of finger polls. Epoch and learning rate have a correlation, and are usually set to 50 times. The number of polling times in this experiment was 50.
dim is the dimension of the preset word vector. If the number of words after preprocessing is large, the preset value is large, and the proper number of linguistic data can be tested by polling to find a proper value. Dim for this experiment was set to 64.
ws, window size, i.e., window size. For example, ws is set to default 5, which is to predict the first five and last five of the current word.
wordNgrams, this refers to word granularity. In summary, the default setting is 1-gram, i.e. each word is a word vector without using word grains. If bi-gram (i.e., word grams ═ 2) is set, that is, every two word grains are treated as a unit, the number of words increases. If the 2-gram is used in the classification, a certain training rate is ensured, and the training time is saved.
The default setting for loss is newivesample. It is a softmax output layer that does not use all words, but only activates some of them with the target word as output layer dimensions when outputting the retained target word.
bucket, set to 10000000.
minCount means word frequency processing. The default setting is 5, i.e. words less than 5 word frequencies will be ignored. If the text is relatively large, the text is generally set to be smaller, and words with too low word frequency are ignored because the content cannot be basically learned.
Finally, the finished model is called and output and stored. An example of the results of the FastText model run is shown in FIG. 5.

Claims (3)

1. A power grid equipment defect text classification method based on machine learning comprises the following steps:
step 1: acquiring existing defect text csv data, finding out data with data items less than a set threshold value under the same defect type in the defect text csv data, and deleting all data under the defect type; finding out data which have the same description contents but different actual IDs and defect types in the defect text csv data, judging that the data item under the defect type in the data is the most, and classifying the data into the defect type;
taking out defect description data, defect type ID data and defect type name data in the remaining defect text csv data;
step 2: reclassifying all samples in the defect description data and the defect type name data according to the classification format of the defect type ID data, and labeling all data according to the defect type to be used as training data;
and step 3: preprocessing each sample in the training data;
step 3.1: removing stop words in the training data according to the Chinese stop word library and replacing the stop words with spaces;
step 3.2: setting a frequency threshold, and if the frequency of a certain word appearing in the sample is less than the set frequency threshold, deleting the word;
step 3.3: performing text preprocessing by using a word vector training model, and preparing for neural network training;
step 3.3.1: establishing a Huffman tree, a root node and a plurality of leaf nodes, wherein the root node and the leaf nodes are connected through intermediate nodes; selecting all terms with the most representative defect from a defect text database, wherein the number of leaf nodes is the same as that of the selected terms with the most representative defect, and each leaf node represents one term with the most representative defect;
step 3.3.2: training a Huffman tree, wherein the root node and the middle node of the Huffman tree calculate the probability once, and the probability is approximate to 0 or 1, wherein 0 represents one branch transmitted to the node, and 1 represents the other branch transmitted to the node; randomly forming all words in the defect text database into vectors consisting of 0 and 1, and calling the vectors as word vectors; training a root node and a middle node of the Huffman tree by using all the obtained word vectors, and considering that the training is finished when all the words are conducted to leaf nodes which are the same as or similar to the words in meaning;
step 3.3.3: extracting 2A subject words representing the defect from the data processed in the step 3.2; taking a root node of a Huffman tree as an input of each word, transmitting the word to leaf nodes all the time, wherein the word vector of the corresponding leaf node represents the word vector of the word;
step 3.3.4: forming a matrix by the 2A obtained in the step 3.3.3, and using the matrix as the input of a neural network;
and 4, step 4: establishing a neural network and training;
training the established neural network by adopting the sample preprocessed in the step 3 until the neural network is converged;
and 5: and classifying the power grid equipment defect data to be classified by adopting the trained Huffman tree and the neural network.
2. The machine learning-based power grid equipment defect text classification method according to claim 1, wherein the specific method in the step 4 is as follows:
establishing a neural network consisting of a first convolution-first pooling-second convolution-second pooling-full connection layer; training the neural network by adopting the matrix obtained in the step 3;
embedding each word of a training sample into a matrix, inputting the matrix into a first convolution layer, converting a two-dimensional matrix into three dimensions after convolution, and then sequentially carrying out subsequent first pooling-second convolution-second pooling-full connection processing; the convolution calculation method of the first convolution layer and the second convolution layer comprises the following steps: layers. convolution2d, convolution kernel 100, convolution kernel size 20 x 20, activation function relu.
3. The machine learning-based power grid equipment defect text classification method according to claim 1, wherein the specific method of the step 3.3 preprocessing is as follows: setting a total number of extracted words of 2A, wherein each word is random with a word vector, and the words form a 2A m matrix; if this matrix is updated, then the word vectors W1, W2, W2A for all words in 2A;
the word vectors for 2A words are added up, i.e.:
Figure FDA0002586814640000021
the output of the model can be understood as a binary tree, the leaf nodes of the binary tree are all words in the trained article, and then the word frequency is used as the weight of all the leaf nodes; the method is a Huffman tree, and in the Huffman tree, the number of leaf nodes is all the word numbers in a training article;
the Huffman tree is constructed to obtain the probability of which word is closest to a word in the total vocabulary table, namely the classification basis; starting from the root node of the Huffman tree, representing all left subtrees as 0 and all right subtrees as 1, judging through a mapping function, if the probability of the node is less than 0, entering left subtree judgment, and if the probability of the node is more than 0, entering right subtree judgment; the continuous circulation finally reaches the leaf node closest to the word, and the circulation judgment is completed; in the Huffman tree, the high weight is close to the root node, and the low weight is far from the root node; thus established, the mapping function is:
Figure FDA0002586814640000022
in the formula, Xw is a word vector of the leaf node, and theta represents a model parameter of logistic regression required to be solved from a training sample; it needs to be solved from the training sample; in conclusion, if a word W exists in a sentence, the huffman tree must have a path p, which is the only path from the root node to the node where the word W is located; route pωAbove lω1 branch as a second classification, then one probability can be obtained each time; multiplying the probabilities to obtain a required conditional probability; substituting this function into the log-likelihood function:
L=∑ω∈2Alogp(Xω|2A)
the following likelihood functions can be obtained:
Figure FDA0002586814640000031
(n (ω), i) represents the probability of the word W in the node of the Huffman tree, the total number of the nodes from the root node of the Huffman tree to the leaf node where the word is to be judged is l, i belongs to (1, l); the model parameter corresponding to each node is thetai(ii) a To maximize this likelihood function by a stochastic gradient solver (max) in which thetaiAnd the word vector Xw is continuously updated, finally, a complete Huffman tree is obtained, all leaf nodes are filled with different words, all word vectors are updated at the moment, and the updated vector combination is adopted as a word matrix to be input into the neural network. And 4, step 4: establishing a neural network and training;
training the established neural network by adopting the sample preprocessed in the step 3 until the neural network is converged;
and 5: and classifying the power grid equipment defect data to be classified by adopting the trained Huffman tree and the neural network.
CN202010683964.6A 2020-07-16 2020-07-16 Power grid equipment defect text classification method based on machine learning Pending CN111966825A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010683964.6A CN111966825A (en) 2020-07-16 2020-07-16 Power grid equipment defect text classification method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010683964.6A CN111966825A (en) 2020-07-16 2020-07-16 Power grid equipment defect text classification method based on machine learning

Publications (1)

Publication Number Publication Date
CN111966825A true CN111966825A (en) 2020-11-20

Family

ID=73362166

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010683964.6A Pending CN111966825A (en) 2020-07-16 2020-07-16 Power grid equipment defect text classification method based on machine learning

Country Status (1)

Country Link
CN (1) CN111966825A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011166A (en) * 2021-04-19 2021-06-22 华北电力大学 Relay protection defect text synonym recognition method based on decision tree classification
CN113111750A (en) * 2021-03-31 2021-07-13 智慧眼科技股份有限公司 Face living body detection method and device, computer equipment and storage medium
CN113111183A (en) * 2021-04-20 2021-07-13 通号(长沙)轨道交通控制技术有限公司 Traction power supply equipment defect grade classification method
CN113157675A (en) * 2021-03-05 2021-07-23 深圳供电局有限公司 Defect positioning method and device, computer equipment and storage medium
CN113435195A (en) * 2021-07-01 2021-09-24 贵州电网有限责任公司 Defect intelligent diagnosis model construction method based on main transformer load characteristics
CN113486347A (en) * 2021-06-30 2021-10-08 福州大学 Deep learning hardware Trojan horse detection method based on semantic understanding
CN114036946B (en) * 2021-11-26 2023-07-07 浪潮卓数大数据产业发展有限公司 Text feature extraction and auxiliary retrieval system and method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326346A (en) * 2016-08-06 2017-01-11 上海高欣计算机系统有限公司 Text classification method and terminal device
CN108596470A (en) * 2018-04-19 2018-09-28 浙江大学 A kind of power equipments defect text handling method based on TensorFlow frames
CN110895565A (en) * 2019-11-29 2020-03-20 国网湖南省电力有限公司 Method and system for classifying fault defect texts of power equipment
CN111027595A (en) * 2019-11-19 2020-04-17 电子科技大学 Double-stage semantic word vector generation method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326346A (en) * 2016-08-06 2017-01-11 上海高欣计算机系统有限公司 Text classification method and terminal device
CN108596470A (en) * 2018-04-19 2018-09-28 浙江大学 A kind of power equipments defect text handling method based on TensorFlow frames
CN111027595A (en) * 2019-11-19 2020-04-17 电子科技大学 Double-stage semantic word vector generation method
CN110895565A (en) * 2019-11-29 2020-03-20 国网湖南省电力有限公司 Method and system for classifying fault defect texts of power equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘梓权 等: "基于卷积神经网络的电力设备缺陷文本分类模型研究", 《电网技术》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113157675A (en) * 2021-03-05 2021-07-23 深圳供电局有限公司 Defect positioning method and device, computer equipment and storage medium
CN113111750A (en) * 2021-03-31 2021-07-13 智慧眼科技股份有限公司 Face living body detection method and device, computer equipment and storage medium
CN113011166A (en) * 2021-04-19 2021-06-22 华北电力大学 Relay protection defect text synonym recognition method based on decision tree classification
CN113111183A (en) * 2021-04-20 2021-07-13 通号(长沙)轨道交通控制技术有限公司 Traction power supply equipment defect grade classification method
CN113486347A (en) * 2021-06-30 2021-10-08 福州大学 Deep learning hardware Trojan horse detection method based on semantic understanding
CN113486347B (en) * 2021-06-30 2023-07-14 福州大学 Deep learning hardware Trojan horse detection method based on semantic understanding
CN113435195A (en) * 2021-07-01 2021-09-24 贵州电网有限责任公司 Defect intelligent diagnosis model construction method based on main transformer load characteristics
CN113435195B (en) * 2021-07-01 2023-10-03 贵州电网有限责任公司 Defect intelligent diagnosis model construction method based on main transformer load characteristics
CN114036946B (en) * 2021-11-26 2023-07-07 浪潮卓数大数据产业发展有限公司 Text feature extraction and auxiliary retrieval system and method

Similar Documents

Publication Publication Date Title
CN111966825A (en) Power grid equipment defect text classification method based on machine learning
CN109697232B (en) Chinese text emotion analysis method based on deep learning
CN111694924B (en) Event extraction method and system
CN108984526B (en) Document theme vector extraction method based on deep learning
CN111209738B (en) Multi-task named entity recognition method combining text classification
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN114896388A (en) Hierarchical multi-label text classification method based on mixed attention
CN111177383A (en) Text entity relation automatic classification method fusing text syntactic structure and semantic information
CN114911945A (en) Knowledge graph-based multi-value chain data management auxiliary decision model construction method
CN111428513A (en) False comment analysis method based on convolutional neural network
JP2024503036A (en) Methods and systems for improved deep learning models
US11164044B2 (en) Systems and methods for tagging datasets using models arranged in a series of nodes
CN111709225A (en) Event cause and effect relationship judging method and device and computer readable storage medium
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN114722198A (en) Method, system and related device for determining product classification code
CN113221569A (en) Method for extracting text information of damage test
CN117151222A (en) Domain knowledge guided emergency case entity attribute and relation extraction method thereof, electronic equipment and storage medium
CN112347247A (en) Specific category text title binary classification method based on LDA and Bert
CN111460817A (en) Method and system for recommending criminal legal document related law provision
CN115759095A (en) Named entity recognition method and device for tobacco plant diseases and insect pests
CN115713970A (en) Transcription factor identification method based on Transformer-Encoder and multi-scale convolutional neural network
CN115827871A (en) Internet enterprise classification method, device and system
CN115098707A (en) Cross-modal Hash retrieval method and system based on zero sample learning
CN112651590B (en) Instruction processing flow recommending method
CN115129890A (en) Feedback data map generation method and generation device, question answering device and refrigerator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201120

RJ01 Rejection of invention patent application after publication