CN111694956A

CN111694956A - Feature expansion-based bert power grid defect text classification method

Info

Publication number: CN111694956A
Application number: CN202010430653.9A
Authority: CN
Inventors: 郑泽忠; 牟范; 谢乐; 杨宇霆; 江邵斌; 侯安锴
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-05-20
Filing date: 2020-05-20
Publication date: 2020-09-22

Abstract

The invention discloses a bert power grid defect text classification method based on feature expansion, and belongs to the field of data processing of machine learning, in particular to the field of error data identification. The method utilizes the historical defect information of the power grid and adopts a k-means clustering method to preprocess the historical defect information. And performing dimension reduction and feature expansion on the defect text features by using a feature expansion method. And classifying the defect texts by adopting a bert method on the basis of feature expansion. The original mode of manually classifying the defect data is changed, and the method has the effects of high classifying speed and accurate classification.

Description

Feature expansion-based bert power grid defect text classification method

Technical Field

The invention belongs to the field of data processing of machine learning, in particular to the field of error data identification.

Background

Along with the construction of power grid intellectualization and informatization, a large amount of data are accumulated by power grid enterprises, and electric power big data which are commonly concerned by academic circles and industrial circles are gradually formed. In the field of electric power, research on structured data mining is mainly focused, and research on image recognition is also focused, but electric power text mining research is just started. There are many valuable things in the defect description recorded by a large number of workers, however, the manual arrangement of the data requires a professional and is time-consuming and labor-consuming. At present, there is a part of research on the power grid text, but no relevant research results exist in the classification of the defect text.

Disclosure of Invention

Aiming at the defects in the background technology, the invention solves the problems that the electric power text information processing speed is low and the defect data can not be identified in the prior art.

The technical scheme of the invention is a bert power grid defect text classification method based on feature expansion, which comprises the following steps:

step 1: preprocessing data;

step 1.1: reading data in an original csv file, and counting the number of defect types in the data in the original csv file and description characters of each defect type;

step 1.2: classifying all defect data in the original csv file according to defect types;

step 1.3: performing K-means clustering on the defect types according to the description characters of the defect types, and finally clustering into K types;

step 2: extracting characteristics;

step 2.1: extracting the characteristics of all data in each type of clustered defects, wherein the data comprises description characters of the defects and characters of the data;

step 2.2: calculating the dispersion DI of each extracted feature by using the following formula_ic(f,C_i)；

Wherein: n represents C_iTotal number of data in class: tf is_ij(f) Representation featuref at C_iThe number of occurrences in the jth data in a class,

representing the feature f at C_iThe average value of the occurrence times in all the data of the class is calculated as follows;

step 2.3: according to dispersion DI_ic(f,C_i) Sorting each feature, setting a dispersion threshold value, and combining the features with the dispersion larger than the threshold value;

and step 3: performing feature expansion on all short text data in the defect data;

selecting the most similar features to the features of the short text data from all the features obtained in the step 2, and expanding the short text data according to the most similar features;

and 4, step 4: and reclassifying all the expanded defect data by adopting a bert model to obtain a classification result.

Further, the specific method of step 3 is as follows:

step 3.1: if a certain feature of the short text data can be directly found in the finally obtained feature in the step 2, the feature is reserved;

step 3.2: if a certain feature is not found in the features finally obtained in the step 2, calculating the similarity R (f) of the feature in the short text data and all the features finally obtained in the step 2 by adopting the following formula_i,f_j)；

Wherein the characteristic f_iFor features in short text data, feature f_jFor the features obtained in step 2, P (f)_i,f_j) Feature f represented in dataset_iAnd feature f_jProbability of simultaneous occurrence; p (f)_i) Representing a feature f_iProbability of occurrence in all data, P (f)_j) Representing a feature f_jProbability of occurrence in all data;

step 3.3: and (4) finding out the features with the maximum similarity from all the features finally obtained in the step (2), and replacing the features in the short text data with the features.

The method utilizes the historical defect information of the power grid and adopts a k-means clustering method to preprocess the historical defect information. And performing dimension reduction and feature expansion on the defect text features by using a feature expansion method. Classifying the defect texts by adopting a bert method on the basis of feature expansion; the method has the effects of high classification speed and accurate classification.

Drawings

FIG. 1 is a flow chart of data consolidation.

FIG. 2 k-means clustering flow chart.

FIG. 3 Bert text classification flow chart.

Detailed Description

1. Data pre-processing

1) First, data is read, since the original file is stored in the csv file. Can not be directly used.

2) And counting the number of the defect types in the csv file of the stored data, and storing the defect types in the txt file according to a format that each defect type name occupies one line.

3) The txt file holding the defect type name is then read. And stores the read files in a list.

4) One defect type in the list is read. And traversing all defect types in the csv file, and if the defect types are the same as those in the list, storing the defect description corresponding to the defect type in a folder named by the defect type, wherein the folder is stored in a txt format.

5) Traversing all defect types in the list, repeating the operation of 4).

6) And reading the sorted defect description file, and clustering the data by using a k-means algorithm. The k value of the cluster is set to 2, and then the contour coefficient at the current k value is calculated. Then, the k values are respectively set to be 3-227, and the contour coefficients under the current k values are sequentially calculated. And selecting the minimum contour coefficient as the optimal category number.

7) And taking the obtained optimal clustering number as a K value of a K-means algorithm, and then carrying out clustering processing.

8) And finally, storing the clustered data in a corresponding folder.

2. Feature extension

1) Main framework for feature expansion

For a given short text set consisting of n articles, D ═ D_i,i＝1,2,…n+，d_iTable short text. The invention aims to construct a classifier on D. Wherein d is_i＝*f_k+，f_kRepresenting short text d_iThe term (1). In short text, f_kIs generally small. The algorithm is mainly divided into two steps: a) selecting the features with high distinctiveness for text classification from the short text set D to form a feature space F, thereby achieving the effect of reducing dimensions; b) for any short text d_iSelecting a word F from the feature space F_kFeature pairs short text d with highly similar features_iIs expanded to increase d_iThe category of (1) represents the capability. These two steps are described in detail below.

2) Feature space selection

The method mainly comprises the steps of selecting a feature with high category distinguishability from a document set to form a feature space F, and during feature selection, performing the following two aspects: a) the features with high category distinguishing capability in each short text appear in the feature space F as much as possible, and the direct correlation between each sample and the feature space is ensured: b) the selected features should appear in most texts in a class of text categories, so that the distribution of the features is balanced, and sparseness of text features is avoided.

Therefore, two factors need to be considered when performing feature selection: a) the difference in the total number of texts for each category; b) the degree of association of the selected feature with the document category. Within a document category, features are selected that significantly contribute to the category. With such toolsThe document category is represented by a feature having high category distinctiveness. Using intra-class dispersion DI of features_icThe intra-class dispersion represents the distribution of features in a certain class, and the calculation formula is shown in formula 1.

Wherein: n represents C_iTotal number of documents in class: tf is_ij(f) Representing the feature f at C_iNumber of occurrences in jth document in class:

representing the feature f at C_iThe calculation formula of the average value of the occurrence times of all the documents in the class is shown in formula 2.

The smaller the dispersion degree in the class of the characteristic f is, the more uniform the distribution of the characteristic f in the class is, and the better the distinguishing capability of the class is. DI for each class_icSorting the features from large to small, and then combining all non-repetitive features to form a feature space F.

3) Feature extension

By using the feature f contained in the short text_kThe short text is expanded by the characteristics with the maximum correlation degree; the current main method for calculating the correlation degree between the features is to adopt mutual information to intuitively reflect the direct correlation between the features and the categories, but the mutual information is very sensitive to inaccuracy brought by sparse data, and the mutual information value between the features is negative due to the sparse data problem, so that the later application and processing are inconvenient, therefore, the correlation degree calculation formula R (f) based on mutual information improvement is adopted in the invention_i,f_j) The problem that the binary mutual information formed by low-frequency words is larger than the binary information formed by high-frequency words can be avoided to a certain extent, the influence of data sparsity on the correlation degree among the characteristics is weakened, and the formula is shown as formula 3.

Wherein P (f)_i,f_j) Represented in a data set, representing a feature f_iAnd f_jProbability of simultaneous occurrence; p (f)_i) Representing a feature f_iProbability of occurrence in the dataset, P (f)_j) Representing a feature f_jProbability of occurrence in the data set.

The feature extension algorithm is described as follows:

inputting: short text set D ═ D₁,d₂,…,d_n)。

And (3) outputting: short text d_iThe vector of (a): v_i(d_i)。

For document D in document set D_i：d_i＝(f₁,f₂,..,f_m)∈D＝(d₁,d₂,…,d_n)

Selecting a feature space F

Traversing document d_iEach feature f of_k,k＝1,2,…,n

If f is_kIn the feature space F

Will f is_kThe corresponding words are retained

If f is_kNot in the feature space F

Calculating the feature f_kAnd each feature F in the feature space F^′A mutual information value of; r (f)_k,f^′) Selecting out the feature f' with the maximum mutual information

The words corresponding to the characteristic item f' are reserved

Execute loop n times

Return document features

3. Text classification

And classifying the power grid defect texts on the basis of the features extracted in the step 2 by adopting a bert model. The classification accuracy is as follows:

precision	recall	F1-score
			0.99	0.99	0.98

Claims

1. a bert power grid defect text classification method based on feature expansion comprises the following steps:

step 1: preprocessing data;

step 2: extracting characteristics;

step 2.2: calculating the dispersion DI of each extracted feature by using the following formula_ic(f，C_i)；

Wherein: n represents C_iTotal number of data in class: tf is_ij(f) Representing the feature f at C_iThe number of occurrences in the jth data in a class,

step 2.3: according to dispersion DI_ic(f，C_i) Sorting each feature, setting a dispersion threshold value, and combining the features with the dispersion larger than the threshold value;

2. The feature expansion-based bert power grid defect text classification method as claimed in claim 1, wherein the specific method in the step 3 is as follows:

step 3.2: if a certain feature is not found in the features finally obtained in the step 2, calculating the similarity R (f) of the feature in the short text data and all the features finally obtained in the step 2 by adopting the following formula_i，f_j)；

Wherein: characteristic f_iFor features in short text data, feature f_jFor the features obtained in step 2, P (f)_i，f_j) Feature f represented in dataset_iAnd feature f_jProbability of simultaneous occurrence; p (f)_i) Representing a feature f_iProbability of occurrence in all data, P (f)_j) Representing a feature f_jProbability of occurrence in all data;