CN111694956A - Feature expansion-based bert power grid defect text classification method - Google Patents

Feature expansion-based bert power grid defect text classification method Download PDF

Info

Publication number
CN111694956A
CN111694956A CN202010430653.9A CN202010430653A CN111694956A CN 111694956 A CN111694956 A CN 111694956A CN 202010430653 A CN202010430653 A CN 202010430653A CN 111694956 A CN111694956 A CN 111694956A
Authority
CN
China
Prior art keywords
feature
data
features
defect
short text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010430653.9A
Other languages
Chinese (zh)
Inventor
郑泽忠
牟范
谢乐
杨宇霆
江邵斌
侯安锴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010430653.9A priority Critical patent/CN111694956A/en
Publication of CN111694956A publication Critical patent/CN111694956A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a bert power grid defect text classification method based on feature expansion, and belongs to the field of data processing of machine learning, in particular to the field of error data identification. The method utilizes the historical defect information of the power grid and adopts a k-means clustering method to preprocess the historical defect information. And performing dimension reduction and feature expansion on the defect text features by using a feature expansion method. And classifying the defect texts by adopting a bert method on the basis of feature expansion. The original mode of manually classifying the defect data is changed, and the method has the effects of high classifying speed and accurate classification.

Description

Feature expansion-based bert power grid defect text classification method
Technical Field
The invention belongs to the field of data processing of machine learning, in particular to the field of error data identification.
Background
Along with the construction of power grid intellectualization and informatization, a large amount of data are accumulated by power grid enterprises, and electric power big data which are commonly concerned by academic circles and industrial circles are gradually formed. In the field of electric power, research on structured data mining is mainly focused, and research on image recognition is also focused, but electric power text mining research is just started. There are many valuable things in the defect description recorded by a large number of workers, however, the manual arrangement of the data requires a professional and is time-consuming and labor-consuming. At present, there is a part of research on the power grid text, but no relevant research results exist in the classification of the defect text.
Disclosure of Invention
Aiming at the defects in the background technology, the invention solves the problems that the electric power text information processing speed is low and the defect data can not be identified in the prior art.
The technical scheme of the invention is a bert power grid defect text classification method based on feature expansion, which comprises the following steps:
step 1: preprocessing data;
step 1.1: reading data in an original csv file, and counting the number of defect types in the data in the original csv file and description characters of each defect type;
step 1.2: classifying all defect data in the original csv file according to defect types;
step 1.3: performing K-means clustering on the defect types according to the description characters of the defect types, and finally clustering into K types;
step 2: extracting characteristics;
step 2.1: extracting the characteristics of all data in each type of clustered defects, wherein the data comprises description characters of the defects and characters of the data;
step 2.2: calculating the dispersion DI of each extracted feature by using the following formulaic(f,Ci);
Figure BDA0002500410100000011
Wherein: n represents CiTotal number of data in class: tf isij(f) Representation featuref at CiThe number of occurrences in the jth data in a class,
Figure BDA0002500410100000012
representing the feature f at CiThe average value of the occurrence times in all the data of the class is calculated as follows;
Figure BDA0002500410100000013
step 2.3: according to dispersion DIic(f,Ci) Sorting each feature, setting a dispersion threshold value, and combining the features with the dispersion larger than the threshold value;
and step 3: performing feature expansion on all short text data in the defect data;
selecting the most similar features to the features of the short text data from all the features obtained in the step 2, and expanding the short text data according to the most similar features;
and 4, step 4: and reclassifying all the expanded defect data by adopting a bert model to obtain a classification result.
Further, the specific method of step 3 is as follows:
step 3.1: if a certain feature of the short text data can be directly found in the finally obtained feature in the step 2, the feature is reserved;
step 3.2: if a certain feature is not found in the features finally obtained in the step 2, calculating the similarity R (f) of the feature in the short text data and all the features finally obtained in the step 2 by adopting the following formulai,fj);
Figure BDA0002500410100000021
Wherein the characteristic fiFor features in short text data, feature fjFor the features obtained in step 2, P (f)i,fj) Feature f represented in datasetiAnd feature fjProbability of simultaneous occurrence; p (f)i) Representing a feature fiProbability of occurrence in all data, P (f)j) Representing a feature fjProbability of occurrence in all data;
step 3.3: and (4) finding out the features with the maximum similarity from all the features finally obtained in the step (2), and replacing the features in the short text data with the features.
The method utilizes the historical defect information of the power grid and adopts a k-means clustering method to preprocess the historical defect information. And performing dimension reduction and feature expansion on the defect text features by using a feature expansion method. Classifying the defect texts by adopting a bert method on the basis of feature expansion; the method has the effects of high classification speed and accurate classification.
Drawings
FIG. 1 is a flow chart of data consolidation.
FIG. 2 k-means clustering flow chart.
FIG. 3 Bert text classification flow chart.
Detailed Description
1. Data pre-processing
1) First, data is read, since the original file is stored in the csv file. Can not be directly used.
2) And counting the number of the defect types in the csv file of the stored data, and storing the defect types in the txt file according to a format that each defect type name occupies one line.
3) The txt file holding the defect type name is then read. And stores the read files in a list.
4) One defect type in the list is read. And traversing all defect types in the csv file, and if the defect types are the same as those in the list, storing the defect description corresponding to the defect type in a folder named by the defect type, wherein the folder is stored in a txt format.
5) Traversing all defect types in the list, repeating the operation of 4).
6) And reading the sorted defect description file, and clustering the data by using a k-means algorithm. The k value of the cluster is set to 2, and then the contour coefficient at the current k value is calculated. Then, the k values are respectively set to be 3-227, and the contour coefficients under the current k values are sequentially calculated. And selecting the minimum contour coefficient as the optimal category number.
7) And taking the obtained optimal clustering number as a K value of a K-means algorithm, and then carrying out clustering processing.
8) And finally, storing the clustered data in a corresponding folder.
2. Feature extension
1) Main framework for feature expansion
For a given short text set consisting of n articles, D ═ Di,i=1,2,…n+,diTable short text. The invention aims to construct a classifier on D. Wherein d isi=*fk+,fkRepresenting short text diThe term (1). In short text, fkIs generally small. The algorithm is mainly divided into two steps: a) selecting the features with high distinctiveness for text classification from the short text set D to form a feature space F, thereby achieving the effect of reducing dimensions; b) for any short text diSelecting a word F from the feature space FkFeature pairs short text d with highly similar featuresiIs expanded to increase diThe category of (1) represents the capability. These two steps are described in detail below.
2) Feature space selection
The method mainly comprises the steps of selecting a feature with high category distinguishability from a document set to form a feature space F, and during feature selection, performing the following two aspects: a) the features with high category distinguishing capability in each short text appear in the feature space F as much as possible, and the direct correlation between each sample and the feature space is ensured: b) the selected features should appear in most texts in a class of text categories, so that the distribution of the features is balanced, and sparseness of text features is avoided.
Therefore, two factors need to be considered when performing feature selection: a) the difference in the total number of texts for each category; b) the degree of association of the selected feature with the document category. Within a document category, features are selected that significantly contribute to the category. With such toolsThe document category is represented by a feature having high category distinctiveness. Using intra-class dispersion DI of featuresicThe intra-class dispersion represents the distribution of features in a certain class, and the calculation formula is shown in formula 1.
Figure BDA0002500410100000041
Wherein: n represents CiTotal number of documents in class: tf isij(f) Representing the feature f at CiNumber of occurrences in jth document in class:
Figure BDA0002500410100000042
representing the feature f at CiThe calculation formula of the average value of the occurrence times of all the documents in the class is shown in formula 2.
Figure BDA0002500410100000043
The smaller the dispersion degree in the class of the characteristic f is, the more uniform the distribution of the characteristic f in the class is, and the better the distinguishing capability of the class is. DI for each classicSorting the features from large to small, and then combining all non-repetitive features to form a feature space F.
3) Feature extension
By using the feature f contained in the short textkThe short text is expanded by the characteristics with the maximum correlation degree; the current main method for calculating the correlation degree between the features is to adopt mutual information to intuitively reflect the direct correlation between the features and the categories, but the mutual information is very sensitive to inaccuracy brought by sparse data, and the mutual information value between the features is negative due to the sparse data problem, so that the later application and processing are inconvenient, therefore, the correlation degree calculation formula R (f) based on mutual information improvement is adopted in the inventioni,fj) The problem that the binary mutual information formed by low-frequency words is larger than the binary information formed by high-frequency words can be avoided to a certain extent, the influence of data sparsity on the correlation degree among the characteristics is weakened, and the formula is shown as formula 3.
Figure BDA0002500410100000044
Wherein P (f)i,fj) Represented in a data set, representing a feature fiAnd fjProbability of simultaneous occurrence; p (f)i) Representing a feature fiProbability of occurrence in the dataset, P (f)j) Representing a feature fjProbability of occurrence in the data set.
The feature extension algorithm is described as follows:
inputting: short text set D ═ D1,d2,…,dn)。
And (3) outputting: short text diThe vector of (a): vi(di)。
For document D in document set Di:di=(f1,f2,..,fm)∈D=(d1,d2,…,dn)
Selecting a feature space F
Traversing document diEach feature f ofk,k=1,2,…,n
If f iskIn the feature space F
Will f iskThe corresponding words are retained
If f iskNot in the feature space F
Calculating the feature fkAnd each feature F in the feature space FA mutual information value of; r (f)k,f) Selecting out the feature f' with the maximum mutual information
The words corresponding to the characteristic item f' are reserved
Execute loop n times
Return document features
3. Text classification
And classifying the power grid defect texts on the basis of the features extracted in the step 2 by adopting a bert model. The classification accuracy is as follows:
precision recall F1-score
0.99 0.99 0.98

Claims (2)

1. a bert power grid defect text classification method based on feature expansion comprises the following steps:
step 1: preprocessing data;
step 1.1: reading data in an original csv file, and counting the number of defect types in the data in the original csv file and description characters of each defect type;
step 1.2: classifying all defect data in the original csv file according to defect types;
step 1.3: performing K-means clustering on the defect types according to the description characters of the defect types, and finally clustering into K types;
step 2: extracting characteristics;
step 2.1: extracting the characteristics of all data in each type of clustered defects, wherein the data comprises description characters of the defects and characters of the data;
step 2.2: calculating the dispersion DI of each extracted feature by using the following formulaic(f,Ci);
Figure FDA0002500410090000011
Wherein: n represents CiTotal number of data in class: tf isij(f) Representing the feature f at CiThe number of occurrences in the jth data in a class,
Figure FDA0002500410090000012
representing the feature f at CiThe average value of the occurrence times in all the data of the class is calculated as follows;
Figure FDA0002500410090000013
step 2.3: according to dispersion DIic(f,Ci) Sorting each feature, setting a dispersion threshold value, and combining the features with the dispersion larger than the threshold value;
and step 3: performing feature expansion on all short text data in the defect data;
selecting the most similar features to the features of the short text data from all the features obtained in the step 2, and expanding the short text data according to the most similar features;
and 4, step 4: and reclassifying all the expanded defect data by adopting a bert model to obtain a classification result.
2. The feature expansion-based bert power grid defect text classification method as claimed in claim 1, wherein the specific method in the step 3 is as follows:
step 3.1: if a certain feature of the short text data can be directly found in the finally obtained feature in the step 2, the feature is reserved;
step 3.2: if a certain feature is not found in the features finally obtained in the step 2, calculating the similarity R (f) of the feature in the short text data and all the features finally obtained in the step 2 by adopting the following formulai,fj);
Figure FDA0002500410090000021
Wherein: characteristic fiFor features in short text data, feature fjFor the features obtained in step 2, P (f)i,fj) Feature f represented in datasetiAnd feature fjProbability of simultaneous occurrence; p (f)i) Representing a feature fiProbability of occurrence in all data, P (f)j) Representing a feature fjProbability of occurrence in all data;
step 3.3: and (4) finding out the features with the maximum similarity from all the features finally obtained in the step (2), and replacing the features in the short text data with the features.
CN202010430653.9A 2020-05-20 2020-05-20 Feature expansion-based bert power grid defect text classification method Pending CN111694956A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010430653.9A CN111694956A (en) 2020-05-20 2020-05-20 Feature expansion-based bert power grid defect text classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010430653.9A CN111694956A (en) 2020-05-20 2020-05-20 Feature expansion-based bert power grid defect text classification method

Publications (1)

Publication Number Publication Date
CN111694956A true CN111694956A (en) 2020-09-22

Family

ID=72477999

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010430653.9A Pending CN111694956A (en) 2020-05-20 2020-05-20 Feature expansion-based bert power grid defect text classification method

Country Status (1)

Country Link
CN (1) CN111694956A (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763348A (en) * 2018-05-15 2018-11-06 南京邮电大学 A kind of classification improved method of extension short text word feature vector

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763348A (en) * 2018-05-15 2018-11-06 南京邮电大学 A kind of classification improved method of extension short text word feature vector

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
任莹: "基于预训练BERT模型的客服工单自动分类研究", 《云南电力技术》 *
李文超 等: "一种新的基于层次和 K-means 方法的聚类算法", 《第二十六届中国控制会议论文集》 *
杨超群: "基于自身特征的短文本分类研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
谢乐: "基于数据挖掘的电容型设备缺陷文本研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Similar Documents

Publication Publication Date Title
US11715313B2 (en) Apparatus and methods for extracting data from lineless table using delaunay triangulation and excess edge removal
CN111914558B (en) Course knowledge relation extraction method and system based on sentence bag attention remote supervision
Huang et al. An improved knn based on class contribution and feature weighting
CN107391772B (en) Text classification method based on naive Bayes
CN108509629B (en) Text emotion analysis method based on emotion dictionary and support vector machine
CN101777125B (en) Method for supervising and classifying complex category of high-resolution remote sensing image
CN110442618B (en) Convolutional neural network review expert recommendation method fusing expert information association relation
Gordo et al. Document classification and page stream segmentation for digital mailroom applications
CN102262645A (en) Information processing apparatus, information processing method, and program
CN110781333A (en) Method for processing unstructured monitoring data of cable-stayed bridge based on machine learning
Jayady et al. Theme Identification using Machine Learning Techniques
Ghosh et al. R-phoc: segmentation-free word spotting using cnn
Pengcheng et al. Fast Chinese calligraphic character recognition with large-scale data
CN109359090A (en) File fragmentation classification method and system based on convolutional neural networks
Chandra et al. Optical character recognition-A review
CN112579783A (en) Short text clustering method based on Laplace map
CN111694956A (en) Feature expansion-based bert power grid defect text classification method
CN111625578A (en) Feature extraction method suitable for time sequence data in cultural science and technology fusion field
CN114511027B (en) Method for extracting English remote data through big data network
CN114996446B (en) Text classification method, device and storage medium
Hamza et al. An end-to-end administrative document analysis system
CN111340029A (en) Device and method for identifying at least partial address in recipient address
Li et al. A more effective method for image representation: Topic model based on latent dirichlet allocation
Halder et al. Individuality of Bangla numerals
CN111753084A (en) Short text feature extraction and classification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200922