CN111694956A - Feature expansion-based bert power grid defect text classification method - Google Patents
Feature expansion-based bert power grid defect text classification method Download PDFInfo
- Publication number
- CN111694956A CN111694956A CN202010430653.9A CN202010430653A CN111694956A CN 111694956 A CN111694956 A CN 111694956A CN 202010430653 A CN202010430653 A CN 202010430653A CN 111694956 A CN111694956 A CN 111694956A
- Authority
- CN
- China
- Prior art keywords
- feature
- data
- features
- defect
- short text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a bert power grid defect text classification method based on feature expansion, and belongs to the field of data processing of machine learning, in particular to the field of error data identification. The method utilizes the historical defect information of the power grid and adopts a k-means clustering method to preprocess the historical defect information. And performing dimension reduction and feature expansion on the defect text features by using a feature expansion method. And classifying the defect texts by adopting a bert method on the basis of feature expansion. The original mode of manually classifying the defect data is changed, and the method has the effects of high classifying speed and accurate classification.
Description
Technical Field
The invention belongs to the field of data processing of machine learning, in particular to the field of error data identification.
Background
Along with the construction of power grid intellectualization and informatization, a large amount of data are accumulated by power grid enterprises, and electric power big data which are commonly concerned by academic circles and industrial circles are gradually formed. In the field of electric power, research on structured data mining is mainly focused, and research on image recognition is also focused, but electric power text mining research is just started. There are many valuable things in the defect description recorded by a large number of workers, however, the manual arrangement of the data requires a professional and is time-consuming and labor-consuming. At present, there is a part of research on the power grid text, but no relevant research results exist in the classification of the defect text.
Disclosure of Invention
Aiming at the defects in the background technology, the invention solves the problems that the electric power text information processing speed is low and the defect data can not be identified in the prior art.
The technical scheme of the invention is a bert power grid defect text classification method based on feature expansion, which comprises the following steps:
step 1: preprocessing data;
step 1.1: reading data in an original csv file, and counting the number of defect types in the data in the original csv file and description characters of each defect type;
step 1.2: classifying all defect data in the original csv file according to defect types;
step 1.3: performing K-means clustering on the defect types according to the description characters of the defect types, and finally clustering into K types;
step 2: extracting characteristics;
step 2.1: extracting the characteristics of all data in each type of clustered defects, wherein the data comprises description characters of the defects and characters of the data;
step 2.2: calculating the dispersion DI of each extracted feature by using the following formulaic(f,Ci);
Wherein: n represents CiTotal number of data in class: tf isij(f) Representation featuref at CiThe number of occurrences in the jth data in a class,representing the feature f at CiThe average value of the occurrence times in all the data of the class is calculated as follows;
step 2.3: according to dispersion DIic(f,Ci) Sorting each feature, setting a dispersion threshold value, and combining the features with the dispersion larger than the threshold value;
and step 3: performing feature expansion on all short text data in the defect data;
selecting the most similar features to the features of the short text data from all the features obtained in the step 2, and expanding the short text data according to the most similar features;
and 4, step 4: and reclassifying all the expanded defect data by adopting a bert model to obtain a classification result.
Further, the specific method of step 3 is as follows:
step 3.1: if a certain feature of the short text data can be directly found in the finally obtained feature in the step 2, the feature is reserved;
step 3.2: if a certain feature is not found in the features finally obtained in the step 2, calculating the similarity R (f) of the feature in the short text data and all the features finally obtained in the step 2 by adopting the following formulai,fj);
Wherein the characteristic fiFor features in short text data, feature fjFor the features obtained in step 2, P (f)i,fj) Feature f represented in datasetiAnd feature fjProbability of simultaneous occurrence; p (f)i) Representing a feature fiProbability of occurrence in all data, P (f)j) Representing a feature fjProbability of occurrence in all data;
step 3.3: and (4) finding out the features with the maximum similarity from all the features finally obtained in the step (2), and replacing the features in the short text data with the features.
The method utilizes the historical defect information of the power grid and adopts a k-means clustering method to preprocess the historical defect information. And performing dimension reduction and feature expansion on the defect text features by using a feature expansion method. Classifying the defect texts by adopting a bert method on the basis of feature expansion; the method has the effects of high classification speed and accurate classification.
Drawings
FIG. 1 is a flow chart of data consolidation.
FIG. 2 k-means clustering flow chart.
FIG. 3 Bert text classification flow chart.
Detailed Description
1. Data pre-processing
1) First, data is read, since the original file is stored in the csv file. Can not be directly used.
2) And counting the number of the defect types in the csv file of the stored data, and storing the defect types in the txt file according to a format that each defect type name occupies one line.
3) The txt file holding the defect type name is then read. And stores the read files in a list.
4) One defect type in the list is read. And traversing all defect types in the csv file, and if the defect types are the same as those in the list, storing the defect description corresponding to the defect type in a folder named by the defect type, wherein the folder is stored in a txt format.
5) Traversing all defect types in the list, repeating the operation of 4).
6) And reading the sorted defect description file, and clustering the data by using a k-means algorithm. The k value of the cluster is set to 2, and then the contour coefficient at the current k value is calculated. Then, the k values are respectively set to be 3-227, and the contour coefficients under the current k values are sequentially calculated. And selecting the minimum contour coefficient as the optimal category number.
7) And taking the obtained optimal clustering number as a K value of a K-means algorithm, and then carrying out clustering processing.
8) And finally, storing the clustered data in a corresponding folder.
2. Feature extension
1) Main framework for feature expansion
For a given short text set consisting of n articles, D ═ Di,i=1,2,…n+,diTable short text. The invention aims to construct a classifier on D. Wherein d isi=*fk+,fkRepresenting short text diThe term (1). In short text, fkIs generally small. The algorithm is mainly divided into two steps: a) selecting the features with high distinctiveness for text classification from the short text set D to form a feature space F, thereby achieving the effect of reducing dimensions; b) for any short text diSelecting a word F from the feature space FkFeature pairs short text d with highly similar featuresiIs expanded to increase diThe category of (1) represents the capability. These two steps are described in detail below.
2) Feature space selection
The method mainly comprises the steps of selecting a feature with high category distinguishability from a document set to form a feature space F, and during feature selection, performing the following two aspects: a) the features with high category distinguishing capability in each short text appear in the feature space F as much as possible, and the direct correlation between each sample and the feature space is ensured: b) the selected features should appear in most texts in a class of text categories, so that the distribution of the features is balanced, and sparseness of text features is avoided.
Therefore, two factors need to be considered when performing feature selection: a) the difference in the total number of texts for each category; b) the degree of association of the selected feature with the document category. Within a document category, features are selected that significantly contribute to the category. With such toolsThe document category is represented by a feature having high category distinctiveness. Using intra-class dispersion DI of featuresicThe intra-class dispersion represents the distribution of features in a certain class, and the calculation formula is shown in formula 1.
Wherein: n represents CiTotal number of documents in class: tf isij(f) Representing the feature f at CiNumber of occurrences in jth document in class:representing the feature f at CiThe calculation formula of the average value of the occurrence times of all the documents in the class is shown in formula 2.
The smaller the dispersion degree in the class of the characteristic f is, the more uniform the distribution of the characteristic f in the class is, and the better the distinguishing capability of the class is. DI for each classicSorting the features from large to small, and then combining all non-repetitive features to form a feature space F.
3) Feature extension
By using the feature f contained in the short textkThe short text is expanded by the characteristics with the maximum correlation degree; the current main method for calculating the correlation degree between the features is to adopt mutual information to intuitively reflect the direct correlation between the features and the categories, but the mutual information is very sensitive to inaccuracy brought by sparse data, and the mutual information value between the features is negative due to the sparse data problem, so that the later application and processing are inconvenient, therefore, the correlation degree calculation formula R (f) based on mutual information improvement is adopted in the inventioni,fj) The problem that the binary mutual information formed by low-frequency words is larger than the binary information formed by high-frequency words can be avoided to a certain extent, the influence of data sparsity on the correlation degree among the characteristics is weakened, and the formula is shown as formula 3.
Wherein P (f)i,fj) Represented in a data set, representing a feature fiAnd fjProbability of simultaneous occurrence; p (f)i) Representing a feature fiProbability of occurrence in the dataset, P (f)j) Representing a feature fjProbability of occurrence in the data set.
The feature extension algorithm is described as follows:
inputting: short text set D ═ D1,d2,…,dn)。
And (3) outputting: short text diThe vector of (a): vi(di)。
For document D in document set Di:di=(f1,f2,..,fm)∈D=(d1,d2,…,dn)
Selecting a feature space F
Traversing document diEach feature f ofk,k=1,2,…,n
If f iskIn the feature space F
Will f iskThe corresponding words are retained
If f iskNot in the feature space F
Calculating the feature fkAnd each feature F in the feature space F′A mutual information value of; r (f)k,f′) Selecting out the feature f' with the maximum mutual information
The words corresponding to the characteristic item f' are reserved
Execute loop n times
Return document features
3. Text classification
And classifying the power grid defect texts on the basis of the features extracted in the step 2 by adopting a bert model. The classification accuracy is as follows:
precision | recall | F1-score |
0.99 | 0.99 | 0.98 |
Claims (2)
1. a bert power grid defect text classification method based on feature expansion comprises the following steps:
step 1: preprocessing data;
step 1.1: reading data in an original csv file, and counting the number of defect types in the data in the original csv file and description characters of each defect type;
step 1.2: classifying all defect data in the original csv file according to defect types;
step 1.3: performing K-means clustering on the defect types according to the description characters of the defect types, and finally clustering into K types;
step 2: extracting characteristics;
step 2.1: extracting the characteristics of all data in each type of clustered defects, wherein the data comprises description characters of the defects and characters of the data;
step 2.2: calculating the dispersion DI of each extracted feature by using the following formulaic(f,Ci);
Wherein: n represents CiTotal number of data in class: tf isij(f) Representing the feature f at CiThe number of occurrences in the jth data in a class,representing the feature f at CiThe average value of the occurrence times in all the data of the class is calculated as follows;
step 2.3: according to dispersion DIic(f,Ci) Sorting each feature, setting a dispersion threshold value, and combining the features with the dispersion larger than the threshold value;
and step 3: performing feature expansion on all short text data in the defect data;
selecting the most similar features to the features of the short text data from all the features obtained in the step 2, and expanding the short text data according to the most similar features;
and 4, step 4: and reclassifying all the expanded defect data by adopting a bert model to obtain a classification result.
2. The feature expansion-based bert power grid defect text classification method as claimed in claim 1, wherein the specific method in the step 3 is as follows:
step 3.1: if a certain feature of the short text data can be directly found in the finally obtained feature in the step 2, the feature is reserved;
step 3.2: if a certain feature is not found in the features finally obtained in the step 2, calculating the similarity R (f) of the feature in the short text data and all the features finally obtained in the step 2 by adopting the following formulai,fj);
Wherein: characteristic fiFor features in short text data, feature fjFor the features obtained in step 2, P (f)i,fj) Feature f represented in datasetiAnd feature fjProbability of simultaneous occurrence; p (f)i) Representing a feature fiProbability of occurrence in all data, P (f)j) Representing a feature fjProbability of occurrence in all data;
step 3.3: and (4) finding out the features with the maximum similarity from all the features finally obtained in the step (2), and replacing the features in the short text data with the features.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010430653.9A CN111694956A (en) | 2020-05-20 | 2020-05-20 | Feature expansion-based bert power grid defect text classification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010430653.9A CN111694956A (en) | 2020-05-20 | 2020-05-20 | Feature expansion-based bert power grid defect text classification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111694956A true CN111694956A (en) | 2020-09-22 |
Family
ID=72477999
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010430653.9A Pending CN111694956A (en) | 2020-05-20 | 2020-05-20 | Feature expansion-based bert power grid defect text classification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111694956A (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763348A (en) * | 2018-05-15 | 2018-11-06 | 南京邮电大学 | A kind of classification improved method of extension short text word feature vector |
-
2020
- 2020-05-20 CN CN202010430653.9A patent/CN111694956A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763348A (en) * | 2018-05-15 | 2018-11-06 | 南京邮电大学 | A kind of classification improved method of extension short text word feature vector |
Non-Patent Citations (4)
Title |
---|
任莹: "基于预训练BERT模型的客服工单自动分类研究", 《云南电力技术》 * |
李文超 等: "一种新的基于层次和 K-means 方法的聚类算法", 《第二十六届中国控制会议论文集》 * |
杨超群: "基于自身特征的短文本分类研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 * |
谢乐: "基于数据挖掘的电容型设备缺陷文本研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11715313B2 (en) | Apparatus and methods for extracting data from lineless table using delaunay triangulation and excess edge removal | |
CN111914558B (en) | Course knowledge relation extraction method and system based on sentence bag attention remote supervision | |
Huang et al. | An improved knn based on class contribution and feature weighting | |
CN107391772B (en) | Text classification method based on naive Bayes | |
CN108509629B (en) | Text emotion analysis method based on emotion dictionary and support vector machine | |
CN101777125B (en) | Method for supervising and classifying complex category of high-resolution remote sensing image | |
CN110442618B (en) | Convolutional neural network review expert recommendation method fusing expert information association relation | |
Gordo et al. | Document classification and page stream segmentation for digital mailroom applications | |
CN102262645A (en) | Information processing apparatus, information processing method, and program | |
CN110781333A (en) | Method for processing unstructured monitoring data of cable-stayed bridge based on machine learning | |
Jayady et al. | Theme Identification using Machine Learning Techniques | |
Ghosh et al. | R-phoc: segmentation-free word spotting using cnn | |
Pengcheng et al. | Fast Chinese calligraphic character recognition with large-scale data | |
CN109359090A (en) | File fragmentation classification method and system based on convolutional neural networks | |
Chandra et al. | Optical character recognition-A review | |
CN112579783A (en) | Short text clustering method based on Laplace map | |
CN111694956A (en) | Feature expansion-based bert power grid defect text classification method | |
CN111625578A (en) | Feature extraction method suitable for time sequence data in cultural science and technology fusion field | |
CN114511027B (en) | Method for extracting English remote data through big data network | |
CN114996446B (en) | Text classification method, device and storage medium | |
Hamza et al. | An end-to-end administrative document analysis system | |
CN111340029A (en) | Device and method for identifying at least partial address in recipient address | |
Li et al. | A more effective method for image representation: Topic model based on latent dirichlet allocation | |
Halder et al. | Individuality of Bangla numerals | |
CN111753084A (en) | Short text feature extraction and classification method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200922 |