CN115357719B

CN115357719B - Power audit text classification method and device based on improved BERT model

Info

Publication number: CN115357719B
Application number: CN202211283079.4A
Authority: CN
Inventors: 孟庆霖; 穆健; 戴斐斐; 赵宝国; 王霞; 崔霞; 宋岩; 葛晓舰; 吕元旭; 赵战云; 唐厚燕; 王瑞; 许良; 徐业朝; 徐晓萱; 马剑; 李常春; 郭保伟; 李婧
Original assignee: Tianjin Chengxi Guangyuan Power Engineering Co ltd; Tianjin Ninghe District Ningdong Shengyuan Power Engineering Co ltd; Tianjin Tianyuan Electric Power Engineering Co ltd; State Grid Tianjin Electric Power Co Training Center; State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd
Current assignee: Tianjin Chengxi Guangyuan Power Engineering Co ltd; Tianjin Ninghe District Ningdong Shengyuan Power Engineering Co ltd; Tianjin Tianyuan Electric Power Engineering Co ltd; State Grid Tianjin Electric Power Co Training Center; State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-01-03
Anticipated expiration: 2042-10-20
Also published as: CN115357719A

Abstract

The invention discloses a method and a device for classifying power audit texts based on an improved BERT model, wherein the classification method comprises the following steps: acquiring a power text; constructing an EPAT-BERT model; inputting the power text into an EPAT-BERT model for pre-training to obtain a pre-trained EPAT-BERT model; the pre-training comprises respectively performing word granularity mask language model training and entity granularity mask language model training; fine-tuning the pre-trained EPAT-BERT model, then performing performance evaluation, and determining a text classification EPAT-BERT model; and inputting the power audit text to be classified into a text classification EPAT-BERT model, and outputting a category label of the power audit text. According to the two pre-training tasks provided by the invention, the large-scale electric power text is used as a training corpus, the lexical method, the grammar and the related knowledge in the electric power text are mastered, and the high-efficiency automatic classification of the electric power audit text is realized.

Description

Power audit text classification method and device based on improved BERT model

Technical Field

The invention belongs to the technical field of Natural Language Processing (NLP), and particularly relates to a method and a device for classifying electric power audit texts based on an improved BERT model.

Background

With the development of information technology, word2vec, RNN, LSTM and other text classification technologies based on machine learning and neural networks are proposed in sequence.

In recent years, the paradigm of "pre-training + fine-tuning" gradually becomes the latest research direction of text classification, and compared with the prior fully-supervised neural model, the method can achieve better effect. However, the existing pre-training models are pre-trained by using general corpora, and do not use texts related to the power field, especially the power auditing field.

The electric power enterprise audit texts are short texts in specific fields, have distinct industrial characteristics such as high text similarity and fuzzy classification boundaries, are different from general communication languages, and directly utilize the characteristics of the electric power audit texts in the fields which cannot be considered by the existing text classification models, so that the further design of the models to adapt to the characteristics becomes an important problem to be solved.

Disclosure of Invention

Aiming at the problems, the invention provides a method and a device for classifying power audit texts based on an improved BERT model, and the specific technical scheme is as follows:

a power audit text classification method based on an improved BERT model comprises the following steps:

acquiring a power text;

constructing an EPAT-BERT model, wherein the EPAT-BERT model comprises a word granularity mask language model and an entity granularity mask language model;

inputting the power text into an EPAT-BERT model for pre-training to obtain a pre-trained EPAT-BERT model; the pre-training comprises respectively performing word granularity mask language model training and entity granularity mask language model training;

fine-tuning the pre-trained EPAT-BERT model, then carrying out performance evaluation, and determining a text classification EPAT-BERT model;

and inputting the power audit text to be classified into a text classification EPAT-BERT model, and outputting a class label of the power audit text.

Further, the obtaining of the power text specifically includes:

arranging professional vocabularies in the electric power field into a vocabulary V, searching a webpage containing the vocabularies in the vocabulary V in a Web data set, and obtaining a set W;

and extracting the text in the set W by using an extraction algorithm based on a regular expression to obtain a pre-training corpus C.

Further, pre-training the word granularity mask language model specifically comprises the following steps:

marking a position input vector corresponding to each word in the pre-training corpus C to obtain a vectorized input text;

adding identifiers to the front and back of sentences in the input text subjected to vector quantization through a word granularity code language model, and carrying out sentence segmentation to obtain a data set A;

dividing a data set A into a pre-training data set and a first verification set according to a set proportion;

and respectively inputting the pre-training data set and the first verification set into a word granularity mask language model for classification pre-training.

Further, the entity-granularity mask language model pre-training is specifically as follows:

introducing a knowledge graph related to electric power to mark out entities contained in the pre-training data set and the first verification set;

the mask language model of the entity granularity replaces each word in the corresponding entity with a mask mark, the position of each mask mark corresponds to a hidden layer vector, and the word at the position corresponding to each mask mark is predicted by connecting a full-connection layer;

and measuring the difference between the predicted value and the true value by adopting a loss function, calculating the loss function value on the first verification set after the pre-training of the mask language model with the entity granularity by using the pre-training data set reaches a set training turn, and stopping the pre-training of the mask language model with the entity granularity when the loss function value is not reduced any more.

Further, fine tuning is performed on the pre-trained EPAT-BERT model, which specifically includes:

extracting a certain amount of electric power audit text to form a data set, and obtaining a vectorization text by adopting vector representation of words, position coding of the words and each word in the segmentation marking data set of the wordsT；

Vectorizing text using a set ratioTDividing the test data into a fine tuning data set, a second verification set and a test set;

EPAT-BERT model on vectorized textTAdding a special mark at the beginning of the text, and taking the vector of the output position of the special mark as the vector representation of the whole input text;

and adding a full connection layer on the upper layer of the EAPT-BERT, calculating an F1 value on the verification set after each training turn of the fine adjustment data set by the EPAT-BERT model after adding the full connection layer, and stopping training when the F1 value on the verification set is reduced to finish the fine adjustment of the EPAT-BERT model.

Further, the performance evaluation is carried out on the fine-tuned EPAT-BERT model, and the text classification EPAT-BERT model is determined as follows:

calculating the classification accuracy of the fine-tuned EPAT-BERT model under the test set;

and comparing various evaluation indexes of the EPAT-BERT model and other pre-training language models in the test set, and if the comparison result meets the set requirement, determining that the trained EPAT-BERT model is a text classification EPAT-BERT model, wherein the various evaluation indexes comprise classification accuracy.

Further, the pre-training data set and the first verification set are respectively input into the word granularity mask language model for classification pre-training, which specifically comprises the following steps:

the word granularity mask language model carries out mask masking on Chinese characters in each sentence of a pre-training data set randomly according to a set proportion, the masked Chinese characters are predicted through output vectors corresponding to mask positions, a loss function is adopted to measure the difference between a predicted value and a true value, after the pre-training of the word granularity mask language model is carried out by using the pre-training data set to reach a set training turn, a loss function value is calculated on a first verification set, and when the loss function value does not decrease any more, the pre-training of the word granularity mask language model is stopped.

Further, the position input vector corresponding to each word in the pre-training corpus C is labeled as follows:

marking each word using vectors of the word, absolute position coding and segmentation coding of the wordwCorresponding position input vector VwThe method comprises the following steps:

in the formula, W _w The vector representing the character, namely the initial word vector of the character, is used for distinguishing different Chinese characters; p _w Indicating the position of the word, and fusing sequence position information into the input data by using absolute position coding; s. the _w Representing a segmented code.

Further, the method also comprises the following steps: and carrying out an ablation experiment on the text classification EPAT-BERT model, and evaluating the experiment result through various evaluation indexes to determine the pre-training effect.

Further, the various evaluation indexes further comprise precision rate, recall rate and F1 value.

Further, the F1 value is determined according to the precision rate and the recall rate of the EPAT-BERT model on the verification set.

Further, the entity part is similar to or the same as the professional vocabulary in the power field and the vocabulary and grammar in the grammar analysis tool kit.

The invention also provides a power audit text classification device based on the improved BERT model, which comprises the following steps:

the text processing module is used for acquiring a power text;

the model building module is used for building an EPAT-BERT model, wherein the EPAT-BERT model comprises a word granularity mask language model and an entity granularity mask language model;

the model pre-training module is used for inputting the power text into the EPAT-BERT model for pre-training to obtain the EPAT-BERT model after pre-training; the pre-training comprises respectively performing word granularity mask language model training and entity granularity mask language model training;

the model fine-tuning module is used for carrying out fine tuning on the pre-trained EPAT-BERT model and then carrying out performance evaluation to determine a text classification EPAT-BERT model;

and the text classification module is used for inputting the power audit text to be classified into a text classification EPAT-BERT model and outputting a class label of the power audit text.

Further, the text processing module is specifically configured to:

Further, the model pre-training module is configured to pre-train the word granularity mask language model as follows:

Further, the model pre-training module is configured to perform entity-granularity mask language model pre-training as follows:

Further, the model fine-tuning module is specifically configured to:

Further, the model fine-tuning module is further specifically configured to:

comparing various evaluation indexes of the EPAT-BERT model and other pre-training language models in the test set, and if the comparison result meets the set requirement, determining the trained EPAT-BERT model as a text classification EPAT-BERT model, wherein the various evaluation indexes comprise classification accuracy.

The invention also provides a computer device comprising a processor and a memory;

wherein the memory has stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by the processor to implement the improved BERT model based power audit text classification method.

The invention has the beneficial effects that: the invention provides a pre-training task of electric power audit texts with two granularities: a word-granular mask language model and an entity-granular mask language model. The two pre-training tasks take large-scale power texts as training corpora, and the models are respectively used for completing word granularity prediction and entity granularity prediction, so that the lexical method, grammar and related knowledge in the power texts are grasped, and the high-efficiency automatic classification of the power audit texts is realized.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 shows a flow diagram of a pre-training phase of a BERT model according to the prior art;

FIG. 2 shows a schematic flow diagram of a fine tuning phase of a BERT model according to the prior art;

FIG. 3 illustrates a word-granular mask language model pre-training flow diagram according to an embodiment of the invention;

FIG. 4 illustrates a mask language model pre-training flow diagram of entity granularity, according to an embodiment of the invention;

FIG. 5 is a flow diagram illustrating a method for classification of power audit text based on the improved BERT model according to an embodiment of the present invention;

fig. 6 shows a schematic structural diagram of a power audit text classification device based on an improved BERT model according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in this application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. In this application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings.

In order to facilitate understanding of the embodiment of the present application, the following simple descriptions are first performed on the pre-training, the fine-tuning, the natural language processing model and the power audit text respectively:

pre-training means that a pre-training task is designed that is independent of downstream tasks and the model is trained using a large amount of label-free data that is relevant to the task.

With the introduction of Pre-trained models such as natural Language processing Model BERT, computer vision Model MAE, and cross-modal search Model CLIP, pre-trained Language models (Pre-trained Language models) and "Fine-tuning" have become one of the important research fields of natural Language processing.

The earliest pre-training models focused on capturing the semantics of a single word and obtaining word embedding. Later, the advent of CoVe, ELMo, etc. models made it possible to extract contextual features. With the advent of the Transformer network, emerging models such as BERT, GPT, etc. have made "pre-training + fine-tuning" an example of solving natural language processing tasks. One advantage of this model is that since the model has learned a large amount of vocabulary and semantic information during the pre-training phase, the fine-tuning phase requires only a small amount of fully supervised data to train and can achieve a better result than a non-pre-trained model.

The BERT model is a typical pre-trained model that uses the encoder of a transform network as a basic structure. As shown in fig. 1, the BERT model takes a sentence as input, e.g., "security tools are all provided by the packetization unit," and the model automatically adds a "[ CLS ]" identifier before the sentence to indicate the beginning of the sentence, and adds a "[ SEP ]" identifier after the sentence to indicate the end of the sentence. Then, the model converts the input into an id sequence and obtains a sequence of corresponding word vectors, and then encodes the word vector sequence to obtain a context-dependent (context) output corresponding to each word.

As shown in fig. 2, the meaning of fine tuning is to train the downstream task again using the pre-trained model. Although the pre-training task is independent of downstream tasks, the pre-training model is still able to learn common language structures, such as chinese lexical and grammars, during the pre-training phase. When the model is further trained using data from downstream tasks, the parameters in the network will change slightly on an original basis, a process called "tuning".

The Electric Power Audit Text (Electric Power Audit Text) is a Text recorded by an Electric Power enterprise auditor and has an important significance for the Electric Power enterprise to complete the Audit work. The electric power audit text usually comprises audit content and method, audit concerns, audit finding problems, system bases, audit opinions, problem classification and other information manually recorded by electric power auditors.

Common power audit texts are shown in table 1. It can be seen that each section of audit text needs an auditor to manually label a four-level classification label, so that audit text classification is realized. However, manually labeling the four-level classification labels on a large scale consumes manpower and material resources, is inefficient, and is prone to errors. Therefore, efficient and automatic classification of the power audit texts becomes an urgent problem to be solved.

Table 1 power audit text example

The existing pretrained models such as BERT related to texts can be further finely adjusted, so that the text classification task is completed. However, for the field of power auditing, a suitable and generic pre-trained language model and pre-training tasks have not emerged. This leads to a large room for power audit text classification.

The semantic domain of the corpus related to the electric power field is closer to that of the electric power audit text classification task, so that the pre-training task related to the field can enhance the performance of the downstream task related to the field from the perspective of the pre-training theory. Based on the current research situation, the electric power audit text classification method based on the improved BERT (Bidirectional Encoder replication from transformations) model provides electric power audit text pre-training tasks with two granularities: a word-granular mask language model and an entity-granular mask language model. The two pre-training tasks take a large-scale electric power text as a training corpus, and respectively enable the model to complete the prediction of word granularity and the prediction of entity granularity, so that the lexical method, grammar and related knowledge in the electric power text are grasped.

As shown in fig. 5, a power audit text classification method based on an improved BERT model includes the following steps:

s1, acquiring a power text, specifically: the technical vocabularies in the electric power field are firstly organized into a vocabulary V, and then Web pages containing the vocabularies in the vocabulary V are searched in a Web data set provided by Yahoo company and recorded as a set W. And extracting the text in the set W by using an extraction algorithm based on a regular expression to serve as a pre-training corpus of the invention, and recording the pre-training corpus as C.

The embodiment of the invention collects the text related to the electric power from the Internet, so that the model has more lexical methods and syntaxes related to the electric power and is closer to the downstream audit text classification task.

S2, constructing an EPAT-BERT (electric Power Audit Text-BERT) model, wherein the EPAT-BERT model comprises a word granularity mask language model and an entity granularity mask language model.

It should be noted that the word granularity mask language model of the embodiment of the present invention follows the existing general corpus BERT model. Compared with a general text, the electric text has more professional nouns, the word use accuracy is emphasized, and the problem of inaccurate factual information exists only by adopting a mask language model with word granularity.

The electric Power Audit Text (electric Power Audit Text-BERT) is generally a highly specialized short Text, in which entities and knowledge related to the electric Power Audit industry often appear, but the frequency of the entities and knowledge in the general Text is not high. Existing research has shown that for this type of text, it can be inaccurate when performing word-granular mask language model training. For example, when predicting that "the second city of china is a MASK in MASK ] [ MASK ]", it is easy to predict an incorrect city because the contents to be predicted in this sentence are knowledge-dependent, and the word-size MASK language model emphasizes lexical grammar in prediction and the smoothness of the sentence, sometimes ignoring this knowledge information.

Therefore, the EPAT-BERT Model of the present invention further includes an Entity-level Masked Language Model (Entity-level Masked Language Model).

The contents to be predicted by the pre-training language model during pre-training need not only conform to the lexical or grammatical rules, but also learn corresponding facts or knowledge. This helps the pre-trained language model to understand the text even further, especially for highly specialized domain knowledge such as power audit text.

In the Entity-granularity mask Language Model (Entity-level Masked Language Model) of the embodiment of the invention, the Model not only predicts the Masked words, but also masks the entities consisting of a plurality of words in the pre-training stage, and the Model predicts the words. The process allows the model to learn knowledge related to power auditing, not just to be limited to lexical and syntactic.

S3, inputting the power text into the EPAT-BERT model for pre-training to obtain the pre-trained EPAT-BERT model, wherein the pre-training specifically comprises the following steps: and respectively training the power text input word granularity mask language model and the entity granularity mask language model.

In the step, word granularity mask language model training can be carried out firstly, and then entity granularity mask language model training can be carried out; or the entity granularity mask language model training can be carried out firstly, and then the word granularity mask language model training can be carried out.

The step S31 of pre-training the power text input word granularity mask language model specifically comprises the following steps:

s311, marking each word in the pre-training corpus CwCorresponding position input vector Vw，Vectorized input text is obtained.

in the formula, W _w The vector representing the character, namely the initial word vector of the character, is used for distinguishing different Chinese characters; p is _w Indicating the position of the word, and fusing sequence position information into the input data by using absolute position coding; s _w Representing a segment code, different segments should be represented with different codes when the input contains multiple sentences or multiple parts, whereas the input of EPAT-BERT has only one part, and thus the segment representation is unique.

And S312, adding identifiers to the front and the back of sentences in the input text subjected to vector quantization through the word granularity code language model, and carrying out sentence segmentation to obtain a data set A. Sentence segmentation is carried out, specifically: a "[ CLS ]" identifier is added before the sentence to indicate the beginning of the sentence, and a "[ SEP ]" identifier is added after the sentence to indicate the end of the sentence.

And S313, dividing the data set A into a pre-training data set and a first verification set according to a set proportion.

S314, respectively inputting the pre-training data set and the first verification set into a word granularity mask language model for classification pre-training, wherein the classification pre-training is as follows:

the word granularity mask language model randomly masks Chinese characters in each sentence of a pre-training data set according to a first set proportion, predicts the masked Chinese characters through output vectors corresponding to mask positions, adopts a cross entropy loss function with L2 regular terms to measure the difference between a predicted value and a true value, and optimizes the loss function by using an AdamW learner with the learning rate of 5 e-5.

In the pre-training stage, the model is optimized by using a pre-training data set, after every 8000 training turns, a loss function value is calculated on a first verification set, and when the loss function value does not decrease any more, the pre-training is stopped, so that the phenomenon of model overfitting is avoided.

In this step, the first set ratio may be 20%. For example, as shown in fig. 3, a word granularity mask language model randomly selects 20% of the chinese characters in a text to be masked, and then uses the output vector corresponding to the mask position to make the model predict the chinese characters. Wherein "[ M ]" represents a MASK "[ MASK ]". The language model of the granularity mask of the input words, namely the safety tools and the instruments provided by the sub-packet unit, is randomly masked, and a work order is obtained through prediction.

Because the pre-training corpus is converted from the general Chinese text to the text related to the electric power, the model can learn the lexical method and the grammar more related to the electric power in the pre-training stage, and therefore better effect can be achieved in the task related to the electric power text in the downstream theoretically.

S32, pre-training a mask language model of the power text input entity granularity, specifically as follows:

s321, marking out the entity included in the pre-training dataset and the first verification set by introducing a power-related knowledge graph, for example, the power-related knowledge graph may be an oww.

S322, the MASK language model of entity granularity replaces each word in the corresponding entity with a special MASK mark [ MASK ], and each [ MASK ] position corresponds to a hidden layer vector. The method comprises the steps of predicting words at the corresponding position of each [ MASK ] by connecting a full connection layer, measuring the difference between a predicted value and a true value by adopting a cross entropy loss function with an L2 regular term, optimizing a model by using a pre-training data set, calculating a loss function value on a first verification set after every 8000 training turns, and stopping pre-training of a MASK language model with entity granularity when the loss function value does not decrease any more.

For example, by using such a masking method of entity granularity, for example, as shown in fig. 4, the "security tools are all provided by the sub-packet units" and the masking language model of the input entity granularity is randomly masked, and the "sub-packet units" is obtained by prediction.

The content to be predicted by the model during pre-training is not only required to conform to the lexical or grammatical theory, but also required to learn corresponding facts or knowledge. The pre-training of the mask language model through entity granularity is helpful for the model to further understand texts, particularly the texts with highly-integrated professional domain knowledge, such as power audit texts.

According to the embodiment of the invention, by introducing the mask language model of the entity granularity, the model can learn more contents related to the domain knowledge on the basis of the language model task of the word granularity, so that texts related to the electric power domain can be more accurately understood, and the performance of downstream tasks is improved.

It should be noted that both the word-granularity mask language model and the entity-granularity mask language model are built by using transforms and a Pytorch library. Since EPAT-BERT requires pre-training from scratch, its model parameters are initialized randomly.

S4, carrying out fine adjustment on the pre-trained EPAT-BERT model, then carrying out performance evaluation, and determining a text classification EPAT-BERT model, which specifically comprises the following steps:

s41, extracting a certain amount of electric power audit texts to form a data set, and obtaining a vectorization text by adopting vector representation of words, position coding of the words and each word in the segmentation marking data set of the wordsT，And using a set ratio (e.g., 8TThe partitioning into a trim data set, a second verification set, and a test set.

For example, 1000 pieces may be extracted from existing power audit text as a data set.

S42, the EPAT-BERT model is used for vectorizing textsTAdding a special mark [ CLS ] at the beginning]Will [ CLS ]]And the vector of the output position is used as the vector representation of the whole input text, and a full connection layer is added on the EAPT-BERT upper layer, and the number of the neurons of the full connection layer is the total category number of the audit text. To this end, the entire EPAT-BERT forms an end-to-end neural network. In the fine tuning training, a loss function with an L2 regularization term is used for optimization.

The EPAT-BERT model calculates F1 values (F1-score) on the second validation set after each training round of the fine tuning data set, and when the F1 values on the second validation set are decreased, the training is stopped, and the fine tuning of the EPAT-BERT model is completed.

In the step, the fine tuning data set is used for optimizing the model, the F1 value is determined according to Precision (Precision) and Recall (Recall) of the EPAT-BERT model on the verification set, and the F1 is used as the basis of early stopping because the index is the synthesis of other indexes and has representative significance.

In the embodiment of the invention, the second verification set is input into the word granularity mask language model, the obtained classification result comprises a true positive example TP, a false positive example FP, a false negative example FN and a true negative example TN, and the performance of the model is respectively evaluated by adopting common evaluation indexes such as classification Accuracy (Accuracy), precision, recall rate and F1 value.

Calculating the precision rate of the EPAT-BERT model on the second verification setPThat is, the percentage of the real positive samples in the samples predicted to be positive by the model is calculated as follows:

calculating the recall rate of the EPAT-BERT model on the second verification setRThat is, the percentage of the real positive samples in the samples with the positive actual labels is calculated as follows:

calculating an F1 value (F1-score) of the EPAT-BERT model on a second verification set, namely calculating a harmonic mean value of the precision rate and the recall rate as the most important evaluation criterion of the text classification, wherein the F1 value (F1-score) is as follows:

s43, calculating the classification accuracy A of the fine-tuned EPAT-BERT model under the test set, namely calculating the proportion of samples with correct classification results in the test set to all samples in the test set, wherein the proportion is as follows:

according to the embodiment of the invention, the model can be ensured to have the most accurate generalization error by dividing the model into the training set, the verification set and the test set, which is better than the condition of only dividing the model into the training set and the test set.

And S44, comparing various evaluation indexes of the EPAT-BERT model and other pre-training language models in the test set, and if the comparison result meets the set requirement, determining the trained EPAT-BERT model as a text classification EPAT-BERT model.

In this step, two classical machine learning models are selected for comparison:

1. naive Bayes (Naive Bayes): the text is represented as a bag-of-words model and classified using a naive bayes algorithm.

2. Support Vector Machine (SVM): the text is represented as a bag-of-words model and classified using a support vector machine algorithm.

In addition, two deep learning models commonly used for text classification are chosen:

3. text convolutional neural network (TextCNN): and regarding the word vector sequence corresponding to the text as a matrix, extracting the characteristics of the matrix by using a convolutional neural network, and performing end-to-end learning.

4. Long short term memory network (LSTM): and sequentially sending the word vector sequence corresponding to the text into the LSTM, and performing end-to-end learning.

And finally, selecting a general pre-training BERT model for comparing to demonstrate the effectiveness of the power text pre-training task:

5. general pre-training BERT model: and (3) predicting two pre-training tasks by using a mask language model with word granularity and the next sentence, and pre-training on the universal corpus.

The evaluation indexes finally calculated on the test set by the different models are shown in table 2. From the experimental results, the following can be concluded:

1. compared with machine learning models (Naive Bayes and SVM), the deep learning models TextCNN and LSTM based on the neural network can obtain higher effects on four evaluation indexes, and the model based on the neural network is superior to the traditional machine learning model based on statistical learning.

2. Compared with a deep learning model, the BERT model based on pre-training is further improved on four evaluation indexes.

3. The model EPAT-BERT based on the electric power audit text classification provided by the invention is obviously superior to the general corpus pre-training model BERT, so that the validity of two granularity pre-training tasks provided by the invention is proved, and the promotion effect of the pre-training related to the field on downstream tasks in the field is proved.

TABLE 2 evaluation results of different models on the test set

S45, performing an ablation experiment on the text classification EPAT-BERT model, and evaluating the experiment result through various evaluation indexes to determine the pre-training effect.

The focus of the EPAT-BERT model is two pre-training tasks: a word-granular mask language model and an entity-granular mask language model. Therefore, it is important to explore the influence of the two pre-training tasks on the experimental results. In order to explore the influence of two pre-training tasks, the invention further designs two groups of ablation experiments.

TABLE 3 ablation test results

In the first set of experiments, pre-training tasks for word granularity and entity granularity in EPAT-BERT were removed and recorded as EPAT-BERT w \ o.W and EPAT-BERT w \ o.E, respectively. The experimental result shows that when two pre-training tasks in the model are removed respectively, the model is decreased in four classification evaluation indexes such as classification accuracy, precision, recall rate and F1 value, and therefore the pre-training tasks with two granularities play an important role in further improving the classification effect of the audit text.

In addition, the downstream task effect improvement brought by the entity granularity pre-training is more obvious than the word granularity pre-training. In a second set of experiments, the effect of the training sequence of the two pre-training tasks in the EPAT-BERT on the experimental results was explored. Wherein "-WE" represents that the word granularity mask language model training is performed first, and then the entity granularity mask language model training is performed; "-EW" or vice versa. The experimental result shows that compared with the method for separately and independently carrying out two pre-training tasks, the effect of training by fusing the two tasks is better. The sequence of the two tasks has no significant influence on the result.

And S5, inputting the power audit text to be classified into a text classification EPAT-BERT model, outputting a class label of the power audit text, and finishing a power audit text classification task.

As shown in fig. 6, based on the above method for classifying a power audit text based on an improved BERT model, an embodiment of the present invention further provides a device for classifying a power audit text based on an improved BERT model, which includes a text processing module, a model building module, a model pre-training module, a model fine-tuning module, and a text classification module.

Specifically, the text processing module is used for acquiring a power text; the model building module is used for building an EPAT-BERT model, wherein the EPAT-BERT model comprises a word granularity mask language model and an entity granularity mask language model; the model pre-training module is used for inputting the electric power text into the EPAT-BERT model for pre-training to obtain the pre-trained EPAT-BERT model; the pre-training comprises respectively performing word granularity mask language model training and entity granularity mask language model training; the model fine-tuning module is used for carrying out fine tuning on the pre-trained EPAT-BERT model and then carrying out performance evaluation to determine a text classification EPAT-BERT model; and the text classification module is used for inputting the power audit text to be classified into a text classification EPAT-BERT model and outputting a class label of the power audit text.

The present invention also provides a computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, the at least one instruction, at least one program, set of codes, or set of instructions being loaded and executed by the processor to implement the above-described method for power audit text classification based on an improved BERT model.

For example, the computer device may be a GPU cloud server, and is specifically configured as follows: the CPU adopts Intel (R) Xeon (R) Silver 4114 CPU 2.20GHz, the GPU is four NVIDIA Titan V, and each video memory is 12GB. The memory of the computer equipment is 256GB, and the hard disk is 2T.

Software packages and frameworks required for computer devices include pyrrch 1.7.1, transformations 4.7.0, scidit-spare 0.24.2, numpy 1.19.5, pandas 1.1.5, and matchlitb 3.3.4.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A power audit text classification method based on an improved BERT model is characterized by comprising the following steps:

acquiring a power text; arranging professional vocabularies in the electric power field into a vocabulary V, searching a webpage containing the vocabularies in the vocabulary V in a Web data set, and obtaining a set W; extracting the text in the set W by using an extraction algorithm based on a regular expression to obtain a pre-training corpus C;

inputting the power text into an EPAT-BERT model for pre-training to obtain a pre-trained EPAT-BERT model; the pre-training comprises respectively performing word granularity mask language model training and entity granularity mask language model training; the pre-training of the word granularity mask language model is specifically as follows: marking a position input vector corresponding to each word in the pre-training corpus C to obtain a vectorized input text; adding identifiers to the front and back of sentences in the input text subjected to vector quantization through a word granularity code language model, and carrying out sentence segmentation to obtain a data set A; dividing a data set A into a pre-training data set and a first verification set according to a set proportion; respectively inputting the pre-training data set and the first verification set into a word granularity mask language model for classification pre-training; the entity-granularity mask language model pre-training is specifically as follows: introducing a knowledge graph related to electric power to mark out entities contained in the pre-training data set and the first verification set; the entity granularity mask language model replaces each word in the corresponding entity with a mask mark, the position of each mask mark corresponds to a hidden layer vector, and the word at the position corresponding to each mask mark is predicted by connecting a full connection layer; measuring the difference between the predicted value and the true value by using a loss function, calculating a loss function value on a first verification set after a mask language model of the entity granularity is pre-trained by using a pre-training data set to reach a set training turn, and stopping the pre-training of the mask language model of the entity granularity when the loss function value is not reduced any more;

fine-tuning the pre-trained EPAT-BERT model, then carrying out performance evaluation, and determining a text classification EPAT-BERT model; fine adjustment is carried out on the pre-trained EPAT-BERT model, and the method specifically comprises the following steps: extracting a certain amount of electric power audit text to form a data set, and obtaining a vectorization text by adopting vector representation of words, position coding of the words and each word in the segmentation marking data set of the wordsT(ii) a Vectorizing text using a set ratioTDividing the test data into a fine tuning data set, a second verification set and a test set; EPAT-BERT model in vectorizing textTAdding a special mark at the beginning of the text, and taking the vector of the output position of the special mark as the vector representation of the whole input text; adding a full connection layer on the upper layer of the EPAT-BERT, calculating an F1 value on a second verification set after each training turn of a fine adjustment data set by the EPAT-BERT model after the full connection layer is added, and stopping training when the F1 value on the second verification set is reduced to finish fine adjustment of the EPAT-BERT model;

2. The electric power audit text classification method based on the improved BERT model as claimed in claim 1, wherein the performance evaluation is carried out on the fine-tuned EPAT-BERT model, and the text classification EPAT-BERT model is determined as follows:

calculating the classification accuracy of the fine-tuned EPAT-BERT model in a test set;

3. The method for classifying power audit texts based on an improved BERT model as claimed in claim 1, wherein the pre-training data set and the first validation set are respectively input into a word granularity mask language model for classification pre-training as follows:

4. The electric power audit text classification method based on the improved BERT model as claimed in claim 1 or 3, wherein the position input vector corresponding to each word in the labeled pre-training corpus C is as follows:

in the formula, W _w The vector representing the character, namely the initial word vector of the character, is used for distinguishing different Chinese characters; p is _w Indicating the position of the word, and fusing sequence position information into the input data by using absolute position coding; s _w Representing a segmented encoding.

5. The method of claim 2, further comprising the steps of: and carrying out an ablation experiment on the text classification EPAT-BERT model, and evaluating the experiment result through various evaluation indexes to determine the pre-training effect.

6. The method for classifying power audit texts based on an improved BERT model according to claim 2 or 5, wherein the evaluation indexes further comprise precision rate, recall rate and F1 value.

7. The improved BERT model-based power audit text classification method according to claim 1, wherein the F1 value is determined according to the precision rate and recall rate of the EPAT-BERT model on the verification set.

8. The method as claimed in claim 1, wherein the entity is a similar or same vocabulary and grammar as those in professional vocabulary and grammar analysis toolkit in power field.

9. A power audit text classification device based on an improved BERT model is characterized by comprising the following components:

the text processing module is used for acquiring the power text, and specifically comprises the following steps: the method comprises the steps that professional vocabularies in the electric power field are arranged into a vocabulary V, and Web pages containing the vocabularies in the vocabulary V are searched in a Web data set provided by Yahoo company to obtain a set W; extracting the text in the set W by using an extraction algorithm based on a regular expression to obtain a pre-training corpus C;

the model pre-training module is used for inputting the power text into the EPAT-BERT model for pre-training to obtain the EPAT-BERT model after pre-training; the pre-training comprises respectively performing word granularity mask language model training and entity granularity mask language model training; the pre-training of the word granularity mask language model is specifically as follows: marking a position input vector corresponding to each word in the pre-training corpus C to obtain a vectorized input text; adding identifiers to the front and back of sentences in the input text subjected to vector quantization through a word granularity code language model, and carrying out sentence segmentation to obtain a data set A; dividing a data set A into a pre-training data set and a first verification set according to a set proportion; respectively inputting the pre-training data set and the first verification set into a word granularity mask language model for classification pre-training; the entity-granularity mask language model pre-training is specifically as follows: introducing a knowledge graph related to electric power to mark out entities contained in the pre-training data set and the first verification set; the entity granularity mask language model replaces each word in the corresponding entity with a mask mark, the position of each mask mark corresponds to a hidden layer vector, and the word at the position corresponding to each mask mark is predicted by connecting a full connection layer; measuring the difference between the predicted value and the true value by using a loss function, calculating a loss function value on a first verification set after a mask language model of the entity granularity is pre-trained by using a pre-training data set to reach a set training turn, and stopping the pre-training of the mask language model of the entity granularity when the loss function value is not reduced any more;

the model fine-tuning module is used for carrying out fine tuning on the pre-trained EPAT-BERT model and then carrying out performance evaluation to determine a text classification EPAT-BERT model; fine adjustment is carried out on the pre-trained EPAT-BERT model, and the method specifically comprises the following steps: extracting a certain amount of power audit text to form a data set, and obtaining a vectorized text by adopting vector representation of words, position codes of the words and each word in the segmented marking data set of the wordsT(ii) a Vectorizing text with a set ratioTDividing the test data into a fine tuning data set, a second verification set and a test set; EPAT-BERT model in vectorizing textTAdding a special mark at the beginning ofTaking the vector of the output position of the special mark as the vector representation of the whole input text; adding a full connection layer on the upper layer of the EPAT-BERT, calculating an F1 value on a second verification set after each training turn of a fine adjustment data set by the EPAT-BERT model after the full connection layer is added, and stopping training when the F1 value on the second verification set is reduced to finish fine adjustment of the EPAT-BERT model;

and the text classification module is used for inputting the power audit text to be classified into a text classification EPAT-BERT model and outputting a category label of the power audit text.

10. The improved BERT model-based power audit text classification device according to claim 9, wherein the model fine-tuning module is further configured to:

11. A computer device comprising a processor and a memory;

wherein the memory has stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by the processor to implement the method of improved BERT model based power audit text classification as claimed in any of claims 1-8.