CN115357719B - Power audit text classification method and device based on improved BERT model - Google Patents

Power audit text classification method and device based on improved BERT model Download PDF

Info

Publication number
CN115357719B
CN115357719B CN202211283079.4A CN202211283079A CN115357719B CN 115357719 B CN115357719 B CN 115357719B CN 202211283079 A CN202211283079 A CN 202211283079A CN 115357719 B CN115357719 B CN 115357719B
Authority
CN
China
Prior art keywords
training
text
epat
model
bert model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211283079.4A
Other languages
Chinese (zh)
Other versions
CN115357719A (en
Inventor
孟庆霖
穆健
戴斐斐
赵宝国
王霞
崔霞
宋岩
葛晓舰
吕元旭
赵战云
唐厚燕
王瑞
许良
徐业朝
徐晓萱
马剑
李常春
郭保伟
李婧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Chengxi Guangyuan Power Engineering Co ltd
Tianjin Ninghe District Ningdong Shengyuan Power Engineering Co ltd
Tianjin Tianyuan Electric Power Engineering Co ltd
State Grid Tianjin Electric Power Co Training Center
State Grid Corp of China SGCC
State Grid Tianjin Electric Power Co Ltd
Original Assignee
Tianjin Chengxi Guangyuan Power Engineering Co ltd
Tianjin Ninghe District Ningdong Shengyuan Power Engineering Co ltd
Tianjin Tianyuan Electric Power Engineering Co ltd
State Grid Tianjin Electric Power Co Training Center
State Grid Corp of China SGCC
State Grid Tianjin Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Chengxi Guangyuan Power Engineering Co ltd, Tianjin Ninghe District Ningdong Shengyuan Power Engineering Co ltd, Tianjin Tianyuan Electric Power Engineering Co ltd, State Grid Tianjin Electric Power Co Training Center, State Grid Corp of China SGCC, State Grid Tianjin Electric Power Co Ltd filed Critical Tianjin Chengxi Guangyuan Power Engineering Co ltd
Priority to CN202211283079.4A priority Critical patent/CN115357719B/en
Publication of CN115357719A publication Critical patent/CN115357719A/en
Application granted granted Critical
Publication of CN115357719B publication Critical patent/CN115357719B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Economics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Water Supply & Treatment (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a device for classifying power audit texts based on an improved BERT model, wherein the classification method comprises the following steps: acquiring a power text; constructing an EPAT-BERT model; inputting the power text into an EPAT-BERT model for pre-training to obtain a pre-trained EPAT-BERT model; the pre-training comprises respectively performing word granularity mask language model training and entity granularity mask language model training; fine-tuning the pre-trained EPAT-BERT model, then performing performance evaluation, and determining a text classification EPAT-BERT model; and inputting the power audit text to be classified into a text classification EPAT-BERT model, and outputting a category label of the power audit text. According to the two pre-training tasks provided by the invention, the large-scale electric power text is used as a training corpus, the lexical method, the grammar and the related knowledge in the electric power text are mastered, and the high-efficiency automatic classification of the electric power audit text is realized.

Description

Power audit text classification method and device based on improved BERT model
Technical Field
The invention belongs to the technical field of Natural Language Processing (NLP), and particularly relates to a method and a device for classifying electric power audit texts based on an improved BERT model.
Background
With the development of information technology, word2vec, RNN, LSTM and other text classification technologies based on machine learning and neural networks are proposed in sequence.
In recent years, the paradigm of "pre-training + fine-tuning" gradually becomes the latest research direction of text classification, and compared with the prior fully-supervised neural model, the method can achieve better effect. However, the existing pre-training models are pre-trained by using general corpora, and do not use texts related to the power field, especially the power auditing field.
The electric power enterprise audit texts are short texts in specific fields, have distinct industrial characteristics such as high text similarity and fuzzy classification boundaries, are different from general communication languages, and directly utilize the characteristics of the electric power audit texts in the fields which cannot be considered by the existing text classification models, so that the further design of the models to adapt to the characteristics becomes an important problem to be solved.
Disclosure of Invention
Aiming at the problems, the invention provides a method and a device for classifying power audit texts based on an improved BERT model, and the specific technical scheme is as follows:
a power audit text classification method based on an improved BERT model comprises the following steps:
acquiring a power text;
constructing an EPAT-BERT model, wherein the EPAT-BERT model comprises a word granularity mask language model and an entity granularity mask language model;
inputting the power text into an EPAT-BERT model for pre-training to obtain a pre-trained EPAT-BERT model; the pre-training comprises respectively performing word granularity mask language model training and entity granularity mask language model training;
fine-tuning the pre-trained EPAT-BERT model, then carrying out performance evaluation, and determining a text classification EPAT-BERT model;
and inputting the power audit text to be classified into a text classification EPAT-BERT model, and outputting a class label of the power audit text.
Further, the obtaining of the power text specifically includes:
arranging professional vocabularies in the electric power field into a vocabulary V, searching a webpage containing the vocabularies in the vocabulary V in a Web data set, and obtaining a set W;
and extracting the text in the set W by using an extraction algorithm based on a regular expression to obtain a pre-training corpus C.
Further, pre-training the word granularity mask language model specifically comprises the following steps:
marking a position input vector corresponding to each word in the pre-training corpus C to obtain a vectorized input text;
adding identifiers to the front and back of sentences in the input text subjected to vector quantization through a word granularity code language model, and carrying out sentence segmentation to obtain a data set A;
dividing a data set A into a pre-training data set and a first verification set according to a set proportion;
and respectively inputting the pre-training data set and the first verification set into a word granularity mask language model for classification pre-training.
Further, the entity-granularity mask language model pre-training is specifically as follows:
introducing a knowledge graph related to electric power to mark out entities contained in the pre-training data set and the first verification set;
the mask language model of the entity granularity replaces each word in the corresponding entity with a mask mark, the position of each mask mark corresponds to a hidden layer vector, and the word at the position corresponding to each mask mark is predicted by connecting a full-connection layer;
and measuring the difference between the predicted value and the true value by adopting a loss function, calculating the loss function value on the first verification set after the pre-training of the mask language model with the entity granularity by using the pre-training data set reaches a set training turn, and stopping the pre-training of the mask language model with the entity granularity when the loss function value is not reduced any more.
Further, fine tuning is performed on the pre-trained EPAT-BERT model, which specifically includes:
extracting a certain amount of electric power audit text to form a data set, and obtaining a vectorization text by adopting vector representation of words, position coding of the words and each word in the segmentation marking data set of the wordsT
Vectorizing text using a set ratioTDividing the test data into a fine tuning data set, a second verification set and a test set;
EPAT-BERT model on vectorized textTAdding a special mark at the beginning of the text, and taking the vector of the output position of the special mark as the vector representation of the whole input text;
and adding a full connection layer on the upper layer of the EAPT-BERT, calculating an F1 value on the verification set after each training turn of the fine adjustment data set by the EPAT-BERT model after adding the full connection layer, and stopping training when the F1 value on the verification set is reduced to finish the fine adjustment of the EPAT-BERT model.
Further, the performance evaluation is carried out on the fine-tuned EPAT-BERT model, and the text classification EPAT-BERT model is determined as follows:
calculating the classification accuracy of the fine-tuned EPAT-BERT model under the test set;
and comparing various evaluation indexes of the EPAT-BERT model and other pre-training language models in the test set, and if the comparison result meets the set requirement, determining that the trained EPAT-BERT model is a text classification EPAT-BERT model, wherein the various evaluation indexes comprise classification accuracy.
Further, the pre-training data set and the first verification set are respectively input into the word granularity mask language model for classification pre-training, which specifically comprises the following steps:
the word granularity mask language model carries out mask masking on Chinese characters in each sentence of a pre-training data set randomly according to a set proportion, the masked Chinese characters are predicted through output vectors corresponding to mask positions, a loss function is adopted to measure the difference between a predicted value and a true value, after the pre-training of the word granularity mask language model is carried out by using the pre-training data set to reach a set training turn, a loss function value is calculated on a first verification set, and when the loss function value does not decrease any more, the pre-training of the word granularity mask language model is stopped.
Further, the position input vector corresponding to each word in the pre-training corpus C is labeled as follows:
marking each word using vectors of the word, absolute position coding and segmentation coding of the wordwCorresponding position input vector VwThe method comprises the following steps:
Figure 720604DEST_PATH_IMAGE001
in the formula, W w The vector representing the character, namely the initial word vector of the character, is used for distinguishing different Chinese characters; p w Indicating the position of the word, and fusing sequence position information into the input data by using absolute position coding; s. the w Representing a segmented code.
Further, the method also comprises the following steps: and carrying out an ablation experiment on the text classification EPAT-BERT model, and evaluating the experiment result through various evaluation indexes to determine the pre-training effect.
Further, the various evaluation indexes further comprise precision rate, recall rate and F1 value.
Further, the F1 value is determined according to the precision rate and the recall rate of the EPAT-BERT model on the verification set.
Further, the entity part is similar to or the same as the professional vocabulary in the power field and the vocabulary and grammar in the grammar analysis tool kit.
The invention also provides a power audit text classification device based on the improved BERT model, which comprises the following steps:
the text processing module is used for acquiring a power text;
the model building module is used for building an EPAT-BERT model, wherein the EPAT-BERT model comprises a word granularity mask language model and an entity granularity mask language model;
the model pre-training module is used for inputting the power text into the EPAT-BERT model for pre-training to obtain the EPAT-BERT model after pre-training; the pre-training comprises respectively performing word granularity mask language model training and entity granularity mask language model training;
the model fine-tuning module is used for carrying out fine tuning on the pre-trained EPAT-BERT model and then carrying out performance evaluation to determine a text classification EPAT-BERT model;
and the text classification module is used for inputting the power audit text to be classified into a text classification EPAT-BERT model and outputting a class label of the power audit text.
Further, the text processing module is specifically configured to:
arranging professional vocabularies in the electric power field into a vocabulary V, searching a webpage containing the vocabularies in the vocabulary V in a Web data set, and obtaining a set W;
and extracting the text in the set W by using an extraction algorithm based on a regular expression to obtain a pre-training corpus C.
Further, the model pre-training module is configured to pre-train the word granularity mask language model as follows:
marking a position input vector corresponding to each word in the pre-training corpus C to obtain a vectorized input text;
adding identifiers to the front and back of sentences in the input text subjected to vector quantization through a word granularity code language model, and carrying out sentence segmentation to obtain a data set A;
dividing a data set A into a pre-training data set and a first verification set according to a set proportion;
and respectively inputting the pre-training data set and the first verification set into a word granularity mask language model for classification pre-training.
Further, the model pre-training module is configured to perform entity-granularity mask language model pre-training as follows:
introducing a knowledge graph related to electric power to mark out entities contained in the pre-training data set and the first verification set;
the mask language model of the entity granularity replaces each word in the corresponding entity with a mask mark, the position of each mask mark corresponds to a hidden layer vector, and the word at the position corresponding to each mask mark is predicted by connecting a full-connection layer;
and measuring the difference between the predicted value and the true value by adopting a loss function, calculating the loss function value on the first verification set after the pre-training of the mask language model with the entity granularity by using the pre-training data set reaches a set training turn, and stopping the pre-training of the mask language model with the entity granularity when the loss function value is not reduced any more.
Further, the model fine-tuning module is specifically configured to:
extracting a certain amount of electric power audit text to form a data set, and obtaining a vectorization text by adopting vector representation of words, position coding of the words and each word in the segmentation marking data set of the wordsT
Vectorizing text using a set ratioTDividing the test data into a fine tuning data set, a second verification set and a test set;
EPAT-BERT model on vectorized textTAdding a special mark at the beginning of the text, and taking the vector of the output position of the special mark as the vector representation of the whole input text;
and adding a full connection layer on the upper layer of the EAPT-BERT, calculating an F1 value on the verification set after each training turn of the fine adjustment data set by the EPAT-BERT model after adding the full connection layer, and stopping training when the F1 value on the verification set is reduced to finish the fine adjustment of the EPAT-BERT model.
Further, the model fine-tuning module is further specifically configured to:
calculating the classification accuracy of the fine-tuned EPAT-BERT model under the test set;
comparing various evaluation indexes of the EPAT-BERT model and other pre-training language models in the test set, and if the comparison result meets the set requirement, determining the trained EPAT-BERT model as a text classification EPAT-BERT model, wherein the various evaluation indexes comprise classification accuracy.
The invention also provides a computer device comprising a processor and a memory;
wherein the memory has stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by the processor to implement the improved BERT model based power audit text classification method.
The invention has the beneficial effects that: the invention provides a pre-training task of electric power audit texts with two granularities: a word-granular mask language model and an entity-granular mask language model. The two pre-training tasks take large-scale power texts as training corpora, and the models are respectively used for completing word granularity prediction and entity granularity prediction, so that the lexical method, grammar and related knowledge in the power texts are grasped, and the high-efficiency automatic classification of the power audit texts is realized.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 shows a flow diagram of a pre-training phase of a BERT model according to the prior art;
FIG. 2 shows a schematic flow diagram of a fine tuning phase of a BERT model according to the prior art;
FIG. 3 illustrates a word-granular mask language model pre-training flow diagram according to an embodiment of the invention;
FIG. 4 illustrates a mask language model pre-training flow diagram of entity granularity, according to an embodiment of the invention;
FIG. 5 is a flow diagram illustrating a method for classification of power audit text based on the improved BERT model according to an embodiment of the present invention;
fig. 6 shows a schematic structural diagram of a power audit text classification device based on an improved BERT model according to an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in this application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. In this application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings.
In order to facilitate understanding of the embodiment of the present application, the following simple descriptions are first performed on the pre-training, the fine-tuning, the natural language processing model and the power audit text respectively:
pre-training means that a pre-training task is designed that is independent of downstream tasks and the model is trained using a large amount of label-free data that is relevant to the task.
With the introduction of Pre-trained models such as natural Language processing Model BERT, computer vision Model MAE, and cross-modal search Model CLIP, pre-trained Language models (Pre-trained Language models) and "Fine-tuning" have become one of the important research fields of natural Language processing.
The earliest pre-training models focused on capturing the semantics of a single word and obtaining word embedding. Later, the advent of CoVe, ELMo, etc. models made it possible to extract contextual features. With the advent of the Transformer network, emerging models such as BERT, GPT, etc. have made "pre-training + fine-tuning" an example of solving natural language processing tasks. One advantage of this model is that since the model has learned a large amount of vocabulary and semantic information during the pre-training phase, the fine-tuning phase requires only a small amount of fully supervised data to train and can achieve a better result than a non-pre-trained model.
The BERT model is a typical pre-trained model that uses the encoder of a transform network as a basic structure. As shown in fig. 1, the BERT model takes a sentence as input, e.g., "security tools are all provided by the packetization unit," and the model automatically adds a "[ CLS ]" identifier before the sentence to indicate the beginning of the sentence, and adds a "[ SEP ]" identifier after the sentence to indicate the end of the sentence. Then, the model converts the input into an id sequence and obtains a sequence of corresponding word vectors, and then encodes the word vector sequence to obtain a context-dependent (context) output corresponding to each word.
As shown in fig. 2, the meaning of fine tuning is to train the downstream task again using the pre-trained model. Although the pre-training task is independent of downstream tasks, the pre-training model is still able to learn common language structures, such as chinese lexical and grammars, during the pre-training phase. When the model is further trained using data from downstream tasks, the parameters in the network will change slightly on an original basis, a process called "tuning".
The Electric Power Audit Text (Electric Power Audit Text) is a Text recorded by an Electric Power enterprise auditor and has an important significance for the Electric Power enterprise to complete the Audit work. The electric power audit text usually comprises audit content and method, audit concerns, audit finding problems, system bases, audit opinions, problem classification and other information manually recorded by electric power auditors.
Common power audit texts are shown in table 1. It can be seen that each section of audit text needs an auditor to manually label a four-level classification label, so that audit text classification is realized. However, manually labeling the four-level classification labels on a large scale consumes manpower and material resources, is inefficient, and is prone to errors. Therefore, efficient and automatic classification of the power audit texts becomes an urgent problem to be solved.
Table 1 power audit text example
Figure 172445DEST_PATH_IMAGE002
The existing pretrained models such as BERT related to texts can be further finely adjusted, so that the text classification task is completed. However, for the field of power auditing, a suitable and generic pre-trained language model and pre-training tasks have not emerged. This leads to a large room for power audit text classification.
The semantic domain of the corpus related to the electric power field is closer to that of the electric power audit text classification task, so that the pre-training task related to the field can enhance the performance of the downstream task related to the field from the perspective of the pre-training theory. Based on the current research situation, the electric power audit text classification method based on the improved BERT (Bidirectional Encoder replication from transformations) model provides electric power audit text pre-training tasks with two granularities: a word-granular mask language model and an entity-granular mask language model. The two pre-training tasks take a large-scale electric power text as a training corpus, and respectively enable the model to complete the prediction of word granularity and the prediction of entity granularity, so that the lexical method, grammar and related knowledge in the electric power text are grasped.
As shown in fig. 5, a power audit text classification method based on an improved BERT model includes the following steps:
s1, acquiring a power text, specifically: the technical vocabularies in the electric power field are firstly organized into a vocabulary V, and then Web pages containing the vocabularies in the vocabulary V are searched in a Web data set provided by Yahoo company and recorded as a set W. And extracting the text in the set W by using an extraction algorithm based on a regular expression to serve as a pre-training corpus of the invention, and recording the pre-training corpus as C.
The embodiment of the invention collects the text related to the electric power from the Internet, so that the model has more lexical methods and syntaxes related to the electric power and is closer to the downstream audit text classification task.
S2, constructing an EPAT-BERT (electric Power Audit Text-BERT) model, wherein the EPAT-BERT model comprises a word granularity mask language model and an entity granularity mask language model.
It should be noted that the word granularity mask language model of the embodiment of the present invention follows the existing general corpus BERT model. Compared with a general text, the electric text has more professional nouns, the word use accuracy is emphasized, and the problem of inaccurate factual information exists only by adopting a mask language model with word granularity.
The electric Power Audit Text (electric Power Audit Text-BERT) is generally a highly specialized short Text, in which entities and knowledge related to the electric Power Audit industry often appear, but the frequency of the entities and knowledge in the general Text is not high. Existing research has shown that for this type of text, it can be inaccurate when performing word-granular mask language model training. For example, when predicting that "the second city of china is a MASK in MASK ] [ MASK ]", it is easy to predict an incorrect city because the contents to be predicted in this sentence are knowledge-dependent, and the word-size MASK language model emphasizes lexical grammar in prediction and the smoothness of the sentence, sometimes ignoring this knowledge information.
Therefore, the EPAT-BERT Model of the present invention further includes an Entity-level Masked Language Model (Entity-level Masked Language Model).
The contents to be predicted by the pre-training language model during pre-training need not only conform to the lexical or grammatical rules, but also learn corresponding facts or knowledge. This helps the pre-trained language model to understand the text even further, especially for highly specialized domain knowledge such as power audit text.
In the Entity-granularity mask Language Model (Entity-level Masked Language Model) of the embodiment of the invention, the Model not only predicts the Masked words, but also masks the entities consisting of a plurality of words in the pre-training stage, and the Model predicts the words. The process allows the model to learn knowledge related to power auditing, not just to be limited to lexical and syntactic.
S3, inputting the power text into the EPAT-BERT model for pre-training to obtain the pre-trained EPAT-BERT model, wherein the pre-training specifically comprises the following steps: and respectively training the power text input word granularity mask language model and the entity granularity mask language model.
In the step, word granularity mask language model training can be carried out firstly, and then entity granularity mask language model training can be carried out; or the entity granularity mask language model training can be carried out firstly, and then the word granularity mask language model training can be carried out.
The step S31 of pre-training the power text input word granularity mask language model specifically comprises the following steps:
s311, marking each word in the pre-training corpus CwCorresponding position input vector Vw,Vectorized input text is obtained.
Marking each word using vectors of the word, absolute position coding and segmentation coding of the wordwCorresponding position input vector VwThe method comprises the following steps:
Figure 505337DEST_PATH_IMAGE001
in the formula, W w The vector representing the character, namely the initial word vector of the character, is used for distinguishing different Chinese characters; p is w Indicating the position of the word, and fusing sequence position information into the input data by using absolute position coding; s w Representing a segment code, different segments should be represented with different codes when the input contains multiple sentences or multiple parts, whereas the input of EPAT-BERT has only one part, and thus the segment representation is unique.
And S312, adding identifiers to the front and the back of sentences in the input text subjected to vector quantization through the word granularity code language model, and carrying out sentence segmentation to obtain a data set A. Sentence segmentation is carried out, specifically: a "[ CLS ]" identifier is added before the sentence to indicate the beginning of the sentence, and a "[ SEP ]" identifier is added after the sentence to indicate the end of the sentence.
And S313, dividing the data set A into a pre-training data set and a first verification set according to a set proportion.
S314, respectively inputting the pre-training data set and the first verification set into a word granularity mask language model for classification pre-training, wherein the classification pre-training is as follows:
the word granularity mask language model randomly masks Chinese characters in each sentence of a pre-training data set according to a first set proportion, predicts the masked Chinese characters through output vectors corresponding to mask positions, adopts a cross entropy loss function with L2 regular terms to measure the difference between a predicted value and a true value, and optimizes the loss function by using an AdamW learner with the learning rate of 5 e-5.
In the pre-training stage, the model is optimized by using a pre-training data set, after every 8000 training turns, a loss function value is calculated on a first verification set, and when the loss function value does not decrease any more, the pre-training is stopped, so that the phenomenon of model overfitting is avoided.
In this step, the first set ratio may be 20%. For example, as shown in fig. 3, a word granularity mask language model randomly selects 20% of the chinese characters in a text to be masked, and then uses the output vector corresponding to the mask position to make the model predict the chinese characters. Wherein "[ M ]" represents a MASK "[ MASK ]". The language model of the granularity mask of the input words, namely the safety tools and the instruments provided by the sub-packet unit, is randomly masked, and a work order is obtained through prediction.
Because the pre-training corpus is converted from the general Chinese text to the text related to the electric power, the model can learn the lexical method and the grammar more related to the electric power in the pre-training stage, and therefore better effect can be achieved in the task related to the electric power text in the downstream theoretically.
S32, pre-training a mask language model of the power text input entity granularity, specifically as follows:
s321, marking out the entity included in the pre-training dataset and the first verification set by introducing a power-related knowledge graph, for example, the power-related knowledge graph may be an oww.
S322, the MASK language model of entity granularity replaces each word in the corresponding entity with a special MASK mark [ MASK ], and each [ MASK ] position corresponds to a hidden layer vector. The method comprises the steps of predicting words at the corresponding position of each [ MASK ] by connecting a full connection layer, measuring the difference between a predicted value and a true value by adopting a cross entropy loss function with an L2 regular term, optimizing a model by using a pre-training data set, calculating a loss function value on a first verification set after every 8000 training turns, and stopping pre-training of a MASK language model with entity granularity when the loss function value does not decrease any more.
For example, by using such a masking method of entity granularity, for example, as shown in fig. 4, the "security tools are all provided by the sub-packet units" and the masking language model of the input entity granularity is randomly masked, and the "sub-packet units" is obtained by prediction.
The content to be predicted by the model during pre-training is not only required to conform to the lexical or grammatical theory, but also required to learn corresponding facts or knowledge. The pre-training of the mask language model through entity granularity is helpful for the model to further understand texts, particularly the texts with highly-integrated professional domain knowledge, such as power audit texts.
According to the embodiment of the invention, by introducing the mask language model of the entity granularity, the model can learn more contents related to the domain knowledge on the basis of the language model task of the word granularity, so that texts related to the electric power domain can be more accurately understood, and the performance of downstream tasks is improved.
It should be noted that both the word-granularity mask language model and the entity-granularity mask language model are built by using transforms and a Pytorch library. Since EPAT-BERT requires pre-training from scratch, its model parameters are initialized randomly.
S4, carrying out fine adjustment on the pre-trained EPAT-BERT model, then carrying out performance evaluation, and determining a text classification EPAT-BERT model, which specifically comprises the following steps:
s41, extracting a certain amount of electric power audit texts to form a data set, and obtaining a vectorization text by adopting vector representation of words, position coding of the words and each word in the segmentation marking data set of the wordsT,And using a set ratio (e.g., 8TThe partitioning into a trim data set, a second verification set, and a test set.
For example, 1000 pieces may be extracted from existing power audit text as a data set.
S42, the EPAT-BERT model is used for vectorizing textsTAdding a special mark [ CLS ] at the beginning]Will [ CLS ]]And the vector of the output position is used as the vector representation of the whole input text, and a full connection layer is added on the EAPT-BERT upper layer, and the number of the neurons of the full connection layer is the total category number of the audit text. To this end, the entire EPAT-BERT forms an end-to-end neural network. In the fine tuning training, a loss function with an L2 regularization term is used for optimization.
The EPAT-BERT model calculates F1 values (F1-score) on the second validation set after each training round of the fine tuning data set, and when the F1 values on the second validation set are decreased, the training is stopped, and the fine tuning of the EPAT-BERT model is completed.
In the step, the fine tuning data set is used for optimizing the model, the F1 value is determined according to Precision (Precision) and Recall (Recall) of the EPAT-BERT model on the verification set, and the F1 is used as the basis of early stopping because the index is the synthesis of other indexes and has representative significance.
In the embodiment of the invention, the second verification set is input into the word granularity mask language model, the obtained classification result comprises a true positive example TP, a false positive example FP, a false negative example FN and a true negative example TN, and the performance of the model is respectively evaluated by adopting common evaluation indexes such as classification Accuracy (Accuracy), precision, recall rate and F1 value.
Calculating the precision rate of the EPAT-BERT model on the second verification setPThat is, the percentage of the real positive samples in the samples predicted to be positive by the model is calculated as follows:
Figure 257392DEST_PATH_IMAGE003
calculating the recall rate of the EPAT-BERT model on the second verification setRThat is, the percentage of the real positive samples in the samples with the positive actual labels is calculated as follows:
Figure 486380DEST_PATH_IMAGE004
calculating an F1 value (F1-score) of the EPAT-BERT model on a second verification set, namely calculating a harmonic mean value of the precision rate and the recall rate as the most important evaluation criterion of the text classification, wherein the F1 value (F1-score) is as follows:
Figure 691096DEST_PATH_IMAGE005
s43, calculating the classification accuracy A of the fine-tuned EPAT-BERT model under the test set, namely calculating the proportion of samples with correct classification results in the test set to all samples in the test set, wherein the proportion is as follows:
Figure 562100DEST_PATH_IMAGE006
according to the embodiment of the invention, the model can be ensured to have the most accurate generalization error by dividing the model into the training set, the verification set and the test set, which is better than the condition of only dividing the model into the training set and the test set.
And S44, comparing various evaluation indexes of the EPAT-BERT model and other pre-training language models in the test set, and if the comparison result meets the set requirement, determining the trained EPAT-BERT model as a text classification EPAT-BERT model.
In this step, two classical machine learning models are selected for comparison:
1. naive Bayes (Naive Bayes): the text is represented as a bag-of-words model and classified using a naive bayes algorithm.
2. Support Vector Machine (SVM): the text is represented as a bag-of-words model and classified using a support vector machine algorithm.
In addition, two deep learning models commonly used for text classification are chosen:
3. text convolutional neural network (TextCNN): and regarding the word vector sequence corresponding to the text as a matrix, extracting the characteristics of the matrix by using a convolutional neural network, and performing end-to-end learning.
4. Long short term memory network (LSTM): and sequentially sending the word vector sequence corresponding to the text into the LSTM, and performing end-to-end learning.
And finally, selecting a general pre-training BERT model for comparing to demonstrate the effectiveness of the power text pre-training task:
5. general pre-training BERT model: and (3) predicting two pre-training tasks by using a mask language model with word granularity and the next sentence, and pre-training on the universal corpus.
The evaluation indexes finally calculated on the test set by the different models are shown in table 2. From the experimental results, the following can be concluded:
1. compared with machine learning models (Naive Bayes and SVM), the deep learning models TextCNN and LSTM based on the neural network can obtain higher effects on four evaluation indexes, and the model based on the neural network is superior to the traditional machine learning model based on statistical learning.
2. Compared with a deep learning model, the BERT model based on pre-training is further improved on four evaluation indexes.
3. The model EPAT-BERT based on the electric power audit text classification provided by the invention is obviously superior to the general corpus pre-training model BERT, so that the validity of two granularity pre-training tasks provided by the invention is proved, and the promotion effect of the pre-training related to the field on downstream tasks in the field is proved.
TABLE 2 evaluation results of different models on the test set
Figure 529181DEST_PATH_IMAGE007
S45, performing an ablation experiment on the text classification EPAT-BERT model, and evaluating the experiment result through various evaluation indexes to determine the pre-training effect.
The focus of the EPAT-BERT model is two pre-training tasks: a word-granular mask language model and an entity-granular mask language model. Therefore, it is important to explore the influence of the two pre-training tasks on the experimental results. In order to explore the influence of two pre-training tasks, the invention further designs two groups of ablation experiments.
TABLE 3 ablation test results
Figure 742119DEST_PATH_IMAGE008
In the first set of experiments, pre-training tasks for word granularity and entity granularity in EPAT-BERT were removed and recorded as EPAT-BERT w \ o.W and EPAT-BERT w \ o.E, respectively. The experimental result shows that when two pre-training tasks in the model are removed respectively, the model is decreased in four classification evaluation indexes such as classification accuracy, precision, recall rate and F1 value, and therefore the pre-training tasks with two granularities play an important role in further improving the classification effect of the audit text.
In addition, the downstream task effect improvement brought by the entity granularity pre-training is more obvious than the word granularity pre-training. In a second set of experiments, the effect of the training sequence of the two pre-training tasks in the EPAT-BERT on the experimental results was explored. Wherein "-WE" represents that the word granularity mask language model training is performed first, and then the entity granularity mask language model training is performed; "-EW" or vice versa. The experimental result shows that compared with the method for separately and independently carrying out two pre-training tasks, the effect of training by fusing the two tasks is better. The sequence of the two tasks has no significant influence on the result.
And S5, inputting the power audit text to be classified into a text classification EPAT-BERT model, outputting a class label of the power audit text, and finishing a power audit text classification task.
As shown in fig. 6, based on the above method for classifying a power audit text based on an improved BERT model, an embodiment of the present invention further provides a device for classifying a power audit text based on an improved BERT model, which includes a text processing module, a model building module, a model pre-training module, a model fine-tuning module, and a text classification module.
Specifically, the text processing module is used for acquiring a power text; the model building module is used for building an EPAT-BERT model, wherein the EPAT-BERT model comprises a word granularity mask language model and an entity granularity mask language model; the model pre-training module is used for inputting the electric power text into the EPAT-BERT model for pre-training to obtain the pre-trained EPAT-BERT model; the pre-training comprises respectively performing word granularity mask language model training and entity granularity mask language model training; the model fine-tuning module is used for carrying out fine tuning on the pre-trained EPAT-BERT model and then carrying out performance evaluation to determine a text classification EPAT-BERT model; and the text classification module is used for inputting the power audit text to be classified into a text classification EPAT-BERT model and outputting a class label of the power audit text.
The present invention also provides a computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, the at least one instruction, at least one program, set of codes, or set of instructions being loaded and executed by the processor to implement the above-described method for power audit text classification based on an improved BERT model.
For example, the computer device may be a GPU cloud server, and is specifically configured as follows: the CPU adopts Intel (R) Xeon (R) Silver 4114 CPU 2.20GHz, the GPU is four NVIDIA Titan V, and each video memory is 12GB. The memory of the computer equipment is 256GB, and the hard disk is 2T.
Software packages and frameworks required for computer devices include pyrrch 1.7.1, transformations 4.7.0, scidit-spare 0.24.2, numpy 1.19.5, pandas 1.1.5, and matchlitb 3.3.4.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (11)

1. A power audit text classification method based on an improved BERT model is characterized by comprising the following steps:
acquiring a power text; arranging professional vocabularies in the electric power field into a vocabulary V, searching a webpage containing the vocabularies in the vocabulary V in a Web data set, and obtaining a set W; extracting the text in the set W by using an extraction algorithm based on a regular expression to obtain a pre-training corpus C;
constructing an EPAT-BERT model, wherein the EPAT-BERT model comprises a word granularity mask language model and an entity granularity mask language model;
inputting the power text into an EPAT-BERT model for pre-training to obtain a pre-trained EPAT-BERT model; the pre-training comprises respectively performing word granularity mask language model training and entity granularity mask language model training; the pre-training of the word granularity mask language model is specifically as follows: marking a position input vector corresponding to each word in the pre-training corpus C to obtain a vectorized input text; adding identifiers to the front and back of sentences in the input text subjected to vector quantization through a word granularity code language model, and carrying out sentence segmentation to obtain a data set A; dividing a data set A into a pre-training data set and a first verification set according to a set proportion; respectively inputting the pre-training data set and the first verification set into a word granularity mask language model for classification pre-training; the entity-granularity mask language model pre-training is specifically as follows: introducing a knowledge graph related to electric power to mark out entities contained in the pre-training data set and the first verification set; the entity granularity mask language model replaces each word in the corresponding entity with a mask mark, the position of each mask mark corresponds to a hidden layer vector, and the word at the position corresponding to each mask mark is predicted by connecting a full connection layer; measuring the difference between the predicted value and the true value by using a loss function, calculating a loss function value on a first verification set after a mask language model of the entity granularity is pre-trained by using a pre-training data set to reach a set training turn, and stopping the pre-training of the mask language model of the entity granularity when the loss function value is not reduced any more;
fine-tuning the pre-trained EPAT-BERT model, then carrying out performance evaluation, and determining a text classification EPAT-BERT model; fine adjustment is carried out on the pre-trained EPAT-BERT model, and the method specifically comprises the following steps: extracting a certain amount of electric power audit text to form a data set, and obtaining a vectorization text by adopting vector representation of words, position coding of the words and each word in the segmentation marking data set of the wordsT(ii) a Vectorizing text using a set ratioTDividing the test data into a fine tuning data set, a second verification set and a test set; EPAT-BERT model in vectorizing textTAdding a special mark at the beginning of the text, and taking the vector of the output position of the special mark as the vector representation of the whole input text; adding a full connection layer on the upper layer of the EPAT-BERT, calculating an F1 value on a second verification set after each training turn of a fine adjustment data set by the EPAT-BERT model after the full connection layer is added, and stopping training when the F1 value on the second verification set is reduced to finish fine adjustment of the EPAT-BERT model;
and inputting the power audit text to be classified into a text classification EPAT-BERT model, and outputting a class label of the power audit text.
2. The electric power audit text classification method based on the improved BERT model as claimed in claim 1, wherein the performance evaluation is carried out on the fine-tuned EPAT-BERT model, and the text classification EPAT-BERT model is determined as follows:
calculating the classification accuracy of the fine-tuned EPAT-BERT model in a test set;
and comparing various evaluation indexes of the EPAT-BERT model and other pre-training language models in the test set, and if the comparison result meets the set requirement, determining that the trained EPAT-BERT model is a text classification EPAT-BERT model, wherein the various evaluation indexes comprise classification accuracy.
3. The method for classifying power audit texts based on an improved BERT model as claimed in claim 1, wherein the pre-training data set and the first validation set are respectively input into a word granularity mask language model for classification pre-training as follows:
the word granularity mask language model carries out mask masking on Chinese characters in each sentence of a pre-training data set randomly according to a set proportion, the masked Chinese characters are predicted through output vectors corresponding to mask positions, a loss function is adopted to measure the difference between a predicted value and a true value, after the pre-training of the word granularity mask language model is carried out by using the pre-training data set to reach a set training turn, a loss function value is calculated on a first verification set, and when the loss function value does not decrease any more, the pre-training of the word granularity mask language model is stopped.
4. The electric power audit text classification method based on the improved BERT model as claimed in claim 1 or 3, wherein the position input vector corresponding to each word in the labeled pre-training corpus C is as follows:
marking each word using vectors of the word, absolute position coding and segmentation coding of the wordwCorresponding position input vector VwThe method comprises the following steps:
Figure 149899DEST_PATH_IMAGE001
in the formula, W w The vector representing the character, namely the initial word vector of the character, is used for distinguishing different Chinese characters; p is w Indicating the position of the word, and fusing sequence position information into the input data by using absolute position coding; s w Representing a segmented encoding.
5. The method of claim 2, further comprising the steps of: and carrying out an ablation experiment on the text classification EPAT-BERT model, and evaluating the experiment result through various evaluation indexes to determine the pre-training effect.
6. The method for classifying power audit texts based on an improved BERT model according to claim 2 or 5, wherein the evaluation indexes further comprise precision rate, recall rate and F1 value.
7. The improved BERT model-based power audit text classification method according to claim 1, wherein the F1 value is determined according to the precision rate and recall rate of the EPAT-BERT model on the verification set.
8. The method as claimed in claim 1, wherein the entity is a similar or same vocabulary and grammar as those in professional vocabulary and grammar analysis toolkit in power field.
9. A power audit text classification device based on an improved BERT model is characterized by comprising the following components:
the text processing module is used for acquiring the power text, and specifically comprises the following steps: the method comprises the steps that professional vocabularies in the electric power field are arranged into a vocabulary V, and Web pages containing the vocabularies in the vocabulary V are searched in a Web data set provided by Yahoo company to obtain a set W; extracting the text in the set W by using an extraction algorithm based on a regular expression to obtain a pre-training corpus C;
the model building module is used for building an EPAT-BERT model, wherein the EPAT-BERT model comprises a word granularity mask language model and an entity granularity mask language model;
the model pre-training module is used for inputting the power text into the EPAT-BERT model for pre-training to obtain the EPAT-BERT model after pre-training; the pre-training comprises respectively performing word granularity mask language model training and entity granularity mask language model training; the pre-training of the word granularity mask language model is specifically as follows: marking a position input vector corresponding to each word in the pre-training corpus C to obtain a vectorized input text; adding identifiers to the front and back of sentences in the input text subjected to vector quantization through a word granularity code language model, and carrying out sentence segmentation to obtain a data set A; dividing a data set A into a pre-training data set and a first verification set according to a set proportion; respectively inputting the pre-training data set and the first verification set into a word granularity mask language model for classification pre-training; the entity-granularity mask language model pre-training is specifically as follows: introducing a knowledge graph related to electric power to mark out entities contained in the pre-training data set and the first verification set; the entity granularity mask language model replaces each word in the corresponding entity with a mask mark, the position of each mask mark corresponds to a hidden layer vector, and the word at the position corresponding to each mask mark is predicted by connecting a full connection layer; measuring the difference between the predicted value and the true value by using a loss function, calculating a loss function value on a first verification set after a mask language model of the entity granularity is pre-trained by using a pre-training data set to reach a set training turn, and stopping the pre-training of the mask language model of the entity granularity when the loss function value is not reduced any more;
the model fine-tuning module is used for carrying out fine tuning on the pre-trained EPAT-BERT model and then carrying out performance evaluation to determine a text classification EPAT-BERT model; fine adjustment is carried out on the pre-trained EPAT-BERT model, and the method specifically comprises the following steps: extracting a certain amount of power audit text to form a data set, and obtaining a vectorized text by adopting vector representation of words, position codes of the words and each word in the segmented marking data set of the wordsT(ii) a Vectorizing text with a set ratioTDividing the test data into a fine tuning data set, a second verification set and a test set; EPAT-BERT model in vectorizing textTAdding a special mark at the beginning ofTaking the vector of the output position of the special mark as the vector representation of the whole input text; adding a full connection layer on the upper layer of the EPAT-BERT, calculating an F1 value on a second verification set after each training turn of a fine adjustment data set by the EPAT-BERT model after the full connection layer is added, and stopping training when the F1 value on the second verification set is reduced to finish fine adjustment of the EPAT-BERT model;
and the text classification module is used for inputting the power audit text to be classified into a text classification EPAT-BERT model and outputting a category label of the power audit text.
10. The improved BERT model-based power audit text classification device according to claim 9, wherein the model fine-tuning module is further configured to:
calculating the classification accuracy of the fine-tuned EPAT-BERT model in a test set;
comparing various evaluation indexes of the EPAT-BERT model and other pre-training language models in the test set, and if the comparison result meets the set requirement, determining the trained EPAT-BERT model as a text classification EPAT-BERT model, wherein the various evaluation indexes comprise classification accuracy.
11. A computer device comprising a processor and a memory;
wherein the memory has stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by the processor to implement the method of improved BERT model based power audit text classification as claimed in any of claims 1-8.
CN202211283079.4A 2022-10-20 2022-10-20 Power audit text classification method and device based on improved BERT model Active CN115357719B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211283079.4A CN115357719B (en) 2022-10-20 2022-10-20 Power audit text classification method and device based on improved BERT model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211283079.4A CN115357719B (en) 2022-10-20 2022-10-20 Power audit text classification method and device based on improved BERT model

Publications (2)

Publication Number Publication Date
CN115357719A CN115357719A (en) 2022-11-18
CN115357719B true CN115357719B (en) 2023-01-03

Family

ID=84007751

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211283079.4A Active CN115357719B (en) 2022-10-20 2022-10-20 Power audit text classification method and device based on improved BERT model

Country Status (1)

Country Link
CN (1) CN115357719B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115794803B (en) * 2023-01-30 2023-05-09 北京至臻云智能科技有限公司 Engineering audit problem monitoring method and system based on big data AI technology
CN115983242A (en) * 2023-02-16 2023-04-18 北京有竹居网络技术有限公司 Text error correction method, system, electronic device and medium
CN116562284B (en) * 2023-04-14 2024-01-26 湖北经济学院 Government affair text automatic allocation model training method and device
CN116662579B (en) * 2023-08-02 2024-01-26 腾讯科技(深圳)有限公司 Data processing method, device, computer and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110717339A (en) * 2019-12-12 2020-01-21 北京百度网讯科技有限公司 Semantic representation model processing method and device, electronic equipment and storage medium
CN114265922A (en) * 2021-11-23 2022-04-01 清华大学 Automatic question answering and model training method and device based on cross-language
CN114936287A (en) * 2022-01-30 2022-08-23 阿里云计算有限公司 Knowledge injection method for pre-training language model and corresponding interactive system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560486A (en) * 2020-11-25 2021-03-26 国网江苏省电力有限公司电力科学研究院 Power entity identification method based on multilayer neural network, storage medium and equipment
CN113239700A (en) * 2021-04-27 2021-08-10 哈尔滨理工大学 Text semantic matching device, system, method and storage medium for improving BERT
CN113836315A (en) * 2021-09-23 2021-12-24 国网安徽省电力有限公司电力科学研究院 Electric power standard knowledge extraction system
CN114330312A (en) * 2021-11-03 2022-04-12 腾讯科技(深圳)有限公司 Title text processing method, apparatus, storage medium, and program
CN115114906A (en) * 2022-04-24 2022-09-27 腾讯科技(深圳)有限公司 Method and device for extracting entity content, electronic equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110717339A (en) * 2019-12-12 2020-01-21 北京百度网讯科技有限公司 Semantic representation model processing method and device, electronic equipment and storage medium
CN114265922A (en) * 2021-11-23 2022-04-01 清华大学 Automatic question answering and model training method and device based on cross-language
CN114936287A (en) * 2022-01-30 2022-08-23 阿里云计算有限公司 Knowledge injection method for pre-training language model and corresponding interactive system

Also Published As

Publication number Publication date
CN115357719A (en) 2022-11-18

Similar Documents

Publication Publication Date Title
CN110489555B (en) Language model pre-training method combined with similar word information
CN110096570B (en) Intention identification method and device applied to intelligent customer service robot
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN109145294B (en) Text entity identification method and device, electronic equipment and storage medium
CN115357719B (en) Power audit text classification method and device based on improved BERT model
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN112817561B (en) Transaction type functional point structured extraction method and system for software demand document
CN110704621A (en) Text processing method and device, storage medium and electronic equipment
CN114580382A (en) Text error correction method and device
CN106570180A (en) Artificial intelligence based voice searching method and device
CN112818698B (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN113138920B (en) Software defect report allocation method and device based on knowledge graph and semantic role labeling
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN113919366A (en) Semantic matching method and device for power transformer knowledge question answering
CN115359799A (en) Speech recognition method, training method, device, electronic equipment and storage medium
CN114004231A (en) Chinese special word extraction method, system, electronic equipment and storage medium
CN114742069A (en) Code similarity detection method and device
CN112036186A (en) Corpus labeling method and device, computer storage medium and electronic equipment
CN113486174A (en) Model training, reading understanding method and device, electronic equipment and storage medium
CN116342167B (en) Intelligent cost measurement method and device based on sequence labeling named entity recognition
CN113705207A (en) Grammar error recognition method and device
CN113761875B (en) Event extraction method and device, electronic equipment and storage medium
CN116383414A (en) Intelligent file review system and method based on carbon check knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant