CN113343701A - Extraction method and device for text named entities of power equipment fault defects - Google Patents

Extraction method and device for text named entities of power equipment fault defects Download PDF

Info

Publication number
CN113343701A
CN113343701A CN202110742874.4A CN202110742874A CN113343701A CN 113343701 A CN113343701 A CN 113343701A CN 202110742874 A CN202110742874 A CN 202110742874A CN 113343701 A CN113343701 A CN 113343701A
Authority
CN
China
Prior art keywords
entity information
text
class
entity
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110742874.4A
Other languages
Chinese (zh)
Other versions
CN113343701B (en
Inventor
陈鹏
金杨
邰彬
杨贤
汪进锋
黄杨珏
姚瑶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Power Grid Co Ltd
Electric Power Research Institute of Guangdong Power Grid Co Ltd
Original Assignee
Guangdong Power Grid Co Ltd
Electric Power Research Institute of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Power Grid Co Ltd, Electric Power Research Institute of Guangdong Power Grid Co Ltd filed Critical Guangdong Power Grid Co Ltd
Priority to CN202110742874.4A priority Critical patent/CN113343701B/en
Publication of CN113343701A publication Critical patent/CN113343701A/en
Application granted granted Critical
Publication of CN113343701B publication Critical patent/CN113343701B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a method and a device for extracting a text named entity of a power equipment fault defect, wherein the method comprises the following steps: acquiring a defect text of the power equipment, and preprocessing the defect text to obtain standardized text data; extracting I-type entity information by adopting a dictionary method, wherein the I-type entity information comprises the following steps: device name, component name, fault type, fault class, and voltage class; adopting an LTP tool to extract class II entity information, wherein the class II entity information comprises: production time and commissioning time; extracting class III entity information by adopting a Bert-CRF algorithm, wherein the class III entity information comprises: a line name and a manufacturer name; and outputting the type I entity information, the type II entity information and the type III entity information to obtain an extraction result of the named entity. The method and the device can improve the accuracy and efficiency of extracting the text named entities of the power equipment fault defects.

Description

Extraction method and device for text named entities of power equipment fault defects
Technical Field
The invention relates to the technical field of machine learning, in particular to a method, a device, a terminal and a storage medium for extracting a text named entity of a fault defect of power equipment.
Background
A large number of fault cases are accumulated in the process of overhauling and maintaining the power system, and the fault cases are semi-structured and unstructured text data related to power equipment, and account for more than 80 percent of the whole power field. The defect texts accumulated in the power field contain key information highly related to the operation state of power equipment and the safety of a power grid, but only a small amount of text data is mined and utilized at present. Through the natural language processing technology, a large amount of electric power defect texts can be processed, so that effective fault information such as equipment names and fault types is mined, and more effective basis and guidance are provided for fault diagnosis, operation maintenance, state maintenance and the like of an electric power system.
At present, the existing entity extraction methods all adopt a single extraction method, but because the entity types of the electric power equipment are various and the characteristic difference is large, especially, part of the entities have expandability, the existing methods cannot extract all the entities in the electric power fault defect text.
Disclosure of Invention
The purpose of the invention is: the method, the device, the terminal and the storage medium for extracting the text named entities of the power equipment fault defects can improve the accuracy and efficiency of extracting the text named entities of the power equipment fault defects.
In order to achieve the above object, the present invention provides a method for extracting a text named entity of a power equipment fault defect, including:
s1, acquiring a defect text of the power equipment, and preprocessing the defect text to obtain standardized text data;
s2, extracting I-type entity information by adopting a dictionary method, wherein the I-type entity information comprises: device name, component name, fault type, fault class, and voltage class;
s3, extracting class II entity information by adopting an LTP tool, wherein the class II entity information comprises: production time and commissioning time;
s4, extracting the III-type entity information by adopting a Bert-CRF algorithm, wherein the III-type entity information comprises: a line name and a manufacturer name;
and S5, outputting the type I entity information, the type II entity information and the type III entity information to obtain the extraction result of the named entity.
Further, the S1 includes:
s11, removing words without actual meanings in the defect text according to a preset rule;
s12, removing special symbols in the defect text by adopting a regular expression, wherein the special symbols comprise: punctuation, numbers and special characters.
Further, the S2 includes:
s21, importing a preset dictionary set and the standardized text data;
s22, assigning a class of entity labels to each dictionary in the dictionary set;
s23, traversing all dictionaries in the dictionary set;
s24, traversing each word in the current dictionary, and judging whether the word appears in the standardized text data; if yes, recording the words as I-type entities of the standardized text data, and if not, entering S25;
s25, judging whether the current dictionary is traversed and ended, if yes, entering S26, and if not, entering S24;
and S26, judging whether all dictionaries in the dictionary set are traversed and ended, if so, ending the extraction of the class I entity information, and if not, entering S23.
Further, the S3 includes:
s31, importing the standardized text data;
s32, performing word segmentation processing on the standardized text data, and labeling the part of speech of each word to obtain a word set after word segmentation processing;
s33, traversing and reading the word set, and judging whether the current word is a time noun, if so, entering S34, and if not, entering S36;
s34, reading the next word of the current word, and judging whether the next word is a time noun, if so, entering S34, and if not, entering S35;
s35, forming the time nouns into time entities;
and S36, judging whether all the words in the word set are traversed and ended, if so, ending the extraction of the II-type entity information, and if not, entering S33.
Further, in S4, the following calculation formula is adopted:
Figure BDA0003142082320000031
in the formula, A and B are two word vectors obtained by training a language model, n is the dimension of the word vector, AiAnd BiFor each dimension.
The invention also provides an extraction device of the text named entity of the power equipment fault defect, which comprises the following steps: a data acquisition module, a class I entity information extraction module, a class II entity information extraction module, a class III entity information extraction module and an output module, wherein,
the data acquisition module is used for acquiring a defect text of the power equipment, and preprocessing the defect text to obtain standardized text data;
the class I entity information extraction module is configured to extract class I entity information by using a dictionary method, where the class I entity information includes: device name, component name, fault type, fault class, and voltage class;
the class II entity information extraction module is configured to extract class II entity information by using an LTP tool, where the class II entity information includes: production time and commissioning time;
the class-III entity information extraction module is used for extracting class-III entity information by adopting a Bert-CRF algorithm, wherein the class-III entity information comprises: a line name and a manufacturer name;
and the output module is used for outputting the type I entity information, the type II entity information and the type III entity information to obtain an extraction result of the named entity.
Further, the data acquisition module is specifically configured to:
according to a preset rule, eliminating words without actual meanings in the defect text;
adopting a regular expression to eliminate special symbols in the defect text, wherein the special symbols comprise: punctuation, numbers and special characters.
Further, the class III entity information extraction module adopts the following calculation formula:
Figure BDA0003142082320000041
in the formula, A and B are two word vectors obtained by training a language model, n is the dimension of the word vector, AiAnd BiFor each dimension.
The present invention also provides a computer terminal device, comprising: one or more processors; a memory coupled to the processor for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement the method for extracting a text-named entity of a power equipment failure as described in any one of the above.
The invention also provides a computer readable storage medium, on which a computer program is stored, wherein the computer program is used for implementing the method for extracting the text named entity of the power equipment fault defect when being executed by a processor.
Compared with the prior art, the method and the device for extracting the text named entity of the power equipment fault defect, the terminal equipment and the computer readable storage medium have the advantages that:
firstly, extracting 5-class entities of equipment name, part name, fault type, fault level and voltage level in a fault defect text of the power equipment by adopting a dictionary matching-based method; and secondly, extracting time entities in the defect text by adopting an LTP tool. And finally, replacing the SoftMax output layer of the Bert model by using CRF (conditional random Access memory), overcoming the problem of local optimization of the preferred word label, and improving the accuracy and efficiency of extracting the text named entities with the fault defects of the power equipment by using the method disclosed by the invention.
Drawings
FIG. 1 is a schematic flow chart of a method for extracting a text named entity of a power equipment fault defect according to the present invention;
FIG. 2 is a flow chart diagram of extracting class I entity information based on a dictionary provided by the present invention;
FIG. 3 is a schematic flow chart of extracting information of class II entities based on LTP tool provided by the present invention;
FIG. 4 is a schematic diagram of the Bert structure provided by the present invention;
FIG. 5 is a schematic diagram of the input structure of the Bert model provided by the present invention;
FIG. 6 is a schematic diagram of a Self-anchorage structure provided by the present invention;
FIG. 7 is a schematic structural diagram of class III entity extraction based on Bert-CRF according to the present invention;
fig. 8 is a schematic structural diagram of an extraction apparatus for a text naming entity of a power equipment fault defect provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the step numbers used herein are for convenience of description only and are not intended as limitations on the order in which the steps are performed.
It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.
As shown in fig. 1, the method for extracting a text named entity of a power equipment fault defect according to an embodiment of the present invention at least includes the following steps:
and S1, acquiring a defect text of the electric power equipment, and preprocessing the defect text to obtain standardized text data.
Specifically, words without actual meanings in the defect text are removed according to a preset rule; secondly, removing special symbols in the defect text by adopting a regular expression, wherein the special symbols comprise: punctuation, numbers and special characters.
S2, extracting I-type entity information by adopting a dictionary method, wherein the I-type entity information comprises: device name, component name, fault type, fault class, and voltage class.
Specifically, after balancing the operation speed and the recognition accuracy of the algorithm, the method introduces a custom dictionary as an auxiliary dictionary for word segmentation when an LTP word segmentation system is used for carrying out the word segmentation of the text with the power failure defect. And on the basis of the electric power equipment fault defect text word segmentation, a dictionary traversal mode is adopted to remove special symbols without actual meanings, through the steps, the text preprocessing result is obviously improved, the reliable progress of subsequent tasks is ensured, and the precision and the efficiency of subsequent entity extraction are improved.
The algorithm flow of extracting class I entities based on dictionary is shown in fig. 2, firstly, the contents of the 5 classes of entities are stored in the form of dictionary, each entity class has a corresponding dictionary, and the dictionary also has class labels. And sequentially traversing words in each dictionary aiming at each preprocessed power equipment fault defect text, and judging whether the words appear in the text records. If yes, the entity is indicated in the text record, and the category of the entity is determined by the category of the dictionary to which the entity belongs; otherwise, the traversal is continued. When all dictionaries are traversed, namely 5 types of entities in the text of the fault defect of the power equipment are successfully extracted.
S3, extracting class II entity information by adopting an LTP tool, wherein the class II entity information comprises: production time and commissioning time.
Specifically, the production time and the commissioning time both represent time, and are collectively referred to as class II entities, the extraction technology of the time information is quite mature, and the speed of extracting the text time entity of the fault defect of the power equipment can be effectively increased by adopting the LTP-based tool to extract the time entity.
It should be noted that LTP is a language technology platform.
An algorithm flow for extracting class II entities based on the LTP tool is shown in fig. 3, and first, word segmentation processing is performed on each power equipment fault defect text record through the LTP tool, a part of speech of each word is labeled, each word and the part of speech are traversed, and whether the word is a time noun is judged through the part of speech. If yes, judging whether the next word is also a time noun, sequentially proceeding until one word is not the time noun, and then connecting all the time nouns together to form a time entity; if not, the traversal is continued until the traversal of each text record is finished.
S4, extracting the III-type entity information by adopting a Bert-CRF algorithm, wherein the III-type entity information comprises: the line name and the manufacturer name.
Specifically, the names of the two types of entities, namely the line name and the manufacturer name, can increase and change with time and cannot be determined, and are collectively referred to as type III entities. The method adopts a Bert-CRF algorithm to extract two types of entities with expansibility, namely a line name and a manufacturer, and the Bert-CRF algorithm adopts a CRF model to replace a softmax output layer of the Bert model, so that the Bert-CRF algorithm is obtained, and the accuracy and the efficiency of entity extraction in the power field are improved. Because the number of the III-type entities is continuously updated and the III-type entities cannot be directly extracted by the existing method, the invention improves on the basis of the Bert model and provides the method for extracting the III-type entities by adopting the model based on the Bert-CRF.
The Transformer model is a new framework of the text sequence network, and Bert is a fine-tuning-based multi-layer bidirectional Transformer encoder, and the model structure is shown in fig. 4. In FIG. 4, Y1,Y2…YNFor model output, in an entity extraction task, representing a label corresponding to each character; TrmE is an encoding (Encoder) structure of a Transformer, and is a core part of a Bert model; x1,X2…XNFor model input, including Token, Segment and Position3, the model input is shown in fig. 5, taking the two sentences of "device" and "switch" as an example. Inputting character [ CLS ]]Is a starting mark; [ SEP ]]To end the logo, Token is the input sequence of text, Segment is sentence segmentation information, EA represents the first sentence, denoted by "1", EB represents the second sentence, denoted by "0"; the Position is Position information indicating a Position index of each input character in the library. Each input XiIs composed of Token + Segment + Position.
The Bert model has 12 layers of TrmE networks, and each layer of TrmE network is composed of 6 layers of Encoders. Wherein, each layer of Encoder consists of a Self-attention mechanism (Self-attention) and a Feed-forward neural network (Feed-form). The core part of the Encoder is a Self-attribute module which is used for calculating the relation between a word and the context thereof, then endowing each word in the context thereof with different weights, and updating the word vector representation result of the word according to the weights. Compared with the traditional word vector expression method, the word vector obtained based on Self-attention can better reflect the text meaning expressed by a word.
The Self-attention structure is shown in FIG. 6. Each word comprises 3 vectors with the same length, namely query vector Quer (Q), key vector Key (K) and value vector value (V), and the calculation formulas are respectively
Q=XWQ,K=XWK,V=XWV (1)
Wherein X is an input matrix, WQ,WK,WVThe weight matrix can be obtained through model training. The output Y of the Self-attention is
Figure BDA0003142082320000091
In the formula (I), the compound is shown in the specification,
Figure BDA0003142082320000092
the effect of this is to ensure that the product of Q and K is not too large, a penalty factor.
The Bert model uses a multi-head attention mechanism (multi-head-attention) based on Self-attention, the number of heads is the number of used Self-attention, and the number of used heads is 12. In the Multi-attribute, each Self-attribute focuses on different context information of a word, and the output matrix of the Multi-attribute, i.e., the Multi-head, can be expressed as
MulitiHead(Q,K,V)=Concat(head1,head2…headk)WO (3)
Where Concat denotes that each headi matrix is concatenated together laterally and then multiplied by the concatenation matrix WO; headi denotes the output matrix of each Self-attribute, i.e.
headi=Attention(QWi Q,KWi K,VWi V) (4)
The Bert model can further extract word vector features on the basis of obtaining the tag value of the sequence word vector, and the tags corresponding to the maximum value of each word are selected through the softmax layer, so that word vector feature classification is completed. However, in the classification process, the Bert model only considers local information, and is easy to fall into local optimization. Actually, the context relationship exists between the current word and the adjacent word, and the label of the current word and the label of the adjacent word, and the CRF selects the globally optimal label in consideration of the context relationship of the sequence information, and finally completes the word vector classification. The sequence tag comparison process between Bert and CRF is as follows: the Bert model causes errors in predicting the tags of the sequence because each character simply selects the tag with the highest value. CRF considers the total value C (y) of the entire sequence compared to Bert1,…ym) Including an initial value, a transition value and a character value, and the calculation formula is
Figure BDA0003142082320000101
In the formula, b (y)1) And e (y)m) Values of the initial and end states, s, respectivelyt(yt) Is the label ytValue of time, T (y)t,yt+1) As a label ytState transition to tag yt+1The value of the state. CRF will be oneAnd selecting the label sequence with the highest value according to the maximum label sequence probability from all possible sequences. Probability of tag sequence of
Figure BDA0003142082320000102
Where w is the input character sequence, y is the output tag sequence, Ci is the value of each sequence, and n is the number of all possible sequence tags.
The method utilizes the Bert model to train text word vectors, replaces a SoftMax output layer of the Bert model with CRF, and extracts two entities including line names and manufacturer names in the text of the fault defects of the power equipment. As shown in fig. 7. The method comprises the steps that firstly, an input power equipment fault defect text is converted into a standard processing format of a Bert model by the Bert-CRF model, and Token, Segment and Position information of each character is obtained. And each layer of Encoder network transmits the learned characteristics to the next layer of Encoder by learning the information, and outputs the finally learned characteristic information to the CRF model by iteration. At the moment, the CRF model calculates the maximum probability sequence label of the text sequence through the characteristic information, so as to obtain the label corresponding to each character. When two types of entities, namely a line name and a manufacturer name, are extracted by using a Bert-CRF model, the fault defect text of the power equipment needs to be manually marked, and a label corresponding to each character is identified. And the Bert-CRF identifies two types of entities, namely a line name and a manufacturer name in the text of the fault defect of the power equipment through learning text characteristics and then predicting a label of each character.
The method takes the fault defect text record of the power equipment provided by a certain power grid as a data set, extracts 8 types of entities by adopting 3 algorithms, and the section mainly verifies the effect of the extraction algorithm of the III type entities based on the Bert-CRF provided by the invention. And finally, extracting results of all entities displaying the fault defect text of the power equipment, and analyzing the results through corresponding indexes.
The Bert-CRF model proposed by the experiment is optimized by an algorithm through Adam, the initial learning rate and the discarding rate dropout are 0.1, the sizes of a training batch and a testing batch are 32 and 8 respectively, 30 epochs are iterated, the number of Multi-entries in the Bert model is 12, the dimensionality of a text feature vector is 768, and the number of full-connection layers connected with the CRF is 7. And implemented using a TensorFlow framework. 2000 electric power equipment fault defect text records are selected and labeled for the result extracted by the testing entity. When the extraction results of the two types of entities of the line name and the manufacturer name are tested, 1600 text records are selected as a training set for model learning text characteristics, and 400 text records are selected as a test set for testing the performance of the model.
The class-III entity extraction algorithm based on the Bert-CRF is divided into two parts, namely word vector representation and feature classification, and the class-III entity extraction result of the invention is analyzed from the two aspects. Firstly, result analysis is carried out in the aspect of text vectorization, the most important step of a text processing task is text vectorization, and the quality of a downstream task generally depends on word vectors obtained by training texts of a Bert language model. The word vector represents the semantics contained in the text and can be evaluated by a semantic correlation task. The invention adopts a word similarity task to calculate the correlation (similarity) between two words as a measurement index, and the calculation formula is
Figure BDA0003142082320000121
In the formula, A and B are two word vectors obtained by training a language model, n is the dimension of the word vector, and Ai and Bi are values corresponding to each dimension.
And S5, outputting the type I entity information, the type II entity information and the type III entity information to obtain the extraction result of the named entity.
Specifically, the type I entity information, the type II entity information and the type III entity information are output to obtain the extraction result of the named entity.
In one embodiment of the present invention, the S1 includes:
s11, removing words without actual meanings in the defect text according to a preset rule;
s12, removing special symbols in the defect text by adopting a regular expression, wherein the special symbols comprise: punctuation, numbers and special characters.
In one embodiment of the present invention, the S2 includes:
s21, importing a preset dictionary set and the standardized text data;
s22, assigning a class of entity labels to each dictionary in the dictionary set;
s23, traversing all dictionaries in the dictionary set;
s24, traversing each word in the current dictionary, and judging whether the word appears in the standardized text data; if yes, recording the words as I-type entities of the standardized text data, and if not, entering S25;
s25, judging whether the current dictionary is traversed and ended, if yes, entering S26, and if not, entering S24;
and S26, judging whether all dictionaries in the dictionary set are traversed and ended, if so, ending the extraction of the class I entity information, and if not, entering S23.
In one embodiment of the present invention, the S3 includes:
s31, importing the standardized text data;
s32, performing word segmentation processing on the standardized text data, and labeling the part of speech of each word to obtain a word set after word segmentation processing;
s33, traversing and reading the word set, and judging whether the current word is a time noun, if so, entering S34, and if not, entering S36;
s34, reading the next word of the current word, and judging whether the next word is a time noun, if so, entering S34, and if not, entering S35;
s35, forming the time nouns into time entities;
and S36, judging whether all the words in the word set are traversed and ended, if so, ending the extraction of the II-type entity information, and if not, entering S33.
In one embodiment of the present invention, the calculation formula of S4 is as follows:
Figure BDA0003142082320000131
in the formula, A and B are two word vectors obtained by training a language model, n is the dimension of the word vector, AiAnd BiFor each dimension.
For a better understanding of the invention, it may be specifically understood by the following examples:
the invention takes the line names of 'tea south line' and 'tea north line', the manufacturer names of 'Kubo electronics technology Co., Ltd' and 'Shen san electric appliance manufacturing Co., Ltd' as examples, the similarity between the 4 words is calculated and tested, the performance of extracting the text characteristics of the fault defects of the electric power equipment by the Bert model is analyzed, and the cross experimental result table 1 shows.
TABLE 1 word similarity
Figure BDA0003142082320000132
Figure BDA0003142082320000141
As can be seen from table 1, the similarity between the "south tea line" and the "north tea line" is as high as 0.98, indicating that the correlation between the two lines is very high, and it is likely that the two lines are in the same place; the similarity between Kubai electronic technology Co., Ltd and Shen san appliance manufacturing Co., Ltd, both of which are known as manufacturers, is 0.92. And the similarity between the line name and the manufacturer name is less than 0.8, which indicates that the correlation between the line name and the manufacturer name is weak. The test result shows that the similarity of the entity words of the same category is higher, and the similarity of the entity words of different categories is lower, which is consistent with the actual situation, thus the Bert language model adopted by the invention is effective.
On the basis of the Bert model, the machine learning model CRF is adopted to replace the softmax output layer of the original Bert model, the Bert-CRF model is provided, and two entities, namely an equipment line and a manufacturer in the text of the fault defect of the power equipment, are extracted. In order to illustrate the effectiveness of the method, the results of extracting the two types of entities by using the CRF model and the Bert model independently are compared, and the experimental results of 3 models are shown in Table 2.
TABLE 2 comparison of the results
Figure BDA0003142082320000142
Figure BDA0003142082320000151
It can be easily found from table 2 that, compared with the CRF model and the Bert model, the Bert-CRF model has improved experimental results to some extent in both the line name entity and the manufacturer name entity. Wherein, the F1 value of the Bert-CRF model on the line name recognition is as high as 95.26%, 3.06% higher than that of the CRF model and 0.35% higher than that of the Bert model; the F1 value on the manufacturer name identification is as high as 94.46%, 4.66% higher than that of the CRF model and 0.55% higher than that of the Bert model. Experimental results show that the Bert-CRF model provided by the invention is effective in extracting two types of entities, namely a line name and a manufacturer name, in a power equipment fault defect text.
In addition, in 3 types of models of CRF, Bert and Bert-CRF, the identification result of the line name is superior to that of the manufacturer name in 3 indexes of accuracy, recall rate and F1 value. Wherein, the CRF is respectively higher by 2.82%, 1.99% and 2.40%, the Bert model is respectively higher by 1.11%, 0.82% and 0.98%, and the Bert-CRF model is respectively higher by 0.79%, 0.80% and 0.80%. Through analyzing the fault defect text of the power equipment, the text characteristics of the line name are more obvious compared with the manufacturer name, and the model can learn the characteristics more easily, so that the identification effect of the line name is better than that of the manufacturer name when the two types of target entities are identified.
Through research and analysis on fault defect texts of the power equipment, 3 methods based on dictionary matching, LTP tools and Bert-CRF models are provided according to different characteristics of entities for extracting the entities with different characteristics, wherein 5 types of entities including equipment names, part names, fault types, fault grades and voltage grades are extracted by the method based on the dictionary matching, time entities are extracted by the method based on the LTP tools, and two types of entities including line names and manufacturer names are extracted by the model based on the Bert-CRF models. In the text record of the fault and defect of the power equipment marked by the invention, the results of 8 types of entities extracted by adopting the 3 methods are shown in table 3.
TABLE 3 entity extraction results
Figure BDA0003142082320000161
Obviously, entity extraction based on dictionary matching obtains good extraction results because of small quantity of word stock and simple method, the accuracy, the recall rate and the F1 value of 5 types of entities such as equipment names, part names, fault types, fault levels and voltage levels are all up to 100%, the subsequent expansion of the method is also convenient, and the dictionary can be directly supplemented. LTP is also a relatively mature time extraction tool applied currently, and has a good effect on time entity extraction. For the fault defect text of the power equipment, the accuracy, the recall rate and the F1 value of time entity extraction are respectively as high as 99.91%, 97.84% and 98.86%, which indicates that the tool is selected properly. The Bert-CRF model provided by the invention combines the advantages of the Bert and the CRF model, the extraction accuracy, the recall rate and the F1 value of the circuit name entity are 95.64%, 94.88% and 95.26% respectively, the extraction accuracy, the recall rate and the F1 value of the manufacturer name entity are 94.85%, 94.08% and 94.46% respectively, and compared with a CRF algorithm and a BERT algorithm which are singly used, the Bert-CRF model has certain improvement on the accuracy, the recall rate and the F1 value and can meet the use requirement.
Generally, the text processing flow of the fault defect of the power equipment cascaded by the 3 entity extraction methods is effective, the problem that the power equipment cannot be directly extracted due to various entity types and large characteristic difference of the power equipment is solved, and a good extraction result is obtained.
Compared with the prior art, the method for extracting the text named entity of the power equipment fault defect has the beneficial effects that:
firstly, extracting 5-class entities of equipment name, part name, fault type, fault level and voltage level in a fault defect text of the power equipment by adopting a dictionary matching-based method; and secondly, extracting time entities in the defect text by adopting an LTP tool. And finally, replacing the SoftMax output layer of the Bert model by using CRF (conditional random Access memory), overcoming the problem of local optimization of the preferred word label, and improving the accuracy and efficiency of extracting the text named entities with the fault defects of the power equipment by using the method disclosed by the invention.
As shown in fig. 8, the present invention further provides an apparatus 200 for extracting text-named entities of power equipment failure defects, including: a data acquisition module 201, a class I entity information extraction module 202, a class II entity information extraction module 203, a class III entity information extraction module 204, and an output module 205, wherein,
the data acquisition module 201 is configured to acquire a defect text of the power device, and preprocess the defect text to obtain standardized text data;
the class I entity information extraction module 202 is configured to extract class I entity information by using a dictionary method, where the class I entity information includes: device name, component name, fault type, fault class, and voltage class;
the class II entity information extraction module 203 is configured to extract class II entity information by using an LTP tool, where the class II entity information includes: production time and commissioning time;
the class III entity information extraction module 204 is configured to extract class III entity information by using a Bert-CRF algorithm, where the class III entity information includes: a line name and a manufacturer name;
the output module 205 is configured to output the class I entity information, the class II entity information, and the class III entity information, and obtain an extraction result of the named entity.
In an embodiment of the present invention, the data obtaining module is specifically configured to:
according to a preset rule, eliminating words without actual meanings in the defect text;
adopting a regular expression to eliminate special symbols in the defect text, wherein the special symbols comprise: punctuation, numbers and special characters.
In one embodiment of the present invention, the class III entity information extraction module adopts the following calculation formula:
Figure BDA0003142082320000181
in the formula, A and B are two word vectors obtained by training a language model, n is the dimension of the word vector, AiAnd BiFor each dimension.
The present invention also provides a computer terminal device, comprising: one or more processors; a memory coupled to the processor for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement the method for extracting a text-named entity of a power equipment failure as described in any one of the above.
It should be noted that the processor may be a Central Processing Unit (CPU), other general-purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an application-specific programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, etc., the general-purpose processor may be a microprocessor, or the processor may be any conventional processor, the processor is a control center of the terminal device, and various interfaces and lines are used to connect various parts of the terminal device.
The memory mainly includes a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function, and the like, and the data storage area may store related data and the like. In addition, the memory may be a high speed random access memory, may also be a non-volatile memory, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (FlashCard), and the like, or may also be other volatile solid state memory devices.
It should be noted that the terminal device may include, but is not limited to, a processor and a memory, and those skilled in the art will understand that the terminal device is only an example and does not constitute a limitation of the terminal device, and may include more or less components, or combine some components, or different components.
The invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the method for extracting a text-named entity of a power equipment fault defect as described in any one of the above.
It should be noted that the computer program may be divided into one or more modules/units (e.g., computer program), and the one or more modules/units are stored in the memory and executed by the processor to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of the computer program in the terminal device.
The above-mentioned embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, and it should be understood that the above-mentioned embodiments are only examples of the present invention and are not intended to limit the scope of the present invention. It should be understood that any modifications, equivalents, improvements and the like, which come within the spirit and principle of the invention, may occur to those skilled in the art and are intended to be included within the scope of the invention.

Claims (10)

1. A method for extracting a text named entity of a fault defect of power equipment is characterized by comprising the following steps:
s1, acquiring a defect text of the power equipment, and preprocessing the defect text to obtain standardized text data;
s2, extracting I-type entity information by adopting a dictionary method, wherein the I-type entity information comprises: device name, component name, fault type, fault class, and voltage class;
s3, extracting class II entity information by adopting an LTP tool, wherein the class II entity information comprises: production time and commissioning time;
s4, extracting the III-type entity information by adopting a Bert-CRF algorithm, wherein the III-type entity information comprises: a line name and a manufacturer name;
and S5, outputting the type I entity information, the type II entity information and the type III entity information to obtain the extraction result of the named entity.
2. The method for extracting text-named entities in power equipment failures according to claim 1, wherein the step S1 includes:
s11, removing words without actual meanings in the defect text according to a preset rule;
s12, removing special symbols in the defect text by adopting a regular expression, wherein the special symbols comprise: punctuation, numbers and special characters.
3. The method for extracting text-named entities in power equipment failures according to claim 1, wherein the step S2 includes:
s21, importing a preset dictionary set and the standardized text data;
s22, assigning a class of entity labels to each dictionary in the dictionary set;
s23, traversing all dictionaries in the dictionary set;
s24, traversing each word in the current dictionary, and judging whether the word appears in the standardized text data; if yes, recording the words as I-type entities of the standardized text data, and if not, entering S25;
s25, judging whether the current dictionary is traversed and ended, if yes, entering S26, and if not, entering S24;
and S26, judging whether all dictionaries in the dictionary set are traversed and ended, if so, ending the extraction of the class I entity information, and if not, entering S23.
4. The method for extracting text-named entities in power equipment failures according to claim 1, wherein the step S3 includes:
s31, importing the standardized text data;
s32, performing word segmentation processing on the standardized text data, and labeling the part of speech of each word to obtain a word set after word segmentation processing;
s33, traversing and reading the word set, and judging whether the current word is a time noun, if so, entering S34, and if not, entering S36;
s34, reading the next word of the current word, and judging whether the next word is a time noun, if so, entering S34, and if not, entering S35;
s35, forming the time nouns into time entities;
and S36, judging whether all the words in the word set are traversed and ended, if so, ending the extraction of the II-type entity information, and if not, entering S33.
5. The method for extracting the text-named entity of the power equipment fault defect of claim 1, wherein the step S4 is implemented by using the following calculation formula:
Figure FDA0003142082310000031
in the formula, A and B are two word vectors obtained by training a language model, n is the dimension of the word vector, AiAnd BiFor each dimensionThe corresponding value.
6. An extraction device for a text naming entity of a power equipment fault defect is characterized by comprising: a data acquisition module, a class I entity information extraction module, a class II entity information extraction module, a class III entity information extraction module and an output module, wherein,
the data acquisition module is used for acquiring a defect text of the power equipment, and preprocessing the defect text to obtain standardized text data;
the class I entity information extraction module is configured to extract class I entity information by using a dictionary method, where the class I entity information includes: device name, component name, fault type, fault class, and voltage class;
the class II entity information extraction module is configured to extract class II entity information by using an LTP tool, where the class II entity information includes: production time and commissioning time;
the class-III entity information extraction module is used for extracting class-III entity information by adopting a Bert-CRF algorithm, wherein the class-III entity information comprises: a line name and a manufacturer name;
and the output module is used for outputting the type I entity information, the type II entity information and the type III entity information to obtain an extraction result of the named entity.
7. The device for extracting the text-named entity of the power equipment fault defect of claim 6, wherein the data obtaining module is specifically configured to:
according to a preset rule, eliminating words without actual meanings in the defect text;
adopting a regular expression to eliminate special symbols in the defect text, wherein the special symbols comprise: punctuation, numbers and special characters.
8. The extraction device of the text-named entity of the power equipment fault defect according to claim 6, wherein the class III entity information extraction module adopts the following calculation formula:
Figure FDA0003142082310000041
in the formula, A and B are two word vectors obtained by training a language model, n is the dimension of the word vector, AiAnd BiFor each dimension.
9. A computer terminal device, comprising:
one or more processors;
a memory coupled to the processor for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method for extracting a power equipment failure defect text-named entity as recited in any one of claims 1 to 5.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for extracting a power equipment failure defect text-named entity according to any one of claims 1 to 5.
CN202110742874.4A 2021-06-30 2021-06-30 Extraction method and device for text named entities of power equipment fault defects Active CN113343701B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110742874.4A CN113343701B (en) 2021-06-30 2021-06-30 Extraction method and device for text named entities of power equipment fault defects

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110742874.4A CN113343701B (en) 2021-06-30 2021-06-30 Extraction method and device for text named entities of power equipment fault defects

Publications (2)

Publication Number Publication Date
CN113343701A true CN113343701A (en) 2021-09-03
CN113343701B CN113343701B (en) 2022-08-02

Family

ID=77482147

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110742874.4A Active CN113343701B (en) 2021-06-30 2021-06-30 Extraction method and device for text named entities of power equipment fault defects

Country Status (1)

Country Link
CN (1) CN113343701B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113961708A (en) * 2021-11-10 2022-01-21 北京邮电大学 Power equipment fault tracing method based on multilevel graph convolutional network
EP4369245A1 (en) * 2022-11-08 2024-05-15 Tata Consultancy Services Limited Enhanced named entity recognition (ner) using custom-built regular expression (regex) matcher and heuristic entity ruler

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110727761A (en) * 2019-09-16 2020-01-24 腾讯科技(深圳)有限公司 Object information acquisition method and device and electronic equipment
CN111160023A (en) * 2019-12-23 2020-05-15 华南理工大学 Medical text named entity identification method based on multi-way recall
CN111738004A (en) * 2020-06-16 2020-10-02 中国科学院计算技术研究所 Training method of named entity recognition model and named entity recognition method
CN112001177A (en) * 2020-08-24 2020-11-27 浪潮云信息技术股份公司 Electronic medical record named entity identification method and system integrating deep learning and rules
CN112101028A (en) * 2020-08-17 2020-12-18 淮阴工学院 Multi-feature bidirectional gating field expert entity extraction method and system
EP3767516A1 (en) * 2019-07-18 2021-01-20 Ricoh Company, Ltd. Named entity recognition method, apparatus, and computer-readable recording medium
CN112818692A (en) * 2021-02-03 2021-05-18 招商银行股份有限公司 Named entity recognition and processing method, device, equipment and readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3767516A1 (en) * 2019-07-18 2021-01-20 Ricoh Company, Ltd. Named entity recognition method, apparatus, and computer-readable recording medium
CN110727761A (en) * 2019-09-16 2020-01-24 腾讯科技(深圳)有限公司 Object information acquisition method and device and electronic equipment
CN111160023A (en) * 2019-12-23 2020-05-15 华南理工大学 Medical text named entity identification method based on multi-way recall
CN111738004A (en) * 2020-06-16 2020-10-02 中国科学院计算技术研究所 Training method of named entity recognition model and named entity recognition method
CN112101028A (en) * 2020-08-17 2020-12-18 淮阴工学院 Multi-feature bidirectional gating field expert entity extraction method and system
CN112001177A (en) * 2020-08-24 2020-11-27 浪潮云信息技术股份公司 Electronic medical record named entity identification method and system integrating deep learning and rules
CN112818692A (en) * 2021-02-03 2021-05-18 招商银行股份有限公司 Named entity recognition and processing method, device, equipment and readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谢凌云: "电力设备故障缺陷文本实体关系抽取及其可视化", 《中国优秀硕士学位论文全文数据库》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113961708A (en) * 2021-11-10 2022-01-21 北京邮电大学 Power equipment fault tracing method based on multilevel graph convolutional network
CN113961708B (en) * 2021-11-10 2024-04-23 北京邮电大学 Power equipment fault tracing method based on multi-level graph convolutional network
EP4369245A1 (en) * 2022-11-08 2024-05-15 Tata Consultancy Services Limited Enhanced named entity recognition (ner) using custom-built regular expression (regex) matcher and heuristic entity ruler

Also Published As

Publication number Publication date
CN113343701B (en) 2022-08-02

Similar Documents

Publication Publication Date Title
CN111694924B (en) Event extraction method and system
CN109992664B (en) Dispute focus label classification method and device, computer equipment and storage medium
CN110929149B (en) Industrial equipment fault maintenance recommendation method and system
CN112001177A (en) Electronic medical record named entity identification method and system integrating deep learning and rules
CN110580292A (en) Text label generation method and device and computer readable storage medium
CN112801010A (en) Visual rich document information extraction method for actual OCR scene
CN113343701B (en) Extraction method and device for text named entities of power equipment fault defects
CN109598517B (en) Commodity clearance processing, object processing and category prediction method and device thereof
CN111259153B (en) Attribute-level emotion analysis method of complete attention mechanism
CN111274817A (en) Intelligent software cost measurement method based on natural language processing technology
CN112749562A (en) Named entity identification method, device, storage medium and electronic equipment
CN111400449B (en) Regular expression extraction method and device
CN116070632A (en) Informal text entity tag identification method and device
CN112541077A (en) Processing method and system for power grid user service evaluation
CN114036246A (en) Commodity map vectorization method and device, electronic equipment and storage medium
CN113868422A (en) Multi-label inspection work order problem traceability identification method and device
CN112818693A (en) Automatic extraction method and system for electronic component model words
CN114969334B (en) Abnormal log detection method and device, electronic equipment and readable storage medium
CN116362247A (en) Entity extraction method based on MRC framework
CN116028606A (en) Human-machine multi-round dialogue rewriting method based on transform pointer extraction
CN112883183B (en) Method for constructing multi-classification model, intelligent customer service method, and related device and system
CN115062615A (en) Financial field event extraction method and device
CN115293133A (en) Vehicle insurance fraud behavior identification method based on extracted text factor enhancement
CN111400606B (en) Multi-label classification method based on global and local information extraction
CN114648029A (en) Electric power field named entity identification method based on BiLSTM-CRF model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant